Tracing origin of Atacama ‘alien’ skeleton: A PCA step-by-step example

They found Ata 15 years ago in a deserted Chilean town. She resembled no one on Earth. She had features that would make James Cameron immediately call his director of photography: angular skull, disproportionately large head, and slanted eyes. She was indeed featured in a documentary film about UFOs and extraterrestrials (though not a Cameron’s film).

Ata is short for the Atacama skeleton, named after the desert where she was found. Fifteen years later, a group of scientists led by Gary Nolan finally solved the puzzle. This ‘alien’ skeleton turned out to be a female human fetus with skeletal disorders, likely dysplasia or dwarfism. The research group even traced the origin of Ata and found out that she was of Chilean descent. How did they reach that conclusion from studying the whole genome sequences of the Atacama skeleton? In this blog, we present a step-by-step description of this principal component analysis (PCA), as done in the study.

Check out the PCA 3D plot of the Atacama specimen and 1000 genomes data now!

 Figure 1: Overview of the analysis presented in this blog. Starting from raw sequencing data of the Atacama specimen and 1000G variants, we perform simple alignment and variant call based on the GRCh37 reference genome. Then we calculate new coordinates of all samples on a two dimensional graph.

Step 1: Preprocess the variant data

The Nolan’s paper provided the BioSample ID of the Atacama skeleton: SAMN05545220. With this ID, we can easily retrieve the whole genome sequencing data of the specimen from NCBI (and so can you).

Then we use BWA-mem to align Atacama WGS data to the reference of human genome GRCh37.

Next step: extracting Atacama’s variants and comparing them with those of different populations in the world. We use freebayes with default parameters to extract genomic variants of Atacama, which returns a VCF file with 8,014,618 variant sites. Matching this output with the genetic variation data from 1000 Genome Project Phase 3 (1000G), we get the final VCF file ata.1000genomes.vcf.gz that has 3,139,028 shared variants.

Note that we did not exactly follow the protocol in the paper. We used freebayes instead of GATK without any additional tools to reduce processing time. This keeps things simple and hardly affects the result: we found 3,139,028 variants (the paper has 3,874,934 variants), and the output PCA is almost identical.

Figure 2: Inside a VCF file after merging the variants from both Ata and 1000G.

Step 2: PCA 1000 genomes and Atacama

At this point, we have more than 3 millions variants shared between Ata and 1000G, stored in the VCF file ata.1000genomes.vcf.gz. Time to search for principal components (PC)! We use PLINK 2.0 and the protocol below (courtesy Kevin Blighe, who started this discussion on Biostars).

Convert the VCF file into BCF format

Convert BCF file to PLINK format

Prune variants

Perform PCA of 1000 Genome samples and Atacama

Step 3: Visualize the PCA

Here it is: the final dataset for PCA. The data feature 2505 genomes (including that of the Atacama skeleton), 5 PCs, and the corresponding demographic information.

Import the whole dataset to BioVinci. (If you haven’t got it on your computer, download here )

Once the graphing window pops up, go for the Scatter plot symbol underneath the blank area.

Drag PC1, PC2, and Superpopulation to X, Y, and Color by respectively. You’ve got a 2D scatter plot right after that.

Figure 3: A two-dimensional scatter plot  with the PCA result from PLINK.

Or drag PC1, PC2, and PC3 one by one to the X,Y, and Z placeholders on the left-hand side. There you go: a 3D scatter plot with a single color.

Well, it’s hard to make sense of the data from that monochromatic plot. To separate the Ata specimen from other populations, simply drag the Superpopulation column to the Color by field. Now you’ve got a complete 3D scatter plot for interpretation.

PCA step by step example of Atacama and 1000 genome samples

PCA step by step example of Atacama and 1000 genome samples

Figure 4: Creating a three-dimensional scatter plot with the PCA result from PLINK. Other annotations, e.g. Sub-region or Superpopulation, was retrieved from the data description of 1000G.

To see which subregion and gender Ata belongs to, just drag these columns into the Tooltip placeholder. The information will show when you hover your mouse over the data points.

PCA step by step example of Atacama and 1000 genome samples

PCA step by step example of Atacama and 1000 genome samples

Figure 5: The tooltip display

Step 4: Interpret

On the three-dimensional PCA graph, Atacama is closest to the Mexican ancestry from Los Angeles (USA), Colombians from Medellin (Colombia), and Peruvians from Lima (Peru). This suggests that the Atacama has traces from South American origin. There are plenty more analyses in the study that dug deeper into this evidence. They estimated that the admixed genome is comprised of 53.8% European ancestry and 25.7% Native components of Andean, which matches the ancestry estimations in Chilean individuals. The alignment also reported no presence of the Y chromosome and no evidence from the sex-determining SRY gene. This means Atacama is actually a “she.”

If you look up the variants found in the final VCF file (Figure 2), you may find many of them related to various skeletal disorders, such as cranioectodermal dysplasia, Greenberg skeletal dysplasia, or osteochondrodysplasias. These diseases alter the bone formation and lead to the perplexing features that made Atacama look like an ‘alien’ — unusual skull shape, abnormally small stature, lack of 2 ribs, premature bone age, and so on.

There you have it. We hope this blog post has guided you through a step-by-step PCA example to trace the origin of the Atacama skeleton. The seemingly alien skeleton is actually a female human fetus of Chilean descent. Her perplexing features can be explained by the multiple skeletal disorders, also revealed in her whole genome sequencing data. Nolan and his team has presented us with a beautiful study. They combine sequencing technology, computation, and human genome knowledge base to solve a puzzle of an unknown-origin specimen.



Leave a Reply

Notify of