Computational protocol: Genome data from a sixteenth century pig illuminate modern breed relationships

Similar protocols

Protocol publication

[…] To process raw ancient data, we first removed stretches of N′s and stretches of consecutive bases with 0, 1 or 2 quality scores from the 3' and 5' ends of the reads. Reads shorter than 30 nucleotides were discarded for further analyses, a common practice in aDNA studies to minimize the risk of erroneous alignments (, ; ). Post-mortem degradation results in a short length of aDNA sequences. As a result, adapter sequences ligated during library preparation can be present at the end of the reads. This can affect the correct mapping to the reference genome and it can also bias the single-nucleotide polymorphism (SNP) calling. Therefore, we used adapter removal () to remove adapter sequences from the reads, discarding sequences shorter than 30 bp after adapter trimming. We found 13% of the reads containing adapter sequence, keeping a total of 408 912 560 reads with an average length of 93 bp, which were aligned to the pig reference genome. We mapped reads to the current pig genome assembly (Sscrofa10.2) using BWA () with the quality trimming parameter set to a Sanger quality score of 15. Furthermore, to improve the aDNA read mapping against modern reference genomes the edit distance parameter was set to 0.02, and the seed region (the first 32 nucleotides) was disabled following recommendations in ). Finally, we removed duplicates with SAMtools rmdup option. In order to assess the level of human contamination, we mapped all the reads to the human reference assembly (GRCh37/hg19) using BWA.A metagenomic analysis and taxonomic classification of one million of around 133 millions reads non-mapping to the Sus scrofa assembly was performed using BLAST 2.2.27+ and MEGAN4 (). In order to assess whether we were infra-representing taxonomy diversity by down sampling to one million reads, we performed rarefaction curves by counting the number of detected leaves by cumulatively increasing the number of reads in bins up to the total of reads assigned at the level of genus. We observed that the curve approached a plateau with poor increase of new genus discovered when the last bins of our subset were analyzed. Also, we specifically assessed the presence of integrated or episomal viral DNA in the ancient pig sample. We mapped, using BWA-0.7.4 mem, all reads non-mapping to the reference Sus scrofa genome against a sequence database including all known genomes of DNA viruses. Mapping reads were then mapped again against individualized viral genomes using BWA-0.7.4 aln and we counted the number of uniquely mapping reads.For allele determination in the ancient sample, we considered only reads with minimum mapping quality of 20, and base quality (Phred score) of at least 30 if there was a single read covering that position or 20 with depths 2–5 × . Positions covered with >5 × were discarded, as being most likely caused by repetitive or copy number variant regions. To avoid post-mortem DNA damages that lead to increased C→T and G→A transitions, we only retained those positions where the ancient allele was also observed in any of the eight modern pig genomes used for this study (). This filtering should decrease dramatically the number of post-mortem changes accepted as true variants by a factor of at least ∼1/2 a16θ. This is the probability of finding a base in the eight modern samples that coincide with a given post-mortem damage. Suppose that the ancient sample has a post-mortem modification at a given (unknown) site; what is the probability that this change is also observed in any of the modern samples and therefore taken as a true variant? This is the probability of observing a variant in any of the modern samples (a16θ) times the probability that any of the two alleles coincide with any of the two nucleotides in the ancient sample (1/2), with a16θ being the expected number of polymorphisms per nucleotide to be found in eight diploid samples or 16 chromosomes, an being Ewens' constant (a16=3.4) and θ the variability per site (or ∼0.002 in pigs), the constant 1/2 occurs because half of the potential errors will be accepted, when they match any of the two alleles found in the modern population. In practice, this is an upper limit because post-mortem changes occur randomly in each DNA fragment and, for depths >1, this requires that the same post-mortem change has occurred in all fragments sequenced. An assumption here is that all samples sequenced originate from the same population (a neutral model is implicitly assumed); however, except for extreme selective events that are specific of the ancient sample, this assumption should have only a minor effect.Admittedly, this filtering will also remove true variants that are observed only in the ancient sample. These will tend to be variants at low frequency in the whole pig population; otherwise, it is likely that they are observed in any of the modern samples as well. The expected percentage of polymorphisms, that are singletons in the ancient sample and are therefore discarded even if being true variants, is also obtained from Ewen's sampling term. Assuming a standard neutral model, this corresponds 1−a16/a17∼2% this follows from the fact that we ascertain 16 modern chromosomes but only one ancient allele; and therefore we expect to observe a16 SNPs within modern samples, and a17 including the ancient sample. In all, the bias due to removing singletons is expected to be small. This reasoning discards the possibility that the ancient sample is homozygous for a variant not observed in any of the modern samples. This event is, however, highly unlikely because shallow coverage prevents confidently observing both alleles for most of sites in the diploid ancient genome. [...] Modern sample () reads were aligned with BWA () allowing for seven mismatches. Genotypes were called using the SAMtools mpileup option and filtered with varFilter, all modern samples were analyzed together setting a minimum depth to 5 × and a maximum depth of twice the average sample's depth plus one, minimum map quality of 20 and minimum base quality of 20. Setting a maximum depth was done to minimize risks of wrongly called SNPs caused by copy number variants or repetitive regions, as in or . The resulting vcf file was merged with the ancient reads. For further analyses, we retained only the positions without missing data in any of the samples and where the ancient reads were compatible with the modern sample genotypes. Genotypes were stored and managed as plink () files, using custom perl and shell scripts as needed. […]

Pipeline specifications

Software tools BWA, SAMtools, MEGAN, PLINK
Applications Metagenomic sequencing analysis, GWAS
Organisms Sus scrofa