Computational protocol: Ancient mitogenomes of Phoenicians from Sardinia and Lebanon: A story of settlement, integration, and female mobility

Similar protocols

Protocol publication

[…] Ancient DNA sequences were also processed following the protocols described previously []. Specifically, ancient sequences were subjected to AdapterRemoval [], where sequencing adaptors, reads with stretches of Ns, bases with quality scores of <30 and short reads (<25 bp) were removed. Additionally, where paired end fragments overlapped by at least 11 base pairs, reads were merged. Merged and unmerged reads were aligned separately with Burrows-Wheeler Aligner (BWA) using recommended ancient DNA settings in which seeding was disabled (-l 1024), gap opens was set to 2 (-o 2) and maximum edit distance was set to 0.03 (-n0.03) []. To test for contamination with laboratory reagents, all reads were mapped to a composite reference genome constructed using the revised Cambridge Reference Sequence (rCRS) for humans [], along with the reference mitochondrial genomes for cow (Bos taurus), pig (Sus scrofa) and chicken (Gallus gallus) (GenBank refs NC_006853, NC_0012095.1 and NC_001323.1). The contamination ratio was determined by calculating the ratio of all reads with mapping quality > = 20 for each reference. We determined the contamination ratio by calculating the ratio of all reads with a mapping quality > = 20 for each reference, compared to those for the human CRS.PCR duplicates were marked and removed from unmerged reads using Picard tools 1.92 (http://broadinstitute.github.io/picard/), and from merged reads using a python script developed by Fu et al. []. MapDamage (v2.0.2–9) [] was used to identify characteristic aDNA damage patterns, with the ‘-rescale’ option to lower the quality score of likely damaged sites. Thiamines (T) found at the 5’end of a read and guanines (G) found at the 3’ ends within the first two bases were rescaled to 0 for their quality scores. Modern human contamination levels were assessed using the ContamMix package with a set of 311 representative modern human mtDNA sequences [].Variant call files (VCFs) were generated for all reads that mapped to the human reference genome using the GATK Haplotype Caller (v3.5) [] with settings specific for haploid genomes and were filtered for mapping quality (<20) and bases that were covered by fewer than 3 reads. We only analysed sequences having a minimum of 96% coverage. VCF files were converted to whole-sequence FASTA files using in-house commands and GATK’s FastaAlternateReferenceMaker. Coverage plots were created for all reads using R software [], and plots of fragment length were created from merged reads with Picard’s CollectInsertSizeMetrics (Picard tools 1.92, http://broadinstitute.github.io/picard/). Consensus sequences were created (including indels) and these were deposited in GenBank. Sequences were assigned to haplogroup using Haplogrep [] with Phylotree build 17. All reads generated have been submitted to the NCBI Sequence Read Archive (SRP123440), identified by lab sample number. [...] Median-Joining network analysis was applied to all ancient sequences using POPART [] with default settings. Median-Joining networks were constructed by combining minimum-spanning trees within a single network.Discriminant analysis of principal components (DAPC) was used to investigate clustering of related individuals. DAPC is a multivariate analysis used to extract information of genetic relationships for between- and within-species/population sequence data []. The DAPC approach is integrated into the R package adegenet []. We used the fasta2DNAbin function provided by adegenet to load the sequence data into R. We then transformed the data to a genind file using DNAbin2genind (DAPC input format). The principal components (PC) and discriminant functions to retain were chosen using the plots created by the DAPC function (dapc). All R packages used for the cluster analysis can be freely downloaded at: http://cran.rproject.org/web/packages/.To investigate the phylogenetic relationships between the ancient mitogenomes that we obtained and modern population data, Maximum likelihood (ML) trees were constructed using PhyML []. Modern sequences for each haplogroup identified in our ancient samples were downloaded from Genbank. The most appropriate nucleotide substitution model was determined using jModeltest []. […]

Pipeline specifications