Computational protocol: Genetic substructure in cynomolgus macaques (Macaca fascicularis) on the island of Mauritius

Similar protocols

Protocol publication

[…] Sequence reads were assembled and cleaned-up using CodonCode Aligner v4.1.1 (CodonCode Corporation, Centerville, MA). Two alignments were created for phylogenetic analysis, one for the mtDNA sequence data and one for the Y chromosome data, using ClustalW []. Sequences from Tosi et al. [] and Tosi and Coke [] were included as references to establish provenance of our cynomolgus macaque samples and a single baboon (Papio sp.) sequence was included in each alignment as an outgroup for phylogenetic analysis. Each alignment was run in jModelTest v2.1.1 [, ] and Akaike information criterion (AIC) calculations were used to determine the best-fit model of nucleotide substitution for phylogenetic analysis. The model used for the Y chromosome dataset was the GTR + G model with alpha = 0.1450 based on model averaged estimates, and for the mtDNA dataset, the HKY + I + G with alpha = 0.3560 and I = 0.4194 based on model averaged estimates. Maximum likelihood phylogenetic analyses were carried out using PhyML 3.0 [], with the best of nearest neighbor interchanges (NNI) and subtree pruning and regrafting (SPR) tree topology search, a BioNJ starting tree, and bootstrap analysis (n = 100). [...] Using the genotypes ascertained on the SNP panel, population substructure was interrogated using STRUCTURE 2.3.4. [, ]. STRUCTURE uses a Bayesian approach to identify subpopulation structure, returning a log probability (ln Pr (X|K) for the data for a given number of discrete clusters (K). For initial analyses, the default settings of STRUCTURE were used following the configuration of Falush et al. [] with 10,000 burn-in and 40,000 Markov chain Monte Carlo repetitions. The degree of admixture, alpha, was allowed to be estimated from the data and a default value of lambda, a parameter describing the distribution of allele frequencies, was fixed. Allele frequencies were assumed to correlate between clusters. For between one (K = 1) and five clusters (K = 5), 100 runs each were tested. To test the robustness of these assumptions the same was also run with 50,000 burn-in and 250,000 MCMC repetitions, a data derived lambda (2.22), and assuming independence between allele frequencies in populations (Additional file : Figures S1, Additional file : Figure S2 and Additional file : Figure S3).A Discriminant Analysis of Principle Components (DAPC) was also performed using the adegenet package v1.4-2 in R [, –]. DAPC uses a clustering algorithm k-means and Bayesian Inference Criterion to determine number of population clusters, K, optimizing variance between groups while minimizing variance within groups. SNP data was first transformed using a Principle Component Analysis (PCA) and then analyzed using k values from 1 to 10 with k-means to identify the optimal number of clusters. DAPC then constructs synthetic variables, discriminant functions, based on linear combinations of alleles harboring the greatest between-group variation and smallest within-group variation []. This method differs from traditional PCA analysis in that it minimizes within group variability. […]

Pipeline specifications

Software tools CodonCode Aligner, Clustal W, jModelTest, PhyML, adegenet
Applications Phylogenetics, Population genetic analysis
Organisms Macaca fascicularis, Homo sapiens