Computational protocol: The genetic history of Cochin Jews from India

Similar protocols

Protocol publication

[…] We applied various QC steps on the merged data set of Indian and Jewish populations (and the HapMap3 and Coriell populations). Briefly, we removed single nucleotide polymorphisms (SNPs) with low call rate (below 95 %) and removed individuals based on two criteria:Genetic outliers: genetic outliers, as defined by the default parameters of SMARTPCA (Patterson et al. ), were removed for each population (with at least five samples) alone, based on autosomal SNPs. As the population structure of Cochin Jews is complex and may be composed of different groups, we did not filter genetic outliers in this population.Relatives: from each pair of related individuals, we maintained only one individual. For this purpose, we represented the data as a graph, where each vertex represented an individual and two vertices were connected by an edge if the corresponding individuals were related. We used a greedy algorithm (Halldórsson and Radhakrishnan ) to find maximal independent set in this graph which corresponds to a maximal set of unrelated individuals. Similar to previous studies (Campbell et al. ; Waldman et al. ), two individuals were considered related if their total autosomal identity-by-descent (IBD) sharing was larger than 800 cM and if they shared at least 10 segments with the length of at least 10 cM (see below how IBD sharing was calculated).Genetic outliers: genetic outliers, as defined by the default parameters of SMARTPCA (Patterson et al. ), were removed for each population (with at least five samples) alone, based on autosomal SNPs. As the population structure of Cochin Jews is complex and may be composed of different groups, we did not filter genetic outliers in this population.Relatives: from each pair of related individuals, we maintained only one individual. For this purpose, we represented the data as a graph, where each vertex represented an individual and two vertices were connected by an edge if the corresponding individuals were related. We used a greedy algorithm (Halldórsson and Radhakrishnan ) to find maximal independent set in this graph which corresponds to a maximal set of unrelated individuals. Similar to previous studies (Campbell et al. ; Waldman et al. ), two individuals were considered related if their total autosomal identity-by-descent (IBD) sharing was larger than 800 cM and if they shared at least 10 segments with the length of at least 10 cM (see below how IBD sharing was calculated).The merged data set (of Jewish, Indian, HapMap3 and Coriell populations), following these QC steps, included 465,604 and 25,165 autosomal and X chromosome (in the non-pseudoautosomal regions) SNPs, respectively, for 1698 individuals. Further merging with the HGDP data set included 1756 samples with 274,454 shared autosomal SNPs. The number of samples from each population is shown in Supplementary Table S1 (Supplementary Material online).In the following analyses, we used a set of filtered SNPs based on linkage disequilibrium (LD): PCA, FST, ADMIXTURE, runs-of-homozygosity and heterozygosity. For each pair of SNPs showing LD of r2 > 0.5, we considered only one representative (using SMARTPCA’s (Patterson et al. ) r2thresh and killr2 flags). This filtering was done separately for each analysis, depending on LD in the specific set of populations used in the analysis. Other analyses presented here were performed on the full data sets described above. […]

Pipeline specifications

Software tools EIGENSOFT, ADMIXTURE
Databases HGDP
Application Population genetic analysis
Diseases Genetic Diseases, Inborn