Computational protocol: Global Carrier Rates of Rare Inherited Disorders Using Population Exome Sequences

Similar protocols

Protocol publication

[…] Genotyping pipelines from 1000G (Phase 1) (http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/) and NHLBI (http://www.nhlbi.nih.gov/) projects were collected in VCF format. The dataset consisted of a total of 15,190 haploid exomes from high-coverage exome sequence data derived from 14 + 2 ethnic groups. NHLBI data contains individuals sequenced as part of various disease-specific studies and may not partially reflect the precise genetic population structures while 1000G collected healthy individuals. The validity of a part of the NHLBI dataset was previously assessed by NHLBI using Sanger sequencing [novel singleton variants, 143/145 (99%); novel nonsingleton variants 316/323 (98%)] []. The genotype accuracy of 1000G was estimated at 97.4% (20,687/21,235) by comparing with the HapMap genotype calls []. The 1000G and NHLBI datasets (VCF files) were filtered on Variant Tools (http://varianttools.sourceforge.net/Annotation/HomePage) and Microsoft Excel by total read depth, the number of individuals with coverage at the site, the fraction of mutation reads in each heterozygote, and the average position of mutation alleles along a read. Eighteen recessively inherited diseases were probatively retrieved and selected from literature (published from 1957 to 2014) and derived from NCBI OMIM (http://www.ncbi.nlm.nih.gov/omim) and PubMed (http://www.ncbi.nlm.nih.gov/pubmed). Causative mutations for inherited disorders were derived from these datasets based on the corresponding chromosome position (UTR, coding, intron, and splice site). ClinVar and HGMD were supplementarily reviewed to collect the mutations. Identified mutations were then classified by mutation type, allele frequency, racial groups, and clinical impact. Information on mutation types, positions, reference sequences, and pathogenicity were retrieved from NCBI dbSNP (http://www.nlm.nih.gov/SNP/) and UCSC genome browser (http://genome.ucsc.edu/) to generate exome-based epidemiology. Statistical analysis, including carrier rate (%), was performed with Excel. ExAC Browser (http://exac.broadinstitute.org/) was additionally searched for the mutation alleles of 18 inherited disorders. A global map of carrier rate distribution was manually constructed for 15 recessive disorders collated from literature sources. A world map was obtained from Free Editable Worldmap (http://free-editable-worldmap-for-powerpoint.en.softonic.com/) and modified. […]

Pipeline specifications

Software tools vtools, VariantTools
Databases ClinVar dbSNP OMIM HGMD UCSC Genome Browser ExAC browser
Application Genome data visualization
Diseases Anemia, Sickle Cell, Genetic Diseases, Inborn