Computational protocol: Whole Exome Sequencing to Identify Genetic Variants Associated with Raised Atherosclerotic Lesions in Young Persons

Similar protocols

Protocol publication

[…] Genomic DNA samples were purified from archived PDAY liver tissues using an automated paramagnetic bead based extraction system (Promega Magnasil). The genomic DNA samples were used for exome capture with TargetSeq reagents (Life Technologies, Inc.) based on high density oligonucleotide hybridization of GENCODE annotated coding exons, NCBI CCDS, exon flanking sequences (including intron splice sites), small non-coding RNAs (e.g., microRNAs) and a selection of miRNA binding sites. We also designed custom probes to capture promoter regions located within ~200 bp upstream from transcriptional start sites that typically contain the majority of transcription factor binding sites. These custom capture probes included proximal promoter regions for 18,661 genes after excluding pseudogenes, unidentified LOC genes, repetitive non-coding RNAs (tRNAs, ribosomal RNAs), and intractable repeat regions (polynucleotides and Alu repeats). After capture, we used automated library construction (AB Library Builder, Life Technologies, Inc.) with addition of barcodes for multiplexed sequencing. We began sequencing with the oligonucleotide ligation-based SOLiD 5500xl platform (Life Technologies, Inc.) that performs massively parallel sequencing of individual DNA molecules amplified on beads affixed to glass slides (288 subjects). For the remaining 697 subjects, we used the Ion Proton platform (Life Technologies, Inc.), based on proton assays for polymerase sequencing of individual DNA molecules in wells of modified semiconductor chips. For both platforms, we sequenced each case with their corresponding controls in the same sequencing run (SOLiD slide, Proton chip). We constructed BAM files (CUSHAW for SOLiD, tmap1 aligner for Proton) for variant calling using Freebayes algorithms, and annotated the resulting VCF files using VEP. We created a single VCF by merging the calls from each sequencing platform, and filled in the reference sequences (non-variable positions) using a jointly called VCF from the two platforms. [...] For all association analyses, we used linear mixed-model analyses from the EPACTS software pipeline (http://genome.sph.umich.edu/wiki/EPACTS) to adjust for population stratification and to minimize batch effects from sequencing platforms by calculating the kinship matrix between samples from the exome VCF file. EPACTS supports linear mixed-model based tests for both single variant tests (common variants) and gene-based tests (low frequency and rare variants). Age, gender, and race were used as covariates in all statistical tests. First, we evaluated association signals of individual common variants with minimum minor allele frequencies (MAF) of 0.01 and with minimum minor allele count of 5. We used the ‘q.emmax’ test in the EPACTS pipeline, which is an implementation of the linear mixed-model. We treated case/control status as a quantitative trait after adjusting for covariates. We then tested associations of low frequency and rare variants (MAF < 0.05) using mixed-model gene-based tests to find whether functional rare variants in each gene are jointly associated with the phenotype. Variants were annotated by Variant Effect Predictor (VEP) with gene symbols and functional effects, and only missense, nonsense, and splice variants with MAF less than 0.05 were included in the gene-based tests. We used two different EMMAX-based rare variant tests: the collapsing test (emmaxCMC) and the kernel-based test (mixed model SKAT) to test different hypotheses. The collapsing test assumes all functional rare variants in the gene would have effects of the same direction, while the kernel-based test assumes that each variant can have either positive or negative effect to the phenotype. Single-variant and gene-based test results were then sub-selected using the list of candidate genes from CAD GWAS. For single-variant results, we also ran conditional analysis to evaluate dependence of our top hit from the candidate region (rs4773144) to known GWAS variants in COL4A1/COL4A2. All statistical tests used Bonferroni corrections to account for multiple testing. […]

Pipeline specifications

Software tools FreeBayes, EPACTS, EMMAX, VEP, RVTESTS, SKAT
Applications WES analysis, GWAS
Organisms Homo sapiens