Computational protocol: Combined examination of sequence and copy number variations in human deafness genes improves diagnosis for cases of genetic deafness

Similar protocols

Protocol publication

[…] High-molecular weight gDNA (5 μg) was fragmented ultrasonically with the Covaris E210 DNA shearing instrument (Covaris Inc., Woburn, MA) to an average size of 300 bps for subsequent construction of Illumina NGS libraries. The Covaris protocol is set at 3 minutes total duration, duty cycle 10%, intensity 5, and 200 cycles per burst. Fragmented gDNA libraries for sequencing on the Illumina platform (HiSeq2000) were prepared with the NEBNext™ DNA Sample Prep Master Mix set (E6040, NEB Biolab, Ipswich, MA). End repair of DNA fragments, addition of a 3’ adenine (A), adaptor ligation, and reaction clean-up were carried out following the manufacturer’s protocol. gDNA libraries were cleaned and size selected using the AMPure DNA Purification kit (Beckman Agencourt, Danvers, MA). The ligated product (20 ng) was amplified for 14 PCR cycles with Illumina PCR primers InPE1.0 and indexing primer following the manufacturer’s instructions. The PCR products were purified again with QIAquick MinElute column and eluted into 50 μl of hybridization buffer (HB, Roche NimbleGen, Madison, WI).Targeted capture of deafness genes (Additional file : Table S1) and the examination for the enrichment of these genes by quantitative PCR (qPCR) were carried out by protocols described in our published paper []. Systematic comparison of different tools in population-scale genomic CNV analysis found notable differences between different methods used in identifying genomic regions ascertained, size-range and breakpoint []. Since only partially-overlapping CNVs are identified using different methods, some studies used as many as 36 CNV call-sets to help improve the accuracy of finding high-confidence CNVs []. In our study, FASTQ data files generated after sequencing with the Illumina HiSeq2000 were processed using four independent bioinformatic data processing pipelines, since the use of multiple independent bioinformatic pipelines help improve accuracy in finding high-confidence CNVs []. These bioinformatic platforms are:(1) A web-based open source platform called Galaxy (https://usegalaxy.org/), which runs on a local Linux server. The tools we used on Galaxy Platform were: (a) BWA (ver. 0.7.4), to generate SAM (Sequence Align/Map) files; (b) Samtools (ver.0.1.19), which were used to transform Binary SAM into BAM and were sorted with samtools; (c) Picard (ver. 1.79), this was used to remove PCR duplicates in the sorted BAM files; (d) GenomeAnalysisTK-1.6 (GATK-1.6). The duplicate-removed BAM files were used as inputs of GATK-1.6 for InDel re-alignment and base quality recalibration using known InDels from dbSNP137 and the 1000 Genome project. Target region coverage and VCF (Variant Calling Format) files of SNP/InDel calling were generated by GATK, based on processed BAM files.(2) A software package developed at the Broad Institute called Genome Analysis Toolkit (GATK) (http://www.broadinstitute.org/gatk/).(3) A paid commercial bioinformatic analysis platform (http://www.dnanexus.com).(4) A proprietary bioinformatic processing pipeline called NGSeq, developed in-house specifically for analyzing mutations in targeted deafness genes.The variation and CNV results presented here are consensus results of the four bioinformatic pipelines. The consensus rates among different platforms were high, with the lowest being 94.5% between the GATK and Galaxy pipelines. Sequencing results obtained by the NGS method for the coding region of the GJB2 gene were compared to those obtained by the Sanger method (N = 8) for validation purposes. A high percentage (99.2%) of detected sequence variations (SNPs) in the GJB2 gene as identified by the two methods agreed with each other. All samples processed in this project had an on-target average coverage of greater than 100 (Additional file : Table S1). Distribution of average coverage (n = 10) at 0x, 1x, 10x & 20x was 1.1 ± 0.5, 4.8 ± 1.1, 5.6 ± 1.9 and 11.2 ± 2.9, respectively. After controlling for data quality (coverage ≥20 and Phred-like quality score ≥30), we obtained the VCF reports for the coding regions and the exon-intron boundaries of the targeted deafness genes (Additional file : Table S1). Variations contained in VCF reports were filtered by a custom knowledge database in order to identify candidate disease-causing mutations. Information in this database was collected from the following sources: (1) Human Genome Mutation Database (HGMD); (2) data from unpublished mutation and normal hearing control results obtained in the authors’ lab; (3) data made from consensus predications of both PolyPhen and Shift bioinformatic algorithms []. Variants with high population allele frequencies (>0.02), as reported by data collected from the 1000 genome project, were filtered out. The identified variations were classified into nine categories (see Additional file : Table S2). Mutations in these categories can be defined as disease-causing, highly likely to be disease-causing, predicted but unconfirmed to be disease-causing, carrier of mutations, and novel mutations with low population allele frequencies (less than 0.005) of unknown significance.In the hybridization step for capturing the targeted deafness genes, we normally hybridized 20–30 samples together. Samples were differentiated by different barcodes. In-house generated cDNA capture probes with lengths ranging from 100 to ~5000 bps were used. Biotinylated probes were hybridized overnight with fragmented genomic DNA (gDNA) at 47°C in a thermocycler (Bio-Rad T100, Hercules, CA). Captured gDNA fragments were enriched using streptavidin dynabeads (Beckman Coulter, Brea, CA). For targeted NGS projects, on-target coverage after the capturing step is a more appropriate index for describing the quality of sequencing, not the number of overall reads. The average depth of coverage for each exon of targeted deafness genes was used in the calculation of CNVs. Compared to previous methods used to detect CNVs from whole human genome or exome NGS data [], this is a novel method used to specifically detect CNVs from NGS data obtained in a disease (e.g., deafness) panel. The mean coverage can be affected by CNVs, the relative quantity of DNA in each sample loaded for hybridization, and the experimental conditions. By analyzing the CNVs only for samples that were hybridized together, we could control the level of variation introduced by the experimental conditions. The relative quantity of loading DNA for each sample was normalized by the summed total of the average depth of coverage (Sd), which yielded a normalized average depth of coverage Nadc = (depth of coverage of specific exon)/Sd. The ratio of CNVs (RCNV) for each exon of the deafness genes was calculated by the following equation: R CNV = N adc / ( average N adc of all other samples in the same hybridization for this exon ) For normal copy numbers (two copies) of any exons, the ratio should be close to one. Heterozygous and complete deletions would result in a ratio close to 0.5 and 0, respectively. CNV gains would yield ratios of 1.5 or larger (illustrated in Figure ). The resolution of our method used to detect CNVs is the size of a single exon as determined by comparing the average depth of coverage for each exon. Bioinformatic analysis of the gene structure showed that the average exon size for the 80 deafness genes (Additional file : Table S1) is 4491 bps.In order to validate our NGS-based approach in identifying the CNVs, we randomly selected 15 putative CNVs and examined these CNVs again using the independent qPCR approach. A commercially-available CNV test kit based on SYBR green chemistry (DyNAmo ColorFlash SYBR Green qPCR Kit, Thermo Scientific Inc., Pittsburgh PA) was used. A house-keeping gene (GAPDH) was used as a reference for the qPCR quantification. Primers for amplifying selected genes and reference genes (Additional file : Table S4) were designed by Primer3 (http://frodo.wi.mit.edu). Triplicate qPCR reactions were conducted for each sample, each using a reaction volume of 25 μl. The relative abundance of genomic DNA in samples was calculated by looking at the differences in the cycle numbers obtained in the linear increasing phase, using the following steps:1) Calculating threshold cycle (Ct) differences between the target gene and the reference gene ΔC t calibrator = C t target , calibrator – C t reference , calibrator ΔC t test = C t target , test – C t reference , test 2) Normalizing ΔCt of the test samples to the ΔCt of the calibrator: ΔΔCt = ΔCt test – ΔCt calibrator 3) Calculating the CNV ratio using following formula: CNV ratio = 2− ΔΔCt […]

Pipeline specifications

Software tools Galaxy, BWA, SAMtools, Picard, GATK, DNAnexus, PolyPhen, Primer3
Databases HGMD
Applications Miscellaneous, qPCR
Organisms Homo sapiens
Diseases Genetic Diseases, Inborn