Computational protocol: Comparison of solution-based exome capture methods for next generation sequencing

Similar protocols

Protocol publication

[…] Human genome build hg19 (GRCh37) Primary Assembly (not including the unplaced scaffolds) was used as the reference sequence throughout the analyses. Both Agilent and NimbleGen have used exon annotations from the CCDS and miRNA annotations from the miRBase based on human genome build hg18 as the basis for their capture designs in the smaller kits. In the probe designs for the larger kits, Agilent has used the CCDS (March 2009), GENCODE, RefSeq, Rfam and miRBase v.13 annotations based on human genome hg19, whereas the NimbleGen SeqCap v2.0 design relies on the CCDS (September 2009), RefSeq (UCSC, January 2010), and miRBase (v.14, September 2009) annotations, as well as on additional genes from customer inputs. The updated kits included capture probes for unplaced chromosomal positions as well (namely, 378 probe regions in Agilent SureSelect 50 Mb and 99 in NimbleGen SeqCap v2.0), but these regions were removed from our further analyses. CTRs were defined for all of the capture kits as the companies' given probe positions. These needed to be lifted over from the given hg18 build positions to the recent hg19 positions for the smaller kits, whereas the updated kits' designs had already been made using the hg19 build. In some of our statistics (see Results), we included the flanking 100 bp near all the given probe positions into the CTRs (CTR + flank). Exon annotations from the CCDS project build v59 (EnsEMBL) were used []. A common target region for the capture methods was defined as the probe regions that were included in all of the probe designs.For the probe design comparisons (Figure ; Additional file ), the exon regions of interest were defined by combining CCDS and UCSC known exon [] annotated regions as well as all the kits' capture target regions into a single query. Overlapping genomic regions were merged as single positions in the query. For any given kit, an exon region was considered to be included in the kit if its capture probe positions overlapped with the combined query for one base pair or more. The numbers of included exon regions are given in the figures.All sequence data were analyzed using an in-house developed SAMtools-based bioinformatics pipeline for quality control, short read alignment, variant identification and annotation (VCP; Figure ). Image analyses and base calling of the raw sequencing data were first performed on the Illumina RTA v1.6.32.0 sequence analysis pipeline. In the VCP, the sequences were then trimmed of any possible B block in the quality scores from the end of the read. After this, if any pair had a read shorter than 36 bp, the pair was removed. The quality scores were converted to Sanger Phred scores using Emboss (version 6.3.1) [] and aligned using BWA (version 0.5.8 c) [] against human genome build hg19. The genome was downloaded from EnsEMBL (version 59). After alignment, potential PCR duplicates were removed with Picard MarkDuplicates (version 1.32).SNVs were called with SAMtools' pileup (version 0.1.8) []. The pileup results were first filtered by requiring the variant allele quality to be 20 or more and then with the SAMtools' VarFilter. We calculated quality ratios for the variants as a ratio of A/(A + B), where A and B were defined as follows: if there were call bases of both the reference base and variant base in the variant position, A was the sum of allele qualities of the reference call bases and B was the sum of allele qualities of the variant call bases; if there were two different variant call bases and no reference call bases, the variant call base with a higher allele quality sum was the A and the other call base was the B; if all the call bases in the variant position were variant calls of the same base, the quality ratio was defined to be 0. In variant positions with call bases of more than two alleles the ratio was defined to be -1, and they were filtered from subsequent analyses. Finally, single nucleotide variants called by pileup were filtered in the VCP according to the described quality ratio: any variant call with a quality ratio of more than 0.8 was considered as a reference call and was filtered out. In addition, we included our own base calls for the called variants based on the quality ratio. Any call with a quality ratio between 0.2 and 0.8 was considered to be heterozygous and calls below 0.2 to be homozygous variant calls.For the control I sample, GATK base quality score recalibration and genotype calling was done with recommended parameter settings for whole exome sequencing []. Known variants for quality score recalibration were from the 1000 Genomes Project (phase 1 consensus SNPs, May 2011 data release).In addition to SNVs, small indels were called for the control I sample using SAMtools' pileup as well. The results were filtered by requiring the quality to be 50 or more and then with the SAMtools' VarFilter. No other alleles than the indel or reference allele calls were allowed for the indel variant positions.We hypothesized that indel, inversion or translocation break points could be identified from the aligned sequence data by examining genomic positions, where a sufficient number of overlapping reads had the same start or end position without being PCR duplicates. Such positions could be caused by soft-clipping of reads done by BWA: if only the start of a read aligned to the reference sequence, but the rest of the read did not align adjacently to it, BWA aligned only the start of the read and reported a soft-clip from the un-aligned part. Another possible cause for these positions was B blocks in the quality scores, starting from the same position for the overlapping reads, and subsequent B block trimming. These positions were named as REAs. REAs were searched for in the control I sample from the aligned read file. At least five reads, all of them either starting or ending in the same position, and a minimum contribution of 30% to the total coverage in the position, were required for a REA to be reported. Associated soft-clipped sequences were reported together with REAs.GC content was defined for the CTRs and the common target region as a mean percentage of G and C bases in the targets, calculated from human genome build hg19 (GRCh37) based FASTA formatted target files with the Emboss geecee script []. For the SNP analyses, GC content was defined as the percentage of G and C bases in the distinct target (for example, a single exon) adjacent to the SNP. Mapabilities were retrieved from the UCSC Table Browser using track: mapability, CRG Align 75 (wgEncodeCrgMapabilityAlign75mer). In this track, a mapability of 1.0 means one match in the genome for k-mer sequences of 75 bp, 0.5 means two matches in the genome and so on. Mean mapability was calculated for each distinct target region. Similarly for the SNP analyses, mapability for a SNP was defined as mean mapability in the region adjacent to the SNP.Student's t-test was used to test for statistical significance in the differences between the sequence alignment results and between the SNV allele balances. T-distribution and equal variance were assumed for the results, thought it should be noted that with a small number of samples the results should be interpreted with caution. Uncorrected two-tailed P-values are given in the text. […]

Pipeline specifications

Software tools SAMtools, EMBOSS, BWA, Picard, GATK
Databases Rfam miRBase GENCODE
Application WES analysis
Organisms Homo sapiens