Computational protocol: Accurate detection of subclonal single nucleotide variants in whole genome amplified and pooled cancer samples using HaloPlex target enrichment

Similar protocols

Protocol publication

[…] Sixteen bone marrow samples collected at diagnosis from patients with childhood acute lymphoblastic leukemia (ALL) were analyzed in this study (Additional file : Table S1). All patients were treated at Swedish centers according to the Nordic Society for Pediatric Haematology and Oncology (NOPHO) 1992 and 2000 ALL protocols []. The study was approved by the Regional Ethical Review Board in Uppsala, Sweden. The study was conducted according to the guidelines of the Declaration of Helsinki, and all patients and/or guardians provided written or oral informed consent.We have previously performed whole genome sequencing of two of the ALL samples included in the study (ALL1 and ALL2), along with matched normal samples from the same patients (Lindqvist et al., manuscript in preparation). Both patients responded well to therapy, and at examination after cessation of therapy 2-2½ years after diagnosis they were found to be in first continuous complete remission (CCR 1), with <0.01% blast cells in the bone marrow according to PCR. The normal blood samples were collected 2½-3 years later, when the patients were still in CCR 1 with normal hematological parameters. The patients are today clinically well in CCR 1 another 2½-3 years later. Thus, there is good evidence that the normal blood samples did not contain any leukemic cells. The proportion of leukemic cells in the cancer samples was estimated to be >90% by light microscopy in May-Grünwald-Giemsa-stained cytocentrifuge preparations.For whole genome sequencing, on average 138 Gb paired-end sequence data was generated for each sample using the HiSeq2000 or GAIIx instruments (Illumina). Sequence reads were trimmed from the 3’ end and aligned to the human reference genome (version hg19) using BWA version 0.5.9 [] with default parameters. Read realignment and base quality recalibration was performed using GATK version 1.0.5909 []. Read realignment was performed around candidate indels identified during the run and indels previously called in the data using VarScan []. During base quality recalibration, dbSNP132 and the BAQ option was used. PCR duplicates and read pairs where at least one read fulfilled any of the following criteria were excluded: trimmed to <25 bp, >3 mismatches or MAPQ <30. Somatic SNVs were predicted with MuTect version 1.0.27200 [] and SomaticSniper version 1.0.0 [] with default parameters. MuTect SNVs labeled REJECT and SomaticSniper SNVs with somatic score <40 were discarded. In addition, SNVs with an allele fraction <0.2, SNVs present in dbSNP135, and SNVs overlapping a repeated region present in the tracks “rmsk” or “simpleRepeats” from the UCSC table browser were excluded from further analysis. [...] Sequence reads were trimmed to remove Illumina adapter sequences with CutAdapt version 1.1 [] and aligned to the human reference genome (version hg19) with MOSAIK version 2.1.33 with default parameters. Realignment and recalibration of base quality scores using dbSNP137 was performed with GATK version 1.0.5909 []. Read realignment was performed around candidate indels identified during the run, and SNPs and indels in dbSNP137 that were located in the regions covered by the HaloPlex design. Reads with MAPQ = 0 were discarded.Allele fractions at sites with candidate SNVs detected in WGS data and germline SNPs were calculated with a custom Python script (publicly available at https://github.com/Molmed/Berglund-Lindqvist-2013). Variant calling was based on these allele fractions. In individual samples, a candidate SNV was classified as somatic if fulfilling the following criteria: allele fraction ≥0.1 in the gDNA ALL sample, allele fraction <0.01 in the matched normal sample, and HaloPlex sequence depth ≥30 in both samples. In pools, we considered a validated somatic SNV to be detected if the allele fraction was ≥0.05 divided by the number of samples in the pool.For de novo SNV calling in pools, we investigated the allele fractions at every site in the 147 kb target region that had a sequence depth ≥30 per sample included in the pool. We only searched for variants in the unknown (i.e., not whole genome sequenced) samples. We applied several criteria to filter out putative germline SNPs and false positive calls. First, we assumed that variants that are present in more than one of the samples in a pool are likely to be germline, and we focused on finding variants with an allele fraction suggesting that they are present in only one sample. We set the expected allele fraction for such variants to 0.5 divided by the number of samples in the pool. We excluded variants with an allele fraction less than half or more than twice the expected value. Second, we filtered out all variants present in dbSNP137. Third, we excluded variants that had an allele fraction >1% in any of the other experiments included in the study except the replicate experiment. This was to filter out germline variants that are not in dbSNP and putative false positive calls caused by alignment artifacts. Validation of putative novel SNVs was done by PCR amplification and Sanger sequencing of each of the samples included in the pools individually. Figures were generated using R version 3.0.1. [...] PCR primers were designed using Primer3Plus []. PCR was performed using a Smart Taq Hot Thermostable DNA Polymerase Set (Naxo, Estonia) for 35 cycles. Sanger sequencing was performed with an ABI3730XL instrument at the Genome Center in Uppsala, Sweden. The sequence traces were analyzed with the Sequencher software (Applied Biosystems). […]

Pipeline specifications

Software tools BWA, GATK, VarScan, MuTect, SomaticSniper, cutadapt, MOSAIK, Primer3, Sequencher
Databases dbSNP
Applications WGS analysis, qPCR
Diseases Neoplasms