Computational protocol: Comprehensive statistical inference of the clonal structure of cancer from multiple biopsies

Similar protocols

Protocol publication

[…] We assume that next generation sequencing data was mapped to the reference genome, and the mapped BAM files are ready for analysis. Pre-processing of the data consists of three steps (Supplementary Fig. ). First, we identify the genomic sites that will be included in the model. Our model captures both CNA events and SNV events; therefore, two types of genomic sites are included. For CNAs, we consider germline heterozygous sites since we can monitor not only absolute copy number changes (via tumor-normal read depth difference), but also what happens to the two individual copies (via allelic imbalance). For SNVs, we consider the somatic mutation sites which host an SNV event in any of the tumor biopsies. From the germline (normal) BAM file, we use Samtools to identify germline heterozygous sites. From the tumor and normal BAM files, we use MuTect to identify somatic mutation sites. Second, we filter out unreliable sites and reads using MuTect. Third, we adjust for GC content and mappability. Short reads from next generation sequencers are not uniformly distributed across the genome—more reads are expected to be obtained from regions with higher GC content and mappability. The bias cannot fully be adjusted by normalizing with another next generation sequencing library (e.g. from a normal biopsy) from the same patient. We therefore use HMMcopy to adjust GC content and mappability in the read counts. [...] Unlike previous methods such as phyloWGS, SPRUCE and Canopy, which capture CNA or SNV events as the entities in the model, our model THEMIS and its predecessor TITAN directly model individual genomic positions as the entities in the model and therefore have the ability to perform CNA calling during tumor heterogeneity analysis. Both THEMIS and TITAN are dynamic graphical models with each frame representing a single genomic position, with CNA events captured by hidden Markov chains. Therefore, THEMIS inherits five key assumptions from TITAN:Two primary observed variables—allelic imbalance and the tumor-normal read depth ratio—reflect the underlying somatic genotype of the tumor at germline heterozygous sites.CNA events span multiple contiguous germline heterozygous sites.The observed NGS data comes from heterogeneous cellular populations, including normal cells and tumor subpopulations.Two mutation events are observed at the same cellular prevalence if and only if the two events come from the same subpopulation.Only one CNA event can arise in only one tumor subpopulation at each genomic position. Two primary observed variables—allelic imbalance and the tumor-normal read depth ratio—reflect the underlying somatic genotype of the tumor at germline heterozygous sites.CNA events span multiple contiguous germline heterozygous sites.The observed NGS data comes from heterogeneous cellular populations, including normal cells and tumor subpopulations.Two mutation events are observed at the same cellular prevalence if and only if the two events come from the same subpopulation.Only one CNA event can arise in only one tumor subpopulation at each genomic position.Note that Assumption 4, although used by many tumor heterogeneity models, can be invalid if two different tumor subclones in a tumor have the same cellular prevalence. The purpose of introducing Assumption 5 is to make the heterogeneity model simple and identifiable; however, this assumption does prevent us from modeling more complicated situations in which multiple CNAs arise in the same genomic region.We usually have around 30–50 thousand germline heterozygous sites and several hundred somatic mutation sites in whole-exome sequencing data from a single biopsy. With reasonable sequencing depth (greater than ∼100 reads per position, on average) the underlying genotypes (i.e. the type of the CNA event) estimated from the contiguous germline heterozygous sites can be inferred accurately. Integrating the somatic mutation sites and germline heterozygous sites using two factorial Markov chains allows us to model sites that harbor both a CNA event and a somatic mutation. In the situation when the observed variables at one somatic mutation site suggest that the genotype or the subclone assignment at that site disagrees with the neighboring germline heterozygous sites, THEMIS can still infer the correct hidden genotype and subclone assignment based on the observed variables at the somatic mutation site. Furthermore, because there will typically be many contiguous germline heterozygous sites before and after this somatic mutation site, the disagreement will not be propagated to nearby germline heterozygous sites.We adopted these particular modeling choices and assumptions based on the sequencing quality and depth in our data. However, we encourage users to adjust these modeling choices and assumptions as appropriate for their own data. The extensible modeling platform employed by THEMIS should make it easy to implement variants of the model proposed here. […]

Pipeline specifications

Software tools SAMtools, MuTect, HMMcopy, PhyloWGS, TITAN, THetA
Applications WGS analysis, WES analysis
Organisms Homo sapiens