Computational protocol: Estimating and accounting for tumor purity in the analysis of DNA methylation data from cancer studies

Similar protocols

Protocol publication

[…] The previously developed InfiniumPurify for purity estimation [] is based on an important observation from the 450 k methylation data: the number of probes with intermediate methylation level is significantly greater in tumors compared to normal samples. Many of the intermediate methylated CpG sites are the result of sample mixtures and contain information of the mixing proportion (tumor purity). InfiniumPurify first identifies a number of informative differentially methylated CpG sites (iDMCs) from cancer-normal comparisons and then estimates purity from the probability density of methylation levels of iDMCs. An important drawback of the previous version of InfiniumPurify is that the selection of iDMCs requires a number of cancer and normal samples. For cancer types without or only having a few normal samples, such as ovarian carcinoma (without a normal sample) or glioblastoma (only one normal sample) from TCGA, InfiniumPurify would fail or has not enough statistical power to find reliable iDMCs. Our previous method therefore was only able to provide estimated tumor purities for nine cancer types in TCGA. This greatly limits the application of InfiniumPurify in smaller scale studies or on new cancer types.We obtained all 450 k methylation data from TCGA (including 8830 tumor samples and 703 normal samples for 32 cancer types) to study the effect of iDMC selection and purity estimation. We found that it is possible to use a group of “universal” normal samples to obtain iDMCs and then apply them on purity estimation for different cancers. We redesigned the purity estimation algorithm, which can be applied to data without normal controls or replicates. The essence of the newly updated method is to combine normal samples from different tissue types, construct a panel of normal methylomes, and then detect iDMCs for each cancer type using this panel for downstream purity estimation. Another important improvement of current version of InfiniumPurify is that it does not rely on ABSOLUTE to calibrate the estimation. Therefore, all purity results in this paper are from 450 k methylation array data alone. The comparison with existing methods shows that tumor purities using universal normal samples are comparable with previous version, even better for cancer types with only small number of normal samples. The algorithm of updated InfiniumPurify is illustrated in Fig. , and is detailed in the “” section.Fig. 1 [...] We further applied the proposed DM calling method to all TCGA data whenever the 450 k data were available. We compared the DMC calling results with minfi [], arguably the most widely used package for 450 k data analysis, and RefFreeEWAS, which considers cell composition in DM calling. We ran minfi using default parameters and specified K = 2 in RefFreeEWAS, corresponding to two components (cancer and normal) in the cell mixture. We want to point out that the comparison is not completely fair, since minfi does not consider purity and RefFreeEWAS is not designed for cancer-normal comparison (as discussed in the “” section). However, because there is currently no DM calling method accounting for purity, the results presented in this section simply demonstrate that the DM calling results can be significantly improved with proper consideration of purity. Even though there are a number of other DM calling tools for 450 k data [–, ], none of them considers tumor purity so we expect they provide results similar to minfi. Due to this reason, those methods are not included in the comparison.First, we examined the sensitivity in DM calling. Figure  shows the number of significant (defined as false discovery rate (FDR) < 0.01) DMCs detected for all cancer types whenever data are available. The proposed method detects the most DMCs in almost all datasets, demonstrating higher sensitivity. This makes sense because with the consideration of purity, the within group variance is reduced among the cancer samples, thus leading to a more powerful statistical test. The gain in sensitivity could be significant, for example, the number of DMCs detected in THCA (thyroid carcinoma) is almost doubled compared to minfi. On average, there are over 20% more DMCs detected from the proposed method compared to other methods. We also investigated the overlaps of DMCs called from different methods, shown by Venn diagrams in Additional file : Figure S10. It is shown that DMCs called from all three different methods have rather significant overlap for all tested cancer types, especially between InfiniumPurify and minfi.Fig. 4 We compared the absolute methylation differences for InfiniumPurify exclusive, minfi exclusive, and common DMCs from BRCA data. As shown in Additional file : Figure S11, InfiniumPurify exclusive DMCs show a much higher methylation difference between matched tumor and normal samples than minfi exclusive DMCs. This is because the InfiniumPurify exclusive DMCs have large within-group variances, caused by the tumor purities, thus they cannot be detected by minfi. After correcting for purity, the within-group variances are reduced and these sites will be called as DMC. This further illustrates the importance of purity correction in DM calling.Next, we looked at the spatial correlations of test statistics from different methods. For each cancer type, we first selected pairs of CpG sites with distances less than 50 base pairs and computed the Pearson’s correlation of their test statistics. It was known that methylation levels have strong spatial correlation [], that is, the nearby CpG sites usually have similar methylation levels. Therefore, the differential methylation statuses are likely to be similar among nearby CpG sites and this is the reasoning of grouping DMCs into DMRs in whole genome methylation data. Thus, we argue that a better DMC calling method should produce test statistics with stronger spatial correlation. Figure  compares the spatial correlations in test statistics from the three methods and the proposed method provides the highest correlation for all cancer types. This indicates that by accounting for purity in DM detection, the DM status from nearby CpG sites become more similar.We further looked at the correlations among test statistics from different types of cancers. Even though different cancer types have distinct etiologies, they also share many commonalities, such as the hyper-methylation in CpG islands and genic regions and global hypo-methylation in whole genomes especially for highly and moderately repeated DNA sequences []. Hence, we believe that there are many shared epigenetics dynamics in different cancers and expect that the test statistics are well correlated across different cancer types. Figure  shows, for each cancer type, the average correlations in test statistics with other cancers. All inter-cancer correlations from three methods are shown in Additional file : Figure S12. Overall the test statistics from the proposed method have a stronger correlation, again suggesting that the results are more consistent.Finally, we looked at the biological implications of the DM calling results. We first identify the top 1000 genes (termed as DMGs) with most DMCs by different methods. Then DMCs mapped to these genes are input to gometh function in missMethyl package [] to test their enrichments with “PATHWAYS_IN_CANCER” from KEGG []. Compared to the simple Chi-square test, gometh function adjusts the bias from different numbers of probes on different genes, thus provide more objective results. Figure  shows the -log10 of the p values for the enrichment of DMGs in “PATHWAYS_IN_CANCER,” which contains 328 genes involved in all cancer types. The p values are much smaller from the proposed method, indicating stronger enrichment. We further examined the enrichment of DMGs in pathways related to different cancer types (Additional file : Figure S13). To be specific, we looked at the enrichment of DMGs from COAD (colon adenocarcinoma) in the COLORECTAL_CANCER pathway, UCEC (uterine corpus endometrial carcinoma) in the ENDOMETRIAL_CANCER pathway, PRAD (prostate adenocarcinoma) in the PROSTATE_CANCER pathway, THCA in the THYROID_CANCER pathway, BLCA (bladder urothelial carcinoma) in the BLADDER_CANCER pathway, and LUAD in the NON_SMALL_CELL_LUNG_CANCER pathway. Again, the enrichments are in general stronger from the proposed method. These results support that the proposed method generates more biologically meaningful results.To better understand the differences in DM calling results from the proposed and other methods, we explored the raw data of CpG sites with substantial discrepancies in test results from the InfiniumPurify and minfi. Additional file : Figure S14 shows several examples of such CpG sites. These CpG sites are not statistically significant from minfi, mainly because of the large variance in the cancer group. However, the middle panel shows the scatter plot of beta value versus purities, indicating that the large within group variance is mostly caused by the variation in purities from different samples. After correcting the purity effect, as shown in the right panel, the adjusted beta values become higher and the means between two groups are visibly different now. This leads to a very significant test result and tiny p values (p < 1e-20). These examples illustrate the importance of correcting purity in the DM calling procedure.Taken together, the results presented in this section show that the proposed DM calling method is more sensitive, accurate, and provides more biologically interpretable results compared with existing methods. […]

Pipeline specifications