Computational protocol: A novel analysis strategy for integrating methylation and expression data reveals core pathways for thyroid cancer aetiology

Similar protocols

Protocol publication

[…] Methylation is not a gene-specific but a region-specific phenomenon. Methylation occurring at different gene regions may end up having different outcomes. In our methylation analysis, we have investigated methylations occurring in first exon, 3'UTR, 5'UTR, gene body, intergenic region and transcription start sites using ChAMP package [] which is available in R. ChAMP pipeline is specifically designed for analysis of Illumina HumanMethylation450k chip and it involves a sliding window approach (Probe Lasso) for annotating CpG regions with genomic locations [].In array-based methylation experiments, both Beta-value and M-value statistics are used as metrics to measure methylation levels. Beta-Value in methylation experiments is the estimate of methylation level using the ratio of the methylation probe intensity and the overall intensity whereas M-value is a logit transformation of Beta-Value. For easier functional interpretation of the results, we have used Beta-Value at our analysis, which provides more intuitive biological interpretation as it roughly corresponds to the percentage of a methylation on a specific site [].After obtaining intensity data from TCGA, intra-array normalization is done using BMIQ normalization method [] to avoid the bias introduced by the Infinium type 2 probe design. In order to assess the similarity of normalized methylation samples in both batches and the pooled data, multidimensional scaling plots based on top 1000 most variable probes and corresponding hierarchical clustering plots are shown in Figures and . When looked at the MDS and clustering plots, not all tumour samples were clustered together and specifically in Batch230, control samples were in separate clusters. In order to validate the problem, we have conducted the same analysis three times by double-checking the parameters. Overall, the picture was better for the pooled dataset, where there were precise "control" clusters in the plot. Adding that TCGA is a well-designed database, we had doubts on excluding the outlying samples and thus, we have continued our analysis without any elimination but focusing on pooled dataset. The reason behind enhanced performance of pooled data against individual batch data may be due to the fact that pooled data increases the confidence rate of measuring methylation and expression levels in genes, leading to an increase in the significance corresponding to each gene.After BMIQ normalization, magnitude of batch effects are assessed and corrected using the ComBat normalization method, which is an empirical Bayes based method to correct for technical variation related to the slide []. After pre-processing, analysis for Copy Number Aberrations (CNA) and segmentation of methylation variable positions (MVPs) into biologically relevant differentially methylated regions (DMRs) was conducted using the "champ.MVP" function of CHAMP package. In order to have better knowledge about false positive results, Benjamini-Hochberg calculation [] is applied for all p-values. […]

Pipeline specifications

Software tools ChAMP, BMIQ, ComBat
Databases TCGA Data Portal
Application DNA methylation array analysis
Diseases Thyroid Neoplasms