Computational protocol: An integrative study on the impact of highly differentially methylated genes on expression and cancer etiology

Similar protocols

Protocol publication

[…] Methylation is a region-specific, rather than a gene-specific phenomenon; hence, methylation in different gene regions can lead to diverse consequences. In our methylation analysis, we benefited from the ChAMP pipeline [], which included in the R-Bioconductor package and is specifically designed for the analysis of Illumina HumanMethylation450k chip data. ChAMP employs a sliding window approach (Probe Lasso) for annotating CpG regions with genomic locations []. CHAMP allows of the investigation of methylation occurring in different genomic regions, including in the first exon, 3'UTR, 5'UTR, gene body, intergenic region and within 200 bp and 1500 bp proximities of the transcription start sites. Moreover, the beta values associated with each methylation change are used as estimates of methylation levels, which is the ratio of the methylation probe intensity to the overall intensity and provides an intuitive biological interpretation. []After downloading the methylation intensity data from TCGA, the BMIQ normalization method [] was applied to avoid the bias introduced by the Infinium type 2 probe design. The magnitude of the batch effects was corrected using the ComBat normalization method, which is an empirical Bayes-based method of correcting for technical variation related to a slide []. Moreover, because possible polymorphisms in an individual’s genome may affect the methylation status of probes, we excluded SNPs with frequencies greater than 0.05 based on the 1000 Genomes Project [].After pre-processing, analyses of copy number aberrations (CNA) and the segmentation of methylation variable positions (MVPs) into biologically relevant differentially methylated regions (DMRs) were conducted using the “champ.MVP” function of the CHAMP package. When conducting the analysis of copy number aberrations, we focused on the entire gene, rather than only including particular genomic regions. Individual tumor samples were evaluated against pooled normal samples, and the corresponding regions and segmental mean changes were reported in the output of the analysis. To determine whether copy number aberrations led to corresponding expression changes for the same gene, we used a segmental mean change of 2 as the threshold. Moreover, we annotated genes that exhibited both increases and decreases in different samples from the same dataset as “not important”.In contrast, to avoid false-positive results in the differential methylation analyses, the Benjamini-Hochberg calculation [] was applied for all p-values, and genes with detected Benjamini-Hochberg false discovery rates (FDR) below 0.1 were selected as “differentially methylated”.Additionally, the Illumina HumanMethylation450k chip provides information about more than 450,000 different regions predicted in approximately 22,000 genes in the human body. Consequently, there is usually more than one differentially methylated region that falls within the borders of a given gene, which causes discrepancies in the data. To solve this problem, we evaluated regions of differing methylation within each gene and defined a “general trend of change” for each gene by checking whether the majority of changes were upregulation or downregulation. Depending on the direction of the change, the CpG region exhibiting the greatest methylation change, an FDR below 0.1, and a change in the same direction as the general trend was taken as representing the change in the methylation level for the whole gene.To investigate the effects of large methylation changes, we first defined “large methylation change” threshold. For this purpose, we pooled all of the data and identified the distribution that best fit the data using the “fitdistr” function of the MASS R package []. Thus, the Cauchy distribution was found to best explain the data. We calculated the central value and scaling parameter for the pooled data as the Cauchy distribution parameters. For central value estimation, we took the truncated mean of the middle 24% of the sample order statistics, which has been demonstrated to be valid for the Cauchy distribution []. In our pooled data, we detected a central value of 16.8%. In contrast, the log-likelihood function was used for the scaling parameter. The corresponding log-likelihood formula can be found below: ∑i=1nγ2γ2+[χi−χ0]2−n2=0 where n is the sample size; y is the scaling parameter; and x0 is the central value. After the calculation of each sample value, we identified a scaling parameter of 7.77. We used the central value plus two scaling parameters away from the center as the “large methylation change threshold”; thus, we set 32.2% as the high methylation threshold. [...] RNA sequencing analyses for all cancer types were performed using the edgeR [] Bioconductor [] package. The raw RNA sequencing reads associated with each sample were not available on the TCGA Server; hence, the quality control, pre-processing, mapping and counting procedures were performed by the providers of the data [, , , ]. We worked on counting the data produced via the RSEM procedure [], and we applied EdgeR for the detection of differential expression between the tumor and control samples. EdgeR benefits from empirical Bayes estimation and tests based on the negative binomial distribution []. Similar to the methylation analysis, we performed Benjamini-Hochberg corrections [] for all p-values. Finally, genes showing a fold-change >2 were selected as “differentially expressed” for our analysis. […]

Pipeline specifications

Software tools edgeR, RSEM
Databases TCGA Data Portal
Application RNA-seq analysis
Diseases Neoplasms