[…] hile keeping variation associated with the covariate of interest (such as case/control status). Unsurprisingly, batch effects have been observed in studies using the 450K array []., As an example of unwanted variation that is biological in origin, we draw attention to the issue of cell-type heterogeneity, which has seen a lot of attention in the literature on DNA methylation [-]. This issue arises when primary samples are profiled; primary samples are usually a complicated mixture of cell types. This mixture can substantially increase the unwanted variation in the data and can even confound the analysis if the cell-type distribution depends on a phenotype of interest. It has been shown that SVA can help mitigate the effect of cell-type heterogeneity [], but other approaches are also useful [-]., In this work, we propose an unsupervised method that we call functional normalization, which uses control probes to act as surrogates for unwanted variation. We apply this method to the analysis of 450k array data, and show that functional normalization outperforms all existing normalization methods in the analysis of data sets with global methylation differences, including studies of human cancer. We also show that functional normalization outperforms the batch removal tools SVA [,], ComBat [] and RUV [] in this setting. Our evaluation metrics focus on assessing the degree of replication between large-scale studies, arguably the most important biologically relevant end point for such studies. Our method is available as the ‘preprocessFunnorm’ function in the minfi package [] through the Bioconductor project []., The 450k array contains 848 control probes. These probes can roughly be divided into negative control probes (613), probes intended for between array normalization (186) and the remainder (49), which are designed for quality control, including assessing the bisulfite conversion rate (see and Additional file : Supplementary Materials). Importantly for our proposed method, none of these probes are designed to measure a biological signal., Figure a shows a heat map of a simple summary (see ) of these control probes, for 200 samples assayed on four plates (Ontario data set). Columns are the control measure summaries and rows are samples. Th […]

