Computational protocol: Estimation of Cell Type Composition Including T and B Cell Subtypes for Whole Blood Methylation Microarray Data

Similar protocols

Protocol publication

[…] Each data set from each study and each cell type was processed independently using the same quality control and normalization pipeline. All data sets were preprocessed from raw beta values, which represent the proportion of methylation at each CpG site for each sample, by first setting any data points to missing in which a significant signal could not be detected as compared to background using a cutoff value of 0.01 for Illumina's detection p-value. Next all CpGs with >10% missing data in the data set were removed, and all samples with >1% missing data were removed. Missing values were imputed using the impute.knn function in the impute package in R version 3.1.1 in order to carry out normalization. Data were then batch normalized using the Combat function (Johnson et al., ) using subsets of 20,000 CpGs run in parallel to improve computational efficiency. For the purposes of batch correction, a batch was defined as a single array consisting of 12 samples. For smaller data sets in which all samples were run on a single array, the batch normalization step was omitted. Next, samples were normalized to adjust for differences between the Infinium I and Infinium II probe chemistries on the M450 array using a method that fits a polynomial curve to adjacent Infinium I and Infinium II CpGs within 50 bp of one another (Absher et al., ). Supplementary Figure displays the results of the normalization method on the global distribution of beta values in comparison to the raw beta values and BMIQ-normalized beta values (Teschendorff et al., ), a widely-used normalization method for M450 data. Finally, all missing values were reintroduced into the data sets where imputed values had been positioned prior to normalization.A principal component analysis (PCA) was conducted using a random subset of 5000 CpGs for all data sets with main cell type data. This was used to determine the best data sets to use for our model in terms of clean clustering and clear separation of one cell type from another. It was also used to find any outliers within a cell-type set and exclude them for the purpose of the analysis. Additional PCA analyses were conducted independently in the same manner for CD4+ T cell subtypes, CD8+ T cell subtypes, and B cell subtypes. For each cell type and subtype, the median of all QC-filtered samples of that cell type for each QC-filtered CpG was calculated and used as the covariate basis for that cell type in the model. For the purpose of model fitting and estimation, all CpGs that contained SNPs within the probe sequence with minor allele frequencies above 0.01 were removed from the data. […]

Pipeline specifications

Software tools ComBat, BMIQ
Application DNA methylation array analysis