Computational protocol: A Transcriptional Signature of Fatigue Derived from Patients with Primary Sjögren’s Syndrome

Similar protocols

Protocol publication

[…] Gene expression data were prepared for analysis using the microarray packages provided by BioConductor [] as described by Cockell and colleagues []. Data were transformed to stabilise the variance across probes before robust spline normalisation using the lumi package []. The arrayQualityMetrics package was used to detect outliers []. The lumi command detectionCall was used to filter out probes with a detection p-value less than 0.01. This filtering step was not included prior to gene set enrichment analysis (GSEA) since the algorithm requires unfiltered data []. Batch effects were removed using the combat package []. Gene annotations were retrieved from the lumiHumanAll.db package [].The expression data were then analysed using several parallel approaches (): Differentially expressed genes between “high fatigue” and “low fatigue” pSS patients were identified using the limma package [] at a fold-change cutoff of 1.2 and a p-value cutoff of 0.05 after adjustment using the Benjamini-Hochberg false discovery rate []. Other clinical factors were corrected for by inclusion in the linear fits.The Fatigue VAS scores were analysed as a continuous variable by fitting a linear regression model to the expression data including both the pSS and healthy control groups. Since fatigue data were not available for the controls, their individual scores were considered 0. Other clinical factors were corrected for by inclusion in the regression models. The p-values were adjusted using the Benjamini-Hochberg false discovery rate [] and a p-value significance cutoff of 0.05 was applied.The IFN type I signature was calculated for all the patients based on the five INF induced genes identified by Brkic and colleagues []. Scores were calculated for each patient as the number of healthy control standard deviations above the healthy control mean, summed over all five genes, as described by Kirou and co-workers []. Patients with a score exceeding 10 were considered to be IFN-positive [].GSEA and leading edge analysis were carried out using the GSEA software package [, ]. Gene sets were taken from version 4 of the Molecular Signature Database (MSigDB) []. All 1320 canonical pathway gene sets (collection C2:CP) were tested. Additionally, the fatigue-related features identified (point 5) were analysed as a bespoke input gene set. Gene sets were considered significant at an FDR cut-off of 25%. Real gene ordering was used to detect enrichments in the low and high groups, while absolute gene ordering was used to detect other non-random distributions.Machine learning was carried out on the high and low fatigue groups using radial kernel support vector machines (SVMs) [] run in the e1071 package []. Hyperparameter inputs were selected and inputs pre-processed using the carat package [] and 10-fold cross-validation was applied. The performance of the classifiers was evaluated using the area under curve (AUC) of receiver operator characteristic (ROC) curves []. The error of the AUC was calculated using the standard error of the Wilcoxon statistic SE(W) [, ] using , where θ is the AUC, C p is the number of positive examples, C n is the number of negative examples, and Q 1 and Q 2 are the probabilities of incorrect group assignment as defined by Eqs () and (), respectively. SE(W)=θ(1-θ)+(Cp-1)(Q1-θ2)+(Cn-1)(Q2-θ2)CpCn(1) Q1=θ2-θ(2) Q2=2θ21+θ(3) Differentially expressed genes between “high fatigue” and “low fatigue” pSS patients were identified using the limma package [] at a fold-change cutoff of 1.2 and a p-value cutoff of 0.05 after adjustment using the Benjamini-Hochberg false discovery rate []. Other clinical factors were corrected for by inclusion in the linear fits.The Fatigue VAS scores were analysed as a continuous variable by fitting a linear regression model to the expression data including both the pSS and healthy control groups. Since fatigue data were not available for the controls, their individual scores were considered 0. Other clinical factors were corrected for by inclusion in the regression models. The p-values were adjusted using the Benjamini-Hochberg false discovery rate [] and a p-value significance cutoff of 0.05 was applied.The IFN type I signature was calculated for all the patients based on the five INF induced genes identified by Brkic and colleagues []. Scores were calculated for each patient as the number of healthy control standard deviations above the healthy control mean, summed over all five genes, as described by Kirou and co-workers []. Patients with a score exceeding 10 were considered to be IFN-positive [].GSEA and leading edge analysis were carried out using the GSEA software package [, ]. Gene sets were taken from version 4 of the Molecular Signature Database (MSigDB) []. All 1320 canonical pathway gene sets (collection C2:CP) were tested. Additionally, the fatigue-related features identified (point 5) were analysed as a bespoke input gene set. Gene sets were considered significant at an FDR cut-off of 25%. Real gene ordering was used to detect enrichments in the low and high groups, while absolute gene ordering was used to detect other non-random distributions.Machine learning was carried out on the high and low fatigue groups using radial kernel support vector machines (SVMs) [] run in the e1071 package []. Hyperparameter inputs were selected and inputs pre-processed using the carat package [] and 10-fold cross-validation was applied. The performance of the classifiers was evaluated using the area under curve (AUC) of receiver operator characteristic (ROC) curves []. The error of the AUC was calculated using the standard error of the Wilcoxon statistic SE(W) [, ] using , where θ is the AUC, C p is the number of positive examples, C n is the number of negative examples, and Q 1 and Q 2 are the probabilities of incorrect group assignment as defined by Eqs () and (), respectively. SE(W)=θ(1-θ)+(Cp-1)(Q1-θ2)+(Cn-1)(Q2-θ2)CpCn(1) Q1=θ2-θ(2) Q2=2θ21+θ(3) […]

Pipeline specifications

Software tools lumi, arrayQualityMetrics, GSEA, ComBat, limma
Application Gene expression microarray analysis
Organisms Homo sapiens
Diseases Autoimmune Diseases, Xerostomia