Computational protocol: Predicting disease progression in amyotrophic lateral sclerosis

Similar protocols

Protocol publication

[…] A nonlinear, nonparametric random forest (RF) algorithm was trained using the PRO‐ACT dataset and the “randomForest” R package (Fig. ). A preliminary model was trained using 21 available predictor variables from the baseline visit, and the relative contribution of each variable to reducing model accuracy was determined by quantifying the error rate for each variable (Table ). A second RF model using only those variables that contributed more than a 2% reduction in model error was retrained and used for further testing on the clinic population (Table , upper non‐gray area). This variable reduction step was included to reduce the chance of model overfitting. The final model consisted of 13 predictor variables. Performance of the random forest model was compared to the pre‐slope model and a parametric generalized linear (GLM) model.The pre‐slope model is a nonparametric linear model that is often used in a clinical or research setting (Fig. ). It did not use the PRO‐ACT data, but rather it was calculated for every patient using a presumed perfect ALSFRS‐R score the day before time of onset and the score at baseline; all patients had a y intercept of 48. This model is not based on any assumptions about population‐level disease characteristics, rather, it is based on calculating a patient's ALSFRS‐R slope over time by assuming full functionality prior to the first onset of symptoms and extrapolating a future score based on the presumed linear trajectory of ALSFRS‐R progression.The parametric generalized linear (GLM) model was developed using PRO‐ACT patient data and the “LM” function in the base R package (Fig. ). The model was fit using four variables, including the time since baseline, time from symptom onset to baseline, the ALSFRS‐R score at baseline, and the slope of the ALSFRS‐R score at baseline (calculated using a score of 48 the day prior to the day of symptom onset). These four variables were selected based on the four most important noncollinear variables revealed by the variable importance list generated from the RF Model (Table ):ALSFRSRi,T=−3.443−0.02(T)−0.0027(t)+1.044(ALSFRSR0)−61.94(mALSFRSR0)+ε where,ALSFRSRi,T is the predicted ALSFRS‐R score for patient i at time T.T is the time since baseline.t is the time from symptom onset to baseline.ALSFRSR0 is the ALSFRS‐R score at baseline. m ALSFRSR0 is the slope of the ALSFRS‐R score at baseline. ε is ~N(0, 5.9022). [...] All computations were performed using the R statistical computing system (version 3.1.0) and the R base packages and add‐on packages randomForest, plyr, and ggplot2. The data are available to registered PRO‐ACT users. […]

Pipeline specifications

Software tools randomforest, Ggplot2
Databases PRO-ACT
Application Miscellaneous
Organisms Homo sapiens
Diseases Amyotrophic Lateral Sclerosis, Glioma