Computational protocol: Predicting all-cause risk of 30-day hospital readmission using artificial neural networks

Similar protocols

Protocol publication

[…] Electronic health records corresponding to 323,813 inpatient stays were extracted from Sutter Health’s EPIC electronic record system. shows a summary of the population under study. We had access to all Sutter EHR data, beginning in 2009 and going through the end of 2015. Since many hospitals only recently completed their EHR integration, some 80% of the data comes from 2013–2015 (). To ensure data consistency, we limited our hospitals of study to those with over 3,000 inpatient records and excluded Skilled Nursing and other specialty facilities. shows the total number of records for each hospital, and their respective readmission rates.We studied all inpatient visits to all Sutter hospitals. Hospital transfers and elective admissions were excluded. With this method, a 30-day boolean readmission label was created for each hospital admission.In the current version of their EHR system, Sutter Health captures a few SDoH data fields, such as history of alcohol and tobacco use. We supplemented those data with block-level 2010 census data [] by matching patients’ addresses. The Google Geocoding API was used to determine the coordinates of each patient’s home address, and a spatial join was performed with the open-source QGIS platform [] to find respective census tract and block IDs.The data was transferred from Sutter to a HIPAA-compliant cloud service, where it was stored in a PostgreSQL database. An open-source framework [], written in Python, was built to systematically extract features from the dataset. In total, 335,815 patient records with 1667 distinct features, comprising 15 feature sets, were extracted from the database, as summarized in .Each type of feature (age, length of stay, etc) was independently studied using Jupyter Notebook, an interactive Python tool for data exploration and analysis. Using the pandas [] library, we explored the quality and completeness of the data for each feature, identified quirks, and came to a holistic understanding of the feature, before using it in our models. Each feature-study notebook provided a readable document mixing code and results, allowing the research team to share findings with one another in a clear and technically reproducible way. [...] Initially, we experimented with several classic and modern classifiers, including logistic regression, random forests [], and neural networks. In each case, a 5-fold cross validation, with 20% of the data kept hidden from the model, was performed. We found that the neural network models heavily outperformed other models in performance and recall, with the neural network model being about 10 times faster to train than the random forest model, the second best performing model. Therefore, we focused on optimizing the neural network model.After evaluating a variety of neural network architectures, we found the best-performing model to be a two-layer neural network, containing one dense hidden layer with half the size of the input layer, and dropout nodes between all layers to prevent overfitting. Our model architecture can be seen in . To train the neural network, we used the keras framework [] on top of Google’s TensorFlow [] algorithm. We trained in batches of 64 samples using the Adam optimizer [], limiting our training to 5 epochs because we found that any further training tended to result in overfitting, as indicated by validation accuracy decreasing with each epoch while training loss continued to improve.Initially, we trained the model on 1667 features extracted from the dataset. We then retrained the model using the top N features most correlated with 30-day readmission, for different values of N. As shown in , the model achieved over 95% of the optimal precision when limited to the top 100 features, suggesting that 100 features is a reasonable cutoff for achieving near-optimal performance at a fraction of the training time and model size required for the full model. summarizes the features most correlated with readmission risk.Measuring a model’s performance cannot be completely separated from its intended use. While one metric, AUC, is designed to measure model behavior across the full range of possible uses, in practice risk models are only ever used to flag a minority patient population, and so the statistic is not fully relevant. Metrics like precision and recall require a yes/no intervention threshold before they can even be computed, something that we lack as this model is not slated for a specific clinical program. For simplification, we assumed the model would be used in an intervention on the 25% of patients with the highest predicted risk. We chose 25% because this is the fraction of patients that LACE naturally flags as high-risk, so we conservatively compare to LACE on its best terms. Additionally, we wanted to understand the predictive power of each set of features. To achieve that, we removed individual feature sets, one at a time, and compared the performance (in terms of AUC) with the best performing model.Providers often want to focus their interventions on a specific patient population based on their age, geography or medical condition. Therefore, it is important to measure how well the model performs in each of those subpopulations. In addition, so far, CMS has penalized hospitals for excessive readmission of patients with heart failure (HF), chronic obstructive pulmonary disease (COPD), acute myocardial infarction (AMI), or pneumonia5. We compared the performance of our model against LACE in each of those subpopulations. […]

Pipeline specifications

Software tools Jupyter Notebook, TensorFlow
Applications Miscellaneous, Computational neuroscience modelling
Organisms Homo sapiens, Plum pox virus