Computational protocol: A multivariate analysis of CalEnviroScreen: comparing environmental and socioeconomic stressors versus chronic disease

Similar protocols

Protocol publication

[…] Statistical analysis was performed in R (version 3.4.0) []. Principal component analysis (PCA) was conducted for data reduction, and to allow examination of the contribution to variance explained by individual variables (i.e., variable loadings) and the multivariate data structure, similar to other public health studies [, , ]. For all census tracts that had CalEnviroScreen results (n = 7929), PCA was performed on the correlation matrix using the R package factoMineR. Prior to PCA, missing values were imputed using the imputePCA command in the missMDA package [].Separate PCAs were performed to achieve different study objectives. A PCA was performed on all 20 CalEnviroScreen variables in combination in order to examine multivariate patterns of the entire data set. Two separate PCAs were also performed on the 12 environmental and the 5 socioeconomic variables to generate and evaluate a smaller number of variables representing the categories of environmental hazard, and socioeconomic status, respectively. Finally, another PCA was performed on the 17 environmental and socioeconomic variables. This PCA, which did not include the three health outcome variables (asthma, low birth-weight, or cardiovascular disease), was compared to the hospitalization rate disease burden measure (described above). The goal of this analysis was to examine how these exposure and population variables underlying CalEnviroScreen are generally associated with disease burden.Environmental hazard and socioeconomic status variables (principal components) were compared to the disease burden measure (hospitalization rate for 14 diagnoses) using simultaneously autoregressive models (SAR), employing the R package spdep. SAR was chosen as a spatial autoregressive model appropriate to describe and test for linear relationships in the presence of spatial autocorrelation [, ]. Appropriate treatment of spatial autocorrelation was assessed based on Moran plots illustrating no association with spatially lagged means, global Moran’s I that was not significant, and a spatial dependence parameter (λ) that was significant via likelihood ratio test [–]. Models were selected based on minimizing the Bayesian Information Criterion (BIC). Parameter inclusion was based on reported p values (α = 0.05) and on ∆BIC, employing the rule of thumb that ∆BIC ≥ 2 provides positive evidence of model improvement []. Nagelkerke pseudo-R2 was calculated as a measure of model goodness of fit for SAR models. Analogous to traditional R2 in meaning (though not directly comparable), the Nagelkerke pseudo-R2 estimates from 0 to 1 the improvement in proportion of variation explained by the fitted model, versus a null (intercept-only) model []. In order to compare the contribution of each parameter to final variation explained by the model, the psuedo-R2 was compared between the full model and the model with that parameter removed.Prior to statistical analysis, all variables were transformed to approximate a normal distribution and multivariate linearity required for linear model analysis [, ]. Transformations included log10 (7 variables), cube root (6 variables), square root (5 variables), and arcsine square root transformation (drinking water). PM2.5 and low birth weight did not require transformation (Additional File: Table S1). The combined disease burden measure (DB) exhibited skewness and long tails (leptokurtic) and standard transformations failed to achieve normally distributed model residuals. Normal residuals were therefore achieved employing a modulus transformation: signDB∗lnDB+1 following John and Draper []. The predictor variables for the SAR were centered and scaled by subtracting the mean and dividing by the standard deviation. This converted the transformed variables (Additional File: Table S1) to the same unit normal distributions, such that a comparison of model coefficients would approximately indicate relative contribution of each variable to disease burden [].In the interest of independent assessment, we did not communicate with OEHHA, CalEPA, or any members of the CalEnviroScreen development team regarding any aspect of this study. […]

Pipeline specifications

Software tools FactoMineR, Spdep
Application Miscellaneous
Chemicals Ozone