Computational protocol: Identifying maternal and infant factors associated with newborn size in rural Bangladesh by partial least squares (PLS) regression analysis

Similar protocols

Protocol publication

[…] The classical regression methods usually meet four main challenges with: (i) a large number of variables, (ii) correlated predictors, (iii) smaller sample size with a large number of variables and (iv) having more than one response variables simultaneously []. To overcome these problems, the researchers usually take some measures; they may remove some variables [] or may use multivariate reduction techniques like principal component analysis to reduce the multidimensionality in the predictor or response variables []. However, removing variables may often incur selection of redundant variables which have no significant effect on the response variable. On the other hand, despite the fact that the dimensionality reduction techniques reduce the number of predictors by using latent variables instead, the latent variables are usually derived by maximizing the covariation among the predictors instead of maximizing the covariation among the response variables. Consequently, this may produce patterns or syndromes within the predictor variables making little or no biological sense []. The appropriate solution of these challenges is using PLS regression [].Although PLS regression is comparatively new, its use in research is gradually increasing. The great strength of PLS regression is parsimony []. Initially, used in analytic chemistry [–], PLS now it is gaining popularity in public health [–], bioinformatics [], ecology [,] and agriculture []. As it is computationally much more intensive, the advent of statistical packages such as, R, SAS, STATA, MatLab and STATISTICA also facilitates its wider application.Similar to principal component regression (PCR), PLS regression analysis is a data-dimension reduction method that extracts a set of orthogonal factors called latent variables which are used as predictors in the regression model []. The major difference with PCR is that principal components are determined solely by the X variables, whereas with PLS, both the X and Y variables influence the construction of latent variables. The intention of PLS is to form components (latent variables) that capture most of the information in the X variables that is useful for predicting Y variables, while reducing the dimensionality of the regression problem by using fewer components than the number of X variables. PLS is considered especially useful for constructing prediction equations when there are many explanatory variables and comparatively little sample data [].The PLS regression identifies the latent variables stored in matrix T and they model X and predict Y simultaneously. Then the following expression can be written as, X=TPTandY^=TBCT(1) Where, P and C are loadings and B is diagonal matrix. These latent variables are ordered according to the variance of Ŷ they explain. Ŷ can also be written as Y^=TBCT=XBPLS,whereBPLS=PT+BCT(2)PT+ is the Moore-Penrose pseudo-inverse of PT. The matrix BPLS has J rows and K columns and is equivalent to the regression weights of multiple regression.The latent variables are computed iteratively using Singular Value Decomposition (SVD). In each iteration, SVD constructs orthogonal latent variables for X and Y and corresponding regression weights []. The algorithm for PLS regression is as follows:Step 1: Transform X and Y into Z-scores and store in matrices X0 and Y0Step 2: Compute the correlation matrix between X0 and Y0, R1 = X0TY0Step 3: Perform singular value decomposition (SVD) on R1 and produce two sets of orthogonal singular vectors w1 and c1 corresponding to the largest singular value, λ1.Step 4: The first latent variable for X is given by T1 = X0Tw1.Step 5: Normalize T1 such that T1TT1 = 1Step 6: The loadings of X0 on T1 is computed as P1=X0TT1andX^1=T1TP1.Step 7: Compute U1 = Y0c1 and Y^1=U1c1T=T1b1c1T,where,b1=T1TU1(3)The scalar b1 is the slope of the regression of Ŷ on T1. shows that Ŷ is obtained as linear regression from the latent variable extracted from X0. Matrices X^1 and Ŷ1 are then subtracted from the original X0 and Y0 respectively to give deflated X1 and Y1.Step 8: Compute the input matrices for the next iteration, X1=X0−X^1andY1=Y0−X^1Step 9: The first set of latent variables has now been extracted. Now perform SVD on R2 = X1T Y1 we get w2, c2, T2 and b2 and the new deflated matrices X2 and Y2.Step 10: The iterative process continues until X is completely decomposed in to L components (where L is the rank of X). When this is done, the weights (i.e., all the w’s) for x are stored in the J by L matrix W (whose l-th column is wl).The latent variables of X are stored in matrix T, the weights for Y are stored in C, the latent variables of Y are stored in matrix U, the loadings for X are stored in matrix P and the regression weights are stored in a diagonal matrix B. The regression weights are used to predict Y from X.Now the question is how many components, or t’s will have to be retained in the final model. The answer can be obtained by comparing the cross validation Root-Mean Squared Error of Prediction (RMSEP) for different number of components. The component at which the cross validation RMSEP has a meaningful change is used in the final model. To choose the optimum number of components for both PLS and principal component regression, root mean squared error of prediction (RMSEP) were calculated using different number of components. We performed approximate t-tests of regression coefficients based on jackknife variance estimates [].We constructed a correlation plot of the variables to observe how variables are correlated with each other and also between the birth size variables and maternal variables. The closer a variable appears to the perimeter of the circle, the better it is represented, and if two variables are highly correlated they appear near each other. If two variables are negatively correlated they will tend to appear in opposite extremes. If two variables are uncorrelated, they will be orthogonal to each other. We plotted the scores of first two components, t1 vs t2, which helped us to assess if there is any natural grouping or interactions among variables.To examine the advantage of PLS regression over principal component regression, we calculated Pearson’s correlation coefficients between the predicated values (by PLS and principal component regression with 1 to 5 components respectively) and the observed values of infant’s size variables. This correlation coefficient indicate the predicative power of the model: if the model has perfect predictive ability then the correlation coefficient will be 1. So, the more the correlation coefficient, the higher the predictive power of a given model is. For this analysis, we used the R packages: “plsdepot”, “pls” and “mixOmics”. […]

Pipeline specifications

Software tools Statistica, mixOmics
Application Miscellaneous
Organisms Homo sapiens
Diseases Epilepsies, Partial, Obstetric Labor, Premature
Chemicals Vitamin A