PCR, PLS, or OPLS Evaluation of different regression techniques for hypothesis generation

In the current era of ‘big data’, scientists are able to quickly amass enormous amount of data in a limited number of experiments. The investigators then try to hypothesize about the root cause based on the observed trends for the predictors and the response variable. This involves identifying the discriminatory predictors that are most responsible for explaining variation in the response variable. In the current work, we investigated three related multivariate techniques: Principal Component Regression (PCR), Partial Least Squares or Projections to Latent Structures (PLS), and Orthogonal Partial Least Squares (OPLS). To perform a comparative analysis, we used a publicly available dataset for Parkinson’ disease patien ts. We first performed the analysis using a cross-validated number of principal components for the aforementioned techniques. Our results demonstrated that PLS and OPLS were better suited than PCR for identifying the discriminatory predictors. Since the X data did not exhibit a strong correlation, we also performed Multiple Linear Regression (MLR) on the dataset. A comparison of the top five discriminatory predictors identified by the four techniques showed a substantial overlap between the results obtained by PLS, OPLS, and MLR, and the three techniques exhibited a significant divergence from the variables identified by PCR. A further investigation of the data revealed that PCR could be used to identify the discriminatory variables successfully if the number of principal components in the regression model were increased. In summary, we recommend using PLS or OPLS for hypothesis generation and systemizing the selection process for principal components when using PCR.rewordexplain later why MLR can be used on a dataset with no correlation


Introduction
Many scientific investigations involve generating experimental data for multiple input or dependent variables (termed as predictors) with the hope that one would be able to find the predictors responsible for the observed trends in the response variable. Multiple linear regression (MLR) is a common technique that is used widely to establish a relationship between the response variable and predictors. However, a prerequisite to performing MLR is that the predictors should be orthogonal to each other. Also, to perform MLR, the number of observations must exceed the number of predictors. The MLR technique is primarily concerned with explaining the variation in Y as a function of X, it however does not address collinearity that may exist amongst predictors. The condition of orthogonality can be achieved in systematically designed experiments but is not possible in cases where multidimensional data is generated with limited number of experiments or when predictors are correlated to each other. In such situations, to enable a rigorous analysis, the correlated variables can be transformed to a new multivariate space containing uncorrelated variables known as principal components. This transformation helps reduce the number of necessary dimensions (or the new primary axes) for describing the data and facilitates an enhanced understanding of correlations within predictors and relationships between the predictors and the response variable. In this article, we explored three relevant techniques, PCR, PLS, and OPLS. We chose these techniques as they are all based on notion of latent variables, the variables that are linear combinations of the original variables.
The brief descriptions of these techniques are described below: PCR: In this methodology, Principal Component Analysis (PCA) (Jackson, 1991) is first applied to the X data to derive the latent variables or principal components. The first principal component is chosen so that the R 2 X value is maximized. The successive principal components must be orthogonal to the preceding ones and try to maximize R 2 X with every iteration. Note that the principal components are linear combinations of the predictors and are mathematically related to them by the corresponding loading values. The loading values can be interpreted as the coefficients in the said linear combination and vary between -1 and 1. Also, the observations in the new multivariate space are characterized by the scores associated with principal components where the scores can be interpreted as the coordinates of observations in the new multivariate space containing principal components as the primary axes.
After principal components are selected, they are subjected to MLR to generate a PCR model. Note that the derivation of orthogonal components allows the application of MLR since the principal components are orthogonal to each other. Also, in situations where the number of predictors far exceed the number of observations, reducing the number of dimensions allows generating a regression model that would not have possible with the original data. For a detailed review of the technique, the reader is directed to some of the earlier publications on the subject (Jeffers, 1967;Hawkins, 1973;Mansfield et al., 1977).

PLS:
The PLS technique attempts to maximize the X-Y covariance, while assuming the existence of a small number of latent variables in the X-data that predict the response variable. In other words, this technique involves maximizing the relationship between X and Y data while maintaining the correlation amongst predictors at the same time. Mathematically, the principal components in PLS are chosen so that at each step, the algorithm finds a new principal component that maximizes the product of the variance of the predictors multiplied by the square of their correlation to the response variable (Hastie et al., 2017a). Please refer to relevant publications (Wold et al., 1993;Wold, 2001) for statistical details about the technique.
OPLS: Consistent with the PLS methodology, OPLS technique aims to maximize the relationship between X and Y while retaining latent X variables; additionally, it also attempts to filter out the X information that is unrelated to Y. Mathematically, the OPLS method uses a technique named orthogonal signal correction (Wold et al., 1998), where the first principal component, termed as the predictive component, maximizes available X-Y covariance. The succeeding principal components capture variance in the orthogonal predictors, i.e. the ones that are not statistically correlated to the response variable. While PLS does a reasonable job in reducing random noise in the data, the OPLS technique allows removing the structured noise present in X data that is uncorrelated to Y. This helps simplify the model by reducing the number of principal components and allows the analysis of the main sources of orthogonal variation (Goueguel, 2019). For details about the technique, please refer to (Eriksson et al., 2006a).
Although PLS is widely used for data mining purposes and is offered in a number of data mining platforms, the equivalent cannot be said about PCR and OPLS. The objective of this work was to compare these techniques and provide recommendations regarding which technique should be used for hypothesis generation.

Materials and Methods
The data used in this article was borrowed from a Kaggle data base (Kaggle Inc., 2021) and originally belonged to the work performed by Hlavnička et al. (2017), where the researchers showed that latent parkinsonian speech aberrations can be captured even in patients with Rapid Eye Movement (REM) behavior disorder. The dataset included 30 patients with early untreated Parkinson's disease (designated as PD), 50 patients with REM sleep behavior disorder (designated as RBD) that were at high risk for developing Parkinson's disease or other synucleinopathies, and 50 healthy controls (designated as HC). A professional neurologist with experience in movement disorders examined the patients and provided them with a clinical score, Unified Parkinson's Disease Rating Scale (UPDRS) score (designated as UPDRS III total (-)). The patients were also evaluated by a speech specialist where they read standardized, phonetically balanced text of 80 words and monologized about their current activities, interests, family, or job for approximately 90 seconds. Table 1 lists all the predictors in the data set that were used for this work. Note that the data for three of the predictors available in the dataset -Antiparkinsonian medication, Antipsychotic medication, and Levodopa equivalent (mg/day)were not used as that did not contain significant variability between patients. The current work also ignored the 50 HC observations since they did not contain the corresponding Y (UPDRS III total (-)) data. Also, Hoehn & Yahr scale (-) response was not included as a Y parameter since the related scores were available for the PD patients only. Overall, the modeled data included 33 X and 1 Y variable.

UPDRS Continuous
We used MATLAB (Version R2019a) to run the PCR model, SIMCA-P (Version 16) to run the PLS and OPLS models, and JMP (Version 15) to run the MLR models. After compiling the predictors in a suitable format, mean-centering and univariate scaling were used for centering and scaling the data ( (Eriksson et al., 2006b). To determine the principal components that should be retained in PCA, PLS, and OPLS algorithms, a cross validation approach similar to a methodology described elsewhere (Eastment and Krzanowski, 1982)

PCR
Prior to running the PCR regression model, PCA was applied to the X data, resulting in 33 principal components. However, based on the predefined rules in SIMCA, only the first 2 components were retained. Figure 1(a) shows the R 2 and Q 2 values associated with the selected components and Figure 1

MLR
Considering that the R 2 X(cum) value was only 0.32, which does not reflect a strong correlation between predictors, we also explored MLR to evaluate the predictors related to the response variable. The regression model yielded a moderate fit (R 2 =0.69) and a p-value (Prob > F) of 0.0001, indicating that at least one predictor had a significant effect on Y. The scaled estimates of the regression model are shown in Figure 5. DIS, HPF, AST-M, RLR, and PIR-M were identified as the most important predictors in influencing the model ( Figure 5). If one selects the predictors strictly on the basis of p-value using a strict criterion (say significance level of 0.05), the analysis shown in Figure 5 would conclude only 2 significant predictors. This points to the limitation of MLR when dealing with correlated data. Since MLR analysis only focuses on establishing a relationship between X and Y without any summarization of the X data, the analysis can result in weighing correlated variables equally and in turn excluding discriminatory predictors that otherwise might be important. Comparison between different algorithms showed that the PLS and OPLS algorithms yielded similar results, since four out of the top five predictors between the two models were identified to be the same. MLR shared three of the common top predictors for PLS and OPLS. Two of the three common predictors between the three techniques were also shared by PCR.

Evaluation of increased number of principal components in PCR model
As described earlier, the PCR model with two components did not result in a significant fit of the data, and none of the two principal components demonstrated a significant effect on Y ( Figure   2). We checked if the data fit and model significance can be improved by increasing the number of principal components. We attempted to generate an MLR model with all 33 principal components, but this algorithm was not successful, since the scores corresponding to the last (33 rd ) component were extremely low for all the observations. Note that the first component explained the maximum variation in X and the 33 rd component explained the least X variation.
Removing the last component resulted in a model that could successfully fit the remaining 32 principal components. The model resulted in a moderate R 2 value of 0.69 and yielded a p-value (Prob > F) of 0.0001, indicating that at least one principal component had a significant effect on Y. Six principal components, namely t3, t4, t9, t16, and t17 were found to have statistically significant effects ( Figure 6). Similar to the procedure we adopted earlier, the regression coefficients from this model were multiplied with appropriate loadings to determine the coefficients for the X-Y relationship. Subsequently, the Y values were predicted based on the X-  Figure 5). Considering that there was no overlap between the X predictors with the maximum variation and those whose variation most aligned with the response variable, it is not surprising that PCR resulted in poor fit of the data.
Curiously, there was a significant overlap between the results of PLS and OPLS and both models resulted in the same R 2 X(cum), R 2 Y(cum), and Q 2 (Y(cum) values. Finally, we demonstrated that the number of selected principal components is critical to the accuracy of the results obtained with the PCR. There can be many ways by which the principal components are selected prior to running the PCR regression model. Clearly, as shown in the current work, selecting the first few principal components based on cross validation is not the optimal approach. The fact that the six significant principal components from the 32components model were not the components that explained the maximum variation in X shows that selecting the principal components in the order of their ability to summarize X is not the right approach to efficiently capture X-Y covariance. One can choose a higher number of principal components as has been done in this work, using an excessive number of principal components can result in overfitting of the data. Alternate approaches for component selection could be selecting the principal components based on subset selection methods such as stepwise regression or shrinkage based methodologies such as Ridge Regression or Lasso (Hastie et al., 2017b). Appropriate selection of principal components to enable PCR has been the subject of prior work (Naes and Martens, 1998;Sutter et al., 1992,) and this should be revisited to develop an efficient algorithm that can successfully identify discriminatory predictors.