How do I run a PLS regression with XLSTAT?
This tutorial is based on data that have extensively analyzed in [Tenenhaus, M., Pagès, J., Ambroisine L. and & Guinot, C. (2005). PLS methodology for studying relationships between hedonic judgments and product characteristics. Food Quality an Preference. 16, 4, pp 315-325]. The data used in this article correspond to 6 orange juices described by 16 physico-chemical descriptors and evaluated by 96 judges. PLS regression is going to allow us to obtain a simultaneous map of the judges, the descriptors and the products, and then to analyze for some judges which descriptors are related to their preferences. An Excel sheet containing both the data and the results for use in this tutorial can be downloaded by clicking here.
To activate the PLS regression dialog box, start first XLSTAT-PLS by clicking on the button in the Excel toolbar, then select the XLSTAT-PLS / PLS Regression command in the Excel menu or click the corresponding button on the "Modeling data" toolbar.
Once you have clicked the button, the PLS Regression dialog box is displayed. In the Dependent variable(s) field, select with the mouse the ratings of the 96 judges (see the tutorial on how to select data for more information on this topic). The ratings are the "Ys" of the model as we want to explain the ratings given by the judges. In Quantitative variable(s) field, select the explanatory variables, that are in our case the physico-chemical descriptors. The name of the orange juices have also been selected as "Observation labels". In the Options tab of the dialog box, make sure that "Automatic" is activated. Last, in the "Charts" tab, the "Colored labels" option has been activated in order to make the reading of the charts easier. The Vectors option has been unchecked in order not to saturate the charts.
The extremely fast computations start when you click on OK. The display of the results is stopped to allow you to select the axes for the maps.
You only need to click on "Done" so that the charts are only displayed for the first two axes.
Interpreting the results of a Partial Least Squares regression
The display of the results takes about 2 minutes as there are many tables and charts because of the 96 dependent variables.
After the tables that display the basic statistics and the correlations between all the selected variables (dependent variables are displayed in blue and quantitative explanatory variables in black), the results specific to the PLS regression are displayed.
The first table and the corresponding bar chart allow to visualize the quality of the PLS regression as a function of the number of components.
The Q² cumulated index measures the global goodness of fit and the predictive quality of the 96 models. XLSTAT-PLS has automatically selected 4 components. We see that Q² remains low even with 4 components (ideally it should be close to 1). This suggests that the quality of the fit varies a lot depending on the judge. The cumulated R²Y and R²X cum that correspond to the correlations between the explanatory (X) and dependent (Y) variables with the components are very close to 1 with 4 components. This indicates that the 4 components generated by the PLS regression summarize well both the Xs and the Ys.
The first correlations map allows to visualize on the first two components the correlations between the Xs and the components, and the Ys and the components.
We can see that for some judges displayed at the center of the map, the correlations are low. By looking at the corresponding table, we see that for example, the J54 judge is only correlated with the fourth component, that is globally little correlated with the explanatory variables. Regarding the explanatory variables we notice that the Vitamin C is not well represented on the first two dimensions. We can interpret this as the fact that this variable explains only little the preferences of the judges, which is not surprising as it does not have a strong effect on taste or other criteria that could easily influence the judges' preference. We notice the strong correlations between the fructose and the glucose, between the two pH, and the negative correlation between the pH and the acidity and titer. One should also notice how different the judges are: they are not concentrated on one part of the correlations circle, but well distributed all around it.
The map that displays the dependent variables on the c vectors, and the explanatory variables on the w* vectors allows to visualize the global relationship between the variables. The w* are related to the weights of the variables in the models.
If one projects an explanatory variable on the vector of a dependent variable (the vectors are displayed only if there are less than 50 dependent variables) we have an idea of the influence of the explanatory variable in the modeling of the dependent variable.
The coordinates of the orange juices in the space of the t coordinates are available in a table and displayed on a map. We notice that the products are well distinguished.
A new correlations map allows to superimpose the products on the previous correlations map. In the legend we replaced "Obs" by "Juices", by modifying the series name in the Excel toolbar, after we selected the series by clicking on one of the points. As almost always with XLSTAT, the charts are Excel charts and can easily be modified.
Running a second Partial Least Squares regression on clustered data
In their article, Tenenhaus et al. interpret this chart in detail. They deduce from it the existence of 4 well identified clusters of judges. They advise re-running PLS regressions on each of these groups. This way they obtain better Q² and R². For the first group, the R²Y is 0.63 instead of the 0.53 we obtained with all the judges.
Two tables giving results for the u and u~ components are then displayed. A chart allows to visualize the observations (in our case the juices) in the space of the u~.
The following tables allow to visualize for each dependent variable the Q² et cumulated Q² indexes as a function of the number of components. We notice that for several variables, the maximum of the cumulated Q² is obtained for one or two components only (for example J5, J6, J7).
A series of tables with the R² for each entry variable with the t components is optionally displayed. The option is not activated by default, and the tables are not taken into account in this tutorial.
The following table displays the VIPs (Variable Importance for the Projection) for each explanatory variable, for an increasing number of components. This allows to quickly identify which are the explanatory variables that contribute the most to the models. For the model with one component we can see that the Vitamin C, the Sweetening power, the Odor intensity and the Taste intensity have a low infludence on the models.
The next table displays the parameters (or coefficients) of the models corresponding to each dependent variable. The equations of the models are displayed below that table. The equations can be later reused for simulation or prediction purposes.
For each model, XLSTAT-PLS displays the goodness of fit coefficients, the standardized coefficients table, and the table of predictions and residuals. The analysis of the model corresponding to judge J1 allows to conclude that the model is well fitted (R² equals 0.88). However the number of degrees of freedom is low, and might be facing an overfitting problem. This is also confirmed when we look at the standardized coefficients: for each coefficient the confidence intervals are wide and include 0. As we noticed that the cumulated Q² corresponding to this model reaches its maximum value with 2 components, it is likely that a model with only two components would be better.
We have performed a new PLS analysis, using just J1 as a dependent variable, and forcing the number of components (see the options tab) to 2. The results are displayed on the PLS2 sheet. We can now see that the quality of the results has been improved. The following chart correspond to the standardized coefficients of the new model.
We can see here that the coefficients are significantly different from zero only for "Smell intensity" and "Odor typicity". The predictions and residuals table allows to verify that the rates given by the judge 1 are well reproduced by the model.
Last, the table with the DModX and DModY, and the corresponding charts that allow to quickly identify potential outliers, are displayed. In our case there aren't any outlier as all values are lower than DCritX or DCritY.
Watch this video to view a demonstration.
Click here for other tutorials.