# XLSTAT-PLS

The XLSTAT-Basic statistical Excel add-in has a number of advanced modeling tools for Partial Least Squares (PLS) regression and Principal Component regression (PCR). These regression methods free oneself from some of the constraints of the classical linear regression and analysis of variance, such as the non-colinearity of the explanatory variables and the minimal sample size that must be superior to the number of explicative variables.

Partial Least Squares Regression is a statistical method that was developed during the 1980s, and is now being used in more and more industries and research fields. With PLS regression it is possible to model one or more dependent variables by a very high number of explanatory variables, regardless the number of observations, and without risking obtaining an over-fitting model. Moreover this module also makes available Principal Component Regression (PCR) and Ordinary Least Squares regression (OLS), both bringing interesting analytical capabilities.

## Features

- Partial Least Squares regression (PLS)
- Principal Component Regression (PCR)
- Ordinary Least Squares regression (OLS)
- PLS discriminant analysis

The following video presents XLSTAT-PLS and its three available regression methods: OLS regression, PCR and PLS regression.

**This analysis is available in the XLStat-Basic addin for Microsoft Excel**™

# DETAILED DESCRIPTIONS

# Partial Least Squares regression

Partial Least Squares regression (PLS) is a quick, efficient and optimal for a criterion method based on covariance. It is recommended in cases where the number of variables is high, and where it is likely that the explanatory variables are correlated.

### Partial Least Squares regression principle

The idea of PLS regression is to create, starting from a table with n observations described by p variables, a set of h components with h<="" p="">

### PLS1 and PLS2 algorithms

Some programs differentiate PLS1 from PLS2. PLS1 corresponds to the case where there is only one dependent variable. PLS2 corresponds to the case where there are several dependent variables. The algorithms used by XLSTAT are such that the PLS1 is only a particular case of PLS2.

### Partial Least Squares regression model equations

In the case of the OLS and PCR methods, if models need to be computed for several dependent variables, the computation of the models is simply a loop on the columns of the dependent variables table Y. In the case of PLS regression, the covariance structure of Y also influences the computations.

The equation of the PLS regression model writes:

Y = T_{h}C’_{h} + E_{h}

= XW_{h}*C’_{h} + E_{h}

= XW_{h} (P’_{h}W_{h})-1 C’_{h} + E_{h}

where Y is the matrix of the dependent variables, X is the matrix of the explanatory variables. T_{h}, C_{h}, W*_{h} , W_{h} and P_{h}, are the matrices generated by the PLS algorithm, and E_{h} is the matrix of the residuals.

The matrix B of the regression coefficients of Y on X, with h components generated by the PLS regression algorithm is given by:

B = W_{h}(P’_{h}W_{h})^{-1}C’_{h}

*Note: the PLS regression leads to a linear model as the OLS and PCR do.*

### PLS regression results: Correlation, observations charts and biplots

A great advantage of PLS regression over classic regression are the available charts that describe the data structure. Thanks to the correlation and loading plots it is easy to study the relationship among the variables. It can be relationships among the explanatory variables or dependent variables, as well as between explanatory and dependent variables.

The score plot gives information about sample proximity and dataset structure.

The biplot gather all these information in one chart.

### Prediction with Partial Least Squares regression

PLS regression is also used to build predictive models. XLSTAT enable you to predict new samples' values.

### General remarks about PLS regression

The three methods – Partial Least Squere regression, Principal Componenet regression and Ordinary Least Squares regression - give the same results if the number of components obtained from the PCA (in PCR) or from the PLS regression is equal to the number of explanatory variables.

The components obtained from the PLS regression are built so that they explain as well as possible Y, while the components of the PCR are built to describe X as well as possible. The XLSTAT-PLS software allows partly compensating this drawback of the PCR by allowing the selection of the components that are the most correlated with Y.

# Principal Component Regression

### Principal Component Regression principle

PCR (Principal Components Regression) is a regression method that can be divided into three steps:

- The first step is to run a PCA (Principal Components Analysis) on the table of the explanatory variables,
- Then run an Ordinary Least Squares regression (OLS regression) on the selected components,
- Finally compute the parameters of the model that correspond to the input variables.

### Principal Component Regression models

PCA allows to transform an X table with n observations described by variables into an S table with n scores described by q components, where q is lower or equal to p and such that (S’S) is invertible. An additional selection can be applied on the components so that only the r components that are the most correlated with the Y variable are kept for the OLS regression step. We then obtain the R table.

The OLS regression is performed on the Y and R tables. In order to circumvent the interpretation problem with the parameters obtained from the regression, XLSTAT transforms the results back into the initial space to obtain the parameters and the confidence intervals that correspond to the input variables.

### PCR results: Correlation and observations charts and biplots

As PCR is build on PCA, a great advantage of PCR regression over classical regression is the available charts that describe the data structure.

Thanks to the correlation and loading plots it is easy to study the relationship among the variables. It can be relationships among the explanatory variables, as well as between explanatory and dependent variables.

The score plot gives information about sample proximity and dataset structure.

The biplot gather all these information in one chart.

### Prediction with Principal Component Regression

Principal Componenet Regression is also used to build predictive models. XLSTAT enable you to predict new samples' values.

# Ordinary Least Squares regression (OLS)

### Equations for the Ordinary Least Squares regression

Ordinary Least Squares regression (OLS) is more commonly named linear regression (simple or multiple depending on the number of explanatory variables).

In the case of a model with p explanatory variables, the OLS regression model writes:

Y = β_{0} + Σ_{j=1..p} β_{j}X_{j} + ε

where Y is the dependent variable, β0, is the intercept of the model, X_{j} corresponds to the j^{th} explanatory variable of the model (j= 1 to p), and e is the random error with expectation 0 and variance σ².

In the case where there are n observations, the estimation of the predicted value of the dependent variable Y for the ith observation is given by:

y_{i} = β_{0} + Σ_{j=1..p} β_{j}X_{ij}

The OLS method corresponds to minimizing the sum of square differences between the observed and predicted values. This minimization leads to the following estimators of the parameters of the model:

[β = (X’DX)^{-1} X’ Dy

σ² = 1/(W –p*) Σ_{i=1..n} w_{i}(y_{i} - y_{i})]

where β is the vector of the estimators of the β_{i} parameters, X is the matrix of the explanatory variables preceded by a vector of 1s, y is the vector of the n observed values of the dependent variable, p* is the number of explanatory variables to which we add 1 if the intercept is not fixed, w_{i} is the weight of the i^{th} observation, and W is the sum of the w_{i} weights, and D is a matrix with the w_{i} weights on its diagonal.

The vector of the predicted values writes:

y = X (X’ DX)^{-1} X’Dy

### Limitation of the Ordinary Least Squares regression

The limitations of the OLS regression come from the constraint of the inversion of the X’X matrix: it is required that the rank of the matrix is p+1, and some numerical problems may arise if the matrix is not well behaved. XLSTAT uses algorithms due to Dempster (1969) that allow circumventing these two issues: if the matrix rank equals q where q is strictly lower than p+1, some variables are removed from the model, either because they are constant or because they belong to a block of collinear variables.

### Variable selection in the OLS regression

An automatic selection of the variables is performed if the user selects a too high number of variables compared to the number of observations. The theoretical limit is n-1, as with greater values the X’X matrix becomes non-invertible.

The deleting of some of the variables may however not be optimal: in some cases we might not add a variable to the model because it is almost collinear to some other variables or to a block of variables, but it might be that it would be more relevant to remove a variable that is already in the model and to the new variable.

For that reason, and also in order to handle the cases where there a lot of explanatory variables, other methods have been developed.

### Prediction

Linear regression is often use to predict outputs' values for new samples. XLSTAT enable you to characterize the quality of the model for prediction before you go ahaed and use it for predictive use.

# PLS discriminant analysis

PLS regression can be adapted to fit discriminant analysis. The PLS discriminant analysis uses the PLS algorithm to explain and predict the membership of observations to several classes using quantitative or qualitative explanatory variables. XLSTAT-PLS uses the PLS2 algorithm applied on the full disjunctive table obtained from the qualitative dependent variable.

PLS discriminant analysis can be applied in many cases when classical discriminant analysis cannot be applied. For example, when the number of observations is low and when the number of explanatory variables is high. When there are missing values, PLS discriminant analysis can be applied on the data that is available. Finally, as PLS regression, it is adapted when multicollinearity between explanatory variables is high.

As many models as categories of the dependent variable are obtained. An observation is associated to the category that has an equation with the highest value.

PLS discriminant analysis offers an interesting alternative to classical linear discriminant analysis.

The output mixes the outputs of the PLS regression with classical discriminant analysis outputs such as confusion matrix.