XLSTAT-CCR, Correlated Component Regression
The XLSTAT-CCR module focuses on regression analysis (linear regression, logistic regression, etc.) where large numbers of correlated predictors may be available. On many data sets, Correlated Component Regression (CCR) has been shown to outperform penalized regression techniques such as Lasso, and other methods such as Naive Bayes and PLS regression.
XLSTAT-CCR develops reliable regression models using CCR methods. CCR models may even include more predictors than cases, a situation that is impossible with traditional regression methods. CCR was developed by Dr. Jay Magidson for simultaneously estimating regression models and excluding irrelevant predictors. Reliable models are obtained using a fast algorithm that incorporates M-fold cross-validation for tuning model parameters (optimal number of predictors and amount of regularization).
Statistical Innovations, a Boston-based firm which specializes in innovative applications of statistical modeling, was started in 1981 by Dr. Jay Magidson. Statistical Innovations has since then been a precursor for many techniques that have become a standard in data analysis and data mining.
For more information Statistical Innovations, please visit www.statisticalinnovations.com.
A trial version of XLSTAT-CCR is included in the main XLSTAT download.
Prices and ordering
For prices, on-line ordering and other purchasing information please go to our ordering page.
Correlated Component Regression (CCR)
The four regression methods available in the Correlated Component Regression (CCR) module use fast cross-validation to determine the amount of regularization to produce reliable predictions from data with P correlated explanatory (X) variables, where multicollinearity may exist and P can be greater than the sample size N. The methods are based on Generalized Linear Models (GLM). As an option, the CCR step-down algorithm may be activated to exclude irrelevant Xs.
The linear part of the model is a weighted average of K components S = (S1, S2, … , SK) where each component itself is a linear combination of the predictors. For Y dichotomous, these methods provide an alternative to Logistic regression (CCR-Logistic) and linear discriminant analysis (CCR-LDA). For a continuous Y, these procedures provide an alternative to traditional linear regression methods, where components may be correlated (CCR-LM procedure), or restricted to be uncorrelated with components obtained by PLS regression techniques (CCR-PLS). Typically K < P, resulting in model regularization that reduces prediction error.
Traditional maximum likelihood regression methods, which employ no regularization at all, can be obtained as a special case of these models when K=P (the saturated model). Regularization, inherent in the CCR methods, reduces the variance (instability) of prediction and also lowers the mean squared error of prediction when the predictors have moderate to high correlation. The smaller the value for K, the more regularization is applied. Typically, K will be less than 10 (quite often K = 2, 3 or 4) regardless of P. M-fold cross-validation techniques are used to determine the amount of regularization K* to apply, and the number of predictors P* to include in the model when the step-down algorithm is utilized.
When the CCR step-down option is activated with M-fold cross-validation, output includes a table of predictor counts, reporting the number of times each predictor is included in a model estimated with one omitted fold. The counts can be used as an alternative measure of variable importance (Tenenhaus, 2010), as a supplement to the standardized regression coefficients. Additional options can limit the number of predictors to be included in the model.
The regression methods in the XLSTAT-CCR module differ according to the assumptions made about the scale type of the dependent variable Y (continuous vs. dichotomous), and the distributions (if any) assumed about the predictors.
Linear regression (CCR.LM, PLS)
Predictions for the dependent variable Y based on the linear regression model are obtained as follows:
Pred(Y) = S(S'DS)-1S'DY
where D is a diagonal matrix with case weights as the diagonal entries.
For example, with K=2 components we have:
Pred(Y) = α + b1.2S1 + b2.1S2
where b1.2 and b2.1 are the component weights, the components defined as:
S1 = Σg=l:P(λg.1Xg) and S2 = Σg=l:P(λg.2Xg)
where λg.1 and λg.2 are component coefficients (loadings) for the gth predictor on components S1 and S2respectively.
The component weights and loadings are obtained from traditional OLS regression. By substitution we get the reduced form expression:
Pred(Y) = α + Σg=1:P(b1.2λg.1 + b2.1λg.2) Xg
where βg = b1.2λg.1 + b2.1λg.2 is the (regularized) regression coefficient associated with predictor Xg.
Regardless which linear regression model (CCR-LM, or PLS) is used to generate the predictions, when the number of components K equals the number of predictors P, the results are identical to those obtained from traditional least squares (OLS or WLS) regression. Traditional least squares regression produces unbiased predictions, but such predictions may have large variance and hence higher mean squared error than regularized solutions (K < P). Thus, predictions obtained from the CCR module are typically more reliable than those obtained from a traditional regression model.
Methods CCR.LM and PLS assume that the dependent variable Y is continuous:
- CCR.LM is invariant to standardization and also allows the components to be correlated (recommended)
- PLS produces different results depending upon whether or not the predictors are standardized to have variance 1. By default, the PLS ‘standardize’ option is activated.
Logistic Regression (CCR.Logistic) and Linear Discriminant Analysis (CCR.LDA)
Logistic regression is the standard regression (classification) approach for predicting a dichotomous dependent variable. Both Linear and Logistic regression are GLM (Generalized Linear Models) in that a linear combination of the explanatory variables (‘
linear predictor’) is used to predict a function of the dependent variable. In the case of linear regression, the mean of Y is predicted as a linear function of the X variables. For logistic regression, the logit of Y is predicted as a linear function of X.
Logit(Y|S) = α + b1.2S1 + b2.1S2
which in reduced form yields:
Logit(Y|X) = α + Σg=1:P(b1.2λg.1 + b2.1λg.2) Xg
Logit(Y), defined as the natural logarithm of the probability of being in dependent variable group 1 (say Y=1) divided by the probability of being in group 2 (say Y=0), can easily be transformed to yield the probability of being in either category. For example, the conditional probability of being in group 1 can be expressed as:
Prob(Y=1|X) = exp(Logit(Y|X)) / (1+exp(Logit(Y|X))) = 1 / [1+exp(-Logit(Y|X))]
and Prob(Y=0|X) = 1 / [1+exp(Logit(Y|X))]
Thus, the logistic regression model is a model for predicting the probability of being in a particular group. Predictions are reported for group 1, which is defined as the category of Y associated with the higher of the 2 numeric values taken on by Y.
Linear Discriminant Analysis (LDA) is another model used commonly to obtain predicted probabilities for a dichotomous Y:
- CCR.LDA assumes that the X variables follow a multivariate normal distribution within each Y group, with different group means but common variances and covariances.
- CCR.Logistic makes no distributional assumptions.
The CCR.LM method applies CCR techniques to obtain a regularized linear regression based on the Correlated Component Regression (CCR) model for a continuous Y (Magidson, 2010; Magidson and Wassmann, 2010). It is recommended especially in cases where several explanatory variables have moderate to high correlation.
Method CCR.LM differs from Method PLS in that the components are allowed to be correlated, there is no need to deflate (and then restore) predictors, and similar to traditional OLS regression, predictions are invariant to linear transformations applied to the predictors. Thus, the explanatory variables do not need to be standardized prior to estimation.
The PLS method applies CCR techniques to obtain a regularized linear regression based on the PLS regression (PLS) model for a continuous Y. For an introduction to PLS regression see Tenenhaus (1998). For a comparison of the CCR.LM and PLS methods see Tenenhaus (2011).
Unlike CCR.LM which is invariant with respect to the scale of the predictors, when K < P, PLS regression can yield substantially different predictions depending upon whether the predictors are standardized or not. For a detailed comparison of CCR.LM, PLS with unstandardized Xs and PLS with standardized Xs, see Magidson (2011).
CCR.LDA and CCR.Logistic
The CCR.LDA and CCR.Logistic methods apply CCR techniques to obtain regularized regressions based on the Correlated Component Regression (CCR) model for a dichotomous Y.
Notes on Correlated Component Regression:
Case of K = P
Depending upon which method is selected, CCR.LM, PLS, CCR.LDA, or CCR.logistic, in the case where P < N, setting K = P yields the corresponding (saturated) regression models:
- Method CCR.LM (or PLS) is equivalent to OLS regression (for K = P)
- Method CCR.Logistic yields traditional Logistic regression (for K = P)
- Method CCR.LDA yields traditional Linear Discriminant Analysis (for K = P)
where prior probabilities are computed from group sizes.
R rounds of M-fold Cross-validation (CV) may be used to determine the number of components K* and number of predictors P* to include in a model. For R>1 rounds, the standard error of the relevant CV statistic is also reported. When multiple records (rows) are associated with the same case ID (in XLSTAT, case IDs are specified using ‘Observation labels’), for each round, the CV procedure assigns all records corresponding to the same case to the same fold.
The Automatic Option in M-fold Cross-Validation
When the CV option is performed in Automatic mode (see ‘Automatic’ option in Options tab) a maximum number K is specified for the number of components, all K models containing between 1 and K components are estimated, and the K* model selected as the one with the best CV statistic. When the step-down option is also activated, the K models are estimated with all predictors prior to beginning the step-down algorithm.
The CV statistic used to determine K* depends upon the model type as follows:
- For CCR.LM or PLS: The CV-R2 is the default statistic. Alternatively, the Normed Mean Squared Error (NMSE) can be used instead of CV-R2.
- For CCR.LDA or CCR.Logistic: The CV-Accuracy (ACC), based on the probability cut-point of .5, is used by default. In the case of two or more values of K yielding identical values for CV-Accuracy, the one with the higher value for the Area Under the ROC Curve (AUC) is selected.
Predictor Selection Using the CCR/Step-Down Algorithm
In step 1 of the step-down option, a model containing all predictors is estimated with K* components (where K* is specified by the user or determined by the program if the Automatic option is activated), and the relevant CV statistics are computed. In step 2, the model is then re-estimated after excluding the predictor whose standardized coefficient is smallest in absolute value, and CV statistics are computed again. Note that both steps 1 and 2 are performed within each subsample formed by eliminating one of the folds. This process continues until the user-specified minimum number of predictors remains in the model (by default, Pmin = 1). The number of predictors included in the reported model, P*, is the one with the best CV statistic.
In any step of the algorithm, if the number of predictors remaining in the model falls below K*, the number of components is automatically reduced by 1, so that the model remains saturated. For example, suppose that K*=5, but after a certain number of predictors are eliminated P=4 predictors remain. Then, the K* is reduced to 4 and the step-down algorithm continues.
If a maximum number of predictors to be included in a model, Pmax, is specified, the step-down algorithm still begins with all predictors included in the model, but results are reported only for P less than or equal to Pmax, and the CV statistics are only examined for P in the range [Pmin , Pmax].