Using Correlated Component Regression with a Dichotomous Y and Many Correlated Predictors
Dataset for running Correlated Component Regression LDA model (CCR.LDA)
This tutorial is based on data simulated according to the assumptions of Linear Discriminant Analysis (LDA) with 2 groups (ZPC1=1,0). The number of available predictors is P = 84 including 28 valid predictors (listed in Table 1 with their true coefficients), some with high within-group correlation, and 56 irrelevant predictors ‘INDPT1’ – ‘INDPT28’ and ‘extra1’ – ‘extra28’ (with true coefficients equal to 0). We generated 100 simulated samples, each consisting of N=50 cases, with equal group sizes N1 = N2 = 25. You can download the data here.
Table 1: True LDA Logit Model Coefficients
Goal of the CCR.LDA model in this example
CCR will apply the proper amount of regularization (K components) to reduce the confounding effects of high predictor correlation, and the CCR step-down algorithm will be used to exclude irrelevant and weak predictors, resulting in a model with a relatively small number of predictors P*. This results in a sparse model that provides better prediction (better classification) and coefficient estimates closer to the true values than traditional stepwise LDA, which imposes no regularization at all.
For illustration, this tutorial focuses on simulation #1 (N=50). A summary of the results from all 100 simulations can be found in Magidson (2010).
Setting up a Correlated Component Regression LDA
To activate the Correlated Component Regression dialog box, first start XLSTAT by clicking on the button in the Excel toolbar, then select the XLSTAT / Modeling data / Correlated Component Regression command in the Excel menu or click the corresponding button on the Modeling data toolbar.
Once you have clicked the button, the Correlated Component Regression dialog box is displayed with the Method=CCR.LM (linear regression model) selected by default. In the Method section, select the CCR.LDA (linear discriminant analysis model) option.
Figure 1. General Tab
In the Y/ Dependent variables field, use your mouse to select the (column A) variable ‘ZPC1’.
The ZPC1 values are the "Ys" of the model as we want to predict the probability of being in group ZPC1=1 as a function of the 84 predictors. Specifically, Logit(Y) is determined as a linear function of the predictors, where Logit(Y)=exp(Prob[Y=1|X])/(1+exp(Prob[Y=1|X])
In the X/ Predictors field, select the 84 predictors.
The case ID of the subjects (ID) has also been selected as Observation labels.
Figure 2. General Tab
In the Options tab of the dialog box, enter ‘5’ as the number of components and activate the Step-down option. Make sure that the settings are as shown below.
Figure 3. Options Tab
In the Validation tab of the dialog box, activate the Validation option and select ‘N last rows’ from the Validation set drop down menu. In the Number of observations field, type ‘4950’. We have now specified the ‘Training set’ as the first 50 rows of the data file (simulation #1) and the last 4,950 rows of the data file will be used as the validation set (simulations #2-100). Activate the Cross-validation option and change the default number of folds from ‘10’ to ‘5’. Activate the ‘Stratify’ option.
Make sure that the settings are as shown below.
Figure 4. Validation Tab
Estimate the 5-component model
Click OK to estimate the model.
Interpreting the Results of a CCR Model with 10 Predictors
The Cross-Validation Step-down Plot shows that for K=5 components the Cross-validation Accuracy (CV-ACC) is best with P=10 predictors.
Figure 5. Plot of Cross-validated Area Under ROC Curve (CV-AUC) and Cross-validated Accuracy (CV-ACC) for K=5, N=50
Correlated Component Regression: Unstandardized coefficients for the 5-component model with 10 predictors are given below.
These results obtained from CCR.LDA outperform step-wise linear discriminant analysis in the following respects:
- More valid predictors included in the model: 10 for CCR.LDA vs. 4 for step-wise LDA.
- Fewer irrelevant predictors included in the model: 0 for CCR.LDA vs. 2 for step-wise LDA.
- Higher accuracy as determined from the validation sample: 83.6% for CCR.LDA vs. 77.8% for step-wise LDA.
The results for step-wise LDA are provided below.
Overall, the results based on all simulated samples show that CCR.LDA outperforms step-wise LDA as well as penalized regression on these data (Magidson, 2010) : Correlated Component Regression: A Prediction/Classification Methodology for Possibly Many Features. 2010 Proceedings of the American Statistical Association.)
Click here for other tutorials.