# XLSTAT-Multiblock Data Analysis

**
View tutorial**sXLSTAT's Multiblock Data Analysis module (formerly called XLSTAT-ADA) makes it possible for XLSTAT users to run advanced multivariate analysis. These methods are useful for a variety of applications, ranging from ecology to marketing.

## Features

- Canonical Correspondence Analysis (CCA and partial CCA)
- Generalized Procrustean Analysis (GPA)
- Multiple Factor Analysis (MFA)
- Redundancy analysis (RDA)
- Principal Coordinate Analysis
- Canonical Correlation Analysis (CCorA)

You can find a tutorial about using XLSTAT-Multiblock Data Analysis here.

## Demo version

A trial version of XLSTAT-Multiblock Data Analysis is included in the main XLSTAT-Base download.

## Prices and ordering

These analyses are included in the XLStat-Ecology, XLStat-Psy and XLStat-Premium packages.

# DETAILED DESCRIPTIONS

# Canonical Correspondence Analysis (CCA and partial CCA)

View a tutorial

View a tutorial

### What is Canonical Correspondence Analysis

Canonical Correspondence Analysis (CCA) has been developed to allow ecologists to relate the abundance of species to environmental variables. However, this method can be used in other domains. Geomarketing and demographic analyses should be able to take advantage of it.

Canonical Correspondence Analysis allows obtaining a simultaneous representation of the sites, the objects, and the variables describing the sites in two or three dimensions that are optimal for a variance criterion.

### Principles of Canonical Correspondence Analysis

Let T1 be a contingency table corresponding to the counts on n sites of p objects. This table can be analyzed using Correspondence Analysis (CA) to obtain a simultaneous map of the sites and objects in two or three dimensions.

Let T2 be a table that contains the measures recorded on the same n sites of corresponding to q quantitative and/or qualitative variables.

Canonical Correspondence Analysis can be divided into two parts:

- A constrained analysis in a space which number of dimensions is equal to q. This part is the one of main interest as it corresponds to the analysis of the relation between the two tables T1 and T2.
- An unconstrained part, which corresponds to the analysis of the residuals. The number of dimensions for the unconstrained CCA is equal to min(n-1-q, p-1).

### Two methods derived from Canonical Correspondence Analysis

- Partial Canonical Correspondence Analysis adds a preliminary step. The T2 table is subdivided into two groups of variables: the first group contains conditioning variables which effect we want to remove, as it is either known or without interest for the study. A Canonical Correspondence Analysis is run using these variables. A second Canonical Correspondence Analysis is run using the second group of variables which effect we want to analyze. Partial Canonical Correspondence Analysis allows you to analyze the effect of the second group of variables, after the effect of the first group has been removed.
- PLS- Canonical Correspondence Analysis: It is possible to relate discriminant PLS to Canonical Correspondence Analysis. Addinsoft is the first software editor to propose a comprehensive and effective integration between the two methods. Using a restructuring of data, a PLS step is applied to the data, either to create orthogonal PLS components that are optimally designed for the Canonical Correspondence Analysis to avoid the constraints in terms of number of variables that can be used, or to select the most influential variables before running the Canonical Correspondence Analysis. As calculations of the Canonical Correspondence Analysis step and results are identical to what is done with the classical Canonical Correspondence Analysis, users can see this approach as a selection method that identifies the variables of higher interest, either because they are selected in the model, or by looking at the chart of the VIPs (see the section on PLS regression for more information). In the case of a partial Canonical Correspondence Analysis, the preliminary step is unchanged.

### Results for Canonical Correspondence Analysis in XLSTAT

- Inertia:

This table displays the distribution of the inertia between the constrained Canonical Correspondence Analysis and the unconstrained Canonical Correspondence Analysis. - Eigenvalues and percentages of inertia:

In these tables are displayed for the Canonical Correspondence Analysis and the unconstrained Canonical Correspondence Analysis the eigenvalues, the corresponding inertia, and the corresponding percentages, either in terms of constrained inertia (or unconstrained inertia), or in terms of total inertia. - Weighted averages:

This table displays the weighted means as well the global weighted means. - Principal coordinates and standard coordinates:

The principal coordinates and standard coordinates of the sites, the objects and the variables are then displayed. These coordinates are used to produce the various charts. - Regression coefficients:

This table displays the regression coefficients of the variables in the factor space. - Sites and objects maps:
- Sites and objects / Symmetric chart
- Site / Asymmetric
- Objects / Assymetric
- Sites
- Objects

# Generalized Procrustean Analysis (GPA)

View a tutorial

View a tutorial

### When to use Generalized Procrustean Analysis

Generalized Procrustean Analysis is used in sensory data analysis before a Preference Mapping to reduce the scale effects and to obtain a consensual configuration. It also allows comparing the proximity between the terms that are used by different experts to describe products.

### Principle of Generalized Procrustean Analysis

We define by configuration an n x p matrix that corresponds to the description of n objects (or individuals/cases/products) on p dimensions (or attributes/variables/criteria/descriptors).

We name consensus configuration the mean configuration computed from the m configurations. Procrustes Analysis is an iterative method that allows to reduce, by applying transformations to the configurations (rescaling, translations, rotations, reflections), the distance of the m configurations to the consensus configuration, the latter being updated after each transformation.

Let us take the example of 5 experts rating 4 cheeses according to 3 criteria. The ratings can go from 1 to 10. One can easily consider that an expert tends to be harder in his notation, leading to a shift to the bottom of the ratings, or that another expert tends to give ratings around the average, without daring to use extreme ratings. To work on an average configuration could lead to false interpretations. One can easily see that a translation of the ratings of the first expert is necessary, or that rescaling the ratings of the second expert would make his ratings possibly closer to those of the other experts.

Once the consensus configuration has been obtained, it is possible to run a PCA (Principal Components Analysis) on the consensus configuration in order to allow an optimal visualization in two or three dimensions.

There exist two cases:

- If the number and the designation of the p dimensions are identical for the m configurations, one speaks in sensory analysis about conventional profiles.
- If the number p and the designation of the dimensions varies from one configuration to the other, one speaks in sensory analysis about free profiles, and the data can then only be represented by a series of m matrices of size n x p(k), k=1,2, …, m.

### Algorithms for Generalized Procrustean Analysis used in XLSTAT

XLSTAT is the unique product offering the choice between the two main available algorithms: the one based on the works initiated by John Gower (1975), and the later one described in the thesis of Jacques Commandeur (1991). Which algorithm performs best (in terms of least squares) depends on the dataset, but the Commandeur algorithm is the only one that allows to take into account missing data; by missing data we mean here that for a given configuration and a given observation or row, the values were not recorded for all the dimensions of the configuration. The latter can happen in sensory data analysis if one of the judges has not evaluated a product.

### Results for the Generalized Procrustean Analysis in XLSTAT

#### PANOVA table

Inspired from the format of the analysis of variance table of the linear model, this table allows you to evaluate the relative contribution of each transformation to the evolution of the variance. In this table are displayed the residual variance before and after the transformations, the contribution to the evolution of the variance of the rescaling, rotation and translation steps. The computing of the Fisher’s F statistic enables you to compare the relative contributions of the transformations. The corresponding probabilities help you to determine whether the contributions are significant or not.

#### Residuals

Residuals by object: This table and the corresponding bar chart allow to visualize the distribution of the residual variance by object. Thus, it is possible to identify for which objects the GPA has been the less efficient, in other words, which objects are the farther from the consensus configuration.

Residuals by configuration: This table and the corresponding bar chart allow you to visualize the distribution of the residual variance by configuration. Thus, it is possible to identify for which configurations the GPA has been the less efficient, in other words, which configurations are the farther from the consensus configuration.

#### Scaling factors for each configuration

Scaling factors for each configuration presented either in a table or a plot allow to compare the scaling factors applied to the configurations. It is used in sensory analysis to understand how the experts use the rating scales.

#### Results of the consensus test

The number of permutations that have been performed, the value of Rc which corresponds to the proportion of the original variance explained by the consensus configuration, and the quantile corresponding to Rc, calculated using the distribution of Rc obtained from the permutations are displayed to evaluate the effectiveness of the Generalized Procrustean Analysis. You need to set a confidence interval (typically 95%), and if the quantile is beyond the confidence interval, one concludes that the Generalized Procrustean Analysis significantly reduced the variance.

#### Results of the dimensions test

For each factor retained at the end of the PCA step, the number of permutations that have been performed, the F calculated after the Generalized Procrustean Analysis (F is here the ratio of the variance between the objects, on the variance between the configurations), and the quantile corresponding to F calculated using the distribution of F obtained from the permutations are displayed to evaluate if a dimension contributes significantly to the quality of the Generalized Procrustean Analysis.

You need to set a confidence interval (typically 95%), and if the quantile is beyond the confidence interval, one concludes that factor contributes significantly. As an indication are also displayed, the critical values and the p-value that corresponds to the Fisher’s F distribution for the selected alpha significance level. It may be that the conclusions resulting from the Fisher’s F distribution is very different from what the permutations test indicates: using Fisher’s F distribution requires to assume the normality of the data, which is not necessarily the case.

#### Results for the consensus configuration

- Objects coordinates before the PCA: This table corresponds to the mean over the configurations of the objects coordinates, after the Generalized Procrustean Analysis transformations and before the PCA.
- Eigenvalues: If a PCA has been requested, the table of the eigenvalues and the corresponding scree-plot are displayed. The percentage of the total variability corresponding to each axis is computed from the eigenvalues.
- Correlations of the variables with the factors: These results correspond to the correlations between the variables of the consensus configuration before and after the transformations (Generalized Procrustean Analysis and PCA if the latter has been requested). These results are not displayed on the circle of correlations as they are not always interpretable.
- Objects coordinates: This table corresponds to the mean over the configurations of the objects coordinates, after the transformations (Generalized Procrustean Analysis and PCA if the latter has been requested). These results are displayed on the objects charts.

#### Results for the configurations after transformations

- Variance by configuration and by dimension: This table allows to visualize how the percentage of total variability corresponding to each axis is divided up for the configurations.
- Correlations of the variables with the factors: These results, displayed for all the configurations, correspond to the correlations between the variables of the configurations before and after the transformations (GPA and PCA if the latter has been requested). These results are displayed on the circle of correlations.
- Objects coordinates (presentation by configuration): This series of tables corresponds to the objects coordinates for each configuration after the transformations (GPA and PCA if the latter has been requested). These results are displayed on the first series of objects charts.
- Objects coordinates (presentation by object): This series of tables corresponds to the objects coordinates for each configuration after the transformations (GPA and PCA if the latter has been requested). These results are displayed on the second series of objects charts.

# Multiple Factor Analysis (MFA)

View a tutorial

View a tutorial

### When to use Multiple Factor Analysis

Multiple Factor Analysis (MFA) makes it possible to analyze several tables of variables simultaneously, and to obtain results, in particular charts, that allow studying the relationship between the observations, the variables and tables. Within a table the variables must be of the same type (quantitative or qualitative), but the tables can be of different types.

This method can be very useful to analyze surveys for which one can identify several groups of variables, or for which the same questions are asked at several time intervals.

### Principles of Multiple Factor Analysis

The Multiple Factor Analysis is a synthesis of the PCA (Principal Component Analysis) and the MCA (Multiple Correspondence Analysis) that it generalizes to enable the use of quantitative and qualitative variables. The methodology of the MFA breaks up into two phases:

- We successively carry out for each table a PCA or an MCA according to the type of the variables of the table. One stores the value of the first eigenvalue of each analysis to then weigh the various tables in the second part of the analysis.
- One carries out a weighted PCA on the columns of all the tables, knowing that the tables of qualitative variables are transformed into complete disjunctive tables, each indicator variable having a weight that is a function of the frequency of the corresponding category. The weighting of the tables makes it possible to prevent that the tables that include more variables do not weigh too much in the analysis.

The originality of method is that it allows visualizing in a two or three dimensional space, the tables (each table being represented by a point), the variables, the principal axes of the analyses of the first phase, and the individuals. In addition, one can study the impact of the other tables on an observation by simultaneously visualizing the observation described by the all the variables and the projected observations described by the variables of only one table.

### Results for Multiple Factor Analysis

#### Correlation/Covariance matrix

This table shows the correlations between all the quantitative variables. The type of coefficient depends on what has been chosen in the dialog box.

#### Results on individual tables

The results of the analyses performed on each individual table (PCA or MCA) are then displayed. These results are identical to those you would obtain after running the PCA or MCA function of XLSTAT.

#### Multiple Factor Analysis

Afterwards, the results of the second phase of the MFA are displayed.

- Eigenvalues: The eigenvalues and corresponding chart (scree plot) are displayed. The number of eigenvalues is equal to the number of non-null eigenvalues.
- Eigenvectors: This table shows the eigenvectors obtained from the spectral decomposition. These vectors take into account the variable weights used in the Multiple Factor Analysis.
- Coordinates of the tables: The coordinates of the tables are then displayed and used to create the plots of the tables. The latter allow to visualize the distance between the tables. The coordinates of the supplementary tables are displayed in the second part of the table.
- Contributions (%): Contributions are an interpretation aid. The tables which had the highest influence in building the axes are those whose contributions are highest.
- Squared cosines: As in other factor methods, squared cosine analysis is used to avoid interpretation errors due to projection effects. If the squared cosines associated with the axes used on a chart are low, the position of the observation or the variable in question should not be interpreted.
- Lg coefficients: The Lg coefficients of relationship between the tables allow to measure to what extend the tables are related two by two. The more variables of a first table are related to the variables of the second table, the higher the Lg coefficient.
- RV coefficients: The RV coefficients of relationship between the tables are another measure derived from the Lg coefficients. The value of the RV coefficients varies between 0 and 1.

#### Results for quantitative variables

The results that follow concern the quantitative variables. As for a PCA, the coordinates of the variables (factor loadings), their correlation with the axes, the contributions and the squared cosines are displayed.

The coordinates of the partial axes, and even more their correlations, allow to visualize in the new space the link between the factors obtained from the first phase of the Multiple Factor Analysis, and those obtained from the second phase.

The results that concern the observations are then displayed as they are after a PCA (coordinates, contributions in %, and squared cosines).

Last, the coordinates of the projected points in the space resulting from the Multiple Factor Analysis are displayed. The projected points correspond to projections of the observations in the spaces reduced to the dimensions of each table. The representation of the projected points superimposed with those of the complete observations makes it possible to visualize at the same time the diversity of the information brought by the various tables for a given observation, and to visualize the relative distances from two observations according to the various tables.

# Redundancy analysis (RDA)

### What is Redundancy Analysis

Redundancy Analysis (RDA) was developed by Van den Wollenberg (1977) as an alternative to Canonical Correlation Analysis (CCorA).

Redundancy Analysis allows studying the relationship between two tables of variables Y and X. While the Canonical Correlation Analysis is a symmetric method, Redundancy Analysis is non-symmetric. In Canonical Correlation Analysis, the components extracted from both tables are such that their correlation is maximized. In Redundancy Analysis, the components extracted from X are such that they are as much as possible correlated with the variables of Y. Then, the components of Y are extracted so that they are as much as possible correlated with the components extracted from X.

### Principles of Redundancy Analysis

Let Y be a table of response variables with n observations and p variables. This table can be analyzed using Principal Components Analysis (PCA) to obtain a simultaneous map of the observations and the variables in two or three dimensions.

Let X be a table that contains the measures recorded for the same n observations on q quantitative and/or qualitative variables.

Redundancy Analysis allows to obtain a simultaneous representation of the observations, the Y variables, and the X variables in two or three dimensions, that is optimal for a covariance criterion (Ter Braak 1986).

Redundancy Analysis can be divided into two parts:

- A constrained analysis in a space which number of dimensions is equal to min(n-1,p, q). This part is the one of main interest as it corresponds to the analysis of the relation between the two tables.
- An unconstrained part, which corresponds to the analysis of the residuals. The number of dimensions for the unconstrained RDA is equal to min(n-1, p).

It is also possible to use Partial Redundancy Analysis that adds a preliminary step. The X table is subdivided into two groups. The first group X(1) contains conditioning variables which effect we want to remove, as it is either known or without interest for the study. Regressions are run on the Y and X(2) tables and the residuals of the regressions are used for the Redundancy Analysis step. Partial Redundancy Analysis allows you to analyze the effect of the second group of variables, after the effect of the first group has been removed.

### Biplot scaling in Redundancy Analysis

XLSTAT offers three different types of scaling. The type of scaling changes the way the scores of the response variables and the observations are computed, and as a matter of fact, their respective position on the plot.

### Results for in Redundancy Analysis in XLSTAT

If a permutation test was requested, its results are first displayed so that we can check if the relationship between the tables is significant or not.

Eigenvalues and percentages of inertia: In these tables are displayed for the constrained RDA and the unconstrained RDA the eigenvalues, the corresponding inertia, and the corresponding percentages, either in terms of constrained inertia (or unconstrained inertia), or in terms of total inertia.

The scores and of the observations, response variables and explanatory variables are displayed. These coordinates are used to produce a summary plot. The chart allows you to visualize the relationship between the sites, the objects and the variables. When qualitative variables have been included, the corresponding categories are displayed with a hollowed red circle.

# Principal Coordinate Analysis

View a tutorial

View a tutorial

### Principles of Principal Coordinate Analysis

Principal Coordinate Analysis (often referred to as PCoA) is aimed at graphically representing a resemblance matrix between p elements (individuals, variables, objects, among others).

The algorithm can be divided into three steps:

- Computation of a distance matrix for the p elements
- Centering of the matrix by rows and columns
- Eigen-decomposition of the centered distance matrix

The rescaled eigenvectors correspond to the principal coordinates that can be used to display the p objects in a space with 1, 2, p-1 dimensions.

As with PCA (Principal Component Analysis) eigenvalues can be interpreted in terms of percentage of total variability that is being represented in a reduced space.

### Results of Principal Coordinate Analysis in XLSTAT

- Delta1 matrix: This matrix corresponds to the D1 matrix of Gower, used to compute the eigen-decomposition.
- Eigenvalues and percentage of inertia: this table displays the eigenvalues and the corresponding percentage of inertia.
- Principal coordinates: This table displays of the principal coordinates of the objects that are used to create the chart where the proximities between the charts can be interpreted.
- Contributions: This table displays the contributions that help evaluate how much an object contributes to a given axis.
- Squared cosines: This table displays the contributions that help evaluate how close an object is to a given axis.

### Principal Coordinate Analysis and Principal Component Analysis

PCA and Principal Coordinate Analysis are quite similar in that the PCA can also represent observations in a space with less dimensions, the later being optimal in terms of variability carried. A Principal Coordinate Analysis applied to matrix of Euclidean distances between observations (calculated after standardization of the columns using the unbiased standard deviation) leads to the same results as a PCA based on the correlation matrix. The eigenvalues obtained with the Principal Coordinate Analysis are equal to (p-1) times those obtained with the PCA.

### Principal Coordinate Analysis and Multidimensional Scaling

Principal Coordinate Analysis and MDS (Multidimensional Scaling) share the same goal of representing objects for which we have a proximity matrix.

MDS has two drawbacks when compared with Principal Coordinate Analysis:

- The algorithm is much more complex and performs slower.
- Axes obtained with MDS cannot be interpreted in terms of variability.

MDS has two advantages compared with Principal Coordinate Analysis:

- The algorithm allows having missing data in the proximity matrix.
- The non-metric version of MDS provides a simpler and clear way to handle matrices where only the ranking of the distances is important.

# Canonical Correlation Analysis (CCorA)

View a tutorial

View a tutorial

### Origins and aim of Canonical Correlation Analysis

Canonical Correlation Analysis (CCorA, sometimes CCA, but we prefer to use CCA for Canonical Correspondence Analysis) is one of the many statistical methods that allow studying the relationship between two sets of variables.It studies the correlation between two sets of variables and extract from these tables a set of canonical variables that are as much as possible correlated with both tables and orthogonal to each other.

Discovered by Hotelling (1936) this method is used a lot in ecology but is has been supplanted by RDA (Redundancy Analysis) and by CCA (Canonical Correspondence Analysis).

### Principles of Canonical Correlation Analysis

This method is symmetrical, contrary to RDA, and is not oriented towards prediction. Let Y1 and Y2 be two tables, with respectively p and q variables. Canonical Correlation Analysis aims at obtaining two vectors a(i) and b(i) such that

ρ(i) = cor[Y_{1}a_{(i)},Y_{2}b_{(i)}] = cov(Y_{1}a_{(i)} Y_{2}b_{(i)}) / [var(Y_{1}a_{(i)}).var(Y_{2}b_{(i)})]

is maximized. Constraints must be introduced so that the solution for a(i) and b(i) is unique. As we are in the end trying to maximize the covariance between Y_{1}a_{(i)} and Y_{2}b_{(i)} and to minimize their respective variance, we might obtain components that are well correlated among each other, but that are not explaining well Y_{1} and Y_{2}. Once the solution has been obtained for i=1, we look for the solution for i=2 where a_{(2)} and b_{(2)} must respectively be orthogonal to a(1) and b(2), and so on. The number of vectors that can be extracted is to the maximum equal to min(p, q).

*Note: The inter-batteries analysis of Tucker (1958) is an alternative where one wants to maximize the covariance between the Y _{1}a_{(i)} and Y_{2}b_{(i)} components.*

### Results for Canonical Correlation Analysis in XLSTAT

- Similarity matrix: .

The matrix that corresponds to the "type of analysis" chosen in the dialog box is displayed. - Eigenvalues and percentages of inertia:

In this table are displayed the eigenvalues, the corresponding inertia, and the corresponding percentages. Note: in some software, the eigenvalues that are displayed are equal to L / (1-L), where L is the eigenvalues given by XLSTAT. - Wilks Lambda test:

This test allows to determine if the two tables Y1 and Y2 are significantly related to each canonical variable. - Canonical correlations:

The canonical correlations, bounded by 0 and 1, are higher when the correlation between Y1 and Y2 is high. However, they do not tell to what extent the canonical variables are related to Y1 and Y2. The squared canonical correlations are equal to the eigenvalues and, as a matter of fact, correspond to the percentage of variability carried by the canonical variable.

The results listed below are computed separately for each of the two groups of input variables.

- Redundancy coefficients:

These coefficients allow to measure for each set of input variables what proportion of the variability of the input variables is predicted by the canonical variables. - Canonical coefficients:

These coefficients (also called Canonical weights, or Canonical function coefficients) indicate how the canonical variables were constructed, as they correspond to the coefficients in the linear combine that generates the canonical variables from the input variables. They are standardized if the input variables have been standardized. In that case, the relative weights of the input variables can be compared. - Correlations between input variables and canonical variables:

Correlations between input variables and canonical variables (also called Structure correlation coefficients, or Canonical factor loadings) allow understanding how the canonical variables are related to the input variables. - Canonical variable adequacy coefficients:

The canonical variable adequacy coefficients correspond, for a given canonical variable, to the sum of the squared correlations between the input variables and canonical variables, divided by the number of input variables. They give the percentage of variability taken into account by the canonical variable of interest. - Square cosines:

The squared cosines of the input variables in the space of canonical variables allow to know if an input variable is well represented in the space of the canonical variables. The squared cosines for a given input variable sum to 1. The sum over a reduced number of canonical axes gives the communality. - Scores:

The scores correspond to the coordinates of the observations in the space of the canonical variables.