# XLSTAT-OMICs

**
View tutorial**sXLSTAT-OMICs is a user friendly module that allows to analyse high-throughput OMICs data (genomics, transcriptomics, proteomics, metabolomics). Analyzing such data has never been as easy as with XLSTAT-OMICs. As any XLSTAT module, it is integrated in Microsoft Excel and allows you to obtain actionable results with just two or three clicks. You can do differential expression or regulation, comparison of groups and heat maps within Excel, while using the most advanced technology and fast algorithms (XLSTAT uses parallel programming).

## Features

## Demo version

A trial version of XLSTAT-OMICs is included in the main XLSTAT download.

## Prices and ordering

These analyses are included in the XLStat-Ecology, XLStat-Biomed and XLStat-Premium packages.

# DETAILED DESCRIPTIONS

# Differential expression

### What is differential expression?

**
View a tutorial**Differential expression allows identifying features (genes, proteins, metabolites…) that are significantly affected by explanatory variables. For example, we might be interested in identifying proteins that are differentially expressed between healthy and diseased individuals. In this kind of studies, data often have a very important size ( = high-throughput data). At this stage, we may talk about

*omics*data analyses, in reference to analyses performed over the genome (gen

*omics*) or the transcriptome (transcript

*omics*) or the proteome (prote

*omics*) or the metabolome (metabol

*omics*), etc.

In order to test if features are differentially expressed, we often use traditional statistical tests. However, the size of the data may cause problems in terms of computation time as well as readability and statistical reliability of results. Those tools must therefore be slightly adapted in order to overcome these problems.

### Statistical tests

The statistical tests proposed in the differential expression tool in XLSTAT are traditional parametric or non-parametric tests: Student t-test, ANOVA, Mann-Whitney, Kruskal-Wallis).

### Post-hoc corrections

The p-value represents the risk that we take to be wrong when stating that an effect is statistically significant. Running a test several times increases the number of computed p-values, and subsequently the risk of detecting significant effects which are not significant in reality. Considering a significance level alpha of 5%, we would likely find 5 significant p-values by chance over 100 computed p-values. When working with high-throughput data, we often test the effect of an explanatory variable on the expression of thousands of genes, thus generating thousands of p-values. Consequently, p-values should be corrected ( = increased = penalized) as their number grow. XLSTAT proposes three common p-value correction methods:

**Benjamini-Hochberg**: this procedure makes sure that p-values increase both with their number and the proportion of non-significant p-values. It is part of the FDR (False Discovery Rate) correction procedure family. The Benjamini-Hochberg correction is poorly conservative ( = not very severe). It is therefore adapted to situations where we are looking for a large number of genes which are likely affected by the explanatory variables. It is widely used in differential expression studies.

The corrected p-value according to the Benjamini-Hochberg procedure is defined by:

p_{BenjaminiHochberg} = min( p* nbp / j , 1)

where p is the original (uncorrected) p-value, nbp is the number of computed p-values in total and j is the rank of the original p-value when p-values are sorted in ascending order.

**Benjamini-Yekutieli**: this procedure makes sure that p-values increase both with their number and the proportion of non-significant p-values. It is part of the FDR (False Discovery Rate) correction procedure family. In addition to Benjamini-Hochberg’s approach, it takes into account a possible dependence between the tested features, making it more conservative than this procedure. However, it is far less stringent than the Bonferroni approach which we describe just after.

The corrected p-value according to the Benjamini-Yekutieli procedure is defined by:

p_{BenjaminiYekutieli} = min[( p * nbp * ∑_{i=1…nbp}1/i ) / j , 1]

where p is the original p-value, nbp is the number of computed p-values in total and j is the rank of the original p-value when p-values are sorted in ascending order.

**Bonferroni**: p-values increase only with their number. This procedure is very conservative. It is part of the FWER (Familywise error rate) correction procedure family. It is rarely used in differential expression analyses. It is useful when the goal of the study is to select a very low number of differentially expressed features.

The corrected p-value according to the Bonferroni procedure is defined by:

p_{Bonferroni} = min( p * nbp, 1 )

where p is the original p-value and nbp is the number of computed p-values in total.

**Multiple pairwise comparisons**

After one-way ANOVAs or Kruskal-Wallis tests, it is possible to perform multiple pairwise comparisons for each feature taken separately.

### Non-specific filtering

Before launching the analyses, it is interesting to filter out features with very poor variability across individuals. Non-specific filtering has two major advantages:

- It allows computations to focus less on features which are very likely to be not differentially expressed thus saving computation time.

- It limits post-hoc penalizations, as fewer p-values are computed.

Two methods are available in XLSTAT:

- The user specifies a variability threshold (interquartile range or standard deviation), and features with lower variability are eliminated prior to analyses.

- The user specifies a percentage of features with low variability (interquartile range or standard deviation) to be removed prior to analyses.

### Biological effects and statistical effects: the volcano plot

A statistically significant effect is not necessarily interesting at the biological scale. An experiment involving very precise measurements with a high number of replicates may provide low p-values associated to very weak biological differences. It is thus recommended to keep an eye on biological effects and not to rely only on p-values. The **volcano plot** is a scatter chart that combines statistical effects on the y-axis and biological effects on the x-axis for a whole individuals/features matrix. The only constraint is that it can only be executed to examine the difference between the levels of two-level qualitative explanatory variables.

The y axis coordinates are -log10( p-values ) making the chart easier to read: high values reflect the most significant effects whereas low values correspond to effects which are less significant.

XLSTAT provides two ways of building the x axis coordinates:

- Difference between the mean of the first level and the mean of the second, for each feature. Generally, we use this format when handling data on a transformed scale such as log or square root.

- Log2 of fold change between the two means: log2( mean1 / mean2 ). This format should preferably be used with untransformed data.

**Results**

For each explanatory variable, XLSTAT provides the following results:

**X features with the lowest p-values table**: it contains information about the x features with the lowest p-values. Features are sorted in an ascending order of p-values. The p-values column contains modified p-values according to the selected post-hoc correction method. The significant column indicates if the corresponding p-value is significant at the selected significance level. If the multiple pairwise comparisons option has been activated, additional columns appear. According to the selected type of test, they contain means (parametric tests) or medians (non-parametric tests) of the explanatory variable’s levels. Within each feature, levels are associated to letters summarizing multiple pairwise comparisons. Two levels sharing the same letter are not significantly different.

**Charts**: A histogram depicting the distribution of corrected p-values is followed by a volcano plot allowing the user to pinpoint features with the highest statistical and biological effects.

### References

**Benjamini Y. and Hochberg Y. (1995)**. Controlling the false discovery rate: a practical and powerful approach to multiple testing. *Journal of the Royal Statistical Society, Series B*, **57**, 289–300.

**Benjamini Y. and Yekutieli D. (2001)**. The control of the false discovery rate in multiple hypothesis testing under dependency. *Annals of Statistics*, **29**, 1165–88.

**Hahne F., Huber W., Gentleman R. and Falcon S. (2008)**. Bioconductor Case Studies. Springer.

# Heat maps

### Heat maps and OMICS data

**
View a tutorial**While exploring individuals/features matrices in an OMICS framework, it is interesting to examine how correlated features (i.e. genes, proteins, metabolites) correspond to similar individuals (i.e. samples). For example, a cluster of diseased kidney tissue samples may be characterized by a high expression of a group of genes, compared to other samples. The heat map tool in XLSTAT allows performing such explorations.

### How it works in XLSTAT

Both features and individuals are clustered independently using ascendant hierarchical clustering based on Euclidian distances, optionally preceded by the k-means algorithm depending on the matrix’s size. The data matrix’s rows and columns are then permuted according to corresponding clusterings, which brings similar columns closer to each other and similar lines closer to each other. A heat map is then displayed, reflecting data in the permuted matrix (data values are replaced by corresponding color intensities).

### Non-specific filtering

Before launching the analyses, it is interesting to filter out features with very poor variability across individuals. In heat map analysis, non-specific filtering has two major advantages:

- It allows computations to focus less on features which are very likely to be not differentially expressed thus saving computation time.

- It improves the readability of the heat map chart. by color intensity).llowed by a cluster of diseased kidney cell samples may be ch

Two methods are available in XLSTAT:

- The user specifies a variability threshold (interquartile range or standard deviation), and features with lower variability are eliminated prior to analyses.

- The user specifies a percentage of features with low variability (interquartile range or standard deviation) to be removed prior to analyses.

### Results

**Summary statistics: **The tables of descriptive statistics show the simple statistics for all individuals. The number of observations, missing values, the number of non-missing values, the mean and the standard deviation (unbiased) are displayed.

**heat map: **The features dendrogram is displayed vertically (rows) and the individuals dendrogram is displayed horizontally (columns). A heat map is added to the chart, reflecting data values.

Similarly expressed features are characterized by horizontal rectangles of homogeneous color along the map.

Similar individuals are characterized by vertical rectangles of homogeneous color along the map.

Clusters of similar individuals characterized by clusters of similarly expressed features can be detected by examining rectangles or squares of homogeneous color at the intersection between feature clusters and individual clusters inside the map.

### References

**Hahne F., Huber W., Gentleman R. and Falcon S. (2008)**. Bioconductor Case Studies. Springer.

The following video explains how to use heat maps in XLStat-OMICs: