How do I create an intelligent pivot table with XLSTAT-Pivot?
An Excel sheet (zipped file) with both the data and the results can be downloaded by clicking here. The data were collected during the 1994 Census by the American Census Bureau (http://www.census.gov/). This dataset has been used several times by statisticians to evaluate the predictive performance of new algorithms. Each record contains 15 descriptors about an individual, like age, occupation, education, sex, etc. The number of records has been limited to 32561. The weight variable (allowing each individual to represent a certain percentage of the population) is not used in the example below. The next release of XLSTAT-Pivot will be able to take into account weights.
The goal is here to quickly build a pivot table and a contribution chart that will help the user to understand which factors and combination of factors most influence the fact that an individual has a revenue greater or lower than $50k (the corresponding variable is in column O). XLSTAT-Pivot enables to quickly and easily do this.
Once XLSTAT is open, select the XLSTAT/KXEN's modules/XLSTAT-Pivot command, or click on the corresponding button of the "KXEN" toolbar (see below).
Once you have clicked on the button, the XLSTAT-Pivot dialog box appears. Select the data on the Excel sheet. As the first row corresponds to the labels, and as the next rows correspond to data, it is possible to use the quickest selection mode of XLSTAT: select directly columns by clicking on the corresponding letters. Select the "Labels included" option as the first row corresponds to the name of the variables. As we want to save memory and disk space, we ask it to delete the intermediary sheets. Note that the explanatory variables can be either qualitative or quantitative. XLSTAT-Pivot automatically determines the type of the variables, which enables mixing qualitative and quantitative variables in the "Explanatory variables" field. Note that we have done a multiple selection for the explanatory variables as we do not want to include the "Weight" column in the model (use the Ctrl key and the mouse to do multiple selections).
As the variable to explain is a binary variable, the corresponding option is selected. Note that the binary variable are transformed to a "0/1" variable with the the 1s corresponding to the least frequent category. In our case this corresponds to the ">50K" case.
Then click on "Format" so that XLSTAT-Pivot can start reformatting the data. XLSTAT-Pivot first looks for missing data and offers you the possibility to remove them or to let the Pivot algorithm either replace them by the mean (quantitative variables) or the mode (qualitative variables), or to create a new category if it seems that the missing values provide information to the model. In this case we decide to remove the individuals with missing values. The reformatted data are displayed on a new sheet.
Then select "Prepare a description" and click on "Prepare" if you want to verify that XLSTAT-Pivot has well recognized the type of the data. Then click on the "Edit" button to visualize the type of the variables.
We decide to change the type of the "Number of years of study" variable from Ordinal to Continuous and then click on "Validate". We then select the "Model data" option, and click on "Model" to start the modeling phase. XLSTAT-Pivot displays the computing information in the dialog box, until it has found the optimal solution.
The last dialog box displays the options for creating the optimal pivot tables while already giving a global idea of the underlying model:
- Ki: This indicator is given in % corresponding to the information brought by the explanatory variables to explain the target variable. This concept is quite similar to the R2 concept when talking of linear regression. The closest to 100% the Ki is, the more explanatory variables explain the target variable.
- Kr: This indicator measures the model robustness. The robustness of a model corresponds to its capacity to adapt to new data sets. XLSTAT-Pivot sue 75% of data to adjust the model and 25% of data to validate the model. A model is said to be robust if its Kr is superior to 95%.
Select the variables which you want to use in the pivot tables. The contribution of the variables to the model are displayed next to the variable name (the higher the contribution, the more information it brings to explain the variability of the explanatory variable). Once you are satisfied with the selection (in this example we did not change anything to the default options), click on "Create".
Interpreting an intelligent pivot table
A new new sheet is displayed with a histogram of the contributions of the variables, and a dynamic pivot table.
The chart confirms that the variables that have the highest effect on the revenue are the Education level followed by the Marital status.
The dynamic pivot table can display up to 4 values for each combination of categories:
- Target average: percentage of the cases where the target variable is equal to 1 in the case of a binary variable; average of the target variable calculated on the sub-population corresponding to the combination in the case of continuous variable
- Target size: count of the "1" occurrences for the target variable in the case of binary variable , sum of the target variable calculated on the sub-population corresponding to the combination in the case of a continuous variable;
- Population size %: percentage of the overall population corresponding to the combination;
- Population size: population size corresponding to the combination.
Click here to see a screenshot of the pivot table.
We should now analyze the dynamic pivot table, to identify the combinations that most influence the fact that the people earn more than 50k$. We can see that the combination that has the highest % of ">50k$" people is the case when people belong to the categories [Doctorate ; Prof-School] and [Married-civ-spouse].
Note that once you have a pivot table, it might be interesting to do a correspondence analysis o to see how the categories of the various explanatory variables are related to each other. To build the input table, keep only the "Target size" values.
Click here for other tutorials.