---
title: "Variable importance in random forests"
date: "2016-02-22"
output: html_document
---
```{r setup, include=FALSE}
library(knitr)
opts_chunk$set(collapse=TRUE)
```
Consider we run a random forest on $n$ independent realizations of a random vector $(X_1,X_2,X_3,Y)$, assuming $Y$ is a numerical response variable.
The theoretical classifier is the function $f$ such that
$$
E[Y\mid X_1, X_2, X_3]=f(X_1, X_2, X_3).
$$
A random forest also returns a so-called *variable importance* $\hat I_j$ for each predicting variable $X_j$. I am going to explain what is the variable importance.
I take $j=1$ not to be invaded by the notations.
In the paper ["Correlation and variable importance in random forests" (Gregorutti & al)](http://arxiv.org/abs/1310.5726), it is shown that the variable importance $\hat I_1$ of $X_1$ goes to the *population variable importance*
$$
I_1 = E\left[{\bigl(Y-f(X'_1,X_2,X_3)\bigr)}^2\right] - E\left[{\bigl(Y-f(X_1,X_2,X_3)\bigr)}^2\right]
$$
where $X'_1$ is a random variable having the same distribution as $X_1$ but is independent of all other random variables $X_2,X_3,Y$. It is always a nonnegative number. Roughly speaking, $X_1$ has a high importance $I_1$ if the prediction error has a high increase when one breaks the link between $X_1$ and $Y$.
The convergence $\hat I_j \to I_j$ shown by Gregorutti & al was an expected result.
The variable importance $\hat I_j$ of the $j$-th predictor $X_j$ is defined as follows. For each tree $t$ of the random forest, there's a classifier $\hat f_t$. The mean squared error $MSE_t$ in tree $t$ is the mean squared prediction error in the out-of-bag (*OOB*) sample of tree $t$. The $j$-th perturbed mean squared error $MSE_t^{(j)}$ is defined similarly after randomly permuting the values of $j$-th variable in the OOB sample. Finally $\hat I_j$ is defined as the average difference $\overline{MSE}^{(j)} - \overline{MSE}$ over all trees.
Let us check this convergence with the `randomForest` package. I will take
$$
f(X_1, X_2, X_3) = X_1 + X_2X_3.
$$
This function $f$ is of the form
$$
f(X_1, X_2, X_3) = f_1(X_1) + f_{23}(X_2X_3).
$$
It is shown in Gregorutti & al's paper that $I_1= 2Var(f_1(X_1))$ in such a case.
Before running the random forest, we will check this result and evaluate $I_2$ and $I_3$ with the help of simulations. The distribution of my random vector $(X_1,X_2,X_3,Y)$ can be seen on these simulations:
```{r sims}
set.seed(666)
N <- 25000
x1 <- rnorm(N)
x2 <- x1 + rnorm(N)
x3 <- x1 + x2 + rnorm(N)
f <- function(x1, x2, x3) x1 + x2*x3
sigma <- 1
y <- f(x1, x2, x3) + rnorm(N, sd=sigma)
```
Thus $E\left[{\bigl(Y-f(X_1,X_2,X_3)\bigr)}^2\right] = \sigma^2$ with $\sigma=1$.
The evaluation of $I_1$ with the help of simulations indeed returns a result close to $2Var(X_1)=2$:
```{r I1}
xx1 <- rnorm(N)
( I1 <- mean((y-f(xx1,x2,x3))^2) - sigma^2 )
```
One finds $I_2 \approx I_3 \approx 40$:
```{r I2I3}
xx2 <- xx1 + rnorm(N)
( I2 <- mean((y-f(x1,xx2,x3))^2) - sigma^2 )
xx3 <- xx1 + xx2 + rnorm(N)
( I3 <- mean((y-f(x1,x2,xx3))^2) - sigma^2 )
```
Now let us try the random forest on the first $n$ simulations with $n=300$.
```{r XY}
XY <- data.frame(x1, x2, x3, y)[1:300, ]
```
I firstly tune the `mtry` parameter with the `train` function of the `caret` package. Recall that `mtry` is the number of variables selected at random for the splitting in each tree. Here it can be $2$ or $3$ since there are only $3$ predictors.
```{r training, message=FALSE}
library(caret)
( training <- train(x=XY[,1:3], y=XY$y, tuneLength=2) )
```
The selected value of `mtry` was $2$. Now I run the random forest, requiring the importances with the option `importance=TRUE`:
```{r rf, message=FALSE}
library(randomForest)
rf <- randomForest(y ~ ., data=XY, importance=TRUE, mtry=training$finalModel$tuneValue$mtry)
```
The variable importances $\hat I_j$ are provided by the `importance` function as follows:
```{r}
importance(rf, type=1, scale=FALSE)
```
We could conclude they are not very close to their theoretical values. But the ratios $\hat I_j / \hat I_{j'}$ are rather good estimates of their theoretical values.