{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#  Tree-based methods\n",
    "\n",
    "### Objective\n",
    "\n",
    "This notebook aims to work on tree-based methods on classification problems.\n",
    "\n",
    "### Learning objective\n",
    "\n",
    "After finished this notebook, you should be able to explain **decision trees**, **bagging trees**, and **random forests**, using the scikit-learn implementation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import required libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "assert sys.version_info >= (3, 6)\n",
    "\n",
    "import numpy\n",
    "assert numpy.__version__ >=\"1.17.3\" \n",
    "import numpy as np\n",
    "\n",
    "import pandas\n",
    "assert pandas.__version__ >= \"0.25.1\"\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "import seaborn\n",
    "assert seaborn.__version__ >= \"0.9.0\"\n",
    "import seaborn as sns\n",
    "\n",
    "import sklearn\n",
    "assert sklearn.__version__ >='0.21.3'\n",
    "\n",
    "from sklearn import metrics\n",
    "\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Endometrium and uterus tumours dataset\n",
    "\n",
    "In this data set, each observation is a tumor, and it is described by the expression of 3,000 genes. The expression of a gene is a measure of how much of that gene is present in the biological sample. Because this affects how much of the protein this gene codes for is produced, and because proteins dictacte what cells can do, gene expression gives us valuable information about the tumor. In particular, the expression of the same gene in the same individual is different in different tissues. \n",
    "\n",
    "The goal is to separate the **endometrium** tumors from the **uterine** ones."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Loading the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "endometrium_data = pd.read_csv('data/endometrium_uterus_tumor.csv', sep=\",\") \n",
    "endometrium_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Creating the X matrix and the target vector"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = endometrium_data.drop(['ID_REF', 'Tissue'], axis=1).values\n",
    "y = pd.get_dummies(endometrium_data['Tissue']).values[:,1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import model_selection\n",
    "\n",
    "skf = model_selection.StratifiedKFold(n_splits=5)\n",
    "skf.get_n_splits(X, y)\n",
    "folds = [(tr,te) for (tr,te) in skf.split(X, y)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Cross-validation procedures"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Import the functions defined in the ```cross_validation.py``` file. Module enables us to reuse code across different jupyter notebooks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from cross_validation import cross_validate_clf , cross_validate_clf_optimize"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. Decision Trees\n",
    "\n",
    "A decision tree predicts the value of a target variable by learning simple decision rules inferred from the data features.\n",
    "\n",
    "In scikit-learn, they are implemented in [tree.DecisionTreeClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) for classification and [tree.DecisionTreeRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) for regression."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import tree\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "\n",
    "from sklearn import tree\n",
    "from sklearn import metrics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "ypred_dt = [] \n",
    "for tree_index in range(5):\n",
    "    clf = tree.DecisionTreeClassifier() \n",
    "    pred_proba = cross_validate_clf(X, y, clf, folds)\n",
    "    ypred_dt.append(pred_proba)\n",
    "    print(\"%.3f\" % metrics.accuracy_score(y, np.where(pred_proba > 0.5, 1, 0)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fpr_dt = [] # will hold the 5 arrays of false positive rates (1 per tree)\n",
    "tpr_dt = [] # will hold the 5 arrays of true positive rates (1 per tree)\n",
    "auc_dt = [] # will hold the 5 areas under the ROC curve (1 per tree)\n",
    "\n",
    "for tree_index in range(5):\n",
    "    # Compute the ROC curve of the current tree\n",
    "    fpr_dt_tmp, tpr_dt_tmp, thresholds =  metrics.roc_curve(y, ypred_dt[tree_index], pos_label=1)\n",
    "    # Compute the area under the ROC curve of the current tree\n",
    "    auc_dt_tmp = metrics.auc(fpr_dt_tmp, tpr_dt_tmp)\n",
    "    fpr_dt.append(fpr_dt_tmp)\n",
    "    tpr_dt.append(tpr_dt_tmp)\n",
    "    auc_dt.append(auc_dt_tmp)\n",
    "\n",
    "# Plot the first 4 ROC curves\n",
    "for tree_index in range(4):\n",
    "    plt.plot(fpr_dt[tree_index], tpr_dt[tree_index], '-')\n",
    "            \n",
    "# Plot the last ROC curve, with a label that gives the mean/std AUC\n",
    "plt.plot(fpr_dt[-1], tpr_dt[-1], '-', \n",
    "         label='DT (AUC = %0.2f +/- %0.2f)' % (np.mean(auc_dt), np.std(auc_dt)))\n",
    "\n",
    "# Plot the ROC curve\n",
    "plt.xlabel('False Positive Rate', fontsize=16)\n",
    "plt.ylabel('True Positive Rate', fontsize=16)\n",
    "plt.title('ROC curves', fontsize=16)\n",
    "plt.legend(loc=\"lower right\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Decision trees' performance is very bad on this dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question:** What parameters of DecisionTreeClassifier can you play with to define trees differently than with the default parameters? Cross-validate these using a grid search with [model_selection.GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). Plot the optimal decision tree on the previous plot. Did you manage to improve performance?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import model_selection\n",
    "\n",
    "# Define the grid of parameters to test\n",
    "param_grid = {'criterion':['gini', 'entropy'],\n",
    "              'min_samples_split':[2, 4, 16, 256]}\n",
    " \n",
    "clf = model_selection.GridSearchCV(tree.DecisionTreeClassifier(random_state = 100), \n",
    "                                   param_grid, \n",
    "                                   scoring='roc_auc', \n",
    "                                   cv=5, iid=True)\n",
    "\n",
    "# Cross-validate the GridSearchCV object \n",
    "ypred_dt_opt = cross_validate_clf_optimize(X, y, clf, folds)\n",
    "\n",
    "# Compute the ROC curve for the optimized DecisionTreeClassifier\n",
    "fpr_dt_opt, tpr_dt_opt, thresholds = metrics.roc_curve(y, ypred_dt_opt, pos_label=1)\n",
    "auc_dt_opt = metrics.auc(fpr_dt_opt, tpr_dt_opt)\n",
    "\n",
    "# Plot the ROC curves of the 5 decision trees from earlier\n",
    "fig = plt.figure(figsize=(5, 5))\n",
    "\n",
    "for tree_index in range(4):\n",
    "    plt.plot(fpr_dt[tree_index], tpr_dt[tree_index], '-', color='blue') \n",
    "\n",
    "plt.plot(fpr_dt[-1], tpr_dt[-1], '-', color='blue', \n",
    "         label='DT (AUC = %0.2f (+/- %0.2f))' % (np.mean(auc_dt), np.std(auc_dt)))\n",
    "\n",
    "# Plot the ROC curve of the optimized DecisionTreeClassifier\n",
    "plt.plot(fpr_dt_opt, tpr_dt_opt, color='orange', label='DT optimized (AUC=%0.2f)' % auc_dt_opt)\n",
    "\n",
    "plt.xlabel('False Positive Rate', fontsize=16)\n",
    "plt.ylabel('True Positive Rate', fontsize=16)\n",
    "plt.title('ROC curves', fontsize=16)\n",
    "plt.legend(loc=\"lower right\", fontsize=12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is no clear optimal parameter and the performance is still poor."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Bagging trees"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will resort to ensemble methods to try to improve the performance of single decision trees. Let us start with _bagging trees_: The different trees are to be built using a _bootstrap sample_ of the data, that is to say, a sample built by randomly drawing n points _with replacement_ from the original data, where n is the number of points in the training set.\n",
    "\n",
    "Bagging is efficient when used with low bias and high variance weak learners. Indeed, by averaging such estimators, we lower the variance by obtaining a smoother estimator, which is still centered around the true density (low bias). \n",
    "\n",
    "Bagging decision trees hence makes sense, as decision trees have:\n",
    "* low bias: intuitively, the conditions that are checked become multiplicative so the tree is continuously narrowing down on the data (the tree becomes highly tuned to the data present in the training set).\n",
    "* high variance: decision trees are very sensitive to where it splits and how it splits. Therefore, even small changes in input variable values might result in very different tree structure.\n",
    "\n",
    "\n",
    "**Note**: Bagging trees and random forests start being really powerful when using large number of trees (several hundreds). This is computationally more intensive, especially when the number of features is large, as in this lab. For the sake of computational time, we suggeste using small numbers of trees, but you might want to repeat this lab for larger number of trees at home."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question** Cross-validate a bagging ensemble of 5 decision trees on the data. Plot the resulting ROC curve, compared to the 5 decision trees you trained earlier.\n",
    "\n",
    "Use [ensemble.BaggingClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import ensemble\n",
    "\n",
    "# Initialize a bag of trees\n",
    "clf = ensemble.BaggingClassifier(n_estimators=5, max_features=X.shape[1], random_state = 100)\n",
    "\n",
    "# Cross-validate the bagging trees on the tumor data\n",
    "ypred_bt = cross_validate_clf(X, y, clf, folds)\n",
    "\n",
    "# Compute the ROC curve of the bagging trees\n",
    "fpr_bt, tpr_bt, thresholds = metrics.roc_curve(y, ypred_bt, pos_label=1)\n",
    "auc_bt = metrics.auc(fpr_bt, tpr_bt)\n",
    "\n",
    "# Plot the ROC curve of the 5 decision trees from earlier\n",
    "fig = plt.figure(figsize=(5, 5))\n",
    "\n",
    "for tree_index in range(4):\n",
    "    plt.plot(fpr_dt[tree_index], tpr_dt[tree_index], '-', color='blue') \n",
    "plt.plot(fpr_dt[-1], tpr_dt[-1], '-', color='blue', \n",
    "         label='Decision Tree (AUC = %0.2f (+/- %0.2f))' % (np.mean(auc_dt), np.std(auc_dt)))\n",
    "\n",
    "# Plot the ROC curve of the bagging trees\n",
    "plt.plot(fpr_bt, tpr_bt, color='orange', label='Bagging Tree (AUC=%0.2f)' % auc_bt)\n",
    "\n",
    "\n",
    "plt.xlabel('False Positive Rate', fontsize=16)\n",
    "plt.ylabel('True Positive Rate', fontsize=16)\n",
    "plt.title('ROC curves', fontsize=16)\n",
    "plt.legend(loc=\"lower right\", fontsize=12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Question:__ How do the bagging trees perform compared to individual trees?\n",
    "    \n",
    "__Answer:__ Combining five tree enables bagging tree to perform better than any other decision tree."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question** Use ```cross_validate_optimize``` to optimize the number of decision trees to use in the bagging method. How many trees did you find to be an optimal choice?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Number of trees to use\n",
    "list_n_trees = [5, 10, 20, 50, 80]\n",
    "\n",
    "# Start a ROC curve plot\n",
    "fig = plt.figure(figsize=(5, 5))\n",
    "    \n",
    "for idx, n_trees in enumerate(list_n_trees):\n",
    "    # Initialize a bag of trees with n_trees trees\n",
    "    clf = ensemble.BaggingClassifier(n_estimators=n_trees, random_state = 100)\n",
    "    \n",
    "    # Cross-validate the bagging trees on the tumor data\n",
    "    ypred_bt_opt = cross_validate_clf(X, y, clf, folds)\n",
    "    \n",
    "    \n",
    "    # Compute the ROC curve \n",
    "    fpr_bt_opt, tpr_bt_opt, thresholds = metrics.roc_curve(y, ypred_bt_opt, pos_label=1)\n",
    "    auc_bt_opt = metrics.auc(fpr_bt_opt, tpr_bt_opt)\n",
    "\n",
    "    # Plot the ROC curve\n",
    "    plt.plot(fpr_bt_opt, tpr_bt_opt, '-', \n",
    "             label='Bagging Tree %0.f trees (AUC = %0.2f)' % (n_trees, auc_bt_opt))\n",
    "\n",
    "# Plot the ROC curve of the optimal decision tree\n",
    "plt.plot(fpr_dt_opt, tpr_dt_opt, label='DT optimized (AUC=%0.2f)' % auc_dt_opt)\n",
    "\n",
    "plt.xlabel('False Positive Rate', fontsize=16)\n",
    "plt.ylabel('True Positive Rate', fontsize=16)\n",
    "plt.title('ROC curves', fontsize=16)\n",
    "plt.legend(loc=\"lower right\", fontsize=12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Increasing the number of trees increases the performance, but not too much."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Random Forest"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In practice, simply bagging is typically not enough. In order to get a good reduction in variance, we require that the models being aggregated be uncorrelated, so that they make “different errors”. Bagging will usually get you highly correlated models that will make the same errors, and will therefore not reduce the variance of the combined predictor.\n",
    "\n",
    "**Question** What is the difference between bagging trees and random forests? How does it intuitively fix the problem of correlations between trees ? "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize a random forest with 5 trees\n",
    "clf = ensemble.RandomForestClassifier(n_estimators = 10, criterion='entropy')\n",
    "\n",
    "# Cross-validate the random forest on the tumor data\n",
    "ypred_rf = cross_validate_clf(X, y, clf, folds)\n",
    "\n",
    "# Compute the ROC curve of the random forest\n",
    "fpr_rf, tpr_rf, thresholds = metrics.roc_curve(y, ypred_rf, pos_label=1)\n",
    "auc_rf = metrics.auc(fpr_rf, tpr_rf)\n",
    "\n",
    "# Plot the ROC curve of the 5 decision trees\n",
    "fig = plt.figure(figsize=(5, 5))\n",
    "\n",
    "for tree_index in range(4):\n",
    "    plt.plot(fpr_dt[tree_index], tpr_dt[tree_index], '-', color='grey') \n",
    "    \n",
    "plt.plot(fpr_dt[-1], tpr_dt[-1], '-', color='grey', \n",
    "         label='Decision Tree (AUC = %0.2f (+/- %0.2f))' % (np.mean(auc_dt), np.std(auc_dt)))\n",
    "\n",
    "# Plot the ROC curve of the bagging trees (5 trees)\n",
    "plt.plot(fpr_bt, tpr_bt, label='Bagging Tree (AUC=%0.2f)' % auc_bt)\n",
    "\n",
    "# Plot the ROC curve of the random forest (5 trees)\n",
    "plt.plot(fpr_rf, tpr_rf, label='RandomForest Tree (AUC=%0.2f)' % auc_rf)\n",
    "\n",
    "plt.xlabel('False Positive Rate', fontsize=16)\n",
    "plt.ylabel('True Positive Rate', fontsize=16)\n",
    "\n",
    "plt.title('ROC curves', fontsize=16)\n",
    "plt.legend(loc=\"lower right\", fontsize=12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As can be seen, with a low number of trees, ```random forest``` performs similar to bagging trees."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "?ensemble.RandomForestClassifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question** What are the main parameters of Random Forest which can be optimized ?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question** Use ```cross_validate_clf_optimize``` to optimize\n",
    "\n",
    "  * the number of decision trees \n",
    "  * the number of features to consider at each split."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the grid of parameters to test\n",
    "param_grid = {'criterion': ['gini', 'entropy'],\n",
    "              'max_features': ['auto', 'log2'],\n",
    "              'min_samples_leaf': [1,2,5], \n",
    "              'n_estimators': [5,10,20,50,80]\n",
    "             }\n",
    "\n",
    "# Initialize a GridSearchCV object that will be used to cross-validate a random forest with these parameters.\n",
    "clf = model_selection.GridSearchCV(ensemble.RandomForestClassifier(), \n",
    "                                   param_grid, \n",
    "                                   scoring='roc_auc', \n",
    "                                   cv=5, iid=True)\n",
    "\n",
    "# Cross-validate the GridSearchCV object \n",
    "ypred_rf_opt = cross_validate_clf_optimize(X, y, clf, folds)\n",
    "\n",
    "# Compute the ROC curve for the optimized random forest\n",
    "fpr_rf_opt, tpr_rf_opt, thresholds = metrics.roc_curve(y, ypred_rf_opt, pos_label=1)\n",
    "auc_rf_opt = metrics.auc(fpr_rf_opt, tpr_rf_opt)\n",
    "\n",
    "\n",
    "fig = plt.figure(figsize=(5, 5))\n",
    "\n",
    "# Plot the ROC curve of the optimized DecisionTreeClassifier\n",
    "plt.plot(fpr_dt_opt, tpr_dt_opt, color='grey', label='DT optimized (AUC=%0.2f)' % auc_dt_opt)\n",
    "\n",
    "# Plot the ROC curve of the optimized bagging trees\n",
    "plt.plot(fpr_bt_opt, tpr_bt_opt,  label='BT 80 trees (AUC=%0.2f)' % auc_dt_opt)\n",
    "    \n",
    "# Plot the ROC curve of the optimized random forest\n",
    "plt.plot(fpr_rf_opt, tpr_rf_opt, label='RF optimized (AUC = %0.2f)' % (auc_rf_opt))\n",
    "\n",
    "plt.xlabel('False Positive Rate', fontsize=16)\n",
    "plt.ylabel('True Positive Rate', fontsize=16)\n",
    "plt.title('ROC curves', fontsize=16)\n",
    "plt.legend(loc=\"lower right\", fontsize=12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. How many trees do you find to be an optimal choice? \n",
    "2. How does the optimal random forest compare to the optimal bagging trees?\n",
    "3. How do the training times of the random forest and the bagging trees compare?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question** How do your tree-based classifiers compare to regularized logistic regression models?\n",
    "\n",
    "Plot the corresponding ROC curves."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import linear_model\n",
    "\n",
    "# Evaluate an optimized l1-regularized logistic regression\n",
    "param_grid = {'C': np.logspace(-3, 3, 7)}\n",
    "clf = model_selection.GridSearchCV(linear_model.LogisticRegression(penalty='l1', solver='liblinear'), \n",
    "                                   param_grid, \n",
    "                                   scoring='roc_auc',\n",
    "                                   cv=3,\n",
    "                                   iid=True)\n",
    "ypred_l1 = cross_validate_clf_optimize(X, y, clf, folds)\n",
    "fpr_l1, tpr_l1, thresholds_l1 = metrics.roc_curve(y, ypred_l1, pos_label=1)\n",
    "auc_l1 = metrics.auc(fpr_l1, tpr_l1)\n",
    "print('nb features of best sparse model:', len(np.where(clf.best_estimator_.coef_!=0)[0]))\n",
    "\n",
    "# Evaluate an optimized l2-regularized logistic regression\n",
    "clf = model_selection.GridSearchCV(linear_model.LogisticRegression(penalty='l2',solver='liblinear'), \n",
    "                                   param_grid, \n",
    "                                   scoring='roc_auc',\n",
    "                                   cv=3,\n",
    "                                   iid=True)\n",
    "ypred_l2 = cross_validate_clf_optimize(X, y, clf, folds)\n",
    "fpr_l2, tpr_l2, thresholds_l2 = metrics.roc_curve(y, ypred_l2, pos_label=1)\n",
    "auc_l2 = metrics.auc(fpr_l2, tpr_l2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot the ROC curves\n",
    "fig = plt.figure(figsize=(5, 5))\n",
    "\n",
    "plt.plot(fpr_rf_opt, tpr_rf_opt, \n",
    "         label='RF optimized (AUC = %0.2f)' % (auc_rf_opt))\n",
    "plt.plot(fpr_bt_opt, tpr_bt_opt, \n",
    "         label='BT optimized (AUC = %0.2f)' % (auc_bt_opt))\n",
    "plt.plot(fpr_l1, tpr_l1,  \n",
    "         label='l1 optimized (AUC = %0.2f)' % (auc_l1))\n",
    "plt.plot(fpr_l2, tpr_l2,  \n",
    "         label='l2 optimized (AUC = %0.2f)' % (auc_l2))\n",
    "\n",
    "plt.xlabel('False Positive Rate', fontsize=16)\n",
    "plt.ylabel('True Positive Rate', fontsize=16)\n",
    "plt.title('ROC curves', fontsize=16)\n",
    "plt.legend(loc=\"lower right\", fontsize=12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As can be seen, random forest slightly outperforms the $L_1$ regularized logistic regression (i.e., Lasso and Ridge logistic regression). This means that a simple model can be used to have similar performance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature importance via RandomForest\n",
    "\n",
    "Based on the performance obtained by the l1-regularized logistic regression, on the endometrium-vs-uterus dataset, a subset of features can yield better predictive models of gene expression data. \n",
    "\n",
    "It is worth to notice that tree-based methods 'naturally' compute a measure of feature importance which can be directly use for selecting features. Indeed, at each split in each tree, the improvement in the split-criterion is the importance measure attributed to the splittings variables. Feature importance is accumulated over all trees in the forest.\n",
    "\n",
    "In scikit-learn, this feature importance is accessible via the `feature_importances_` attributes of random forests, and can be processed thanks to the meta-transformer [feature_selection.SelectFromModel](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.feature_selection import SelectFromModel\n",
    "from sklearn import pipeline\n",
    "\n",
    "from sklearn import ensemble\n",
    "from sklearn import linear_model\n",
    "from sklearn.model_selection import GridSearchCV, StratifiedKFold"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Creating folds\n",
    "skf = StratifiedKFold(n_splits=5)\n",
    "skf.get_n_splits(X, y)\n",
    "folds = [(tr,te) for (tr,te) in skf.split(X, y)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "THRESHOLD_OPTIONS = ['mean', '1.5*mean', '2*mean', '5*mean']\n",
    "C_OPTIONS = np.logspace(-3, 3, 7)\n",
    "N_ESTIMATORS_OPTIONS = [10, 20, 50]\n",
    "\n",
    "param_grid = [\n",
    "\n",
    "    {\n",
    "        'feature_selection': [SelectFromModel(ensemble.RandomForestClassifier(n_estimators=50))],\n",
    "        'feature_selection__threshold': THRESHOLD_OPTIONS,\n",
    "        'classification': [ensemble.RandomForestClassifier()],\n",
    "        'classification__n_estimators': N_ESTIMATORS_OPTIONS,\n",
    "    },\n",
    "    {\n",
    "        'feature_selection': [SelectFromModel(ensemble.RandomForestClassifier(n_estimators=50))],\n",
    "        'feature_selection__threshold': THRESHOLD_OPTIONS,\n",
    "        'classification': [linear_model.LogisticRegression(penalty='l2')],\n",
    "        'classification__C': C_OPTIONS,\n",
    "    },\n",
    "]\n",
    "\n",
    "pipe = pipeline.Pipeline([\n",
    "  ('feature_selection', SelectFromModel((ensemble.RandomForestClassifier(n_estimators=50)))),\n",
    "  ('classification', ensemble.RandomForestClassifier())\n",
    "])\n",
    "grid = GridSearchCV(pipe, cv=folds, n_jobs=3, param_grid=param_grid, scoring='roc_auc',iid=True)\n",
    "grid.fit(X, y)\n",
    "\n",
    "grid_l1_log_reg = GridSearchCV(linear_model.LogisticRegression(penalty='l1', solver='liblinear'), cv=folds, n_jobs=3, \n",
    "                               param_grid={'C':C_OPTIONS}, scoring='roc_auc',iid=True)\n",
    "grid_l1_log_reg.fit(X,y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us now use a DataFrame to display the results of our evaluation procedure"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "data_frame = {name:[] for i in range(len(grid.cv_results_['params'])) for name in grid.cv_results_['params'][i]}\n",
    "data_frame['score'] = []\n",
    "\n",
    "sorted_index_score = np.argsort(grid.cv_results_['mean_test_score'])[::-1]\n",
    "for ind in sorted_index_score:\n",
    "    data_frame['score'].append(grid.cv_results_['mean_test_score'][ind])\n",
    "    for name in data_frame.keys():\n",
    "        if name in grid.cv_results_['params'][ind]:\n",
    "            data_frame[name].append(grid.cv_results_['params'][ind][name])\n",
    "        elif name!='score':\n",
    "            data_frame[name].append(None)\n",
    "    \n",
    "pd.DataFrame(data_frame).head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_frame = {name:[] for i in range(len(grid_l1_log_reg.cv_results_['params'])) \n",
    "                          for name in grid_l1_log_reg.cv_results_['params'][i]}\n",
    "data_frame['score'] = []\n",
    "\n",
    "sorted_index_score = np.argsort(grid_l1_log_reg.cv_results_['mean_test_score'])[::-1]\n",
    "for ind in sorted_index_score:\n",
    "    data_frame['score'].append(grid_l1_log_reg.cv_results_['mean_test_score'][ind])\n",
    "    for name in data_frame.keys():\n",
    "        if name in grid_l1_log_reg.cv_results_['params'][ind]:\n",
    "            data_frame[name].append(grid_l1_log_reg.cv_results_['params'][ind][name])\n",
    "        elif name!='score':\n",
    "            data_frame[name].append(None)\n",
    "    \n",
    "pd.DataFrame(data_frame).head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('number of original features: {}', X.shape[1])\n",
    "\n",
    "tree_based_feature_selection = SelectFromModel(estimator=ensemble.RandomForestClassifier(n_estimators=50), \n",
    "                                               threshold='mean')\n",
    "tree_based_feature_selection.fit(X, y)\n",
    "print('number of selected features by random forest:', len(tree_based_feature_selection.get_support(indices=True)))\n",
    "\n",
    "l1_clf = linear_model.LogisticRegression(penalty='l1', C=100,solver='liblinear')\n",
    "l1_clf.fit(X,y)\n",
    "print('number of selected features by log. reg. with L1 regularization:', len(np.where(l1_clf.coef_!=0)[0]))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}