{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Section 1-3 - Parameter Tuning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In previous sections, we took the approach of using Scikit-learn as a black box. We now review how to tune the parameters of the model to make more accurate predictions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pandas - Extracting data" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "df = pd.read_csv('data.csv')\n", "\n", "df_train = df.iloc[:712, :]\n", "df_test = df.iloc[712:, :]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pandas - Cleaning data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df_train = df_train.drop(['Name', 'Ticket', 'Cabin'], axis=1)\n", "\n", "age_mean = df_train['Age'].mean()\n", "df_train['Age'] = df_train['Age'].fillna(age_mean)\n", "\n", "df_train['Embarked'] = df_train['Embarked'].fillna('S')\n", "\n", "df_train['Sex'] = df_train['Sex'].map({'female': 0, 'male': 1})\n", "df_train = pd.concat([df_train, pd.get_dummies(df_train['Embarked'], prefix='Embarked')], axis=1)\n", "\n", "df_train = df_train.drop(['Embarked'], axis=1)\n", "\n", "X_train = df_train.iloc[:, 2:].values\n", "y_train = df_train['Survived']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scikit-learn - Training the model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The documentation for the Random Forest Classifier details the different input parameters of the model. These input parameters include the number of trees, and the number of branches each tree has. It is unclear, off-the-bat, which values would be optimal. \n", "\n", "http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "GridSearchCV allows us to test the desired range of input parameters, and review the performance of each set of values on a cross-validation basis. Here we review the number of features considered at each step a branch is made (max_features: 50% or 100% of features) and the maximum number of branches (max_depth: 5 levels or no limitations). " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.grid_search import GridSearchCV\n", "\n", "parameter_grid = {\n", " 'max_features': [0.5, 1.],\n", " 'max_depth': [5., None]\n", "}\n", "\n", "grid_search = GridSearchCV(RandomForestClassifier(n_estimators = 100), parameter_grid, cv=5, verbose=3)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 5 folds for each of 4 candidates, totalling 20 fits\n", "[CV] max_features=0.5, max_depth=5.0 .................................\n", "[CV] ........ max_features=0.5, max_depth=5.0, score=0.818182 - 0.2s\n", "[CV] max_features=0.5, max_depth=5.0 .................................\n", "[CV] ........ max_features=0.5, max_depth=5.0, score=0.797203 - 0.2s\n", "[CV] max_features=0.5, max_depth=5.0 .................................\n", "[CV] ........ max_features=0.5, max_depth=5.0, score=0.888112 - 0.2s\n", "[CV] max_features=0.5, max_depth=5.0 .................................\n", "[CV] ........ max_features=0.5, max_depth=5.0, score=0.823944 - 0.2s\n", "[CV] max_features=0.5, max_depth=5.0 .................................\n", "[CV] ........ max_features=0.5, max_depth=5.0, score=0.773050 - 0.2s\n", "[CV] max_features=1.0, max_depth=5.0 .................................\n", "[CV] ........ max_features=1.0, max_depth=5.0, score=0.797203 - 0.2s\n", "[CV] max_features=1.0, max_depth=5.0 .................................\n", "[CV] ........ max_features=1.0, max_depth=5.0, score=0.818182 - 0.2s\n", "[CV] max_features=1.0, max_depth=5.0 .................................\n", "[CV] ........ max_features=1.0, max_depth=5.0, score=0.874126 - 0.2s\n", "[CV] max_features=1.0, max_depth=5.0 .................................\n", "[CV] ........ max_features=1.0, max_depth=5.0, score=0.830986 - 0.2s\n", "[CV] max_features=1.0, max_depth=5.0 .................................\n", "[CV] ........ max_features=1.0, max_depth=5.0, score=0.794326 - 0.2s\n", "[CV] max_features=0.5, max_depth=None ................................\n", "[CV] ....... max_features=0.5, max_depth=None, score=0.783217 - 0.2s\n", "[CV] max_features=0.5, max_depth=None ................................\n", "[CV] ....... max_features=0.5, max_depth=None, score=0.776224 - 0.2s\n", "[CV] max_features=0.5, max_depth=None ................................\n", "[CV] ....... max_features=0.5, max_depth=None, score=0.860140 - 0.2s\n", "[CV] max_features=0.5, max_depth=None ................................\n", "[CV] ....... max_features=0.5, max_depth=None, score=0.788732 - 0.2s\n", "[CV] max_features=0.5, max_depth=None ................................\n", "[CV] ....... max_features=0.5, max_depth=None, score=0.794326 - 0.2s\n", "[CV] max_features=1.0, max_depth=None ................................\n", "[CV] ....... max_features=1.0, max_depth=None, score=0.783217 - 0.2s\n", "[CV] max_features=1.0, max_depth=None ................................\n", "[CV] ....... max_features=1.0, max_depth=None, score=0.797203 - 0.2s\n", "[CV] max_features=1.0, max_depth=None ................................\n", "[CV] ....... max_features=1.0, max_depth=None, score=0.881119 - 0.2s\n", "[CV] max_features=1.0, max_depth=None ................................\n", "[CV] ....... max_features=1.0, max_depth=None, score=0.788732 - 0.2s\n", "[CV] max_features=1.0, max_depth=None ................................\n", "[CV] ....... max_features=1.0, max_depth=None, score=0.765957 - 0.2s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 3.8s finished\n" ] }, { "data": { "text/plain": [ "GridSearchCV(cv=5, error_score='raise',\n", " estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n", " max_depth=None, max_features='auto', max_leaf_nodes=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,\n", " oob_score=False, random_state=None, verbose=0,\n", " warm_start=False),\n", " fit_params={}, iid=True, n_jobs=1,\n", " param_grid={'max_features': [0.5, 1.0], 'max_depth': [5.0, None]},\n", " pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=3)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grid_search.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now review the results." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[mean: 0.82022, std: 0.03842, params: {'max_features': 0.5, 'max_depth': 5.0},\n", " mean: 0.82303, std: 0.02894, params: {'max_features': 1.0, 'max_depth': 5.0},\n", " mean: 0.80056, std: 0.03040, params: {'max_features': 0.5, 'max_depth': None},\n", " mean: 0.80337, std: 0.04026, params: {'max_features': 1.0, 'max_depth': None}]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grid_search.grid_scores_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now review the best-performing tuning parameters." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'max_depth': 5.0, 'max_features': 1.0}" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grid_search.best_params_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We then set these tuning parameters to our model." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [], "source": [ "model = RandomForestClassifier(n_estimators = 100, max_features=1.0, max_depth=5.0, random_state=0)\n", "model = model.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise**\n", "\n", "- Write the code so that grid_search refits model with the best tuning parameters to the entire dataset after these parameters are found, and hence allow us to skip the two lines of code above." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scikit-learn - Making predictions" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df_test = df_test.drop(['Name', 'Ticket', 'Cabin'], axis=1)\n", "\n", "df_test['Age'] = df_test['Age'].fillna(age_mean)\n", "df_test['Embarked'] = df_test['Embarked'].fillna('S')\n", "\n", "df_test['Sex'] = df_test['Sex'].map({'female': 0, 'male': 1}).astype(int)\n", "df_test = pd.concat([df_test, pd.get_dummies(df_test['Embarked'], prefix='Embarked')],axis=1)\n", "\n", "df_test = df_test.drop(['Embarked'], axis=1)\n", "\n", "X_test = df_test.iloc[:, 2:]\n", "y_test = df_test['Survived']\n", "\n", "y_prediction = model.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluation" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.86033519553072624" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.sum(y_prediction == y_test) / float(len(y_test))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.10" } }, "nbformat": 4, "nbformat_minor": 0 }