{ "metadata": { "name": "", "signature": "sha256:13189f520865582176d281b22c1da4c5c7c59e0f9d43c5c58eba67d27bdec76b" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "For every classifier, there are __a lot of__ parameters to be optimized in Scikit-Learn to get an optimal accuracy for the classifier in the cross-validattion dataset. You could always try and fail and see which parameters work for best but it is time consuming(computation, storage and reporting) and cumbersome in general if you want to handle the parameter search by yourself. Consider these parameters an extra burden when you try to choose an optimal classifier out of multiple classifiers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scikit-learn provides a grid search with cross validation so that when you do the parameter search in `GridSearchCV`, it automatically finds the best parameters tested on a cross-validation so you do not have to search the parameters in different folds." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One thing to note is that it may take a lot of time to do on a regular machine if you have a lot of parameters to try as the number of parameters will be a combinatorial of those parameters. So you may want to use a powerful and multicore machine to do the grid search in the parameter space. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "%matplotlib inline \n", "\n", "import matplotlib as mlp\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import sklearn\n", "from sklearn import cross_validation\n", "from sklearn import datasets\n", "from sklearn import ensemble\n", "from sklearn import grid_search\n", "from sklearn import metrics\n", "import time\n", "\n", "plt.style.use('fivethirtyeight')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 10 }, { "cell_type": "code", "collapsed": false, "input": [ "california_housing = datasets.california_housing.fetch_california_housing()\n", "california_housing_data = california_housing['data']\n", "california_housing_labels = california_housing['target']# 'target' variables\n", "california_housing_feature_names = california_housing['feature_names']" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 11 }, { "cell_type": "code", "collapsed": false, "input": [ "X_train, X_test, y_train, y_test = cross_validation.train_test_split(california_housing_data,\n", " california_housing_labels,\n", " test_size=0.2,\n", " random_state=0)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to create a grid search object, it is necessary to pass the estimator, parameters to optimize for that estimator, optionally you could pass `n_job` in order to parallellize the parameter search. Grid search in parameter space is embarrasingly parallel by nature as different parameter combinations are independent from each other. You could also pass different scoring functions if you want to use another scoring function to optimize the best parameters for your classifier." ] }, { "cell_type": "code", "collapsed": false, "input": [ "param_grid = {\n", " 'learning_rate': [0.1, 0.05, 0.01],\n", " 'max_depth': [4, 6],\n", " 'min_samples_leaf': [3, 9, 15],\n", " 'n_estimators': [1000, 2000, 3000],\n", " }\n", "\n", "est = ensemble.GradientBoostingRegressor()\n", "\n", "start_time = time.time()\n", "gs_cv = grid_search.GridSearchCV(est, param_grid, n_jobs=4).fit(X_train, y_train)\n", "end_time = time.time()\n", "\n", "print('It took {} seconds'.format(end_time - start_time))\n", "# best hyperparameter setting\n", "gs_cv.best_params_" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "It took 2920.44363379 seconds\n" ] }, { "metadata": {}, "output_type": "pyout", "prompt_number": 13, "text": [ "{'learning_rate': 0.05,\n", " 'max_depth': 6,\n", " 'min_samples_leaf': 15,\n", " 'n_estimators': 2000}" ] } ], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to get scores for each different parameter combination, `grid_scores_` attribute of GridSearch object provides a nice way to provide all of the scores for each parameter combination." ] }, { "cell_type": "code", "collapsed": false, "input": [ "gs_cv.grid_scores_" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 14, "text": [ "[mean: 0.83473, std: 0.00265, params: {'n_estimators': 1000, 'learning_rate': 0.1, 'max_depth': 4, 'min_samples_leaf': 3},\n", " mean: 0.83579, std: 0.00259, params: {'n_estimators': 2000, 'learning_rate': 0.1, 'max_depth': 4, 'min_samples_leaf': 3},\n", " mean: 0.83533, std: 0.00238, params: {'n_estimators': 3000, 'learning_rate': 0.1, 'max_depth': 4, 'min_samples_leaf': 3},\n", " mean: 0.83513, std: 0.00268, params: {'n_estimators': 1000, 'learning_rate': 0.1, 'max_depth': 4, 'min_samples_leaf': 9},\n", " mean: 0.83493, std: 0.00242, params: {'n_estimators': 2000, 'learning_rate': 0.1, 'max_depth': 4, 'min_samples_leaf': 9},\n", " mean: 0.83368, std: 0.00249, params: {'n_estimators': 3000, 'learning_rate': 0.1, 'max_depth': 4, 'min_samples_leaf': 9},\n", " mean: 0.83669, std: 0.00174, params: {'n_estimators': 1000, 'learning_rate': 0.1, 'max_depth': 4, 'min_samples_leaf': 15},\n", " mean: 0.83715, std: 0.00267, params: {'n_estimators': 2000, 'learning_rate': 0.1, 'max_depth': 4, 'min_samples_leaf': 15},\n", " mean: 0.83590, std: 0.00264, params: {'n_estimators': 3000, 'learning_rate': 0.1, 'max_depth': 4, 'min_samples_leaf': 15},\n", " mean: 0.83683, std: 0.00123, params: {'n_estimators': 1000, 'learning_rate': 0.1, 'max_depth': 6, 'min_samples_leaf': 3},\n", " mean: 0.83672, std: 0.00126, params: {'n_estimators': 2000, 'learning_rate': 0.1, 'max_depth': 6, 'min_samples_leaf': 3},\n", " mean: 0.83642, std: 0.00138, params: {'n_estimators': 3000, 'learning_rate': 0.1, 'max_depth': 6, 'min_samples_leaf': 3},\n", " mean: 0.83686, std: 0.00217, params: {'n_estimators': 1000, 'learning_rate': 0.1, 'max_depth': 6, 'min_samples_leaf': 9},\n", " mean: 0.83616, std: 0.00187, params: {'n_estimators': 2000, 'learning_rate': 0.1, 'max_depth': 6, 'min_samples_leaf': 9},\n", " mean: 0.83541, std: 0.00188, params: {'n_estimators': 3000, 'learning_rate': 0.1, 'max_depth': 6, 'min_samples_leaf': 9},\n", " mean: 0.83755, std: 0.00155, params: {'n_estimators': 1000, 'learning_rate': 0.1, 'max_depth': 6, 'min_samples_leaf': 15},\n", " mean: 0.83602, std: 0.00125, params: {'n_estimators': 2000, 'learning_rate': 0.1, 'max_depth': 6, 'min_samples_leaf': 15},\n", " mean: 0.83504, std: 0.00146, params: {'n_estimators': 3000, 'learning_rate': 0.1, 'max_depth': 6, 'min_samples_leaf': 15},\n", " mean: 0.83159, std: 0.00199, params: {'n_estimators': 1000, 'learning_rate': 0.05, 'max_depth': 4, 'min_samples_leaf': 3},\n", " mean: 0.83444, std: 0.00241, params: {'n_estimators': 2000, 'learning_rate': 0.05, 'max_depth': 4, 'min_samples_leaf': 3},\n", " mean: 0.83544, std: 0.00223, params: {'n_estimators': 3000, 'learning_rate': 0.05, 'max_depth': 4, 'min_samples_leaf': 3},\n", " mean: 0.83204, std: 0.00182, params: {'n_estimators': 1000, 'learning_rate': 0.05, 'max_depth': 4, 'min_samples_leaf': 9},\n", " mean: 0.83496, std: 0.00233, params: {'n_estimators': 2000, 'learning_rate': 0.05, 'max_depth': 4, 'min_samples_leaf': 9},\n", " mean: 0.83548, std: 0.00208, params: {'n_estimators': 3000, 'learning_rate': 0.05, 'max_depth': 4, 'min_samples_leaf': 9},\n", " mean: 0.83380, std: 0.00185, params: {'n_estimators': 1000, 'learning_rate': 0.05, 'max_depth': 4, 'min_samples_leaf': 15},\n", " mean: 0.83745, std: 0.00270, params: {'n_estimators': 2000, 'learning_rate': 0.05, 'max_depth': 4, 'min_samples_leaf': 15},\n", " mean: 0.83832, std: 0.00334, params: {'n_estimators': 3000, 'learning_rate': 0.05, 'max_depth': 4, 'min_samples_leaf': 15},\n", " mean: 0.83698, std: 0.00283, params: {'n_estimators': 1000, 'learning_rate': 0.05, 'max_depth': 6, 'min_samples_leaf': 3},\n", " mean: 0.83833, std: 0.00332, params: {'n_estimators': 2000, 'learning_rate': 0.05, 'max_depth': 6, 'min_samples_leaf': 3},\n", " mean: 0.83799, std: 0.00338, params: {'n_estimators': 3000, 'learning_rate': 0.05, 'max_depth': 6, 'min_samples_leaf': 3},\n", " mean: 0.83716, std: 0.00246, params: {'n_estimators': 1000, 'learning_rate': 0.05, 'max_depth': 6, 'min_samples_leaf': 9},\n", " mean: 0.83759, std: 0.00199, params: {'n_estimators': 2000, 'learning_rate': 0.05, 'max_depth': 6, 'min_samples_leaf': 9},\n", " mean: 0.83729, std: 0.00182, params: {'n_estimators': 3000, 'learning_rate': 0.05, 'max_depth': 6, 'min_samples_leaf': 9},\n", " mean: 0.83903, std: 0.00159, params: {'n_estimators': 1000, 'learning_rate': 0.05, 'max_depth': 6, 'min_samples_leaf': 15},\n", " mean: 0.83928, std: 0.00207, params: {'n_estimators': 2000, 'learning_rate': 0.05, 'max_depth': 6, 'min_samples_leaf': 15},\n", " mean: 0.83862, std: 0.00207, params: {'n_estimators': 3000, 'learning_rate': 0.05, 'max_depth': 6, 'min_samples_leaf': 15},\n", " mean: 0.80416, std: 0.00012, params: {'n_estimators': 1000, 'learning_rate': 0.01, 'max_depth': 4, 'min_samples_leaf': 3},\n", " mean: 0.82043, std: 0.00121, params: {'n_estimators': 2000, 'learning_rate': 0.01, 'max_depth': 4, 'min_samples_leaf': 3},\n", " mean: 0.82661, std: 0.00154, params: {'n_estimators': 3000, 'learning_rate': 0.01, 'max_depth': 4, 'min_samples_leaf': 3},\n", " mean: 0.80566, std: 0.00118, params: {'n_estimators': 1000, 'learning_rate': 0.01, 'max_depth': 4, 'min_samples_leaf': 9},\n", " mean: 0.82172, std: 0.00089, params: {'n_estimators': 2000, 'learning_rate': 0.01, 'max_depth': 4, 'min_samples_leaf': 9},\n", " mean: 0.82785, std: 0.00182, params: {'n_estimators': 3000, 'learning_rate': 0.01, 'max_depth': 4, 'min_samples_leaf': 9},\n", " mean: 0.80542, std: 0.00182, params: {'n_estimators': 1000, 'learning_rate': 0.01, 'max_depth': 4, 'min_samples_leaf': 15},\n", " mean: 0.82148, std: 0.00052, params: {'n_estimators': 2000, 'learning_rate': 0.01, 'max_depth': 4, 'min_samples_leaf': 15},\n", " mean: 0.82741, std: 0.00095, params: {'n_estimators': 3000, 'learning_rate': 0.01, 'max_depth': 4, 'min_samples_leaf': 15},\n", " mean: 0.82472, std: 0.00189, params: {'n_estimators': 1000, 'learning_rate': 0.01, 'max_depth': 6, 'min_samples_leaf': 3},\n", " mean: 0.83270, std: 0.00156, params: {'n_estimators': 2000, 'learning_rate': 0.01, 'max_depth': 6, 'min_samples_leaf': 3},\n", " mean: 0.83506, std: 0.00197, params: {'n_estimators': 3000, 'learning_rate': 0.01, 'max_depth': 6, 'min_samples_leaf': 3},\n", " mean: 0.82550, std: 0.00247, params: {'n_estimators': 1000, 'learning_rate': 0.01, 'max_depth': 6, 'min_samples_leaf': 9},\n", " mean: 0.83370, std: 0.00171, params: {'n_estimators': 2000, 'learning_rate': 0.01, 'max_depth': 6, 'min_samples_leaf': 9},\n", " mean: 0.83645, std: 0.00171, params: {'n_estimators': 3000, 'learning_rate': 0.01, 'max_depth': 6, 'min_samples_leaf': 9},\n", " mean: 0.82603, std: 0.00174, params: {'n_estimators': 1000, 'learning_rate': 0.01, 'max_depth': 6, 'min_samples_leaf': 15},\n", " mean: 0.83392, std: 0.00143, params: {'n_estimators': 2000, 'learning_rate': 0.01, 'max_depth': 6, 'min_samples_leaf': 15},\n", " mean: 0.83648, std: 0.00186, params: {'n_estimators': 3000, 'learning_rate': 0.01, 'max_depth': 6, 'min_samples_leaf': 15}]" ] } ], "prompt_number": 14 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could look at various metrics as well, passing `scoring` optional parameter to `GridSearchCV`. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TODO\n", "- You could look at `mean_squared_error` score by providing `scoring` parameter to `GridSearchCV`." ] } ], "metadata": {} } ] }