{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Learning scikit-learn " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## An Introduction to Machine Learning in Python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### at PyData Chicago 2016" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%load_ext watermark\n", "%watermark -a \"Sebastian Raschka\" -u -d -p numpy,scipy,matplotlib,sklearn,pandas,mlxtend" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Table of Contents\n", "\n", "* [1 Introduction to Machine Learning](#1-Introduction-to-Machine-Learning)\n", "* [2 Linear Regression](#2-Linear-Regression)\n", " * [Loading the dataset](#Loading-the-dataset)\n", " * [Preparing the dataset](#Preparing-the-dataset)\n", " * [Fitting the model](#Fitting-the-model)\n", " * [Evaluating the model](#Evaluating-the-model)\n", "* [3 Introduction to Classification](#3-Introduction-to-Classification)\n", " * [The Iris dataset](#The-Iris-dataset)\n", " * [Class label encoding](#Class-label-encoding)\n", " * [Scikit-learn's in-build datasets](#Scikit-learn's-in-build-datasets)\n", " * [Test/train splits](#Test/train-splits)\n", " * [Logistic Regression](#Logistic-Regression)\n", " * [K-Nearest Neighbors](#K-Nearest-Neighbors)\n", " * [3 - Exercises](#3---Exercises)\n", "* [4 - Feature Preprocessing & scikit-learn Pipelines](#4---Feature-Preprocessing-&-scikit-learn-Pipelines)\n", " * [Categorical features: nominal vs ordinal](#Categorical-features:-nominal-vs-ordinal)\n", " * [Normalization](#Normalization)\n", " * [Pipelines](#Pipelines)\n", " * [4 - Exercises](#4---Exercises)\n", "* [5 - Dimensionality Reduction: Feature Selection & Extraction](#5---Dimensionality-Reduction:-Feature-Selection-&-Extraction)\n", " * [Recursive Feature Elimination](#Recursive-Feature-Elimination)\n", " * [Sequential Feature Selection](#Sequential-Feature-Selection)\n", " * [Principal Component Analysis](#Principal-Component-Analysis)\n", "* [6 - Model Evaluation & Hyperparameter Tuning](#6---Model-Evaluation-&-Hyperparameter-Tuning)\n", " * [Wine Dataset](#Wine-Dataset)\n", " * [Stratified K-Fold](#Stratified-K-Fold)\n", " * [Grid Search](#Grid-Search)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1 Introduction to Machine Learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2 Linear Regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading the dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Source: R.J. Gladstone (1905). \"A Study of the Relations of the Brain to \n", "to the Size of the Head\", Biometrika, Vol. 4, pp105-123\n", "\n", "\n", "Description: Brain weight (grams) and head size (cubic cm) for 237\n", "adults classified by gender and age group.\n", "\n", "\n", "Variables/Columns\n", "- Gender (1=Male, 2=Female)\n", "- Age Range (1=20-46, 2=46+)\n", "- Head size (cm^3)\n", "- Brain weight (grams)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df = pd.read_csv('dataset_brain.txt', \n", " encoding='utf-8', \n", " comment='#',\n", " sep='\\s+')\n", "df.tail()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plt.scatter(df['head-size'], df['brain-weight'])\n", "plt.xlabel('Head size (cm^3)')\n", "plt.ylabel('Brain weight (grams)');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Preparing the dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "y = df['brain-weight'].values\n", "y.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X = df['head-size'].values\n", "X = X[:, np.newaxis]\n", "X.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.cross_validation import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.3, random_state=123)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plt.scatter(X_train, y_train, c='blue', marker='o')\n", "plt.scatter(X_test, y_test, c='red', marker='s')\n", "plt.xlabel('Head size (cm^3)')\n", "plt.ylabel('Brain weight (grams)');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fitting the model" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", "\n", "lr = LinearRegression()\n", "lr.fit(X_train, y_train)\n", "y_pred = lr.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluating the model" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sum_of_squares = ((y_test - y_pred) ** 2).sum()\n", "res_sum_of_squares = ((y_test - y_test.mean()) ** 2).sum()\n", "r2_score = 1 - (sum_of_squares / res_sum_of_squares)\n", "print('R2 score: %.3f' % r2_score)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print('R2 score: %.3f' % lr.score(X_test, y_test))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "lr.coef_" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "lr.intercept_" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "min_pred = X_train.min() * lr.coef_ + lr.intercept_\n", "max_pred = X_train.max() * lr.coef_ + lr.intercept_\n", "\n", "plt.scatter(X_train, y_train, c='blue', marker='o')\n", "plt.plot([X_train.min(), X_train.max()],\n", " [min_pred, max_pred],\n", " color='red',\n", " linewidth=4)\n", "plt.xlabel('Head size (cm^3)')\n", "plt.ylabel('Brain weight (grams)');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3 Introduction to Classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The Iris dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df = pd.read_csv('dataset_iris.txt', \n", " encoding='utf-8', \n", " comment='#',\n", " sep=',')\n", "df.tail()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X = df.iloc[:, :4].values \n", "y = df['class'].values\n", "np.unique(y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Class label encoding" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.preprocessing import LabelEncoder\n", "\n", "l_encoder = LabelEncoder()\n", "l_encoder.fit(y)\n", "l_encoder.classes_" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "y_enc = l_encoder.transform(y)\n", "np.unique(y_enc)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "np.unique(l_encoder.inverse_transform(y_enc))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scikit-learn's in-build datasets" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.datasets import load_iris\n", "\n", "iris = load_iris()\n", "print(iris['DESCR'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Test/train splits" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X, y = iris.data[:, :2], iris.target\n", "# ! We only use 2 features for visual purposes\n", "\n", "print('Class labels:', np.unique(y))\n", "print('Class proportions:', np.bincount(y))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.cross_validation import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.3, random_state=123)\n", "\n", "print('Class labels:', np.unique(y_train))\n", "print('Class proportions:', np.bincount(y_train))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.cross_validation import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.3, random_state=123,\n", " stratify=y)\n", "\n", "print('Class labels:', np.unique(y_train))\n", "print('Class proportions:', np.bincount(y_train))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Logistic Regression" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression\n", "\n", "lr = LogisticRegression(solver='newton-cg', \n", " multi_class='multinomial', \n", " random_state=1)\n", "\n", "lr.fit(X_train, y_train)\n", "print('Test accuracy %.2f' % lr.score(X_test, y_test))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from mlxtend.evaluate import plot_decision_regions\n", "\n", "plot_decision_regions\n", "\n", "plot_decision_regions(X=X, y=y, clf=lr, X_highlight=X_test)\n", "plt.xlabel('sepal length [cm]')\n", "plt.xlabel('sepal width [cm]');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### K-Nearest Neighbors" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "\n", "kn = KNeighborsClassifier(n_neighbors=4)\n", "\n", "kn.fit(X_train, y_train)\n", "print('Test accuracy %.2f' % kn.score(X_test, y_test))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plot_decision_regions(X=X, y=y, clf=kn, X_highlight=X_test)\n", "plt.xlabel('sepal length [cm]')\n", "plt.xlabel('sepal width [cm]');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3 - Exercises" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Which of the two models above would you prefer if you had to choose? Why?\n", "- What would be possible ways to resolve ties in KNN when `n_neighbors` is an even number?\n", "- Can you find the right spot in the scikit-learn documentation to read about how scikit-learn handles this?\n", "- Train & evaluate the Logistic Regression and KNN algorithms on the 4-dimensional iris datasets. \n", " - What performance do you observe? \n", " - Why is it different vs. using only 2 dimensions? \n", " - Would adding more dimensions help?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 4 - Feature Preprocessing & scikit-learn Pipelines" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Categorical features: nominal vs ordinal" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.DataFrame([\n", " ['green', 'M', 10.0], \n", " ['red', 'L', 13.5], \n", " ['blue', 'XL', 15.3]])\n", "\n", "df.columns = ['color', 'size', 'prize']\n", "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.feature_extraction import DictVectorizer\n", "\n", "dvec = DictVectorizer(sparse=False)\n", "\n", "X = dvec.fit_transform(df.transpose().to_dict().values())\n", "X" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "size_mapping = {\n", " 'XL': 3,\n", " 'L': 2,\n", " 'M': 1}\n", "\n", "df['size'] = df['size'].map(size_mapping)\n", "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X = dvec.fit_transform(df.transpose().to_dict().values())\n", "X" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalization" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df = pd.DataFrame([1., 2., 3., 4., 5., 6.], columns=['feature'])\n", "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.preprocessing import MinMaxScaler\n", "from sklearn.preprocessing import StandardScaler\n", "\n", "mmxsc = MinMaxScaler()\n", "stdsc = StandardScaler()\n", "\n", "X = df['feature'].values[:, np.newaxis]\n", "\n", "df['minmax'] = mmxsc.fit_transform(X)\n", "df['z-score'] = stdsc.fit_transform(X)\n", "\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pipelines" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.pipeline import make_pipeline\n", "from sklearn.cross_validation import train_test_split\n", "from sklearn.datasets import load_iris\n", "\n", "iris = load_iris()\n", "X, y = iris.data, iris.target\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.3, random_state=123,\n", " stratify=y)\n", "\n", "lr = LogisticRegression(solver='newton-cg', \n", " multi_class='multinomial', \n", " random_state=1)\n", "\n", "lr_pipe = make_pipeline(StandardScaler(), lr)\n", "\n", "lr_pipe.fit(X_train, y_train)\n", "lr_pipe.score(X_test, y_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "lr_pipe.named_steps" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "lr_pipe.named_steps['standardscaler'].transform(X[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4 - Exercises" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Why is it important that we scale test and training sets separately?\n", "- Fit a KNN classifier to the standardized Iris dataset. Do you notice difference in the predictive performance of the model compared to the non-standardized one? Why or why not?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5 - Dimensionality Reduction: Feature Selection & Extraction" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.cross_validation import train_test_split\n", "from sklearn.datasets import load_iris\n", "\n", "iris = load_iris()\n", "X, y = iris.data, iris.target\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.3, random_state=123, stratify=y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Recursive Feature Elimination" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression\n", "from sklearn.feature_selection import RFECV\n", "\n", "lr = LogisticRegression()\n", "rfe = RFECV(lr, step=1, cv=5, scoring='accuracy')\n", "\n", "rfe.fit(X_train, y_train)\n", "print('Number of features:', rfe.n_features_)\n", "print('Feature ranking', rfe.ranking_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sequential Feature Selection" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from mlxtend.feature_selection import SequentialFeatureSelector as SFS\n", "from mlxtend.feature_selection import plot_sequential_feature_selection as plot_sfs\n", "\n", "\n", "sfs = SFS(lr, k_features=4, forward=True, floating=False, cv=5)\n", "\n", "sfs.fit(X_train, y_train)\n", "sfs = SFS(lr, \n", " k_features=4, \n", " forward=True, \n", " floating=False, \n", " scoring='accuracy',\n", " cv=2)\n", "\n", "sfs = sfs.fit(X, y)\n", "fig1 = plot_sfs(sfs.get_metric_dict())\n", "\n", "plt.ylim([0.8, 1])\n", "plt.title('Sequential Forward Selection (w. StdDev)')\n", "plt.grid()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sfs.subsets_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Principal Component Analysis" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.decomposition import PCA\n", "from sklearn.preprocessing import StandardScaler\n", "\n", "sc = StandardScaler()\n", "pca = PCA(n_components=4)\n", "\n", "pca.fit_transform(X_train, y_train)\n", "\n", "var_exp = pca.explained_variance_ratio_\n", "cum_var_exp = np.cumsum(var_exp)\n", "\n", "idx = [i for i in range(len(var_exp))]\n", "labels = [str(i + 1) for i in idx]\n", "with plt.style.context('seaborn-whitegrid'):\n", " plt.bar(range(4), var_exp, alpha=0.5, align='center',\n", " label='individual explained variance')\n", " plt.step(range(4), cum_var_exp, where='mid',\n", " label='cumulative explained variance')\n", " plt.ylabel('Explained variance ratio')\n", " plt.xlabel('Principal components')\n", " plt.xticks(idx, labels)\n", " plt.legend(loc='center right')\n", " plt.tight_layout()\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X_train_pca = pca.transform(X_train)\n", "\n", "for lab, col, mar in zip((0, 1, 2),\n", " ('blue', 'red', 'green'),\n", " ('o', 's', '^')):\n", " plt.scatter(X_train_pca[y_train == lab, 0],\n", " X_train_pca[y_train == lab, 1],\n", " label=lab,\n", " marker=mar,\n", " c=col)\n", "plt.xlabel('Principal Component 1')\n", "plt.ylabel('Principal Component 2')\n", "plt.legend(loc='lower right')\n", "plt.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 6 - Model Evaluation & Hyperparameter Tuning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Wine Dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from mlxtend.data import wine_data\n", "\n", "X, y = wine_data()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wine dataset.\n", "\n", "Source : https://archive.ics.uci.edu/ml/datasets/Wine\n", "\n", "Number of samples : 178\n", "\n", "Class labels : {0, 1, 2}, distribution: [59, 71, 48]\n", "\n", "Dataset Attributes:\n", "\n", "1. Alcohol\n", "2. Malic acid\n", "3. Ash\n", "4. Alcalinity of ash\n", "5. Magnesium\n", "6. Total phenols\n", "7. Flavanoids\n", "8. Nonflavanoid phenols\n", "9. Proanthocyanins\n", "10. Color intensity\n", "11. Hue\n", "12. OD280/OD315 of diluted wines\n", "13. Proline\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Stratified K-Fold" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.preprocessing import StandardScaler\n", "from sklearn.decomposition import PCA\n", "from sklearn.pipeline import make_pipeline\n", "from sklearn.cross_validation import StratifiedKFold\n", "from sklearn.neighbors import KNeighborsClassifier as KNN\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.3, random_state=123, stratify=y)\n", "\n", "pipe_kn = make_pipeline(StandardScaler(), \n", " PCA(n_components=1),\n", " KNN(n_neighbors=3))\n", "\n", "kfold = StratifiedKFold(y=y_train, \n", " n_folds=10,\n", " random_state=1)\n", "\n", "scores = []\n", "for k, (train, test) in enumerate(kfold):\n", " pipe_kn.fit(X_train[train], y_train[train])\n", " score = pipe_kn.score(X_train[test], y_train[test])\n", " scores.append(score)\n", " print('Fold: %s, Class dist.: %s, Acc: %.3f' % (k+1,\n", " np.bincount(y_train[train]), score))\n", " \n", "print('\\nCV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.cross_validation import cross_val_score\n", "\n", "scores = cross_val_score(estimator=pipe_kn,\n", " X=X_train,\n", " y=y_train,\n", " cv=10,\n", " n_jobs=2)\n", "\n", "print('CV accuracy scores: %s' % scores)\n", "print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Grid Search" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "pipe_kn.named_steps" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.grid_search import GridSearchCV\n", "\n", "\n", "param_grid = {'pca__n_components': [1, 2, 3, 4, 5, 6, None],\n", " 'kneighborsclassifier__n_neighbors': [1, 3, 5, 7, 9, 11]}\n", "\n", "gs = GridSearchCV(estimator=pipe_kn, \n", " param_grid=param_grid, \n", " scoring='accuracy', \n", " cv=10,\n", " n_jobs=2,\n", " refit=True)\n", "gs = gs.fit(X_train, y_train)\n", "print(gs.best_score_)\n", "print(gs.best_params_)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "gs.score(X_test, y_test)" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }