{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Learning scikit-learn "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## An Introduction to Machine Learning in Python"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### at PyData Chicago 2016"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%load_ext watermark\n",
    "%watermark -a \"Sebastian Raschka\" -u -d -p numpy,scipy,matplotlib,sklearn,pandas,mlxtend"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Table of Contents\n",
    "\n",
    "* [1 Introduction to Machine Learning](#1-Introduction-to-Machine-Learning)\n",
    "* [2 Linear Regression](#2-Linear-Regression)\n",
    "    * [Loading the dataset](#Loading-the-dataset)\n",
    "    * [Preparing the dataset](#Preparing-the-dataset)\n",
    "    * [Fitting the model](#Fitting-the-model)\n",
    "    * [Evaluating the model](#Evaluating-the-model)\n",
    "* [3 Introduction to Classification](#3-Introduction-to-Classification)\n",
    "    * [The Iris dataset](#The-Iris-dataset)\n",
    "    * [Class label encoding](#Class-label-encoding)\n",
    "    * [Scikit-learn's in-build datasets](#Scikit-learn's-in-build-datasets)\n",
    "    * [Test/train splits](#Test/train-splits)\n",
    "    * [Logistic Regression](#Logistic-Regression)\n",
    "    * [K-Nearest Neighbors](#K-Nearest-Neighbors)\n",
    "    * [3 - Exercises](#3---Exercises)\n",
    "* [4 - Feature Preprocessing & scikit-learn Pipelines](#4---Feature-Preprocessing-&-scikit-learn-Pipelines)\n",
    "    * [Categorical features: nominal vs ordinal](#Categorical-features:-nominal-vs-ordinal)\n",
    "    * [Normalization](#Normalization)\n",
    "    * [Pipelines](#Pipelines)\n",
    "    * [4 - Exercises](#4---Exercises)\n",
    "* [5 - Dimensionality Reduction: Feature Selection & Extraction](#5---Dimensionality-Reduction:-Feature-Selection-&-Extraction)\n",
    "    * [Recursive Feature Elimination](#Recursive-Feature-Elimination)\n",
    "    * [Sequential Feature Selection](#Sequential-Feature-Selection)\n",
    "    * [Principal Component Analysis](#Principal-Component-Analysis)\n",
    "* [6 - Model Evaluation & Hyperparameter Tuning](#6---Model-Evaluation-&-Hyperparameter-Tuning)\n",
    "    * [Wine Dataset](#Wine-Dataset)\n",
    "    * [Stratified K-Fold](#Stratified-K-Fold)\n",
    "    * [Grid Search](#Grid-Search)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style='height:100px;'></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style='height:100px;'></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1 Introduction to Machine Learning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style='height:100px;'></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2 Linear Regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loading the dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Source: R.J. Gladstone (1905). \"A Study of the Relations of the Brain to \n",
    "to the Size of the Head\", Biometrika, Vol. 4, pp105-123\n",
    "\n",
    "\n",
    "Description: Brain weight (grams) and head size (cubic cm) for 237\n",
    "adults classified by gender and age group.\n",
    "\n",
    "\n",
    "Variables/Columns\n",
    "- Gender (1=Male, 2=Female)\n",
    "- Age Range (1=20-46, 2=46+)\n",
    "- Head size (cm^3)\n",
    "- Brain weight (grams)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df = pd.read_csv('dataset_brain.txt', \n",
    "                 encoding='utf-8', \n",
    "                 comment='#',\n",
    "                 sep='\\s+')\n",
    "df.tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.scatter(df['head-size'], df['brain-weight'])\n",
    "plt.xlabel('Head size (cm^3)')\n",
    "plt.ylabel('Brain weight (grams)');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Preparing the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "y = df['brain-weight'].values\n",
    "y.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X = df['head-size'].values\n",
    "X = X[:, np.newaxis]\n",
    "X.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.cross_validation import train_test_split\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "        X, y, test_size=0.3, random_state=123)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.scatter(X_train, y_train, c='blue', marker='o')\n",
    "plt.scatter(X_test, y_test, c='red', marker='s')\n",
    "plt.xlabel('Head size (cm^3)')\n",
    "plt.ylabel('Brain weight (grams)');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Fitting the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "\n",
    "lr = LinearRegression()\n",
    "lr.fit(X_train, y_train)\n",
    "y_pred = lr.predict(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluating the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "sum_of_squares = ((y_test - y_pred) ** 2).sum()\n",
    "res_sum_of_squares = ((y_test - y_test.mean()) ** 2).sum()\n",
    "r2_score = 1 - (sum_of_squares / res_sum_of_squares)\n",
    "print('R2 score: %.3f' % r2_score)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print('R2 score: %.3f' % lr.score(X_test, y_test))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "lr.coef_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "lr.intercept_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "min_pred = X_train.min() * lr.coef_ + lr.intercept_\n",
    "max_pred = X_train.max() * lr.coef_ + lr.intercept_\n",
    "\n",
    "plt.scatter(X_train, y_train, c='blue', marker='o')\n",
    "plt.plot([X_train.min(), X_train.max()],\n",
    "         [min_pred, max_pred],\n",
    "         color='red',\n",
    "         linewidth=4)\n",
    "plt.xlabel('Head size (cm^3)')\n",
    "plt.ylabel('Brain weight (grams)');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style='height:100px;'></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3 Introduction to Classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The Iris dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df = pd.read_csv('dataset_iris.txt', \n",
    "                 encoding='utf-8', \n",
    "                 comment='#',\n",
    "                 sep=',')\n",
    "df.tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X = df.iloc[:, :4].values \n",
    "y = df['class'].values\n",
    "np.unique(y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Class label encoding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import LabelEncoder\n",
    "\n",
    "l_encoder = LabelEncoder()\n",
    "l_encoder.fit(y)\n",
    "l_encoder.classes_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "y_enc = l_encoder.transform(y)\n",
    "np.unique(y_enc)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "np.unique(l_encoder.inverse_transform(y_enc))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Scikit-learn's in-build datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_iris\n",
    "\n",
    "iris = load_iris()\n",
    "print(iris['DESCR'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Test/train splits"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X, y = iris.data[:, :2], iris.target\n",
    "# ! We only use 2 features for visual purposes\n",
    "\n",
    "print('Class labels:', np.unique(y))\n",
    "print('Class proportions:', np.bincount(y))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.cross_validation import train_test_split\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "        X, y, test_size=0.3, random_state=123)\n",
    "\n",
    "print('Class labels:', np.unique(y_train))\n",
    "print('Class proportions:', np.bincount(y_train))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.cross_validation import train_test_split\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "        X, y, test_size=0.3, random_state=123,\n",
    "        stratify=y)\n",
    "\n",
    "print('Class labels:', np.unique(y_train))\n",
    "print('Class proportions:', np.bincount(y_train))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Logistic Regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "lr = LogisticRegression(solver='newton-cg', \n",
    "                        multi_class='multinomial', \n",
    "                        random_state=1)\n",
    "\n",
    "lr.fit(X_train, y_train)\n",
    "print('Test accuracy %.2f' % lr.score(X_test, y_test))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from mlxtend.evaluate import plot_decision_regions\n",
    "\n",
    "plot_decision_regions\n",
    "\n",
    "plot_decision_regions(X=X, y=y, clf=lr, X_highlight=X_test)\n",
    "plt.xlabel('sepal length [cm]')\n",
    "plt.xlabel('sepal width [cm]');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### K-Nearest Neighbors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "\n",
    "kn = KNeighborsClassifier(n_neighbors=4)\n",
    "\n",
    "kn.fit(X_train, y_train)\n",
    "print('Test accuracy %.2f' % kn.score(X_test, y_test))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plot_decision_regions(X=X, y=y, clf=kn, X_highlight=X_test)\n",
    "plt.xlabel('sepal length [cm]')\n",
    "plt.xlabel('sepal width [cm]');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3 - Exercises"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Which of the two models above would you prefer if you had to choose? Why?\n",
    "- What would be possible ways to resolve ties in KNN when `n_neighbors` is an even number?\n",
    "- Can you find the right spot in the scikit-learn documentation to read about how scikit-learn handles this?\n",
    "- Train & evaluate the Logistic Regression and KNN algorithms on the 4-dimensional iris datasets. \n",
    "  - What performance do you observe? \n",
    "  - Why is it different vs. using only 2 dimensions? \n",
    "  - Would adding more dimensions help?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style='height:100px;'></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4 - Feature Preprocessing & scikit-learn Pipelines"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Categorical features: nominal vs ordinal"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.DataFrame([\n",
    "            ['green', 'M', 10.0], \n",
    "            ['red', 'L', 13.5], \n",
    "            ['blue', 'XL', 15.3]])\n",
    "\n",
    "df.columns = ['color', 'size', 'prize']\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction import DictVectorizer\n",
    "\n",
    "dvec = DictVectorizer(sparse=False)\n",
    "\n",
    "X = dvec.fit_transform(df.transpose().to_dict().values())\n",
    "X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "size_mapping = {\n",
    "           'XL': 3,\n",
    "           'L': 2,\n",
    "           'M': 1}\n",
    "\n",
    "df['size'] = df['size'].map(size_mapping)\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X = dvec.fit_transform(df.transpose().to_dict().values())\n",
    "X"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Normalization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df = pd.DataFrame([1., 2., 3., 4., 5., 6.], columns=['feature'])\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import MinMaxScaler\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "mmxsc = MinMaxScaler()\n",
    "stdsc = StandardScaler()\n",
    "\n",
    "X = df['feature'].values[:, np.newaxis]\n",
    "\n",
    "df['minmax'] = mmxsc.fit_transform(X)\n",
    "df['z-score'] = stdsc.fit_transform(X)\n",
    "\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pipelines"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.cross_validation import train_test_split\n",
    "from sklearn.datasets import load_iris\n",
    "\n",
    "iris = load_iris()\n",
    "X, y = iris.data, iris.target\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "        X, y, test_size=0.3, random_state=123,\n",
    "        stratify=y)\n",
    "\n",
    "lr = LogisticRegression(solver='newton-cg', \n",
    "                        multi_class='multinomial', \n",
    "                        random_state=1)\n",
    "\n",
    "lr_pipe = make_pipeline(StandardScaler(), lr)\n",
    "\n",
    "lr_pipe.fit(X_train, y_train)\n",
    "lr_pipe.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "lr_pipe.named_steps"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "lr_pipe.named_steps['standardscaler'].transform(X[:5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4 - Exercises"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Why is it important that we scale test and training sets separately?\n",
    "- Fit a KNN classifier to the standardized Iris dataset. Do you notice difference in the predictive performance of the model compared to the non-standardized one? Why or why not?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style='height:100px;'></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5 - Dimensionality Reduction: Feature Selection & Extraction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.cross_validation import train_test_split\n",
    "from sklearn.datasets import load_iris\n",
    "\n",
    "iris = load_iris()\n",
    "X, y = iris.data, iris.target\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "        X, y, test_size=0.3, random_state=123, stratify=y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Recursive Feature Elimination"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.feature_selection import RFECV\n",
    "\n",
    "lr = LogisticRegression()\n",
    "rfe = RFECV(lr, step=1, cv=5, scoring='accuracy')\n",
    "\n",
    "rfe.fit(X_train, y_train)\n",
    "print('Number of features:', rfe.n_features_)\n",
    "print('Feature ranking', rfe.ranking_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sequential Feature Selection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from mlxtend.feature_selection import SequentialFeatureSelector as SFS\n",
    "from mlxtend.feature_selection import plot_sequential_feature_selection as plot_sfs\n",
    "\n",
    "\n",
    "sfs = SFS(lr, k_features=4, forward=True, floating=False, cv=5)\n",
    "\n",
    "sfs.fit(X_train, y_train)\n",
    "sfs = SFS(lr, \n",
    "          k_features=4, \n",
    "          forward=True, \n",
    "          floating=False, \n",
    "          scoring='accuracy',\n",
    "          cv=2)\n",
    "\n",
    "sfs = sfs.fit(X, y)\n",
    "fig1 = plot_sfs(sfs.get_metric_dict())\n",
    "\n",
    "plt.ylim([0.8, 1])\n",
    "plt.title('Sequential Forward Selection (w. StdDev)')\n",
    "plt.grid()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "sfs.subsets_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Principal Component Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "sc = StandardScaler()\n",
    "pca = PCA(n_components=4)\n",
    "\n",
    "pca.fit_transform(X_train, y_train)\n",
    "\n",
    "var_exp = pca.explained_variance_ratio_\n",
    "cum_var_exp = np.cumsum(var_exp)\n",
    "\n",
    "idx = [i for i in range(len(var_exp))]\n",
    "labels = [str(i + 1) for i in idx]\n",
    "with plt.style.context('seaborn-whitegrid'):\n",
    "    plt.bar(range(4), var_exp, alpha=0.5, align='center',\n",
    "            label='individual explained variance')\n",
    "    plt.step(range(4), cum_var_exp, where='mid',\n",
    "             label='cumulative explained variance')\n",
    "    plt.ylabel('Explained variance ratio')\n",
    "    plt.xlabel('Principal components')\n",
    "    plt.xticks(idx, labels)\n",
    "    plt.legend(loc='center right')\n",
    "    plt.tight_layout()\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X_train_pca = pca.transform(X_train)\n",
    "\n",
    "for lab, col, mar in zip((0, 1, 2),\n",
    "                         ('blue', 'red', 'green'),\n",
    "                         ('o', 's', '^')):\n",
    "    plt.scatter(X_train_pca[y_train == lab, 0],\n",
    "                X_train_pca[y_train == lab, 1],\n",
    "                label=lab,\n",
    "                marker=mar,\n",
    "                c=col)\n",
    "plt.xlabel('Principal Component 1')\n",
    "plt.ylabel('Principal Component 2')\n",
    "plt.legend(loc='lower right')\n",
    "plt.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style='height:100px;'></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 6 - Model Evaluation & Hyperparameter Tuning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Wine Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from mlxtend.data import wine_data\n",
    "\n",
    "X, y = wine_data()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wine dataset.\n",
    "\n",
    "Source : https://archive.ics.uci.edu/ml/datasets/Wine\n",
    "\n",
    "Number of samples : 178\n",
    "\n",
    "Class labels : {0, 1, 2}, distribution: [59, 71, 48]\n",
    "\n",
    "Dataset Attributes:\n",
    "\n",
    "1. Alcohol\n",
    "2. Malic acid\n",
    "3. Ash\n",
    "4. Alcalinity of ash\n",
    "5. Magnesium\n",
    "6. Total phenols\n",
    "7. Flavanoids\n",
    "8. Nonflavanoid phenols\n",
    "9. Proanthocyanins\n",
    "10. Color intensity\n",
    "11. Hue\n",
    "12. OD280/OD315 of diluted wines\n",
    "13. Proline\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Stratified K-Fold"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.decomposition import PCA\n",
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.cross_validation import StratifiedKFold\n",
    "from sklearn.neighbors import KNeighborsClassifier as KNN\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "        X, y, test_size=0.3, random_state=123, stratify=y)\n",
    "\n",
    "pipe_kn = make_pipeline(StandardScaler(), \n",
    "                        PCA(n_components=1),\n",
    "                        KNN(n_neighbors=3))\n",
    "\n",
    "kfold = StratifiedKFold(y=y_train, \n",
    "                        n_folds=10,\n",
    "                        random_state=1)\n",
    "\n",
    "scores = []\n",
    "for k, (train, test) in enumerate(kfold):\n",
    "    pipe_kn.fit(X_train[train], y_train[train])\n",
    "    score = pipe_kn.score(X_train[test], y_train[test])\n",
    "    scores.append(score)\n",
    "    print('Fold: %s, Class dist.: %s, Acc: %.3f' % (k+1,\n",
    "          np.bincount(y_train[train]), score))\n",
    "    \n",
    "print('\\nCV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.cross_validation import cross_val_score\n",
    "\n",
    "scores = cross_val_score(estimator=pipe_kn,\n",
    "                         X=X_train,\n",
    "                         y=y_train,\n",
    "                         cv=10,\n",
    "                         n_jobs=2)\n",
    "\n",
    "print('CV accuracy scores: %s' % scores)\n",
    "print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Grid Search"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pipe_kn.named_steps"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.grid_search import GridSearchCV\n",
    "\n",
    "\n",
    "param_grid = {'pca__n_components': [1, 2, 3, 4, 5, 6, None],\n",
    "              'kneighborsclassifier__n_neighbors': [1, 3, 5, 7, 9, 11]}\n",
    "\n",
    "gs = GridSearchCV(estimator=pipe_kn, \n",
    "                  param_grid=param_grid, \n",
    "                  scoring='accuracy', \n",
    "                  cv=10,\n",
    "                  n_jobs=2,\n",
    "                  refit=True)\n",
    "gs = gs.fit(X_train, y_train)\n",
    "print(gs.best_score_)\n",
    "print(gs.best_params_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "gs.score(X_test, y_test)"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}