{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Andreas Mueller, Kyle Kastner, Sebastian Raschka \n",
      "last updated: 2016-06-26 \n",
      "\n",
      "CPython 3.5.1\n",
      "IPython 4.2.0\n",
      "\n",
      "numpy 1.11.0\n",
      "scipy 0.17.1\n",
      "matplotlib 1.5.1\n",
      "scikit-learn 0.17.1\n"
     ]
    }
   ],
   "source": [
    "%load_ext watermark\n",
    "%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,scikit-learn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SciPy 2016 Scikit-learn Tutorial"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Case Study - Face Recognition with Eigenfaces"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we'll take a look at a simple facial recognition example.\n",
    "This uses a dataset available within scikit-learn consisting of a\n",
    "subset of the [Labeled Faces in the Wild](http://vis-www.cs.umass.edu/lfw/)\n",
    "data.  Note that this is a relatively large download (~200MB) so it may\n",
    "take a while to execute."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn import datasets\n",
    "\n",
    "lfw_people = datasets.fetch_lfw_people(min_faces_per_person=70, \n",
    "                                       resize=0.4,\n",
    "                                       data_home='datasets')\n",
    "lfw_people.data.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you're on a unix-based system such as linux or Mac OSX, these shell commands\n",
    "can be used to see the downloaded dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "!ls datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "!du -sh datasets/lfw_home"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once again, let's visualize these faces to see what we're working with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "fig = plt.figure(figsize=(8, 6))\n",
    "# plot several images\n",
    "for i in range(15):\n",
    "    ax = fig.add_subplot(3, 5, i + 1, xticks=[], yticks=[])\n",
    "    ax.imshow(lfw_people.images[i], cmap=plt.cm.bone)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "plt.figure(figsize=(10, 2))\n",
    "\n",
    "unique_targets = np.unique(lfw_people.target)\n",
    "counts = [(lfw_people.target == i).sum() for i in unique_targets]\n",
    "\n",
    "plt.xticks(unique_targets, lfw_people.target_names[unique_targets])\n",
    "locs, labels = plt.xticks()\n",
    "plt.setp(labels, rotation=45, size=14)\n",
    "plt.bar(unique_targets, counts);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One thing to note is that these faces have already been centered and scaled\n",
    "to a common size.  This is an important preprocessing piece for facial\n",
    "recognition, and is a process that can require a large collection of training\n",
    "data.  This can be done in scikit-learn, but the challenge is gathering a\n",
    "sufficient amount of training data for the algorithm to work.\n",
    "\n",
    "Fortunately, centering and scaling has already been applied to this dataset.  \n",
    "\n",
    "(One good resource is [OpenCV](http://opencv.willowgarage.com/wiki/FaceRecognition), the\n",
    "*Open Computer Vision Library*.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we'll perform a Support Vector classification of the images; like always, we start with a typical train-test split on the images:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.cross_validation import train_test_split\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    lfw_people.data, \n",
    "    lfw_people.target,\n",
    "    test_size=0.25,\n",
    "    stratify=lfw_people.target,\n",
    "    random_state=0)\n",
    "\n",
    "print('Train size:', X_train.shape)\n",
    "print('Test size:', X_test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preprocessing: Principal Component Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1850 dimensions are a lot for fitting an SVM.  We can use PCA to reduce these 1850 features to a manageable\n",
    "size, while maintaining most of the information in the dataset.  Here it is useful to use a variant\n",
    "of PCA called ``RandomizedPCA``, which is an approximation of PCA that can be much faster for large\n",
    "datasets.  We saw this method in the previous notebook, and will use it again here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn import decomposition\n",
    "\n",
    "pca = decomposition.RandomizedPCA(n_components=150, \n",
    "                                  whiten=True,\n",
    "                                  random_state=1999)\n",
    "pca.fit(X_train)\n",
    "X_train_pca = pca.transform(X_train)\n",
    "X_test_pca = pca.transform(X_test)\n",
    "\n",
    "print('Train size after PCA:', X_train_pca.shape)\n",
    "print('Test size after PCA:', X_test_pca.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These projected components correspond to factors in a linear combination of\n",
    "component images such that the combination approaches the original face. In general, PCA can be a powerful technique for preprocessing that *can* improve classification performance substantially in certain applications."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fitting a Support Vector Machine"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we'll perform support-vector-machine classification on this reduced dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn import svm\n",
    "\n",
    "clf = svm.SVC(C=5.,\n",
    "              gamma=0.001,\n",
    "              random_state=1)\n",
    "clf.fit(X_train_pca, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we can evaluate how well this classification did.  First, we might plot a\n",
    "few of the test-cases with the labels learned from the training set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "fig = plt.figure(figsize=(8, 6))\n",
    "for i in range(15):\n",
    "    ax = fig.add_subplot(3, 5, i + 1, xticks=[], yticks=[])\n",
    "    ax.imshow(X_test[i].reshape((50, 37)), cmap=plt.cm.bone)\n",
    "    y_pred = clf.predict(X_test_pca[i].reshape(1, -1))[0]\n",
    "    color = 'black' if y_pred == y_test[i] else 'red'\n",
    "    ax.set_title(lfw_people.target_names[y_pred], fontsize='small', color=color)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The classifier is correct on an impressive number of images given the simplicity\n",
    "of its learning model!  Using a linear classifier on 150 features derived from\n",
    "the pixel-level data, the algorithm correctly identifies a large number of the\n",
    "people in the images.\n",
    "\n",
    "Again, we can\n",
    "quantify this effectiveness using ``clf.score``"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print('Accuracy: %.2f%%' % (clf.score(X_test_pca, y_test)*100))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Final Note"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we have used PCA \"eigenfaces\" as a pre-processing step for facial recognition.\n",
    "The reason we chose this is because PCA is a broadly-applicable technique, which can\n",
    "be useful for a wide array of data types.  For more details on the eigenfaces approach, see the original paper by [Turk and Penland, Eigenfaces for Recognition](http://www.face-rec.org/algorithms/PCA/jcn.pdf). Research in the field of facial recognition has moved much farther beyond this paper, and has shown specific feature extraction methods can be more effective. However, eigenfaces is a canonical example of machine learning \"in the wild\", and is a simple method with good results."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}