{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Implementing Machine Learning Concepts in Python Workshop, Ampersand 2017\n",
    "* This is a Jupyter notebook with which you can follow the workshop examples\n",
    "* We first analyse the iris dataset and will then do some basic machine learning with it\n",
    "* You can modify each box and play with the code\n",
    "* Useful keyboard shortcuts:\n",
    "\n",
    "|shortcut        |action                         |\n",
    "|----------------|-------------------------------|\n",
    "|shift+enter     |execute code                   |\n",
    "|mouse or enter  |enter box (enter edit mode)    |\n",
    "|esc             |escape box (enter command mode)|\n",
    "|In command mode:|                               |\n",
    "|b               | insert cell below             |\n",
    "|a               | insert cell above             |\n",
    "|s               | save                          |\n",
    "|x/c/v           | cut/copy/paste                |\n",
    "|In edit mode:   |                               |\n",
    "|Tab             | complete command              |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## First we need to import Python packages that we are using:\n",
    "* numpy for basic array analysis\n",
    "* pandas to analyze the dataset\n",
    "* sklearn (scikit-learn) for supervised machine learning\n",
    "* matplotlib for plotting"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn import datasets, svm\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import confusion_matrix\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading the dataset using scikit-learn\n",
    "\n",
    "The iris dataset is a built-in standard example in scikit-learn so we can load it easily"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "iris = datasets.load_iris()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It comes with various attributes:\n",
    "* `data`: flower properties (`data.shape` gives the associated dimensions)\n",
    "* `target`: the kind of flower (0, 1, or 2)\n",
    "* `target_names`: the name of the flower (setosa, versicolor, virginica)\n",
    "* `feature_names`: list of feature names (e.g. \"petal length (cm)\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "iris.data.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "iris.target"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "iris.target_names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "iris.feature_names"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This *list comprehension* creates a list of flower names from the list of numbers in `iris.target`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "target_names = [iris.target_names[i] for i in iris.target]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(target_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we create a Python dictionary for the data: (key, value) pairs where the keys are the property names and the values are arrays of 150 numbers (flower names for `target`) for every property."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "irisdict = dict(zip(iris.feature_names, iris.data.T))\n",
    "irisdict['target'] = target_names\n",
    "irisdict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Analyzing the dataset using Pandas\n",
    "\n",
    "The above dictionary can then be convered into a Pandas DataFrame, which allows for pretty printing, easy data analysis and easy plotting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df = pd.DataFrame(data=irisdict,\n",
    "                  columns=iris.feature_names + ['target'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The below syntax allows to select only iris setosa flowers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df[df['target']=='setosa'].shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Especially useful is the ability to classify the data into groups (one for every flower type)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "grouped = df.groupby('target')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So for each type we can query the average:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "grouped.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and create a bar plot of those averages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "grouped.mean().plot(kind='bar')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we would like to create scatter plots for the various properties. This can be accomplished using a mapping (dictionary) from the flower type to the colour we'd like to use:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "colors = {'setosa': 'red', 'versicolor': 'blue', 'virginica': 'green'}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then by iterating over the groups we can do a coloured scatter plot for every flower type (first petals and then sepals)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()\n",
    "for key, group in grouped:\n",
    "    group.plot(ax=ax, kind='scatter', x='petal length (cm)', y='petal width (cm)', label=key, color=colors[key])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()\n",
    "for key, group in grouped:\n",
    "    group.plot(ax=ax, kind='scatter', x='sepal length (cm)', y='sepal width (cm)', label=key, color=colors[key])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For unsupervised machine learning the type is not told during training so the algorithm will have to create clusters in structures as below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df.plot(kind='scatter', x='petal length (cm)', y='petal width (cm)', color='black')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So the algorithm will probably only be able to identify two types (senosa and versicolor/virginica)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Machine learning using scikit-learn\n",
    "\n",
    "First let's define some shortcuts for `iris.data` and `iris.target`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X, y = iris.data, iris.target"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use [support vector machines](https://en.wikipedia.org/wiki/Support_vector_machine) in this example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "classifier = svm.SVC()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And use all elements except the last one as examples to learn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "classifier.fit(X[:-1], y[:-1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the classifier has been fitted it can be used to predict with the last element as input"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "classifier.predict(X[-1:])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The prediction was correct!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "y[-1:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Computing a confusion matrix\n",
    "\n",
    "It is also possible to automatically split the dataset into training data and test cases (75%/25% default split)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=0)\n",
    "X_train.shape, X_test.shape, y_train.shape, y_test.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run classifier, using a model that is too regularized (C too low) to see the impact on the results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "classifier = svm.SVC(kernel='linear', C=0.01)\n",
    "y_pred = classifier.fit(X_train, y_train).predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Compute confusion matrix\n",
    "cnf_matrix = confusion_matrix(y_test, y_pred)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The confusion matrix counts all matches on the diagonal, mismatches off-diagonal"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "cnf_matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To get fractions instead, we need to scale row-wise."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "np.set_printoptions(precision=2)\n",
    "cnf_matrix.astype('float') / cnf_matrix.sum(axis=1)[:, np.newaxis]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The below code is a function that pretty-plots a confusion matrix (source: http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def plot_confusion_matrix(cm, classes,\n",
    "                          normalize=False,\n",
    "                          title='Confusion matrix',\n",
    "                          cmap=plt.cm.Blues):\n",
    "    \"\"\"\n",
    "    This function prints and plots the confusion matrix.\n",
    "    Normalization can be applied by setting `normalize=True`.\n",
    "    \"\"\"\n",
    "    plt.imshow(cm, interpolation='nearest', cmap=cmap)\n",
    "    plt.title(title)\n",
    "    plt.colorbar()\n",
    "    tick_marks = np.arange(len(classes))\n",
    "    plt.xticks(tick_marks, classes, rotation=45)\n",
    "    plt.yticks(tick_marks, classes)\n",
    "\n",
    "    if normalize:\n",
    "        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n",
    "        print(\"Normalized confusion matrix\")\n",
    "    else:\n",
    "        print('Confusion matrix, without normalization')\n",
    "\n",
    "    print(cm)\n",
    "\n",
    "    thresh = cm.max() / 2.\n",
    "    for i in range(cm.shape[0]):\n",
    "        for j in range(cm.shape[1]):\n",
    "            plt.text(j, i, cm[i, j],\n",
    "                 horizontalalignment=\"center\",\n",
    "                 color=\"white\" if cm[i, j] > thresh else \"black\")\n",
    "\n",
    "    plt.tight_layout()\n",
    "    plt.ylabel('True label')\n",
    "    plt.xlabel('Predicted label')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using this function we can then plot both the non-normalized and normalized confusion matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Plot non-normalized confusion matrix\n",
    "plt.figure()\n",
    "plot_confusion_matrix(cnf_matrix, classes=iris.target_names,\n",
    "                      title='Confusion matrix, without normalization')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Plot normalized confusion matrix\n",
    "plt.figure()\n",
    "plot_confusion_matrix(cnf_matrix, classes=iris.target_names, normalize=True,\n",
    "                      title='Confusion matrix, with normalization')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}