{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Implementing Machine Learning Concepts in Python Workshop, Ampersand 2017\n", "* This is a Jupyter notebook with which you can follow the workshop examples\n", "* We first analyse the iris dataset and will then do some basic machine learning with it\n", "* You can modify each box and play with the code\n", "* Useful keyboard shortcuts:\n", "\n", "|shortcut |action |\n", "|----------------|-------------------------------|\n", "|shift+enter |execute code |\n", "|mouse or enter |enter box (enter edit mode) |\n", "|esc |escape box (enter command mode)|\n", "|In command mode:| |\n", "|b | insert cell below |\n", "|a | insert cell above |\n", "|s | save |\n", "|x/c/v | cut/copy/paste |\n", "|In edit mode: | |\n", "|Tab | complete command |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## First we need to import Python packages that we are using:\n", "* numpy for basic array analysis\n", "* pandas to analyze the dataset\n", "* sklearn (scikit-learn) for supervised machine learning\n", "* matplotlib for plotting" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from sklearn import datasets, svm\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import confusion_matrix\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading the dataset using scikit-learn\n", "\n", "The iris dataset is a built-in standard example in scikit-learn so we can load it easily" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "iris = datasets.load_iris()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It comes with various attributes:\n", "* `data`: flower properties (`data.shape` gives the associated dimensions)\n", "* `target`: the kind of flower (0, 1, or 2)\n", "* `target_names`: the name of the flower (setosa, versicolor, virginica)\n", "* `feature_names`: list of feature names (e.g. \"petal length (cm)\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "iris.data.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [], "source": [ "iris.target" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "iris.target_names" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "iris.feature_names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This *list comprehension* creates a list of flower names from the list of numbers in `iris.target`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "target_names = [iris.target_names[i] for i in iris.target]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(target_names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we create a Python dictionary for the data: (key, value) pairs where the keys are the property names and the values are arrays of 150 numbers (flower names for `target`) for every property." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [], "source": [ "irisdict = dict(zip(iris.feature_names, iris.data.T))\n", "irisdict['target'] = target_names\n", "irisdict" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analyzing the dataset using Pandas\n", "\n", "The above dictionary can then be convered into a Pandas DataFrame, which allows for pretty printing, easy data analysis and easy plotting." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df = pd.DataFrame(data=irisdict,\n", " columns=iris.feature_names + ['target'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The below syntax allows to select only iris setosa flowers." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df[df['target']=='setosa'].shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Especially useful is the ability to classify the data into groups (one for every flower type)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "grouped = df.groupby('target')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So for each type we can query the average:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "grouped.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and create a bar plot of those averages" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "grouped.mean().plot(kind='bar')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we would like to create scatter plots for the various properties. This can be accomplished using a mapping (dictionary) from the flower type to the colour we'd like to use:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "colors = {'setosa': 'red', 'versicolor': 'blue', 'virginica': 'green'}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then by iterating over the groups we can do a coloured scatter plot for every flower type (first petals and then sepals)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "for key, group in grouped:\n", " group.plot(ax=ax, kind='scatter', x='petal length (cm)', y='petal width (cm)', label=key, color=colors[key])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "for key, group in grouped:\n", " group.plot(ax=ax, kind='scatter', x='sepal length (cm)', y='sepal width (cm)', label=key, color=colors[key])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For unsupervised machine learning the type is not told during training so the algorithm will have to create clusters in structures as below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df.plot(kind='scatter', x='petal length (cm)', y='petal width (cm)', color='black')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So the algorithm will probably only be able to identify two types (senosa and versicolor/virginica)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Machine learning using scikit-learn\n", "\n", "First let's define some shortcuts for `iris.data` and `iris.target`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "X, y = iris.data, iris.target" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use [support vector machines](https://en.wikipedia.org/wiki/Support_vector_machine) in this example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "classifier = svm.SVC()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And use all elements except the last one as examples to learn" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "classifier.fit(X[:-1], y[:-1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once the classifier has been fitted it can be used to predict with the last element as input" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "classifier.predict(X[-1:])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The prediction was correct!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "y[-1:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Computing a confusion matrix\n", "\n", "It is also possible to automatically split the dataset into training data and test cases (75%/25% default split)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=0)\n", "X_train.shape, X_test.shape, y_train.shape, y_test.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run classifier, using a model that is too regularized (C too low) to see the impact on the results" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "classifier = svm.SVC(kernel='linear', C=0.01)\n", "y_pred = classifier.fit(X_train, y_train).predict(X_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Compute confusion matrix\n", "cnf_matrix = confusion_matrix(y_test, y_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The confusion matrix counts all matches on the diagonal, mismatches off-diagonal" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "cnf_matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get fractions instead, we need to scale row-wise." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "np.set_printoptions(precision=2)\n", "cnf_matrix.astype('float') / cnf_matrix.sum(axis=1)[:, np.newaxis]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The below code is a function that pretty-plots a confusion matrix (source: http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def plot_confusion_matrix(cm, classes,\n", " normalize=False,\n", " title='Confusion matrix',\n", " cmap=plt.cm.Blues):\n", " \"\"\"\n", " This function prints and plots the confusion matrix.\n", " Normalization can be applied by setting `normalize=True`.\n", " \"\"\"\n", " plt.imshow(cm, interpolation='nearest', cmap=cmap)\n", " plt.title(title)\n", " plt.colorbar()\n", " tick_marks = np.arange(len(classes))\n", " plt.xticks(tick_marks, classes, rotation=45)\n", " plt.yticks(tick_marks, classes)\n", "\n", " if normalize:\n", " cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n", " print(\"Normalized confusion matrix\")\n", " else:\n", " print('Confusion matrix, without normalization')\n", "\n", " print(cm)\n", "\n", " thresh = cm.max() / 2.\n", " for i in range(cm.shape[0]):\n", " for j in range(cm.shape[1]):\n", " plt.text(j, i, cm[i, j],\n", " horizontalalignment=\"center\",\n", " color=\"white\" if cm[i, j] > thresh else \"black\")\n", "\n", " plt.tight_layout()\n", " plt.ylabel('True label')\n", " plt.xlabel('Predicted label')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using this function we can then plot both the non-normalized and normalized confusion matrix." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Plot non-normalized confusion matrix\n", "plt.figure()\n", "plot_confusion_matrix(cnf_matrix, classes=iris.target_names,\n", " title='Confusion matrix, without normalization')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Plot normalized confusion matrix\n", "plt.figure()\n", "plot_confusion_matrix(cnf_matrix, classes=iris.target_names, normalize=True,\n", " title='Confusion matrix, with normalization')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" } }, "nbformat": 4, "nbformat_minor": 2 }