# Implementing Machine Learning Concepts in Python Workshop, Ampersand 2017
* This is a Jupyter notebook with which you can follow the workshop examples
* We first analyse the iris dataset and will then do some basic machine learning with it
* You can modify each box and play with the code
* Useful keyboard shortcuts:

|shortcut |action |
|----------------|-------------------------------|
|shift+enter |execute code |
|mouse or enter |enter box (enter edit mode) |
|esc |escape box (enter command mode)|
|In command mode:| |
|b | insert cell below |
|a | insert cell above |
|s | save |
|x/c/v | cut/copy/paste |
|In edit mode: | |
|Tab | complete command |

## First we need to import Python packages that we are using:
* numpy for basic array analysis
* pandas to analyze the dataset
* sklearn (scikit-learn) for supervised machine learning
* matplotlib for plotting

In [None]:
import numpy as np
import pandas as pd
from sklearn import datasets, svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
%matplotlib inline

## Loading the dataset using scikit-learn

The iris dataset is a built-in standard example in scikit-learn so we can load it easily

In [None]:
iris = datasets.load_iris()

It comes with various attributes:
* `data`: flower properties (`data.shape` gives the associated dimensions)
* `target`: the kind of flower (0, 1, or 2)
* `target_names`: the name of the flower (setosa, versicolor, virginica)
* `feature_names`: list of feature names (e.g. "petal length (cm)")

In [None]:
iris.data.shape

In [None]:
iris.target

In [None]:
iris.target_names

In [None]:
iris.feature_names

This *list comprehension* creates a list of flower names from the list of numbers in `iris.target`

In [None]:
target_names = [iris.target_names[i] for i in iris.target]

In [None]:
print(target_names)

Next we create a Python dictionary for the data: (key, value) pairs where the keys are the property names and the values are arrays of 150 numbers (flower names for `target`) for every property.

In [None]:
irisdict = dict(zip(iris.feature_names, iris.data.T))
irisdict['target'] = target_names
irisdict

## Analyzing the dataset using Pandas

The above dictionary can then be convered into a Pandas DataFrame, which allows for pretty printing, easy data analysis and easy plotting.

In [None]:
df = pd.DataFrame(data=irisdict,
 columns=iris.feature_names + ['target'])

In [None]:
df

In [None]:
df.shape

The below syntax allows to select only iris setosa flowers.

In [None]:
df[df['target']=='setosa'].shape

Especially useful is the ability to classify the data into groups (one for every flower type)

In [None]:
grouped = df.groupby('target')

So for each type we can query the average:

In [None]:
grouped.mean()

and create a bar plot of those averages

In [None]:
grouped.mean().plot(kind='bar')

Next we would like to create scatter plots for the various properties. This can be accomplished using a mapping (dictionary) from the flower type to the colour we'd like to use:

In [None]:
colors = {'setosa': 'red', 'versicolor': 'blue', 'virginica': 'green'}

Then by iterating over the groups we can do a coloured scatter plot for every flower type (first petals and then sepals).

In [None]:
fig, ax = plt.subplots()
for key, group in grouped:
 group.plot(ax=ax, kind='scatter', x='petal length (cm)', y='petal width (cm)', label=key, color=colors[key])

In [None]:
fig, ax = plt.subplots()
for key, group in grouped:
 group.plot(ax=ax, kind='scatter', x='sepal length (cm)', y='sepal width (cm)', label=key, color=colors[key])

For unsupervised machine learning the type is not told during training so the algorithm will have to create clusters in structures as below:

In [None]:
df.plot(kind='scatter', x='petal length (cm)', y='petal width (cm)', color='black')

So the algorithm will probably only be able to identify two types (senosa and versicolor/virginica).

## Machine learning using scikit-learn

First let's define some shortcuts for `iris.data` and `iris.target`:

In [None]:
X, y = iris.data, iris.target

We will use [support vector machines](https://en.wikipedia.org/wiki/Support_vector_machine) in this example:

In [None]:
classifier = svm.SVC()

And use all elements except the last one as examples to learn

In [None]:
classifier.fit(X[:-1], y[:-1])

Once the classifier has been fitted it can be used to predict with the last element as input

In [None]:
classifier.predict(X[-1:])

The prediction was correct!

In [None]:
y[-1:]

### Computing a confusion matrix

It is also possible to automatically split the dataset into training data and test cases (75%/25% default split)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=0)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

Run classifier, using a model that is too regularized (C too low) to see the impact on the results

In [None]:
classifier = svm.SVC(kernel='linear', C=0.01)
y_pred = classifier.fit(X_train, y_train).predict(X_test)

In [None]:
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)

The confusion matrix counts all matches on the diagonal, mismatches off-diagonal

In [None]:
cnf_matrix

To get fractions instead, we need to scale row-wise.

In [None]:
np.set_printoptions(precision=2)
cnf_matrix.astype('float') / cnf_matrix.sum(axis=1)[:, np.newaxis]

The below code is a function that pretty-plots a confusion matrix (source: http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html)

In [None]:
def plot_confusion_matrix(cm, classes,
 normalize=False,
 title='Confusion matrix',
 cmap=plt.cm.Blues):
 """
 This function prints and plots the confusion matrix.
 Normalization can be applied by setting `normalize=True`.
 """
 plt.imshow(cm, interpolation='nearest', cmap=cmap)
 plt.title(title)
 plt.colorbar()
 tick_marks = np.arange(len(classes))
 plt.xticks(tick_marks, classes, rotation=45)
 plt.yticks(tick_marks, classes)

 if normalize:
 cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
 print("Normalized confusion matrix")
 else:
 print('Confusion matrix, without normalization')

 print(cm)

 thresh = cm.max() / 2.
 for i in range(cm.shape[0]):
 for j in range(cm.shape[1]):
 plt.text(j, i, cm[i, j],
 horizontalalignment="center",
 color="white" if cm[i, j] > thresh else "black")

 plt.tight_layout()
 plt.ylabel('True label')
 plt.xlabel('Predicted label')

Using this function we can then plot both the non-normalized and normalized confusion matrix.

In [None]:
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=iris.target_names,
 title='Confusion matrix, without normalization')

In [None]:
# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=iris.target_names, normalize=True,
 title='Confusion matrix, with normalization')