{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Experimentation with a simple Machine Learning algorithm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the last exercise, we used a scatterplot to identify the characteristics (or **features**) that can help us determine whether someone is more likely to be a rugby player or a ballet dancer. By exploring the data visually, we learnt that *height* and *weight* can help us with this classification.\n", "\n", "In this exercise, we are going to teach a computer to perform this classification, such that when we give it new data, it will be able to say which group the person falls into. To do this, we are going to use the **$k$-nearest neighbour** algorithm, which is one of the simplest algorithms used in Machine Learning." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data we used are shown in the table below. We're going to use this as our **training** dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "|Person|Sex (0/1)|Age |Weight (Kg)|Height (cm)|\n", "|------|---------|----|-----------|-----------|\n", "| 1| 1| 24| 63| 190|\n", "| 2| 1| 20| 55| 185|\n", "| 3| 1| 25| 75| 202|\n", "| 4| 1| 30| 50| 180|\n", "| 5| 1| 19| 57| 174|\n", "| 6| 0| 31| 66| 174|\n", "| 7| 0| 31| 85| 150|\n", "| 8| 0| 28| 93| 145|\n", "| 9| 0| 29| 75| 130|\n", "| 10| 0| 24| 99| 163|\n", "| 11| 0| 30| 100| 171|\n", "| 12| 1| 25| 84| 168|" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we **train** a machine learning algorithm, we do this with **labelled** data, which means that we give it a set of inputs, and tell it what outputs to expect. In this instance, we're going to give our algorithm a set of heights and corresponding weights, and tell it whether the person is a ballet dancer or rugby player. This enables the algorithm to **learn** how tall/heavy ballet dancers and rugby players are, and therefore when a new height and weight is given to it, to **predict** what that person is likely to be." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Execute the cells below to see a graph that shows weight plotted against height. This time, we've labelled the data, so that the rugby players are shown as red triangles, and ballet dancers as blue dots.\n", "\n", "The first cell imports the graph library and sets up the data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Import the graph library\n", "import matplotlib.pyplot as plt\n", "\n", "#label the data for rugby players\n", "rugby_heights = [150, 145, 130, 163, 171, 168]\n", "rugby_weights = [85, 93, 75, 99, 100, 84]\n", "\n", "#label the data for ballet dancers\n", "ballet_heights = [190, 185, 202, 180, 174, 174]\n", "ballet_weights = [63, 55, 75, 50, 57, 66]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This second cell configures the graph and displays it:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#show the rugby players as red triangles\n", "plt.scatter(rugby_heights, rugby_weights, color = 'red', marker = '^')\n", "\n", "#show the ballet dancers as blue dots\n", "plt.scatter(ballet_heights, ballet_weights, color = 'blue')\n", "\n", "#give the graph a title\n", "plt.title(\"Data distribution on heights and weights\")\n", "\n", "#label the axes\n", "plt.xlabel(\"Height\")\n", "plt.ylabel(\"Weight\")\n", "\n", "#display the graph\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using a machine learning algorithm to perform the classification\n", "Now let's write a program to do make this classification. The **$k$-nearest neighbour algorithm** is designed (as the name suggests), to identify which class a new input belongs to according to its nearest $k$ neighbours. $k$ is a **parameter** that we can set to a particular value. You'll experiment with this later. For now let's set $k$ to 6. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$k$-nearest neighbour works by calculating the distance between a new data point we give to it, and the existing set of data points it has in its training set. As the training set is labelled, it knows which class the nearest 6 data points belong to. If more of these are rugby players, it will classify the input as a rugby player. If more of them are ballet dancers, it will classify it as a ballet dancer.\n", "\n", "To determine how far one data point is from another, we calculate the **Euclidean distance**: $$distance = \\sqrt{(x_{point_2} - x_{point_1})^2 + (y_{point_2}-y_{point_1})^2}$$\n", "\n", "In this equation, consider $x$ and $y$ as the coordinates of the point in a graph. The algorithm uses this formula to work out the distance between the input point, and the other points in the graph." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we need to prepare the data for the machine learning algorithm. To do this, let's put it into a **matrix** (which you can think of as a table that the algorithm can use for referencing data). Execute the cell below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Import the libraries we need\n", "import numpy as np\n", "from numpy import *\n", "\n", "#Take the heights and weights from the table\n", "height = [190, 185, 202, 180, 174, 174, 150, 145, 130, 163, 171, 168]\n", "weight = [63, 55, 75, 50, 57, 66, 85, 93, 75, 99, 100, 84]\n", "\n", "#Create a matrix containing the heights and weights\n", "features_matrix = np.c_[transpose(height), transpose(weight)]\n", "\n", "#Create a matrix which contains groundtruth labels for the features.\n", "#Here 0 indicates a rugby player, and 1 indicates a ballet dancer\n", "labels_matrix = transpose([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0])\n", "\n", "#Map the labels to the features, so the algorithm knows which points are ballet dancers and which are rugby players.\n", "training_matrix = np.c_[features_matrix, labels_matrix]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's write the algorithm to classify the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Import the math library to do some of the calculations for us\n", "import math\n", "\n", "#Set the value of k (the number of neighbours we want it to use)\n", "k = 6\n", "\n", "#create an empty matrix to hold the distances\n", "distances = np.empty([0,2])\n", "\n", "#provide the height/weight input we want to classify\n", "input_weight = 50\n", "input_height = 170\n", "\n", "#For every data point in the training set, calculate the distance it is from the new data point\n", "for training_point in training_matrix:\n", " distance = math.sqrt((training_point[0] - input_height)**2 + (training_point[1] - input_weight)**2)\n", " #store the distances\n", " distances = np.vstack([distances, [distance, training_point[2]]])\n", "\n", "#sort the distances from shortest to longest\n", "distances = distances[distances[:,0].argsort()]\n", "\n", "#create variables to store the number of neighbours in each class\n", "ballet_dancer = 0\n", "rugby_player = 0\n", "\n", "#go through the first k items in the list\n", "for i in range(0, k):\n", " #if the item in the list has the label '1', this means it is a ballet dancer\n", " if distances[i][1] == 1:\n", " #increment the ballet dancer variable\n", " ballet_dancer = ballet_dancer + 1\n", " else:\n", " #if it isn't a ballet dancer, we know it's a rugby player\n", " rugby_player = rugby_player + 1\n", "\n", "#If the value of the ballet_dancer variable is higher, classify as 'Ballet dancer'\n", "if ballet_dancer > rugby_player:\n", " print(\"Ballet dancer\")\n", "#Otherwise, classify as 'Rugby player'\n", "else:\n", " print(\"Rugby player\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This table shows statistics from a sample of some French players from the Women's Rugby World Cup:\n", "\n", "|Name |Sex (0/1)|Weight (Kg)|Height (cm)|\n", "|------------------|---------|-----------|-----------|\n", "|Audrey Abadie | 1| 62| 166|\n", "|Monserrat Amedee | 1| 68| 174|\n", "|Manon Andre | 1| 84| 180|\n", "|Julie Annery | 1| 65| 171| \n", "|Lise Arricastre | 1| 83| 165|\n", "|Caroline Boujard | 1| 67| 173|\n", "|Lenaig Corson | 1| 85| 185|\n", "|Annaelle Dehayes | 1| 94| 175|\n", "|Caroline Drouin | 1| 71| 172|\n", "|Julie Duval | 1| 71| 161|\n", "|Audrey Forlani | 1| 82| 176|\n", "|Carla Neisen | 1| 67| 165|\n", "|Chloe Pelle | 1| 70| 162|\n", "\n", "Are the team always accurately classified when you enter their data?\n", "\n", "Enter the data for Julie Duval, and try manipulating $k$ (increasing it and decreasing it). What do you notice happening? Try this for Annaelle Deshayes and Carla Niesen. Use the scatterplot to help you understand why changing $k$ changes how people are classified." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2" } }, "nbformat": 4, "nbformat_minor": 2 }