{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# K-Nearest Neighbors\n", "\n", "Implement the [KNN classification algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) for classification.\n", "\n", "### Objective\n", "\n", "This notebook aims to introduce the nearest neighbors model in the classification problem. In this case, we will first implement the model and test it using the [Iris flower data set](https://archive.ics.uci.edu/ml/datasets/iris) where the goal is to the classe of a flower based on its features. Next, we will use the Endometrium vs. Uterus cancer data set, where goal is to separate **endometrium** tumors from the **uterine** ones.\n", "\n", "### Learning objective\n", "\n", "After finished this notebook, you should be able to explain **nearest neighbors model**, including how to use the scikit-learn implementation. \n", "\n", "### Iris flower data set\n", "\n", "![flower](https://archive.ics.uci.edu/ml/assets/MLimages/Large53.jpg)\n", "\n", "The [Iris flower data set](https://archive.ics.uci.edu/ml/datasets/iris) consists of 150 data points, where each data point is a feature vector $\\boldsymbol x \\in \\mathbb{R}^4$ describing the attribute of a flower in the data set. \n", "\n", "The four dimensions represent:\n", "\n", "1. sepal length in cm \n", "2. sepal width in cm \n", "3. petal length in cm \n", "4. petal width in cm \n", "\n", "and the corresponding target $y \\in \\mathbb{Z}$ describes the class of the flower. There are three classes of flowers in this data set. They are:\n", "\n", "0. Iris Setosa\n", "1. Iris Versicolour \n", "2. Iris Virginica\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sys\n", "assert sys.version_info >= (3, 6)\n", "\n", "import numpy\n", "assert numpy.__version__ >=\"1.17.3\" \n", "import numpy as np\n", "\n", "import matplotlib.pyplot as plt\n", "\n", "import pandas\n", "assert pandas.__version__ >= \"0.25.1\"\n", "import pandas as pd\n", "\n", "import sklearn\n", "assert sklearn.__version__ >= \"0.21.3\"\n", "\n", "from sklearn import datasets\n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Implementing the nearest neighbors method" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1. Loading the iris data set" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "iris = datasets.load_iris()\n", "\n", "print('data shape is {}'.format(iris.data.shape))\n", "print('class shape is {}'.format(iris.target.shape))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the simplicity, we will only use the first two features (i.e., sepal length and sepal width) of the data set to classify the flowers.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X = iris.data[:, :2] \n", "y = iris.target" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2. Visualizing the classes of the flowers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We create a scatter plot of the dataset below. The x and y axis represent the sepal length and sepal width of the dataset, and the color of the points represent the different classes of flowers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from matplotlib.colors import ListedColormap\n", "\n", "cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])\n", "\n", "K = 3\n", "x = X[-1]\n", "\n", "fig, ax = plt.subplots(figsize=(4,4))\n", "for i, iris_class in enumerate(['Iris Setosa', 'Iris Versicolour', 'Iris Virginica']):\n", " idx = y==i\n", " ax.scatter(X[idx,0], X[idx,1], \n", " c=cmap_bold.colors[i], edgecolor='k', \n", " s=20, label=iris_class);\n", "\n", "ax.set(xlabel='sepal length (cm)', ylabel='sepal width (cm)')\n", "ax.legend();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The idea behind a k-nearest neighbor classifier is very simple: \n", "\n", "1. Given a training set $\\boldsymbol X \\in \\mathbb{R}^{N \\times D}$ and $\\boldsymbol y \\in \\mathbb{Z}^N$; where, $N$ is the number of data points in the data set, and $D$ is the dimensionality of the data.\n", "2. We predict the label of a novel point $\\boldsymbol x \\in \\mathbb{R}^{D}$ __as the label of the majority of its \"K-nearest neighbor\"__ by using distance measure (e.g., the Euclidean distance).\n", "\n", "\n", "First, we need to compute the pairwise distance matrix with the distances between the rows of two matrices." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def pairwise_distance_matrix(X, Y):\n", " \n", " \"\"\"Compute the pairwise distance between the rows of X and the rows of Y\n", "\n", " Arguments\n", " ----------\n", " X: ndarray of size (N, D)\n", " Y: ndarray of size (M, D)\n", " \n", " Returns\n", " --------\n", " distance_matrix: matrix of shape (N, M), each entry distance_matrix[i,j] is the distance between\n", " ith row of X and the jth row of Y.\n", " \"\"\"\n", " \n", " N, D = X.shape\n", " M, _ = Y.shape\n", " \n", " distance_matrix = None\n", " \n", " return distance_matrix\n", "\n", "\n", "def KNN(k, X, y, x):\n", " \n", " \"\"\"K nearest neighbors\n", " \n", " Arguments\n", " ----------\n", " \n", " k: number of nearest neighbors\n", " X: training set\n", " y: training labels\n", " x: test set\n", " \n", " Returns\n", " --------\n", " \n", " number of k-nearest neighbors for each of the classes\n", " \"\"\"\n", " \n", " x = x.reshape(1, -1) if len(x.shape) == 1 else x\n", " \n", " N, D = X.shape\n", " num_classes = None\n", " \n", " distances = None\n", "\n", " # make the predictions\n", " ypred = np.zeros(num_classes)\n", " \n", " # find the labels of the k-nearest neighbors\n", " classes = None\n", " \n", " for c in np.unique(classes):\n", " ypred[c] = len(classes[classes == c])\n", " \n", " return np.argmax(ypred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.3. Visualizing the decision boundary of the kNN" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1\n", "y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1\n", "\n", "step = 0.1\n", "xx, yy = np.meshgrid(np.arange(x_min, x_max, step),\n", " np.arange(y_min, y_max, step))\n", "\n", "ypred = []\n", "for data in np.array([xx.ravel(), yy.ravel()]).T:\n", " ypred.append(KNN(K, X, y, data.reshape(1,2)))\n", "\n", "fig, ax = plt.subplots(figsize=(5,5))\n", "\n", "ax.pcolormesh(xx, yy, np.array(ypred).reshape(xx.shape), cmap=cmap_light)\n", "ax.scatter(X[:,0], X[:,1], c=y, cmap=cmap_bold, edgecolor='k', s=20);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Endometrium vs. Uterus cancer classification using scikit-learn\n", "\n", "In this data set, each observation is a tumor, and it is described by the expression of 3,000 genes. The expression of a gene is a measure of how much of that gene is present in the biological sample. Because this affects how much of the protein this gene codes for is produced, and because proteins dictacte what cells can do, gene expression gives us valuable information about the tumor. In particular, the expression of the same gene in the same individual is different in different tissues. \n", "\n", "The goal is to separate the **endometrium** tumors from the **uterine** ones." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1. Loading the endometrium cancer data set" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# read the data tumor data set: `data/endometrium_uterus_tumor.csv`\n", "endometrium_data = None" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(endometrium_data.shape)\n", "\n", "endometrium_data.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.unique(endometrium_data['Tissue'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Drop the **target** feature (i.e., `Tissue`) and the unnecessary ones for the classification. \n", "2. Encode the target variable `Tissue` to be 0 for Endometrium tumeur and 1 for Uterus.\n", "\n", "_Hint_: check the method `get_dummies` of pandas." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# drop from the dataframe the columns Tissue and ID_REF\n", "X = None\n", "\n", "# encode the target variable and get its values\n", "y = None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot a scatter of two variables of the data set. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.scatter(X[:,0], X[:,1], c=y)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn import model_selection\n", "\n", "def stratifiedKFold(y, num_folds):\n", " \n", " kf = model_selection.StratifiedKFold(n_splits=num_folds)\n", " folds_regr = [(tr, te) for (tr, te) in kf.split(np.zeros(y.size), y)]\n", " return folds_regr" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a 10-cross validation folds on the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cv_folds = stratifiedKFold(y, 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cross validate 20 *k*-NN classifiers on the loaded datset using the `cross_validate` function of the module `cross_validation`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from cross_validation import cross_validate" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2. *k*-nearest neighbours classifier\n", "\n", "In scikit-learn, a k-neighbours classifier can be initialised as `knn_clf = sklearn.neighbors.KNeighborsClassifier(n_neighbors=k)`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn import neighbors\n", "from sklearn import metrics\n", "\n", "aurocs_clf = []\n", "\n", "# Create a range of values of k. \n", "k_range = range(1,40,2) \n", "\n", "for k in k_range:\n", " clf_knn = None # Create a k-neighbors classifier\n", " y_pred = cross_validate(X, y, clf_knn, cv_folds)\n", " \n", " fpr, tpr, thresholdss = metrics.roc_curve(y, y_pred[:,1])\n", " aurocs_clf.append(metrics.auc(fpr,tpr))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.plot(k_range, aurocs_clf, color='blue')\n", "plt.xlabel('Number of nearest neighbours')\n", "plt.ylabel('Cross-validated AUC')\n", "plt.title('Nearest neighbours classification - cross validated AUC.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question.** Find the best value for the parameter `n_neighbors` by finding the one that gives the maximum value of AUC." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "best_k = None\n", "print(k_range[best_k])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use `sklearn.model_selection.GridSearchCV` to find the best number of neighbors (i.e., `n_neighbors` parameter)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn import model_selection\n", "\n", "classifier = None\n", "\n", "param_grid = {'n_neighbors': k_range}\n", "clf_knn_opt = model_selection.GridSearchCV(classifier, \n", " param_grid=param_grid, \n", " cv=cv_folds,\n", " scoring='roc_auc',\n", " iid=True)\n", "clf_knn_opt.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finding the best number of neighbors." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print (clf_knn_opt.best_params_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Question__: Is the optimum equal to the one computed earlier (i.e. `best_k`)?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] } ], "metadata": { "coursera": { "course_slug": "mathematics-machine-learning-pca", "graded_item_id": "kGOjp", "launcher_item_id": "Myc4L" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 4 }