{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# MATH80629A\n", "# Week \\#3 - Supervised Learning - Exercices\n", "\n", "In this practical session we will explore several classification models." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Cloning into '80-629'...\n", "remote: Enumerating objects: 797, done.\u001b[K\n", "remote: Counting objects: 100% (23/23), done.\u001b[K\n", "remote: Compressing objects: 100% (12/12), done.\u001b[K\n", "remote: Total 797 (delta 12), reused 20 (delta 11), pack-reused 774\u001b[K\n", "Receiving objects: 100% (797/797), 101.99 MiB | 2.58 MiB/s, done.\n", "Resolving deltas: 100% (462/462), done.\n" ] } ], "source": [ "import pandas as pd\n", "import numpy as np \n", "import matplotlib.pyplot as plt\n", "import sklearn as sk\n", "%matplotlib inline\n", "from sklearn.datasets import make_classification\n", "\n", "# Code to obtain utils.py\n", "!rm -rf 80-629\n", "!git clone https://github.com/lcharlin/80-629/\n", "import sys\n", "sys.path += ['80-629/week3-Supervised/']\n", "\n", "\n", "from utils import generate_data, plot_predictions, plot_svc_decision_function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### First let's generate three datasets and plot the training and testing splits\n", "\n", "We will use 3 distinct binary classification datasets. Each datum is in two dimensions. \n", "\n", "Each dataset contains a training set and a test set. Each of the datasets was generated by a distinct data generative process, resulting in distinct class separations.\n", "\n", "We start by visualizing the three datsets." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "datasets_train, datasets_test = generate_data()\n", "\n", "for i, (ds_train, ds_test) in enumerate(zip(datasets_train, datasets_test)):\n", " ds_train = datasets_train[i]\n", " ds_test = datasets_test[i]\n", "\n", " X = ds_train[0]\n", " Y = ds_train[1]\n", " \n", " X_test = ds_test[0]\n", " Y_test = ds_test[1]\n", "\n", " i_c0 = (Y == 0)\n", " i_c1 = (Y == 1)\n", " \n", " #TRAIN: true and false predictions \n", " i_c0_t = (Y[i_c0]==0); i_c0_f = (Y[i_c0]==1)\n", " i_c1_t = (Y[i_c1]==1); i_c1_f = (Y[i_c1]==0)\n", " \n", " plot_predictions(i+1, X, Y, X_test, Y_test, pred_train=Y, pred_test=Y_test )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Linear Classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1. Linear least squares for classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the previous practical session, we explored how to tune a linear regression using a least squares loss function. We saw that minimization of the least square loss function led to a simple closed-form solution (see also Week #2 Slide 31). \n", "\n", "As discussed in the capsules, a similar idea can also be usef for classification tasks, with the difference that now we additionally require a decision rule for classification. The decision rule will allow us to classify the predicted values (*real numbers*) in a class (for example *binary*). \n", "\n", "We express the decision rule as $sign(y(x))$, where $y(x) = W^\\top x$ is our model. \n", "\n", "Where we define: \n", "\n", "$$ sign(a) = \\left\\{\\begin{array}{} \n", "+1, &\\text{if } a > 0\\\\\n", "-1, &\\text{otherwise}\n", "\\end{array}\\right. $$\n", "\n", "Thus, such a classifier will return class '+1' for all the points lying on one side of the decision boundary and '-1' for the ones lying on the other side.\n", "\n", "Let's implement this simple classifier." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# least squares for classification\n", "def train_LSC(X, Y):\n", " #Note: we transform the labels from {0,1} to {-1,+1}\n", " Y = (ds_train[1]*2)-1\n", " Y = Y.reshape(Y.size,1)\n", " \n", " # We learn the parameters of the model \n", " # (this is the same estimation as the one last week for OLS)\n", " A = np.linalg.inv(np.dot(X.T, X))\n", " B = np.dot(X.T, Y)\n", " \n", " return np.dot(A, B)\n", "\n", "# for visualizing the decision boundary \n", "def calculate_decision_boundary(W):\n", " x_1 = np.linspace(-10,10) # <- this would give x1;\n", " \n", " #the goal is to calculate x2 given x1 and the weights \n", " x_2 = (-W[0] - W[1]*x_1) / W[2] \n", " return x_1, x_2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for i, (ds_train, ds_test) in enumerate(zip(datasets_train, datasets_test)):\n", " ds_train = datasets_train[i]\n", " ds_test = datasets_test[i]\n", "\n", " X = ds_train[0]\n", " #Add a column of ones to account for the bias term\n", " X_b = np.array([np.ones(len(X)), X[:,0], X[:,1]]).T\n", " ####\n", " Y = ds_train[1]\n", " X_test = ds_test[0]\n", "\n", " #Add a column of ones to account for the bias term\n", " X_test_b = np.array([np.ones(len(X_test)), X_test[:,0], X_test[:,1]]).T\n", " ###\n", "\n", " Y_test = ds_test[1]\n", "\n", " # 1) train weights\n", " #let k be the number of casses (2), dim - dimentionality of the data (2)\n", " W = train_LSC(X_b, Y) # dim x k\n", "\n", " # 2) Calculate the discriminant function and predict \n", " # a) for test data\n", " y_x = np.dot(W.T, X_test_b.T)\n", " pred_test = 1*(y_x>0)[0]\n", "\n", " # b) for training data\n", " y_x = np.dot(W.T, X_b.T)\n", " pred_train = 1*(y_x>0)[0]\n", "\n", " \n", " # 3) Calculate decision boundary\n", " line_x, line_y = calculate_decision_boundary(W)\n", " \n", " \n", " # 4) plot predictions\n", " plot_predictions(i+1 ,X,Y, X_test, Y_test, pred_train, pred_test, line_x, line_y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 1 (Lin. Classifier):** for which dataset does the model achieve the best performance? Explain in your own words, why this is the case." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2. SVM" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we will use the sklearn library for the first time. We will train and evaluate a Suport Vector Machine (SVM) model, which was brefly introduced in the capsules." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, let's generate some linearly separable data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#let's load a new linearly separable dataset\n", "X_l,Y_l = make_classification(n_features=2, n_redundant=0, n_informative=2,\n", " random_state=1, n_clusters_per_class=1)\n", "fig = plt.figure(figsize = (10,5))\n", "xfit = np.linspace(-2, 2)\n", "plt.scatter(X_l[:, 0], X_l[:, 1], c=Y_l)\n", "\n", "for m, b, d in [(1.5, 1, 0.2), (10, 1, 2), (-1.5, 1.2, 0.2)]:\n", " yfit = m * xfit + b\n", " plt.plot(xfit, yfit, '-k')\n", " plt.fill_between(xfit, yfit - d, yfit + d, edgecolor='none',\n", " color='#AAAAAA', alpha=0.4)\n", "plt.xlim(-2,2)\n", "plt.ylim(-0.5,2.2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that there exist many decision boundaries can be used to separate the given dataset.\n", "\n", "SVM choses the line that maximizes the margin to the closest point. Let's use sklearn to train an SVM model and visualize the results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#import the SVM classifier\n", "from sklearn.svm import SVC \n", "model = SVC(kernel='linear', C=1E10)\n", "\n", "#let us fit the model (learn its parameters)\n", "model.fit(X_l, Y_l)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = plt.figure(figsize = (10,5))\n", "ax = fig.add_subplot(111)\n", "ax.scatter(X_l[:, 0], X_l[:, 1], c = Y_l)\n", "ax = plot_svc_decision_function(model,ax)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the model has chosen the decision boundary with the maximal margin. Here, the pivot elements that touch the (dotted) margins are called support vectors." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now try to classify Dataset 3 from our innitial set using SVM." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X, Y = datasets_train[2]\n", "X_test, Y_test = datasets_test[2]\n", "\n", "# DEFINE SVM model\n", "model = SVC(kernel='linear', C=1E10)\n", "\n", "#let's fit the model (learn its parameters)\n", "model.fit(X, Y)\n", "\n", "\n", "fig = plt.figure(figsize = (10,5))\n", "ax = fig.add_subplot(111)\n", "\n", "plt.scatter(X[:, 0], X[:, 1], c=Y)\n", "plot_svc_decision_function(model,ax)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 1 (SVM)**: complete the cell below to calculate the training and the test accuracy of the above model.\n", "\n", "Hint: check the documentation of the sklearn SVM classifier [here](https://scikit-learn.org/stable/modules/svm.html#svc). Find the function that would return the predictions for the trained model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "acc_train = None #QUESTION - complete this line\n", "acc_test = None #QUESTION - complete this line\n", "\n", "print(\"Accuracy train: \", acc_train, \"%\")\n", "print(\"Accuracy test: \", acc_test, \"%\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that our dataset is noisy, i.e. the data points from the two classes overlap and there is no decision boundary for our SVM model that can perfectly separate all the points.\n", "\n", "While a large amount of research has been conducted on SVMs providing a solid theoretical foundation for this algorithm, here we will solely try to gain some intuition on how the algorithm works. Intuitively, the algorithm tries to optimize the trade-off between maximizing the margin around the decision boundary and minimizing the training error. We see in the above graph that the margin around the decision boundary is so tiny that it is almost not visible (no dotted lines). One of the possible reasons for this is that the trade-off between the error minimization and maximization of the margin around the decision boundary is miscalibrated. \n", "\n", "As mentioned in the previous tutorial, hyperparameter tuning is very important in machine learning and (almost) any result can be improved through hyperparameter tuning." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 2 (SVM): [Advanced question that goes beyond the material covered in the capsule]** check the documentation of the sklearn SVM classifier [here](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html). Try to find the parameter that controls the regularization strength and change it in the definition of the SVM model above to allow for a \"softer-margin\". Measure the performance of your model. Can you observe any improvement?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's run the SVM classifier on all the 3 datasets (*Note: please complete the previous question first, otherwise this next cell will run for a long time*)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for i, (ds_train, ds_test) in enumerate(zip(datasets_train, datasets_test)):\n", " ds_train = datasets_train[i]\n", " ds_test = datasets_test[i]\n", "\n", " X = ds_train[0]\n", " Y = ds_train[1]\n", " \n", " \n", " X_test = ds_test[0]\n", " Y_test = ds_test[1]\n", " # 1) train weights\n", " #let's fit the model (learn its parameters)\n", " #we use the model form the cell above\n", " model.fit(X, Y)\n", "\n", " # 2) Calculate the discriminant function and predict \n", " # a) for test data\n", " pred_test = model.predict(X_test)\n", "\n", " # b) for training data\n", " y_x = np.dot(W.T, X_b.T)\n", " pred_train = model.predict(X)\n", "\n", " \n", " # 4) plot predictions\n", " plot_predictions(i,X,Y, X_test, Y_test, pred_train, pred_test, plot_svm=model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Probabilistic Models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1. Naive Bayes " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Up to now we used linear classifiers, that is classifiers with a linear decision boundary. Here, we will implement the Naive Bayes classifier. It was introduced in the capsules and we will see if this classifier can come up with better decision boundaries." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In our toy datasets, the target variables $y \\in \\{0,1 \\}$ are simple binary variables, and hence, can be modelled with a *Bernoulli* distribution $p(y=k|\\pi) = \\pi^k(1-\\pi)^{1-k}$. Our feature variables $x \\in \\mathbb{R}^2$ are continuous variables. As discussed in the class, we can model continuous variables with a Gaussian distribution $\\mathcal{N}(x_j | \\mu_{jk}, \\sigma^2_{jk})$.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The fitting procedure for a Naive Bayes classifier is described below. Fill in the gaps with missing text (double-click on this cell to be able to edit in the text).\n", " \n", " 1. Calculate maximum likelihood estimate (MLE) for all the parameters of our distributions: \n", " $\\theta \\in \\{$ $ \\sigma^2_{jk}, \\mu_{jk}, \\pi $ \\} using the training set.\n", " 2. Use the estimated parameters and the Bayes theorem to make predictions. The corresponding formula is\n", " \n", " $posterior \\propto joint \\space distribution = class\\space conditional * class\\space prior$\n", " \n", " $P(y = k|x) \\propto$ P(y = k, x) = P(x | y = k) P(y) ." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's start by building the Naive Bayes model for our second dataset. After, we can easily train and test our implementation on all datasets." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X, Y = datasets_train[1] # we use the 2 circles dataset\n", "X_test, Y_test = datasets_test[1] # we use the 2 circles dataset\n", "\n", "#class indices\n", "i_c0 = (Y == 0)\n", "i_c1 = (Y == 1)\n", "\n", "X_0 = X[i_c0] #data of class 0\n", "X_1 = X[i_c1] #data of class 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Training the model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's start by calculating the prior probability for each class. The MLE (maximum likelihood estimate) of a Bernoulli random variable is simply given by the sample mean:\n", "\\begin{align}\n", " \\hat{\\pi} = \\frac{\\sum_{i=0}^n y_i}{n}, \n", "\\end{align}\n", "where $n$ is the total number of samples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 2 (NB):** Fill in missing code in the following cell to calculate the prior probabilities for each class." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def calculate_prios(X_0, X_1, X):\n", " # TODO: calculate prior class 1\n", " prior_k0 = 0 # TODO: fix this line\n", "\n", " # TODO: calculate prior class 0\n", " prior_k1 = 0 # TODO: fix this line\n", "\n", " #let's store the priors in a dictionary\n", " prior_dict = {\"class 0\":prior_k0 , \"class 1\":prior_k1 }\n", " \n", " return prior_dict" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "prior_dict = calculate_prios(X_0, X_1, X)\n", "print(prior_dict)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's calculate the MLE for the parameters of a Gaussian. Note that although the Gaussian is embedded into a larger NB model, taking the log of the likelihood separates the different terms. We can call each dimension of our input dataset X a *feature*.\n", "\n", "As explained e.g. [here](https://www.statlect.com/fundamentals-of-statistics/normal-distribution-maximum-likelihood), the maximum likelihood estimates of the **mean** is simply the empirical mean:\n", "\n", "$\\hat{\\mu} = \\frac{1}{n} \\sum_{i=0}^{n} x^{(i)}$. For our algorithm we need to calculate this quantity for each class ($k$) and for each feature ($j$):\n", "\\begin{align}\n", " \\hat{\\mu}_{jk} = \\frac{\\sum_{i=0}^{n} \\mathbb{1}(y^{(i)}=k) x_{j}^{(i)}}{ \\sum_{i=0}^{n} \\mathbb{1}(y^{(i)}=k)} \n", "\\end{align}\n", "(*Note, $\\mathbb{1}(y^{(i)}=k)$ evaluates to 1 when $y^{(i)}=k$ and to 0 otherwise.*)\n", "\n", "\n", "Further, the MLE of the *variance* of a Gaussian distribution is simply the empirical variance: \n", "\n", "$\\hat{\\sigma}^2 = \\frac{1}{n} \\sum_{j=0}^{n} (x^{(j)} - \\mu)^2$\n", "\n", "(*Note the variance also needs to be calculated per class per feature using our training data.*)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 3 (NB)**: complete the implementation of the MLE estimates in the cell below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def calculate_MLE(X_0,X_1):\n", " #Calculate statistics for features per class\n", " class_summaries = dict()\n", "\n", " feat_0_class_0 = np.mean(X_0[:,0]), np.std(X_0[:,0])\n", " feat_1_class_0 = np.mean(X_0[:,1]), np.std(X_0[:,1])\n", "\n", " #put them in the summaries dictionary\n", " class_summaries['class 0'] = [feat_0_class_0, feat_1_class_0]\n", "\n", " feat_0_class_1 = 0,0 #QUESTION <- complete this line by calculating the statistics for class 1 feature 0\n", " feat_1_class_1 = 0,0 #QUESTION <- complete this line by calculating the statistics for class 1 feature 1\n", " \n", " #put them in the summaries dictionary\n", " class_summaries['class 1'] = [feat_0_class_1, feat_1_class_1]\n", " return class_summaries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class_summaries = calculate_MLE(X_0, X_1)\n", "#lets print the result\n", "for c, class_summary in class_summaries.items():\n", " print(c)\n", " print(class_summary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To summarize:\n", "\n", "- Our *class_summaries* dictionary (printed in the previous cell) contains a list with empirical mean and variance of each features for each of the two classes calculated on our training data .\n", "\n", "- Our *prior_dict* dictionary contains the prior probabilities for each class.\n", "\n", "Congratulations, we have completed the training procedure of our Naive Bayes classifier. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Now let's make predictions using our trained model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we assumed that our features are drawn from Gaussian distributions, we need to implement the Gaussian probability density function (pdf) that will return the probability of the input for a given mean and variance parameters (we will use the per class MLE parameters estimated earlier)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$ \\mathcal{N}(x,\\mu,\\sigma) = \\frac{1}{ \\sigma \\sqrt{2 \\pi}} e^{\\left(-\\frac{{\\left(\\mu - x\\right)}^{2}}{2 \\, \\sigma^{2}}\\right)} $\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def Gaussian_PDF(x, mean, stdev):\n", " exponent = np.exp(-((x-mean)**2 / (2 * stdev**2 )))\n", " return (1 / (np.sqrt(2 * np.pi) * stdev)) * exponent\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's apply our model to a single test point. \n", "\n", "Remember, all we need to do is to apply the Bayes theorem using the parameters we estimated earlier." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#test point\n", "test_point = X_test[2]\n", "print('test point: ', test_point)\n", "print(f'value of feature 1: {test_point[0]}, value of feature 2: {test_point[1]}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 4 (NB):** complete the function in the cell below that calculates the posterior probability for a given test point." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def calculate_probability(x_test, class_summaries, prior_dict):\n", " # we will store our resulting probabilities in this dictionary:\n", " resulting_probabilities = dict() \n", " \n", " \n", " for c, class_stats in class_summaries.items():\n", "\n", " prior_c= prior_dict[c]\n", " resulting_probabilities[c] = prior_c # put P(y) in the dictionary containing the result\n", "\n", " for i, feature_stat in enumerate(class_stats):\n", " mean, stdev = feature_stat\n", " resulting_probabilities[c] *= 0 #QUESTION <- complete this line, here the posterior for the test sample should be calculated\n", " \n", " return resulting_probabilities\n", " \n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "probs = calculate_probability(test_point, class_summaries,prior_dict)\n", "print(probs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The output *'probs'* is a dictionary containing quantities proportional to the posterior porobability of each class given the input data sample. Our decision rule will be simply to pick the class with the highest unnormalized probability." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 5 (NB)**: describe why the Naive Bayes classifier is called 'naive'? State the line from the *'calculate_probability'* function that implements the 'naive' assumption." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**(Bonus) Question 6 (NB)**: why the output probabilities of our Naive Bayes classifier don't sum to 1?\n", "\n", "Hint: have a look on the formula we used for the posterior calculation defined at the beginning of the section." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's run our Naive Bayes mode on all the 3 datasets:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def train_NB(X,Y):\n", " #class indices\n", " i_c0 = (Y == 0)\n", " i_c1 = (Y == 1)\n", "\n", " X_0 = X[i_c0] #data of class 0\n", " X_1 = X[i_c1] #data of class 1\n", " \n", " prior_dict = calculate_prios(X_0, X_1, Y)\n", " \n", " #Calculate statistics for features per class\n", " class_summaries = calculate_MLE(X_0,X_1)\n", " \n", " return prior_dict, class_summaries\n", "\n", "\n", "def test_NB(X, class_summaries, prior_dict):\n", " result = []\n", " for x_test in X:\n", " probs = calculate_probability(x_test, class_summaries, prior_dict)\n", " result.append(list(probs.values()))\n", " return np.array(result)\n", " \n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for i, (ds_train, ds_test) in enumerate(zip(datasets_train, datasets_test)):\n", " ds_train = datasets_train[i]\n", " ds_test = datasets_test[i]\n", "\n", " X = ds_train[0]\n", " Y = ds_train[1]\n", " \n", " \n", " X_test = ds_test[0]\n", " Y_test = ds_test[1]\n", " # 1) train model\n", " prior_dict, class_summaries = train_NB(X,Y)\n", " \n", " # 2) Predict \n", " # a) for test data\n", " posterior = test_NB(X_test,class_summaries, prior_dict )\n", " pred_test = np.argmax(posterior,1)\n", "\n", " # b) for training data\n", " posterior = test_NB(X, class_summaries, prior_dict )\n", " pred_train = np.argmax(posterior,1)\n", "\n", " \n", " #4) plot predictions\n", " plot_predictions(i,X,Y, X_test, Y_test, pred_train, pred_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 7 (NB):** for which datasets did the Naive Bayes classifier perform well? Describe in your own words why this is the case. Why did the Naive Bayes classifier work so badly for the first dataset?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course sklearn package offers an implementation of the Naive Bayes classifier. Let's try it out." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.naive_bayes import GaussianNB\n", "\n", "for i, (ds_train, ds_test) in enumerate(zip(datasets_train, datasets_test)):\n", " ds_train = datasets_train[i]\n", " ds_test = datasets_test[i]\n", "\n", " X = ds_train[0]\n", " Y = ds_train[1]\n", " \n", " \n", " X_test = ds_test[0]\n", " Y_test = ds_test[1]\n", " # 1) train weights\n", " model = GaussianNB()\n", " #let's fit the model (learn its parameters)\n", " model.fit(X, Y)\n", "\n", " # 2) Calculate the discriminant function and predict \n", " # a) for test data\n", " pred_test = model.predict(X_test)\n", "\n", " # b) for training data\n", " y_x = np.dot(W.T, X_b.T)\n", " pred_train = model.predict(X)\n", "\n", " \n", " # 4) plot predictions\n", " plot_predictions(i,X,Y, X_test, Y_test, pred_train, pred_test, plot_nb=model )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2. Example of Naive Bayes for text classification." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have understood how Naive Bayes works, let's walk through a text classification example taken from the exercises notebook accompanying the 'Python Data Science Handbook' [book](https://www.oreilly.com/library/view/python-data-science/9781491912126/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now apply the Naive Bayes classifier to the text classification task. As discussed in class, in text classification the features are related to word counts or frequencies within the documents to be classified.\n", "\n", "Let's download the 20 Newsgroups corpus and take a look at some target names." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import fetch_20newsgroups\n", "\n", "data = fetch_20newsgroups()\n", "data.target_names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For simplicity here, we will select just a few of these categories, and download the training and testing set:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "categories = ['talk.religion.misc', 'soc.religion.christian',\n", " 'sci.space', 'comp.graphics']\n", "train = fetch_20newsgroups(subset='train', categories=categories)\n", "test = fetch_20newsgroups(subset='test', categories=categories)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is an example exntry from the data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(train.data[5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's convert the content of each string into a vector of numbers using the [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) vectorizer and create a pipeline that attaches it to a multinomial Naive Bayes classifier." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.naive_bayes import MultinomialNB\n", "from sklearn.pipeline import make_pipeline\n", "\n", "model = make_pipeline(TfidfVectorizer(), MultinomialNB())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now fit the model on the training data and predict the labels for the unseen test data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model.fit(train.data, train.target)\n", "labels = model.predict(test.data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's evaluate the performance of our learned classifier using a [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import seaborn as sns\n", "from sklearn.metrics import confusion_matrix\n", "mat = confusion_matrix(test.target, labels)\n", "sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,\n", " xticklabels=train.target_names, yticklabels=train.target_names)\n", "plt.xlabel('true label')\n", "plt.ylabel('predicted label');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, this simple classifier is able to relatively well separate the space talk from the computer talk, but it gets confused between talks about religion and Christianity." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The very cool thing here is that we now have the tools to determine the category for *any* string, using the ``predict()`` method of this pipeline.\n", "Here's a quick utility function that will return the prediction for a single string:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def predict_category(s, train=train, model=model):\n", " pred = model.predict([s])\n", " return train.target_names[pred[0]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predict_category('sending a payload to the ISS')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predict_category('discussing islam vs atheism')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predict_category('determining the screen resolution')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remember that this is nothing more sophisticated than a simple probability model for the (weighted) frequency of each word in the string; nevertheless, the result is striking.\n", "Even a very naive algorithm, when used carefully and trained on a large set of high-dimensional data, can be surprisingly effective." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 4 }