{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Evaluating machine learning models\n", "\n", "Machine learning involves automatically learning how to compute functions from examples. There are several ways that this process can go wrong, including:\n", "\n", "0. Overfitting the training examples,\n", "1. Optimizing for the wrong objective,\n", "2. Starting with the wrong features, and\n", "3. Data drift (which we'll treat in [another notebook](05-data-drift.ipynb)).\n", "\n", "Let's look at some of these problems now." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "data = pd.read_parquet(\"data/training.parquet\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Overfitting\n", "\n", "The first concern we'd like to address is *overfitting*, in which we have a model whose performance on training examples is materially different from its performance in production. We'll see that in action with a simple example. First, let's choose some of our data as training examples:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "overfit_training = data.sample(1000)\n", "overfit_training.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll \"train\" a very simple \"model\" from these examples: we'll just memorize hashes of every example so we can look up whether a given example is legitimate or not. We'll program defensively, too: if we don't find an example in either set, we'll call it legitimate if it has an even number of characters." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "class OverfittingSpamModel(object):\n", " def __init__(self):\n", " self.legit = set()\n", " self.spam = set()\n", " \n", " def fit(self, df):\n", " for tup in df.itertuples():\n", " if tup.label == \"legitimate\":\n", " self.legit.add(hash(tup.text))\n", " else:\n", " self.spam.add(hash(tup.text))\n", " \n", " def predict(self, text):\n", " h = hash(text)\n", " if h in self.legit:\n", " return \"legitimate\"\n", " elif h in self.spam:\n", " return \"spam\"\n", " else:\n", " return (len(text) % 2 == 0) and \"legitimate\" or \"spam\"\n", "\n", "osm = OverfittingSpamModel()\n", "osm.fit(overfit_training)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can try this out with some of our training examples to see how well it works:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for row in overfit_training.sample(10).itertuples():\n", " print(\"text is '%s...': actual label is %s; predicted label is %s\" % (row.text[0:20], row.label, osm.predict(row.text)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This model appears to work really well! We can test it on the whole set of examples:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def model_accuracy(osm, df):\n", " correct = 0\n", " incorrect = 0\n", " for row in df.itertuples(): \n", " if row.label == osm.predict(row.text):\n", " correct += 1\n", " else:\n", " incorrect += 1\n", " \n", " if correct + incorrect == 0:\n", " return 100\n", " \n", " return (float(correct) / float(correct + incorrect) * 100)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_accuracy(osm, overfit_training)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our model is enormously successful! It has one hundred percent accuracy. We probably expected this result, but it's always nice when things work out as you expected they would. Let's see how well our model has generalized to data it _hasn't_ seen by testing it on the rest of our dataset (39,000 more examples)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_accuracy(osm, data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Uh oh! It appears that our model is not much better than a coin toss once it's running on data it hasn't already seen. If we had put this model in production, our application surely wouldn't have performed well.\n", "\n", "We want a way to identify this problem before we put a model into production, and you've essentially seen it already: when we train a model, we don't use all of the data we have available to us. Instead, we divide our examples into distinct _training_ and _test_ sets, usually with about 70% of the examples in the former and 30% in the latter. \n", "The training algorithm only considers the examples in the training set. After training our model, we can evaluate its performance on both the training set (which it saw during training) and the test set (which it didn't). If the performance is materially different on the different sets, we know that we've overfit the data when training our model.\n", "\n", "Next up, we'll deal with the question of what metrics we should use to evaluate our performance. We used accuracy above, but is it always the best option?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Evaluation metrics and types of error\n", "\n", "Our training data set is _balanced_ between classes -- there are equal numbers of legitimate and spam documents. But data in the real world are typically not balanced, and are often wildly unbalanced. For example:\n", "\n", "- The worldwide incidence of Rh-negative blood types is approximiately 6 percent;\n", "- Between one and three percent of actual consumer payments transactions are fraudulent; and\n", "- A rare disease may have an incidence rate on the order of one in ten thousand per year.\n", "\n", "In cases like these, it would be possible to develop an accurate model that wouldn't produce meaningful results; for example:\n", "\n", "- A blood type tester that always returned \"Rh-positive\" would be accurate roughly 94% of the time on a sufficiently diverse population;\n", "- A fraud detector that always returned \"not fradulent\" would be accurate between 97-99% of the time -- until, that is, fraudsters determined that their charges would likely go through, increasing the rate of fraudulent charges; and\n", "- A technique to screen for a very rare disease could be quite accurate by simply never identifying disease.\n", "\n", "In many applications, we're not only interested in correctly identifying members of one class, we're interested in correctly identifying members of both classes. We can capture this behavior by using better metrics than accuracy.\n", "\n", "To learn about these metrics, let's start with an unbalanced data set, in which 90% of the messages are spam." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "legit_sample = data[data.label == 'legitimate'].sample(2000)\n", "spam_sample = data[data.label == 'spam'].sample(18000)\n", "unbalanced = pd.DataFrame.append(legit_sample, spam_sample)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To avoid overfitting, we'll split the unbalanced data set into training and test sets, using functionality from scikit-learn." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "unbalanced_train, unbalanced_test = train_test_split(unbalanced, test_size=0.3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll now create a simple model that should work pretty well for spam messages but not necessarily as well for legitimate ones. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "from collections import defaultdict\n", "import re\n", " \n", "class SensitiveSpamModel(object):\n", " \n", " def __init__(self):\n", " self.legit = set()\n", " self.spam = set()\n", " \n", " def fit(self, df):\n", " \"\"\" Train a model based on the most frequent unique \n", " words in each class of documents \"\"\"\n", " legit_words = defaultdict(lambda: 0)\n", " spam_words = defaultdict(lambda: 0)\n", " \n", " for tup in df.itertuples():\n", " target = spam_words\n", " if tup.label == \"legitimate\":\n", " target = legit_words\n", " for word in re.split(r\"\\W+\", tup.text):\n", " if len(word) > 0:\n", " target[word.lower()] += 1\n", " \n", " # remove words common to both classes\n", " for word in set(legit_words.keys()).intersection(set(spam_words.keys())):\n", " del legit_words[word]\n", " del spam_words[word]\n", " \n", " top_legit_words = sorted(legit_words.items(), key=lambda kv: kv[1], reverse=True)\n", " top_spam_words = sorted(spam_words.items(), key=lambda kv: kv[1], reverse=True)\n", " \n", " # store ten times as many words from the spam set\n", " self.legit = set([t[0] for t in top_legit_words[:100]])\n", " self.spam = set([t[0] for t in top_spam_words[:1000]])\n", " \n", " def predict(self, text):\n", " legit_score = 0\n", " spam_score = 0\n", " \n", " for word in re.split(r\"\\W+\", text):\n", " w = word.lower()\n", " if word in self.legit:\n", " legit_score = legit_score + 1\n", " elif word in self.spam:\n", " spam_score = spam_score + 1\n", " \n", " # bias results towards spam in the event of ties\n", " return (legit_score > spam_score) and \"legitimate\" or \"spam\"\n", "\n", "ssm = SensitiveSpamModel()\n", "ssm.fit(unbalanced_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check the accuracy on our training sample." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_accuracy(ssm, unbalanced_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make sure we've not overfit our training sample, let's check the accuracy on the test sample." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_accuracy(ssm, unbalanced_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's not quite as good as the results on our training sample, but it's still pretty decent (that is, it's better than just always returning \"spam\" would be given the balance of the classes). \n", "\n", "However, we get a different picture if we look at our model's performance on the balanced data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_accuracy(ssm, data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the accuracy is even worse if we look at a sample where the label balance is reversed (i.e., only 10% of documents are spam):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "legit_sample = data[data.label == 'legitimate'].sample(900)\n", "spam_sample = data[data.label == 'spam'].sample(100)\n", "legit_biased = pd.DataFrame.append(legit_sample, spam_sample)\n", "model_accuracy(ssm, legit_biased)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'd like to understand the performance of our model with some metric that captures not only the overall accuracy but the accuracy for positive cases and the accuracy for negative cases. That is, if we assume that our goal is to identify spam documents, we care about:\n", "\n", "- *true positives*, which are spam documents that our model predicts as spam;\n", "- *true negatives*, which are legitimate documents that our model predicts as legitimate;\n", "- *false positives*, which are legitimate documents that our model predicts as spam; and\n", "- *false negatives*, which are spam documents that our model predicts as legitimate\n", "\n", "The proportions between these quantities can provide interesting metrics. For example, the ratio of true positives to actual positives (that is, true positives + false negatives) is called *recall*, which indicates the percentage of spam documents we've selected. The ratio of true positives to all predicted positives (that is, true positives + false positives) is called *precision*, which indicates the percentage of predicted spam documents that are actually spam. Ideally, a good classifier would have both high precision and high recall, but in some applications either precision or recall is more important.\n", "\n", "We can visualize the overall performance of a classifier with a *confusion matrix*, which plots actual labels in rows and predicted labels in columns. The confusion matrix thus puts correct predictions along one diagonal and various kinds of incorrect predictions elsewhere. Let's see one in action for our data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from mlworkflows import plot\n", "\n", "def true_and_predicted(df, model):\n", " return (df.label.values, [model.predict(txt) for txt in df.text.values])\n", "\n", "df, chart = plot.binary_confusion_matrix(*true_and_predicted(unbalanced, ssm))\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "chart" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It should be clear that accuracy isn't always the best metric. We're not merely interested in accuracy, but also in false positive rate and false negative rate. Furthermore, in some cases, we care more about false negatives, and in some cases, we care more about false positives: for example, in a disease screening application, false positives may be costly but false negatives may be deadly. The [Fβ-score](https://en.wikipedia.org/wiki/F1_score) is a family of measures that allow us to ascribe relative importance to false positives and false negatives while evaluating a classifier.\n", "\n", "If we want to weight false positives and false negatives equally, the formula looks like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def f1_score(actual_values, predicted_values, positive_label=\"spam\"):\n", " actual_positives = float(len([v for v in actual_values if v == positive_label]))\n", " true_positives = float(len([t for t in zip(actual_values, predicted_values) if t[0] == t[1] == positive_label]))\n", " predicted_positives = float(len([v for v in predicted_values if v == positive_label]))\n", " \n", " precision = true_positives / predicted_positives\n", " recall = true_positives / actual_positives\n", " return 2 * ((precision * recall) / (precision + recall))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "f1_score(*true_and_predicted(unbalanced, ssm))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can generalize this formula to emphasize false negatives or false positives. The Fβ-score weights recall by β times as much as precision -- thus, if β is greater than one, false negatives will be weighted more heavily, but if β is less than one, false positives will be weighted more heavily." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def fbeta_score(actual_values, predicted_values, beta, positive_label=\"spam\"):\n", " actual_positives = float(len([v for v in actual_values if v == positive_label]))\n", " true_positives = float(len([t for t in zip(actual_values, predicted_values) if t[0] == t[1] == positive_label]))\n", " predicted_positives = float(len([v for v in predicted_values if v == positive_label]))\n", " \n", " precision = true_positives / predicted_positives\n", " recall = true_positives / actual_positives\n", " return (1 + (beta * beta)) * ((precision * recall) / (((beta * beta) * precision) + recall))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fbeta_score(*true_and_predicted(unbalanced, ssm), 1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fbeta_score(*true_and_predicted(unbalanced, ssm), 2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fbeta_score(*true_and_predicted(unbalanced, ssm), 0.5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We've included these implementations to show you how simple the formula is, but for real applications, you'd want to use the [scikit-learn implementation of F1-score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score) or [Fβ-score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.fbeta_score.html#sklearn.metrics.fbeta_score). The main takeaway from this section should be that the metrics you use to evaluate your models are important, and they can affect both how you train your models and how you monitor them in production." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Problems with feature selection\n", "\n", "Sometimes, the features we choose wind up not being that informative. In [our previous notebook](01-vectors-and-visualization.ipynb), we saw an example of a very simple feature that exposed some structure in our data. However, we won't always be so lucky as to land on a feature engineering approach that makes it so easy to separate kinds of examples. Let's say that, instead of hashing k-shingles, we instead wanted to count the number of words with a given length. We'll turn our training set into a set of labeled count vectors." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "def count_frequency(txt, MAX_LEN=7):\n", " \"\"\" returns an array of the number of words \n", " with a given length for each length up to MAX_LEN \"\"\"\n", " result = np.zeros(MAX_LEN + 1)\n", " for word in re.split(r\"\\W+\", txt):\n", " idx = min(len(word), MAX_LEN)\n", " result[idx] += 1.0\n", " return result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As before, we'll use PCA to project our points into two dimensions and then plot them to see if there's any structure with our data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot.plot_pca(data, \"text\", func=count_frequency, x=\"x\", y=\"y\", color=\"label\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Zoom and pan to examine that plot more closely, if you haven't already. Surprisingly, there is still some structure to our data, but the boundary between the classes is awfully large (and fuzzy). Just to be sure, we could try using t-SNE to visualize these data as well." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.random.seed(0xc0ffee)\n", "plot.plot_tsne(data, \"text\", func=count_frequency, tsne_sample=800, x=\"x\", y=\"y\", color=\"label\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The structure in the t-SNE plot isn't particularly obvious, either. However, some potential feature engineering approaches are even worse. For example, encoding merely whether or not we've seen a word of a given length, so that `\"Bob ate chips\"` would be encoded as `[0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, ...]` -- i.e., so that it would have a `1.0` in positions 3 and 5." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def count_existence(txt, MAX_LEN=15):\n", " \"\"\" returns an array of 1.0 if the input had a word with the given length \n", " and 0.0 if the input did not, for each length up to MAX_LEN\"\"\"\n", " result = np.zeros(MAX_LEN + 1)\n", " for word in re.split(r\"\\W+\", txt):\n", " idx = min(len(word), MAX_LEN)\n", " result[idx] = 1.0\n", " return result\n", "\n", "plot.plot_pca(data, \"text\", func=count_existence, x=\"x\", y=\"y\", color=\"label\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As before, we can see if t-SNE gets us a more useful visualization of the structure of our data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.random.seed(0xc0ffee)\n", "plot.plot_tsne(data, \"text\", func=count_existence, tsne_sample=800, x=\"x\", y=\"y\", color=\"label\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "...but, also as before, it's not obvious how we'd separate these classes of document based on the features we have.\n", "\n", "This kind of encoding can actually work quite well for some learning problems, but it should be clear that it doesn't work that well for this one. In a situation like this, when visualizing your feature vectors shows a noisy or indistinguishable map, you may need to go back to the drawing board and choose a different feature engineering approach.\n", "\n", "In the next part of the workshop, you get to choose between two different feature engineering approaches. You can proceed either to [this notebook, which shows an approach based on summaries of various properties of the documents](03-feature-engineering-summaries.ipynb) or [this notebook, which shows an approach based on which terms appear most frequently in documents](03-feature-engineering-tfidf.ipynb)." ] } ], "metadata": { "kernelspec": { "display_name": "Python (myenv)", "language": "python", "name": "myenv" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }