{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Classification" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "Think Bayes, Second Edition\n", "\n", "Copyright 2020 Allen B. Downey\n", "\n", "License: [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [] }, "outputs": [], "source": [ "# If we're running on Colab, install empiricaldist\n", "# https://pypi.org/project/empiricaldist/\n", "\n", "import sys\n", "IN_COLAB = 'google.colab' in sys.modules\n", "\n", "if IN_COLAB:\n", " !pip install empiricaldist" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Get utils.py\n", "\n", "from os.path import basename, exists\n", "\n", "def download(url):\n", " filename = basename(url)\n", " if not exists(filename):\n", " from urllib.request import urlretrieve\n", " local, _ = urlretrieve(url, filename)\n", " print('Downloaded ' + local)\n", " \n", "download('https://github.com/AllenDowney/ThinkBayes2/raw/master/soln/utils.py')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [] }, "outputs": [], "source": [ "from utils import set_pyplot_params\n", "\n", "set_pyplot_params()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "tags": [] }, "outputs": [], "source": [ "from utils import Or70, Pu50, Gr30\n", "\n", "color_list3 = [Or70, Pu50, Gr30]" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "tags": [] }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "from cycler import cycler\n", "\n", "marker_cycle = cycler(marker=['s', 'o', '^'])\n", "color_cycle = cycler(color=color_list3)\n", "line_cycle = cycler(linestyle=['-', '--', ':'])\n", "\n", "plt.rcParams['axes.prop_cycle'] = (color_cycle + \n", " marker_cycle + \n", " line_cycle)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Classification might be the most well-known application of Bayesian methods, made famous in the 1990s as the basis of the first generation of [spam filters](https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering).\n", "\n", "In this chapter, I'll demonstrate Bayesian classification using data collected and made available by Dr. Kristen Gorman at the Palmer Long-Term Ecological Research Station in Antarctica (see Gorman, Williams, and Fraser, [\"Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus *Pygoscelis*)\"](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0090081), March 2014).\n", "We'll use this data to classify penguins by species." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "The following cell downloads the raw data." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Load the data files from \n", "# https://github.com/allisonhorst/palmerpenguins\n", "# With gratitude to Allison Horst (@allison_horst)\n", "\n", "download('https://github.com/allisonhorst/palmerpenguins/raw/main/inst/extdata/penguins_raw.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Penguin Data\n", "\n", "I'll use Pandas to load the data into a `DataFrame`." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv('penguins_raw.csv').dropna(subset=['Body Mass (g)'])\n", "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset contains one row for each penguin and one column for each variable." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "tags": [] }, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "For convenience, I'll create a new column called `Species2` that contains a shorter version of the species names." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "tags": [] }, "outputs": [], "source": [ "def shorten(species):\n", " return species.split()[0]\n", "\n", "df['Species2'] = df['Species'].apply(shorten)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Three species of penguins are represented in the dataset: Adélie, Chinstrap and Gentoo." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "These species are shown in this illustration (by Allison Horst, available under the [CC-BY](https://creativecommons.org/licenses/by/2.0/) license):\n", "\n", "\"Drawing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The measurements we'll use are:\n", "\n", "* Body Mass in grams (g).\n", "\n", "* Flipper Length in millimeters (mm).\n", "\n", "* Culmen Length in millimeters. \n", "\n", "* Culmen Depth in millimeters.\n", "\n", "If you are not familiar with the word \"culmen\", it refers to the [top margin of the beak](https://en.wikipedia.org/wiki/Bird_measurement#Culmen)." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "The culmen is shown in the following illustration (also by Allison Horst):\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These measurements will be most useful for classification if there are substantial differences between species and small variation within species. To see whether that is true, and to what degree, I'll plot cumulative distribution functions (CDFs) of each measurement for each species. \n", "\n", "The following function takes the `DataFrame` and a column name.\n", "It returns a dictionary that maps from each species name to a `Cdf` of the values in the column named `colname`." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def make_cdf_map(df, colname, by='Species2'):\n", " \"\"\"Make a CDF for each species.\"\"\"\n", " cdf_map = {}\n", " grouped = df.groupby(by)[colname]\n", " for species, group in grouped:\n", " cdf_map[species] = Cdf.from_seq(group, name=species)\n", " return cdf_map" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "The following function plots a `Cdf` of the values in the given column for each species: " ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "tags": [] }, "outputs": [], "source": [ "from empiricaldist import Cdf\n", "from utils import decorate\n", "\n", "def plot_cdfs(df, colname, by='Species2'):\n", " \"\"\"Make a CDF for each species.\n", " \n", " df: DataFrame\n", " colname: string column name\n", " by: string column name\n", "\n", " returns: dictionary from species name to Cdf\n", " \"\"\"\n", " cdf_map = make_cdf_map(df, colname, by)\n", " \n", " for species, cdf in cdf_map.items():\n", " cdf.plot(label=species, marker='')\n", " \n", " decorate(xlabel=colname,\n", " ylabel='CDF')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's what the distributions look like for culmen length." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "tags": [] }, "outputs": [], "source": [ "colname = 'Culmen Length (mm)'\n", "plot_cdfs(df, colname)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like we can use culmen length to identify Adélie penguins, but the distributions for the other two species almost entirely overlap.\n", "\n", "Here are the distributions for flipper length." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "tags": [] }, "outputs": [], "source": [ "colname = 'Flipper Length (mm)'\n", "plot_cdfs(df, colname)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using flipper length, we can distinguish Gentoo penguins from the other two species. So with just these two features, it seems like we should be able to classify penguins with some accuracy.\n", "\n", "All of these CDFs show the sigmoid shape characteristic of the normal distribution; I will take advantage of that observation in the next section." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "Here are the distributions for culmen depth." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "tags": [] }, "outputs": [], "source": [ "colname = 'Culmen Depth (mm)'\n", "plot_cdfs(df, colname)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "And here are the distributions of body mass." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "tags": [] }, "outputs": [], "source": [ "colname = 'Body Mass (g)'\n", "plot_cdfs(df, colname)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "Culmen depth and body mass distinguish Gentoo penguins from the other two species, but these features might not add a lot of additional information, beyond what we get from flipper length and culmen length." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Normal Models\n", "\n", "Let's use these features to classify penguins. We'll proceed in the usual Bayesian way:\n", "\n", "1. Define a prior distribution with the three possible species and a prior probability for each,\n", "\n", "2. Compute the likelihood of the data for each hypothetical species, and then\n", "\n", "3. Compute the posterior probability of each hypothesis.\n", "\n", "To compute the likelihood of the data under each hypothesis, I'll use the data to estimate the parameters of a normal distribution for each species.\n", "\n", "The following function takes a `DataFrame` and a column name; it returns a dictionary that maps from each species name to a `norm` object.\n", "\n", "`norm` is defined in SciPy; it represents a normal distribution with a given mean and standard deviation." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "from scipy.stats import norm\n", "\n", "def make_norm_map(df, colname, by='Species2'):\n", " \"\"\"Make a map from species to norm object.\"\"\"\n", " norm_map = {}\n", " grouped = df.groupby(by)[colname]\n", " for species, group in grouped:\n", " mean = group.mean()\n", " std = group.std()\n", " norm_map[species] = norm(mean, std)\n", " return norm_map" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, here's the dictionary of `norm` objects for flipper length:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "flipper_map = make_norm_map(df, 'Flipper Length (mm)')\n", "flipper_map.keys()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now suppose we measure a penguin and find that its flipper is 193 cm. What is the probability of that measurement under each hypothesis?\n", "\n", "The `norm` object provides `pdf`, which computes the probability density function (PDF) of the normal distribution. We can use it to compute the likelihood of the observed data in a given distribution." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "data = 193\n", "flipper_map['Adelie'].pdf(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is a probability density, so we can't interpret it as a probability. But it is proportional to the likelihood of the data, so we can use it to update the prior.\n", "\n", "Here's how we compute the likelihood of the data in each distribution." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "hypos = flipper_map.keys()\n", "likelihood = [flipper_map[hypo].pdf(data) for hypo in hypos]\n", "likelihood" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we're ready to do the update." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Update\n", "\n", "As usual I'll use a `Pmf` to represent the prior distribution. For simplicity, let's assume that the three species are equally likely." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "from empiricaldist import Pmf\n", "\n", "prior = Pmf(1/3, hypos)\n", "prior" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can do the update in the usual way." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "posterior = prior * likelihood\n", "posterior.normalize()\n", "posterior" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A penguin with a 193 mm flipper is unlikely to be a Gentoo, but might be either an Adélie or Chinstrap (assuming that the three species were equally likely before the measurement). \n", "\n", "The following function encapsulates the steps we just ran.\n", "It takes a `Pmf` representing the prior distribution, the observed data, and a map from each hypothesis to the distribution of the feature." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "def update_penguin(prior, data, norm_map):\n", " \"\"\"Update hypothetical species.\"\"\"\n", " hypos = prior.qs\n", " likelihood = [norm_map[hypo].pdf(data) for hypo in hypos]\n", " posterior = prior * likelihood\n", " posterior.normalize()\n", " return posterior" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The return value is the posterior distribution.\n", "\n", "Here's the previous example again, using `update_penguin`:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "posterior1 = update_penguin(prior, 193, flipper_map)\n", "posterior1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we saw in the CDFs, flipper length does not distinguish strongly between Adélie and Chinstrap penguins.\n", "\n", "But culmen length *can* make this distinction, so let's use it to do a second round of classification.\n", "First we estimate distributions of culmen length for each species like this:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "culmen_map = make_norm_map(df, 'Culmen Length (mm)')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now suppose we see a penguin with culmen length 48 mm.\n", "We can use this data to update the prior." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "posterior2 = update_penguin(prior, 48, culmen_map)\n", "posterior2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A penguin with culmen length 48 mm is about equally likely to be a Chinstrap or Gentoo.\n", "\n", "Using one feature at a time, we can often rule out one species or another, but we generally can't identify species with confidence.\n", "We can do better using multiple features." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Naive Bayesian Classification\n", "\n", "To make it easier to do multiple updates, I'll use the following function, which takes a prior `Pmf`, a sequence of measurements and a corresponding sequence of dictionaries containing estimated distributions." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "def update_naive(prior, data_seq, norm_maps):\n", " \"\"\"Naive Bayesian classifier\n", " \n", " prior: Pmf\n", " data_seq: sequence of measurements\n", " norm_maps: sequence of maps from species to distribution\n", " \n", " returns: Pmf representing the posterior distribution\n", " \"\"\"\n", " posterior = prior.copy()\n", " for data, norm_map in zip(data_seq, norm_maps):\n", " posterior = update_penguin(posterior, data, norm_map)\n", " return posterior" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It performs a series of updates, using one variable at a time, and returns the posterior `Pmf`.\n", "\n", "To test it, I'll use the same features we looked at in the previous section: culmen length and flipper length." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "colnames = ['Flipper Length (mm)', 'Culmen Length (mm)']\n", "norm_maps = [flipper_map, culmen_map]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now suppose we find a penguin with flipper length 193 mm and culmen length 48.\n", "Here's the update:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "data_seq = 193, 48\n", "posterior = update_naive(prior, data_seq, norm_maps)\n", "posterior" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is almost certain to be a Chinstrap." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "posterior.max_prob()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can loop through the dataset and classify each penguin with these two features." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "df['Classification'] = \"None\"\n", "\n", "for i, row in df.iterrows():\n", " data_seq = row[colnames]\n", " posterior = update_naive(prior, data_seq, norm_maps)\n", " df.loc[i, 'Classification'] = posterior.max_prob()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This loop adds a column called `Classification` to the `DataFrame`; it contains the species with the maximum posterior probability for each penguin.\n", "\n", "So let's see how many we got right." ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "tags": [] }, "outputs": [], "source": [ "len(df)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "valid = df['Classification'].notna()\n", "valid.sum()" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "same = df['Species2'] == df['Classification']\n", "same.sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are 344 penguins in the dataset, but two of them are missing measurements, so we have 342 valid cases.\n", "Of those, 324 are classified correctly, which is almost 95%." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "same.sum() / valid.sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following function encapsulates these steps." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "def accuracy(df):\n", " \"\"\"Compute the accuracy of classification.\"\"\"\n", " valid = df['Classification'].notna()\n", " same = df['Species2'] == df['Classification']\n", " return same.sum() / valid.sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The classifier we used in this section is called \"naive\" because it ignores correlations between the features. To see why that matters, I'll make a less naive classifier: one that takes into account the joint distribution of the features." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Joint Distributions\n", "\n", "I'll start by making a scatter plot of the data." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "def scatterplot(df, var1, var2):\n", " \"\"\"Make a scatter plot.\"\"\"\n", " grouped = df.groupby('Species2')\n", " for species, group in grouped:\n", " plt.plot(group[var1], group[var2],\n", " label=species, ls=\"None\", alpha=0.3)\n", " \n", " decorate(xlabel=var1, ylabel=var2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's a scatter plot of culmen length and flipper length for the three species." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "var1 = 'Flipper Length (mm)'\n", "var2 = 'Culmen Length (mm)'\n", "scatterplot(df, var1, var2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Within each species, the joint distribution of these measurements forms an oval shape, at least roughly. The orientation of the ovals is along a diagonal, which indicates that there is a correlation between culmen length and flipper length.\n", "\n", "If we ignore these correlations, we are assuming that the features are independent. To see what that looks like, I'll make a joint distribution for each species assuming independence.\n", "\n", "The following function makes a discrete `Pmf` that approximates a normal distribution." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "def make_pmf_norm(dist, sigmas=3, n=101):\n", " \"\"\"Make a Pmf approximation to a normal distribution.\"\"\"\n", " mean, std = dist.mean(), dist.std()\n", " low = mean - sigmas * std\n", " high = mean + sigmas * std\n", " qs = np.linspace(low, high, n)\n", " ps = dist.pdf(qs)\n", " pmf = Pmf(ps, qs)\n", " pmf.normalize()\n", " return pmf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use it, along with `make_joint`, to make a joint distribution of culmen length and flipper length for each species." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "from utils import make_joint\n", "\n", "joint_map = {}\n", "for species in hypos:\n", " pmf1 = make_pmf_norm(flipper_map[species])\n", " pmf2 = make_pmf_norm(culmen_map[species])\n", " joint_map[species] = make_joint(pmf1, pmf2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following figure compares a scatter plot of the data to the contours of the joint distributions, assuming independence." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "from utils import plot_contour\n", "\n", "scatterplot(df, var1, var2)\n", "for species in hypos:\n", " plot_contour(joint_map[species], alpha=0.5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The contours of a joint normal distribution form ellipses.\n", "In this example, because the features are uncorrelated, the ellipses are aligned with the axes.\n", "But they are not well aligned with the data.\n", "\n", "We can make a better model of the data, and use it to compute better likelihoods, with a multivariate normal distribution." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multivariate Normal Distribution\n", "\n", "As we have seen, a univariate normal distribution is characterized by its mean and standard deviation.\n", "\n", "A multivariate normal distribution is characterized by the means of the features and the **covariance matrix**, which contains **variances**, which quantify the spread of the features, and the **covariances**, which quantify the relationships among them.\n", "\n", "We can use the data to estimate the means and covariance matrix for the population of penguins.\n", "First I'll select the columns we want." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "features = df[[var1, var2]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And compute the means." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "mean = features.mean()\n", "mean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also compute the covariance matrix:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "cov = features.cov()\n", "cov" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is a `DataFrame` with one row and one column for each feature. The elements on the diagonal are the variances; the elements off the diagonal are covariances.\n", "\n", "By themselves, variances and covariances are hard to interpret. We can use them to compute standard deviations and correlation coefficients, which are easier to interpret, but the details of that calculation are not important right now.\n", "\n", "Instead, we'll pass the covariance matrix to `multivariate_normal`, which is a SciPy function that creates an object that represents a multivariate normal distribution.\n", "\n", "As arguments it takes a sequence of means and a covariance matrix: " ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "from scipy.stats import multivariate_normal\n", "\n", "multinorm = multivariate_normal(mean, cov)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following function makes a `multivariate_normal` object for each species." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "def make_multinorm_map(df, colnames):\n", " \"\"\"Make a map from each species to a multivariate normal.\"\"\"\n", " multinorm_map = {}\n", " grouped = df.groupby('Species2')\n", " for species, group in grouped:\n", " features = group[colnames]\n", " mean = features.mean()\n", " cov = features.cov()\n", " multinorm_map[species] = multivariate_normal(mean, cov)\n", " return multinorm_map" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's how we make this map for the first two features, flipper length and culmen length." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "multinorm_map = make_multinorm_map(df, [var1, var2])" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Visualizing a Multivariate Normal Distribution\n", "\n", "This section uses some NumPy magic to generate contour plots for multivariate normal distributions. If that's interesting for you, great! Otherwise, feel free to skip to the results. In the next section we'll do the actual classification, which turns out to be easier than the visualization.\n", "\n", "I'll start by making a contour map for the distribution of features among Adélie penguins. \n", "Here are the univariate distributions for the two features we'll use and the multivariate distribution we just computed." ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "tags": [] }, "outputs": [], "source": [ "norm1 = flipper_map['Adelie']\n", "norm2 = culmen_map['Adelie']\n", "multinorm = multinorm_map['Adelie']" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "I'll make a discrete `Pmf` approximation for each of the univariate distributions." ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "tags": [] }, "outputs": [], "source": [ "pmf1 = make_pmf_norm(norm1)\n", "pmf2 = make_pmf_norm(norm2)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "And use them to make a mesh grid that contains all pairs of values." ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "tags": [] }, "outputs": [], "source": [ "X, Y = np.meshgrid(pmf1.qs, pmf2.qs)\n", "X.shape" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "The mesh is represented by two arrays: the first contains the quantities from `pmf1` along the `x` axis; the second contains the quantities from `pmf2` along the `y` axis.\n", "\n", "In order to evaluate the multivariate distribution for each pair of values, we have to \"stack\" the arrays." ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "tags": [] }, "outputs": [], "source": [ "pos = np.dstack((X, Y))\n", "pos.shape" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "The result is a 3-D array that you can think of as a 2-D array of pairs. When we pass this array to `multinorm.pdf`, it evaluates the probability density function of the distribution for each pair of values." ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "tags": [] }, "outputs": [], "source": [ "densities = multinorm.pdf(pos)\n", "densities.shape" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "The result is an array of probability densities. If we put them in a `DataFrame` and normalize them, the result is a discrete approximation of the joint distribution of the two features." ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "tags": [] }, "outputs": [], "source": [ "from utils import normalize\n", "\n", "joint = pd.DataFrame(densities, columns=pmf1.qs, index=pmf2.qs)\n", "normalize(joint)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "Here's what the result looks like." ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "tags": [] }, "outputs": [], "source": [ "plot_contour(joint)\n", "decorate(xlabel=var1,\n", " ylabel=var2)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "The contours of a multivariate normal distribution are still ellipses, but now that we have taken into account the correlation between the features, the ellipses are no longer aligned with the axes.\n", "\n", "The following function encapsulate the steps we just did." ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "tags": [] }, "outputs": [], "source": [ "def make_joint(norm1, norm2, multinorm):\n", " \"\"\"Make a joint distribution.\n", " \n", " norm1: `norm` object representing the distribution of the first feature\n", " norm2: `norm` object representing the distribution of the second feature\n", " multinorm: `multivariate_normal` object representing the joint distribution\n", " \"\"\"\n", " pmf1 = make_pmf_norm(norm1)\n", " pmf2 = make_pmf_norm(norm2)\n", " X, Y = np.meshgrid(pmf1.qs, pmf2.qs)\n", " pos = np.dstack((X, Y))\n", " densities = multinorm.pdf(pos)\n", " joint = pd.DataFrame(densities, columns=pmf1.qs, index=pmf2.qs)\n", " return joint" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following figure shows a scatter plot of the data along with the contours of the multivariate normal distribution for each species." ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "tags": [] }, "outputs": [], "source": [ "scatterplot(df, var1, var2)\n", "\n", "for species in hypos:\n", " norm1 = flipper_map[species]\n", " norm2 = culmen_map[species]\n", " multinorm = multinorm_map[species]\n", " joint = make_joint(norm1, norm2, multinorm)\n", " plot_contour(joint, alpha=0.5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because the multivariate normal distribution takes into account the correlations between features, it is a better model for the data. And there is less overlap in the contours of the three distributions, which suggests that they should yield better classifications." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A Less Naive Classifier\n", "\n", "In a previous section we used `update_penguin` to update a prior `Pmf` based on observed data and a collection of `norm` objects that model the distribution of observations under each hypothesis. Here it is again:" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "def update_penguin(prior, data, norm_map):\n", " \"\"\"Update hypothetical species.\"\"\"\n", " hypos = prior.qs\n", " likelihood = [norm_map[hypo].pdf(data) for hypo in hypos]\n", " posterior = prior * likelihood\n", " posterior.normalize()\n", " return posterior" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Last time we used this function, the values in `norm_map` were `norm` objects, but it also works if they are `multivariate_normal` objects.\n", "\n", "We can use it to classify a penguin with flipper length 193 and culmen length 48:" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "data = 193, 48\n", "update_penguin(prior, data, multinorm_map)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A penguin with those measurements is almost certainly a Chinstrap.\n", "\n", "Now let's see if this classifier does any better than the naive Bayesian classifier.\n", "I'll apply it to each penguin in the dataset:" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "df['Classification'] = \"None\"\n", "\n", "for i, row in df.iterrows():\n", " data = row[colnames]\n", " posterior = update_penguin(prior, data, multinorm_map)\n", " df.loc[i, 'Classification'] = posterior.idxmax()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And compute the accuracy:" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "accuracy(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It turns out to be only a little better: the accuracy is 95.3%, compared to 94.7% for the naive Bayesian classifier." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "In this chapter, we implemented a naive Bayesian classifier, which is \"naive\" in the sense that it assumes that the features it uses for classification are independent.\n", "\n", "To see how bad that assumption is, we also implemented a classifier that uses a multivariate normal distribution to model the joint distribution of the features, which includes their dependencies.\n", "\n", "In this example, the non-naive classifier is only marginally better.\n", "In one way, that's disappointing. After all that work, it would have been nice to see a bigger improvement.\n", "But in another way, it's good news. In general, a naive Bayesian classifier is easier to implement and requires less computation. If it works nearly as well as a more complex algorithm, it might be a good choice for practical purposes.\n", "\n", "Speaking of practical purposes, you might have noticed that this example isn't very useful. If we want to identify the species of a penguin, there are easier ways than measuring its flippers and beak.\n", "\n", "But there *are* scientific uses for this type of classification. One of them is the subject of the research paper we started with: [sexual dimorphism](https://en.wikipedia.org/wiki/Sexual_dimorphism), that is, differences in shape between male and female animals.\n", "\n", "In some species, like angler fish, males and females look very different. In other species, like mockingbirds, they are difficult to tell apart.\n", "And dimorphism is worth studying because it provides insight into social behavior, sexual selection, and evolution. \n", "\n", "One way to quantify the degree of sexual dimorphism in a species is to use a classification algorithm like the one in this chapter. If you can find a set of features that makes it possible to classify individuals by sex with high accuracy, that's evidence of high dimorphism.\n", "\n", "As an exercise, you can use the dataset from this chapter to classify penguins by sex and see which of the three species is the most dimorphic." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercises" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise:** In my example I used culmen length and flipper length because they seemed to provide the most power to distinguish the three species. But maybe we can do better by using more features.\n", "\n", "Make a naive Bayesian classifier that uses all four measurements in the dataset: culmen length and depth, flipper length, and body mass.\n", "Is it more accurate than the model with two features?" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise:** One of the reasons the penguin dataset was collected was to quantify sexual dimorphism in different penguin species, that is, physical differences between male and female penguins. One way to quantify dimorphism is to use measurements to classify penguins by sex. If a species is more dimorphic, we expect to be able to classify them more accurately.\n", "\n", "As an exercise, pick a species and use a Bayesian classifier (naive or not) to classify the penguins by sex. Which features are most useful? What accuracy can you achieve?" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "Note: One Gentoo penguin has an invalid value for `Sex`. I used the following code to select one species and filter out invalid data." ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "tags": [] }, "outputs": [], "source": [ "gentoo = (df['Species2'] == 'Gentoo')\n", "subset = df[gentoo].copy()" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "tags": [] }, "outputs": [], "source": [ "subset['Sex'].value_counts()" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "tags": [] }, "outputs": [], "source": [ "valid = df['Sex'] != '.'\n", "valid.sum()" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "tags": [] }, "outputs": [], "source": [ "subset = df[valid & gentoo].copy()" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "OK, you can finish it off from here." ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [], "source": [ "# Solution goes here" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 4 }