{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Contingency tables using sparse coordinate matrices\n", "\n", "> I like sparseness. There's something about that minimalist feel that can make\n", "> something have an immediate impact and make it unique. I'll probably always\n", "> work with that formula. I just don't know how.\n", ">\n", "> — Britt Daniel, lead singer of *Spoon*.\n", "\n", "Many real-world matrices are *sparse*, which means that most of their values are zero.\n", "\n", "Using numpy arrays to manipulate sparse matrices wastes a lot of time and energy multiplying many, many values by 0.\n", "Instead, we can use SciPy's `sparse` module to solve these efficiently, examining only non-zero values.\n", "In addition to helping solve these \"canonical\" sparse matrix problems, `sparse` can be used for problems that are not obviously related to sparse matrices.\n", "\n", "One such problem is the comparison of image segmentations.\n", "(Review chapter 3 for a definition of segmentation.)\n", "\n", "The code sample motivating this chapter uses sparse matrices twice: First, we\n", "use code nominated by Andreas Mueller to compute a *contingency matrix* that\n", "counts the correspondence of labels between two segmentations. Then, with\n", "suggestions from Jaime Fernández del Río and Warren Weckesser, we use that\n", "contingency matrix to compute the *variation of information*, which measures\n", "the differences between segmentations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def variation_of_information(x, y):\n", " # compute contingency matrix, aka joint probability matrix\n", " n = x.size\n", " Pxy = sparse.coo_matrix((np.full(n, 1/n), (x.ravel(), y.ravel())),\n", " dtype=float).tocsr()\n", "\n", " # compute marginal probabilities, converting to 1D array\n", " px = np.ravel(Pxy.sum(axis=1))\n", " py = np.ravel(Pxy.sum(axis=0))\n", "\n", " # use sparse matrix linear algebra to compute VI\n", " # first, compute the inverse diagonal matrices\n", " Px_inv = sparse.diags(invert_nonzero(px))\n", " Py_inv = sparse.diags(invert_nonzero(py))\n", "\n", " # then, compute the entropies\n", " hygx = px @ xlog1x(Px_inv @ Pxy).sum(axis=1)\n", " hxgy = xlog1x(Pxy @ Py_inv).sum(axis=0) @ py\n", "\n", " # return the sum of these\n", " return float(hygx + hxgy)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **Python 3.5 pro-tip! {.callout}**\n", ">\n", "> The `@` symbols in the above paragraph represent the *matrix multiplication*\n", "> operator, and were introduced in Python 3.5 in 2015. This is one of the most\n", "> compelling arguments to use Python 3 for scientific programmers: they enable\n", "> the programming of linear algebra algorithms using code that remains very\n", "> close to the original mathematics. Compare the above:\n", ">\n", "> `hygx = px @ xlog1x(Px_inv @ Pxy).sum(axis=1)`\n", ">\n", "> with the equivalent Python 2 code:\n", ">\n", "> `hygx = px.dot(xlog1x(Px_inv.dot(Pxy)).sum(axis=1))`\n", ">\n", "> By using the `@` operator to stay closer to mathematical notation, we\n", "> can avoid implementation errors and produce code that is much easier to read.\n", ">\n", "> Actually, SciPy's authors knew this long before the `@` operator was\n", "> introduced, and actually altered the meaning of the\n", "> `*` operator when the inputs are SciPy matrices. Available in Python\n", "> 2.7, it lets us produce nice, readable code like the above:\n", ">\n", "> `hygx = -px * xlog(Px_inv * Pxy).sum(axis=1)`\n", ">\n", "> But there is a huge catch: this code will behave differently when `px` or\n", "> `Px_inv` are SciPy matrices than when they are not! If `Px_inv` and `Pxy` are\n", "> NumPy arrays, `*` produces the element-wise multiplication, while if they are\n", "> SciPy matrices, it produces the matrix product! As you can imagine, this is\n", "> the source of a great many errors, and much of the SciPy community has\n", "> abandoned this use in favor of the uglier but unambiguous `.dot` method.\n", ">\n", "> Python 3.5's `@` operator gives us the best of both worlds!\n", "\n", "## Contingency tables\n", "\n", "But let's start simple and work our way up to segmentations.\n", "\n", "Suppose you just started working as a data scientist at email startup Spam-o-matic.\n", "You are tasked with building a detector for spam email.\n", "You encode the detector outcome as a numeric value, 0 for not spam and 1 for spam.\n", "\n", "If you have a set of 10 emails to classify, you end up with a vector of *predictions*:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "pred = np.array([0, 1, 0, 0, 1, 1, 1, 0, 1, 1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can check how well you've done by comparing it to a vector of *ground truth*, classifications obtained by inspecting each message by hand." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gt = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, classification is hard for computers, so the values in `pred` and `gt` don't match up exactly.\n", "At positions where `pred` is 0 and `gt` is 0, the prediction has correctly identified a message as non-spam.\n", "This is called a *true negative*.\n", "Conversely, at positions where both values are 1, the predictor has correctly identified a spam message, and found a *true positive*.\n", "\n", "Then, there are two kinds of errors.\n", "If we let a spam message (where `gt` is 1) through to the user's inbox (`pred` is 0), we've made a *false negative* error.\n", "If we predict a legitimate message (`gt` is 0) to be spam (`pred` is 1), we've made a *false positive* prediction.\n", "(An email from the director of my scientific institute once landed in my spam folder. The reason? His announcement of a postdoc talk competition started with \"You could win $500!\")\n", "\n", "If we want to measure how well we are doing, we have to count the above kinds of errors using a *contingency matrix*.\n", "(This is also sometimes called a confusion matrix. The name is apt.)\n", "For this, we place the prediction labels along the rows and the ground truth labels along the columns.\n", "Then we count the number of times they correspond.\n", "So, for example, since there are 4 true positives (where `pred` and `gt` are both 1), the matrix will have a value of 4 at position (1, 1).\n", "\n", "Generally:\n", "\n", "$$C_{i, j} = \\sum_k{\\mathbb{I}(p_k = i) \\mathbb{I}(g_k = j)}$$\n", "\n", "Here's an intuitive, but inefficient way of building the above:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def confusion_matrix(pred, gt):\n", " cont = np.zeros((2, 2))\n", " for i in [0, 1]:\n", " for j in [0, 1]:\n", " cont[i, j] = np.sum((pred == i) & (gt == j))\n", " return cont" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can check that this gives use the right counts:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "confusion_matrix(pred, gt)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "**Exercise:** Why did we call this inefficient?\n", "\n", "\n", "\n", "**Solution:** From chapter 1, you recall that `arr == k` creates an array of\n", "Boolean (`True` or `False`) values of the same size as `arr`. This, as you\n", "might expect, requires a full pass over `arr`. Therefore, in the above\n", "solution, we make a full pass over each of `pred` and `gt` for every\n", "combination of values in `pred` and `gt`. In principle, we can compute `cont`\n", "using just a single pass over both arrays, so these multiple passes are\n", "inefficient.\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "**Exercise:** Write an alternative way of computing the confusion matrix that only makes a single pass through `pred` and `gt`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def confusion_matrix1(pred, gt):\n", " cont = np.zeros((2, 2))\n", " # your code goes here\n", " return cont" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "**Solution:** We offer two solutions here, although many are possible.\n", "\n", "Our first solution uses Python's built-in `zip` function to pair together\n", "labels from `pred` and `gt`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def confusion_matrix1(pred, gt):\n", " cont = np.zeros((2, 2))\n", " for i, j in zip(pred, gt):\n", " cont[i, j] += 1\n", " return cont" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our second solution is to iterate over all possible indices of `pred` and `gt`\n", "and manually grab the corresponding value from each array:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def confusion_matrix1(pred, gt):\n", " cont = np.zeros((2, 2))\n", " for idx in range(len(pred)):\n", " i = pred[idx]\n", " j = gt[idx]\n", " cont[i, j] += 1\n", " return cont" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first option might be considered the more \"Pythonic\" of the two, but the\n", "second one is easier to speed up by translating and compiling in languages or\n", "tools such as C, Cython, and Numba (which are a topic for another book).\n", "\n", "\n", "\n", "\n", "\n", "We can make this example a bit more general:\n", "Instead of classifying spam and non-spam, we can classify spam, newsletters,\n", "sales and promotions, mailing lists, and personal email.\n", "That's 5 categories, which we'll label 0 to 4.\n", "The confusion matrix will now be 5-by-5, with matches counted on the diagonal,\n", "and errors counted on the off-diagonal entries.\n", "\n", "The definition of the `confusion_matrix` function, above, doesn't extend well\n", "to this larger matrix, because now we must have *twenty-five* passes though the\n", "result and ground truth arrays.\n", "This problem only grows as we add more email categories, such as social media\n", "notifications.\n", "\n", "\n", "\n", "**Exercise:** Write a function to compute the confusion matrix in one pass, as\n", "above, but instead of assuming two categories, infer the number of categories\n", "from the input." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def general_confusion_matrix(pred, gt):\n", " n_classes = None # replace `None` with something useful\n", " # your code goes here\n", " return cont" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "We merely need to make an initial pass through both input arrays to determine\n", "the maximum label. We then add 1 to it to account for the zero label and\n", "Python's 0-based indexing. We then create the matrix and fill it in the same\n", "way as above:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def general_confusion_matrix(pred, gt):\n", " n_classes = max(np.max(pred), np.max(gt)) + 1\n", " cont = np.zeros((n_classes, n_classes))\n", " for i, j in zip(pred, gt):\n", " cont[i, j] += 1\n", " return cont" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "Your one-pass solution will scale well with the number of classes, but, because\n", "the for-loop runs in the Python interpreter, it will be slow when you have a\n", "large number of documents.\n", "Also, because some classes are easier to mistake for one another, the matrix\n", "will be *sparse*, with many 0 entries.\n", "Indeed, as the number of classes increases, dedicating lots of memory space to\n", "the 0 entries of the contingency matrix is increasingly wasteful.\n", "Instead, we can use the `sparse` module of SciPy, which contains objects to\n", "efficiently represent sparse matrices.\n", "\n", "## scipy.sparse data formats\n", "\n", "We covered the internal data format of NumPy arrays in Chapter 1.\n", "We hope you agree that it's a fairly intuitive, and, in some sense, inevitable\n", "format to hold n-dimensional array data.\n", "For sparse matrices, there are actually a wide array of possible formats, and\n", "the \"right\" format depends on the problem you want to solve.\n", "We'll cover the two most commonly-used formats, but for a complete list, see the\n", "comparison table later in the chapter, as well as the online documentation for\n", "`scipy.sparse`.\n", "\n", "### COO (COOrdinate) format\n", "\n", "Perhaps the most intuitive is the coordinate, or COO, format.\n", "This uses three 1D arrays to represent a 2D matrix $A$.\n", "Each of these arrays has length equal to the number of nonzero values in $A$,\n", "and together they list (i, j, value) coordinates of every entry that is not\n", "equal to 0.\n", "\n", "- the `row` and `col` arrays, which together specify the location of each\n", " non-zero entry (row and column indices, respectively).\n", "- the `data` array, which specifies the *value* at each of those locations.\n", "\n", "Every part of the matrix that is not represented by the `(row, col)` pairs is\n", "considered to be 0.\n", "Much more efficient!\n", "\n", "So, to represent the matrix:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s = np.array([[ 4, 0, 3],\n", " [ 0, 32, 0]], dtype=float)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can do the following:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scipy import sparse\n", "\n", "data = np.array([4, 3, 32], dtype=float)\n", "row = np.array([0, 0, 1])\n", "col = np.array([0, 2, 1])\n", "\n", "s_coo = sparse.coo_matrix((data, (row, col)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `.toarray()` method of every sparse format in `scipy.sparse` returns a\n", "NumPy array representation of the sparse data.\n", "We can use this to check that we created `s_coo` correctly:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s_coo.toarray()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Identically, we can use the `.A` *property*, which is just like an attribute,\n", "but actually executes a function. `.A` is a particularly dangerous property,\n", "because it can hide a potentially very large operation: the dense version\n", "of a sparse matrix can be orders of magnitude bigger than the sparse matrix\n", "itself, bringing a computer to its knees, in just three keystrokes!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s_coo.A" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this chapter, and elsewhere, we recommend using the `toarray()` method\n", "wherever it does not impair readability, as it more clearly signals a\n", "potentially expensive operation. However, we will use `.A` where it makes\n", "the code more readable by virtue of its brevity (for example, when trying to\n", "implement a sequence of mathematical equations).\n", "\n", "\n", "\n", "**Exercise:** write out the COO representation of the following matrix:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s2 = np.array([[0, 0, 6, 0, 0],\n", " [1, 2, 0, 4, 5],\n", " [0, 1, 0, 0, 0],\n", " [9, 0, 0, 0, 0],\n", " [0, 0, 0, 6, 7]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "**Solution:** We first list the non-zero elements of the array, left-to-right and\n", "top-to-bottom, as if reading a book:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s2_data = np.array([6, 1, 2, 4, 5, 1, 9, 6, 7])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We then list the row indices of those values in the same order:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s2_row = np.array([0, 1, 1, 1, 1, 2, 3, 4, 4])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And finally the column indices:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s2_col = np.array([2, 0, 1, 3, 4, 1, 0, 3, 4])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can easily check that these produce the right matrix, by checking equality\n", "in both directions:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s2_coo0 = sparse.coo_matrix(s2)\n", "print(s2_coo0.data)\n", "print(s2_coo0.row)\n", "print(s2_coo0.col)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s2_coo1 = sparse.coo_matrix((s2_data, (s2_row, s2_col)))\n", "print(s2_coo1.toarray())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "Unfortunately, although the COO format is intuitive, it's not very optimized to\n", "use the minimum amount of memory, or to traverse the array as quickly as\n", "possible during computations.\n", "(Remember from Chapter 1, *data locality* is very important to efficient\n", "computation!)\n", "However, you can look at your COO representation above to help you identify\n", "redundant information:\n", "Notice all those repeated `1`s?\n", "\n", "### CSR (Compressed Sparse Row) format\n", "\n", "If we use COO to enumerate the nonzero entries row-by-row, rather than in\n", "arbitrary order (which the format allows), we end up with many consecutive,\n", "repeated values in the `row` array.\n", "These can be compressed by indicating the *indices* in `col` where the next row\n", "starts, rather than repeatedly writing the row index.\n", "This is the basis for the *compressed sparse row* or *CSR* format.\n", "\n", "Let's work through the example above. In CSR format, the `col` and `data`\n", "arrays are unchanged (but `col` is renamed to `indices`). However, the `row`\n", "array, instead of indicating the rows, indicates *where* in `col` each row\n", "begins, and is renamed to `indptr`, for \"index pointer\".\n", "\n", "So, let's look at `row` and `col` in COO format, ignoring `data`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "row = [0, 1, 1, 1, 1, 2, 3, 4, 4]\n", "col = [2, 0, 1, 3, 4, 1, 0, 3, 4]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each new row begins at the index where `row` changes.\n", "The 0th row starts at index 0, and the 1st row starts at index 1, but the 2nd\n", "row starts where \"2\" first appears in `row`, at index 5.\n", "Then, the indices increase by 1 for rows 3 and 4, to 6 and 7.\n", "The final index, indicating the end of the matrix, is the total number of\n", "nonzero values (9).\n", "So:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "indptr = [0, 1, 5, 6, 7, 9]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's use these hand-computed arrays to build a CSR matrix in SciPy.\n", "We can check our work by comparing the `.A` output from our COO and\n", "CSR representations to the numpy array `s2` that we defined earlier." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = np.array([6, 1, 2, 4, 5, 1, 9, 6, 7])\n", "\n", "coo = sparse.coo_matrix((data, (row, col)))\n", "csr = sparse.csr_matrix((data, col, indptr))\n", "\n", "print('The COO and CSR arrays are equal: ',\n", " np.all(coo.A == csr.A))\n", "print('The CSR and NumPy arrays are equal: ',\n", " np.all(s2 == csr.A))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ability to store large, sparse matrices, and perform computations on them,\n", "is incredibly powerful, and can be applied in many domains.\n", "\n", "For example,\n", "one can think of the entire web as a large, sparse, $N \\times N$ matrix.\n", "Each entry $X_{ij}$ indicates whether web page $i$ links to page $j$.\n", "By normalizing this matrix and solving for its dominant eigenvector,\n", "one obtains the so-called PageRank—one of the numbers Google uses to\n", "order your search results. (You can read more about this in the next chapter.)\n", "\n", "As another example, we can represent the human brain as a large $m \\times m$\n", "graph, where there are $m$ nodes (positions) in which you\n", "measure activity using an MRI scanner. After a while of measuring,\n", "correlations can be calculated and entered into a matrix $C_{ij}$.\n", "Thresholding this matrix produces a sparse matrix of ones and zeros. \n", "The eigenvector corresponding to the second-smallest eigenvalue of this matrix\n", "partitions the $m$ brain areas into subgroups, which, it turns out,\n", "are often related to functional regions of the brain [^Newman]!\n", "\n", "[^Newman]: Newman MEJ (2006). Modularity and community structure in networks.\n", " PNAS 103(23):8577-8582. DOI:10.1073/pnas.0601602103\n", "\n", "
\n", " | bsr_matrix | \n", "coo_matrix | \n", "csc_matrix | \n", "csr_matrix | \n", "dia_matrix | \n", "dok_matrix | \n", "lil_matrix | \n", "
---|---|---|---|---|---|---|---|
Full name | \n",
"Block Sparse Row | \n",
"Coordinate | \n",
"Compressed Sparse Column | \n",
"Compressed Sparse Row | \n",
"Diagonal | \n",
"Dictionary of Keys | \n",
"Row-based linked-list | \n",
"
Note | \n",
"Similar to CSR | \n",
"Only used to construct sparse matrices, which are then converted to CSC or CSR for further operations. | \n",
"\n", " | \n", " | \n", " | Used to construct sparse matrices incrementally. | \n",
"Used to construct sparse matrices incrementally. | \n",
"
Use cases | \n",
"
| \n",
"
| \n",
"
| \n",
"
| \n",
"
| \n",
"
| \n",
"
| \n",
"
Cons | \n",
"\n", " |
| \n",
"
| \n",
"
| \n",
"
| \n",
"
| \n",
"
| \n",
"