{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Contingency tables using sparse coordinate matrices\n", "\n", "> I like sparseness. There's something about that minimalist feel that can make\n", "> something have an immediate impact and make it unique. I'll probably always\n", "> work with that formula. I just don't know how.\n", ">\n", "> — Britt Daniel, lead singer of *Spoon*.\n", "\n", "Many real-world matrices are *sparse*, which means that most of their values are zero.\n", "\n", "Using numpy arrays to manipulate sparse matrices wastes a lot of time and energy multiplying many, many values by 0.\n", "Instead, we can use SciPy's `sparse` module to solve these efficiently, examining only non-zero values.\n", "In addition to helping solve these \"canonical\" sparse matrix problems, `sparse` can be used for problems that are not obviously related to sparse matrices.\n", "\n", "One such problem is the comparison of image segmentations.\n", "(Review chapter 3 for a definition of segmentation.)\n", "\n", "The code sample motivating this chapter uses sparse matrices twice: First, we\n", "use code nominated by Andreas Mueller to compute a *contingency matrix* that\n", "counts the correspondence of labels between two segmentations. Then, with\n", "suggestions from Jaime Fernández del Río and Warren Weckesser, we use that\n", "contingency matrix to compute the *variation of information*, which measures\n", "the differences between segmentations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def variation_of_information(x, y):\n", " # compute contingency matrix, aka joint probability matrix\n", " n = x.size\n", " Pxy = sparse.coo_matrix((np.full(n, 1/n), (x.ravel(), y.ravel())),\n", " dtype=float).tocsr()\n", "\n", " # compute marginal probabilities, converting to 1D array\n", " px = np.ravel(Pxy.sum(axis=1))\n", " py = np.ravel(Pxy.sum(axis=0))\n", "\n", " # use sparse matrix linear algebra to compute VI\n", " # first, compute the inverse diagonal matrices\n", " Px_inv = sparse.diags(invert_nonzero(px))\n", " Py_inv = sparse.diags(invert_nonzero(py))\n", "\n", " # then, compute the entropies\n", " hygx = px @ xlog1x(Px_inv @ Pxy).sum(axis=1)\n", " hxgy = xlog1x(Pxy @ Py_inv).sum(axis=0) @ py\n", "\n", " # return the sum of these\n", " return float(hygx + hxgy)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **Python 3.5 pro-tip! {.callout}**\n", ">\n", "> The `@` symbols in the above paragraph represent the *matrix multiplication*\n", "> operator, and were introduced in Python 3.5 in 2015. This is one of the most\n", "> compelling arguments to use Python 3 for scientific programmers: they enable\n", "> the programming of linear algebra algorithms using code that remains very\n", "> close to the original mathematics. Compare the above:\n", ">\n", "> `hygx = px @ xlog1x(Px_inv @ Pxy).sum(axis=1)`\n", ">\n", "> with the equivalent Python 2 code:\n", ">\n", "> `hygx = px.dot(xlog1x(Px_inv.dot(Pxy)).sum(axis=1))`\n", ">\n", "> By using the `@` operator to stay closer to mathematical notation, we\n", "> can avoid implementation errors and produce code that is much easier to read.\n", ">\n", "> Actually, SciPy's authors knew this long before the `@` operator was\n", "> introduced, and actually altered the meaning of the\n", "> `*` operator when the inputs are SciPy matrices. Available in Python\n", "> 2.7, it lets us produce nice, readable code like the above:\n", ">\n", "> `hygx = -px * xlog(Px_inv * Pxy).sum(axis=1)`\n", ">\n", "> But there is a huge catch: this code will behave differently when `px` or\n", "> `Px_inv` are SciPy matrices than when they are not! If `Px_inv` and `Pxy` are\n", "> NumPy arrays, `*` produces the element-wise multiplication, while if they are\n", "> SciPy matrices, it produces the matrix product! As you can imagine, this is\n", "> the source of a great many errors, and much of the SciPy community has\n", "> abandoned this use in favor of the uglier but unambiguous `.dot` method.\n", ">\n", "> Python 3.5's `@` operator gives us the best of both worlds!\n", "\n", "## Contingency tables\n", "\n", "But let's start simple and work our way up to segmentations.\n", "\n", "Suppose you just started working as a data scientist at email startup Spam-o-matic.\n", "You are tasked with building a detector for spam email.\n", "You encode the detector outcome as a numeric value, 0 for not spam and 1 for spam.\n", "\n", "If you have a set of 10 emails to classify, you end up with a vector of *predictions*:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "pred = np.array([0, 1, 0, 0, 1, 1, 1, 0, 1, 1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can check how well you've done by comparing it to a vector of *ground truth*, classifications obtained by inspecting each message by hand." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gt = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, classification is hard for computers, so the values in `pred` and `gt` don't match up exactly.\n", "At positions where `pred` is 0 and `gt` is 0, the prediction has correctly identified a message as non-spam.\n", "This is called a *true negative*.\n", "Conversely, at positions where both values are 1, the predictor has correctly identified a spam message, and found a *true positive*.\n", "\n", "Then, there are two kinds of errors.\n", "If we let a spam message (where `gt` is 1) through to the user's inbox (`pred` is 0), we've made a *false negative* error.\n", "If we predict a legitimate message (`gt` is 0) to be spam (`pred` is 1), we've made a *false positive* prediction.\n", "(An email from the director of my scientific institute once landed in my spam folder. The reason? His announcement of a postdoc talk competition started with \"You could win $500!\")\n", "\n", "If we want to measure how well we are doing, we have to count the above kinds of errors using a *contingency matrix*.\n", "(This is also sometimes called a confusion matrix. The name is apt.)\n", "For this, we place the prediction labels along the rows and the ground truth labels along the columns.\n", "Then we count the number of times they correspond.\n", "So, for example, since there are 4 true positives (where `pred` and `gt` are both 1), the matrix will have a value of 4 at position (1, 1).\n", "\n", "Generally:\n", "\n", "$$C_{i, j} = \\sum_k{\\mathbb{I}(p_k = i) \\mathbb{I}(g_k = j)}$$\n", "\n", "Here's an intuitive, but inefficient way of building the above:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def confusion_matrix(pred, gt):\n", " cont = np.zeros((2, 2))\n", " for i in [0, 1]:\n", " for j in [0, 1]:\n", " cont[i, j] = np.sum((pred == i) & (gt == j))\n", " return cont" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can check that this gives use the right counts:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "confusion_matrix(pred, gt)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "**Exercise:** Why did we call this inefficient?\n", "\n", "\n", "\n", "**Solution:** From chapter 1, you recall that `arr == k` creates an array of\n", "Boolean (`True` or `False`) values of the same size as `arr`. This, as you\n", "might expect, requires a full pass over `arr`. Therefore, in the above\n", "solution, we make a full pass over each of `pred` and `gt` for every\n", "combination of values in `pred` and `gt`. In principle, we can compute `cont`\n", "using just a single pass over both arrays, so these multiple passes are\n", "inefficient.\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "**Exercise:** Write an alternative way of computing the confusion matrix that only makes a single pass through `pred` and `gt`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def confusion_matrix1(pred, gt):\n", " cont = np.zeros((2, 2))\n", " # your code goes here\n", " return cont" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "**Solution:** We offer two solutions here, although many are possible.\n", "\n", "Our first solution uses Python's built-in `zip` function to pair together\n", "labels from `pred` and `gt`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def confusion_matrix1(pred, gt):\n", " cont = np.zeros((2, 2))\n", " for i, j in zip(pred, gt):\n", " cont[i, j] += 1\n", " return cont" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our second solution is to iterate over all possible indices of `pred` and `gt`\n", "and manually grab the corresponding value from each array:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def confusion_matrix1(pred, gt):\n", " cont = np.zeros((2, 2))\n", " for idx in range(len(pred)):\n", " i = pred[idx]\n", " j = gt[idx]\n", " cont[i, j] += 1\n", " return cont" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first option might be considered the more \"Pythonic\" of the two, but the\n", "second one is easier to speed up by translating and compiling in languages or\n", "tools such as C, Cython, and Numba (which are a topic for another book).\n", "\n", "\n", "\n", "\n", "\n", "We can make this example a bit more general:\n", "Instead of classifying spam and non-spam, we can classify spam, newsletters,\n", "sales and promotions, mailing lists, and personal email.\n", "That's 5 categories, which we'll label 0 to 4.\n", "The confusion matrix will now be 5-by-5, with matches counted on the diagonal,\n", "and errors counted on the off-diagonal entries.\n", "\n", "The definition of the `confusion_matrix` function, above, doesn't extend well\n", "to this larger matrix, because now we must have *twenty-five* passes though the\n", "result and ground truth arrays.\n", "This problem only grows as we add more email categories, such as social media\n", "notifications.\n", "\n", "\n", "\n", "**Exercise:** Write a function to compute the confusion matrix in one pass, as\n", "above, but instead of assuming two categories, infer the number of categories\n", "from the input." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def general_confusion_matrix(pred, gt):\n", " n_classes = None # replace `None` with something useful\n", " # your code goes here\n", " return cont" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "We merely need to make an initial pass through both input arrays to determine\n", "the maximum label. We then add 1 to it to account for the zero label and\n", "Python's 0-based indexing. We then create the matrix and fill it in the same\n", "way as above:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def general_confusion_matrix(pred, gt):\n", " n_classes = max(np.max(pred), np.max(gt)) + 1\n", " cont = np.zeros((n_classes, n_classes))\n", " for i, j in zip(pred, gt):\n", " cont[i, j] += 1\n", " return cont" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "Your one-pass solution will scale well with the number of classes, but, because\n", "the for-loop runs in the Python interpreter, it will be slow when you have a\n", "large number of documents.\n", "Also, because some classes are easier to mistake for one another, the matrix\n", "will be *sparse*, with many 0 entries.\n", "Indeed, as the number of classes increases, dedicating lots of memory space to\n", "the 0 entries of the contingency matrix is increasingly wasteful.\n", "Instead, we can use the `sparse` module of SciPy, which contains objects to\n", "efficiently represent sparse matrices.\n", "\n", "## scipy.sparse data formats\n", "\n", "We covered the internal data format of NumPy arrays in Chapter 1.\n", "We hope you agree that it's a fairly intuitive, and, in some sense, inevitable\n", "format to hold n-dimensional array data.\n", "For sparse matrices, there are actually a wide array of possible formats, and\n", "the \"right\" format depends on the problem you want to solve.\n", "We'll cover the two most commonly-used formats, but for a complete list, see the\n", "comparison table later in the chapter, as well as the online documentation for\n", "`scipy.sparse`.\n", "\n", "### COO (COOrdinate) format\n", "\n", "Perhaps the most intuitive is the coordinate, or COO, format.\n", "This uses three 1D arrays to represent a 2D matrix $A$.\n", "Each of these arrays has length equal to the number of nonzero values in $A$,\n", "and together they list (i, j, value) coordinates of every entry that is not\n", "equal to 0.\n", "\n", "- the `row` and `col` arrays, which together specify the location of each\n", " non-zero entry (row and column indices, respectively).\n", "- the `data` array, which specifies the *value* at each of those locations.\n", "\n", "Every part of the matrix that is not represented by the `(row, col)` pairs is\n", "considered to be 0.\n", "Much more efficient!\n", "\n", "So, to represent the matrix:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s = np.array([[ 4, 0, 3],\n", " [ 0, 32, 0]], dtype=float)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can do the following:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scipy import sparse\n", "\n", "data = np.array([4, 3, 32], dtype=float)\n", "row = np.array([0, 0, 1])\n", "col = np.array([0, 2, 1])\n", "\n", "s_coo = sparse.coo_matrix((data, (row, col)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `.toarray()` method of every sparse format in `scipy.sparse` returns a\n", "NumPy array representation of the sparse data.\n", "We can use this to check that we created `s_coo` correctly:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s_coo.toarray()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Identically, we can use the `.A` *property*, which is just like an attribute,\n", "but actually executes a function. `.A` is a particularly dangerous property,\n", "because it can hide a potentially very large operation: the dense version\n", "of a sparse matrix can be orders of magnitude bigger than the sparse matrix\n", "itself, bringing a computer to its knees, in just three keystrokes!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s_coo.A" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this chapter, and elsewhere, we recommend using the `toarray()` method\n", "wherever it does not impair readability, as it more clearly signals a\n", "potentially expensive operation. However, we will use `.A` where it makes\n", "the code more readable by virtue of its brevity (for example, when trying to\n", "implement a sequence of mathematical equations).\n", "\n", "\n", "\n", "**Exercise:** write out the COO representation of the following matrix:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s2 = np.array([[0, 0, 6, 0, 0],\n", " [1, 2, 0, 4, 5],\n", " [0, 1, 0, 0, 0],\n", " [9, 0, 0, 0, 0],\n", " [0, 0, 0, 6, 7]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "**Solution:** We first list the non-zero elements of the array, left-to-right and\n", "top-to-bottom, as if reading a book:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s2_data = np.array([6, 1, 2, 4, 5, 1, 9, 6, 7])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We then list the row indices of those values in the same order:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s2_row = np.array([0, 1, 1, 1, 1, 2, 3, 4, 4])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And finally the column indices:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s2_col = np.array([2, 0, 1, 3, 4, 1, 0, 3, 4])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can easily check that these produce the right matrix, by checking equality\n", "in both directions:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s2_coo0 = sparse.coo_matrix(s2)\n", "print(s2_coo0.data)\n", "print(s2_coo0.row)\n", "print(s2_coo0.col)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s2_coo1 = sparse.coo_matrix((s2_data, (s2_row, s2_col)))\n", "print(s2_coo1.toarray())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "Unfortunately, although the COO format is intuitive, it's not very optimized to\n", "use the minimum amount of memory, or to traverse the array as quickly as\n", "possible during computations.\n", "(Remember from Chapter 1, *data locality* is very important to efficient\n", "computation!)\n", "However, you can look at your COO representation above to help you identify\n", "redundant information:\n", "Notice all those repeated `1`s?\n", "\n", "### CSR (Compressed Sparse Row) format\n", "\n", "If we use COO to enumerate the nonzero entries row-by-row, rather than in\n", "arbitrary order (which the format allows), we end up with many consecutive,\n", "repeated values in the `row` array.\n", "These can be compressed by indicating the *indices* in `col` where the next row\n", "starts, rather than repeatedly writing the row index.\n", "This is the basis for the *compressed sparse row* or *CSR* format.\n", "\n", "Let's work through the example above. In CSR format, the `col` and `data`\n", "arrays are unchanged (but `col` is renamed to `indices`). However, the `row`\n", "array, instead of indicating the rows, indicates *where* in `col` each row\n", "begins, and is renamed to `indptr`, for \"index pointer\".\n", "\n", "So, let's look at `row` and `col` in COO format, ignoring `data`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "row = [0, 1, 1, 1, 1, 2, 3, 4, 4]\n", "col = [2, 0, 1, 3, 4, 1, 0, 3, 4]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each new row begins at the index where `row` changes.\n", "The 0th row starts at index 0, and the 1st row starts at index 1, but the 2nd\n", "row starts where \"2\" first appears in `row`, at index 5.\n", "Then, the indices increase by 1 for rows 3 and 4, to 6 and 7.\n", "The final index, indicating the end of the matrix, is the total number of\n", "nonzero values (9).\n", "So:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "indptr = [0, 1, 5, 6, 7, 9]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's use these hand-computed arrays to build a CSR matrix in SciPy.\n", "We can check our work by comparing the `.A` output from our COO and\n", "CSR representations to the numpy array `s2` that we defined earlier." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = np.array([6, 1, 2, 4, 5, 1, 9, 6, 7])\n", "\n", "coo = sparse.coo_matrix((data, (row, col)))\n", "csr = sparse.csr_matrix((data, col, indptr))\n", "\n", "print('The COO and CSR arrays are equal: ',\n", " np.all(coo.A == csr.A))\n", "print('The CSR and NumPy arrays are equal: ',\n", " np.all(s2 == csr.A))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ability to store large, sparse matrices, and perform computations on them,\n", "is incredibly powerful, and can be applied in many domains.\n", "\n", "For example,\n", "one can think of the entire web as a large, sparse, $N \\times N$ matrix.\n", "Each entry $X_{ij}$ indicates whether web page $i$ links to page $j$.\n", "By normalizing this matrix and solving for its dominant eigenvector,\n", "one obtains the so-called PageRank—one of the numbers Google uses to\n", "order your search results. (You can read more about this in the next chapter.)\n", "\n", "As another example, we can represent the human brain as a large $m \\times m$\n", "graph, where there are $m$ nodes (positions) in which you\n", "measure activity using an MRI scanner. After a while of measuring,\n", "correlations can be calculated and entered into a matrix $C_{ij}$.\n", "Thresholding this matrix produces a sparse matrix of ones and zeros. \n", "The eigenvector corresponding to the second-smallest eigenvalue of this matrix\n", "partitions the $m$ brain areas into subgroups, which, it turns out,\n", "are often related to functional regions of the brain [^Newman]!\n", "\n", "[^Newman]: Newman MEJ (2006). Modularity and community structure in networks.\n", " PNAS 103(23):8577-8582. DOI:10.1073/pnas.0601602103\n", "\n", "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
bsr_matrixcoo_matrixcsc_matrixcsr_matrixdia_matrixdok_matrixlil_matrix

Full name

Block Sparse Row

Coordinate

Compressed Sparse Column

Compressed Sparse Row

Diagonal

Dictionary of Keys

Row-based linked-list

Note

Similar to CSR

Only used to construct sparse matrices, which are then converted to CSC or CSR for further operations.

Used to construct sparse matrices incrementally.

Used to construct sparse matrices incrementally.

Use cases

    \n", "
  • Storage of dense sub-matrices
  • \n", "
  • Often used in numerical analyses of discretized problems,
  • \n", "
  • such as finite elements, differential equations
  • \n", "
    \n", "
  • Fast and straightforward way of constructing sparse matrices
  • \n", "
  • During construction, duplicate coordinates are summed—useful for, e.g., finite element analysis
  • \n", "
    \n", "
  • Arithmetic operations (supports addition, subtraction, multiplication, division, and matrix power
  • \n", "
  • Efficient column slicing
  • \n", "
  • Fast matrix-vector products (CSR, BSR can be faster, depending on the problem)
  • \n", "
    \n", "
  • Arithmetic operations
  • \n", "
  • Efficient row slicing
  • \n", "
  • Fast matrix-vector products
  • \n", "
    \n", "
  • Arithmetic operations
  • \n", "
    \n", "
  • Changes in sparsity structure are inexpensive
  • \n", "
  • Arithmetic operations
  • \n", "
  • Fast access to individual elements
  • \n", "
  • Efficient conversion to COO (but no duplicates allowed)
  • \n", "
    \n", "
  • Changes in sparsity structure are inexpensive
  • \n", "
  • Flexible slicing
  • \n", "

Cons

    \n", "
  • No arithmetic operations
  • \n", "
  • No slicing
  • \n", "
    \n", "
  • Slow row slicing (see CSR)
  • \n", "
  • Changes to sparsity structure are expensive (see LIL, DOK)
  • \n", "
    \n", "
  • Slow column slicing (see CSC)
  • \n", "
  • Changes to sparsity structure are expensive (see LIL, DOK)
  • \n", "
    \n", "
  • Sparsity structure limited to values on diagonals
  • \n", "
    \n", "
  • Expensive for arithmetic operations
  • \n", "
  • Slow matrix-vector products
  • \n", "
    \n", "
  • Expensive for arithmetic operations
  • \n", "
  • Slow column slicing
  • \n", "
  • Slow matrix-vector products
  • \n", "
\n", "
\n", "\n", "## Applications of sparse matrices: image transformations\n", "\n", "Libraries like scikit-image and SciPy already contain algorithms for\n", "transforming (rotating & warping) images effectively, but what if you\n", "were head of the NumPy Agency for Space Affairs and had to rotate\n", "millions of images streaming in from the newly launched Jupyter\n", "Orbiter?\n", "\n", "In such cases, you want to squeeze every ounce of performance from your\n", "computer. It turns out that we can do a lot better than even the optimized C\n", "code in SciPy's `ndimage` if we are repeatedly applying the *same*\n", "transformation.\n", "\n", "We'll use the following image from scikit-image as example data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Make plots appear inline, set custom plotting style\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "plt.style.use('style/elegant.mplstyle')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from skimage import data\n", "image = data.camera()\n", "plt.imshow(image);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "As a test operation, we'll be rotating the image by 30 degrees. We begin\n", "by defining the transformation matrix, $H$ which, when multiplied with a\n", "coordinate from the input image, $[r, c, 1]$, will give us the\n", "corresponding coordinate in the output, $[r', c', 1]$. (Note: we are using\n", "[homogeneous coordinates](https://en.wikipedia.org/wiki/Homogeneous_coordinates),\n", "which have a 1 appended to them and which give greater flexibility when\n", "defining linear transforms.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "angle = 30\n", "c = np.cos(np.deg2rad(angle))\n", "s = np.sin(np.deg2rad(angle))\n", "\n", "H = np.array([[c, -s, 0],\n", " [s, c, 0],\n", " [0, 0, 1]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can verify that this works by multiplying H with the point (1, 0). A\n", "30-degree counterclockwise rotation around the origin (0, 0) should take us\n", "to point $(\\frac{\\sqrt{3}}{2}, \\frac{1}{2})$:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "point = np.array([1, 0, 1])\n", "print(np.sqrt(3) / 2)\n", "print(H @ point)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly, applying the 30-degree rotation three times should get us to the\n", "column axis, at point (0, 1). We can see that this works, minus some floating\n", "point approximation error:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(H @ H @ H @ point)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we will build a function that defines a \"sparse operator\". The goal of\n", "the sparse operator is to take all pixels of the output image, figure out where\n", "they came from in the input image, and do the appropriate (bi-linear)\n", "interpolation (see figure below) to calculate their values. It does this using just\n", "matrix multiplication on the image values, and thus is extremely fast.\n", "\n", "\n", "\n", "\n", "Let's look at the function that builds our sparse operator:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from itertools import product\n", "\n", "def homography(tf, image_shape):\n", " \"\"\"Represent homographic transformation & interpolation as linear operator.\n", "\n", " Parameters\n", " ----------\n", " tf : (3, 3) ndarray\n", " Transformation matrix.\n", " image_shape : (M, N)\n", " Shape of input gray image.\n", "\n", " Returns\n", " -------\n", " A : (M * N, M * N) sparse matrix\n", " Linear-operator representing transformation + bilinear interpolation.\n", "\n", " \"\"\"\n", " # Invert matrix. This tells us, for each output pixel, where to\n", " # find its corresponding input pixel.\n", " H = np.linalg.inv(tf)\n", "\n", " m, n = image_shape\n", "\n", " # We are going to construct a COO matrix, often called IJK matrix,\n", " # for which we'll need row coordinates (I), column coordinates (J),\n", " # and values (K).\n", " row, col, values = [], [], []\n", "\n", " # For each pixel in the output image...\n", " for sparse_op_row, (out_row, out_col) in \\\n", " enumerate(product(range(m), range(n))):\n", "\n", " # Compute where it came from in the input image\n", " in_row, in_col, in_abs = H @ [out_row, out_col, 1]\n", " in_row /= in_abs\n", " in_col /= in_abs\n", "\n", " # if the coordinates are outside of the original image, ignore this\n", " # coordinate; we will have 0 at this position\n", " if (not 0 <= in_row < m - 1 or\n", " not 0 <= in_col < n - 1):\n", " continue\n", "\n", " # We want to find the four surrounding pixels, so that we\n", " # can interpolate their values to find an accurate\n", " # estimation of the output pixel value\n", " # We start with the top, left corner, noting that the remaining\n", " # points are 1 away in each direction.\n", " top = int(np.floor(in_row))\n", " left = int(np.floor(in_col))\n", "\n", " # Calculate the position of the output pixel, mapped into\n", " # the input image, within the four selected pixels\n", " # https://commons.wikimedia.org/wiki/File:BilinearInterpolation.svg\n", " t = in_row - top\n", " u = in_col - left\n", "\n", " # The current row of the sparse operator matrix is given by the\n", " # raveled output pixel coordinates, contained in `sparse_op_row`.\n", " # We will take the weighted average of the four surrounding input\n", " # pixels, corresponding to four columns. So we need to repeat the row\n", " # index four times.\n", " row.extend([sparse_op_row] * 4)\n", "\n", " # The actual weights are calculated according to the bilinear\n", " # interpolation algorithm, as shown at\n", " # https://en.wikipedia.org/wiki/Bilinear_interpolation\n", " sparse_op_col = np.ravel_multi_index(\n", " ([top, top, top + 1, top + 1 ],\n", " [left, left + 1, left, left + 1]), dims=(m, n))\n", " col.extend(sparse_op_col)\n", " values.extend([(1-t) * (1-u), (1-t) * u, t * (1-u), t * u])\n", "\n", " operator = sparse.coo_matrix((values, (row, col)),\n", " shape=(m*n, m*n)).tocsr()\n", " return operator" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall that we apply the sparse operator as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def apply_transform(image, tf):\n", " return (tf @ image.flat).reshape(image.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try it out!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tf = homography(H, image.shape)\n", "out = apply_transform(image, tf)\n", "plt.imshow(out);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "There's that rotation!\n", "\n", "\n", "\n", "**Exercise:** The rotation happens around the origin, coordinate (0, 0). But\n", "can you rotate the image around its center?\n", "\n", "**Hint:** The transformation matrix for a *translation*, i.e. sliding the image\n", "up/down or left/right, is given by:\n", "\n", "$$\n", "H_{tr} =\n", "\\begin{bmatrix}\n", " 1 & 0 & t_r \\\\\n", " 0 & 1 & t_c \\\\\n", " 0 & 0 & 1\n", "\\end{bmatrix}\n", "$$\n", "\n", "when you want to move the image $t_r$ pixels down and $t_c$ pixels right.\n", "\n", "\n", "\n", "**Solution:** We can *compose* transformations by multiplying them. We know how to rotate\n", "an image about the origin, as well as how to slide it around. So what we will\n", "do is slide the image so that the center is at the origin, rotate it, and then\n", "slide it back." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def transform_rotate_about_center(shape, degrees):\n", " \"\"\"Return the homography matrix for a rotation about an image center.\"\"\"\n", " c = np.cos(np.deg2rad(angle))\n", " s = np.sin(np.deg2rad(angle))\n", "\n", " H_rot = np.array([[c, -s, 0],\n", " [s, c, 0],\n", " [0, 0, 1]])\n", " # compute image center coordinates\n", " center = np.array(image.shape) / 2\n", " # matrix to center image on origin\n", " H_tr0 = np.array([[1, 0, -center[0]],\n", " [0, 1, -center[1]],\n", " [0, 0, 1]])\n", " # matrix to move center back\n", " H_tr1 = np.array([[1, 0, center[0]],\n", " [0, 1, center[1]],\n", " [0, 0, 1]])\n", " # complete transformation matrix\n", " H_rot_cent = H_tr1 @ H_rot @ H_tr0\n", "\n", " sparse_op = homography(H_rot_cent, image.shape)\n", "\n", " return sparse_op" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can test that this works:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tf = transform_rotate_about_center(image.shape, 30)\n", "plt.imshow(apply_transform(image, tf));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "\n", "As mentioned above, this sparse linear operator approach to image\n", "transformation is extremely fast.\n", "Let's measure how it performs in comparison to `ndimage`. To make the comparison\n", "fair, we need to tell `ndimage` that we want linear interpolation with `order=1`,\n", "and that we want to ignore pixels outside of the original shape, with\n", "`reshape=False`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%timeit apply_transform(image, tf)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scipy import ndimage as ndi\n", "%timeit ndi.rotate(image, 30, reshape=False, order=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On our machines, we see a speed-up of approximately 10 times. While\n", "this example does only a rotation, there is no reason why we cannot do\n", "more complicated warping operations, such as correcting for a distorted lens\n", "during imaging, or making people pull funny faces. Once the transform has been\n", "computed, applying it repeatedly is extremely fast, thanks to sparse matrix\n", "algebra.\n", "\n", "So now that we've seen a \"standard\" use of SciPy's sparse matrices, let's have\n", "a look at the out-of-the-box use that inspired this chapter.\n", "\n", "## Back to contingency tables\n", "\n", "You might recall that we are trying to quickly build a sparse, joint\n", "probability matrix using SciPy's sparse formats. We know that the COO format\n", "stores sparse data as three arrays, containing the row and column coordinates\n", "of nonzero entries, as well as their values. But we can use a little known\n", "feature of COO to obtain our matrix extremely quickly.\n", "\n", "Have a look at this data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "row = [0, 0, 2]\n", "col = [1, 1, 2]\n", "dat = [5, 7, 1]\n", "S = sparse.coo_matrix((dat, (row, col)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the entry at (row, column) position (0, 1) appears twice: first as\n", "5, and then at 7. What should the matrix value at (0, 1) be? Cases could be\n", "made for both the earliest entry encountered, or the latest, but what was in\n", "fact chosen is the *sum*:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(S.toarray())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, COO format will sum together repeated entries... Which is exactly what we\n", "need to do to make a contingency matrix! Indeed, our task is pretty much done:\n", "we can set `pred` as the rows, `gt` as the columns, and simply 1 as the values.\n", "The ones will get summed together and count the number of times that label $i$\n", "in `pred` occurs together with label $j$ in `gt` at position $i, j$ in the\n", "matrix! Let's try it out:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scipy import sparse\n", "\n", "def confusion_matrix(pred, gt):\n", " cont = sparse.coo_matrix((np.ones(pred.size), (pred, gt)))\n", " return cont" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To look at a small one, we simply use the `.toarray` method, as above:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cont = confusion_matrix(pred, gt)\n", "print(cont)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(cont.toarray())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It works!\n", "\n", "\n", "\n", "**Exercise:** Remember from Chapter 1 that NumPy has built-in tools for\n", "repeating arrays using *broadcasting*. How can you reduce the memory footprint\n", "required for the contingency matrix computation?\n", "\n", "**Hint:** Look at the documentation for the function `np.broadcast_to`.\n", "\n", "\n", "\n", "**Solution:** The `np.ones` array that we create is read-only: it will only be used as the\n", "values to sum by `coo_matrix`. We can use `broadcast_to` to create a similar\n", "array with only one element, \"virtually\" repeated n times:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def confusion_matrix(pred, gt):\n", " n = pred.size\n", " ones = np.broadcast_to(1., n) # virtual array of 1s of size n\n", " cont = sparse.coo_matrix((ones, (pred, gt)))\n", " return cont" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make sure it still works as expected:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cont = confusion_matrix(pred, gt)\n", "print(cont.toarray())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Boom. Instead of making an array as big as the original data, we just make\n", "one of size 1. As we handle bigger and bigger datasets, such optimizations become\n", "increasingly important.\n", "\n", "\n", "\n", "\n", "\n", "## Contingency tables in segmentation\n", "\n", "You can think of the segmentation of an image in the same way as the classification problem above:\n", "The segment label at each *pixel* is a *prediction* about which *class* the pixel belongs to.\n", "And numpy arrays allow us to do this transparently, because their `.ravel()` method returns a 1D view of the underlying data.\n", "\n", "As an example, here's a segmentation of a tiny 3 by 3 image:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "seg = np.array([[1, 1, 2],\n", " [1, 2, 2],\n", " [3, 3, 3]], dtype=int)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here’s the ground truth, what some person said was the correct way to segment this image:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gt = np.array([[1, 1, 1],\n", " [1, 1, 1],\n", " [2, 2, 2]], dtype=int)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can think of these two as classifications, just like before. Every pixel is\n", "a different prediction." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(seg.ravel())\n", "print(gt.ravel())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, like above, the contingency matrix is given by:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cont = sparse.coo_matrix((np.ones(seg.size),\n", " (seg.ravel(), gt.ravel())))\n", "print(cont)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some indices appear more than once, but we can use the summing feature of the\n", "COO format to confirm that this represents the matrix we want:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(cont.toarray())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How do we convert this table into a measure of how well `seg` represents `gt`?\n", "Segmentation is a hard problem, so it's important to measure how well a\n", "segmentation algorithm is doing, by comparing its output to a \"ground truth\"\n", "segmentation that is manually produced by a human.\n", "\n", "But, even this comparison is not an easy task. How do we define how \"close\" an\n", "automated segmentation is to a ground truth? We'll illustrate one method, the\n", "*variation of information* or VI (Meila, 2005). This is defined as the answer\n", "to the following question: on average, for a random pixel, if we are given its\n", "segment ID in one segmentation, how much more *information* do we need to\n", "determine its ID in the other segmentation?\n", "\n", "Intuitively, if the two segmentations are exactly alike, then knowing the segment\n", "ID in one tells you the segment ID in the other, with no additional information.\n", "But as the segmentations become more different, knowing an ID in one doesn't tell\n", "you the ID in the other without more information.\n", "\n", "## Information theory in brief\n", "\n", "In order to answer this question, we'll need a quick primer on information\n", "theory. We need to be brief but if you want more information (heh), you should\n", "look at Christopher Olah's stellar blog post,\n", "[Visual Information Theory](https://colah.github.io/posts/2015-09-Visual-Information/).\n", "\n", "The basic unit of information is the *bit*, commonly shown as a 0 or 1,\n", "representing an equal probability choice between two options.\n", "This is straightforward: if I want to tell you whether a coin toss landed as\n", "heads or tails, I need one bit, which can take many forms:\n", "a long or short pulse over a telegraph wire (as in Morse code), a light\n", "flashing one of two colors, or a single number taking values 0 or 1.\n", "Importantly, I *always* need one bit, because the outcome of a coin toss is\n", "random.\n", "\n", "It turns out that we can extend this concept to *fractional* bits for events\n", "that are *less* random.\n", "Suppose, for example, that you need to transmit whether it rained today in Los\n", "Angeles.\n", "At first glance, it seems that this requires 1 bit as well: 0 for it didn't\n", "rain, 1 for it rained.\n", "However, rain in LA is a rare event, so over time we can actually get away with\n", "transmitting much less information:\n", "Transmit a 0 *occasionally* just to make sure that our communication is still\n", "working, but otherwise simply *assume* that the signal is 0, and send 1 only on\n", "those rare occasions that it rains.\n", "\n", "Thus, when two events are *not* equally likely, we need *less* than 1 bit to\n", "represent them.\n", "Generally, we measure this for any random variable $X$ (which could have more\n", "than two possible values) by using the *entropy* function $H$:\n", "\n", "$$\n", "\\begin{aligned}\n", "H(X) & = \\sum_{x}{p_x \\log_2\\left(\\frac{1}{p_x}\\right)} \\\\\n", " & = -\\sum_{x}{p_x \\log_2\\left(p_x\\right)}\n", "\\end{aligned}\n", "$$\n", "\n", "where the $x$s are possible values of $X$, and $p_x$ is the probability of $X$\n", "taking value $x$.\n", "\n", "So, the entropy of a coin toss $T$ that can take values heads ($h$) and tails\n", "($t$) is:\n", "\n", "$$\n", "\\begin{aligned}\n", "H(T) & = p_h \\log_2(1/p_h) + p_t \\log_2(1/p_t) \\\\\n", " & = 1/2 log_2(2) + 1/2 \\log_2(2) \\\\\n", " & = 1/2 \\cdot 1 + 1/2 \\cdot 1 \\\\\n", " & = 1\n", "\\end{aligned}\n", "$$\n", "\n", "The long-term probability of rain on any given day in LA is about 1 in 6, so\n", "the entropy of rain in LA, $R$, taking values rain ($r$) or shine ($s$) is:\n", "\n", "$$\n", "\\begin{aligned}\n", "H(R) & = p_r \\log_2(1/p_r) + p_s \\log_2(1/p_s) \\\\\n", " & = 1/6 \\log_2(6) + 5/6 \\log_2(6/5) \\\\\n", " & \\approx 0.65 \\textrm{ bits}\n", "\\end{aligned}\n", "$$\n", "\n", "A special kind of entropy is the *conditional* entropy.\n", "This is the entropy of a variable *assuming* that you also know something else\n", "about that variable.\n", "For example: what is the entropy of rain *given* that you know the month?\n", "This is written as:\n", "\n", "$$\n", "H(R | M) = \\sum_{m = 1}^{12}{p(m)H(R | M = m)}\n", "$$\n", "\n", "and\n", "\n", "$$\n", "\\begin{aligned}\n", "H(R | M=m) &= {p_{r|m}\\log_2\\left(\\frac{1}{p_{r|m}}\\right) +\n", " p_{s|m}\\log_2\\left(\\frac{1}{p_{s|m}}\\right)} \\\\\n", " &= {\\frac{p_{rm}}{p_m}\\log_2\\left(\\frac{p_m}{p_{rm}}\\right) +\n", " \\frac{p_{sm}}{p_m}\\log_2\\left(\\frac{p_m}{p_{sm}}\\right)} \\\\\n", " &= {-\\frac{p_{rm}}{p_m}\\log_2\\left(\\frac{p_{rm}}{p_{m}}\\right) -\n", " \\frac{p_{sm}}{p_m}\\log_2\\left(\\frac{p_{sm}}{p_{m}}\\right)}\n", "\\end{aligned}\n", "$$\n", "\n", "You now have all the information theory you need to understand the variation\n", "of information.\n", "In the above example, events are days, and they have two properties:\n", "\n", "- rain/shine\n", "- month\n", "\n", "By observing many days, we can build a *contingency matrix*, just like the\n", "ones in the classification examples, measuring the month of a day and whether\n", "it rained.\n", "We're not going to travel to LA to do this (fun though it would be), and\n", "instead we use the historical table below, roughly eyeballed from\n", "[WeatherSpark](https://weatherspark.com/averages/30699/Los-Angeles-California-United-States):\n", "\n", "| Month | P(rain) | P(shine) |\n", "| -----:| -------- | -------- |\n", "|1|0.25|0.75|\n", "|2|0.27|0.73|\n", "|3|0.24|0.76|\n", "|4|0.18|0.82|\n", "|5|0.14|0.86|\n", "|6|0.11|0.89|\n", "|7|0.07|0.93|\n", "|8|0.08|0.92|\n", "|9|0.10|0.90|\n", "|10|0.15|0.85|\n", "|11|0.18|0.82|\n", "|12|0.23|0.77|\n", "\n", "The conditional entropy of rain given month is then:\n", "\n", "$$\n", "\\begin{aligned}\n", "H(R|M) & = -\\frac{1}{12} \\left( 0.25 \\log_2(0.25) +\n", " 0.75 \\log_2(0.75) \\right) -\n", " \\frac{1}{12} \\left( 0.27 \\log_2(0.27) +\n", " 0.73 \\log_2(0.73) \\right) \\\\\n", " & - ... -\n", " \\frac{1}{12} \\left( 0.23 \\log_2(0.23) +\n", " 0.77 \\log_2(0.77) \\right) \\\\\n", " & \\approx 0.626 \\textrm{ bits}\n", "\\end{aligned}\n", "$$\n", "\n", "So, by using the month, we've reduced the randomness of the signal, but not by\n", "much!\n", "\n", "We can also compute the conditional entropy of month given rain, which\n", "measures how much information we need to determine the month if we know it\n", "rained.\n", "Intuitively, we know that this is better than going in blind, since it's\n", "more likely to rain in the winter months.\n", "\n", "\n", "\n", "**Exercise:** Compute the conditional entropy of month given rain. What is the\n", "entropy of the month variable? (Ignore the different number of days in a\n", "month.) Which one is greater? (*Hint:* the probabilities in the table are\n", "the conditional probabilities of rain given month.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "prains = np.array([25, 27, 24, 18, 14, 11, 7, 8, 10, 15, 18, 23]) / 100\n", "pshine = 1 - prains\n", "p_rain_g_month = np.column_stack([prains, pshine])\n", "# replace 'None' below with expression for non-conditional contingency\n", "# table. Hint: the values in the table must sum to 1.\n", "p_rain_month = None\n", "# Add your code below to compute H(M|R) and H(M)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "**Solution:** To obtain the joint probability table, we simply divide the table by its total,\n", "in this case, 12:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('table total:', np.sum(p_rain_g_month))\n", "p_rain_month = p_rain_g_month / np.sum(p_rain_g_month)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can compute the conditional entropy of the month given rain. (This is\n", "like asking: if we know it's raining, how much more information do we need to\n", "know to figure out what month it is, on average?)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "p_rain = np.sum(p_rain_month, axis=0)\n", "p_month_g_rain = p_rain_month / p_rain\n", "Hmr = np.sum(p_rain * p_month_g_rain * np.log2(1 / p_month_g_rain))\n", "print(Hmr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's compare that to the entropy of the months:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "p_month = np.sum(p_rain_month, axis=1) # 1/12, but this method is more general\n", "Hm = np.sum(p_month * np.log2(1 / p_month))\n", "print(Hm)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we can see that knowing whether it rained today got us 2 hundredths of a bit\n", "closer to guessing what month it is! Don't bet the farm on that guess.\n", "\n", "\n", "\n", "\n", "\n", "Together, these two values define the variation of information (VI):\n", "\n", "$$\n", "VI(A, B) = H(A | B) + H(B | A)\n", "$$\n", "\n", "## Information theory in segmentation: variation of information\n", "\n", "Back in the image segmentation context, \"days\" become \"pixels\", and \"rain\" and \"month\"\n", "become \"label in automated segmentation ($S$)\" and \"label ground truth ($T$)\".\n", "Then, the conditional entropy of the automatic segmentation given the ground\n", "truth measures how much additional\n", "information we need to determine a pixel's identity in $S$ if we are told its\n", "identity in $T$.\n", "For example, if every $T$ segment $g$ is split into two equally-sized\n", "segments $a_1$ and $a_2$ in $S$, then $H(S|T) = 1$, because after knowing a\n", "pixel is in $g$, you\n", "still need 1 additional bit to know whether it belongs to $a_1$ or $a_2$.\n", "However, $H(T | S) = 0$, because regardless of whether a pixel is in $a_1$ or\n", "$a_2$, it is guaranteed to be in $g$, so you need no more information than\n", "the segment in S.\n", "\n", "So, together, in this case,\n", "\n", "$$\n", "VI(S, T) = H(S | T) + H(T | S) = 1 + 0 = 1 \\textrm{ bit.}\n", "$$\n", "\n", "Here's a simple example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "S = np.array([[0, 1],\n", " [2, 3]], int)\n", "\n", "T = np.array([[0, 1],\n", " [0, 1]], int)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we have two segmentations of a four-pixel image: `S` and `T`. `S`\n", "puts every pixel in its own segment, while `T` puts the left two pixels in\n", "segment 0 and the right two pixels in segment 1.\n", "Now, we make a contingency table of the pixel labels, just as we did with\n", "the spam prediction labels.\n", "The only difference is that the label arrays are 2-dimensional, instead of\n", "the 1D arrays of predictions.\n", "In fact, this doesn't matter:\n", "remember that numpy arrays are actually linear (1D) chunks of data with some\n", "shape and other metadata attached.\n", "As we mentioned before, we can ignore the shape by using the arrays' `.ravel()` method:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "S.ravel()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can just make the contingency table in the same way as when we were\n", "predicting spam:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cont = sparse.coo_matrix((np.broadcast_to(1., S.size),\n", " (S.ravel(), T.ravel())))\n", "cont = cont.toarray()\n", "cont" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to make this a table of probabilities, instead of counts, we simply\n", "divide by the total number of pixels:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cont /= np.sum(cont)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we can use this table to compute the probabilities of labels in *either*\n", "`S` or `T`, using the axis-wise sums:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "p_S = np.sum(cont, axis=1)\n", "p_T = np.sum(cont, axis=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is a small kink in writing Python code to compute entropy:\n", "although $0 \\log(0)$ is defined to be equal to 0, in Python, it is undefined,\n", "and results in a `nan` (not a number) value:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('The log of 0 is: ', np.log2(0))\n", "print('0 times the log of 0 is: ', 0 * np.log2(0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Therefore, we have to use numpy indexing to mask out the 0 values.\n", "Additionally, we'll need a slightly different strategy depending on whether the\n", "input is a numpy array or a SciPy sparse matrix.\n", "We'll write the following convenience function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def xlog1x(arr_or_mat):\n", " \"\"\"Compute the element-wise entropy function of an array or matrix.\n", "\n", " Parameters\n", " ----------\n", " arr_or_mat : numpy array or scipy sparse matrix\n", " The input array of probabilities. Only sparse matrix formats with a\n", " `data` attribute are supported.\n", "\n", " Returns\n", " -------\n", " out : array or sparse matrix, same type as input\n", " The resulting array. Zero entries in the input remain as zero,\n", " all other entries are multiplied by the log (base 2) of their\n", " inverse.\n", " \"\"\"\n", " out = arr_or_mat.copy()\n", " if isinstance(out, sparse.spmatrix):\n", " arr = out.data\n", " else:\n", " arr = out\n", " nz = np.nonzero(arr)\n", " arr[nz] *= -np.log2(arr[nz])\n", " return out" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make sure it works:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a = np.array([0.25, 0.25, 0, 0.25, 0.25])\n", "xlog1x(a)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mat = sparse.csr_matrix([[0.125, 0.125, 0.25, 0],\n", " [0.125, 0.125, 0, 0.25]])\n", "xlog1x(mat).A" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, the conditional entropy of $S$ given $T$:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "H_ST = np.sum(np.sum(xlog1x(cont / p_T), axis=0) * p_T)\n", "H_ST" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the converse:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "H_TS = np.sum(np.sum(xlog1x(cont / p_S[:, np.newaxis]), axis=1) * p_S)\n", "H_TS" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Converting NumPy array code to use sparse matrices\n", "\n", "We used numpy arrays and broadcasting in the above examples, which, as we've\n", "seen many times, is a powerful way to analyze data in Python.\n", "However, for segmentations of complex images, possibly containing thousands of\n", "segments, it rapidly becomes inefficient.\n", "We can instead use `sparse` throughout the calculation, and recast some of the\n", "NumPy magic as linear algebra operations.\n", "This was\n", "[suggested](http://stackoverflow.com/questions/16043299/substitute-for-numpy-broadcasting-using-scipy-sparse-csc-matrix)\n", "to us by Warren Weckesser on StackOverflow.\n", "\n", "The linear algebra version efficiently computes a contingency matrix for very\n", "large amounts of data, up to billions of points, and is elegantly concise." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from scipy import sparse\n", "\n", "\n", "def invert_nonzero(arr):\n", " arr_inv = arr.copy()\n", " nz = np.nonzero(arr)\n", " arr_inv[nz] = 1 / arr[nz]\n", " return arr_inv\n", "\n", "\n", "def variation_of_information(x, y):\n", " # compute contingency matrix, aka joint probability matrix\n", " n = x.size\n", " Pxy = sparse.coo_matrix((np.full(n, 1/n), (x.ravel(), y.ravel())),\n", " dtype=float).tocsr()\n", "\n", " # compute marginal probabilities, converting to 1D array\n", " px = np.ravel(Pxy.sum(axis=1))\n", " py = np.ravel(Pxy.sum(axis=0))\n", "\n", " # use sparse matrix linear algebra to compute VI\n", " # first, compute the inverse diagonal matrices\n", " Px_inv = sparse.diags(invert_nonzero(px))\n", " Py_inv = sparse.diags(invert_nonzero(py))\n", "\n", " # then, compute the entropies\n", " hygx = px @ xlog1x(Px_inv @ Pxy).sum(axis=1)\n", " hxgy = xlog1x(Pxy @ Py_inv).sum(axis=0) @ py\n", "\n", " # return the sum of these\n", " return float(hygx + hxgy)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can check that this gives the right value (1) for the VI of our toy `S`\n", "and `T`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "variation_of_information(S, T)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see how we use three types of sparse matrices (COO, CSR, and diagonal)\n", "to efficiently solve the entropy calculation in the case of sparse contingency\n", "matrices, where NumPy would be inefficient.\n", "(Indeed, this whole approach was inspired by a Python `MemoryError`!)\n", "\n", "## Using variation of information\n", "\n", "To finish, let's demonstrate the use of VI to estimate the best possible\n", "automated segmentation of an image.\n", "You may remember our friendly stalking tiger from chapter 3.\n", "(If you don't, you might want to work on your threat-assessment skills!)\n", "Using our skills from chapter 3, we're going to generate a number of possible ways of segmenting the tiger image, and then figure out the best one." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from skimage import io\n", "\n", "url = ('http://www.eecs.berkeley.edu/Research/Projects/CS/vision/bsds'\n", " '/BSDS300/html/images/plain/normal/color/108073.jpg')\n", "tiger = io.imread(url)\n", "\n", "plt.imshow(tiger);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "In order to check our image segmentation, we're going to need some ground truth.\n", "It turns out that humans are awesome at detecting tigers (natural selection for the win!), so all we need to do is ask a human to find the tiger.\n", "Luckily, researchers at Berkeley have already asked dozens of humans to look at this image and manually segment it [^bsds].\n", "Let's grab one of the segmentation images from the [Berkeley Segmentation Dataset and Benchmark](https://www.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/).\n", "It's worth noting that there is quite substantial variation between the segmentations performed by humans.\n", "If you look through the [various tiger segmentations](https://www.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/BSDS300/html/dataset/images/color/108073.html), you will see that some humans are more pedantic than others about tracing around the reeds, while others consider the reflections to be objects worth segmenting out from the rest of the water.\n", "We have chosen a segmentation that we like (one with pedantic-reed-tracing, because we are perfectionistic scientist-types.)\n", "But to be clear, we really have no single ground truth!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scipy import ndimage as ndi\n", "from skimage import color\n", "\n", "human_seg_url = ('http://www.eecs.berkeley.edu/Research/Projects/CS/'\n", " 'vision/bsds/BSDS300/html/images/human/normal/'\n", " 'outline/color/1122/108073.jpg')\n", "boundaries = io.imread(human_seg_url)\n", "plt.imshow(boundaries);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "Overlaying the tiger image with the human segmentation, we can see that (unsurprisingly) this person does a pretty good job of finding the tiger.\n", "They have also segmented out the river bank, and a tuft of reeds.\n", "Nice job, human #1122!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "human_seg = ndi.label(boundaries > 100)[0]\n", "plt.imshow(color.label2rgb(human_seg, tiger));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "Now, let's grab our image segmentation code from chapter 3, and see how well a Python does at recognizing a tiger!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Draw a region adjacency graph (RAG) - all code from Ch3\n", "import networkx as nx\n", "import numpy as np\n", "from skimage.future import graph\n", "\n", "def add_edge_filter(values, graph):\n", " current = values[0]\n", " neighbors = values[1:]\n", " for neighbor in neighbors:\n", " graph.add_edge(current, neighbor)\n", " return 0. # generic_filter requires a return value, which we ignore!\n", "\n", "def build_rag(labels, image):\n", " g = nx.Graph()\n", " footprint = ndi.generate_binary_structure(labels.ndim, connectivity=1)\n", " for j in range(labels.ndim):\n", " fp = np.swapaxes(footprint, j, 0)\n", " fp[0, ...] = 0 # zero out top of footprint on each axis\n", " _ = ndi.generic_filter(labels, add_edge_filter, footprint=footprint,\n", " mode='nearest', extra_arguments=(g,))\n", " for n in g:\n", " g.nodes[n]['total color'] = np.zeros(3, np.double)\n", " g.nodes[n]['pixel count'] = 0\n", " for index in np.ndindex(labels.shape):\n", " n = labels[index]\n", " g.nodes[n]['total color'] += image[index]\n", " g.nodes[n]['pixel count'] += 1\n", " return g\n", "\n", "def threshold_graph(g, t):\n", " to_remove = [(u, v) for (u, v, d) in g.edges(data=True)\n", " if d['weight'] > t]\n", " g.remove_edges_from(to_remove)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Baseline segmentation\n", "from skimage import segmentation\n", "seg = segmentation.slic(tiger, n_segments=30, compactness=40.0,\n", " enforce_connectivity=True, sigma=3)\n", "plt.imshow(color.label2rgb(seg, tiger));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "In chapter 3, we set the graph threshold at 80 and sort of hand-waved over the whole thing.\n", "Now we're going to have a closer look at how this threshold impacts our segmentation accuracy.\n", "Let's pop the segmentation code into a function so we can play with it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def rag_segmentation(base_seg, image, threshold=80):\n", " g = build_rag(base_seg, image)\n", " for n in g:\n", " node = g.nodes[n]\n", " node['mean'] = node['total color'] / node['pixel count']\n", " for u, v in g.edges():\n", " d = g.nodes[u]['mean'] - g.nodes[v]['mean']\n", " g[u][v]['weight'] = np.linalg.norm(d)\n", "\n", " threshold_graph(g, threshold)\n", "\n", " map_array = np.zeros(np.max(seg) + 1, int)\n", " for i, segment in enumerate(nx.connected_components(g)):\n", " for initial in segment:\n", " map_array[int(initial)] = i\n", " segmented = map_array[seg]\n", " return(segmented)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try a few thresholds and see what happens:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "auto_seg_10 = rag_segmentation(seg, tiger, threshold=10)\n", "plt.imshow(color.label2rgb(auto_seg_10, tiger));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "auto_seg_40 = rag_segmentation(seg, tiger, threshold=40)\n", "plt.imshow(color.label2rgb(auto_seg_40, tiger));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Actually, in chapter 3 we did the segmentation a bunch of times with different thresholds and then (because we're human, so we can) picked one that produced a good segmentation.\n", "This is a completely unsatisfying way to program image segmentation.\n", "Clearly, we need a way to automate this.\n", "\n", "We can see that the higher threshold seems to producing a better segmentation.\n", "But we have a ground truth, so we can actually put a number to this!\n", "Using all our sparse matrix skills, we can calculate the *variation of information* or VI for each segmentation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "variation_of_information(auto_seg_10, human_seg)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "variation_of_information(auto_seg_40, human_seg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The high threshold has a smaller variation of information, so it's a better segmentation!\n", "Now we can calculate the VI for a range of possible thresholds and see which one gives us closes segmentation to the human ground truth." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Try many thresholds\n", "def vi_at_threshold(seg, tiger, human_seg, threshold):\n", " auto_seg = rag_segmentation(seg, tiger, threshold)\n", " return variation_of_information(auto_seg, human_seg)\n", "\n", "thresholds = range(0, 110, 10)\n", "vi_per_threshold = [vi_at_threshold(seg, tiger, human_seg, threshold)\n", " for threshold in thresholds]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.plot(thresholds, vi_per_threshold);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Unsurprisingly, it turns out that eyeballing it and picking threshold=80, did give us one of the best segmentations.\n", "But now we have a way to automate this process for any image!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "auto_seg = rag_segmentation(seg, tiger, threshold=80)\n", "plt.imshow(color.label2rgb(auto_seg, tiger));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "**Exercise:** Segmentation in practice\n", "\n", "Try finding the best threshold for a selection of other images from the [Berkeley Segmentation Dataset and Benchmark](https://www.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/) [^bsds].\n", "Using the mean or median of those thresholds, then go and segment a new image. Did you get a reasonable segmentation?\n", "\n", "\n", "\n", "Sparse matrices are an efficient way of representing data with many gaps – a\n", "situation that occurs surprisingly often. After reading this chapter, you'll\n", "probably start noticing opportunities to use them all the time... And you'll\n", "know how.\n", "\n", "One particular situation where sparse matrices come extremely handy is in\n", "sparse linear algebra. Read on to the next chapter to find out more!\n", "\n", "[^bsds]: P. Arbelaez, M. Maire, C. Fowlkes and J. Malik. IEEE TPAMI, Vol. 33, No. 5, pp. 898-916, May 2011." ] } ], "metadata": {}, "nbformat": 4, "nbformat_minor": 4 }