{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Errors, or bugs, in your software\n", "\n", "Today we'll cover dealing with errors in your Python code, an important aspect of writing software.\n", "\n", "#### What is a software bug?\n", "\n", "According to [Wikipedia](https://en.wikipedia.org/wiki/Software_bug) (accessed 16 Oct 2018), a software bug is an error, flaw, failure, or fault in a computer program or system that causes it to produce an incorrect or unexpected result, or behave in unintended ways.\n", "\n", "#### Where did the terminology come from?\n", "\n", "Engineers have used the term well before electronic computers and software. Sometimes Thomas Edison is credited with the first recorded use of bug in that fashion. [[Wikipedia](https://en.wikipedia.org/wiki/Software_bug#Etymology)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### If incorrect code is never executed, is it a bug?\n", "\n", "This is the software equivalent to \"If a tree falls and no one hears it, does it make a sound?\". " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Three classes of bugs\n", "\n", "Let's discuss three major types of bugs in your code, from easiest to most difficult to diagnose:\n", "\n", "1. **Syntax errors:** Errors where the code is not written in a valid way. (Generally easiest to fix.)\n", "1. **Runtime errors:** Errors where code is syntactically valid, but fails to execute. Often throwing exceptions here. (Sometimes easy to fix, harder when in other's code.)\n", "1. **Semantic errors:** Errors where code is syntactically valid, but contain errors in logic. (Can be difficult to fix.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Syntax errors" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print \"This should only work in 2.X NOT used in this class.\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"This should only work in 3.x used in this class.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"This should only work in 3.x used in this class.\")" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "hide" ] }, "source": [ "## INSTRUCTOR NOTE:\n", "\n", "1. Run as-is. Run. Error. Returns `SyntaxError: Missing parentheses in call to print.`\n", "1. Add parentheses. Run. Still an error. Returns `SyntaxError: EOL while scanning string literal`.\n", "1. Add closing quotation mark. Run. Should be successful." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Runtime errors" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "# invalid operation\n", "a = 0\n", "5/a # Division by zero" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# invalid operation\n", "input = \"40\"\n", "input/11 # Incompatiable types for the operation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "while True:\n", " pass" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Semantic errors\n", "\n", "Say we're trying to confirm that a trigonometric identity holds. Let's use the basic relationship between sine and cosine, given by the Pythagorean identity\n", "\n", "$$\n", "\\sin^2 \\theta + \\cos^2 \\theta = 1\n", "$$\n", "\n", "We can write a function to check this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import math\n", "\n", "def pythagorean_identity(theta):\n", " return math.sin(theta)**2 + math.cos(theta)**4" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pythagorean_identity(.4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How to find and resolve bugs?\n", "\n", "Debugging has the following steps:\n", "\n", "1. **Detection** of an exception or invalid results. \n", "2. **Isolation** of where the program causes the error. This is often the most difficult step.\n", "3. **Resolution** of how to change the code to eliminate the error. Mostly, it's not too bad, but sometimes this can cause major revisions in codes.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Detection of Bugs\n", "\n", "The detection of bugs is too often done by chance. While running your Python code, you encounter unexpected functionality, exceptions, or syntax errors. While we'll focus on this in today's lecture, you should never leave this up to chance in the future.\n", "\n", "Software testing practices allow for thoughtful detection of bugs in software. We'll discuss more in the lecture on testing." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "### Isolation of Bugs\n", "\n", "There are three main methods commonly used for bug isolation:\n", "1. The \"thought\" method. Think about how your code is structured and so what part of your could would most likely lead to the exception or invalid result.\n", "2. Inserting ``print`` statements (or other logging techniques)\n", "3. Using a line-by-line debugger like ``pdb``.\n", "\n", "Typically, all three are used in combination, often repeatedly." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using `print` statements\n", "\n", "Say we're trying to compute the **entropy** of a set of probabilities. The\n", "form of the equation is\n", "\n", "$$\n", "H = -\\sum_i p_i \\log(p_i)\n", "$$\n", "\n", "The choice of base for the logarithm varies for different applications. Base 2 gives the unit of bits, while base `e` gives nats, and base 10 gives units of \"dits\". \n", "\n", "We can write the function like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def entropy(p):\n", " items = []\n", " for p_i in p:\n", " interm = p_i * np.log2(p_i)\n", " items.append(interm)\n", " return -np.sum(items)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Imagine if we have the nucleotides in DNA: A, T, C & G for adenine, thymine, cytosine, and guanine, respectively. If we have a given DNA sequence, e.g. `ATCGATCG` we can compute the observed probability for each nucleotide in the sequence as:\n", "\n", "| Nucleotide | Occurrences | P(Nucleotide) |\n", "|------------|-------------|---------------|\n", "| A | 2 | .25 |\n", "| T | 2 | .25 |\n", "| C | 2 | .25 |\n", "| G | 2 | .25 |\n", "\n", "Thus, we have 4 states that are all equally likely, we can compute the entropy with the following:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "entropy([.25, .25, .25, .25])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To reinforce this concept, let's imagine we want to encode the nucleotides A, T, C, G as bits. Without any knowledge of nucleotide frequency, i.e. assuming they are all equally likely, we could define a bitwise representation of the nucleotides as:\n", "\n", "| Nucleotide | Bit Representation |\n", "|------------|--------------------|\n", "| A | `00` |\n", "| T | `01` |\n", "| C | `10` |\n", "| G | `11` |\n", "\n", "We can see that it takes two bits to represent the nucleotides uniquely. This is exactly what the `entropy` function predicts for the representation given they are all equality, e.g. `entropy([.25, .25, .25, .25])`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "But there are some problems with this function. Namely, nothing prevents the user from giving it the following inputs:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "entropy([.5, .5, .5, .5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall that the inputs to the function `entropy` are supposed to be probabilities. The sum of probabilities of all states cannot exceed 1. Why does the function let the user pass in probabilities that don't sum to 1?\n", "\n", "Let's start by adding in some print statements to help us look at what is happening." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def entropy(p):\n", " print(f\"p: {p}\")\n", " items = []\n", " for p_i in p:\n", " print(f\"p_i: {p_i}\")\n", " interm = p_i * np.log2(p_i)\n", " print(f\"interm: {interm}\")\n", " items.append(interm)\n", " print(f\"items: {items}\")\n", " return -np.sum(items)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "entropy([0.5, 1.2])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "entropy([0.1, 0.7, 0.2])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def entropy(p):\n", " \"\"\"\n", " arg p: list of float\n", " \"\"\"\n", " # begin by checking all inputs are between 0 and 1 inclusive\n", " check1 = []\n", " for ele in p:\n", " check1.append((ele <= 1) and (ele >= 0))\n", " else:\n", " pass\n", " # verify the sum of the probabilities is 1\n", " # note, the use of np.isclose is correct but the following \n", " # may not return the expected result\n", " # check2 = 1 == np.sum(p)\n", " check2 = np.isclose(1, np.sum(p), atol=1e-08)\n", " if all(check1) and check2:\n", " items = p * np.log2(p)\n", " return -np.sum(items)\n", " else:\n", " return -1" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "all([True, True, True])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "all([False, True, True])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.isclose(1, .9999)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.isclose(1, .999999)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.isclose(1, .9999, atol=1e-03)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(entropy([0.5, 0.5]))\n", "print(entropy([0.25, 0.25, 0.25, 0.25]))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "entropy([1.2, 0, .5])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we can't easily see the bug here, let's add print statements to see the variables change over time." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## INSTRUCTOR NOTE:\n", "\n", "1. Add print statements in tiered way, starting with simple print statements.\n", "1. Point out that may need slight refactor on result.\n", "\n", " ```python\n", " def entropy(p):\n", " print(p)\n", " items = p * np.log(p)\n", " print(items)\n", " result = -np.sum(items)\n", " print(result)\n", " return result\n", " ```\n", "\n", "1. Show complication of reading multiple print statements without labels.\n", "1. Add labels so code looks like below.\n", "\n", " ```python\n", " def entropy(p):\n", " print(\"p=%s\" % p)\n", " items = p * np.log(p)\n", " print(\"items=%s\" % items)\n", " result = -np.sum(items)\n", " print(\"result=%s\" % result)\n", " return result\n", " ```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the print statements significantly reduce legibility of the code. We would like to remove them when we're done debugging." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now it works fine for the first set of inputs. Let's try other inputs." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "entropy([0.5, 0, .5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We should have documented the inputs to the function!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a vector of probabilities.\n", "p = np.arange(start=5., stop=-1., step=-0.5)\n", "p /= np.sum(p)\n", "p" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "entropy(p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We get ``nan``, which stands for \"Not a Number\". What's going on here?\n", "\n", "Let's add our print statements again, but it only fails later in the range of numbers. We may choose to print only if we find a `nan`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "def entropy1(p):\n", " print(\"p=%s\" % str(p))\n", " items = p * np.log2(p)\n", " if [np.isnan(el) for el in items]:\n", " print(items)\n", " else:\n", " pass\n", " return -np.sum(items)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "entropy1([.1, .2])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "entropy1(p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By printing some of the intermediate items, we see the problem: 0 * np.log(0) is resulting in a NaN. Though mathematically it's true that limx→0[xlog(x)]=0limx→0[xlog⁡(x)]=0, the fact that we're performing the computation numerically means that we don't obtain this result.\n", "\n", "Often, inserting a few print statements can be enough to figure out what's going on.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def entropy2(p):\n", " p = np.asarray(p) # convert p to array if necessary\n", " print(p)\n", " items = []\n", " for val in p:\n", " item = val * np.log2(val)\n", " if np.isnan(item):\n", " print(\"%f makes a nan\" % val)\n", " else:\n", " pass\n", " items.append(item)\n", " #items = p * np.log(ps)\n", " return -np.sum(items)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "entropy2(p)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using Python's debugger, `pdb`\n", "\n", "Python comes with a built-in debugger called [pdb](http://docs.python.org/2/library/pdb.html). It allows you to step line-by-line through a computation and examine what's happening at each step. Note that this should probably be your last resort in tracing down a bug. I've probably used it a dozen times or so in five years of coding. But it can be a useful tool to have in your toolbelt.\n", "\n", "You can use the debugger by inserting the line\n", "``` python\n", "import pdb; pdb.set_trace()\n", "```\n", "within your script. To leave the debugger, type \"exit()\". To see the commands you can use, type \"help\".\n", "\n", "Let's try this out:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def entropy(p):\n", " items = p * np.log2(p)\n", " import pdb; pdb.set_trace()\n", " return -np.sum(items)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This can be a more convenient way to debug programs and step through the actual execution." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "p = [.1, -.2, .3]\n", "entropy(p)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "p = \"[0.1, 0.3, 0.5, 0.7, 0.9]\"\n", "entropy(p)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.1" } }, "nbformat": 4, "nbformat_minor": 1 }