{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Basic Programming With Python: Back to the Command Line" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Objectives" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "FIXME" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Lesson" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The IPython Notebook and other interactive tools are great for prototyping code and exploring data,\n", "but sooner or later we will want to use our program in a pipeline\n", "or run it in a shell script to process thousands of data files.\n", "In order to do that,\n", "we need to make it work like other Unix command-line tools." ] }, { "cell_type": "code", "collapsed": false, "input": [ "!cat fractal_1.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "000100000\r\n", "001100110\r\n", "000111100\r\n", "001110000\r\n", "001100000\r\n", "001110000\r\n" ] } ], "prompt_number": 46 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to calculate statistics on these fractals;\n", "more specifically,\n", "we want a program that will read one or more files\n", "and report the average density of each row.\n", "For the file above,\n", "this might be:\n", "\n", " $ fracdens file_1.txt\n", " 0.25\n", " 0.5\n", " 0.5\n", " 0.325\n", " 0.25\n", " 0.325\n", "\n", "but we might also want to look at the density of the first four lines\n", "\n", " head -4 file_1.txt | fracdens\n", "\n", "or the densities of several files one after another:\n", "\n", " fracdens file_1.txt file_2.txt\n", "\n", "or merge densities of several files of the same size:\n", "\n", " fracdens -m file_1.txt file_2.txt file_3.txt\n", "\n", "Our overall requirements are:\n", "\n", "1. If no filename is given on the command line, read data from [standard input](glossary.html#standard_input).\n", "2. If one or more filenames are given, read data from them and report statistics for each file separately.\n", "3. If the -m flag is given, merge the data for several files (i.e., calculate the average density per line across those files).\n", "\n", "To make this work,\n", "we need to know how to handle command-line arguments in a program,\n", "and how to get at standard input.\n", "We'll tackle these questions in turn below." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Command-Line Arguments" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the text editor of your choice,\n", "save the following in a text file:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!cat sys_version.py" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "import sys\r\n", "print 'version is', sys.version\r\n" ] } ], "prompt_number": 47 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first line imports a library called `sys`,\n", "which is short for \"system\".\n", "It defines values such as `sys.version`,\n", "which describes which version of Python we are running.\n", "We can run this script from within the IPython Notebook like this:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%run sys_version.py" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "version is 2.7.5 |Anaconda 1.6.1 (x86_64)| (default, Jun 28 2013, 22:20:13) \n", "[GCC 4.0.1 (Apple Inc. build 5493)]\n" ] } ], "prompt_number": 48 }, { "cell_type": "markdown", "metadata": {}, "source": [ "or like this:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!ipython sys_version.py" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "version is 2.7.5 |Anaconda 1.6.1 (x86_64)| (default, Jun 28 2013, 22:20:13) \r\n", "[GCC 4.0.1 (Apple Inc. build 5493)]\r\n" ] } ], "prompt_number": 49 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first method, `%run`,\n", "uses a special command in the IPython Notebook to run a program in a `.py` file.\n", "The second method is more general:\n", "the exclamation mark `!` tells the Notebook to run a shell command,\n", "and it just so happens that the command we run is `ipython` with the name of the script." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's another script that does something more interesting:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!cat argv_list.py" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "import sys\r\n", "print 'sys.argv is', sys.argv\r\n" ] } ], "prompt_number": 50 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The strange name `argv` stands for \"argument values\".\n", "Whenever Python runs a program,\n", "it takes all of the values given on the command line\n", "and puts them in the list `sys.argv`\n", "so that the program can determine what they were.\n", "If we run this program with no arguments:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!ipython argv_list.py" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "sys.argv is ['/Users/gwilson/bc/lessons/swc-python/argv_list.py']\r\n" ] } ], "prompt_number": 51 }, { "cell_type": "markdown", "metadata": {}, "source": [ "the only thing in the list is the full path to our script,\n", "which is always `sys.argv[0]`.\n", "If we run it with a few arguments, however:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!ipython argv_list.py first second third" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "sys.argv is ['/Users/gwilson/bc/lessons/swc-python/argv_list.py', 'first', 'second', 'third']\r\n" ] } ], "prompt_number": 52 }, { "cell_type": "markdown", "metadata": {}, "source": [ "then Python adds each of those arguments to that magic list.\n", "And if we use a wildcard:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!ipython argv_list.py fractal_*.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "sys.argv is ['/Users/gwilson/bc/lessons/swc-python/argv_list.py', 'fractal_1.txt', 'fractal_2.txt', 'fractal_3.txt']\r\n" ] } ], "prompt_number": 53 }, { "cell_type": "markdown", "metadata": {}, "source": [ "then the shell expands it *before* calling our script,\n", "so that `sys.argv` holds the complete list of arguments,\n", "rather than the string containing the wildcard.\n", "Note,\n", "by the way,\n", "that the `%run` magic does almost the same thing—the\n", "only difference is that we get the relative path to the script\n", "instead of the absolute path as `sys.argv[0]`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%run argv_list.py fractal_*.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "sys.argv is ['argv_list.py', 'fractal_1.txt', 'fractal_2.txt', 'fractal_3.txt']\n" ] } ], "prompt_number": 54 }, { "cell_type": "markdown", "metadata": {}, "source": [ "With this in hand,\n", "let's build a version of `fracdens` that processes one or more files independently of each other\n", "(i.e.,\n", "that doesn't look for a `-m` flag,\n", "and doesn't read from standard input).\n", "The first step is to write a `main` function that outlines our implementation,\n", "and a placeholder for the function that does the actual work:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!cat fracdens_1.py" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "def main():\r\n", " script = sys.argv[0]\r\n", " filenames = sys.argv[1:]\r\n", " for f in filenames:\r\n", " process(f)\r\n", "\r\n", "def process(filename):\r\n", " print filename\r\n" ] } ], "prompt_number": 55 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This function gets the name of the script from `sys.argv[0]`,\n", "because that's where it's always put,\n", "and the list of files to be processed from `sys.argv[1:]`.\n", "The colon inside the brackets is important:\n", "the expression `list[1:]` means,\n", "\"All the elements of the list from index 1 to the end.\"\n", "Here's a simple test:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%run fracdens_1.py fractal_1.txt" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 56 }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is no output because we have defined two functions,\n", "but haven't actually called either of them.\n", "Let's add a call to `main`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!cat fracdens_2.py" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "def main():\r\n", " script = sys.argv[0]\r\n", " filenames = sys.argv[1:]\r\n", " for f in filenames:\r\n", " process(f)\r\n", "\r\n", "def process(filename):\r\n", " print filename\r\n", "\r\n", "# Run the program.\r\n", "main()\r\n" ] } ], "prompt_number": 57 }, { "cell_type": "markdown", "metadata": {}, "source": [ "and run that:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%run fracdens_2.py fractal_1.txt" ], "language": "python", "metadata": {}, "outputs": [ { "ename": "NameError", "evalue": "global name 'sys' is not defined", "output_type": "pyerr", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/Users/gwilson/anaconda/lib/python2.7/site-packages/IPython/utils/py3compat.pyc\u001b[0m in \u001b[0;36mexecfile\u001b[0;34m(fname, *where)\u001b[0m\n\u001b[1;32m 202\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 203\u001b[0m \u001b[0mfilename\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfname\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 204\u001b[0;31m \u001b[0m__builtin__\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexecfile\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfilename\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0mwhere\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m/Users/gwilson/bc/lessons/swc-python/fracdens_2.py\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 10\u001b[0m \u001b[0;31m# Run the program.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 11\u001b[0;31m \u001b[0mmain\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m/Users/gwilson/bc/lessons/swc-python/fracdens_2.py\u001b[0m in \u001b[0;36mmain\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mmain\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mscript\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msys\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0margv\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0mfilenames\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msys\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0margv\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mf\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mfilenames\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0mprocess\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mf\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mNameError\u001b[0m: global name 'sys' is not defined" ] } ], "prompt_number": 58 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Oops:\n", "we have imported `sys` in this notebook,\n", "but we haven't imported it in our script,\n", "which is being run in a separate instance of Python.\n", "Let's make one more change:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!cat fracdens_3.py" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "import sys\r\n", "\r\n", "def main():\r\n", " script = sys.argv[0]\r\n", " filenames = sys.argv[1:]\r\n", " for f in filenames:\r\n", " process(f)\r\n", "\r\n", "def process(filename):\r\n", " print filename\r\n", "\r\n", "# Run the program.\r\n", "main()\r\n" ] } ], "prompt_number": 59 }, { "cell_type": "markdown", "metadata": {}, "source": [ "and run that:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%run fracdens_3.py fractal_1.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "fractal_1.txt\n" ] } ], "prompt_number": 60 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Success!\n", "Now,\n", "what if we run it with several filenames?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%run fracdens_3.py fractal_*.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "fractal_1.txt\n", "fractal_2.txt\n", "fractal_3.txt\n" ] } ], "prompt_number": 61 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Good:\n", "we appear to be getting the filenames correctly." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Handling Command-Line Flags" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next step is to teach our program to handle the `-m` flag\n", "that tells it to merge data from all the files.\n", "By convention,\n", "flags always appear before lists of filenames,\n", "so we could simply do this:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def handle_args():\n", " script = sys.argv[0]\n", " if sys.argv[1] == '-m':\n", " merge_data = True\n", " filenames = sys.argv[2:]\n", " else:\n", " merge_data = False\n", " filenames = sys.argv[1:]\n", " return script, merge_data, filenames" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 62 }, { "cell_type": "markdown", "metadata": {}, "source": [ "But there are at least three things wrong with this approach:\n", "\n", "1. It doesn't scale:\n", " if our program eventually takes several arguments\n", " (e.g., to control what statistics are calculated),\n", " we're going to have a lot of branches in that `if` statement.\n", "\n", "2. It contains a bug:\n", " if we don't provide any arguments or filenames,\n", " the attempt on line 2 to check `sys.argv[1]` will fail with an index-out-of-bounds error.\n", " This means that:\n", "\n", " fracdens < fractal_221.txt\n", "\n", " will blow up instead of printing statistics for `fractal_221.txt`\n", " (which the program is reading from standard input).\n", "\n", "3. It's hard to test.\n", " As written,\n", " `handle_args` always gets data from `sys.argv`,\n", " which means we can't write unit tests for it using something like Ears.\n", " What we really ought to do is pass in the list of strings to be processed,\n", " so that we can write unit tests,\n", " and then call `handle_args` with `sys.argv` as an argument from `main`,\n", " and with other lists of strings as arguments from tests." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's a better version of `handle_args` that uses Python's `optparse` library to handle arguments:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from optparse import OptionParser\n", "\n", "def handle_args(args):\n", " script, rest = args[0], args[1:]\n", " parser = OptionParser()\n", " parser.add_option('-m', '--merge', dest='merge', help='Merge data from all files',\n", " default=False, action='store_true')\n", " options, args = parser.parse_args(args=rest)\n", " return script, options, args" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 63 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `optparse` library defines a tool called `OptionParser`,\n", "which knows how to handle complex arrangements of command-line parameters.\n", "Line 5 of the code above creates one of these;\n", "line 6 then tells it that our command line may contain one flag that:\n", "\n", "* has a short form `-m` and a long form `--merge` (which are interchangeable);\n", "* is stored in the property called `all` of the `options` object;\n", "* is false by default, but should be set to true if the flag is present; and\n", "* tells the program to merge data from all files.\n", "\n", "Let's try it out:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "script, flags, filenames = handle_args(['fracdens', 'fractal_1.txt'])\n", "print 'script name is', script\n", "print 'flags.merge is', flags.merge\n", "print 'filenames are', filenames" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "script name is fracdens\n", "flags.merge is False\n", "filenames are ['fractal_1.txt']\n" ] } ], "prompt_number": 64 }, { "cell_type": "markdown", "metadata": {}, "source": [ "And again:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "script, flags, filenames = handle_args(['fracdens', '-m', 'fractal_1.txt', 'fractal_2.txt'])\n", "print 'script name is', script\n", "print 'flags.merge is', flags.merge\n", "print 'filenames are', filenames" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "script name is fracdens\n", "flags.merge is True\n", "filenames are ['fractal_1.txt', 'fractal_2.txt']\n" ] } ], "prompt_number": 65 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course,\n", "the right thing to do here is to write a few unit tests:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import ears\n", "\n", "def test_parse_no_args():\n", " script, flags, filenames = handle_args(['fracdens'])\n", " assert script == 'fracdens'\n", " assert not flags.merge\n", " assert filenames == []\n", "\n", "def test_parse_one_filename():\n", " script, flags, filenames = handle_args(['fracdens', 'fractal_1.txt'])\n", " assert script == 'fracdens'\n", " assert not flags.merge\n", " assert filenames == ['fractal_1.txt']\n", "\n", "def test_parse_just_merge():\n", " script, flags, filenames = handle_args(['fracdens', '-m'])\n", " assert script == 'fracdens'\n", " assert flags.merge\n", " assert filenames == []\n", "\n", "def test_parse_merge_multiple():\n", " script, flags, filenames = handle_args(['fracdens', '-m', 'fractal_1.txt', 'fractal_2.txt'])\n", " assert script == 'fracdens'\n", " assert flags.merge\n", " assert len(filenames) == 2\n", "\n", "def test_flag_after_filenames():\n", " try:\n", " script, flags, filenames = handle_args(['fracdens', 'fractal_1.txt', '-m'])\n", " assert False, 'Should have had an exception'\n", " except:\n", " pass # exception as expected\n", "\n", "def test_unknown_flag():\n", " try:\n", " script, flags, filenames = handle_args(['fracdens', '-X', 'fractal_1.txt'])\n", " assert False, 'Should have had an exception'\n", " except:\n", " pass # exception as expected\n", "\n", "ears.run()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "......\n", "6 pass, 0 fail, 0 error\n" ] }, { "output_type": "stream", "stream": "stderr", "text": [ "Usage: -c [options]\n", "\n", "-c: error: no such option: -X\n" ] } ], "prompt_number": 66 }, { "cell_type": "markdown", "metadata": {}, "source": [ "All of our tests pass,\n", "though `OptionParser` does produce an error message during one of them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "### *Testing Error Handling*\n", "\n", "\n", "A warning light that doesn't work is worse than no warning light at all,\n", "since it gives people a false sense of security.\n", "Similarly,\n", "error handling that doesn't actually handle errors can fool programmers into thinking that\n", "X, Y, or Z **can't** be wrong\n", "when it actually is.\n", "Our unit tests therefore ought to check that the right exceptions are raised when they should be,\n", "and as the code above shows,\n", "there's a pattern for doing this:\n", "\n", "\n", " try:\n", " function_that_should_raise_exception()\n", " assert False, 'Exception was not raised!'\n", " except:\n", " pass # do nothing because exception was raised correctly\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "### *Testing and Learning*\n", "\n", "\n", "In practice,\n", "we probably wouldn't write these unit tests if we were familiar with the `optparse` library.\n", "Since we're just introducing it,\n", "though,\n", "and are always looking for opportunities to show what unit testing looks like,\n", "we've included the tests.\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's add `handle_args` to our program:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!cat fracdens_4.py" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "import sys\r\n", "\r\n", "def main():\r\n", " script, flags, filenames = handle_args(sys.argv)\r\n", " for f in filenames:\r\n", " process(f)\r\n", "\r\n", "def handle_args(args):\r\n", " script, rest = args[0], args[1:]\r\n", " parser = OptionParser()\r\n", " parser.add_option('-m', '--merge', dest='merge', help='Merge data from all files',\r\n", " default=False, action='store_true')\r\n", " options, args = parser.parse_args(args=rest)\r\n", " return script, options, args\r\n", "\r\n", "def process(filename):\r\n", " print filename\r\n", "\r\n", "# Run the program.\r\n", "main()\r\n" ] } ], "prompt_number": 67 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This runs,\n", "but it doesn't take merging into account.\n", "If `flags.merge` is `True`,\n", "`process` should produce one set of statistics for all of our files\n", "instead of one set per file.\n", "We could handle this as a special case:\n", "\n", " if flags.merge:\n", " process_all_files(filenames)\n", " else:\n", " process_single_file(filenames[0])\n", "\n", "but there's a simpler way.\n", "If our program merges data,\n", "but only has one file to process,\n", "it will produce the statistics for just that file.\n", "We can therefore do this:\n", "\n", " if flags.merge:\n", " process_all_files(filenames)\n", " else:\n", " for f in filenames:\n", " temp = [f]\n", " process_all_files(temp)\n", "\n", "i.e.,\n", "create a list containing just one filename for each filename we have,\n", "and process those lists one by one.\n", "Let's simplify this a bit:\n", "\n", " if flags.merge:\n", " process(filenames)\n", " else:\n", " for f in filenames:\n", " process([f])\n", "\n", "Now,\n", "what does `process` look like?\n", "\n", " def process(filenames):\n", " print ...\n", "\n", "Hm:\n", "what are we supposed to print?\n", "If `filenames` contains just one filename, we print that\n", "(since we're printing statistics for each file separately),\n", "while if it contains more than one,\n", "we print something like the word \"all\".\n", "\n", "That would work,\n", "but putting a test on the length of `filenames` in `process` feels like an awkward design.\n", "Instead,\n", "let's modify it so that the main program decides what should be printed:\n", "\n", " if flags.merge:\n", " process('all', filenames)\n", " else:\n", " for f in filenames:\n", " process(f, [f])\n", "\n", "This feels better:\n", "the decision about processing files one by one or all together is made in `main`,\n", "and so is the decision about what to print in the output.\n", "Our placeholder version of `process` is then something like:\n", "\n", " def process(title, filenames):\n", " print title\n", " print 'files:',\n", " for f in filenames:\n", " print f, # eventually replace this with real code\n", " print # make sure there's a newline at the end\n", "\n", "Let's try running that:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%run fracdens_5.py fractal_1.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "fractal_1.txt\n", "files: fractal_1.txt\n" ] } ], "prompt_number": 68 }, { "cell_type": "code", "collapsed": false, "input": [ "%run fracdens_5.py -m fractal_*.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "all\n", "files: fractal_1.txt fractal_2.txt fractal_3.txt\n" ] } ], "prompt_number": 69 }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "### *Don't Do It This Way*\n", "\n", "\n", "We now have five Python files named `fracdens_1.py` to `fracdens_5.py`.\n", "You should **not** do this when you are writing code yourself:\n", "instead,\n", "you should use a version control system to manage the file's evolution,\n", "and commit it each time you add some useful feature.\n", "We can't do this because we want to display successive versions simultaneously in a notebook,\n", "but that's a rare case.\n", "\n", "
" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Handling Standard Input" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next thing our program has to do is read data from standard input if no filenames are given\n", "so that we can put it in a pipeline,\n", "redirect input to it,\n", "and so on.\n", "Let's experiment in another script:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!cat count_stdin.py" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "import sys\r\n", "\r\n", "count = 0\r\n", "for line in sys.stdin:\r\n", " count += 1\r\n", "\r\n", "print '{0} lines in standard input'.format(count)\r\n" ] } ], "prompt_number": 70 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This little program reads lines from a special \"file\" called `sys.stdin`,\n", "which is actually the program's standard input.\n", "We don't have to open it—Python and the operating system\n", "automatically take care of that between themselves when the program is run—\n", "but we can do almost anything with it that we could do to a regular file.\n", "Let's try running it as if it were a regular command-line program:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!ipython count_stdin.py < fractal_1.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "6 lines in standard input\r\n" ] } ], "prompt_number": 71 }, { "cell_type": "markdown", "metadata": {}, "source": [ "What if we run it using `%run`?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%run count_stdin.py < fractal_1.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0 lines in standard input\n" ] } ], "prompt_number": 72 }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see,\n", "`%run` doesn't understand file redirection:\n", "that's a shell thing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "### *No Input Takes a Long Time to Read*\n", "\n", "\n", "A common mistake is to try to run something that reads from standard input like this:\n", "\n", " !ipython count_stdin.py fractal_1.txt\n", "\n", "i.e., to forget the `<` character that redirect the file to standard input.\n", "In this case,\n", "there's nothing in standard input,\n", "so the program waits at the start of the loop for someone to type something on the keyboard.\n", "Since there's no way for us to do this,\n", "our program is stuck,\n", "and we have to halt it using the `Interrupt` option from the `Kernel` menu in the Notebook.\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now need to rewrite `process` to handle standard input as well as files on disk,\n", "as well as handling merging.\n", "There are three cases:\n", "\n", "* No filenames on the command line: process standard input as a single \"file\".\n", "* No merge flag: process each named file separately.\n", "* Merge flag provided: process all named files in a batch.\n", "\n", "This gives us the following main program (which we've called `fracdens_6.py`):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def main():\n", " script, flags, filenames = handle_args(sys.argv)\n", " if filenames == []:\n", " process('stdin', None)\n", " elif flags.merge:\n", " process('all', filenames)\n", " else:\n", " for f in filenames:\n", " process(f, [f])" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 76 }, { "cell_type": "markdown", "metadata": {}, "source": [ "and this update to `process`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def process(title, filenames):\n", " if filenames is None:\n", " densities = calc_density(sys.stdin)\n", " else:\n", " for f in filenames:\n", " with open(f, 'r') as source:\n", " densities = calc_density(source)\n", " display(title, densities)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 77 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This version of `process` uses `sys.stdin` as a data source if there aren't any filenames,\n", "and opens files one by one to create a readable source if there are.\n", "It then uses two functions `calc_density` and `display` to calculate densities and display the results.\n", "These ought to be straightforward to write,\n", "but before we dive into them,\n", "we have a bug to fix.\n", "\n", "Consider what happens if we run `process` with a list of filenames.\n", "We're supposed to merge all the data from them to find the overall average density,\n", "but what we're actually doing is calculating densities for each file separately,\n", "and then reporting the densities of the last file.\n", "Somehow,\n", "we need to accumulate statistics across all of our files.\n", "\n", "Here's one approach:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def process(title, filenames):\n", " if filenames is None:\n", " densities = calc_density(sys.stdin)\n", " else:\n", " with open(f[0], 'r') as source:\n", " densities = calc_density(source)\n", " for f in filenames[1:]:\n", " with open(f, 'r') as source:\n", " combine_densities(densities, source)\n", " display(title, densities)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 78 }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can (hopefully) guess from the names we've chosen for our functions,\n", "`calc_density` calculates densities for a single file,\n", "while `combine_densities` combines data from an open file with a running total of densities seen so far.\n", "The `else` branch of the function starts by getting the densities of the first data set,\n", "then uses `combine_densities` to add in the densities from all the other data sets.\n", "If there aren't any—i.e.,\n", "if `filenames[1:]` is the empty list—\n", "then `combine_densities` will never be called,\n", "and `densities` will just hold the densities from the first file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is on the right track,\n", "but it doesn't quite work.\n", "The density of a line in our fractal is defined to be\n", "the number of filled cells in a line\n", "divided by the width of that line.\n", "Adding or averaging the densities from different files one by one\n", "isn't going to give us the right answer.\n", "Instead,\n", "we need to add up the number of filled cells per line across all our files,\n", "and divide by the total width—i.e., the number of files times their width—at the end.\n", "Let's rewrite `process` one more time:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def process(title, filenames):\n", " if filenames is None:\n", " number = 1\n", " width, filled = count(sys.stdin)\n", " else:\n", " number = len(filenames)\n", " with open(f[0], 'r') as source:\n", " width, filled = count(sys.stdin)\n", " for f in filenames[1:]:\n", " with open(f, 'r') as source:\n", " filled = combine(source, filled)\n", " display(title, filled, number * width)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 79 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of calculating densities file by file,\n", "we're using two functions called `count` and `combine` to count the number of filled cells per line\n", "and combine a list of counts seen so far with counts from yet another file.\n", "The first of these functions returns both the width of the data and the list of counts per line\n", "so that we can correctly calculate averages,\n", "and both branches of the `if` set the value of `number` for this purpose as well." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could now go ahead and write `count`, `combine`, and `display`,\n", "but we can simplify things one more time before doing so.\n", "`count` is going to process input data line by line and return a list of numbers.\n", "`combine` is going to do this as well;\n", "the only difference is,\n", "it will produce its output by adding line counts to existing totals.\n", "Let's rewrite `process` so that all the reading and line-by-line counting happens in `count`,\n", "and `combine` just adds values together:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def process(title, filenames):\n", " if filenames is None:\n", " number = 1\n", " width, filled = count(sys.stdin)\n", " else:\n", " number = len(filenames)\n", " with open(filenames[0], 'r') as source:\n", " width, filled = count(source)\n", " for f in filenames[1:]:\n", " new_width, new_filled = count(source)\n", " assert new_width == width, 'File widths are not the same'\n", " filled = combine(filled, new_filled)\n", " display(title, filled, number * width)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 89 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're finally done with `process`:\n", "each of the functions it depends on does exactly one simple job,\n", "and we've even included a self-check to make sure that all the input files have the same width.\n", "Our three remaining functions are now almost trivial to write:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def count(source):\n", " result = []\n", " for line in source:\n", " line = line.strip()\n", " width = len(line)\n", " n = line.count('1')\n", " result.append(n)\n", " return width, result\n", "\n", "def combine(left, right):\n", " assert len(left) == len(right), 'Data set lengths have unequal lengths'\n", " result = []\n", " for i in range(len(left)):\n", " result.append( left[i] + right[i] )\n", " return result\n", "\n", "def display(title, counts, scaling):\n", " print title\n", " for c in counts:\n", " print float(c) / scaling" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 90 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try running it on a single input file:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%run fracdens_6.py fractal_1.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "fractal_1.txt\n", "0.111111111111\n", "0.444444444444\n", "0.444444444444\n", "0.333333333333\n", "0.222222222222\n", "0.333333333333\n" ] } ], "prompt_number": 91 }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we double-check the file (which is nine cells wide):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!cat fractal_1.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "000100000\r\n", "001100110\r\n", "000111100\r\n", "001110000\r\n", "001100000\r\n", "001110000\r\n" ] } ], "prompt_number": 92 }, { "cell_type": "markdown", "metadata": {}, "source": [ "that seems to be the right answer:\n", "1/9, two lines of 4/9, a 3/9, a 2/9, and another 3/9.\n", "Let's try three files separately:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%run fracdens_6.py fractal_1.txt fractal_2.txt fractal_3.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "fractal_1.txt\n", "0.111111111111\n", "0.444444444444\n", "0.444444444444\n", "0.333333333333\n", "0.222222222222\n", "0.333333333333\n", "fractal_2.txt\n", "0.111111111111\n", "0.444444444444\n", "0.444444444444\n", "0.444444444444\n", "0.222222222222\n", "0.444444444444\n", "fractal_3.txt\n", "0.111111111111\n", "0.555555555556\n", "0.333333333333\n", "0.444444444444\n", "0.333333333333\n", "0.333333333333\n" ] } ], "prompt_number": 93 }, { "cell_type": "markdown", "metadata": {}, "source": [ "and standard input:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!ipython fracdens_6.py < fractal_1.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "stdin\r\n", "0.111111111111\r\n", "0.444444444444\r\n", "0.444444444444\r\n", "0.333333333333\r\n", "0.222222222222\r\n", "0.333333333333\r\n" ] } ], "prompt_number": 94 }, { "cell_type": "markdown", "metadata": {}, "source": [ "All that's left to test is merging.\n", "Clearly,\n", "if we \"merge\" one file,\n", "we should get the same answer that we got for that file:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%run fracdens_6.py -m fractal_1.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "all\n", "0.111111111111\n", "0.444444444444\n", "0.444444444444\n", "0.333333333333\n", "0.222222222222\n", "0.333333333333\n" ] } ], "prompt_number": 95 }, { "cell_type": "markdown", "metadata": {}, "source": [ "and if we merge data for that file *with itself*,\n", "we should get the same answer\n", "(which is an easier thing to check that merging data from two different files):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%run fracdens_6.py -m fractal_1.txt fractal_1.txt" ], "language": "python", "metadata": {}, "outputs": [ { "ename": "ValueError", "evalue": "I/O operation on closed file", "output_type": "pyerr", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/Users/gwilson/anaconda/lib/python2.7/site-packages/IPython/utils/py3compat.pyc\u001b[0m in \u001b[0;36mexecfile\u001b[0;34m(fname, *where)\u001b[0m\n\u001b[1;32m 202\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 203\u001b[0m \u001b[0mfilename\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfname\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 204\u001b[0;31m \u001b[0m__builtin__\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexecfile\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfilename\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0mwhere\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m/Users/gwilson/bc/lessons/swc-python/fracdens_6.py\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 56\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 57\u001b[0m \u001b[0;31m# Run the program.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 58\u001b[0;31m \u001b[0mmain\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m/Users/gwilson/bc/lessons/swc-python/fracdens_6.py\u001b[0m in \u001b[0;36mmain\u001b[0;34m()\u001b[0m\n\u001b[1;32m 7\u001b[0m \u001b[0mprocess\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'stdin'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 8\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0mflags\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmerge\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 9\u001b[0;31m \u001b[0mprocess\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'all'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfilenames\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 10\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mf\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mfilenames\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/Users/gwilson/bc/lessons/swc-python/fracdens_6.py\u001b[0m in \u001b[0;36mprocess\u001b[0;34m(title, filenames)\u001b[0m\n\u001b[1;32m 29\u001b[0m \u001b[0mwidth\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfilled\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcount\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msource\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 30\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mf\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mfilenames\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 31\u001b[0;31m \u001b[0mnew_width\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnew_filled\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcount\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msource\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 32\u001b[0m \u001b[0;32massert\u001b[0m \u001b[0mnew_width\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0mwidth\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'File widths are not the same'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 33\u001b[0m \u001b[0mfilled\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcombine\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfilled\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnew_filled\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/Users/gwilson/bc/lessons/swc-python/fracdens_6.py\u001b[0m in \u001b[0;36mcount\u001b[0;34m(source)\u001b[0m\n\u001b[1;32m 36\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mcount\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msource\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 37\u001b[0m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 38\u001b[0;31m \u001b[0;32mfor\u001b[0m \u001b[0mline\u001b[0m \u001b[0;32min\u001b[0m \u001b[0msource\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 39\u001b[0m \u001b[0mline\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstrip\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 40\u001b[0m \u001b[0mwidth\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mline\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mValueError\u001b[0m: I/O operation on closed file" ] } ], "prompt_number": 97 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Why does Python think we're trying to read from a closed file?\n", "If take a closer look at `process`,\n", "we see that the loop handling `filenames[1:]` wasn't actually opening any files,\n", "so it was trying to read from the same `source` that was opened and closed for `filenames[0]`.\n", "Let's update `process` one more time to create `fracdens_7.py`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def process(title, filenames):\n", " if filenames is None:\n", " number = 1\n", " width, filled = count(sys.stdin)\n", " else:\n", " number = len(filenames)\n", " with open(filenames[0], 'r') as source:\n", " width, filled = count(source)\n", " for f in filenames[1:]:\n", " with open(f, 'r') as source:\n", " new_width, new_filled = count(source)\n", " assert new_width == width, 'File widths are not the same'\n", " filled = combine(filled, new_filled)\n", " display(title, filled, number * width)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 98 }, { "cell_type": "markdown", "metadata": {}, "source": [ "and run that:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%run fracdens_7.py -m fractal_1.txt fractal_1.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "all\n", "0.111111111111\n", "0.444444444444\n", "0.444444444444\n", "0.333333333333\n", "0.222222222222\n", "0.333333333333\n" ] } ], "prompt_number": 99 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Good:\n", "that's the same answer that we had before.\n", "Let's try merging all three files:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%run fracdens_7.py -m fractal_1.txt fractal_2.txt fractal_3.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "all\n", "0.111111111111\n", "0.481481481481\n", "0.407407407407\n", "0.407407407407\n", "0.259259259259\n", "0.37037037037\n" ] } ], "prompt_number": 100 }, { "cell_type": "markdown", "metadata": {}, "source": [ "That *might* be right—at least,\n", "it isn't obviously wrong—but how can we be sure?\n", "Let's create three simplified input files:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!cat test_1.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "010\r\n", "010\r\n", "010\r\n" ] } ], "prompt_number": 102 }, { "cell_type": "code", "collapsed": false, "input": [ "!cat test_2.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "011\r\n", "011\r\n", "011\r\n" ] } ], "prompt_number": 103 }, { "cell_type": "code", "collapsed": false, "input": [ "!cat test_3.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "111\r\n", "111\r\n", "111\r\n" ] } ], "prompt_number": 104 }, { "cell_type": "markdown", "metadata": {}, "source": [ "and try merging those:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%run fracdens_7.py -m test_1.txt test_2.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "all\n", "0.5\n", "0.5\n", "0.5\n" ] } ], "prompt_number": 105 }, { "cell_type": "code", "collapsed": false, "input": [ "%run fracdens_7.py -m test_1.txt test_2.txt test_3.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "all\n", "0.666666666667\n", "0.666666666667\n", "0.666666666667\n" ] } ], "prompt_number": 106 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Those values are much easier to check:\n", "the average of 1/3 and 2/3 is 1/2,\n", "and the average of 1/3, 2/3, and 3/3 is 2/3." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "### *That Was Unexpected*\n", "\n", "\n", "Just for fun,\n", "let's get our program to process standard input\n", "with merging turned on:\n", "" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!ipython fracdens_7.py -m < test_1.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "usage: ipython [-h] [--profile TERMINALIPYTHONAPP.PROFILE] [-c TERMINALIPYTHONAPP.CODE_TO_RUN]\r\n", " [--logappend TERMINALINTERACTIVESHELL.LOGAPPEND] [--autocall TERMINALINTERACTIVESHELL.AUTOCALL]\r\n", " [--ipython-dir TERMINALIPYTHONAPP.IPYTHON_DIR] [--gui TERMINALIPYTHONAPP.GUI] [--pylab [TERMINALIPYTHONAPP.PYLAB]]\r\n", " [-m TERMINALIPYTHONAPP.MODULE_TO_RUN] [--colors TERMINALINTERACTIVESHELL.COLORS]\r\n", " [--log-level TERMINALIPYTHONAPP.LOG_LEVEL] [--ext TERMINALIPYTHONAPP.EXTRA_EXTENSION]\r\n", " [--matplotlib [TERMINALIPYTHONAPP.MATPLOTLIB]] [--cache-size TERMINALINTERACTIVESHELL.CACHE_SIZE]\r\n", " [--logfile TERMINALINTERACTIVESHELL.LOGFILE] [--config TERMINALIPYTHONAPP.EXTRA_CONFIG_FILE] [--no-autoindent]\r\n", " [--deep-reload] [--classic] [--term-title] [--no-confirm-exit] [--autoindent] [--no-term-title] [--pprint] [--color-info]\r\n", " [--init] [--pydb] [--no-color-info] [--autoedit-syntax] [--confirm-exit] [--no-autoedit-syntax] [--quick] [--banner]\r\n", " [--automagic] [--no-automagic] [--nosep] [-i] [--quiet] [--no-deep-reload] [--no-pdb] [--debug] [--pdb] [--no-pprint]\r\n", " [--no-banner]\r\n", "ipython: error: argument -m/--m: expected one argument\r\n" ] } ], "prompt_number": 107 }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "That's certainly not what we expect.\n", "After a bit of digging,\n", "it turns out that `ipython` assumes the `-m` flag belongs to it,\n", "rather than to our script.\n", "This doesn't show up when we run the program with `%run`\n", "because we're not launching a separate command-line instance of IPython in that case.\n", "To make sure that arguments are actually passed to our script,\n", "we need to put them all after a double dash (`--`)\n", "so that IPython can tell which are its and which are ours:\n", "" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!ipython fracdens_7.py -- -m < test_1.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "stdin\r\n", "0.333333333333\r\n", "0.333333333333\r\n", "0.333333333333\r\n" ] } ], "prompt_number": 108 }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "A More Advanced Solution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the rules in the [previous lesson](python-08-numpy.ipynb) was,\n", "\"If you're writing a loop, you're probably doing it wrong.\"\n", "Let's take a look at how we could eliminate most of the loops in our density calculator using NumPy arrays\n", "and another feature of Python we haven't encountered yet:\n", "[list comprehensions](glossary.html#list_comprehension).\n", "Suppose we have a list of numbers:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "nums = [2, 5, 9]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of writing a loop to create a list of their squares,\n", "we can do this:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "squares = [x ** 2 for x in nums]\n", "print squares" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[4, 25, 81]\n" ] } ], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This doesn't change our original list:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print nums" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[2, 5, 9]\n" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "and the name of the temporary variable inside the comprehension doesn't matter:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print [something_else ** 2 for something_else in nums]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[4, 25, 81]\n" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Used sparingly,\n", "list comprehensions make a lot of programs more readable—just compare\n", "the list comprehension form of this calculation with the loop-and-conditional form:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from math import sqrt\n", "signal = [-0.7, -0.3, -0.1, 0.2, 0.3, 0.5]\n", "\n", "# The easy way\n", "pos_roots = [sqrt(s) for s in signal if s >= 0]\n", "\n", "# The hard way\n", "pos_roots = []\n", "for s in signal:\n", " if s >= 0:\n", " pos_roots.append(sqrt(s))" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's use list comprehension to clean up our code.\n", "First,\n", "though,\n", "let's clean up our data.\n", "It's simple to represent our fractals using tightly-packed 1's and 0's,\n", "but nobody else's software will recognize that data format.\n", "If we use something standard,\n", "like comma-separated values (CSV),\n", "we can get rid of our own parsing code.\n", "This makes our data files somewhat larger:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!cat csv_1.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0,0,0,1,0,0,0,0,0\r\n", "0,0,1,1,0,0,1,1,0\r\n", "0,0,0,1,1,1,1,0,0\r\n", "0,0,1,1,1,0,0,0,0\r\n", "0,0,1,1,0,0,0,0,0\r\n", "0,0,1,1,1,0,0,0,0\r\n" ] } ], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "but we can now read our data with a single statement:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import numpy as np\n", "print np.loadtxt('csv_1.txt', delimiter=',')" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[[ 0. 0. 0. 1. 0. 0. 0. 0. 0.]\n", " [ 0. 0. 1. 1. 0. 0. 1. 1. 0.]\n", " [ 0. 0. 0. 1. 1. 1. 1. 0. 0.]\n", " [ 0. 0. 1. 1. 1. 0. 0. 0. 0.]\n", " [ 0. 0. 1. 1. 0. 0. 0. 0. 0.]\n", " [ 0. 0. 1. 1. 1. 0. 0. 0. 0.]]\n" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using this,\n", "we can rewrite `process` as follows:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def process(title, filenames):\n", " if filenames is None:\n", " data = np.loadtxt(sys.stdin, delimiter=',')\n", " display(title, data, 1)\n", " else:\n", " results = [np.loadtxt(f, delimiter=',') for f in filenames]\n", " assert all([x.shape == results[0].shape for x in results]), 'File sizes differ'\n", " for r in results[1:]:\n", " results[0] += r\n", " display(title, results[0], len(results))" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's go through this line by line:\n", "\n", "* If no filenames have been given, `process` reads data from standard input using `np.loadtxt`, then pass that two-dimensional array to `display`, along with a '1' to indicate that we've only got one data set.\n", "* Otherwise, if we *do* have filenames, we use a list comprehension to load data from all of the files with a single statement.\n", "* Then, on line 7, we check that the shapes of all of the arrays match the shape of the first array.\n", "* Assuming they are, we add the first and following arrays to array 0 to get the total number of times each cell in the grid has ever been filled.\n", "* Finally, we call `display` with the merged data, passing in the totals and the number of arrays we read." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `display` function now needs to be rewritten to do two things:\n", "scale the counts per row,\n", "and show them.\n", "This isn't a great design,\n", "as it violates our \"one purpose per function\" rule,\n", "but it's good enough for now:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def display(title, data, number):\n", " print title\n", " scaling = float(number * data.shape[1])\n", " densities = data.sum(1) / scaling\n", " for d in densities:\n", " print d" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The program we have now produced runs exactly the same way as the previous versions:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%run fracdens_8.py csv_1.txt csv_2.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "csv_1.txt\n", "0.111111111111\n", "0.444444444444\n", "0.444444444444\n", "0.333333333333\n", "0.222222222222\n", "0.333333333333\n", "csv_2.txt\n", "0.111111111111\n", "0.444444444444\n", "0.444444444444\n", "0.444444444444\n", "0.222222222222\n", "0.444444444444\n" ] } ], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "but it's two-thirds the size:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!wc fracdens_7.py fracdens_8.py" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " 59 185 1690 fracdens_7.py\r\n", " 42 133 1238 fracdens_8.py\r\n", " 101 318 2928 total\r\n" ] } ], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The new program isn't noticeably faster than the old one,\n", "though,\n", "since the time to do the calculations is dwarfed by\n", "the time needed to read the data into the program.\n", "\n", "Is the NumPy version better than the list-based version?\n", "The answer depends on the audience we have in mind.\n", "Programmers who know Python well,\n", "and who are used to thinking in terms of applying operations to entire data sets at once,\n", "would probably have written something like the final version right from the start.\n", "Programmers who aren't that familiar with Python's features,\n", "on the other hand,\n", "would probably find the loop-based version easier to understand\n", "because it spells out the steps the program is taking,\n", "rather than the results it's producing.\n", "\n", "These differences highlight one of the fundamental problems in programming\n", "(and indeed in any other activity that requires expertise):\n", "things that are comprehensible to a novice are painfully slow for an expert to read,\n", "while things that are natural for an expert are often opaque to novices.\n", "This doesn't mean that either is right or wrong:\n", "it's just one manifestation of the cognitive changes that occur\n", "as a task goes from being a mystery to being possible to being easy.\n", "\n", "The same thing is true of documentation:\n", "a tutorial aimed at novices can be infuriating for experts to read,\n", "since the information they want is scattered so thinly,\n", "while a manual page for experts can seem like gibberish to people\n", "who don't yet have a mental model of the domain." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Key Points" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "FIXME" ] } ], "metadata": {} } ] }