{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Python for Data Science" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Joe McCarthy](http://interrelativity.com/joe), \n", "*Data Scientist*, [Indeed](http://www.indeed.com/)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from IPython.display import display, Image, HTML" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Navigation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notebooks in this primer:\n", "\n", "* [1. Introduction](1_Introduction.ipynb)\n", "* [2. Data Science: Basic Concepts](2_Data_Science_Basic_Concepts.ipynb)\n", "* **3. Python: Basic Concepts** (*you are here*)\n", "* [4. Using Python to Build and Use a Simple Decision Tree Classifier](4_Python_Simple_Decision_Tree.ipynb)\n", "* [5. Next Steps](5_Next_Steps.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Python: Basic Concepts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### *A note on Python 2 vs. Python 3*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are 2 major versions of Python in widespread use: [Python 2](https://docs.python.org/2/) and [Python 3](https://docs.python.org/3/). Python 3 has some features that are not backward compatible with Python 2, and some Python 2 libraries have not been updated to work with Python 3. I have been using Python 2, primarily because I use some of those Python 2[-only] libraries, but an increasing proportion of them are migrating to Python 3, and I anticipate shifting to Python 3 in the near future.\n", "\n", "For more on the topic, I recommend a very well documented IPython Notebook, which includes numerous helpful examples and links, by [Sebastian Raschka](http://sebastianraschka.com/), [Key differences between Python 2.7.x and Python 3.x](http://nbviewer.ipython.org/github/rasbt/python_reference/blob/master/tutorials/key_differences_between_python_2_and_3.ipynb), the [Cheat Sheet: Writing Python 2-3 compatible code](http://python-future.org/compatible_idioms.html) by Ed Schofield ... or [googling Python 2 vs 3](https://www.google.com/q=python%202%20vs%203).\n", "\n", "[Nick Coghlan](https://twitter.com/ncoghlan_dev), a CPython core developer, sent me an email suggesting that relatively minor changes in this notebook would enable it to run with Python 2 *or* Python 3: importing the `print_function` from the [**`__future__`**](https://docs.python.org/2/library/__future__.html) module, and changing my [`print` *statements* (Python 2)](https://docs.python.org/2/reference/simple_stmts.html#print) to [`print` *function calls* (Python 3)](https://docs.python.org/3/library/functions.html#print). Although a relatively minor conceptual change, it necessitated the changing of many individual cells to reflect the Python 3 `print` syntax. \n", "\n", "I decided to import the `division` module from the `future`, as I find [the use of `/` for \"true division\"](https://www.python.org/dev/peps/pep-0238/) - and the use of `//` for \"floor division\" - to be more aligned with my intuition. I also needed to replace a few functions that are no longer available in Python 3 with related functions that are available in both versions; I've added notes in nearby cells where the incompatible functions were removed explaining why they are related ... and no longer available. \n", "\n", "The differences are briefly illustrated below, with print statements or function calls before and after the importing of the Python 3 versions of the print function and division operator." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 / 2 = 0\n" ] } ], "source": [ "print 1, \"/\", 2, \"=\", 1 / 2" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(1, '/', 2, '=', 0)\n" ] } ], "source": [ "print(1, \"/\", 2, \"=\", 1 / 2)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from __future__ import print_function, division" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "ename": "SyntaxError", "evalue": "invalid syntax (, line 1)", "output_type": "error", "traceback": [ "\u001b[0;36m File \u001b[0;32m\"\"\u001b[0;36m, line \u001b[0;32m1\u001b[0m\n\u001b[0;31m print 1, \"/\", 2, \"=\", 1 / 2\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n" ] } ], "source": [ "print 1, \"/\", 2, \"=\", 1 / 2" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 / 2 = 0.5\n" ] } ], "source": [ "print(1, \"/\", 2, \"=\", 1 / 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Names (identifiers), strings & binding values to names (assignment)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The sample instance of a mushroom shown above can be represented as a string. \n", "\n", "A Python ***string* ([`str`](http://docs.python.org/2/tutorial/introduction.html#strings))** is a sequence of 0 or more characters enclosed within a pair of single quotes (`'`) or a pair of double quotes (`\"`). " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Python [*identifiers*](http://docs.python.org/2/reference/lexical_analysis.html#identifiers) (or [*names*](https://docs.python.org/2/reference/executionmodel.html#naming-and-binding)) are composed of letters, numbers and/or underscores ('`_`'), starting with a letter or underscore. Python identifiers are case sensitive. Although camelCase identifiers can be used, it is generally considered more [pythonic](http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html) to use underscores. Python variables and functions typically start with lowercase letters; Python classes start with uppercase letters.\n", "\n", "The following [assignment statement](http://docs.python.org/2/reference/simple_stmts.html#assignment-statements) binds the value of the string shown above to the name `single_instance_str`. Typing the name on the subsequent line will cause the intepreter to print the value bound to that name." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "single_instance_str = 'p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d'\n", "single_instance_str" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Printing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [**`print`**](https://docs.python.org/3/library/functions.html#print) function writes the value of its comma-delimited arguments to [**`sys.stdout`**](http://docs.python.org/2/library/sys.html#sys.stdout) (typically the console). Each value in the output is separated by a single blank space. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A B C 1 2 3\n", "Instance 1: p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\n" ] } ], "source": [ "print('A', 'B', 'C', 1, 2, 3)\n", "print('Instance 1:', single_instance_str)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The print function has an optional keyword argument, **`end`**. When this argument is used and its value does not include `'\\n'` (newline character), the output cursor will not advance to the next line." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A B\n", "C\n", "A B...\n", "C\n", "A B C\n" ] } ], "source": [ "print('A', 'B') # no end argument\n", "print('C')\n", "print ('A', 'B', end='...\\n') # end includes '\\n' --> output cursor advancees to next line\n", "print ('C')\n", "print('A', 'B', end=' ') # end=' ' --> use a space rather than newline at the end of the line\n", "print('C') # so that subsequent printed output will appear on same line" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Comments" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Python ***comment*** character is **`'#'`**: anything after `'#'` on the line is ignored by the Python interpreter. PEP8 style guidelines recommend using at least 2 blank spaces before an inline comment that appears on the same line as any code.\n", "\n", "***Multi-line strings*** can be used within code blocks to provide multi-line comments.\n", "\n", "Multi-line strings are delimited by pairs of triple quotes (**`'''`** or **`\"\"\"`**). Any newlines in the string will be represented as `'\\n'` characters in the string." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'\\nThis is\\na mult-line\\nstring'" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'''\n", "This is\n", "a mult-line\n", "string'''" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Before comment\n", "After comment\n" ] } ], "source": [ "print('Before comment') # this is an inline comment\n", "'''\n", "This is\n", "a multi-line\n", "comment\n", "'''\n", "print('After comment')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Multi-line strings can be printed, in which case the embedded newline (`'\\n'`) characters will be converted to newlines in the output." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "This is\n", "a mult-line\n", "string\n" ] } ], "source": [ "print('''\n", "This is\n", "a mult-line\n", "string''')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lists" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A [**`list`**](http://docs.python.org/2/tutorial/introduction.html#lists) is an ordered ***sequence*** of 0 or more comma-delimited elements enclosed within square brackets ('`[`', '`]`'). The Python [**`str.split(sep)`**](http://docs.python.org/2/library/stdtypes.html#str.split) method can be used to split a `sep`-delimited string into a corresponding list of elements.\n", "\n", "In the following example, a comma-delimited string is split using `sep=','`." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['p', 'k', 'f', 'n', 'f', 'n', 'f', 'c', 'n', 'w', 'e', '?', 'k', 'y', 'w', 'n', 'p', 'w', 'o', 'e', 'w', 'v', 'd']\n" ] } ], "source": [ "single_instance_list = single_instance_str.split(',')\n", "print(single_instance_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Python lists are *heterogeneous*, i.e., they can contain elements of different types." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['a', 1, 2.3, True, [1, 'b']]\n" ] } ], "source": [ "mixed_list = ['a', 1, 2.3, True, [1, 'b']]\n", "print(mixed_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Python **`+`** operator can be used for addition, and also to concatenate strings and lists." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "6\n", "abc\n", "['a', 1, 2.3, True, [1, 'b']]\n" ] } ], "source": [ "print(1 + 2 + 3)\n", "print('a' + 'b' + 'c')\n", "print(['a', 1] + [2.3, True] + [[1, 'b']])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Accessing sequence elements & subsequences " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Individual elements of [*sequences*](http://docs.python.org/2/library/stdtypes.html#typesseq) (e.g., lists and strings) can be accessed by specifying their *zero-based index position* within square brackets ('`[`', '`]`').\n", "\n", "The following statements print out the 3rd element - at zero-based index position 2 - of `single_instance_str` and `single_instance_list`.\n", "\n", "Note that the 3rd elements are not the same, as commas count as elements in the string, but not in the list created by splitting a comma-delimited string." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\n", "k\n", "['p', 'k', 'f', 'n', 'f', 'n', 'f', 'c', 'n', 'w', 'e', '?', 'k', 'y', 'w', 'n', 'p', 'w', 'o', 'e', 'w', 'v', 'd']\n", "f\n" ] } ], "source": [ "print(single_instance_str)\n", "print(single_instance_str[2])\n", "print(single_instance_list)\n", "print(single_instance_list[2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Negative index values* can be used to specify a position offset from the end of the sequence.\n", "\n", "It is often useful to use a `-1` index value to access the last element of a sequence." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\n", "d\n", ",\n" ] } ], "source": [ "print(single_instance_str)\n", "print(single_instance_str[-1])\n", "print(single_instance_str[-2])" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['p', 'k', 'f', 'n', 'f', 'n', 'f', 'c', 'n', 'w', 'e', '?', 'k', 'y', 'w', 'n', 'p', 'w', 'o', 'e', 'w', 'v', 'd']\n", "d\n", "v\n" ] } ], "source": [ "print(single_instance_list)\n", "print(single_instance_list[-1])\n", "print(single_instance_list[-2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Python ***slice notation*** can be used to access subsequences by specifying two index positions separated by a colon (':'); `seq[start:stop]` returns all the elements in `seq` between `start` and `stop - 1` (inclusive)." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "k,\n", "['f', 'n']\n" ] } ], "source": [ "print(single_instance_str[2:4])\n", "print(single_instance_list[2:4])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Slices index values can be negative." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ",v\n", "['e', 'w']\n" ] } ], "source": [ "print(single_instance_str[-4:-2])\n", "print(single_instance_list[-4:-2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `start` and/or `stop` index can be omitted. A common use of slices with a single index value is to access all but the first element or all but the last element of a sequence." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\n", "p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,\n", "p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v\n", ",k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\n", "k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\n" ] } ], "source": [ "print(single_instance_str)\n", "print(single_instance_str[:-1]) # all but the last \n", "print(single_instance_str[:-2]) # all but the last 2 \n", "print(single_instance_str[1:]) # all but the first\n", "print(single_instance_str[2:]) # all but the first 2" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['p', 'k', 'f', 'n', 'f', 'n', 'f', 'c', 'n', 'w', 'e', '?', 'k', 'y', 'w', 'n', 'p', 'w', 'o', 'e', 'w', 'v', 'd']\n", "['p', 'k', 'f', 'n', 'f', 'n', 'f', 'c', 'n', 'w', 'e', '?', 'k', 'y', 'w', 'n', 'p', 'w', 'o', 'e', 'w', 'v']\n", "['k', 'f', 'n', 'f', 'n', 'f', 'c', 'n', 'w', 'e', '?', 'k', 'y', 'w', 'n', 'p', 'w', 'o', 'e', 'w', 'v', 'd']\n" ] } ], "source": [ "print(single_instance_list)\n", "print(single_instance_list[:-1])\n", "print(single_instance_list[1:])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Slice notation includes an optional third element, `step`, as in `seq[start:stop:step]`, that specifies the steps or increments by which elements are retrieved from `seq` between `start` and `step - 1`:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\n", "pkfnfnfcnwe?kywnpwoewvd\n", ",,,,,,,,,,,,,,,,,,,,,,\n", "d,v,w,e,o,w,p,n,w,y,k,?,e,w,n,c,f,n,f,n,f,k,p\n" ] } ], "source": [ "print(single_instance_str)\n", "print(single_instance_str[::2]) # print elements in even-numbered positions\n", "print(single_instance_str[1::2]) # print elements in odd-numbered positions\n", "print(single_instance_str[::-1]) # print elements in reverse order" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [Python tutorial](http://docs.python.org/2/tutorial/introduction.html) offers a helpful ASCII art representation to show how positive and negative indexes are interpreted:\n", "\n", "
\n",
    " +---+---+---+---+---+\n",
    " | H | e | l | p | A |\n",
    " +---+---+---+---+---+\n",
    " 0   1   2   3   4   5\n",
    "-5  -4  -3  -2  -1\n",
    "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Splitting / separating statements" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Python statements are typically separated by newlines (rather than, say, the semi-colon in Java). Statements can extend over more than one line; it is generally best to break the lines after commas, parentheses, braces or brackets. Inserting a backslash character ('\\\\') at the end of a line will also enable continuation of the statement on the next line, but it is generally best to look for other alternatives." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises?', 'odor', 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat']\n" ] } ], "source": [ "attribute_names = ['class', \n", " 'cap-shape', 'cap-surface', 'cap-color', \n", " 'bruises?', \n", " 'odor', \n", " 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', \n", " 'stalk-shape', 'stalk-root', \n", " 'stalk-surface-above-ring', 'stalk-surface-below-ring', \n", " 'stalk-color-above-ring', 'stalk-color-below-ring',\n", " 'veil-type', 'veil-color', \n", " 'ring-number', 'ring-type', \n", " 'spore-print-color', \n", " 'population', \n", " 'habitat']\n", "print(attribute_names)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "a b c 1 2 3\n" ] } ], "source": [ "print('a', 'b', 'c', # no '\\' needed when breaking after comma\n", " 1, 2, 3)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "a b c 1 2 3\n" ] } ], "source": [ "print( # no '\\' needed when breaking after parenthesis, brace or bracket\n", " 'a', 'b', 'c',\n", " 1, 2, 3)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "6\n" ] } ], "source": [ "print(1 + 2 \\\n", " + 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Processing strings & other sequences" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [**`str.strip([chars]`**)](http://docs.python.org/2/library/stdtypes.html#str.strip) method returns a copy of `str` in which any leading or trailing `chars` are removed. If no `chars` are specified, it removes all leading and trailing whitespace. [*Whitespace* is any sequence of spaces, tabs (`'\\t'`) and/or newline (`'\\n'`) characters.] \n", "\n", "Note that since a blank space is inserted in the output after every item in a comma-delimited list, the second asterisk below is printed after a leading blank space is inserted on the new line." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "* \tp,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\n", " *\n" ] } ], "source": [ "print('*', '\\tp,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\\n', '*')" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "* p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d *\n" ] } ], "source": [ "print('*', '\\tp,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\\n'.strip(), '*')" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "* p,k,f,n,f,n,f,c,n,w,e, ?,k,y\t,w,n,p,w\n", ",o,e,w,v,d *\n" ] } ], "source": [ "print('*', '\\tp,k,f,n,f,n,f,c,n,w,e, ?,k,y\\t,w,n,p,w\\n,o,e,w,v,d\\n'.strip(), '*')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A common programming pattern when dealing with CSV (comma-separated value) files, such as the mushroom dataset file mentioned above, is to repeatedly:\n", "\n", "1. read a line from a file\n", "2. strip off any leading and trailing whitespace\n", "3. split the values separated by commas into a list\n", "\n", "We will get to repetition control structures (loops) and file input and output shortly, but here is an example of how `str.strip()` and `str.split()` be chained together in a single instruction for processing a line representing a single instance from the mushroom dataset file. Note that chained methods are executed in left-to-right order.\n", "\n", "*\\[Python providees a **[`csv`](https://docs.python.org/2/library/csv.html)** module to facilitate the processing of CSV files, but we will not use that module here\\]*" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\n", "\n", "['p', 'k', 'f', 'n', 'f', 'n', 'f', 'c', 'n', 'w', 'e', '?', 'k', 'y', 'w', 'n', 'p', 'w', 'o', 'e', 'w', 'v', 'd']\n" ] } ], "source": [ "single_instance_str = 'p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\\n'\n", "print(single_instance_str)\n", "# first strip leading & trailing whitespace, then split on commas\n", "single_instance_list = single_instance_str.strip().split(',') \n", "print(single_instance_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [**`str.join(words)`**](http://docs.python.org/2/library/string.html#string.join) method is the inverse of `str.split()`, returning a single string in which each string in the sequence of `words` is separated by `str`." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['p', 'k', 'f', 'n', 'f', 'n', 'f', 'c', 'n', 'w', 'e', '?', 'k', 'y', 'w', 'n', 'p', 'w', 'o', 'e', 'w', 'v', 'd']\n", "p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\n" ] } ], "source": [ "print(single_instance_list)\n", "print(','.join(single_instance_list))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A number of Python methods can be used on strings, lists and other sequences.\n", "\n", "The [**`len(s)`**](http://docs.python.org/2/library/functions.html#len) function can be used to find the length of (number of items in) a sequence `s`. It will also return the number of items in a *dictionary*, a data structure we will cover further below." ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "46\n", "23\n" ] } ], "source": [ "print(len(single_instance_str))\n", "print(len(single_instance_list))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The **`in`** operator can be used to determine whether a sequence contains a value. \n", "\n", "Boolean values in Python are **`True`** and **`False`** (note the capitalization)." ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True\n", "False\n" ] } ], "source": [ "print(',' in single_instance_str)\n", "print(',' in single_instance_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [**`s.count(x)`**](http://docs.python.org/2/library/stdtypes.html#str.count) ormethod can be used to count the number of occurrences of item `x` in sequence `s`." ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "22\n", "3\n" ] } ], "source": [ "print(single_instance_str.count(','))\n", "print(single_instance_list.count('f'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [**`s.index(x)`**](http://docs.python.org/2/library/stdtypes.html#str.index) method can be used to find the first zero-based index of item `x` in sequence `s`. " ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1\n", "2\n" ] } ], "source": [ "print(single_instance_str.index(','))\n", "print(single_instance_list.index('f'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that an [`ValueError`](https://docs.python.org/2/library/exceptions.html#exceptions.ValueError) exception will be raised if item `x` is not found in sequence `s`." ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "ename": "ValueError", "evalue": "',' is not in list", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0;32mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msingle_instance_list\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mindex\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m','\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mValueError\u001b[0m: ',' is not in list" ] } ], "source": [ "print(single_instance_list.index(','))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Mutability" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One important distinction between strings and lists has to do with their [*mutability*](http://docs.python.org/2/reference/datamodel.html).\n", "\n", "Python strings are *immutable*, i.e., they cannot be modified. Most string methods (like `str.strip()`) return modified *copies* of the strings on which they are used.\n", "\n", "Python lists are *mutable*, i.e., they can be modified. \n", "\n", "The examples below illustrate a number of [`list`](http://docs.python.org/2/tutorial/datastructures.html#more-on-lists) methods that modify lists." ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "list_1: [1, 2, 3, 5, 1]\n", "list_2: [1, 2, 3, 5, 1]\n", "\n", "list_1.remove(1): [2, 3, 5, 1]\n", "\n", "list_1.pop(2): [2, 3, 1]\n", "\n", "list_1.append(6): [2, 3, 1, 6]\n", "\n", "list_1.insert(0, 7): [7, 2, 3, 1, 6]\n", "\n", "list_1.sort(): [1, 2, 3, 6, 7]\n", "\n", "list_1.reverse(): [7, 6, 3, 2, 1]\n" ] } ], "source": [ "list_1 = [1, 2, 3, 5, 1]\n", "list_2 = list_1 # list_2 now references the same object as list_1\n", "\n", "print('list_1: ', list_1)\n", "print('list_2: ', list_2)\n", "print()\n", "\n", "list_1.remove(1) # remove [only] the first occurrence of 1 in list_1\n", "print('list_1.remove(1): ', list_1)\n", "print()\n", "\n", "list_1.pop(2) # remove the element in position 2\n", "print('list_1.pop(2): ', list_1)\n", "print()\n", "\n", "list_1.append(6) # add 6 to the end of list_1\n", "print('list_1.append(6): ', list_1)\n", "print()\n", "\n", "list_1.insert(0, 7) # add 7 to the beinning of list_1 (before the element in position 0)\n", "print('list_1.insert(0, 7):', list_1)\n", "print()\n", "\n", "list_1.sort()\n", "print('list_1.sort(): ', list_1)\n", "print()\n", "\n", "list_1.reverse()\n", "print('list_1.reverse(): ', list_1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When more than one name (e.g., a variable) is bound to the same mutable object, changes made to that object are reflected in all names bound to that object. For example, in the second statement above, `list_2` is bound to the same object that is bound to `list_1`. All changes made to the object bound to `list_1` will thus be reflected in `list_2` (since they both reference the same object)." ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "list_1: [7, 6, 3, 2, 1]\n", "list_2: [7, 6, 3, 2, 1]\n" ] } ], "source": [ "print('list_1: ', list_1)\n", "print('list_2: ', list_2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can create a copy of a list by using slice notation and not specifying a `start` or `end` parameter, i.e., `[:]`, and if we assign that copy to another variable, the variables will be bound to different objects, so changes to one do not affect the other." ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "list_1: [1, 2, 3, 5, 1]\n", "list_2: [1, 2, 3, 5, 1]\n", "\n", "list_1.remove(1): [2, 3, 5, 1]\n", "\n", "list_1: [2, 3, 5, 1]\n", "list_2: [1, 2, 3, 5, 1]\n" ] } ], "source": [ "list_1 = [1, 2, 3, 5, 1]\n", "list_2 = list_1[:] # list_1[:] returns a copy of the entire contents of list_1\n", "\n", "print('list_1: ', list_1)\n", "print('list_2: ', list_2)\n", "print()\n", "\n", "list_1.remove(1) # remove [only] the first occurrence of 1 in list_1\n", "print('list_1.remove(1): ', list_1)\n", "print()\n", "\n", "print('list_1: ', list_1)\n", "print('list_2: ', list_2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [`dir()`](https://docs.python.org/2/library/functions.html#dir) function returns all the attributes associated with a Python name (e.g., a variable) in alphabetical order. \n", "\n", "When invoked with a name bound to a `list` object, it will return the methods that can be invoked on a list. The attributes with leading and trailing underscores should be treated as protected (i.e., they should not be used); we'll discuss this further below." ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['__add__',\n", " '__class__',\n", " '__contains__',\n", " '__delattr__',\n", " '__delitem__',\n", " '__delslice__',\n", " '__doc__',\n", " '__eq__',\n", " '__format__',\n", " '__ge__',\n", " '__getattribute__',\n", " '__getitem__',\n", " '__getslice__',\n", " '__gt__',\n", " '__hash__',\n", " '__iadd__',\n", " '__imul__',\n", " '__init__',\n", " '__iter__',\n", " '__le__',\n", " '__len__',\n", " '__lt__',\n", " '__mul__',\n", " '__ne__',\n", " '__new__',\n", " '__reduce__',\n", " '__reduce_ex__',\n", " '__repr__',\n", " '__reversed__',\n", " '__rmul__',\n", " '__setattr__',\n", " '__setitem__',\n", " '__setslice__',\n", " '__sizeof__',\n", " '__str__',\n", " '__subclasshook__',\n", " 'append',\n", " 'count',\n", " 'extend',\n", " 'index',\n", " 'insert',\n", " 'pop',\n", " 'remove',\n", " 'reverse',\n", " 'sort']" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dir(list_1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are sorting and reversing functions, **[`sorted()`](https://docs.python.org/2.7/library/functions.html#sorted)** and **[`reversed()`](https://docs.python.org/2.7/library/functions.html#reversed)**, that do *not* modify their arguments, and can thus be used on mutable or immutable objects. \n", "\n", "Note that `sorted()` always returns a sorted *list* of each element in its argument, regardless of which type of sequence it is passed. Thus, invoking `sorted()` on a *string* returns a *list* of sorted characters from the string, rather than a sorted string." ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sorted(list_1): [1, 2, 3, 5]\n", "list_1: [2, 3, 5, 1]\n", "\n", "sorted(single_instance_str): ['\\n', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', '?', 'c', 'd', 'e', 'e', 'f', 'f', 'f', 'k', 'k', 'n', 'n', 'n', 'n', 'o', 'p', 'p', 'v', 'w', 'w', 'w', 'w', 'y']\n", "single_instance_str: p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\n", "\n" ] } ], "source": [ "print('sorted(list_1):', sorted(list_1)) \n", "print('list_1: ', list_1)\n", "print()\n", "print('sorted(single_instance_str):', sorted(single_instance_str)) \n", "print('single_instance_str: ', single_instance_str)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `sorted()` function sorts its argument in ascending order by default. \n", "\n", "An optional ***[keyword argument](http://docs.python.org/2/tutorial/controlflow.html#keyword-arguments)***, `reverse`, can be used to sort in descending order. The default value of this optional parameter is `False`; to get non-default behavior of an optional argument, we must specify the name and value of the argument, in this case, `reverse=True`." ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['\\n', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', '?', 'c', 'd', 'e', 'e', 'f', 'f', 'f', 'k', 'k', 'n', 'n', 'n', 'n', 'o', 'p', 'p', 'v', 'w', 'w', 'w', 'w', 'y']\n", "['y', 'w', 'w', 'w', 'w', 'v', 'p', 'p', 'o', 'n', 'n', 'n', 'n', 'k', 'k', 'f', 'f', 'f', 'e', 'e', 'd', 'c', '?', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', '\\n']\n" ] } ], "source": [ "print(sorted(single_instance_str)) \n", "print(sorted(single_instance_str, reverse=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tuples (immutable list-like sequences)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A [*tuple*](http://docs.python.org/2/tutorial/datastructures.html#tuples-and-sequences) is an ordered, immutable sequence of 0 or more comma-delimited values enclosed in parentheses (`'('`, `')'`). Many of the functions and methods that operate on strings and lists also operate on tuples." ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "x = (5, 4, 3, 2, 1)\n", "len(x) = 5\n", "x.index(3) = 2\n", "x[2:4] = (3, 2)\n", "x[4:2:-1] = (1, 2)\n", "sorted(x): [1, 2, 3, 4, 5]\n" ] } ], "source": [ "x = (5, 4, 3, 2, 1) # a tuple\n", "print('x =', x)\n", "print('len(x) =', len(x))\n", "print('x.index(3) =', x.index(3))\n", "print('x[2:4] = ', x[2:4])\n", "print('x[4:2:-1] = ', x[4:2:-1])\n", "print('sorted(x):', sorted(x)) # note: sorted() always returns a list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the methods that modify lists (e.g., `append()`, `remove()`, `reverse()`, `sort()`) are not defined for immutable sequences such as tuples (or strings). Invoking one of these sequence modification methods on an immutable sequence will raise an [`AttributeError`](https://docs.python.org/2/library/exceptions.html#exceptions.AttributeError) exception." ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [ { "ename": "AttributeError", "evalue": "'tuple' object has no attribute 'append'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mx\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m6\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mAttributeError\u001b[0m: 'tuple' object has no attribute 'append'" ] } ], "source": [ "x.append(6)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, one can approximate these modifications by creating modified copies of an immutable sequence and then re-assigning it to a name." ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(5, 4, 3, 2, 1, 6)" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = x + (6,) # need to include a comma to differentiate tuple from numeric expression\n", "x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that Python has a **`+=`** operator which is a shortcut for the *`name = name + new_value`* pattern. This can be used for addition (e.g., `x += 1` is shorthand for `x = x + 1`) or concatenation (e.g., `x += (7,)` is shorthand for `x = x + (7,)`)." ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(5, 4, 3, 2, 1, 6, 7)" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x += (7,)\n", "x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A tuple of one element must include a trailing comma to differentiate it from a parenthesized expression." ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'a'" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "('a')" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "('a',)" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "('a',)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Conditionals" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One common approach to handling errors is to *look before you leap (LBYL)*, i.e., test for potential [exceptions](http://docs.python.org/2/tutorial/errors.html) before executing instructions that might raise those exceptions. \n", "\n", "This approach can be implemented using the [**`if`**](http://docs.python.org/2/tutorial/controlflow.html#if-statements) statement (which may optionally include an **`else`** and any number of **`elif`** clauses).\n", "\n", "The following is a simple example of an `if` statement:" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "unknown\n" ] } ], "source": [ "class_value = 'x' # try changing this to 'p' or 'x'\n", "\n", "if class_value == 'e':\n", " print('edible')\n", "elif class_value == 'p':\n", " print('poisonous')\n", "else:\n", " print('unknown')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that \n", "\n", "* a colon ('`:`') is used at the end of the lines with `if`, `else` or `elif`\n", "* no parentheses are required to enclose the boolean condition (it is presumed to include everything between `if` or `elif` and the colon)\n", "* the statements below each `if`, `elif` and `else` line are all indented\n", "\n", "Python does not have special characters to delimit statement blocks (like the '{' and '}' delimiters in Java); instead, sequences of statements with the same *indentation level* are treated as a statement block. The [Python Style Guide](http://legacy.python.org/dev/peps/pep-0008/) recommends using 4 spaces for each indentation level.\n", "\n", "An `if` statement can be used to follow the LBYL paradigm in preventing the `ValueError` that occured in an earlier example:" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bruises? is in position 4\n" ] } ], "source": [ "attribute = 'bruises?' # try substituting 'bruises?' for 'bruises' and re-running this code\n", "\n", "if attribute in attribute_names:\n", " i = attribute_names.index(attribute)\n", " print(attribute, 'is in position', i)\n", "else:\n", " print(attribute, 'is not in', attribute_names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Seeking forgiveness vs. asking for permission (EAFP vs. LBYL)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another perspective on handling errors championed by some pythonistas is that it is [*easier to ask forgiveness than permission (EAFP)*](http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html#eafp-vs-lbyl).\n", "\n", "As in many practical applications of philosophy, religion or dogma, it is helpful to *think before you choose (TBYC)*. There are a number of factors to consider in deciding whether to follow the EAFP or LBYL paradigm, including code readability and the anticipated likelihood and relative severity of encountering an exception. For those who are interested, Oran Looney wrote a blog post providing a nice overview of the debate over [LBYL vs. EAFP](http://oranlooney.com/lbyl-vs-eafp/).\n", "\n", "In keeping with practices most commonly used with other languages, we will follow the LBYL paradigm throughout most of this primer. \n", "\n", "However, as a brief illustration of the EAFP paradigm in Python, here is an alternate implementation of the functionality of the code above, using a [**`try/except`**](http://docs.python.org/2/tutorial/errors.html#handling-exceptions) statement." ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bruises? is in position 4\n" ] } ], "source": [ "attribute = 'bruises?' # try substituting 'bruises' for 'bruises' and re-running this code\n", "\n", "i = -1\n", "try:\n", " i = attribute_names.index(attribute)\n", " print(attribute, 'is in position', i)\n", "except ValueError:\n", " print(attribute, 'is not found')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is no local scoping inside a `try`, so the value of `i` persists after the `try/except` statement." ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "i" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Python *null object* is **`None`** (note the capitalization)." ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bruises = None\n" ] } ], "source": [ "attribute = 'bruises' # try substituting 'bruises?' for 'bruises' and re-running this code\n", "\n", "if attribute not in attribute_names: # equivalent to 'not attribute in attribute_names'\n", " value = None\n", "else:\n", " i = attribute_names.index(attribute)\n", " value = single_instance_list[i]\n", " \n", "print(attribute, '=', value)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Defining and calling functions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Python [*function definitions*](http://docs.python.org/2/tutorial/controlflow.html#defining-functions) start with the **`def`** keyword followed by a function name, a list of 0 or more comma-delimited *parameters* (aka 'formal parameters') enclosed within parentheses, and then a colon ('`:`'). \n", "\n", "A function definition may include one or more [**`return`**](http://docs.python.org/2/reference/simple_stmts.html#the-return-statement) statements to indicate the value(s) returned to where the function is called. It is good practice to include a short [docstring](http://docs.python.org/2/tutorial/controlflow.html#tut-docstrings) to briefly describe the behavior of the function and the value(s) it returns." ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def attribute_value(instance, attribute, attribute_names):\n", " '''Returns the value of attribute in instance, based on its position in attribute_names'''\n", " if attribute not in attribute_names:\n", " return None\n", " else:\n", " i = attribute_names.index(attribute)\n", " return instance[i] # using the parameter name here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A *function call* starts with the function name, followed by a list of 0 or more comma-delimited *arguments* (aka 'actual parameters') enclosed within parentheses. A function call can be used as a statement or within an expression." ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "cap-shape = k\n" ] } ], "source": [ "attribute = 'cap-shape' # try substituting any of the other attribute names shown above\n", "print(attribute, '=', attribute_value(single_instance_list, 'cap-shape', attribute_names))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that Python does not distinguish between names used for *variables* and names used for *functions*. An assignment statement binds a value to a name; a function definition also binds a value to a name. At any given time, the value most recently bound to a name is the one that is used. \n", "\n", "This can be demonstrated using the [**`type(object)`**](http://docs.python.org/2.7/library/functions.html#type) function, which returns the `type` of `object`." ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "x used as a variable: 0 \n", "x used as a function: \n" ] } ], "source": [ "x = 0\n", "print('x used as a variable:', x, type(x))\n", "\n", "def x():\n", " print('x')\n", " \n", "print('x used as a function:', x, type(x))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another way to determine the `type` of an object is to use [**`isinstance(object, class)`**](https://docs.python.org/2/library/functions.html#isinstance). This is generally [preferable](http://stackoverflow.com/questions/1549801/differences-between-isinstance-and-type-in-python), as it takes into account [class inheritance](https://docs.python.org/2/tutorial/classes.html#inheritance). There is a larger issue of [*duck typing*](https://en.wikipedia.org/wiki/Duck_typing), and whether code should ever explicitly check for the type of an object, but we will omit further discussion of the topic in this primer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Call by sharing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An important feature of Python functions is that arguments are passed using [*call by sharing*](https://en.wikipedia.org/wiki/Evaluation_strategy#Call_by_sharing). \n", "\n", "If a *mutable* object is passed as an argument to a function parameter, assignment statements using that parameter do not affect the passed argument, however other modifications to the parameter (e.g., modifications to a list using methods such as `append()`, `remove()`, `reverse()` or `sort()`) do affect the passed argument.\n", "\n", "Not being aware of - or forgetting - this important distinction can lead to challenging debugging sessions. \n", "\n", "The example below demonstrates this difference and introduces another [list method](https://docs.python.org/2/tutorial/datastructures.html#more-on-lists), `list.insert(i, x)`, which inserts `x` into `list` at position `i`." ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "argument1, before calling modify_parameters: [1, 2, 3]\n", "argument2, before calling modify_parameters: [4, 5, 6]\n", "\n", "parameter1, after inserting \"x\": ['x', 1, 2, 3]\n", "parameter2, after assigning \"x\" [7, 8, 9]\n", "\n", "argument1, after calling modify_parameters: ['x', 1, 2, 3]\n", "argument2, after calling modify_parameters: [4, 5, 6]\n" ] } ], "source": [ "def modify_parameters(parameter1, parameter2):\n", " '''Inserts \"x\" at the head of parameter1, assigns [7, 8, 9] to parameter2'''\n", " parameter1.insert(0, 'x') # insert() WILL affect argument passed as parameter1\n", " print('parameter1, after inserting \"x\":', parameter1)\n", " parameter2 = [7, 8, 9] # assignment WILL NOT affect argument passed as parameter2\n", " print('parameter2, after assigning \"x\"', parameter2)\n", " return\n", "\n", "argument1 = [1, 2, 3] \n", "argument2 = [4, 5, 6]\n", "print('argument1, before calling modify_parameters:', argument1)\n", "print('argument2, before calling modify_parameters:', argument2)\n", "print()\n", "modify_parameters(argument1, argument2)\n", "print()\n", "print('argument1, after calling modify_parameters:', argument1)\n", "print('argument2, after calling modify_parameters:', argument2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One way of preventing functions from modifying mutable objects passed as parameters is to make a copy of those objects inside the function. Here is another version of the function above that makes a shallow copy of the *list_parameter* using the slice operator. \n", "\n", "*\\[Note: the Python [copy](http://docs.python.org/2/library/copy.html) module provides both [shallow] [`copy()`](http://docs.python.org/2/library/copy.html#copy.copy) and [`deepcopy()`](http://docs.python.org/2/library/copy.html#copy.deepcopy) methods; we will cover modules further below.\\]*" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Before: [1, 2, 3]\n", "Inserted \"x\": ['x', 1, 2, 3]\n", "After: [1, 2, 3]\n" ] } ], "source": [ "def modify_parameter_copy(parameter_1):\n", " '''Inserts \"x\" at the head of parameter_1, without modifying the list argument'''\n", " parameter_1_copy = parameter_1[:] # list[:] returns a copy of list\n", " parameter_1_copy.insert(0, 'x')\n", " print('Inserted \"x\":', parameter_1_copy)\n", " return\n", "\n", "argument_1 = [1, 2, 3] # passing a named object will not affect the object bound to that name\n", "print('Before:', argument_1)\n", "modify_parameter_copy(argument_1)\n", "print('After:', argument_1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another way to avoid modifying parameters is to use assignment statements which do not modify the parameter objects but return a new object that is bound to the name (locally)." ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Before: [1, 2, 3]\n", "Inserted \"x\": ['x', 1, 2, 3]\n", "After: [1, 2, 3]\n" ] } ], "source": [ "def modify_parameter_assignment(parameter_1):\n", " '''Inserts \"x\" at the head of parameter_1, without modifying the list argument'''\n", " parameter_1 = ['x'] + parameter_1 # using assignment rather than list.insert()\n", " print('Inserted \"x\":', parameter_1)\n", " return\n", "\n", "argument_1 = [1, 2, 3] # passing a named object will not affect the object bound to that name\n", "print('Before:', argument_1)\n", "modify_parameter_assignment(argument_1)\n", "print('After:', argument_1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multiple return values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Python functions can return more than one value by separating those return values with commas in the **return** statement. Multiple values are returned as a tuple. \n", "\n", "If the function-invoking expression is an assignment statement, multiple variables can be assigned the multiple values returned by the function in a single statement. This combining of values and subsequent separation is known as tuple ***packing*** and ***unpacking***." ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "min and max of [3, 1, 4, 2, 5] : (1, 5)\n", "min and max of [3, 1, 4, 2, 5] : (1, 5)\n", "min and max of [3, 1, 4, 2, 5] : 1 , 5\n" ] } ], "source": [ "def min_and_max(list_of_values):\n", " '''Returns a tuple containing the min and max values in the list_of_values'''\n", " return min(list_of_values), max(list_of_values)\n", "\n", "list_1 = [3, 1, 4, 2, 5]\n", "print('min and max of', list_1, ':', min_and_max(list_1))\n", "\n", "# a single variable is assigned the two-element tuple\n", "min_and_max_list_1 = min_and_max(list_1) \n", "print('min and max of', list_1, ':', min_and_max_list_1)\n", "\n", "# the 1st variable is assigned the 1st value, the 2nd variable is assigned the 2nd value\n", "min_list_1, max_list_1 = min_and_max(list_1) \n", "print('min and max of', list_1, ':', min_list_1, ',', max_list_1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Iteration: for, range" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [**`for`**](http://docs.python.org/2/tutorial/controlflow.html#for-statements) statement iterates over the elements of a sequence or other [*iterable*](http://docs.python.org/2/glossary.html#term-iterable) object." ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "1\n", "2\n" ] } ], "source": [ "for i in [0, 1, 2]:\n", " print(i)" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "a\n", "b\n", "c\n" ] } ], "source": [ "for c in 'abc':\n", " print(c)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The value of the variable used to iterate in a `for` statement persists after the `for` statement" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(2, 'c')" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "i, c" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In Python 2, the [**`range(stop)`**](http://docs.python.org/2/tutorial/controlflow.html#the-range-function) function returns a list of values from 0 up to `stop - 1` (inclusive). It is often used in the context of a `for` loop that iterates over the list of values." ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Values for the 23 attributes:\n", "\n", "class = p\n", "cap-shape = k\n", "cap-surface = f\n", "cap-color = n\n", "bruises? = f\n", "odor = n\n", "gill-attachment = f\n", "gill-spacing = c\n", "gill-size = n\n", "gill-color = w\n", "stalk-shape = e\n", "stalk-root = ?\n", "stalk-surface-above-ring = k\n", "stalk-surface-below-ring = y\n", "stalk-color-above-ring = w\n", "stalk-color-below-ring = n\n", "veil-type = p\n", "veil-color = w\n", "ring-number = o\n", "ring-type = e\n", "spore-print-color = w\n", "population = v\n", "habitat = d\n" ] } ], "source": [ "print('Values for the', len(attribute_names), 'attributes:', end='\\n\\n') # adds a blank line\n", "for i in range(len(attribute_names)):\n", " print(attribute_names[i], '=', \n", " attribute_value(single_instance_list, attribute_names[i], attribute_names))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The more general form of the function, [**`range(start, stop[, step])`**](http://docs.python.org/2/library/functions.html#range), returns a list of values from `start` to `stop - 1` (inclusive) increasing by `step` (which defaults to `1`), or from `start` down to `stop + 1` (inclusive) decreasing by `step` if `step` is negative." ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3\n", "2\n", "1\n" ] } ], "source": [ "for i in range(3, 0, -1):\n", " print(i)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In Python 2, the [**`xrange(stop[, stop[, step]])`**](http://docs.python.org/2/library/functions.html#xrange) function is an [*iterable*](http://docs.python.org/2/glossary.html#term-iterable) version of the `range()` function. In the context of a `for` loop, it returns the *next* item of the sequence for each iteration of the loop rather than creating *all* the elements of the sequence before the first iteration. This can reduce memory consumption in cases where iteration over all the items is not required.\n", "\n", "In Python 3, the `range()` function behaves the same way as the `xrange()` function does in Python 2, and so the `xrange()` function is not defined in Python 3. \n", "\n", "To maximize compatibility, we will use `range()` throughout this notebook; however, note that it is generally more efficient to use `xrange()` rather than `range()` in Python 2." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Modules, namespaces and dotted notation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A Python [***module***](http://docs.python.org/2/tutorial/modules.html) is a file containing related definitions (e.g., of functions and variables). Modules are used to help organize a Python [***namespace***](http://docs.python.org/2/tutorial/classes.html#python-scopes-and-namespaces), the set of identifiers accessible in a particular context. All of the functions and variables we define in this IPython Notebook are in the `__main__` namespace, so accessing them does not require any specification of a module.\n", "\n", "A Python module named **`simple_ml`** (in the file `simple_ml.py`), contains a set of solutions to the exercises in this IPython Notebook. *\\[The learning opportunity provided by this primer will be maximized by not looking at that file, or waiting as long as possible to do so.\\]*\n", "\n", "Accessing functions in an external module requires that we first **[`import`](http://docs.python.org/2/reference/simple_stmts.html#the-import-statement)** the module, and then prefix the function names with the module name followed by a dot (this is known as ***dotted notation***).\n", "\n", "For example, the following function call in Exercise 1 below: \n", "\n", "`simple_ml.print_attribute_names_and_values(single_instance_list, attribute_names)`\n", "\n", "uses dotted notation to reference the `print_attribute_names_and_values()` function in the `simple_ml` module.\n", "\n", "After you have defined your own function for Exercise 1, you can test your function by deleting the `simple_ml` module specification, so that the statement becomes\n", "\n", "`print_attribute_names_and_values(single_instance_list, attribute_names)`\n", "\n", "This will reference the `print_attribute_names_and_values()` function in the current namespace (`__main__`), i.e., the top-level interpreter environment. The `simple_ml.print_attribute_names_and_values()` function will still be accessible in the `simple_ml` namespace by using the \"`simple_ml.`\" prefix (so you can easily toggle back and forth between your own definition and that provided in the solutions file)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 1: define `print_attribute_names_and_values()`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Complete the following function definition, `print_attribute_names_and_values(instance, attribute_names)`, so that it generates exactly the same output as the code above." ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Values for the 23 attributes:\n", "\n", "class = p\n", "cap-shape = k\n", "cap-surface = f\n", "cap-color = n\n", "bruises? = f\n", "odor = n\n", "gill-attachment = f\n", "gill-spacing = c\n", "gill-size = n\n", "gill-color = w\n", "stalk-shape = e\n", "stalk-root = ?\n", "stalk-surface-above-ring = k\n", "stalk-surface-below-ring = y\n", "stalk-color-above-ring = w\n", "stalk-color-below-ring = n\n", "veil-type = p\n", "veil-color = w\n", "ring-number = o\n", "ring-type = e\n", "spore-print-color = w\n", "population = v\n", "habitat = d\n" ] } ], "source": [ "def print_attribute_names_and_values(instance, attribute_names):\n", " '''Prints the attribute names and values for an instance'''\n", " # your code here\n", " return\n", "\n", "import simple_ml # this module contains my solutions to exercises\n", "\n", "# delete 'simple_ml.' in the function call below to test your function\n", "simple_ml.print_attribute_names_and_values(single_instance_list, attribute_names)\n", "print_attribute_names_and_values(single_instance_list, attribute_names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### File I/O" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Python [file input and output](http://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files) is done through [file](http://docs.python.org/2/library/stdtypes.html#file-objects) objects. A file object is created with the [`open(name[, mode])`](http://docs.python.org/2/library/functions.html#open) statement, where `name` is a string representing the name of the file, and `mode` is `'r'` (read), `'w'` (write) or `'a'` (append); if no second argument is provided, the mode defaults to `'r'`.\n", "\n", "A common Python programming pattern for processing an input text file is to \n", "\n", "* [**`open`**](http://docs.python.org/2/library/functions.html#open) the file using a [**`with`**](http://docs.python.org/2/reference/compound_stmts.html#the-with-statement) statement (which will automatically [**`close`**](http://docs.python.org/2/library/stdtypes.html#file.close) the file after the statements inside the `with` block have been executed)\n", "* iterate over each line in the file using a **`for`** statement\n", "\n", "The following code creates a list of instances, where each instance is a list of attribute values (like `instance_1_str` above). " ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Read 8124 instances from agaricus-lepiota.data\n", "First instance: ['p', 'x', 's', 'n', 't', 'p', 'f', 'c', 'n', 'k', 'e', 'e', 's', 's', 'w', 'w', 'p', 'w', 'o', 'p', 'k', 's', 'u']\n" ] } ], "source": [ "all_instances = [] # initialize instances to an empty list\n", "data_filename = 'agaricus-lepiota.data'\n", "\n", "with open(data_filename, 'r') as f:\n", " for line in f: # 'line' will be bound to the next line in f in each for loop iteration\n", " all_instances.append(line.strip().split(','))\n", " \n", "print('Read', len(all_instances), 'instances from', data_filename)\n", "# we don't want to print all the instances, so we'll just print the first one to verify\n", "print('First instance:', all_instances[0]) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 2: define load_instances()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define a function, `load_instances(filename)`, that returns a list of instances in a text file. The function definition is started for you below. The function should exhibit the same behavior as the code above." ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Read 8124 instances from agaricus-lepiota.data\n", "First instance: ['p', 'x', 's', 'n', 't', 'p', 'f', 'c', 'n', 'k', 'e', 'e', 's', 's', 'w', 'w', 'p', 'w', 'o', 'p', 'k', 's', 'u']\n" ] } ], "source": [ "def load_instances(filename):\n", " '''Returns a list of instances stored in a file.\n", " \n", " filename is expected to have a series of comma-separated attribute values per line, e.g.,\n", " p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\n", " '''\n", " instances = []\n", " # your code goes here\n", " return instances\n", "\n", "data_filename = 'agaricus-lepiota.data'\n", "# delete 'simple_ml.' in the function call below to test your function\n", "all_instances_2 = simple_ml.load_instances(data_filename)\n", "print('Read', len(all_instances_2), 'instances from', data_filename)\n", "print('First instance:', all_instances_2[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Output can be written to a text file via the [**`file.write(str)`**](http://docs.python.org/2/library/stdtypes.html#file.write) method.\n", "\n", "As we saw earlier, the [`str.join(words)`](http://docs.python.org/2/library/stdtypes.html#str.join) method returns a single `str`-delimited string containing each of the strings in the `words` list.\n", "\n", "SQL and Hive database tables sometimes use a pipe ('|') delimiter to separate column values for each row when they are stored as flat files. The following code creates a new data file using pipes rather than commas to separate the attribute values.\n", "\n", "To help maintain internal consistency, it is generally a good practice to define a variable such as `DELIMITER` or `SEPARATOR`, bind it to the intended delimiter string, and then use it as a named constant. The Python language does not support named constants, so the use of variables as named constants depends on conventions (e.g., using ALL-CAPS)." ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Converting to |-delimited strings, e.g., p|x|s|n|t|p|f|c|n|k|e|e|s|s|w|w|p|w|o|p|k|s|u\n", "Read 8124 instances from agaricus-lepiota-2.data\n", "First instance: ['p', 'x', 's', 'n', 't', 'p', 'f', 'c', 'n', 'k', 'e', 'e', 's', 's', 'w', 'w', 'p', 'w', 'o', 'p', 'k', 's', 'u']\n" ] } ], "source": [ "DELIMITER = '|'\n", "\n", "print('Converting to {}-delimited strings, e.g.,'.format(DELIMITER), \n", " DELIMITER.join(all_instances[0]))\n", "\n", "datafile2 = 'agaricus-lepiota-2.data'\n", "with open(datafile2, 'w') as f: # 'w' = open file for writing (output)\n", " for instance in all_instances:\n", " f.write(DELIMITER.join(instance) + '\\n') # write each instance on a separate line\n", "\n", "all_instances_3 = []\n", "with open(datafile2, 'r') as f:\n", " for line in f:\n", " all_instances_3.append(line.strip().split(DELIMITER)) # note: changed ',' to '|'\n", " \n", "print('Read', len(all_instances_3), 'instances from', datafile2)\n", "print('First instance:', all_instances_3[0]) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List comprehensions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Python provides a powerful [*list comprehension*](http://docs.python.org/2/tutorial/datastructures.html#list-comprehensions) construct to simplify the creation of a list by specifying a formula in a single expression.\n", "\n", "Some programmers find list comprehensions confusing, and avoid their use. We won't rely on list comprehensions here, but we will offer several examples with and without list comprehensions to highlight the power of the construct.\n", "\n", "One common use of list comprehensions is in the context of the [`str.join(words)`](http://docs.python.org/2/library/string.html#string.join) method we saw earlier.\n", "\n", "If we wanted to construct a pipe-delimited string containing elements of the list, we could use a `for` loop to iteratively add list elements and pipe delimiters to a string for all but the last element, and then manually add the last element." ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'a|b|c'" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create pipe-delimited string without using list comprehension\n", "DELIMITER = '|'\n", "delimited_string = ''\n", "token_list = ['a', 'b', 'c']\n", "\n", "for token in token_list[:-1]: # add all but the last token + DELIMITER\n", " delimited_string += token + DELIMITER\n", "delimited_string += token_list[-1] # add the last token (with no trailing DELIMITER)\n", "delimited_string" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This process is much simpler using a list comprehension." ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'a|b|c'" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "delimited_string = DELIMITER.join([token for token in token_list])\n", "delimited_string" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Missing values & \"clean\" instances" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As noted in the initial description of the UCI mushroom set above, 2480 of the 8124 instances have missing attribute values (denoted by `'?'`). \n", "\n", "There are several techniques for dealing with instances that include missing attribute values, but to simplify things in the context of this primer - and following the example in the [Data Science for Business](http://www.data-science-for-biz.com/) book - we will simply ignore any such instances and restrict our focus to only the *clean* instances (with no missing values).\n", "\n", "We could use several lines of code - with an `if` statement inside a `for` loop - to create a `clean_instances` list from the `all_instances` list. Or we could use a list comprehension that includes an `if` statement.\n", "\n", "We will show both approaches to creating `clean_instances` below." ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5644 clean instances\n" ] } ], "source": [ "# version 1: using an if statement nested within a for statement\n", "UNKNOWN_VALUE = '?'\n", "\n", "clean_instances = []\n", "for instance in all_instances:\n", " if UNKNOWN_VALUE not in instance:\n", " clean_instances.append(instance)\n", " \n", "print(len(clean_instances), 'clean instances')" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5644 clean instances\n" ] } ], "source": [ "# version 2: using an equivalent list comprehension\n", "clean_instances = [instance\n", " for instance in all_instances\n", " if UNKNOWN_VALUE not in instance]\n", "\n", "print(len(clean_instances), 'clean instances')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that line breaks can be used before a `for` or `if` keyword in a list comprehension." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dictionaries (dicts)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although single character abbreviations of attribute values (e.g., 'x') allow for more compact data files, they are not as easy to understand by human readers as the longer attribute value descriptions (e.g., 'convex').\n", "\n", "A Python [dictionary (or **`dict`**)](http://docs.python.org/2/tutorial/datastructures.html#dictionaries) is an unordered, comma-delimited collection of ***key: value*** pairs, serving a siimilar function as a hash table or hashmap in other programming languages.\n", "\n", "We could create a dictionary for the `cap-type` attribute values shown above:\n", "\n", "> bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s\n", "\n", "Since we will want to look up the value using the abbreviation (which is the representation of the value stored in the file), we will use the abbreviations as *keys* and the descriptions as *values*.\n", "\n", "A Python dictionary can be created by specifying all `key: value` pairs (with colons separating each *key* and *value*), or by adding them iteratively. We will show the first method in the cell below, and use the second method in a subsequent cell. \n", "\n", "Note that a *value* in a Python dictionary (`dict`) can be accessed by specifying its *key* using the general form `dict[key]` (or `dict.get(key, [default])`, which allows the specification of a `default` value to use if `key` is not in `dict`)." ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "x = convex\n" ] } ], "source": [ "attribute_values_cap_type = {'b': 'bell', \n", " 'c': 'conical', \n", " 'x': 'convex', \n", " 'f': 'flat', \n", " 'k': 'knobbed', \n", " 's': 'sunken'}\n", "\n", "attribute_value_abbrev = 'x'\n", "print(attribute_value_abbrev, '=', attribute_values_cap_type[attribute_value_abbrev])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A Python dictionary is an *iterable* container, so we can iterate over the keys in a dictionary using a `for` loop.\n", "\n", "Note that since a dictionary is an *unordered* collection, the sequence of abbreviations and associated values is not guaranteed to appear in any particular order. " ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "c = conical\n", "b = bell\n", "f = flat\n", "k = knobbed\n", "s = sunken\n", "x = convex\n" ] } ], "source": [ "for attribute_value_abbrev in attribute_values_cap_type:\n", " print(attribute_value_abbrev, '=', attribute_values_cap_type[attribute_value_abbrev])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Python supports *dictionary comprehensions*, which have a similar form as the *list comprehensions* described above, except that both a key and a value have to be specified for each iteration.\n", "\n", "For example, if we provisionally omit the 'convex' cap-type (whose abbreviation is the last letter rather than first letter in the attribute name), we could construct a dictionary of abbreviations and descriptions using the following expression." ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'s': 'sunken', 'c': 'conical', 'b': 'bell', 'k': 'knobbed', 'f': 'flat'}\n" ] } ], "source": [ "attribute_values_cap_type_2 = {x[0]: x \n", " for x in ['bell', 'conical', 'flat', 'knobbed', 'sunken']}\n", "print(attribute_values_cap_type_2)" ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[['b', 'bell'], ['c', 'conical'], ['f', 'flat'], ['k', 'knobbed'], ['s', 'sunken']]\n" ] } ], "source": [ "attribute_values_cap_type_2 = [[x[0], x ]\n", " for x in ['bell', 'conical', 'flat', 'knobbed', 'sunken']]\n", "print(attribute_values_cap_type_2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While it's useful to have a dictionary of values for the `cap-type` attribute, it would be even more useful to have a dictionary of values for *every* attribute. Earlier, we created a list of `attribute_names`; we will now expand this to create a list of `attribute_values` wherein each list element is a dictionary.\n", "\n", "Rather than explicitly type in each dictionary entry in the Python interpreter, we'll define a function to read a file containing the list of attribute names, values and value abbreviations in the format shown above:\n", "\n", "* class: edible=e, poisonous=p\n", "* cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s\n", "* cap-surface: fibrous=f, grooves=g, scaly=y, smooth=s\n", "* ..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can make calls to [shell commands](https://ipython.org/ipython-doc/dev/interactive/tutorial.html#system-shell-commands) from a Python cell by using the bang (exclamation point). *\\[There are a large number of [cell magics](https://ipython.org/ipython-doc/dev/interactive/magics.html) that extend the capability of IPython Notebooks (which we will not explore further in this notebook.\\]*\n", "\n", "For example, the following cell will show the contents of the `agaricus-lepiota.attributes` file on OSX or Linux (for Windows, substitute `type` for `cat`)." ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "class: edible=e, poisonous=p\r\n", "cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s\r\n", "cap-surface: fibrous=f, grooves=g, scaly=y, smooth=s\r\n", "cap-color: brown=n ,buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y\r\n", "bruises?: bruises=t, no=f\r\n", "odor: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s\r\n", "gill-attachment: attached=a, descending=d, free=f, notched=n\r\n", "gill-spacing: close=c, crowded=w, distant=d\r\n", "gill-size: broad=b, narrow=n\r\n", "gill-color: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y\r\n", "stalk-shape: enlarging=e, tapering=t\r\n", "stalk-root: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=?\r\n", "stalk-surface-above-ring: fibrous=f, scaly=y, silky=k, smooth=s\r\n", "stalk-surface-below-ring: fibrous=f, scaly=y, silky=k, smooth=s\r\n", "stalk-color-above-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y\r\n", "stalk-color-below-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y\r\n", "veil-type: partial=p, universal=u\r\n", "veil-color: brown=n, orange=o, white=w, yellow=y\r\n", "ring-number: none=n, one=o, two=t\r\n", "ring-type: cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z\r\n", "spore-print-color: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y\r\n", "population: abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y\r\n", "habitat: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d" ] } ], "source": [ "! cat agaricus-lepiota.attributes " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 3: define `load_attribute_values()`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We earlier created the `attribute_names` list manually. The `load_attribute_values()` function above creates the `attribute_values` list from the contents of a file, each line of which starts with the name of an attribute. Unfortunately, the function discards the name of each attribute.\n", "\n", "It would be nice to retain the name as well as the value abbreviations and descriptions. One way to do this would be to create a list of dictionaries, in which each dictionary has 2 keys, a `name`, the value of which is the attribute name (a string), and `values`, the value of which is yet another dictionary (with abbreviation keys and description values, as in `load_attribute_values()`).\n", "\n", "Complete the following function definition so that the code implements this functionality." ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Read 23 attribute values from agaricus-lepiota.attributes\n", "First attribute name: class ; values: {'p': 'poisonous', 'e': 'edible'}\n" ] } ], "source": [ "def load_attribute_names_and_values(filename):\n", " '''Returns a list of attribute names and values in a file.\n", " \n", " This list contains dictionaries wherein the keys are names \n", " and the values are value description dictionariess.\n", " \n", " Each value description sub-dictionary will use \n", " the attribute value abbreviations as its keys \n", " and the attribute descriptions as the values.\n", " \n", " filename is expected to have one attribute name and set of values per line, \n", " with the following format:\n", " name: value_description=value_abbreviation[,value_description=value_abbreviation]*\n", " for example\n", " cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s\n", " The attribute name and values dictionary created from this line would be the following:\n", " {'name': 'cap-shape', \n", " 'values': {'c': 'conical', \n", " 'b': 'bell', \n", " 'f': 'flat', \n", " 'k': 'knobbed', \n", " 's': 'sunken', \n", " 'x': 'convex'}}\n", " '''\n", " attribute_names_and_values = [] # this will be a list of dicts\n", " # your code goes here\n", " return attribute_names_and_values\n", "\n", "attribute_filename = 'agaricus-lepiota.attributes'\n", "# delete 'simple_ml.' in the function call below to test your function\n", "attribute_names_and_values = simple_ml.load_attribute_names_and_values(attribute_filename)\n", "print('Read', len(attribute_names_and_values), 'attribute values from', attribute_filename)\n", "print('First attribute name:', attribute_names_and_values[0]['name'], \n", " '; values:', attribute_names_and_values[0]['values'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Counters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data scientists often need to count things. For example, we might want to count the numbers of edible and poisonous mushrooms in the *clean_instances* list we created earlier." ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 3488 edible mushrooms among the 5644 clean instances\n" ] } ], "source": [ "edible_count = 0\n", "for instance in clean_instances:\n", " if instance[0] == 'e':\n", " edible_count += 1 # this is shorthand for edible_count = edible_count + 1\n", "\n", "print('There are', edible_count, 'edible mushrooms among the', \n", " len(clean_instances), 'clean instances')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "More generally, we often want to count the number of occurrences (frequencies) of each possible value for an attribute. One way to do so is to create a dictionary where each dictionary key is an attribute value and each dictionary value is the count of instances with that attribute value.\n", "\n", "Using an ordinary dictionary, we must be careful to create a new dictionary entry the first time we see a new attribute value (that is not already contained in the dictionary)." ] }, { "cell_type": "code", "execution_count": 83, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Counts for each value of cap-state:\n", "c : 4\n", "b : 300\n", "f : 2432\n", "k : 36\n", "s : 32\n", "x : 2840\n" ] } ], "source": [ "cap_state_value_counts = {}\n", "for instance in clean_instances:\n", " cap_state_value = instance[1] # cap-state is the 2nd attribute\n", " if cap_state_value not in cap_state_value_counts:\n", " # first occurrence, must explicitly initialize counter for this cap_state_value\n", " cap_state_value_counts[cap_state_value] = 0\n", " cap_state_value_counts[cap_state_value] += 1\n", "\n", "print('Counts for each value of cap-state:')\n", "for value in cap_state_value_counts:\n", " print(value, ':', cap_state_value_counts[value])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Python [**`collections`**](http://docs.python.org/2/library/collections.html) module provides a number of high performance container datatypes. A frequently useful datatype is a [**`Counter`**](http://docs.python.org/2/library/collections.html#collections.Counter), a specialized dictionary in which each *key* is a unique element found in a list or some other container, and each *value* is the number of occurrences of that element in the source container. The default value for each newly created key is zero.\n", "\n", "A `Counter` includes a method, [**`most_common([n])`**](http://docs.python.org/2/library/collections.html#collections.Counter.most_common), that returns a list of 2-element tuples representing the values and their associated counts for the most common `n` values in descending order of the counts; if `n` is omitted, the method returns all tuples.\n", "\n", "Note that we can either use\n", "\n", "`import collections`\n", "\n", "and then use `collections.Counter()` in our code, or use\n", "\n", "`from collections import Counter`\n", "\n", "and then use `Counter()` (with no module specification) in our code." ] }, { "cell_type": "code", "execution_count": 84, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Counts for each value of cap-state:\n", "c : 4\n", "b : 300\n", "f : 2432\n", "k : 36\n", "s : 32\n", "x : 2840\n" ] } ], "source": [ "from collections import Counter\n", "\n", "cap_state_value_counts = Counter()\n", "for instance in clean_instances:\n", " cap_state_value = instance[1]\n", " # no need to explicitly initialize counters for cap_state_value; all start at zero\n", " cap_state_value_counts[cap_state_value] += 1\n", "\n", "print('Counts for each value of cap-state:')\n", "for value in cap_state_value_counts:\n", " print(value, ':', cap_state_value_counts[value])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When a `Counter` object is instantiated with a list of items, it returns a dictionary-like container in which the *keys* are the unique items in the list, and the *values* are the counts of each unique item in that list. " ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Counter({'a': 3, 'b': 2, 'c': 1})\n", "[('a', 3), ('b', 2), ('c', 1)]\n" ] } ], "source": [ "counts = Counter(['a', 'b', 'c', 'a', 'b', 'a'])\n", "print(counts)\n", "print(counts.most_common())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This allows us to count the number of values for `cap-state` in a very compact way." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use a `Counter` initialized with a list comprehension to collect all the values of the 2nd attribute, `cap-state`.\n", "\n", "The following shows the first 10 instances; the second element in each sublist is the value of `cap-state` or that instance." ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[['p', 'x', 's', 'n', 't', 'p', 'f', 'c', 'n', 'k', 'e', 'e', 's', 's', 'w', 'w', 'p', 'w', 'o', 'p', 'k', 's', 'u'], ['e', 'x', 's', 'y', 't', 'a', 'f', 'c', 'b', 'k', 'e', 'c', 's', 's', 'w', 'w', 'p', 'w', 'o', 'p', 'n', 'n', 'g'], ['e', 'b', 's', 'w', 't', 'l', 'f', 'c', 'b', 'n', 'e', 'c', 's', 's', 'w', 'w', 'p', 'w', 'o', 'p', 'n', 'n', 'm'], ['p', 'x', 'y', 'w', 't', 'p', 'f', 'c', 'n', 'n', 'e', 'e', 's', 's', 'w', 'w', 'p', 'w', 'o', 'p', 'k', 's', 'u'], ['e', 'x', 's', 'g', 'f', 'n', 'f', 'w', 'b', 'k', 't', 'e', 's', 's', 'w', 'w', 'p', 'w', 'o', 'e', 'n', 'a', 'g'], ['e', 'x', 'y', 'y', 't', 'a', 'f', 'c', 'b', 'n', 'e', 'c', 's', 's', 'w', 'w', 'p', 'w', 'o', 'p', 'k', 'n', 'g'], ['e', 'b', 's', 'w', 't', 'a', 'f', 'c', 'b', 'g', 'e', 'c', 's', 's', 'w', 'w', 'p', 'w', 'o', 'p', 'k', 'n', 'm'], ['e', 'b', 'y', 'w', 't', 'l', 'f', 'c', 'b', 'n', 'e', 'c', 's', 's', 'w', 'w', 'p', 'w', 'o', 'p', 'n', 's', 'm'], ['p', 'x', 'y', 'w', 't', 'p', 'f', 'c', 'n', 'p', 'e', 'e', 's', 's', 'w', 'w', 'p', 'w', 'o', 'p', 'k', 'v', 'g'], ['e', 'b', 's', 'y', 't', 'a', 'f', 'c', 'b', 'g', 'e', 'c', 's', 's', 'w', 'w', 'p', 'w', 'o', 'p', 'k', 's', 'm']]\n" ] } ], "source": [ "print(clean_instances[:10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following list comprehension gathers the 2nd attribute of each of the first 10 sublists (note the slice notation)." ] }, { "cell_type": "code", "execution_count": 87, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['x', 'x', 'b', 'x', 'x', 'x', 'b', 'b', 'x', 'b']" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[instance[1] for instance in clean_instances][:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we will gather all of the values for the 2nd attribute into a list and create a `Counter` for that list." ] }, { "cell_type": "code", "execution_count": 88, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Counts for each value of cap-state:\n", "c : 4\n", "b : 300\n", "f : 2432\n", "k : 36\n", "s : 32\n", "x : 2840\n" ] } ], "source": [ "cap_state_value_counts = Counter([instance[1] for instance in clean_instances])\n", "\n", "print('Counts for each value of cap-state:')\n", "for value in cap_state_value_counts:\n", " print(value, ':', cap_state_value_counts[value])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 4: define `attribute_value_counts()`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define a function, `attribute_value_counts(instances, attribute, attribute_names)`, that returns a `Counter` containing the counts of occurrences of each value of `attribute` in the list of `instances`. `attribute_names` is the list we created above, where each element is the name of an attribute.\n", "\n", "This exercise is designed to generalize the solution shown in the code directly above (which handles only the `cap-state` attribute)." ] }, { "cell_type": "code", "execution_count": 89, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Counts for each value of cap-shape :\n", "c : 4\n", "b : 300\n", "f : 2432\n", "k : 36\n", "s : 32\n", "x : 2840\n" ] } ], "source": [ "# your definition goes here\n", "\n", "attribute = 'cap-shape'\n", "# delete 'simple_ml.' in the function call below to test your function\n", "attribute_value_counts = simple_ml.attribute_value_counts(clean_instances, \n", " attribute, \n", " attribute_names)\n", "\n", "print('Counts for each value of', attribute, ':')\n", "for value in attribute_value_counts:\n", " print(value, ':', attribute_value_counts[value])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### More on sorting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Earlier, we saw that there is a `list.sort()` method that will sort a list in-place, i.e., by replacing the original value of `list` with a sorted version of the elements in `list`. \n", "\n", "We also saw that the [**`sorted(iterable[, cmp[, key[, reverse]]])`**](http://docs.python.org/2/library/functions.html#sorted) function can be used to return a *copy* of a list, dictionary or any other [*iterable*](http://docs.python.org/2/glossary.html#term-iterable) container it is passed, in ascending order." ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[3, 1, 4, 2, 5]\n", "[1, 2, 3, 4, 5]\n" ] } ], "source": [ "original_list = [3, 1, 4, 2, 5]\n", "sorted_list = sorted(original_list)\n", "\n", "print(original_list)\n", "print(sorted_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`sorted()` can also be used with dictionaries (it returns a sorted list of the dictionary *keys*)." ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['b', 'c', 'f', 'k', 's', 'x']\n" ] } ], "source": [ "print(sorted(attribute_values_cap_type))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the sorted *keys* to access the *values* of a dictionary in ascending order of the keys." ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "b = bell\n", "c = conical\n", "f = flat\n", "k = knobbed\n", "s = sunken\n", "x = convex\n" ] } ], "source": [ "for attribute_value_abbrev in sorted(attribute_values_cap_type):\n", " print(attribute_value_abbrev, '=', attribute_values_cap_type[attribute_value_abbrev])" ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Counts for each value of cap-shape :\n", "b : 300\n", "c : 4\n", "f : 2432\n", "k : 36\n", "s : 32\n", "x : 2840\n" ] } ], "source": [ "attribute = 'cap-shape'\n", "attribute_value_counts = simple_ml.attribute_value_counts(clean_instances, \n", " attribute, \n", " attribute_names)\n", "\n", "print('Counts for each value of', attribute, ':')\n", "for value in sorted(attribute_value_counts):\n", " print(value, ':', attribute_value_counts[value])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sorting a dictionary by values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is often useful to sort a dictionary by its *values* rather than its *keys*. \n", "\n", "For example, when we printed out the counts of the attribute values for `cap-shape` above, the counts appeared in an ascending alphabetic order of their attribute names. It is often more helpful to show the attribute value counts in descending order of the counts (which are the values in that dictionary).\n", "\n", "There are a [variety of ways to sort a dictionary by values](http://writeonly.wordpress.com/2008/08/30/sorting-dictionaries-by-value-in-python-improved/), but the approach described in [PEP-256](http://legacy.python.org/dev/peps/pep-0265/) is generally considered the most efficient.\n", "\n", "In order to understand the components used in this approach, we will revisit and elaborate on a few concepts involving *dictionaries*, *iterators* and *modules*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [**`dict.items()`**](http://docs.python.org/2/library/stdtypes.html#dict.items) method returns an unordered list of `(key, value)` tuples in `dict`." ] }, { "cell_type": "code", "execution_count": 94, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[('c', 'conical'),\n", " ('b', 'bell'),\n", " ('f', 'flat'),\n", " ('k', 'knobbed'),\n", " ('s', 'sunken'),\n", " ('x', 'convex')]" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "attribute_values_cap_type.items()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In Python 2, a related method, [**`dict.iteritems()`**](http://docs.python.org/2/library/stdtypes.html#dict.iteritems), returns an [**`iterator`**](http://docs.python.org/2/library/stdtypes.html#iterator-types): a callable object that returns the *next* item in a sequence each time it is referenced (e.g., during each iteration of a for loop), which can be more efficient than generating *all* the items in the sequence before any are used ... and so should be used rather than `items()` wherever possible\n", "\n", "This is similar to the distinction between `xrange()` and `range()` described above ... and, also similarly, `dict.items()` is an `iterator` in Python 3 and so `dict.iteritems()` is no longer needed (nor defined) ... and further similarly, we will use only `dict.items()` in this notebook, but it is generally more efficient to use `dict.iteritems()` in Python 2." ] }, { "cell_type": "code", "execution_count": 95, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "c : conical\n", "b : bell\n", "f : flat\n", "k : knobbed\n", "s : sunken\n", "x : convex\n" ] } ], "source": [ "for key, value in attribute_values_cap_type.items():\n", " print(key, ':', value)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Python [**`operator`**](http://docs.python.org/2/library/operator.html) module contains a number of functions that perform object comparisons, logical operations, mathematical operations, sequence operations, and abstract type tests.\n", "\n", "To facilitate sorting a dictionary by values, we will use the [**`operator.itemgetter(i)`**](http://docs.python.org/2/library/operator.html#operator.itemgetter) function that can be used to retrieve the `i`th value in a tuple (such as a `(key, value)` pair returned by `[iter]items()`).\n", "\n", "We can use `operator.itemgetter(1)`) to reference the *value* - the 2nd item in each `(key, value)` tuple, (at zero-based index position 1) - rather than the *key* - the first item in each `(key, value)` tuple (at index position 0).\n", "\n", "We will use the optional keyword argument **`key`** in [`sorted(iterable[, cmp[, key[, reverse]]])`](http://docs.python.org/2/library/functions.html#sorted) to specify a *sorting* key that is not the same as the `dict` key (recall that the `dict` key is the default *sorting* key for `sorted()`)." ] }, { "cell_type": "code", "execution_count": 96, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[('b', 'bell'),\n", " ('c', 'conical'),\n", " ('x', 'convex'),\n", " ('f', 'flat'),\n", " ('k', 'knobbed'),\n", " ('s', 'sunken')]" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import operator\n", "\n", "sorted(attribute_values_cap_type.items(), \n", " key=operator.itemgetter(1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now sort the counts of attribute values in descending frequency of occurrence, and print them out using tuple unpacking." ] }, { "cell_type": "code", "execution_count": 97, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Counts for each value of cap-shape (sorted by count):\n", "x : 2840\n", "f : 2432\n", "b : 300\n", "k : 36\n", "s : 32\n", "c : 4\n" ] } ], "source": [ "attribute = 'cap-shape'\n", "value_counts = simple_ml.attribute_value_counts(clean_instances, \n", " attribute, \n", " attribute_names)\n", "\n", "print('Counts for each value of', attribute, '(sorted by count):')\n", "for value, count in sorted(value_counts.items(), \n", " key=operator.itemgetter(1), \n", " reverse=True):\n", " print(value, ':', count)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that this example is rather contrived, as it is generally easiest to use a `Counter` and its associated `most_common()` method when sorting a dictionary wherein the values are all counts. The need to sort other kinds of dictionaries by their values is rather common. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### String formatting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is often helpful to use [fancier output formatting](http://docs.python.org/2/tutorial/inputoutput.html#fancier-output-formatting) than simply printing comma-delimited lists of items. \n", "\n", "Examples of the **[`str.format()`](https://docs.python.org/2/library/stdtypes.html#str.format)** function used in conjunction with print statements is shown below. \n", "\n", "More details can be found in the Python documentation on [format string syntax](http://docs.python.org/2/library/string.html#format-string-syntax)." ] }, { "cell_type": "code", "execution_count": 98, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.100\n", " 0.100\n", "000.100\n", " 1\n", "001\n", "hello \n", " hello\n" ] } ], "source": [ "print('{:5.3f}'.format(0.1)) # fieldwidth = 5; precision = 3; f = float\n", "print('{:7.3f}'.format(0.1)) # if fieldwidth is larger than needed, left pad with spaces\n", "print('{:07.3f}'.format(0.1)) # use leading zero to left pad with leading zeros\n", "print('{:3d}'.format(1)) # d = int\n", "print('{:03d}'.format(1))\n", "print('{:10s}'.format('hello')) # s = string, left-justified\n", "print('{:>10s}'.format('hello')) # use '>' to right-justify within fieldwidth" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following example illustrates the use of `str.format()` on data associated with the mushroom dataset." ] }, { "cell_type": "code", "execution_count": 99, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "class: e = 3488 (0.618), p = 2156 (0.382) " ] } ], "source": [ "print('class: {} = {} ({:5.3f}), {} = {} ({:5.3f})'.format(\n", " 'e', 3488, 3488 / 5644, \n", " 'p', 2156, 2156 / 5644), end=' ')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following variation - splitting off the printing of the attribute name from the printing of the values and counts of values for that attrbiute - may be more useful in developing a solution to the following exercise." ] }, { "cell_type": "code", "execution_count": 100, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "class: e = 3488 (0.618), p = 2156 (0.382) \n" ] } ], "source": [ "print('class:', end=' ') # keeps cursor on the same line for subsequent print statements\n", "print('{} = {} ({:5.3f}),'.format('e', 3488, 3488 / 5644), end=' ')\n", "print('{} = {} ({:5.3f})'.format('p', 2156, 2156 / 5644), end=' ')\n", "print() # advance the cursor to the beginning of the next line" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 5: define `print_all_attribute_value_counts()`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define a function, `print_all_attribute_value_counts(instances, attribute_names)`, that prints each attribute name in `attribute_names`, and then for each attribute value, prints the value abbreviation, the count of occurrences of that value and the proportion of instances that have that attribute value." ] }, { "cell_type": "code", "execution_count": 101, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Counts for all attributes and values:\n", "\n", "class: e = 3488 (0.618), p = 2156 (0.382), \n", "cap-shape: x = 2840 (0.503), f = 2432 (0.431), b = 300 (0.053), k = 36 (0.006), s = 32 (0.006), c = 4 (0.001), \n", "cap-surface: y = 2220 (0.393), f = 2160 (0.383), s = 1260 (0.223), g = 4 (0.001), \n", "cap-color: g = 1696 (0.300), n = 1164 (0.206), y = 1056 (0.187), w = 880 (0.156), e = 588 (0.104), b = 120 (0.021), p = 96 (0.017), c = 44 (0.008), \n", "bruises?: t = 3184 (0.564), f = 2460 (0.436), \n", "odor: n = 2776 (0.492), f = 1584 (0.281), a = 400 (0.071), l = 400 (0.071), p = 256 (0.045), c = 192 (0.034), m = 36 (0.006), \n", "gill-attachment: f = 5626 (0.997), a = 18 (0.003), \n", "gill-spacing: c = 4620 (0.819), w = 1024 (0.181), \n", "gill-size: b = 4940 (0.875), n = 704 (0.125), \n", "gill-color: p = 1384 (0.245), n = 984 (0.174), w = 966 (0.171), h = 720 (0.128), g = 656 (0.116), u = 480 (0.085), k = 408 (0.072), r = 24 (0.004), y = 22 (0.004), \n", "stalk-shape: t = 2880 (0.510), e = 2764 (0.490), \n", "stalk-root: b = 3776 (0.669), e = 1120 (0.198), c = 556 (0.099), r = 192 (0.034), \n", "stalk-surface-above-ring: s = 3736 (0.662), k = 1332 (0.236), f = 552 (0.098), y = 24 (0.004), \n", "stalk-surface-below-ring: s = 3544 (0.628), k = 1296 (0.230), f = 552 (0.098), y = 252 (0.045), \n", "stalk-color-above-ring: w = 3136 (0.556), p = 1008 (0.179), g = 576 (0.102), n = 448 (0.079), b = 432 (0.077), c = 36 (0.006), y = 8 (0.001), \n", "stalk-color-below-ring: w = 3088 (0.547), p = 1008 (0.179), g = 576 (0.102), n = 496 (0.088), b = 432 (0.077), c = 36 (0.006), y = 8 (0.001), \n", "veil-type: p = 5644 (1.000), \n", "veil-color: w = 5636 (0.999), y = 8 (0.001), \n", "ring-number: o = 5488 (0.972), t = 120 (0.021), n = 36 (0.006), \n", "ring-type: p = 3488 (0.618), l = 1296 (0.230), e = 824 (0.146), n = 36 (0.006), \n", "spore-print-color: n = 1920 (0.340), k = 1872 (0.332), h = 1584 (0.281), w = 148 (0.026), r = 72 (0.013), u = 48 (0.009), \n", "population: v = 2160 (0.383), y = 1688 (0.299), s = 1104 (0.196), a = 384 (0.068), n = 256 (0.045), c = 52 (0.009), \n", "habitat: d = 2492 (0.442), g = 1860 (0.330), p = 568 (0.101), u = 368 (0.065), m = 292 (0.052), l = 64 (0.011), \n" ] } ], "source": [ "# your function definition goes here\n", "\n", "print('\\nCounts for all attributes and values:\\n')\n", "# delete 'simple_ml.' in the function call below to test your function\n", "simple_ml.print_all_attribute_value_counts(clean_instances, attribute_names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Navigation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notebooks in this primer:\n", "\n", "* [1. Introduction](1_Introduction.ipynb)\n", "* [2. Data Science: Basic Concepts](2_Data_Science_Basic_Concepts.ipynb)\n", "* **3. Python: Basic Concepts** (*you are here*)\n", "* [4. Using Python to Build and Use a Simple Decision Tree Classifier](4_Python_Simple_Decision_Tree.ipynb)\n", "* [5. Next Steps](5_Next_Steps.ipynb)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.10" } }, "nbformat": 4, "nbformat_minor": 0 }