{ "metadata": { "name": "", "signature": "sha256:02775337c0e69cf375098fcccc963d2fdd0f3890830c8c9753bc4fb84efbee49" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Chapter 4: Programming Principles" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-- *A Python Course for the Humanities by Folgert Karsdorp and Maarten van Gompel*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By now you will have a sense of how to use Python for some basic programming. There is, however, still quite some ground to cover before you will have full control over the language. Don't worry, you will get there. This chapter will help you in writing more structured programs, provide you with some best practices and will teach you about some more general programming strategies." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Data Structures" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Perhaps the most important aspect of writing good programs is developing and designing appropriate data structures for the problem at hand. Data structures should feel natural, be flexible, and should not be unnecessarily difficult or hard to read. As a rule of thumb, most of the time the least complex data structure is the one to go with. Let's have a look at a real example to see what different design choices we can make regarding data structures.\n", "\n", "In the file `data/twitter.txt` we constructed a fictional network of twitter users. Each line represents an edge in the network between two users separated by a semicolon:\n", "\n", "`@Fox;@Judie`\n", "\n", "`@Tristan;@Jermain`\n", "\n", "`@Allyn;@Winfred`\n", "\n", "`@Dennis;@Randolph`\n", "\n", "`@Wallie;@Venkat`\n", "\n", "The first name represents the follower; the second name the followee. \n", "\n", "One seemingly natural data structure to represent this network in Python is a list of tuples each consisting of two names: `[(name1, name2), (name1, name3), ..., (name300, name41)]`. We construct the network in this format as follows: " ] }, { "cell_type": "code", "collapsed": false, "input": [ "edges = []\n", "for line in open(\"data/twitter.txt\"):\n", " name_a, name_b = line.strip().split(';')\n", " edges.append((name_a, name_b))\n", "print(edges[:10])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In itself nothing is wrong with this data structure. For example, we might use it as follows to find all people user @Fox follows:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def following(user, edges):\n", " \"Return a list of all users USERS is following.\"\n", " followees = []\n", " for follower, followee in edges:\n", " if follower == user:\n", " followees.append(followee)\n", " return followees\n", "\n", "print(following(\"@Fox\", edges))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One particular downside of this datastructure, especially when our network grows, is that it can become slow, painstakingly slow... For each search query, we have to go through the entire network, check for each node whether it is equal to the one we're looking for and then append the followee to the list of followees. To give you an impression of how long Python is taking, execute the following cell:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%timeit following(\"@Fox\", edges)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It depends a little on your computer, but it will probably be around 500 microseconds per loop or 0.000592 seconds. \"That's quite fast!\", you might think. Just wait and see what happens if we adjust our data structure to the following. This time we represent our network as a dictionary with followers as keys and a list of followees as value. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "edge_dict = {}\n", "for line in open(\"data/twitter.txt\"):\n", " name_a, name_b = line.strip().split(';')\n", " if name_a in edge_dict:\n", " edge_dict[name_a].append(name_b)\n", " else:\n", " edge_dict[name_a] = [name_b]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is not really necessary to define the following function, but we'll do it just so that it is easier to compare with the previous one." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def following2(user, edges):\n", " return edges[user]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's test this function:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%timeit following2(\"@Fox\", edge_dict)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This function approximately takes about 160 nanoseconds per loop, which is around 0.000000165 seconds. Now *that* is fast! You're probably thinking that it is not really a big deal, since you can't really tell the difference. And you're right: for this particular case it doesn't really matter. But what if our network was ten times as big? Or 100 times, 1000 or a million times? Then what would happen? To convince you of my point, I'll will expand our network, making it 1000 times bigger." ] }, { "cell_type": "code", "collapsed": false, "input": [ "edges = []\n", "for line in open(\"data/twitter.txt\"):\n", " name_a, name_b = line.strip().split(';')\n", " # repeatedly add edges to the network (1000 times)\n", " for i in range(1000):\n", " edges.append((name_a, name_b))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "%timeit following(\"@Fox\", edges)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function has become quite a bit slower. It will be somewhere around 570 milliseconds, or 0.57 seconds. Now lets do the same thing with the dictionary network" ] }, { "cell_type": "code", "collapsed": false, "input": [ "edge_dict = {}\n", "for line in open(\"data/twitter.txt\"):\n", " name_a, name_b = line.strip().split(';')\n", " for i in range(1000):\n", " if name_a in edge_dict:\n", " edge_dict[name_a].append(name_b)\n", " else:\n", " edge_dict[name_a] = [name_b]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "%timeit following2(\"@Fox\", edge_dict)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The timing is still about the same! What's going on?! It would take too long to explain to you exactly what the difference is between a dictionary and a list (you can read more about it [here](http://en.wikipedia.org/wiki/Hash_table)), but for now you should remember that you can access keys in a dictionary in constant time (meaning it does not matter how big the dictionary is), whereas for a list, it depends on the size of the list. This example makes it clear that the choice for a given data structure can make a real difference." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Breaking up your code" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another important aspect of good programs is reusability. It is good practice to write general functions that can be applied to a range of problems which you can combine into a complex of functions for a particular task. Again, this can best be shown with an example. \n", "\n", "We will write two versions of a program to transform an English word into [Pig Latin](https://en.wikipedia.org/wiki/Pig_Latin). To translate an English word into Pig Latin we apply the following rules:\n", "\n", "- If a word begins with a consonant or consonant cluster, remove that part from the beginning of the word and add it to the end of the word. Now to make it really latinish, add \"ay\" to this. E.g. *duck* - *uckday* and *bush* - *ushbay*.\n", "- If a word starts with a vowel, simply add \"ay\" to the end of the word. E.g. *egg* - *eggay* and *inbox* - *inboxay*.\n", "\n", "We will first give you a rather verbose function that does it all in one shot:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def translate(word):\n", " \"Convert a word to latin.\"\n", " vowels = 'aeiouAEIOU'\n", " start = 0\n", " end = ''\n", " # loop over all characters in word\n", " for i, char in enumerate(word):\n", " # if this character is not a vowel\n", " if char not in vowels:\n", " # it is a consonant, so add it to the end.\n", " end += char\n", " # if it is a vowel\n", " else:\n", " # we set the starting position to \n", " # the position of this character\n", " start = i\n", " break\n", " return word[start:] + end + 'ay'" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From just looking at this function it is hard to see what it is doing exactly. We'll try to explain it. It loops over each character in `word` and for each character it asks whether it is not a vowel (i.e. a consonant). If it is a consonant, we add it to the `end` variable which we will later append to the remaining word. While this is `True` we carry on to the next character. If we find a vowel, we store the index of that vowel and `break` from the loop. Now we return the word starting from the first vowel found, add the consonant or conconant cluster to it and add *ay* to that word.\n", "\n", "Okay, so this works, but it is not very readable, now is it? If you have a particular problem, it is good practice to divide it into subproblems and solve them separately. Let's break this function up into more comprehensible parts. First we will write a function called `starts_with_vowel`, which takes as argument a string and returns if the first character is a vowel." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Quiz!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define a function called `starts_with_vowel` which takes as argument a word represented by a string. The function should return `True` if the word starts with a vowel; `False` otherwise." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# write your code here\n", "\n", "# do not modify the code below, it is for testing your answer only!\n", "# it should output True if you did well\n", "print(starts_with_vowel(\"egg\") == True)\n", "print(starts_with_vowel(\"bacon\") == False)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One possible way to write the function `starts_with_vowel` is the following:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def starts_with_vowel(word):\n", " \"Return True if WORD starts with a vowel, False otherwise.\"\n", " vowels = ('a', 'e', 'i', 'o', 'u', 'A', 'E', 'I', 'O', 'U')\n", " return word.startswith(vowels)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the function `starts_with_vowel` we define a variable `vowels` which is a tuple containing all vowels, both uppercase and lowercase. We then call the built in function `startswith` which operates on strings and takes as argument either a string or a tuple of strings. It checks for each item in the tuple whether the string starts with that item. As soon as it finds one, it returns `True`. If the string does not start with any of the items in the tuple, it will return `False`.\n", "\n", "Perhaps someday we would like add different Latin endings to words. Having to change *ay* directly in our code for each suffix we choose is rather inconvenient. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Quiz!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Write a function called `add_suffix`. It takes as argument two strings: a word and the string we would like to attach to that word." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def add_suffix(word, suffix):\n", " \"Return WORD with SUFFIX attached.\"\n", " # insert your code here\n", " \n", "# do not modify the code below, it is for testing your answer only!\n", "# it should output True if you did well\n", "print(add_suffix(\"egg\", \"ay\") == \"eggay\")\n", "print(add_suffix(\"egg\", \"oing\") == \"eggoing\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we really want a separate function to add *ay* to words, we can now define it as follows:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def add_ay(word):\n", " \"Return WORD with 'ay' attached.\"\n", " return add_suffix(word, \"ay\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a small and a little silly example of how we can recombine and reuse functions into other more specific functions. Now that we have defined these small helper functions, have a look at the following implementation of `translate`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def translate(word, suffix):\n", " if starts_with_vowel(word):\n", " return add_suffix(word, suffix)\n", " return translate(word[1:] + word[0], suffix)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first thing we notice is that this definition is much shorter. This function provides a recursive solution to our problem and can be read as follows: if the input word starts with a vowel, return the word and add *ay* to it. If it does not start with a vowel, move the first character to the end of the word and try to convert it once more. In our - perhaps subjective - opinion, the recursive solution provides a much more elegant solution to our problem.\n", "\n", "The function `translate` is a reusable function, independent of the suffix we choose. The following function makes use of the generic nature of `translate` and defines a translation function specific to pig latin:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def pig_latinize(word):\n", " \"Pig latinize WORD.\"\n", " return translate(word, \"ay\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Testing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Always test your code, things never go right the first time, not even for experienced programmers. Errors are part of the progress and no sign of failure! Always seek to clearly understand the message behind an error and then find a solution to it. It is also recommended that you write little tests for functions, to assure they give the desired output given certain input.\n", "\n", "To debug your code, it often helps to insert some temporary explicit print statements along the way, outputting the value of certain variables. This can help you spot if things are really progressing the way they should." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Use library references, 3rd party modules, and google errors if you're stuck" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Programmers never know all functions in their language by heart. Part of the art of programming is knowing where to look and how to read library references. Library references describe what functions, etc., exist in what modules, what parameters they take, and what values they return. Use these references! For Python, just start at http://www.python.org !\n", "\n", "Chances are also high that someone has already done what you want, or at least solved one of your subproblems, and made a module for it which you can use! Here is a small list of modules that might be interesting for you:\n", "\n", "- [NLTK](http://www.nltk.org). Provides a nice introduction into programming with Python and an introduction into Natural Language Processing.\n", "\n", "- [PyNLPl](http://github.com/proycon/pynlpl). Library for Natural Language Processing, written by Maarten van Gompel. Contains a lot of modules for dealing with text, specific formats encountered in the field, and interfacing with other external NLP software. All of the things we did in chapter three are readily available in PyNLPl. \n", "\n", "- [Scikit-learn](http://www.scikit-learn.org): Easy-to-use and general-purpose machine learning in Python. Provides many implementations of machine learning algorithms, such as Support Vector Machines, Naive Bayes Classifiers, supervised and unsupervised cluster analysis and many more.\n", "\n", "- [Networkx](http://networkx.github.com) (not yet for Python 3): NetworkX is a Python language software package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks. Supports many output formats that can be imported by e.g. gephi.\n", "\n", "- [gensim](http://radimrehurek.com/gensim/) (in development for Python 3 at https://github.com/samantp/gensimPy3): Topic modelling for humans. Good and easy to use library for many topic modelling algorithms, such as LSA and LDA.\n", "\n", "- [Pattern](http://www.clips.ua.ac.be/pages/pattern) (not yet for Python 3): general Natural Language Processing library with support for Dutch and English part of speech tagging.\n", "\n", "- [Numpy](http://www.numpy.org): fundamental package for scientific computing with Python. It contains among other things:\n", "\n", " - a powerful N-dimensional array object\n", " - sophisticated (broadcasting) functions\n", " - tools for integrating C/C++ and Fortran code\n", " - useful linear algebra, Fourier transform, and random number capabilities\n", "\n", "- [Scipy](http://www.scipy.org/): open-source software for mathematics, science, and engineering. Very good and fast library for all kinds of scientific computations, cluster analysis etc.\n", "\n", "- [Matplotlib](http://matplotlib.org/): Python plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. \n", "\n", "- [lxml](http://www.lxml.de/): Extensive library for processing XML files in Python.\n", "\n", "- [Django](http://www.djangoproject.com/): Interested in full fledged web-development? Django is an amazing open-source framework for Python that allows you to make complex web applications." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Moreover, we would like to encourage you to always open your source code and share it with others. This principle is called *open source* and allows others to use and learn from your code. Especially in the sciences, such a transparency is vital for reproducibility and for building upon each other's work. The Python language itself is fully open-source, as is the Linux operating system, and tens of thousands of other software packages!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ignore the following, it's just here to make the page pretty:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from IPython.core.display import HTML\n", "def css_styling():\n", " styles = open(\"styles/custom.css\", \"r\").read()\n", " return HTML(styles)\n", "css_styling()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "/*\n", "Placeholder for custom user CSS\n", "\n", "mainly to be overridden in profile/static/custom/custom.css\n", "\n", "This will always be an empty file in IPython\n", "*/\n", "\n", "" ], "metadata": {}, "output_type": "pyout", "prompt_number": 1, "text": [ "" ] } ], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\"Creative
Python Programming for the Humanities by http://fbkarsdorp.github.io/python-course is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Based on a work at https://github.com/fbkarsdorp/python-course.

" ] } ], "metadata": {} } ] }