{
 "metadata": {
  "name": "",
  "signature": "sha256:6976dae8a9cd1c10ec18a62d6827041e8d0f3014c359220e7c44c22335aba40f"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Chapter 3: Text Analysis"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "-- *A Python Course for the Humanities by Folgert Karsdorp and Maarten van Gompel*"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In this chapter we will introduce you to the task of text analysis in Python. You will learn how to read an entire corpus into Python, clean it and how to perform certain data analyses on those texts. We will also briefly introduce you to using Python's plotting library *matplotlib*, with which you can visualize your data.\n",
      "\n",
      "Before we delve into the main subject of this chapter, text analysis, we will first write a couple of utility functions that build upon the things you learnt in the previous chapter. Often we don't work with a single text file stored at our computer, but with multiple text files or entire corpora. We would like to have a way to load a corpus into Python.\n",
      "\n",
      "Remember how to read files? Each time we had to open a file, read the contents and then close the file. Since this is a series of steps we will often need to do, we can write a single function that does all that for us. We write a small utility function `read_file(filename)` that reads the specified file and simply returns all contents as a single string."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def read_file(filename):\n",
      "    \"Read the contents of FILENAME and return as a string.\"\n",
      "    infile = open(filename) # windows users should use codecs.open after importing codecs\n",
      "    contents = infile.read()\n",
      "    infile.close()\n",
      "    return contents"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now, instead of having to open a file, read the contents and close the file, we can just call the function `read_file` to do all that:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "text = read_file(\"data/austen-emma-excerpt.txt\")\n",
      "print(text)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In the directory `data/gutenberg/training` we have a corpus consisting of multiple files with the extension `.txt`. This corpus is a collection of English novels which we downloaded for you from the [Gutenberg](http://www.gutenberg.org) project. We want to iterate over all these files. You can do this using the `listdir` function from the `os` module. We import this function as follows:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from os import listdir"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "After that, the `listdir` function is available to use. This function takes as argument the path to a directory and returns all the files and subdirectories present in that directory:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "listdir(\"data\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Notice that `listdir` returns a list and we can iterate over that list. Now, consider the following function:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def list_textfiles(directory):\n",
      "    \"Return a list of filenames ending in '.txt' in DIRECTORY.\"\n",
      "    textfiles = []\n",
      "    for filename in listdir(directory):\n",
      "        if filename.endswith(\".txt\"):\n",
      "            textfiles.append(directory + \"/\" + filename)\n",
      "    return textfiles"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The function `listdir` takes as argument the name of a directory and lists all filenames in that directory. We iterate over this list and append each filename that ends with the extension, `.txt` to a new list of `textfiles`. Using the `list_textfiles` function, the following code will read all text files in the directory `data/gutenberg/training` and outputs the length (in characters) of each:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for filepath in list_textfiles(\"data/gutenberg/training\"):\n",
      "    text = read_file(filepath)\n",
      "    print(filepath +  \" has \" + str(len(text)) + \" characters.\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Sentence tokenization"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In the previous chapter we wrote a function to tokenize or split a text string into a list of words. However, using this function we lose information about where sentences end and start in the text. We will develop a function `split_sentences` that performs some very simple sentence splitting when passed a text string. Each sentence will be represented as a new string, so the function as a whole returns a list of sentence strings. We assume that any occurrence of either `.` or `!` or `?` marks the end of a sentence. In reality, this is more ambiguous of course. Consider for example the use of periods as end-of-sentence marker as well as in abbreviations and initials!\n",
      "\n",
      "How should we tackle this problem? Have a look at the following picture:"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "![caption](files/images/indexing.png)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The first sentence *Hello there!* spans from index 0 to index 11. The second sentence from 13 to 26. If we come up with a way to extract those indexes, we could slice the text into separate sentences. First we define a utility function `end_of_sentence` that takes as argument a character and returns `True` if it is an end-of-sentence marker, otherwise it returns `False`. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Quiz!"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Write the function `end_of_sentence_marker` below:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def end_of_sentence_marker(character):\n",
      "    # insert your code here\n",
      "\n",
      "# these tests should return True if your code is correct\n",
      "print(end_of_sentence_marker(\"?\") == True)\n",
      "print(end_of_sentence_marker(\"a\") == False)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "An important function we will use is the built in `enumerate`. `enumerate` takes as argument any iterable (a string a list etc.). Let's see it in action:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for element in enumerate(\"Python\"):\n",
      "    print(element)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "As you can see, enumerate allows you to iterate over an iterable and for each element in that iterable, it gives you its corresponding index. A slightly more convenient way of iterating over `enumerate` is the following:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for index, character in enumerate(\"Python\"):\n",
      "    print(index)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This way, we have easy access to both the index and the original item in the iterable. Now we know enough to write our `split_sentences` function. We will walk you through it, step by step, but first try to read the function and think about what it possibly does at each step:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def split_sentences(text):\n",
      "    \"Split a text string into a list of sentences.\"\n",
      "    sentences = []\n",
      "    start = 0\n",
      "    for end, character in enumerate(text):\n",
      "        if end_of_sentence_marker(character):\n",
      "            sentence = text[start: end + 1]\n",
      "            sentences.append(sentence)\n",
      "            start = end + 1\n",
      "    return sentences"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The function `split_sentences` takes as argument a text represented by a simple string. Within the function we define a variable `sentences` in which we will store the individual sentences. We need to extract both the start position and the end position of each sentence. We know that the first sentence will always start at position 0. Therefore we define a variable start and set it to zero.\n",
      "\n",
      "Next we will use `enumerate` to loop over all individual characters in the text. Remember that enumerate returns pairs of indexes and their corresponding elements (here characters). For each character we check whether it is an end-of-sentence marker. If it is, the variable `end` marks the position in `text` where a sentence ends. We can now slice the text from the starting position to the end position and obtain our sentence. Notice that we add 1 to the end position. Why would that be? This is because, as you might remember from the first chapter, slices are non-inclusive, so `text[start:end]` would return the text starting at `start` and ending one position before `end`. Since we have reached the end of a sentence, we know that the next sentence will start at least one position later than our last end point. Therefore, we update the start variable to `end + 1`. Let's check whether our function works as promised:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(split_sentences(\"This is a sentence. Should we seperate it from this one?\"))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "It does! "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Quiz!"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "To conclude this section, you will write a wrapper function `tokenize`, that takes as input a text represented by a string and tokenizes this string into sentences. After that, we clean each sentence, by lowercasing all words and removing punctuation. The final step is to tokenize each sentence into a list of words. The file `preprocessing.py` contains a function called `clean_text` which removes all punctuation from a text and turns all characters to lowercase. We import that function using the following line:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from pyhum.preprocessing import clean_text"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def tokenize(text):\n",
      "    \"\"\"Transform TEXT into a list of sentences. Lowercase \n",
      "    each sentence and remove all punctuation. Finally split each\n",
      "    sentence into a list of words.\"\"\"\n",
      "    # insert your code here\n",
      "\n",
      "# these tests should return True if your code is correct\n",
      "print(tokenize(\"This is a sentence. So, what!\") == \n",
      "      [[\"this\", \"is\", \"a\", \"sentence\"], [\"so\", \"what\"]])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "General Text Statistics"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "> When the next night came, Dinarazad said to her sister Shahrazad: \u2018In God\u2019s name, sister, if you are not asleep, then tell us one of your stories!\u2019 Shahrazad answered: \u2018With great pleasure! I have heard tell, honoured King, that\u2026\u2019"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "*Alf Laylah Wa Laylah*, *the Stories of One Thousand and One Nights* is a collection of folk tales, collected over many centuries by various authors, translators, and scholars across West, Central and South Asia and North Africa, forms a huge narrative wheel with an overarching plot, created by the frame story of Shahrazad.\n",
      "\n",
      "The stories begin with the tale of king Shahryar and his brother, who, having both been deceived by their respective Sultanas, leave their kingdom, only to return when they have found someone who \u2014 in their view \u2014 was wronged even more. On their journey the two brothers encounter a huge jinn who carries a glass box containing a beautiful young woman. The two brothers hide as quickly as they can in a tree. The jinn lays his head on the girl\u2019s lap and as soon as he is asleep, the girl demands the two kings to make love to her or else she will wake her \u2018husband\u2019. They reluctantly give in and the brothers soon discover that the girl has already betrayed the jinn ninety-eight times before. This exemplar of lust and treachery strengthens the Sultan\u2019s opinion that all women are wicked and not to be trusted. \n",
      "\n",
      "When king Shahryar returns home, his wrath against women has grown to an unprecedented level. To temper his anger, each night the king sleeps with a virgin only to execute her the next morning. In order to make an end to this cruelty and save womanhood from a \"virgin scarcity\", Sharazad offers herself as the next king\u2019s bride. On the first night, Sharazad begins to tell the king a story, but she does not end it. The king\u2019s curiosity to know how the story ends, prevents him from executing Shahrazad. The next night Shahrazad finishes her story, and begins a new one. The king, eager to know the ending of this tale as well, postpones her execution once more. Using this strategy for One Thousand and One Nights in a labyrinth of stories-within-stories-within-stories, Shahrazad attempts to gradually move the king\u2019s cynical stance against women towards a politics of love and justice (see Marina Warner\u2019s *Stranger Magic* (2013) in case you're interested).\n",
      "\n",
      "The first European version of the Nights was translated into French by Antoine Galland. Many translations (in different languages) followed, such as the (heavily criticized) English translation by Sir Richard Francis Burton entitled *The Book of the Thousand and a Night* (1885). This version is freely available from the Gutenberg project (see [here](http://www.gutenberg.org)), and will be the one we will explore here."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Quiz!"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In the directory `data/arabian_nights` you will find 999 files. This is because in Burton's translation some nights are missing. The name of the file represents the corresponding night of storytelling in *Alf Laylah Wa Laylah*. Go have a look. Use the tokenize function and the corpus reading function we have defined above and tokenize and clean each night. Store the result in the variable named `corpus`."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# insert your code here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Great job! You now should have a corpus containing 999 texts. It is always important to check whether our code actually produces the desired results. Let's check whether we indeed have 999 texts:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(len(corpus))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "OK, that seems to be correct. It would be convenient for further processing to have the corpus in chronological order. Let's have a look the first 20 files returned by `list_textfiles`:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "list_textfiles(\"data/arabian_nights\")[:20]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "As you can see the files are sorted by their string name and not by their numbering. To be able to sort the files by their numbers we must first remove the extension `.txt` as well as the directory `data/arabian_nights/`."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Quiz!"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**1)** Write a function `remove_txt` that takes as argument a string and some extension that you want to remove. It should return the string without the extension. Tip: use the function `splitext` from the `os.path` module. Look up the documentation [here](http://docs.python.org/3.4/library/os.path.html#os.path.splitext)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from os.path import splitext\n",
      "\n",
      "def remove_ext(filename):\n",
      "    # insert your code here\n",
      "    \n",
      "# these tests should return True if your code is correct\n",
      "print(remove_ext(\"data/arabian_nights/1.txt\") == \"data/arabian_nights/1\")\n",
      "print(remove_ext(\"ridiculous_selfie.jpg\") == \"ridiculous_selfie\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**2)** Write a function `remove_dir` that takes as argument a filepath and removes the directory from a filepath. Tip: use the function `basename` from the `os.path` module. Look up the document [here](http://docs.python.org/3.4/library/os.path.html#os.path.basename)"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from os.path import basename\n",
      "\n",
      "def remove_dir(filepath):\n",
      "    # insert your code here\n",
      "    \n",
      "# these tests should return True if your code is correct\n",
      "print(remove_dir(\"data/arabian_nights/1.txt\") == \"1.txt\")\n",
      "print(remove_dir(\"/a/kind/of/funny/filepath/to/file.txt\") == \"file.txt\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**3)** Combine the two functions `remove_ext` and `remove_dir` into one function `get_filename`. This function takes as argument a filepath and returns the name (without the extensions) of the file."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def get_filename(filepath):\n",
      "    # insert your code here\n",
      "    \n",
      "# these tests should return True if your code is correct\n",
      "print(get_filename(\"data/arabian_nights/1.txt\") == '1')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The final step is to convert numbers represented as string (e.g. \"1\" and \"10\") to a number. This can be achieved by using the function `int`:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "x_as_string = \"1\"\n",
      "x_as_int = int(x_as_string)\n",
      "print(x_as_int)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The process of converting a string into an integer, is called *type casting*. Strings are different types than integers. To see this, have a look at the following:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "x = \"1\"\n",
      "y = \"2\"\n",
      "print(x + y)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "12? Yes, 12. This is because, as you might remember from the first chapter, we can use the `+` operator to concatenate two strings. If we apply the same operation to integers, as in:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "x = 1\n",
      "y = 2\n",
      "print(x + y)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "we get the expected result of 3."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Quiz!"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Combine the functions `int` and `get_filename` into the function `get_night` to obtain the integer corresponding to a night."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def get_night(filepath):\n",
      "    # insert your code here\n",
      "\n",
      "# these tests should return True if your code is correct\n",
      "print(get_night(\"data/arabian_nights/1.txt\") == 1)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "OK, so now we can convert the filepaths to integers corresponding to the nights of storytelling. But how will we use that to sort the corpus in chronological order? In chapter 1 we briefly discussed how to sort your collection of good reads. In combination with our `get_night` function, we can use `sort` to obtain a nicely chronologically ordered list of stories. Prepare yourself for some real Python magic, because the following lines of code might be a little dazzling...\n",
      "\n",
      "First we list all files using `list_textfiles` and store it in the variable `filenames`:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "filenames = list_textfiles('data/arabian_nights')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Next we call the function `.sort()` on this list and supply as keyword our function `get_night`:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "filenames.sort(key=get_night)\n",
      "filenames[:20]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "As you can see, we now have a perfectly chronologically ordered list of filenames. But how, **HOW!** did that work? As you might have guessed, the argument of `sort`: `key=get_night`, has something to do with all this magic. Without this argument, Python would just sort the filenames alphabeticaly:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "filenames = list_textfiles('data/arabian_nights')\n",
      "filenames.sort()\n",
      "print(filenames[:20])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "However, if we supply a function to `key`, Python will internally first apply that function to all items we want to sort. In our case this means Python converts all filepaths to integers. After that Python sorts the list. Then for each converted item it returns the corresponding item in the original list. (Technically this is not an accurate description, but it basically comes down to this.)\n",
      "\n",
      "If you still feel a little dizzy after all this, don't be afraid. Sometimes it is good enough to use a particular piece of code even if you don't completely understand it. We can now use these functions to reload the corpus, this time in chronological order:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "corpus = []\n",
      "filenames = list_textfiles(\"data/arabian_nights\")\n",
      "filenames.sort(key=get_night)\n",
      "for filename in filenames:\n",
      "    text = read_file(filename)\n",
      "    corpus.append(tokenize(text))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Exploratory data analysis"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "As a first exploratory data analysis, we are going to compute for each night how many sentences it contains and how many words. It is quite easy to count the number of sentences per night, since each night is represented by a list of sentences."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "sentences_per_night = []\n",
      "for night in corpus:\n",
      "    sentences_per_night.append(len(night))\n",
      "print(sentences_per_night[:10])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Using the function `max` we can find out what the highest number of sentences is:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "max(sentences_per_night)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Similarly, if we would like to now what the lowest number of sentences is, we use the function `min`:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "min(sentences_per_night)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Quiz!"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The function `sum` takes a list of numbers as input and returns the sum:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(sum([1, 3, 3, 4]))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Use this function to compute the average number of sentences per night. Note if you use Python 2.7, you will need to convert the result of sum, which will be an integer to a `float`, using `float(some_number)`. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# if you use Python 3.x, both print statements will return \n",
      "# the same thing and you don't need to worry.\n",
      "number = 1\n",
      "print(number)\n",
      "number = float(number)\n",
      "print(number)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# insert your code here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Given our data structure of a list of sentences which are themselves lists of words, it is a little trickier to count for each night how many words it contains. One possible way is the following:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "words_per_night = []\n",
      "for night in corpus:\n",
      "    n_words = 0\n",
      "    for sentence in night:\n",
      "        n_words += len(sentence)\n",
      "    words_per_night.append(n_words)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Make sure you really understand these lines of code as you will need them in the next quiz. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The suspense created by Shahrazad\u2019s story-telling skills is intriguing, especially the \u201ccliff-hanger\u201d ending each night which she uses to avert her own execution (and possibly that of womanhood). Every night she tells the Sultan a story only to stop at dawn and she picks up the thread the next night. But does it really take the whole night to tell a particular story?\n",
      "\n",
      "I am not aware of any exact numbers about how many words people speak per minute. Averages seem to fluctuate between 100 and 200 words per minute. Narrators are advised to use approximately 150 words per minute in audiobooks. I suspect that this number is a little lower for live storytelling and assume it lies around 130 words per minute (including pauses). Using this information, we can compute the time it takes to tell a particular story as follows:\n",
      "\n",
      "$$\\textrm{story time}(\\textrm{text}) = \\frac{\\textrm{number of words in text}}{\\textrm{number of words per minute}}$$"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Quiz!"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**1)** Write a function called `story_time` that takes as input a text. Given a speed of 130 words per minute, compute how long it takes to tell that text."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def story_time(text):\n",
      "    # insert your code here\n",
      "\n",
      "# these tests should return True if your code is correct\n",
      "print(story_time([[\"story\", \"story\"]]) * 130 == 2.0)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**2)** Compute the story_time for each night in our corpus. Assign the result to the variable `story_time_per_night`."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "story_time_per_night = []\n",
      "# insert your code here\n",
      "print(story_time_per_night[:10])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**3**) Compute the average, minimum and maximum story telling time."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# insert your code here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Visualizing general statistics"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now that we have computed a range of general statistics for our corpus, it would be nice to visualize them. Python's plotting library *matplotlib* (see [here](http://matplotlib.org)) allows us to produce all kinds of graphs. We could for example, plot for each story, how many sentences it contains:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import matplotlib.pyplot as plt\n",
      "\n",
      "plt.plot(sentences_per_night)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Quiz!"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**1)** Can you do the same for `words_per_night`?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# insert your code here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**2)** And can you do the same for `story_time_per_night`?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# insert your code here"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**3)** In this final exercise we will put everything together what we have learnt so far. We want you to write a function `positions_of` that returns for a given word all sentence positions in the *Arabian Nights* where that word occurs. We are not interested in the positions relative to a particular night, but only to the corpus as a whole. Use that function to find all occurences of the name Sharahzad and store the corresponding indexes in the variable `positions_of_shahrazad`. Do the same thing for the name *Ali*. Store the result in `positions_of_ali`. Finally, find all occurences of *Egypt* and store the indexes in `positions_of_egypt`. Tip: (1) remember that we lowercased the entire corpus! (2) remember that indexes start at 0."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def positions_of(word):\n",
      "    #insert your code here\n",
      "\n",
      "positions_of_shahrazad = positions_of(\"shahrazad\")\n",
      "positions_of_ali = positions_of(\"ali\")\n",
      "positions_of_egypt = positions_of(\"egypt\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If everything went well, the following lines of code should produce a nice dispersion plot of all sentence occurences of Shahrazad, Ali and Egypt in the corpus."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "plt.figure(figsize=(20, 8))\n",
      "names = [\"Shahrazad\", \"Ali\", \"Egypt\"]\n",
      "plt.plot(positions_of_shahrazad, [1]*len(positions_of_shahrazad), \"|\", markersize=100)\n",
      "plt.plot(positions_of_ali, [2]*len(positions_of_ali), \"|\", markersize=100)\n",
      "plt.plot(positions_of_egypt, [0]*len(positions_of_egypt), \"|\", markersize=100)\n",
      "plt.yticks(range(len(names)), names)\n",
      "_ = plt.ylim(-1, 3)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "> Then Shahrazad reached the morning, and fell silent in the telling of her tale\u2026"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Ignore the following, it's just here to make the page pretty:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from IPython.core.display import HTML\n",
      "def css_styling():\n",
      "    styles = open(\"styles/custom.css\", \"r\").read()\n",
      "    return HTML(styles)\n",
      "css_styling()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "/*\n",
        "Placeholder for custom user CSS\n",
        "\n",
        "mainly to be overridden in profile/static/custom/custom.css\n",
        "\n",
        "This will always be an empty file in IPython\n",
        "*/\n",
        "<style>\n",
        "    @import url(http://fonts.googleapis.com/css?family=Roboto:400,300,300italic,400italic,700,700italic);\n",
        "\n",
        "    div.cell{\n",
        "        font-family:'roboto','helvetica','sans';\n",
        "        color:#444;\n",
        "        width:800px;\n",
        "        margin-left:16% !important;\n",
        "        margin-right:auto;\n",
        "    }\n",
        "\n",
        "    div.text_cell_render{\n",
        "        font-family: 'roboto','helvetica','sans';\n",
        "        line-height: 145%;\n",
        "        font-size: 120%;\n",
        "        color:#444;\n",
        "        width:800px;\n",
        "        margin-left:auto;\n",
        "        margin-right:auto;\n",
        "    }\n",
        "    .CodeMirror{\n",
        "            font-family: \"Menlo\", source-code-pro,Consolas, monospace;\n",
        "    }\n",
        "    .prompt{\n",
        "        display: None;\n",
        "    }    \n",
        "    .warning{\n",
        "        color: rgb( 240, 20, 20 )\n",
        "        }  \n",
        "</style>\n",
        "<script>\n",
        "    MathJax.Hub.Config({\n",
        "                        TeX: {\n",
        "                           extensions: [\"AMSmath.js\"]\n",
        "                           },\n",
        "                tex2jax: {\n",
        "                    inlineMath: [ ['$','$'], [\"\\\\(\",\"\\\\)\"] ],\n",
        "                    displayMath: [ ['$$','$$'], [\"\\\\[\",\"\\\\]\"] ]\n",
        "                },\n",
        "                displayAlign: 'center', // Change this to 'center' to center equations.\n",
        "                \"HTML-CSS\": {\n",
        "                    styles: {'.MathJax_Display': {\"margin\": 4}}\n",
        "                }\n",
        "        });\n",
        "</script>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 2,
       "text": [
        "<IPython.core.display.HTML at 0x1057b3128>"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "---"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<p><small><a rel=\"license\" href=\"http://creativecommons.org/licenses/by-sa/4.0/\"><img alt=\"Creative Commons License\" style=\"border-width:0\" src=\"https://i.creativecommons.org/l/by-sa/4.0/88x31.png\" /></a><br /><span xmlns:dct=\"http://purl.org/dc/terms/\" property=\"dct:title\">Python Programming for the Humanities</span> by <a xmlns:cc=\"http://creativecommons.org/ns#\" href=\"http://fbkarsdorp.github.io/python-course\" property=\"cc:attributionName\" rel=\"cc:attributionURL\">http://fbkarsdorp.github.io/python-course</a> is licensed under a <a rel=\"license\" href=\"http://creativecommons.org/licenses/by-sa/4.0/\">Creative Commons Attribution-ShareAlike 4.0 International License</a>. Based on a work at <a xmlns:dct=\"http://purl.org/dc/terms/\" href=\"https://github.com/fbkarsdorp/python-course\" rel=\"dct:source\">https://github.com/fbkarsdorp/python-course</a>.</small></p>"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}