{ "metadata": { "name": "", "signature": "sha256:9277f1cecba1f31d42c3f073259c08ec9fdada90e3777ca8ad87b169b15b1d8e" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Chapter 5: Building NLP applications" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-- *A Python Course for the Humanities by Folgert Karsdorp and Maarten van Gompel*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Chapter is still in DRAFT stage

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the last chapter we made some tools to prepare corpora for further processing. To be able to tokenise a text is nice, but from a humanities perspective not very interesting. So, what are we going to do with it? In this chapter, you'll implement two major applications that build upon the tools you developed. The first will be a relatively simple program that scores each text in a corpus according to its *Automatic Readability Index*. In the second application we will build a system that can predict who wrote a certain text. Again, we'll need to cover a lot of ground and things are becoming increasingly difficult now. So, let's get started!" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Automatic Readability Index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The *Automatic Readability Index* is a readability test designed to gauge the understandability of a text. The formula for calculating the *Automated Readability Index* is as follows:\n", "\n", "$$ 4.71 \\cdot \\frac{nchars}{nwords} + 0.5 \\cdot \\frac{nwords}{nsents} - 21.43 $$\n", "\n", "Let's apply some wishful thinking. If we had all the information needed to compute this formula, we could start with writing a function that does it for us. Let's do so." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "--------" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Quiz!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Write a function `automatic_readability_index` that takes three arguments `n_chars`, `n_words` and `n_sents` and returns the ARI given those arguments." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def automatic_readability_index(n_chars, n_words, n_sents):\n", " # insert your code here\n", "\n", "# do not modify the code below, it is for testing your answer only!\n", "# it should output True if you did well\n", "print(abs(automatic_readability_index(300, 40, 10) - 15.895) < 0.001)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we need to write some code to obtain the numbers we so wishfully assumed to have. We will use the code we wrote in earlier chapters to read and tokenise texts. In the file `preprocessing.py` we defined a function `read_corpus` which reads all files with the extension `.txt` in the given directory. It tokenizes each text into a list of sentences, each of which is represented by a list of words. All words are lowercased and we remove all punctuation. We import the function using the following line of code:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from pyhum.preprocessing import read_corpus" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's write a function `extract_counts` that takes a list of sentences as input and returns the number of characters, the number of words and the number of sentences as a tuple." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----------" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Quiz!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Write the function `extract_counts`." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def extract_counts(sentences):\n", " # insert your code here\n", "\n", "# do not modify the code below, for testing only!\n", "print(extract_counts(\n", " [[\"this\", \"was\", \"rather\", \"easy\"], \n", " [\"please\", \"give\", \"me\", \"something\", \"more\", \"challenging\"]]) == (53, 10, 2))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Well done! We're almost done. We could use our two functions to compute the ARI for a given text as follows:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "sentences = [[\"this\", \"was\", \"rather\", \"easy\"], \n", " [\"Please\", \"give\", \"me\", \"something\", \"more\", \"challenging\"]]\n", "\n", "n_chars, n_words, n_sents = extract_counts(sentences)\n", "print(automatic_readability_index(n_chars, n_words, n_sents))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, it would be nice to have a little more abstraction." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---------" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Quiz!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Write the function `compute_ARI` that takes as argument a list of sentences (represented by lists of words) and returns the *Automatic Readability Index* for that input." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def compute_ARI(sentences):\n", " # insert your code here\n", " \n", "# do not modify the code below, it is for testing your answer only!\n", "# it should output True if you did well\n", "print(abs(compute_ARI(sentences) - 4.442) < 0.001)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "--------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally it would be nice to compare the readability of a number of texts. We need a function that iterates through the files in a directory and prints the Automatic Readability Index for each text. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---------" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Quiz!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Write a function `compute_ARIs` that takes the name of a directory as input and prints the Automatic Readbility Index for each document in that directory." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def compute_ARIs(directory):\n", " # insert your code here" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remember that in Chapter 3, we plotted different basic statistics using Python plotting library matplotlib. Can you do the same for all ARIs?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import matplotlib.pyplot as plt\n", "\n", "# insert your code here" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----------" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Authorship attribution\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section you will implement the core of an authorship attribution application. You won't build a full stand-alone application, but rather focus on the core functions for classifying new texts for their authors.\n", "\n", "The core of our application will be a naive bayes classifier. Following good programming principles, we will try to make this classifier as generic as possible. This allows us to use the classifier in other contexts than authorship attribution, such as text classification and classification in general." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The naive bayes classifier is a probabilistic classifier that, given a set of features, tries to find the class with the highest probability. It is based on applying Bayes' theorem and is called naive because of its strong independence assumption between features. This means that the absence or presence of each feature is assumed to be independent of each other. We compute the posterior probability of a class as the joint probability of all features given that class:\n", "\n", "$$ P(y|x_1,\\ldots,x_n) \\propto P(y) \\prod^n_{i=1} P(x_i|y)$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Classification is based on the *maximum a posteriori* or MAP descision rule which simply picks the class (or author in our case) that is most probable:\n", "\n", "$$ classify(x_1, \\ldots, x_n) = \\arg\\max_y P(y) \\prod^n_{i=1} P(x_i|y) $$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The main function we will implement has a simple job: take a text as an argument and classify it as being written by one of the authors. Let's again apply some wishful thinking and implement the function as follows:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def predict_author(text, feature_database):\n", " \"Predict who wrote this text.\"\n", " return classify(score(extract_features(text), feature_database))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This function takes two arguments: the text to classify and the training data against which we want to classify the text. The function is basically an abstraction layer on top of `classify`, `extract_features` and `score`. `classify` is a simple function that takes a dictionary of {$author_i$: $P(author_i|text)$} and returns the author that is most probable. Let's implement this function." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-----------" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Quiz!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Implement the function `classify`. It takes one argument, `scores` which is a dictionary with authors as keys and the probability of an author as value. Return the author with maximum probability. (Tip: use the built in function `max`, see [the documentation](http://docs.python.org/3/library/functions.html#max))" ] }, { "cell_type": "code", "collapsed": false, "input": [ "scores = {\"Hermans\": 0.15, \"Voskuil\": 0.55, \"Reve\": 0.2, \"Mulisch\": 0.18, \"Claus\": 0.02}\n", "\n", "def classify(scores):\n", " # insert your code here\n", " \n", "print(classify(scores) == \"Voskuil\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function `extract_features` is rather straightforward as well. We'll build this function on top of the functions we defined in the previous chapters. For the moment we will assume that our model is a bag-of-words (BOW) model where the only features are individual words. We will define `extract_features` as an abstraction layer on top of `read_corpus_file` and `tokenise` as follows:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from pyhum.preprocessing import read_corpus_file, tokenize\n", "\n", "def extract_features(filename):\n", " return tokenise(read_corpus_file(filename))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now for our training data, we need to store for each word how often it occurs with a particular author. As our data structure we will use a nested dictionary of the following structure `author -> word -> count`. We'll store the counts in the variable `feature_database`.\n", "\n", "(For more information about `defaultdict`, see [the defaultdict documentation](http://docs.python.org/3/library/collections.html#collections.defaultdict). For more information about `lambda` expressions, see [the lambda documentation](http://docs.python.org/3/tutorial/controlflow.html?highlight=lambda#lambda-expressions).)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from collections import defaultdict\n", "\n", "feature_database = defaultdict(lambda: defaultdict(int))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "--------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To fill our `feature_database` we will need a couple of functions. First we need a function that returns the author of a particular text. To make things a little easier, we named our training files with the author's name followed by the title of the book, i.e. `austen-emma.txt`. Or when the path is part of the filename, `/path/to/austen-emma.txt`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-----" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Quiz!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Write a function `extract_author` that takes a filename as input and returns the name of the author." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def extract_author(filename):\n", " # insert your code here\n", "\n", "# do not modify the code below, it is for testing your answer only!\n", "# it should output True if you did well\n", "print(extract_author(\"Austen-emma.txt\") == \"Austen\")\n", "print(extract_author(\"/path/to/Austen-emma.txt\") == \"Austen\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we'll need a function, `update_counts`, that takes as argument the name of an author and the words extracted using `extract_features` and adds these to our `feature_database`. The function should return a new updated version of the `feature_database`." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from preprocess import tokenise\n", "\n", "def update_counts(author, text, feature_database):\n", " # insert your code here\n", " return feature_database\n", "\n", "# do not modify the code below, for testing only!\n", "feature_database = defaultdict(lambda: defaultdict(int))\n", "feature_database = update_counts(\"Anonymous\", \"This was written with a lack of inspiration\", \n", " feature_database)\n", "test_database = defaultdict(lambda: defaultdict(int))\n", "for word in \"This was written with a lack of inspiration\".split():\n", " test_database[\"Anonymous\"][word] += 1\n", "print(sorted(feature_database.items()) == sorted(test_database.items()))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally we define a function `add_file_to_database` that takes a filename and the `feature_database` as input, extracts the author from the filename and adds the feature counts to the `feature_database`. We define it as follows:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def add_file_to_database(filename, feature_database):\n", " return update_counts(extract_author(filename), \n", " extract_features(filename), \n", " feature_database)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----------" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Quiz!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have a function to add one file to the `feature_database`, we need a function that adds an entire corpus to the database. Write a function that takes the name of a directory as input and add all files in this directory to the `feature_database`. Again, the function should return an updated version of our database." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import os\n", "\n", "def add_directory_to_database(directory, feature_database):\n", " # insert your code here\n", " return feature_database" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now have a function to extract the features of a document. We have also defined a couple of functions to add those features to our `feature_database`. It is now time to implement the core function of our authorship attribution application. \n", "\n", "Before we will implement the `score` function we will first implement a function to compute the probability of one feature given an author. There are two things to note with regard to this function.\n", "\n", "First, if a given author and a word never occur together in the `feature_database`, the probabilty of that class will be zero (think about this, if you don't understand why). Needless to say this is rather problematic. A common strategy to surpass this problem is to add pseudocounts to the observed counts, normally 1. The pseudocounts need to be incorporated in both the numerator and the denominator.\n", "\n", "Second, the probability of one feature will normally be quite small. If we now multiply all probabilities of our features given an author we will get a very small number, possibly too small, to be adequately represented by Python. Consider the code below:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "x = 0.00000000000000001\n", "for i in range(30):\n", " x = x * 0.000000000000001\n", " print(x)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After less than 30 multiplications, the values are too small for Python to distinguish them from each other. Even worse, the values default to zero and as we know multiplying by zero will return zero, and therefore all our probabilities could be zero! We therefore take the log of the individual feature probabilities and sum them to obtain our final score. Let's implement the `log_probability` function first." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from math import log\n", "\n", "def log_probability(feature_counts, features_sum, n_features):\n", " return log((feature_counts + 1.0) / (features_sum + n_features))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`feature_counts` is the number of times the given feature occurs with a particular author. `features_sum` is the sum of all feature counts for that author. `n_features` is the number of unique features in our `feature_database` (needed for pseudocounts). Now that we have defined this crucial function, we are ready to implement our `score` function." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-----------" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Quiz!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function `score` takes as input a list of features and the `feature_database`. It should return a dictionary with authors as keys and their probabilities given the list of features as values. We'll provide the basic frame for this function below and ask you to fill in the details. This is without doubt the most challenging Quiz! you have seen so far and we will be very impressed if you get it right. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "def score(features, feature_database):\n", " \"Predict who wrote the document on the basis of the corpus.\"\n", " scores = defaultdict(float)\n", " # compute the number of features in the feature database here\n", " for author in feature_database:\n", " # compute the probability of features given that author here\n", " return scores\n", "\n", "# do not modify the code below, for testing your answer only! \n", "# It should return True if you did well!\n", "features = [\"the\", \"a\", \"the\", \"be\", \"book\"]\n", "feature_database = defaultdict(lambda: defaultdict(int))\n", "feature_database[\"A\"][\"the\"] = 2\n", "feature_database[\"A\"][\"a\"] = 5\n", "feature_database[\"A\"][\"book\"]= 1\n", "feature_database[\"B\"][\"the\"] = 5\n", "feature_database[\"B\"][\"a\"] = 1\n", "feature_database[\"B\"][\"book\"] = 6\n", "print(abs(dict(score(features, feature_database))[\"A\"] - -7.30734) < 0.001)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wow! You have really done a great job. You have implemented almost entirely by yourself a complete naive bayes classifier that can be used for all kinds of classification problems, such as document classification and authorship attribution. \n", "\n", "Now we should put all things together and test how well our system works. In the folder `data/gutenberg/training` we have provided a couple of training documents from different author. The folder `data/gutenberg/testing` provides three test documents. Let's test our classifier on one of those documents!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# first define the feature_database\n", "feature_database = defaultdict(lambda: defaultdict(int))\n", "feature_database = add_directory_to_database(\"data/gutenberg/training\", feature_database)\n", "print(predict_author(\"data/gutenberg/testing/milton-poetical.txt\", feature_database))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It would be nice to evaluate our classifier on more than one document and to obtain some sort of score of how well our classifier performs. We will implement two functions: `test_from_corpus` and `analyze_results`. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---------" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Quiz!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`test_from_corpus` takes as input the name of a directory and a trained feature database. It then tries to predict the author of all files in the given directory. The function should return a list of `(ground-truth-author, predicted-author)` tuples." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def test_from_corpus(directory, feature_database):\n", " results = []\n", " # insert your code here\n", " return results" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally we will implement the function `analyze_results` which takes a list of `(ground-truth-author, predicted-author)` tuples as input and returns the accuracy of the classifier, which is defined as:\n", "\n", "$$ accuracy(X) = \\frac{\\textrm{number of correct predictions}}{\\textrm{total number of predictions}}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "------" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Quiz!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Implement the function `analyze_results` and test your classifier on the test corpus in `data/gutenberg/testing`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def analyze_results(results):\n", " # insert your code here\n", "\n", "# do not modify the code below, for testing only!\n", "print(analyze_results([(\"A\", \"A\"), (\"A\", \"B\"), (\"C\", \"C\"), (\"D\", \"C\"), (\"E\", \"E\")]) == 0.6)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ignore this, it's only here to make the page pretty:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from IPython.core.display import HTML\n", "def css_styling():\n", " styles = open(\"styles/custom.css\", \"r\").read()\n", " return HTML(styles)\n", "css_styling()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "/*\n", "Placeholder for custom user CSS\n", "\n", "mainly to be overridden in profile/static/custom/custom.css\n", "\n", "This will always be an empty file in IPython\n", "*/\n", "\n", "" ], "metadata": {}, "output_type": "pyout", "prompt_number": 1, "text": [ "" ] } ], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\"Creative
Python Programming for the Humanities by http://fbkarsdorp.github.io/python-course is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Based on a work at https://github.com/fbkarsdorp/python-course.

" ] } ], "metadata": {} } ] }