{ "cells": [ { "cell_type": "code", "execution_count": 1, "id": "79d7e25a", "metadata": { "collapsed": false }, "outputs": [], "source": [ "import nltk\n", "from nltk.corpus import names\n", "from pylab import *\n", "import random as pyrandom" ] }, { "cell_type": "markdown", "id": "f41b10d3", "metadata": {}, "source": [ "# Text Classification" ] }, { "cell_type": "markdown", "id": "bac6d9ff", "metadata": {}, "source": [ "Let's start with a very simple text classification problem: guessing the gender of a name from the name itself.\n", "\n", "You can probably make a pretty good guess about the gender of names like: \"Bilama\" or \"Telek\".\n", "\n", "If we want to generalize to new names, we need to extract properties of names that occur in new, previously unseen names. We call these properties _features_.\n", "\n", "In NLTK, they are represented as hash tables." ] }, { "cell_type": "code", "execution_count": 2, "id": "49de10bb", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'first_letter': 'P', 'last_letter': 'a', 'length': 5}" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def gender_features(w):\n", " return dict(last_letter=w[-1],\n", " first_letter=w[0],\n", " length=len(w))\n", "gender_features(\"Petra\")" ] }, { "cell_type": "markdown", "id": "f50dc584", "metadata": {}, "source": [ "For the training data, we read male and female names from NLTK corpora." ] }, { "cell_type": "code", "execution_count": 3, "id": "f1f49fa8", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "7944 7579\n" ] } ], "source": [ "male = [(name,'male') for name in names.words('male.txt')]\n", "female = [(name,'female') for name in names.words('female.txt')]\n", "nlist = male+female\n", "print len(nlist),len(set([x for x,y in nlist]))" ] }, { "cell_type": "code", "execution_count": 4, "id": "6533ca2e", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[('Pier', 'female'),\n", " ('Laird', 'male'),\n", " ('Ramsay', 'male'),\n", " ('Wilburt', 'male'),\n", " ('Roy', 'male')]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pyrandom.shuffle(nlist)\n", "nlist[:5]" ] }, { "cell_type": "markdown", "id": "c0bb1f9f", "metadata": {}, "source": [ "We extract features and then split the data into a training set and a test set." ] }, { "cell_type": "code", "execution_count": 5, "id": "4e1a878b", "metadata": { "collapsed": false }, "outputs": [], "source": [ "featuresets = [(gender_features(n),g) for n,g in nlist]\n", "training_set = featuresets[500:]\n", "test_set = featuresets[:500]" ] }, { "cell_type": "code", "execution_count": 6, "id": "9728a1db", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[({'first_letter': 'P', 'last_letter': 'r', 'length': 4}, 'female'),\n", " ({'first_letter': 'L', 'last_letter': 'd', 'length': 5}, 'male'),\n", " ({'first_letter': 'R', 'last_letter': 'y', 'length': 6}, 'male'),\n", " ({'first_letter': 'W', 'last_letter': 't', 'length': 7}, 'male'),\n", " ({'first_letter': 'R', 'last_letter': 'y', 'length': 3}, 'male')]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "featuresets[:5]" ] }, { "cell_type": "markdown", "id": "b21795ac", "metadata": {}, "source": [ "Once we have features and corresponding labels, we can train a classifier..." ] }, { "cell_type": "code", "execution_count": 7, "id": "53a40960", "metadata": { "collapsed": false }, "outputs": [], "source": [ "classifier = nltk.NaiveBayesClassifier.train(training_set)" ] }, { "cell_type": "markdown", "id": "b07273fa", "metadata": {}, "source": [ "... and evaluate its performance on the test set." ] }, { "cell_type": "code", "execution_count": 8, "id": "49d401c5", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.788" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nltk.classify.accuracy(classifier,test_set)" ] }, { "cell_type": "markdown", "id": "7dcbdaaf", "metadata": {}, "source": [ "Classiifers also give us information about how informative features are." ] }, { "cell_type": "code", "execution_count": 9, "id": "43ed4306", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Most Informative Features\n", " last_letter = 'a' female : male = 35.9 : 1.0\n", " last_letter = 'k' male : female = 32.7 : 1.0\n", " last_letter = 'f' male : female = 15.9 : 1.0\n", " last_letter = 'p' male : female = 12.6 : 1.0\n", " last_letter = 'v' male : female = 10.5 : 1.0\n" ] } ], "source": [ "classifier.show_most_informative_features(5)" ] }, { "cell_type": "markdown", "id": "dbb5c30c", "metadata": {}, "source": [ "Naive Bayesian classifiers assume a very simple statistical model of the posterior probability $P(c|x)$ for input features $x = (x_1,...,x_n)$\n", "\n", "- We assume that each feature is generated independently $P(x|c) = \\prod_i(x_i|c)$\n", "- We use Bayes formula to turn that equation into a posterior probability $P(c|x)$\n", "\n", "Here, the different $P(x_i|c)$ are modeled via empirical distributions; that is, we count\n", "how often $x_i$ is true given that the class is $c$." ] }, { "cell_type": "markdown", "id": "0cf9c813", "metadata": {}, "source": [ "# Another Classifier" ] }, { "cell_type": "markdown", "id": "c013258e", "metadata": {}, "source": [ "There are many different classifiers available in NLTK; they give different performance on different\n", "tasks. There is no single best classifier, so you need a bit of experimentation." ] }, { "cell_type": "code", "execution_count": 10, "id": "40168516", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " ==> Training (10 iterations)\n", "\n", " Iteration Log Likelihood Accuracy\n", " ---------------------------------------\n", " 1 -0.69315 0.370\n", " 2 -0.47845 0.733\n", " 3 -0.42183 0.774\n", " 4 -0.39399 0.780\n", " 5 -0.37861 0.783\n", " 6 -0.36935 0.783\n", " 7 -0.36341 0.782\n", " 8 -0.35944 0.782\n", " 9 -0.35671 0.781\n", " Final -0.35477 0.781\n" ] }, { "data": { "text/plain": [ "0.798" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "classifier = nltk.MaxentClassifier.train(training_set,algorithm=\"IIS\",max_iter=10)\n", "nltk.classify.accuracy(classifier,test_set)" ] }, { "cell_type": "markdown", "id": "ca880319", "metadata": {}, "source": [ "The maximum entropy classifier can be derived in different ways. In practice, it amounts to the same classifier as logistic regression:\n", "\n", "$P(c|x) = \\sigma(w \\cdot x)$\n", "\n", "where $\\sigma(x) = \\frac{1}{1+e^{-x}}$\n", "\n", "The term \"MaxEnt\" is frequently used in NLP. It refers to the fact that we are thinking of the problem as follows:\n", "\n", "- assume that we have a set of binary feature functions of documents $x_i(d)$\n", "- each binary feature function is true or false\n", "- for each binary feature function, we have a posterior $P(c|x_i)$\n", "- we want to find an overall posterior probability $P(c|d)$ that is...\n", " - consistent with the individual posteriors\n", " - otherwise a \"maximum entropy distribution\"" ] }, { "cell_type": "markdown", "id": "b34784ac", "metadata": {}, "source": [ "# \"Traditional\" Classifier" ] }, { "cell_type": "markdown", "id": "22d07c7f", "metadata": {}, "source": [ "The NLTK classifiers take features in the form of hash tables; this is convenient for NLP tasks, but somewhat inefficient.\n", "\n", "Classifiers in other machine learning libraries tend to take input data in a different format.\n", "\n", "A common format is two matrices, one for inputs (each row representing an input vectors), and one for outputs (containing integer classes or indicator functions)." ] }, { "cell_type": "code", "execution_count": 11, "id": "1b8af837", "metadata": { "collapsed": false }, "outputs": [], "source": [ "xs = zeros((len(training_set),26))\n", "ys = zeros(len(training_set))" ] }, { "cell_type": "markdown", "id": "17d66d68", "metadata": {}, "source": [ "For coding the inputs, we use a \"unary code\"." ] }, { "cell_type": "code", "execution_count": 12, "id": "01de5756", "metadata": { "collapsed": false }, "outputs": [], "source": [ "for i,(f,c) in enumerate(training_set):\n", " ll = f[\"last_letter\"].lower()\n", " if ll==\" \" : continue\n", " xs[i,ord(ll)-ord(\"a\")] = 1\n", " if c==\"female\": ys[i] = 1" ] }, { "cell_type": "markdown", "id": "2f9788d0", "metadata": {}, "source": [ "*LogisticRegression* is the same as *MaxentClassifier*, but the sklearn implementation is much faster." ] }, { "cell_type": "code", "execution_count": 13, "id": "99b8f82d", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", " intercept_scaling=1, penalty='l2', tol=0.0001)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.linear_model import LogisticRegression\n", "lr = LogisticRegression()\n", "lr.fit(xs,ys)" ] }, { "cell_type": "code", "execution_count": 14, "id": "6ec18dbe", "metadata": { "collapsed": false }, "outputs": [], "source": [ "xs = zeros((len(test_set),26))\n", "ys = zeros(len(test_set))\n", "for i,(f,c) in enumerate(test_set):\n", " ll = f[\"last_letter\"].lower()\n", " if ll==\" \" : continue\n", " xs[i,ord(ll)-ord(\"a\")] = 1\n", " if c==\"female\": ys[i] = 1" ] }, { "cell_type": "code", "execution_count": 15, "id": "c6049f02", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.754" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "1.0-sum(lr.predict(xs)!=ys)*1.0/len(ys)" ] }, { "cell_type": "markdown", "id": "3ea38b79", "metadata": {}, "source": [ "# Bigger Feature Set" ] }, { "cell_type": "markdown", "id": "00496fa9", "metadata": {}, "source": [ "There is no single right feature set, and different feature sets give different amounts of performance for different classifiers." ] }, { "cell_type": "code", "execution_count": 16, "id": "e6bdd4ff", "metadata": { "collapsed": false }, "outputs": [], "source": [ "def more_features(w):\n", " features = {}\n", " features[\"first\"] = w[0].lower()\n", " features[\"last\"] = w[-1].lower()\n", " features[\"last2\"] = w[-2:].lower()\n", " for c in [chr(i) for i in range(ord(\"a\"),ord(\"z\")+1)]:\n", " features[\"nr_\"+c] = name.lower().count(c)\n", " features[\"has_\"+c] = (c in name.lower())\n", " return features" ] }, { "cell_type": "code", "execution_count": 17, "id": "3298aa6a", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.786" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "featuresets = [(more_features(n),g) for n,g in nlist]\n", "training_set = featuresets[500:]\n", "test_set = featuresets[:500]\n", "classifier = nltk.NaiveBayesClassifier.train(training_set)\n", "nltk.classify.accuracy(classifier,test_set)" ] }, { "cell_type": "markdown", "id": "1d073119", "metadata": {}, "source": [ "Usually, you should split the training data into three sets:\n", "\n", "- the training set\n", "- the feature evaluation set\n", "- the test set\n", "\n", "If you don't, you risk that you get a good result on the test set by accident, a result that doesn't generalize.\n", "\n", "Other approaches are resampling methods and cross-validation." ] }, { "cell_type": "markdown", "id": "22b7e6e9", "metadata": {}, "source": [ "What are some of the tradeoffs in choosing a feature set?" ] }, { "cell_type": "markdown", "id": "942bdf3c", "metadata": {}, "source": [ "# Decision Trees" ] }, { "cell_type": "markdown", "id": "81e9ad87", "metadata": {}, "source": [ "Decision trees are another common classifier." ] }, { "cell_type": "code", "execution_count": 18, "id": "a2dba1f1", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'fl': 'p', 'l': 5, 'll': 'a'}" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def simple_features(w):\n", " return {'fl':w[0].lower(),'ll': w[-1].lower(),'l':len(w)}\n", "simple_features(\"Petra\")" ] }, { "cell_type": "code", "execution_count": 19, "id": "677d67d0", "metadata": { "collapsed": false }, "outputs": [], "source": [ "featuresets = [(simple_features(n),g) for n,g in nlist]\n", "training_set = featuresets[500:]\n", "test_set = featuresets[:500]" ] }, { "cell_type": "code", "execution_count": 20, "id": "aaa44045", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.77" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "classifier = nltk.DecisionTreeClassifier.train(training_set,depth_cutoff=2)\n", "nltk.classify.accuracy(classifier,test_set)" ] }, { "cell_type": "code", "execution_count": 21, "id": "740c86a0", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "if ll == 'a': return 'female'\n", "if ll == 'b': \n", " if fl == 'a': return 'male'\n", " if fl == 'b': return 'female'\n", " if fl == 'c': return 'male'\n", " if fl == 'd': return 'female'\n", " if fl == 'h': return 'male'\n", " if fl == 'j': return 'male'\n", " if fl == 'k': return 'male'\n", " if fl == 'l': return 'female'\n", " if fl == 'm': return 'female'\n", " if fl == 'r': return 'male'\n", " if fl == 't': return 'male'\n", " if fl == 'w': return 'male'\n", " if fl == 'z': return 'male'\n", "if ll == 'c': return 'male'\n", "if ll == 'd': return 'male'\n", "if ll == 'e': \n", " if fl == 'a': return 'female'\n", " if fl == 'b': return 'female'\n", " if fl == 'c': return 'female'\n", " if fl == 'd': return 'female'\n", " if fl == 'e': return 'female'\n", " if fl == 'f': return 'female'\n", " if fl == 'g': return 'female'\n", " if fl == 'h': return 'female'\n", " if fl == 'i': return 'female'\n", " if fl == 'j': return 'female'\n", " if fl == 'k': return 'female'\n", " if fl == 'l': return 'female'\n", " if fl == 'm': return 'female'\n", " if fl == 'n': return 'female'\n", " if fl == 'o': return 'female'\n", " if fl == 'p': return 'female'\n", " if fl == 'q': return 'female'\n", " if fl == 'r': return 'female'\n", " if fl == 's': return 'female'\n", " if fl == 't': return 'male'\n", " if fl == 'u': return 'female'\n", " if fl == 'v': return 'female'\n", " if fl == 'w': return 'male'\n", " if fl == 'y': return 'female'\n", " if fl == 'z': return 'male'\n", "if ll == 'f': \n", " if fl == 'a': return 'male'\n", " if fl == 'b': return 'male'\n", " if fl == 'c': return 'male'\n", " if fl == 'g': return 'male'\n", " if fl == 'h': return 'male'\n", " if fl == 'j': return 'male'\n", " if fl == 'o': return 'male'\n", " if fl == 'p': return 'male'\n", " if fl == 'r': return 'male'\n", " if fl == 's': return 'male'\n", " if fl == 't': return 'female'\n", " if fl == 'w': return 'male'\n", "if ll == 'g': \n", " if fl == 'a': return 'male'\n", " if fl == 'b': return 'male'\n", " if fl == 'c': return 'male'\n", " if fl == 'd': return 'male'\n", " if fl == 'f': return 'male'\n", " if fl == 'g': return 'male'\n", " if fl == 'h': return 'female'\n", " if fl == 'i': return 'female'\n", " if fl == 'k': return 'male'\n", " if fl == 'l': return 'male'\n", " if fl == 'm': return 'female'\n", " if fl == 'o': return 'male'\n", " if fl == 'p': return 'female'\n", " if fl == 'r': return 'male'\n", " if fl == 's': return 'male'\n", " if fl == 'w': return 'male'\n", " if fl == 'z': return 'male'\n", "if ll == 'h': \n", " if fl == 'a': return 'female'\n", " if fl == 'b': return 'female'\n", " if fl == 'c': return 'male'\n", " if fl == 'd': return 'female'\n", " if fl == 'e': return 'female'\n", " if fl == 'f': return 'female'\n", " if fl == 'g': return 'female'\n", " if fl == 'h': return 'male'\n", " if fl == 'i': return 'male'\n", " if fl == 'j': return 'male'\n", " if fl == 'k': return 'female'\n", " if fl == 'l': return 'female'\n", " if fl == 'm': return 'female'\n", " if fl == 'n': return 'female'\n", " if fl == 'o': return 'male'\n", " if fl == 'p': return 'female'\n", " if fl == 'r': return 'male'\n", " if fl == 's': return 'female'\n", " if fl == 't': return 'female'\n", " if fl == 'u': return 'male'\n", " if fl == 'w': return 'male'\n", " if fl == 'z': return 'male'\n", "if ll == 'i': \n", " if fl == 'a': return 'female'\n", " if fl == 'b': return 'female'\n", " if fl == 'c': return 'female'\n", " if fl == 'd': return 'female'\n", " if fl == 'e': return 'female'\n", " if fl == 'f': return 'female'\n", " if fl == 'g': return 'female'\n", " if fl == 'h': return 'female'\n", " if fl == 'i': return 'female'\n", " if fl == 'j': return 'female'\n", " if fl == 'k': return 'female'\n", " if fl == 'l': return 'female'\n", " if fl == 'm': return 'female'\n", " if fl == 'n': return 'female'\n", " if fl == 'p': return 'female'\n", " if fl == 'r': return 'female'\n", " if fl == 's': return 'female'\n", " if fl == 't': return 'female'\n", " if fl == 'u': return 'male'\n", " if fl == 'v': return 'female'\n", " if fl == 'w': return 'female'\n", " if fl == 'y': return 'male'\n", "if ll == 'j': \n", " if fl == 'a': return 'male'\n", " if fl == 'm': return 'female'\n", " if fl == 'r': return 'male'\n", "if ll == 'k': \n", " if fl == 'a': return 'male'\n", " if fl == 'b': return 'male'\n", " if fl == 'c': return 'male'\n", " if fl == 'd': return 'male'\n", " if fl == 'e': return 'male'\n", " if fl == 'f': return 'male'\n", " if fl == 'h': return 'male'\n", " if fl == 'i': return 'male'\n", " if fl == 'j': return 'male'\n", " if fl == 'k': return 'male'\n", " if fl == 'l': return 'female'\n", " if fl == 'm': return 'male'\n", " if fl == 'n': return 'male'\n", " if fl == 'p': return 'male'\n", " if fl == 'r': return 'male'\n", " if fl == 's': return 'male'\n", " if fl == 't': return 'male'\n", " if fl == 'u': return 'male'\n", " if fl == 'v': return 'male'\n", " if fl == 'w': return 'male'\n", " if fl == 'z': return 'male'\n", "if ll == 'l': \n", " if fl == 'a': return 'female'\n", " if fl == 'b': return 'female'\n", " if fl == 'c': return 'female'\n", " if fl == 'd': return 'male'\n", " if fl == 'e': return 'male'\n", " if fl == 'f': return 'male'\n", " if fl == 'g': return 'female'\n", " if fl == 'h': return 'male'\n", " if fl == 'i': return 'female'\n", " if fl == 'j': return 'female'\n", " if fl == 'k': return 'female'\n", " if fl == 'l': return 'female'\n", " if fl == 'm': return 'male'\n", " if fl == 'n': return 'male'\n", " if fl == 'o': return 'male'\n", " if fl == 'p': return 'male'\n", " if fl == 'q': return 'male'\n", " if fl == 'r': return 'male'\n", " if fl == 's': return 'female'\n", " if fl == 't': return 'male'\n", " if fl == 'u': return 'male'\n", " if fl == 'v': return 'male'\n", " if fl == 'w': return 'male'\n", " if fl == 'y': return 'male'\n", "if ll == 'm': \n", " if fl == 'a': return 'male'\n", " if fl == 'b': return 'male'\n", " if fl == 'c': return 'male'\n", " if fl == 'd': return 'male'\n", " if fl == 'e': return 'male'\n", " if fl == 'f': return 'male'\n", " if fl == 'g': return 'male'\n", " if fl == 'h': return 'male'\n", " if fl == 'i': return 'male'\n", " if fl == 'j': return 'male'\n", " if fl == 'k': return 'male'\n", " if fl == 'l': return 'male'\n", " if fl == 'm': return 'female'\n", " if fl == 'n': return 'male'\n", " if fl == 'p': return 'female'\n", " if fl == 'r': return 'male'\n", " if fl == 's': return 'male'\n", " if fl == 't': return 'male'\n", " if fl == 'v': return 'male'\n", " if fl == 'w': return 'male'\n", "if ll == 'n': \n", " if fl == 'a': return 'male'\n", " if fl == 'b': return 'male'\n", " if fl == 'c': return 'female'\n", " if fl == 'd': return 'male'\n", " if fl == 'e': return 'female'\n", " if fl == 'f': return 'male'\n", " if fl == 'g': return 'male'\n", " if fl == 'h': return 'male'\n", " if fl == 'i': return 'male'\n", " if fl == 'j': return 'female'\n", " if fl == 'k': return 'female'\n", " if fl == 'l': return 'male'\n", " if fl == 'm': return 'female'\n", " if fl == 'n': return 'male'\n", " if fl == 'o': return 'male'\n", " if fl == 'p': return 'male'\n", " if fl == 'q': return 'male'\n", " if fl == 'r': return 'female'\n", " if fl == 's': return 'male'\n", " if fl == 't': return 'male'\n", " if fl == 'u': return 'male'\n", " if fl == 'v': return 'male'\n", " if fl == 'w': return 'male'\n", " if fl == 'y': return 'male'\n", " if fl == 'z': return 'male'\n", "if ll == 'o': \n", " if fl == 'a': return 'male'\n", " if fl == 'b': return 'male'\n", " if fl == 'c': return 'female'\n", " if fl == 'd': return 'male'\n", " if fl == 'e': return 'male'\n", " if fl == 'f': return 'male'\n", " if fl == 'g': return 'male'\n", " if fl == 'h': return 'male'\n", " if fl == 'i': return 'male'\n", " if fl == 'j': return 'male'\n", " if fl == 'k': return 'female'\n", " if fl == 'l': return 'male'\n", " if fl == 'm': return 'male'\n", " if fl == 'n': return 'male'\n", " if fl == 'o': return 'male'\n", " if fl == 'p': return 'male'\n", " if fl == 'r': return 'male'\n", " if fl == 's': return 'male'\n", " if fl == 't': return 'male'\n", " if fl == 'u': return 'male'\n", " if fl == 'v': return 'male'\n", " if fl == 'w': return 'male'\n", " if fl == 'y': return 'female'\n", " if fl == 'z': return 'male'\n", "if ll == 'p': \n", " if fl == 'b': return 'male'\n", " if fl == 'c': return 'male'\n", " if fl == 'f': return 'male'\n", " if fl == 'g': return 'male'\n", " if fl == 'k': return 'female'\n", " if fl == 'n': return 'male'\n", " if fl == 'p': return 'male'\n", " if fl == 's': return 'male'\n", " if fl == 't': return 'male'\n", " if fl == 'w': return 'male'\n", "if ll == 'r': \n", " if fl == 'a': return 'male'\n", " if fl == 'b': return 'male'\n", " if fl == 'c': return 'male'\n", " if fl == 'd': return 'female'\n", " if fl == 'e': return 'female'\n", " if fl == 'f': return 'male'\n", " if fl == 'g': return 'male'\n", " if fl == 'h': return 'female'\n", " if fl == 'i': return 'male'\n", " if fl == 'j': return 'female'\n", " if fl == 'k': return 'male'\n", " if fl == 'l': return 'male'\n", " if fl == 'm': return 'male'\n", " if fl == 'n': return 'male'\n", " if fl == 'o': return 'male'\n", " if fl == 'p': return 'male'\n", " if fl == 'r': return 'male'\n", " if fl == 's': return 'male'\n", " if fl == 't': return 'male'\n", " if fl == 'v': return 'male'\n", " if fl == 'w': return 'male'\n", " if fl == 'x': return 'male'\n", "if ll == 's': \n", " if fl == 'a': return 'male'\n", " if fl == 'b': return 'male'\n", " if fl == 'c': return 'male'\n", " if fl == 'd': return 'male'\n", " if fl == 'e': return 'male'\n", " if fl == 'f': return 'male'\n", " if fl == 'g': return 'female'\n", " if fl == 'h': return 'male'\n", " if fl == 'i': return 'female'\n", " if fl == 'j': return 'male'\n", " if fl == 'k': return 'male'\n", " if fl == 'l': return 'male'\n", " if fl == 'm': return 'male'\n", " if fl == 'n': return 'male'\n", " if fl == 'o': return 'male'\n", " if fl == 'p': return 'female'\n", " if fl == 'q': return 'male'\n", " if fl == 'r': return 'male'\n", " if fl == 's': return 'male'\n", " if fl == 't': return 'male'\n", " if fl == 'u': return 'male'\n", " if fl == 'v': return 'female'\n", " if fl == 'w': return 'male'\n", " if fl == 'x': return 'male'\n", " if fl == 'y': return 'male'\n", " if fl == 'z': return 'male'\n", "if ll == 't': \n", " if fl == 'a': return 'male'\n", " if fl == 'b': return 'male'\n", " if fl == 'c': return 'male'\n", " if fl == 'd': return 'male'\n", " if fl == 'e': return 'male'\n", " if fl == 'f': return 'male'\n", " if fl == 'g': return 'male'\n", " if fl == 'h': return 'male'\n", " if fl == 'i': return 'male'\n", " if fl == 'j': return 'female'\n", " if fl == 'k': return 'male'\n", " if fl == 'l': return 'female'\n", " if fl == 'm': return 'female'\n", " if fl == 'n': return 'female'\n", " if fl == 'o': return 'male'\n", " if fl == 'p': return 'male'\n", " if fl == 'q': return 'male'\n", " if fl == 'r': return 'male'\n", " if fl == 's': return 'male'\n", " if fl == 't': return 'male'\n", " if fl == 'v': return 'female'\n", " if fl == 'w': return 'male'\n", " if fl == 'y': return 'female'\n", "if ll == 'u': \n", " if fl == 'b': return 'female'\n", " if fl == 'd': return 'male'\n", " if fl == 'e': return 'male'\n", " if fl == 'i': return 'female'\n", " if fl == 'j': return 'male'\n", " if fl == 'l': return 'female'\n", " if fl == 'm': return 'male'\n", " if fl == 'p': return 'female'\n", " if fl == 'v': return 'male'\n", "if ll == 'v': \n", " if fl == 'b': return 'female'\n", " if fl == 'c': return 'male'\n", " if fl == 'd': return 'male'\n", " if fl == 'e': return 'male'\n", " if fl == 'g': return 'male'\n", " if fl == 'h': return 'male'\n", " if fl == 'l': return 'male'\n", " if fl == 'm': return 'male'\n", " if fl == 'n': return 'male'\n", " if fl == 'r': return 'male'\n", " if fl == 't': return 'male'\n", " if fl == 'v': return 'female'\n", " if fl == 'y': return 'male'\n", "if ll == 'w': \n", " if fl == 'a': return 'male'\n", " if fl == 'b': return 'male'\n", " if fl == 'd': return 'female'\n", " if fl == 'h': return 'male'\n", " if fl == 'l': return 'male'\n", " if fl == 'm': return 'male'\n", " if fl == 'r': return 'female'\n", " if fl == 's': return 'male'\n", " if fl == 'w': return 'female'\n", "if ll == 'x': \n", " if fl == 'a': return 'female'\n", " if fl == 'b': return 'female'\n", " if fl == 'd': return 'female'\n", " if fl == 'f': return 'male'\n", " if fl == 'k': return 'male'\n", " if fl == 'l': return 'male'\n", " if fl == 'm': return 'female'\n", " if fl == 'r': return 'male'\n", " if fl == 's': return 'female'\n", " if fl == 't': return 'female'\n", "if ll == 'y': \n", " if fl == 'a': return 'female'\n", " if fl == 'b': return 'female'\n", " if fl == 'c': return 'female'\n", " if fl == 'd': return 'female'\n", " if fl == 'e': return 'female'\n", " if fl == 'f': return 'female'\n", " if fl == 'g': return 'male'\n", " if fl == 'h': return 'male'\n", " if fl == 'i': return 'female'\n", " if fl == 'j': return 'female'\n", " if fl == 'k': return 'female'\n", " if fl == 'l': return 'female'\n", " if fl == 'm': return 'female'\n", " if fl == 'n': return 'female'\n", " if fl == 'o': return 'female'\n", " if fl == 'p': return 'female'\n", " if fl == 'q': return 'male'\n", " if fl == 'r': return 'male'\n", " if fl == 's': return 'female'\n", " if fl == 't': return 'female'\n", " if fl == 'v': return 'female'\n", " if fl == 'w': return 'male'\n", " if fl == 'y': return 'male'\n", " if fl == 'z': return 'male'\n", "if ll == 'z': \n", " if fl == 'a': return 'male'\n", " if fl == 'b': return 'female'\n", " if fl == 'e': return 'male'\n", " if fl == 'f': return 'male'\n", " if fl == 'h': return 'male'\n", " if fl == 'i': return 'female'\n", " if fl == 'j': return 'male'\n", " if fl == 'l': return 'female'\n", " if fl == 'm': return 'male'\n", " if fl == 'r': return 'female'\n", " if fl == 'x': return 'male'\n", "\n" ] } ], "source": [ "print classifier.pseudocode()" ] }, { "cell_type": "markdown", "id": "b32f7a2f", "metadata": {}, "source": [ "Decision trees are classifiers that classify as a nested sequence of if-then statements.\n", "\n", "Variables can be binary, categorical, or numeric.\n", "\n", "For numerical variables, they divide the feature space into axis-parallel rectangles and associated probabilities.\n", "\n", "Decision trees are generally grown as follows:\n", "\n", "- take a set of data\n", "- consider splits along every possible feature and value\n", "- pick the best split according to the minimal impurity of the corresponding label set\n", "- split according to that feature and value\n", "- repeat the process on each subset (branch)\n", "- stop if a minimum impurity or set size is reached\n", "\n", "A better way of doing this is to split like the above, into small terminal nodes (deliberate overfitting), then start merging terminal nodes back together again, based on cross-validated error (\"pruning\"). This is what CART does, and leads to better overall performance." ] }, { "cell_type": "markdown", "id": "22edbba6", "metadata": {}, "source": [ "# Document Classification" ] }, { "cell_type": "code", "execution_count": 22, "id": "ac4d5c47", "metadata": { "collapsed": false }, "outputs": [], "source": [ "from nltk.corpus import movie_reviews" ] }, { "cell_type": "code", "execution_count": 23, "id": "e951ac97", "metadata": { "collapsed": false }, "outputs": [], "source": [ "documents = [(list(movie_reviews.words(fileid)),category)\n", " for category in movie_reviews.categories()\n", " for fileid in movie_reviews.fileids(category)]" ] }, { "cell_type": "code", "execution_count": 24, "id": "23255f76", "metadata": { "collapsed": false }, "outputs": [], "source": [ "pyrandom.shuffle(documents)" ] }, { "cell_type": "code", "execution_count": 25, "id": "926ae80f", "metadata": { "collapsed": false }, "outputs": [], "source": [ "all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())" ] }, { "cell_type": "code", "execution_count": 26, "id": "5830031f", "metadata": { "collapsed": false }, "outputs": [], "source": [ "word_features = all_words.keys()[:2000]" ] }, { "cell_type": "code", "execution_count": 27, "id": "be9dac06", "metadata": { "collapsed": false }, "outputs": [], "source": [ "def document_features(document):\n", " document_words = set(document)\n", " features = {}\n", " for w in word_features:\n", " features[w] = (w in document_words)\n", " return features" ] }, { "cell_type": "code", "execution_count": 28, "id": "b47958f4", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'limited': False, 'four': False, 'woods': True, 'woody': False, 'captain': False, 'hate': False, 'consider': False, 'relationships': False, 'whose': False, 'buddy': False, 'themes': False, 'presents': False, 'edward': False, 'under': False, 'lord': False, 'worth': False, 'rescue': False, 'every': False, 'jack': False, 'bringing': False, 'school': False, 'skills': False, 'ups': False, 'enjoy': False, 'force': False, 'tired': False, 'miller': False, 'direct': False, 'second': False, 'street': False, 'even': True, '+': False, 'above': False, 'new': True, 'poorly': False, 'ever': False, 'disney': False, 'told': False, 'hero': False, 'mel': False, 'human': False, 'men': False, 'here': False, 'studio': False, 'cult': False, '100': False, 'kids': True, 'daughter': False, 'leaves': False, 'changed': False, 'credit': False, 'military': False, 'changes': False, 'fantastic': False, 'julie': False, 'explained': False, 'julia': False, 'highly': False, 'brought': False, 'moral': False, 'actions': False, 'total': False, 'sarah': False, 'plot': False, 'would': False, 'army': False, 'hospital': True, 'music': False, 'therefore': False, 'recommend': False, 'strike': False, 'survive': False, 'type': True, 'until': False, 'speaking': False, 'successful': False, 'brings': False, 'wars': False, 'award': False, 'hurt': False, 'phone': False, 'adult': True, 'excellent': False, '90': False, 'hold': False, 'must': False, 'shoot': False, 'word': False, 'room': False, '1997': False, '1996': False, '1999': False, '1998': False, 'blade': False, 'movies': False, 'era': False, 'ms': False, 'mr': False, 'my': True, 'example': False, 'give': False, 'climax': False, 'laughs': False, 'want': False, 'times': False, 'end': False, 'thing': False, 'provide': False, 'travel': False, 'sitting': False, 'feature': True, 'machine': False, 'how': False, 'amazing': False, 'writers': True, 'answer': False, 'beach': False, 'badly': False, 'elizabeth': False, 'beauty': False, 'mess': False, 'after': True, 'wrong': False, 'president': False, 'law': False, 'danny': False, 'attempt': False, 'third': False, 'appreciate': False, 'lost': True, 'green': False, 'ultimate': False, 'keeps': False, 'worst': True, 'order': False, 'office': False, 'over': True, 'before': True, 'fit': False, 'personal': False, ',': True, 'writing': True, 'better': True, 'production': False, 'compelling': False, 'hidden': False, 'then': False, 'them': True, 'safe': False, 'break': False, 'band': False, 'effects': False, 'they': True, 'one': True, 'alex': False, 'rocky': False, 'debut': False, 'l': False, 'grows': False, 'each': False, 'went': True, 'side': False, 'mean': False, 'meets': False, 'series': True, 'truman': False, 'sounds': False, 'driving': False, 'god': False, 'cheesy': False, 'content': False, 're': False, 'got': False, 'turning': False, 'little': True, 'free': False, 'standard': False, 'masterpiece': False, 'struggle': False, 'wanted': False, 'created': False, 'starts': True, 'days': False, 'creates': False, 'isn': True, 'uses': False, 'onto': False, 'already': False, 'features': False, 'fantasy': False, 'another': False, 'wasn': True, 'comic': False, 'toy': False, 'top': False, 'girls': False, 'fiction': False, 'needed': False, 'master': False, 'too': False, 'tom': False, 'hollywood': False, 'john': False, 'carrey': False, 'urban': False, 'murder': False, 'serve': False, 'took': False, 'japanese': False, 'predictable': False, 'somewhat': False, 'helen': False, 'wasted': False, 'begins': False, 'trek': False, 'target': False, 'roles': False, 'likely': False, 'project': False, 'matter': False, 'silly': False, 'williams': False, 'feeling': False, 'powers': False, 'screenplay': False, 'fashion': False, 'sees': False, 'modern': False, 'mind': False, 'talking': False, 'manner': False, 'seen': False, 'seem': False, 'tells': False, 'ray': False, 'forced': False, 'strength': False, 'genuine': False, 'thoroughly': False, 'latter': False, 'responsible': False, '-': True, 'amusing': False, 'forces': False, 'blue': False, 'nobody': False, 'though': False, 'bruce': False, 'cinematic': False, 'involving': False, 'mouth': False, 'plenty': False, 'the': True, 'thriller': False, 'singer': False, 'don': False, 'cops': False, 'professor': False, 'camp': False, 'm': False, 'dog': False, 'definitely': False, 'entertaining': False, 'scream': False, 'came': False, 'saying': False, 'queen': False, 'ending': False, 'attempts': False, 'radio': False, 'subtle': False, 'enjoyed': False, 'situations': False, 'explain': False, 'theme': False, 'rich': False, 'folks': False, 'personality': False, 'do': True, 'godzilla': False, 'wide': False, 'de': False, 'stop': False, '13': False, 'christopher': False, 'despite': False, 'nasty': False, 'dr': False, 'hall': False, 'runs': False, 'bar': False, 'nicely': False, 'guns': False, 'twice': False, 'bad': False, 'shots': False, 'release': False, 'headed': False, 'decides': False, 'disaster': False, 'fair': False, 'decided': False, 'result': False, 'fail': False, 'disturbing': False, 'best': False, 'subject': False, 'said': True, 'lots': False, 'away': True, 'portrayed': False, 'r': False, 'discovers': False, 'drawn': False, 'approach': False, 'we': False, 'never': False, 'terms': False, 'nature': False, 'weak': False, 'however': False, 'boss': False, 'news': False, 'cop': False, 'accident': False, 'country': False, 'ill': False, 'against': False, 'players': False, 'faces': False, 'asked': False, 'tough': False, 'character': False, 'tony': False, 'trust': False, 'speak': False, 'puts': False, 'three': True, 'been': True, '.': True, 'much': False, 'interest': False, 'basic': False, 'expected': False, 'board': False, 'life': False, 'easy': False, 'drugs': False, 'filmmaker': False, 'substance': False, 'child': False, 'catch': False, 'worked': False, 'exception': False, 'has': True, 'air': False, 'ugly': False, 'near': False, 'suppose': False, 'voice': False, 'spawn': False, 'seven': False, 'played': False, 'is': True, 'it': True, 'ii': False, 'shame': False, 'creepy': False, 'in': True, 'if': True, 'hong': False, 'things': True, 'make': True, 'jerry': False, 'complex': False, 'several': False, 'showing': False, 'fairly': False, 'pick': False, 'evil': False, 'hand': False, 'characters': False, 'opportunity': False, 'kid': False, 'kept': False, 'lucas': False, 'charlie': False, 'contact': False, 'greatest': False, 'mother': True, 'jean': False, 'musical': False, 'left': False, 'just': False, 'aside': False, 'thanks': False, 'victim': False, 'campbell': False, 'yes': False, 'yet': False, 'previous': False, 'terrific': False, 'thankfully': False, 'instance': False, 'had': True, 'animation': True, 'ok': False, 'innocent': False, 'prison': False, 'save': False, 'humanity': False, 'gave': False, 'casting': False, 'breaks': False, 'possible': False, 'possibly': False, 'tarzan': False, 'background': False, 'destroy': False, 'unique': False, 'dreams': False, 'apart': False, 'desire': False, 'hunt': False, 'officer': False, 'night': False, 'security': False, 'right': False, 'old': True, 'deal': True, 'people': False, 'somehow': False, 'dead': False, 'jennifer': False, 'born': False, 'escape': False, 'confusing': False, 'humor': False, 'for': True, 'bottom': True, 'opposite': False, 'fox': False, '/': False, 'creative': False, 'witty': False, 'everything': False, 'shakespeare': False, 'christmas': False, 'core': False, 'protagonist': False, 'starring': False, 'memorable': False, 'post': False, 'super': False, 'dollars': False, 'months': False, 'o': False, 'plus': False, 'slightly': False, 'presence': False, 'son': False, 'down': False, 'lies': False, 'crowd': False, 'support': False, 'constantly': False, 'fight': False, 'stuck': False, 'anderson': False, 'way': True, 'jane': False, 'call': False, 'was': True, 'war': False, 'happy': False, 'head': True, 'form': False, 'offer': False, 'batman': False, 'becoming': False, 'ford': False, 'taken': False, 'talents': False, 'hear': False, 'true': False, 'inside': False, 'tell': False, 'crystal': False, 'emotional': False, 'moore': False, 'classic': False, 'nudity': False, 'proves': False, 'exist': False, 'believable': False, 'ship': False, 'annoying': False, 'trip': False, 'physical': False, 'dying': False, 'no': True, 'when': True, 'actor': False, 'reality': False, 'tim': False, 'interested': False, 'role': False, 'oscar': False, 'test': False, 'picture': True, 'brothers': False, 'dies': False, 'felt': False, 'convincing': False, 'journey': False, 'died': False, 'jones': False, 'younger': False, 'longer': False, 'together': False, 'time': False, 'serious': False, 'paced': False, 'songs': False, 'concept': False, 'managed': False, 'vampires': False, 'remarkable': False, 'impression': False, 'dance': False, 'rob': False, 'focus': False, 'leads': False, 'theaters': False, 'manages': False, 'ice': False, 'battle': False, 'certainly': False, 'father': False, 'finally': False, 'me': True, 'keeping': False, 'seemingly': False, 'choice': False, 'vegas': False, 'lynch': False, 'join': False, 'trouble': False, 'minute': False, 'cool': False, 'slapstick': False, 'impressive': False, 'presented': False, 'did': True, 'turns': False, 'brother': False, 'leave': False, 'team': False, 'quick': False, 'guy': False, 'work': False, 'says': False, 'porn': False, 'dealing': False, 'discover': False, 'sign': False, 'adds': False, 'appear': False, 'current': False, 'suspect': False, 'goes': True, 'falling': False, 'appeal': False, 'filled': False, 'supporting': False, 'henry': False, 'french': False, 'travolta': False, 'water': False, 'witch': False, 'alone': False, 'along': True, 'appears': False, 'change': False, 'wait': False, 'box': False, 'boy': False, 'brilliant': False, 'guilty': False, 'usually': False, 'bob': False, 'myers': False, 'teenage': False, 'love': False, 'humour': False, 'extra': False, 'merely': False, 'bloody': False, 'fake': False, 'fbi': False, 'rarely': False, 'sympathetic': False, 'russell': False, 'working': False, 'prove': False, 'angry': False, 'dude': False, 'intense': False, 'live': False, 'wonderfully': False, 'wondering': False, 'films': False, 'angels': False, 'today': False, 'loving': False, 'club': False, 'apparent': False, 'visual': False, 'effort': False, 'fly': False, 'graphic': False, 'car': False, 'originally': False, 'soul': False, 'believes': False, 'reviews': False, 'can': False, 'following': False, 'making': False, 'heart': False, 'crazy': False, 'figure': False, 'confused': False, 'agent': False, 'heard': False, 'critic': False, 'sharp': False, 'dennis': False, 'steve': False, 'means': False, '1': False, 'pure': False, 'spielberg': False, 'information': False, 'may': True, 'max': False, 'mad': False, 'date': False, 'such': False, 'heroes': False, 'guys': False, 'man': False, 'natural': False, 'remember': False, 'maybe': False, 'tale': False, 'so': True, 'talk': False, 'typical': False, 'cute': True, 'breaking': False, 'indeed': False, 'years': False, 'course': True, 'cold': False, 'still': False, 'apes': False, 'group': True, 'interesting': False, 'jim': False, 'hot': False, 'window': False, 'offers': False, 'hanks': False, 'main': False, 'into': True, 'happened': False, 'non': False, 'killer': False, 'touching': False, 'half': False, 'not': True, 'now': False, 'killed': False, 'nor': True, 'name': False, 'james': False, 'didn': False, 'realistic': False, 'rock': False, 'entirely': False, 'creature': False, 'ed': False, 'directing': False, 'yeah': False, 'ex': False, 'year': True, 'girl': False, 'surprise': False, 'intended': False, 'living': False, 'ultimately': False, 'jackson': False, 'enjoyable': False, 'space': False, 'looking': False, 'seriously': False, 'formula': False, 'shows': False, 'earlier': False, 'monster': False, 'million': False, 'brooks': False, 'quite': False, 'inevitable': False, 'care': False, 'couldn': False, 'language': False, 'british': False, 'honest': False, 'motion': False, 'turn': False, 'place': False, 'think': True, 'first': True, 'emotion': False, 'flying': False, 'saving': False, 'yourself': False, 'long': False, 'directly': False, 'carry': False, 'impossible': False, 'message': False, 'open': False, 'george': False, 'city': False, 'given': False, 'silent': False, 'caught': False, 'anyone': False, 'returns': False, '2': False, 'white': False, 'friend': False, 'gives': False, 'subplot': False, 'slasher': False, 'eyes': False, 'mostly': False, 'that': True, 'season': False, 'alan': False, 'viewing': False, 'released': False, 'ridiculous': False, 'than': True, 'boyfriend': False, '10': False, 'television': True, 'bored': False, 'unfortunately': False, 'rival': False, 'future': False, 'were': True, 'and': True, 'sam': False, 'turned': False, 'sad': False, 'say': False, 'suspense': False, 'rent': False, 'allen': False, 'saw': True, 'any': True, 'kevin': False, 'ideas': False, 'note': False, 'potential': False, 'take': True, 'performance': False, 'begin': False, 'sure': False, 'pain': False, 'normal': False, 'track': False, 'price': False, 'knew': False, 'suspects': False, 'falls': False, 'pair': False, 'america': False, ']': False, 'forever': False, 'especially': True, 'surprising': False, 'failure': False, 'considered': False, 'average': False, 'later': False, 'drive': False, 'mrs': False, 'professional': False, 'detective': False, 'rating': False, 'walking': False, 'shot': False, 'show': True, 'cheap': False, 'bright': True, 'ground': False, 'slow': False, 'title': False, 'written': False, 'crime': False, 'only': False, 'going': False, 'black': False, 'get': True, 'routine': False, 'truly': False, 'cannot': False, 'nearly': True, 'flicks': False, 'reveal': False, 'artist': False, 'naked': False, 'jokes': False, 'stupid': False, 'roger': False, 'scott': False, 'where': True, 'husband': False, 'seat': False, 'gangster': False, 'college': False, 'sean': False, 'wonder': True, 'fails': False, 'gags': False, 'ways': False, 'review': True, '3': True, 'between': False, 'reading': False, 'across': False, 'notice': False, 'screen': False, 'jr': False, 'killing': False, 'blame': False, 'come': False, 'talented': False, 'mob': False, 'many': True, 'quiet': False, 'somewhere': False, 's': True, 'remake': False, 'comes': True, 'among': True, 'characterization': False, 'color': False, 'effective': False, 'period': False, 'pop': False, 'expectations': False, 'anti': False, 'bizarre': False, 'crew': False, 'boat': False, 'considering': False, 'arts': False, 'cares': False, 'west': False, 'filmmakers': False, 'mark': False, 'mars': False, 'featuring': False, 'hardly': False, 'mary': False, 'wants': False, 'direction': False, 'dramatic': False, 'former': False, 'those': False, 'case': False, 'myself': False, 'these': False, 'cash': False, 'arnold': False, 'cast': False, 'situation': False, 'eventually': False, 'twists': False, 'middle': False, 'someone': False, 'technology': False, 'mediocre': False, 'different': True, 'doctor': False, 'pay': False, 'same': False, 'check': False, 'speech': False, 'damon': False, 'events': False, 'week': False, 'visually': False, 'driver': False, 'director': False, 'running': True, 'driven': False, 'totally': False, 'cartoon': False, 'theater': False, 'floor': False, 'without': False, 'relief': False, 'model': False, 'parody': False, 'summer': False, 'being': True, 'money': False, 'actress': False, 'violent': False, 'kill': False, 'aspect': False, 'touch': False, 'baldwin': False, 'death': False, 'thinking': False, 'rose': False, 'seems': False, 'except': False, 'setting': False, '4': False, 'real': False, 'aspects': False, 'around': True, 'spectacular': False, 'read': False, 'early': False, 'world': False, 'lady': False, 'adams': False, 'patrick': False, 'audience': True, 'laughing': False, 'fully': False, 'eddie': False, 'images': False, 'joan': False, 'thinks': False, 'provided': False, 'mood': False, 'willis': False, 'overly': False, 'provides': False, 'welcome': False, 'business': False, 'seconds': False, 'credits': False, 'starship': False, 'exciting': False, 'throw': False, 'on': True, 'stone': False, 'central': False, 'oh': False, 'island': False, 'industry': False, 'violence': False, 'favorite': False, 'generated': False, 'stand': False, 'act': False, 'bond': False, 'or': True, 'road': False, 'image': False, 'rising': False, 'gary': False, 'burton': False, 'portrayal': False, 'happening': False, 'your': True, 'her': False, 'aren': False, 'there': True, 'hey': False, 'start': True, 'low': False, 'stars': False, 'complete': False, 'enough': True, 'seagal': False, 'ended': False, 'gore': False, 'trying': False, 'with': True, 'spice': False, 'football': False, 'pull': False, 'rush': False, 'romantic': False, 'pulp': False, 'gone': False, 'taylor': False, 'johnny': False, 'cinema': False, 'taste': False, 'certain': False, 'am': True, 'al': False, 'deep': False, 'fellow': False, 'imagination': False, 'as': True, 'at': True, 'girlfriend': False, 'watched': False, 'moves': False, 'spends': False, 'film': True, 'again': False, 'comedies': False, 'cinematography': False, '5': False, 'you': True, 'poor': False, 'flaws': False, 'finale': False, 'carpenter': False, 'includes': False, 'important': False, 'chris': False, 'building': False, 'calls': False, 'wife': False, 'directors': False, 'u': False, 'original': False, 'all': True, 'sci': False, 'forget': False, 'chinese': False, 'lack': False, 'cameron': False, 'follow': False, 'children': True, 'apartment': False, 'hunting': False, 'tv': False, 'spirit': False, 'to': True, 'smile': False, 'sound': False, 'woman': False, 'worse': False, 'song': False, 'very': True, 'horror': False, 'fat': False, 'fan': False, 'decide': False, 'fall': False, 'awful': True, 'difference': False, '`': False, 'heaven': False, '--': False, 'list': False, 'large': False, 'harry': False, 'small': False, 'flick': False, 'ten': False, 'streets': False, 'ted': False, 'past': False, 'rate': False, 'design': False, 'lawyer': False, 'pass': False, 'further': False, 'what': True, 'sub': False, 'richard': False, 'brief': False, 'emotions': False, 'jedi': False, 'version': False, 'public': False, 'hasn': False, 'full': False, 'supposedly': False, 'revenge': False, 'hours': False, 'shouldn': False, 'strong': False, 'legend': False, 'search': False, 'ahead': False, 'inspired': False, 'allows': False, 'jackie': False, 'experience': False, 'amount': False, 'social': False, 'action': False, 'followed': False, 'family': True, 'suddenly': False, 'put': False, 'eye': False, 'takes': False, 'contains': True, 'two': True, '6': False, 'mysterious': False, 'minor': False, 'more': True, 'teen': False, 'flat': False, 'door': False, 'knows': False, 'company': False, 'excuse': False, 'basically': False, 'particular': False, 'known': False, 'town': False, 'none': False, 'hour': False, 'science': False, 'learn': False, 'male': False, 'history': False, 'beautiful': False, 'brown': False, 'share': False, 'accept': False, 'states': False, 'sense': False, 'station': False, 'lacks': False, 'species': False, '!': True, 'huge': False, 'needs': False, 'court': False, 'goal': False, 'rather': False, 'plans': False, 'acts': False, 'fame': False, 'occasionally': False, 'either': False, 'okay': False, 'tried': False, 'soundtrack': False, 'tries': False, 'plane': False, 'blood': False, 'coming': False, 'fi': False, 'a': True, 'contrived': False, 'short': False, 'loud': False, 'perhaps': False, 'media': False, 'pleasure': False, 'dream': False, 'playing': False, 'help': False, 'developed': False, 'soon': False, 'attitude': True, 'scientist': False, 'through': True, 'hell': False, 'its': True, 'romance': False, 'style': True, 'comedic': False, '20': False, 'actually': False, 'late': False, 'parts': False, 'stephen': False, 'damme': False, 'might': False, 'wouldn': False, 'good': True, 'return': False, 'food': False, 'viewers': False, 'mystery': False, 'easily': False, 'always': True, 'level': False, 'goofy': False, 'die': False, 'found': False, 'trailer': True, 'heavy': False, 'everyone': False, 'generation': False, 'house': False, 'energy': False, 'hard': False, 'idea': True, 'gun': False, 'engaging': False, 'expect': False, 'beyond': False, 'event': False, 'really': False, 'deals': False, 'robert': False, 'since': True, 'douglas': False, 'acting': False, 'hill': False, '7': True, 'ass': False, 'pathetic': False, 'story': True, 'reason': False, 'rated': False, 'members': False, 'imagine': False, 'ask': False, 'horse': False, 'laughable': False, 'beginning': False, 'thrown': False, 'producers': False, 'terrible': False, 'american': False, 'expecting': False, 'twenty': False, 'major': False, 'feel': False, 'number': False, 'feet': False, 'done': True, 'bland': False, 'miss': False, 'guess': False, '\"': True, 'heads': False, 'script': False, 'leading': False, 'least': False, 'player': False, 'wonderful': False, 'store': False, 'relationship': False, 'behind': False, 'hotel': False, 'park': False, 'part': False, 'believe': False, 'grace': False, 'king': False, 'kind': False, 'b': False, 'double': False, 'determined': False, 'marriage': False, 'supposed': False, 'toward': False, 'outstanding': False, 'nights': False, 'built': False, 'zero': False, 'self': False, 'also': False, 'superior': False, 'finding': False, 'play': False, 'towards': False, 'english': False, 'reach': False, 'most': True, 'virus': False, 'charm': True, 'plan': False, 'nothing': False, 'extremely': False, 'screenwriter': False, 'stands': False, 'performances': False, 'clear': False, 'sometimes': True, 'cover': False, 'storyline': False, 'latest': False, 'thomas': False, 'particularly': False, 'gold': False, 'fine': False, 'find': False, 'impact': False, 'giant': False, 'writer': False, 'failed': False, 'wayne': False, 'pretty': False, '8': False, 'his': True, 'hit': False, 'meanwhile': False, 'famous': False, 'feels': False, 'rest': False, 'during': False, 'frightening': False, 'him': False, 'generally': False, 'common': False, 'x': False, 'vincent': False, 'wrote': False, 'set': True, 'art': False, 'intelligence': False, 'scenes': False, 'culture': False, 'see': True, 'individual': False, 'dumb': False, 'are': False, 'close': False, 'learns': False, 'pictures': False, 'please': False, 'fans': False, 'won': False, 'various': False, 'probably': False, 'numerous': False, 'available': False, 'recently': False, 'creating': False, 'missing': True, 'attention': True, 'premise': False, 'genre': False, 'both': False, 'c': False, 'last': False, 'barely': False, 'became': False, 'annie': False, 'forgotten': False, 'whole': False, 'finds': False, 'liked': False, 'point': False, 'simple': False, 'sweet': False, 'acted': False, 'whatever': False, 'hollow': False, 'dimensional': False, 'simply': True, 'likes': False, 'throughout': False, 'mission': False, '[': False, 'devil': False, 'humorous': False, 'create': False, 'political': False, 'due': False, 'teacher': False, 'whom': False, 'secret': False, 'damn': False, 'pg': False, 'meeting': False, 'dialogue': False, 'gay': False, 'fire': False, 'else': False, 'anthony': False, 'lives': False, 'wedding': False, 'intriguing': False, 'look': True, 'solid': False, 'straight': False, 'bill': False, 'budget': False, 'pace': False, 'while': False, 'match': False, 'fun': True, 'animated': True, 'robin': False, 'hoping': False, 'century': False, 'disappointing': False, 'itself': False, 'ready': False, 'chase': True, 'funny': False, 'kills': False, 'grant': False, 'rules': False, 'virtually': False, 'grand': False, '9': False, 'conflict': False, 'development': False, 'used': False, 'blair': False, 'moment': False, '000': False, 'moving': False, 'purpose': False, 'haunting': False, 'weird': False, 'recent': True, 'dark': False, 'task': False, 'older': True, 'spent': False, 'obviously': False, 'person': False, 'edge': False, 'kelly': False, 'chemistry': False, 'spend': False, 'questions': False, 'using': False, 'cut': False, '$': False, 'surprises': False, 'parents': True, 'surprised': False, 'build': False, 'big': False, 'couple': False, 'matters': False, 'game': False, 'bit': True, 'd': False, 'follows': False, 'continue': False, 'popular': False, 'often': False, 'absolutely': True, 'some': False, 'back': True, 'martial': False, 'added': False, 'sight': False, 'scale': False, 'decision': False, 'anne': False, 'epic': False, 't': True, 'be': True, 'run': False, 'lose': False, 'costumes': False, 'feelings': False, 'step': False, 'nowhere': False, 'crap': False, 'by': True, 'faith': False, 'anything': True, 'drama': False, 'deserves': False, 'stuart': False, 'seeing': True, 'jimmy': False, 'aliens': False, 'within': False, 'appropriate': False, 'steven': False, 'question': False, 'fast': False, 'doubt': False, 'forward': False, ':': True, 'opens': False, 'files': False, 'himself': False, 'an': True, 'murphy': False, 'boys': False, 'larry': False, 'hopes': False, 'episode': False, 'line': False, 'dull': False, 'directed': False, 'up': True, 'us': False, 'planet': False, 'similar': False, 'called': False, 'constant': False, 'adults': True, 'chan': False, 'doesn': False, 'single': False, 'phantom': False, 'nick': False, 'points': False, 'actors': False, 'nice': False, 'elements': False, 'problems': False, 'liners': False, 'william': False, 'meaning': False, 'utterly': False, 'ago': False, 'land': False, 'e': True, 'dvd': False, 'age': False, 'depth': False, 'narrative': False, 'far': False, 'fresh': False, 'menace': False, 'having': False, 'once': False, 'jason': False, 'results': False, 'alien': False, 'gang': False, 'go': True, 'kate': False, 'issues': False, 'seemed': False, 'simon': False, 'young': True, 'send': False, 'helps': False, 'include': False, 'sent': False, 'outside': False, 'continues': False, 'telling': False, 'entire': False, 'magic': True, 'shock': False, 'michael': False, 'ryan': False, 'try': False, 'race': False, 'carter': False, 'visuals': False, 'video': False, 'earth': False, 'odd': False, 'clich': False, 'plays': False, 'power': False, 'giving': False, 'vampire': False, 'waiting': False, ';': False, 'body': False, 'led': False, 'lee': False, 'growing': False, 'let': False, 'others': False, 'sexy': False, 'extreme': False, 'great': True, 'talent': False, 'broken': False, 'technical': False, 'involved': False, '30': False, 'titanic': False, 'leaving': False, 'opinion': False, 'makes': False, 'involves': False, 'named': True, 'win': False, 'manage': False, 'private': False, 'wit': True, 'names': False, 'singing': False, 'standing': False, 'use': False, 'from': True, '&': False, 'remains': False, 'next': False, 'few': False, 'camera': False, 'themselves': False, 'sort': False, 'clever': False, 'babe': False, 'comparison': True, 'started': False, 'becomes': True, 'about': True, 'charming': False, 'train': False, 'baby': True, 'rare': False, 'women': False, 'animals': False, 'f': False, 'this': True, 'ride': False, 'hopkins': False, 'obvious': False, 'thin': False, 'of': True, 'meet': False, 'control': False, 'tarantino': False, 'process': False, 'pieces': False, 'high': False, 'villain': False, 'something': False, 'united': False, 'mulan': False, 'sit': True, 'cliches': False, 'six': False, 'brian': False, 'animal': False, 'instead': False, 'comedy': False, 'intelligent': False, 'stock': False, 'fare': True, 'tension': False, 'watch': False, 'realized': False, 'light': False, 'lame': False, 'lines': False, 'element': False, 'claire': False, 'realizes': False, 'allow': False, 'holds': False, 'producer': False, 'move': False, 'produced': False, 'including': False, 'looks': False, 'cruise': False, 'mentioned': False, 'bunch': False, 'perfect': False, 'write': False, 'affleck': False, 'la': False, 'chosen': False, 'll': False, 'winning': False, 'willing': False, 'criminal': False, 'dad': False, 'freeman': False, 'crash': True, 'catherine': False, 'material': False, 'mention': False, 'snake': False, 'kiss': False, 'hands': False, 'front': False, 'cage': False, 'day': False, 'billy': False, 'truth': False, 'doing': False, 'adventure': False, 'product': False, 'society': False, 'books': False, 'filmmaking': False, 'our': False, 'patch': False, 'sexual': False, 'special': False, 'out': False, 'matt': False, 'matrix': False, \"'\": True, 'entertainment': False, 'shallow': False, 'critics': False, 'hilarious': False, 'cause': False, 'red': False, 'frank': False, 'completely': True, 'york': False, 'flynt': False, 'princess': False, 'scary': False, 'g': True, 'could': True, 'david': False, 'length': False, 'austin': False, 'south': False, 'succeeds': False, 'powerful': False, 'scene': True, 'owner': False, 'quality': True, 'sadly': False, 'fascinating': False, 'accent': False, 'system': False, 'their': True, 'attack': False, 'perfectly': False, 'final': False, 'lot': False, 'academy': False, 'exactly': False, 'sex': False, 'herself': False, 'haven': False, 'loved': False, 'ben': False, 'boring': True, 'tommy': True, 'roberts': False, 'appealing': False, 'loves': False, 'lover': False, 'viewer': True, 'teenagers': False, 'williamson': False, 'have': True, 'need': False, 'apparently': False, 'clearly': False, 'able': False, 'mix': False, 'which': True, 'jail': False, '=': False, 'unless': False, 'who': False, 'eight': False, 'device': False, 'why': True, 'face': False, 'looked': True, 'movie': True, 'fact': False, 'affair': False, 'atmosphere': False, 'charles': False, 'anyway': False, 'bring': False, 'soldiers': False, 'fear': False, 'decade': False, 'filmed': False, 'based': False, 'jay': False, '(': True, 'winner': False, 'woo': False, 'should': False, 'score': False, 'local': False, 'hope': False, 'meant': False, 'watching': False, 'beat': False, 'familiar': False, 'overall': False, 'lucky': False, 'community': False, 'ones': False, 'words': True, 'kong': False, 'brain': False, 'married': False, 'stuff': False, 'she': False, 'dangerous': False, 'gibson': False, 'view': False, 'frame': False, 'humans': False, 'computer': False, 'desperate': False, 'nuclear': False, 'superb': False, 'genius': False, 'state': False, 'horrible': False, 'neither': False, 'speed': False, 'ends': False, 'ability': False, 'opening': False, 'deliver': False, 'job': False, 'joe': False, 'key': False, 'police': False, 'hits': False, 'career': False, 'joke': False, 'taking': True, 'drug': False, 'etc': False, 'admit': False, 'figures': False, 'otherwise': False, 'co': False, 'wall': False, 'walk': False, 'laugh': True, 'sequences': False, 'respect': False, 'addition': False, 'decent': False, 'slowly': False, 'treat': False, 'waste': False, 'troopers': False, 'mike': False, 'general': False, 'reeves': False, 'present': False, 'novel': False, 'unlike': False, 'plain': False, 'appearance': False, 'will': False, 'stunning': False, 'fault': False, 'wild': False, 'likable': False, 've': False, 'almost': True, 'thus': False, 'surprisingly': False, 'surface': False, 'audiences': False, 'partner': False, ')': True, 'began': False, 'cross': False, 'member': False, 'matthew': False, 'strange': False, 'party': False, 'gets': False, 'difficult': False, 'emotionally': False, 'eccentric': False, 'upon': False, 'effect': False, 'beast': False, 'cinematographer': False, 'student': False, 'identity': False, 'keep': False, 'off': False, 'center': False, 'i': True, 'well': True, 'fighting': False, 'thought': False, 'sheer': False, 'sets': False, 'position': False, 'usual': False, 'cameo': False, 'less': False, 'moments': False, 'tone': False, 'paul': False, 'field': False, 'jeff': False, 'smith': False, 'realize': False, 'reasons': False, 'add': False, 'other': True, 'attractive': False, 'smart': False, 'fate': False, 'government': False, 'five': False, 'know': False, 'press': False, 'immediately': False, 'necessary': False, 'like': True, 'success': False, 'loses': False, 'martin': False, 'become': True, 'funniest': False, 'works': False, 'adaptation': False, 'because': False, 'sequence': False, 'footage': False, 'alive': False, 'hair': False, 'home': False, 'peter': False, 'happens': False, 'unfunny': False, 'lead': False, 'literally': False, 'avoid': False, 'does': False, 'leader': False, '?': True, 'although': True, 'worthy': False, 'amy': False, 'stage': False, 'sister': False, 'actual': False, 'asks': False, 'getting': False, 'introduced': False, 'documentary': False, 'equally': False, 'own': False, 'satire': False, 'washington': False, 'guard': False, 'promise': False, 'female': False, 'quickly': False, 'pointless': False, 'van': False, '*': False, 'biggest': False, 'buy': False, 'bus': False, 'sequel': False, 'but': True, 'delivers': False, 'editing': False, 'bug': False, 'he': False, 'count': False, 'made': False, 'wise': False, 'places': False, 'whether': False, 'wish': False, 'j': False, 'placed': False, 'stories': False, 'problem': False, 'piece': True, 'minutes': False, 'twist': False, 'shooting': False, 'happen': False, 'compared': False, 'incredible': False, 'offensive': False, 'detail': False, 'book': False, 'details': False, 'sick': False, 'incredibly': False, 'conclusion': False, 'star': False, 'class': False, 'shown': False, 'stay': True, 'chance': False, 'sandler': False, 'friends': True, 'ghost': False, 'understand': False}\n" ] } ], "source": [ "print document_features(documents[0][0])" ] }, { "cell_type": "code", "execution_count": 29, "id": "427d6adf", "metadata": { "collapsed": false }, "outputs": [], "source": [ "featuresets = [(document_features(d),c) for d,c in documents]\n", "training_set = featuresets[:100]\n", "test_set = featuresets[100:]" ] }, { "cell_type": "code", "execution_count": 30, "id": "0d25bbf0", "metadata": { "collapsed": false }, "outputs": [], "source": [ "classifier = nltk.NaiveBayesClassifier.train(training_set)" ] }, { "cell_type": "code", "execution_count": 31, "id": "76ec8568", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.7347368421052631" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nltk.classify.accuracy(classifier,test_set)" ] }, { "cell_type": "code", "execution_count": 32, "id": "314e1030", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Most Informative Features\n", " hilarious = True pos : neg = 6.1 : 1.0\n", " writers = True neg : pos = 5.9 : 1.0\n", " leaving = True pos : neg = 5.4 : 1.0\n", " choice = True pos : neg = 5.4 : 1.0\n", " superb = True pos : neg = 5.4 : 1.0\n" ] } ], "source": [ "classifier.show_most_informative_features(5)" ] }, { "cell_type": "markdown", "id": "0ae0cab0", "metadata": {}, "source": [ "# Parts of Speech Tagging" ] }, { "cell_type": "code", "execution_count": 33, "id": "438201fb", "metadata": { "collapsed": false }, "outputs": [], "source": [ "from nltk.corpus import brown" ] }, { "cell_type": "code", "execution_count": 34, "id": "e3450b18", "metadata": { "collapsed": false }, "outputs": [], "source": [ "suffixes = nltk.FreqDist()\n", "for word in brown.words():\n", " word = word.lower()\n", " suffixes.inc(word[-1:])\n", " suffixes.inc(word[-2:])\n", " suffixes.inc(word[-3:])" ] }, { "cell_type": "code", "execution_count": 35, "id": "6ff0f25f", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['e', ',', '.', 's', 'd', 't', 'he', 'n', 'a', 'of', 'the', 'y', 'r', 'to', 'in', 'f', 'o', 'ed', 'nd', 'is', 'on', 'l', 'g', 'and', 'ng', 'er', 'as', 'ing', 'h', 'at', 'es', 'or', 're', 'it', '``', 'an', \"''\", 'm', ';', 'i', 'ly', 'ion', 'en', 'al', '?', 'nt', 'be', 'hat', 'st', 'his', 'th', 'll', 'le', 'ce', 'by', 'ts', 'me', 've', \"'\", 'se', 'ut', 'was', 'for', 'ent', 'ch', 'k', 'w', 'ld', '`', 'rs', 'ted', 'ere', 'her', 'ne', 'ns', 'ith', 'ad', 'ry', ')', '(', 'te', '--', 'ay', 'ty', 'ot', 'p', 'nce', \"'s\", 'ter', 'om', 'ss', ':', 'we', 'are', 'c', 'ers', 'uld', 'had', 'so', 'ey']\n" ] } ], "source": [ "common = suffixes.keys()[:100]\n", "print common" ] }, { "cell_type": "code", "execution_count": 36, "id": "8ac20819", "metadata": { "collapsed": false }, "outputs": [], "source": [ "def pos_features(w):\n", " features = {}\n", " for s in common:\n", " features[s] = w.lower().endswith(s)\n", " return features" ] }, { "cell_type": "code", "execution_count": 37, "id": "7b406086", "metadata": { "collapsed": false }, "outputs": [], "source": [ "tagged_words = brown.tagged_words(categories='news')" ] }, { "cell_type": "code", "execution_count": 38, "id": "70216e95", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100554\n" ] } ], "source": [ "featuresets = [(pos_features(w),c) for w,c in tagged_words]\n", "n = len(featuresets)\n", "print n" ] }, { "cell_type": "code", "execution_count": 39, "id": "c835bf9f", "metadata": { "collapsed": false }, "outputs": [], "source": [ "training_set = featuresets[n//10:]\n", "test_set = featuresets[:n//10]" ] }, { "cell_type": "code", "execution_count": 40, "id": "8e14de69", "metadata": { "collapsed": false }, "outputs": [], "source": [ "classifier = nltk.DecisionTreeClassifier.train(training_set)" ] }, { "cell_type": "code", "execution_count": 41, "id": "e16dd746", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.6270512182993535" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nltk.classify.accuracy(classifier,test_set)" ] }, { "cell_type": "code", "execution_count": 42, "id": "e23a4bec", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'NNS'" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "classifier.classify(pos_features('cats'))" ] }, { "cell_type": "code", "execution_count": 43, "id": "4fd9613c", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "if the == False: \n", " if , == False: \n", " if s == False: \n", " if . == False: return '.'\n", " if . == True: return '.'\n", " if s == True: \n", " if is == False: return 'PP$'\n", " if is == True: return 'BEZ'\n", " if , == True: return ','\n", "if the == True: return 'AT'\n", "\n" ] } ], "source": [ "print classifier.pseudocode(depth=4)" ] }, { "cell_type": "code", "execution_count": 43, "id": "99132fc6", "metadata": { "collapsed": false }, "outputs": [], "source": [] } ], "metadata": {}, "nbformat": 4, "nbformat_minor": 5 }