{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## NLP for Task Classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Hypothesis**: Part of Speech (POS) tagging and syntactic dependency parsing provides valuable information for classifying imperative phrases. The thinking is that being able to detect imperative phrases will transfer well to detecting tasks and to-dos.\n", "\n", "#### Some Terminology\n", "- [_Imperative mood_](https://en.wikipedia.org/wiki/Imperative_mood) is \"used principally for ordering, requesting or advising the listener to do (or not to do) something... also often used for giving instructions as to how to perform a task.\"\n", "- _Part of speech (POS)_ is a way of categorizing a word based on its syntactic function.\n", " - The POS tagger from Spacy.io that is used in this notebook differentiates between [*pos_* and *tag_*](https://spacy.io/docs/api/annotation#pos-tagging-english) - *POS (pos_)* refers to \"coarse-grained part-of-speech\" like `VERB`, `ADJ`, or `PUNCT`; and *POSTAG (tag_)* refers to \"fine-grained part-of-speech\" like `VB`, `JJ`, or `.`.\n", "- _Syntactic dependency parsing_ is a way of connecting words based on syntactic relationships, [such as](https://spacy.io/docs/api/annotation#dependency-parsing-english) `DOBJ` (direct object), `PREP` (prepositional modifier), or `POBJ` (object of preposition).\n", " - Check out the dependency parse of the phrase [\"Send the report to Kyle by tomorrow\"](https://demos.explosion.ai/displacy/?text=Send%20the%20report%20to%20Kyle%20by%20tomorrow&model=en&cpu=1&cph=1) as an example.\n", "\n", "### Proposed Features\n", "The imperative mood centers around _actions_, and actions are generally represented in English using verbs. So the features are engineered to also center on the VERB:\n", "1. `FeatureName.VERB`: Does the phrase contain `VERB`(s) of the tag form `VB*`?\n", "2. `FeatureName.FOLLOWING_POS`: Are the words following the `VERB`(s) of certain parts of speech?\n", "3. `FeatureName.FOLLOWING_POSTAG`: Are the words following the `VERB`(s) of certain POS tags?\n", "4. `FeatureName.CHILD_DEP`: Are the `VERB`(s) parents of certain syntactic dependencies?\n", "5. `FeatureName.PARENT_DEP`: Are the `VERB`(s) children of certain syntactic dependencies?\n", "6. `FeatureName.CHILD_POS`: Are the syntactic dependencies that the `VERB`(s) are children of of certain parts of speech?\n", "7. `FeatureName.CHILD_POSTAG`: Are the syntactic dependencies that the `VERB`(s) are children of of certain POS tags?\n", "8. `FeatureName.PARENT_POS`: Are the syntactic dependencies that the `VERB`(s) parent of certain parts of speech?\n", "9. `FeatureName.PARENT_POSTAG`: Are the syntactic dependencies that the `VERB`(s) parent of certain POS tags?\n", "\n", "**Notes:**\n", "- Features 2-9 all depend on feature 1 between `True`; if `False`, phrase vectorization will result in all zeroes.\n", "- When features 2-9 are applied to actual phrases, they will append identifying informating about the feature in the form of `_*` (e.g., `FeatureName.FOLLOWING_POSTAG_WRB`)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data and Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Building a recipe corpus" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I wrote and ran `epicurious_recipes.py`\\* to scrape Epicurious.com for recipe instructions and descriptions. I then performed some manual cleanup of the script results. Output is in `epicurious-pos.txt` and `epicurious-neg.txt`.\n", "\n", "\\* _script (very) loosely based off of https://github.com/benosment/hrecipe-parse_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note** that deriving all negative examples in the training set from Epicurious recipe descriptions would result in negative examples that are longer and syntactically more complicated than the positive examples. This is a form of bias.\n", "\n", "To (hopefully?) correct for this a bit, I will add the short movie reviews found at https://pythonprogramming.net/static/downloads/short_reviews/ as more negative examples.\n", "\n", "This still feels weird because we're selecting negative examples only from specific categories of text (recipe descriptions, short movie reviews) - just because they're readily available. Further, most positive examples are recipe instructions - also a specific (and not necessarily related to the main \"task\" category) category of text.\n", "\n", "Ultimately though, this recipe corpus is a **stopgap/proof of concept** for a corpus more relevant to tasks later on, so I won't worry further about this for now." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import os\n", "from pandas import read_csv\n", "from numpy import random" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "BASE_DIR = os.getcwd()\n", "data_path = BASE_DIR + '/data.tsv'" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TextLabel
0Be kindpos
1Get out of herepos
2Look this overpos
3Paul, do your homework nowpos
4Do not clean soot off the windowpos
\n", "
" ], "text/plain": [ " Text Label\n", "0 Be kind pos\n", "1 Get out of here pos\n", "2 Look this over pos\n", "3 Paul, do your homework now pos\n", "4 Do not clean soot off the window pos" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = read_csv(data_path, sep='\\t', header=None, names=['Text', 'Label'])\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "pos_data_split = list(df.loc[df.Label == 'pos'].Text)\n", "neg_data_split = list(df.loc[df.Label == 'neg'].Text)\n", "\n", "num_pos = len(pos_data_split)\n", "num_neg = len(neg_data_split)\n", "\n", "# 50/50 split between the number of positive and negative samples\n", "num_per_class = num_pos if num_pos < num_neg else num_neg\n", "\n", "# shuffle samples\n", "random.shuffle(pos_data_split)\n", "random.shuffle(neg_data_split)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "lines = []\n", "for l in pos_data_split[:num_per_class]:\n", " lines.append((l, 'pos'))\n", "for l in neg_data_split[:num_per_class]:\n", " lines.append((l, 'neg'))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Features as defined in the introduction\n", "from enum import Enum, auto\n", "class FeatureName(Enum):\n", " VERB = auto()\n", " FOLLOWING_POS = auto()\n", " FOLLOWING_POSTAG = auto()\n", " CHILD_DEP = auto()\n", " PARENT_DEP = auto()\n", " CHILD_POS = auto()\n", " CHILD_POSTAG = auto()\n", " PARENT_POS = auto()\n", " PARENT_POSTAG = auto()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## [spaCy.io](https://spacy.io/) for NLP\n", "_Because Stanford CoreNLP is hard to install for Python_\n", "\n", "Found Spacy through an article on [\"Training a Classifier for Relation Extraction from Medical Literature\"](https://www.microsoft.com/developerblog/2016/09/13/training-a-classifier-for-relation-extraction-from-medical-literature/) ([GitHub](https://github.com/CatalystCode/corpus-to-graph-ml))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"NLTK" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#!conda config --add channels conda-forge\n", "#!conda install spacy\n", "#!python -m spacy download en" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using the Spacy Data Model for NLP" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import spacy\n", "# slow\n", "nlp = spacy.load('en')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Spacy's sentence segmentation is lacking... https://github.com/explosion/spaCy/issues/235. So each '\\n' will start a new Spacy Doc.\n", "\n", "**TODO**: Improvement to Doc.sents? \"To improve accuracy on informal texts, spaCy calculates sentence boundaries from the syntactic dependency parse.\"" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def create_spacy_docs(ll):\n", " dd = [(nlp(l[0]), l[1]) for l in ll]\n", " # collapse noun phrases into single compounds\n", " for d in dd:\n", " for np in d[0].noun_chunks:\n", " np.merge(tag=np.root.tag_, ent_type=np.root.ent_type_, lemma=np.root.lemma_)\n", " return dd" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# slower\n", "docs = create_spacy_docs(lines)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### NLP output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tokenization, POS tagging, and dependency parsing happened automatically with the `nlp(line)` calls above! So let's look at the outputs.\n", "\n", "https://spacy.io/docs/usage/data-model and https://spacy.io/docs/api/doc will be useful going forward" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[Toss cherry tomatoes with 1 Tbsp., oil in a small bowl; season with salt.]\n", "[Transfer to a platter and top with remaining 1/2 tsp., lime zest; season with salt and pepper.]\n", "[Serve with a crunchy green salad.]\n", "[Immediately pour entire contents of pot into a large colander to drain, then spread out corn, potatoes, and shrimp on a large rimmed baking sheet or sheets of newspaper; discard lemon halves.]\n", "[Cook wings, moving to a cooler section of grill or reducing heat if they start to burn, until cooked through, an instant-read thermometer inserted into the flesh but not touching the bone registers 165°F, and skin is crisp and lightly charred, 5–10 minutes.]\n", "[Line 24 muffin cups with paper liners.]\n", "[Cut into thin slices crosswise.]\n", "[Place 2 small plates in freezer to chill.]\n", "[Bring mixture to boil, stirring often, over medium-high heat.]\n", "[Don’t omit the milk, however, as this will change the balance of liquid to dry ingredients in the recipe.]\n" ] } ], "source": [ "for doc in docs[:10]:\n", " print(list(doc[0].sents))" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[Toss cherry tomatoes, 1 Tbsp, oil, a small bowl, season, salt]\n", "[a platter, top, remaining 1/2 tsp, lime zest, season, salt, pepper]\n", "[a crunchy green salad]\n", "[entire contents, pot, a large colander, corn, potatoes, a large rimmed baking sheet, sheets, newspaper, discard lemon halves]\n", "[Cook wings, a cooler section, grill, heat, they, an instant-read thermometer, the flesh, 165°F, skin]\n", "[Line 24 muffin cups, paper liners]\n", "[thin slices]\n", "[Place 2 small plates, freezer]\n", "[mixture, medium-high heat]\n", "[the milk, the balance, liquid, ingredients, the recipe]\n" ] } ], "source": [ "for doc in docs[:10]:\n", " print(list(doc[0].noun_chunks))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Spacy's dependency graph visualization](https://demos.explosion.ai/displacy)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Toss cherry tomatoes ROOT tomato NOUN NNS Toss cherry tomatoes [with, .]\n", "with prep with ADP IN Toss cherry tomatoes [1 Tbsp]\n", "1 Tbsp pobj tbsp PROPN NNP with []\n", ". punct . PUNCT . Toss cherry tomatoes []\n", "oil ROOT oil NOUN NN oil [in, ;, season, .]\n", "in prep in ADP IN oil [a small bowl]\n", "a small bowl pobj bowl NOUN NN in []\n", "; punct ; PUNCT : oil []\n", "season conj season NOUN NN oil [with]\n", "with prep with ADP IN season [salt]\n", "salt pobj salt NOUN NN with []\n", ". punct . PUNCT . oil []\n", "Transfer ROOT transfer VERB VB Transfer [to, with, .]\n", "to prep to ADP IN Transfer [a platter]\n", "a platter pobj platter NOUN NN to [and, top]\n", "and cc and CCONJ CC a platter []\n", "top conj top NOUN NN a platter []\n", "with prep with ADP IN Transfer [remaining 1/2 tsp]\n", "remaining 1/2 tsp pobj tsp NOUN NN with []\n", ". punct . PUNCT . Transfer []\n", "lime zest ROOT zest NOUN NN lime zest [;, season, .]\n", "; punct ; PUNCT : lime zest []\n", "season appos season NOUN NN lime zest [with]\n", "with prep with ADP IN season [salt]\n", "salt pobj salt NOUN NN with [and, pepper]\n", "and cc and CCONJ CC salt []\n", "pepper conj pepper NOUN NN salt []\n", ". punct . PUNCT . lime zest []\n", "Serve ROOT serve VERB VB Serve [with, .]\n", "with prep with ADP IN Serve [a crunchy green salad]\n", "a crunchy green salad pobj salad NOUN NN with []\n", ". punct . PUNCT . Serve []\n", "Immediately advmod immediately ADV RB pour []\n", "pour ROOT pour VERB VBP pour [Immediately, entire contents, into, ,, spread, .]\n", "entire contents dobj content NOUN NNS pour [of]\n", "of prep of ADP IN entire contents [pot]\n", "pot pobj pot NOUN NN of []\n", "into prep into ADP IN pour [a large colander]\n", "a large colander pobj colander NOUN NN into [drain]\n", "to aux to PART TO drain []\n", "drain relcl drain VERB VB a large colander [to]\n", ", punct , PUNCT , pour []\n", "then advmod then ADV RB spread []\n", "spread dep spread VERB VB pour [then, out, corn]\n", "out prt out PART RP spread []\n", "corn dobj corn NOUN NN spread [,, potatoes, ;, discard lemon halves]\n", ", punct , PUNCT , corn []\n", "potatoes conj potato NOUN NNS corn [,, and, shrimp]\n", ", punct , PUNCT , potatoes []\n", "and cc and CCONJ CC potatoes []\n", "shrimp conj shrimp VERB VB potatoes [on]\n", "on prep on ADP IN shrimp [a large rimmed baking sheet]\n", "a large rimmed baking sheet pobj sheet NOUN NN on [or, sheets]\n", "or cc or CCONJ CC a large rimmed baking sheet []\n", "sheets conj sheet NOUN NNS a large rimmed baking sheet [of]\n", "of prep of ADP IN sheets [newspaper]\n", "newspaper pobj newspaper NOUN NN of []\n", "; punct ; PUNCT : corn []\n", "discard lemon halves appos half NOUN NNS corn []\n", ". punct . PUNCT . pour []\n", "Cook wings nsubj wing NOUN NNS inserted []\n", ", punct , PUNCT , inserted []\n", "moving advcl move VERB VBG inserted [to, or, reducing]\n", "to prep to ADP IN moving [a cooler section]\n", "a cooler section pobj section NOUN NN to [of]\n", "of prep of ADP IN a cooler section [grill]\n", "grill pobj grill NOUN NN of []\n", "or cc or CCONJ CC moving []\n", "reducing conj reduce VERB VBG moving [heat, start]\n", "heat dobj heat NOUN NN reducing []\n", "if mark if ADP IN start []\n", "they nsubj -PRON- PRON PRP start []\n", "start advcl start VERB VBP reducing [if, they, burn]\n", "to aux to PART TO burn []\n", "burn xcomp burn VERB VB start [to]\n", ", punct , PUNCT , inserted []\n", "until mark until ADP IN cooked []\n", "cooked advcl cook VERB VBN inserted [until, through]\n", "through prt through PART RP cooked []\n", ", punct , PUNCT , inserted []\n", "an instant-read thermometer nsubj thermometer NOUN NN inserted []\n", "inserted ROOT insert VERB VBN inserted [Cook wings, ,, moving, ,, cooked, ,, an instant-read thermometer, into, but, touching, ,, and, is]\n", "into prep into ADP IN inserted [the flesh]\n", "the flesh pobj flesh NOUN NN into []\n", "but cc but CCONJ CC inserted []\n", "not neg not ADV RB touching []\n", "touching conj touch VERB VBG inserted [not, registers, 165°F]\n", "the det the DET DT registers []\n", "bone compound bone NOUN NN registers []\n", "registers dobj register VERB VBZ touching [the, bone]\n", "165°F dobj f PROPN NNP touching []\n", ", punct , PUNCT , inserted []\n", "and cc and CCONJ CC inserted []\n", "skin nsubj skin NOUN NN is []\n", "is conj be VERB VBZ inserted [skin, crisp, minutes, .]\n", "crisp acomp crisp ADJ JJ is [and, charred, ,]\n", "and cc and CCONJ CC crisp []\n", "lightly advmod lightly ADV RB charred []\n", "charred conj char VERB VBN crisp [lightly]\n", ", punct , PUNCT , crisp []\n", "5–10 nummod 5–10 NUM CD minutes []\n", "minutes npadvmod minute NOUN NNS is [5–10]\n", ". punct . PUNCT . is []\n" ] } ], "source": [ "for doc in docs[:5]:\n", " for token in doc[0]:\n", " print(token.text, token.dep_, token.lemma_, token.pos_, token.tag_, token.head, list(token.children))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Featurization" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import re\n", "from collections import defaultdict\n", "\n", "def featurize(d):\n", " s_features = defaultdict(int)\n", " for idx, token in enumerate(d):\n", " if re.match(r'VB.?', token.tag_) is not None: # note: not using token.pos == VERB because this also includes BES, HVS, MD tags \n", " s_features[FeatureName.VERB.name] += 1\n", " # FOLLOWING_POS\n", " # FOLLOWING_POSTAG\n", " next_idx = idx + 1;\n", " if next_idx < len(d):\n", " s_features[f'{FeatureName.FOLLOWING_POS.name}_{d[next_idx].pos_}'] += 1\n", " s_features[f'{FeatureName.FOLLOWING_POSTAG.name}_{d[next_idx].tag_}'] += 1\n", " # PARENT_DEP\n", " # PARENT_POS\n", " # PARENT_POSTAG\n", " '''\n", " \"Because the syntactic relations form a tree, every word has exactly one head.\n", " You can therefore iterate over the arcs in the tree by iterating over the words in the sentence.\"\n", " https://spacy.io/docs/usage/dependency-parse#navigating\n", " '''\n", " if (token.head is not token):\n", " s_features[f'{FeatureName.PARENT_DEP.name}_{token.head.dep_.upper()}'] += 1\n", " s_features[f'{FeatureName.PARENT_POS.name}_{token.head.pos_}'] += 1\n", " s_features[f'{FeatureName.PARENT_POSTAG.name}_{token.head.tag_}'] += 1\n", " # CHILD_DEP\n", " # CHILD_POS\n", " # CHILD_POSTAG\n", " for child in token.children:\n", " s_features[f'{FeatureName.CHILD_DEP.name}_{child.dep_.upper()}'] += 1\n", " s_features[f'{FeatureName.CHILD_POS.name}_{child.pos_}'] += 1\n", " s_features[f'{FeatureName.CHILD_POSTAG.name}_{child.tag_}'] += 1\n", " return dict(s_features)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "featuresets = [(doc[0], (featurize(doc[0]), doc[1])) for doc in docs]" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Stats on number of features per example:\n", "mean: 24.562363715656346\n", "stdev: 14.281121014936279\n", "median: 24.0\n", "mode: 0\n", "max: 73\n", "min: 0\n" ] } ], "source": [ "from statistics import mean, median, mode, stdev\n", "f_lengths = [len(fs[1][0]) for fs in featuresets]\n", "\n", "print('Stats on number of features per example:')\n", "print(f'mean: {mean(f_lengths)}')\n", "print(f'stdev: {stdev(f_lengths)}')\n", "print(f'median: {median(f_lengths)}')\n", "print(f'mode: {mode(f_lengths)}')\n", "print(f'max: {max(f_lengths)}')\n", "print(f'min: {min(f_lengths)}')" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(Toss cherry tomatoes with 1 Tbsp. oil in a small bowl; season with salt.,\n", " ({}, 'pos')),\n", " (Transfer to a platter and top with remaining 1/2 tsp. lime zest; season with salt and pepper.,\n", " ({'CHILD_DEP_PREP': 2,\n", " 'CHILD_DEP_PUNCT': 1,\n", " 'CHILD_POSTAG_.': 1,\n", " 'CHILD_POSTAG_IN': 2,\n", " 'CHILD_POS_ADP': 2,\n", " 'CHILD_POS_PUNCT': 1,\n", " 'FOLLOWING_POSTAG_IN': 1,\n", " 'FOLLOWING_POS_ADP': 1,\n", " 'PARENT_DEP_ROOT': 1,\n", " 'PARENT_POSTAG_VB': 1,\n", " 'PARENT_POS_VERB': 1,\n", " 'VERB': 1},\n", " 'pos'))]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "featuresets[:2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On one run, the above line printed the following featureset:\n", "`(Gather foil loosely on top and bake for 1 1/2 hours., ({}, 'pos'))`\n", "\n", "This is because the Spacy.io POS tagger provided this:\n", " `Gather/NNP foil/NN loosely/RB on/IN top/NN and/CC bake/NN for/IN 1 1/2 hours./NNS`\n", "\n", "...with no VERBs tagged, which is incorrect.\n", "\n", "\"Voting - POS taggers and classifiers\" in the _Next Steps/Improvements_ section below is meant to improve on this.\n", "\n", "---\n", "Compare to [Stanford CoreNLP POS tagger](http://nlp.stanford.edu:8080/corenlp/process):\n", " `Gather/VB foil/NN loosely/RB on/IN top/JJ and/CC bake/VB for/IN 1 1/2/CD hours/NNS ./.`\n", "\n", "And [Stanford Parser](http://nlp.stanford.edu:8080/parser/index.jsp):\n", " `Gather/NNP foil/VB loosely/RB on/IN top/NN and/CC bake/VB for/IN 1 1/2/CD hours/NNS ./.`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Classification" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "# training samples: 3669\n", "# test samples: 917\n" ] } ], "source": [ "random.shuffle(featuresets)\n", "\n", "num_classes = 2\n", "split_num = round(num_per_class*num_classes / 5)\n", "\n", "# train and test sets\n", "testing_set = [fs[1] for i, fs in enumerate(featuresets[:split_num])]\n", "training_set = [fs[1] for i, fs in enumerate(featuresets[split_num:])]\n", "\n", "print(f'# training samples: {len(training_set)}')\n", "print(f'# test samples: {len(testing_set)}')" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# decoupling the functionality of nltk.classify.accuracy\n", "def predict(classifier, gold, prob=True):\n", " if (prob is True):\n", " predictions = classifier.prob_classify_many([fs for (fs, ll) in gold])\n", " else:\n", " predictions = classifier.classify_many([fs for (fs, ll) in gold])\n", " return list(zip(predictions, [ll for (fs, ll) in gold]))\n", "\n", "def accuracy(predicts, prob=True):\n", " if (prob is True):\n", " correct = [label == prediction.max() for (prediction, label) in predicts]\n", " else:\n", " correct = [label == prediction for (prediction, label) in predicts]\n", " \n", " if correct:\n", " return sum(correct) / len(correct)\n", " else:\n", " return 0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note below the use of `DummyClassifier` to provide a simple sanity check, a baseline of random predictions. `stratified` means it \"generates random predictions by respecting the training set class distribution.\" (http://scikit-learn.org/stable/modules/model_evaluation.html#dummy-estimators)\n", "\n", "> More generally, when the accuracy of a classifier is too close to random, it probably means that something went wrong: features are not helpful, a hyperparameter is not correctly tuned, the classifier is suffering from class imbalance, etc…\n", "\n", "If a classifier can beat the `DummyClassifier`, it is at least learning something valuable! How valuable is another question..." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dummy classifier accuracy percent: 49.836423118865866\n", "NaiveBayes classifier accuracy percent: 68.92039258451473\n", "MultinomialNB classifier accuracy percent: 79.17121046892039\n", "BernoulliNB classifier accuracy percent: 78.08069792802618\n", "LogisticRegressionCV classifier accuracy percent: 83.31515812431843\n", "SGD classifier accuracy percent: 80.91603053435115\n", "SVC classifier accuracy percent: 82.11559432933478\n", "LinearSVC classifier accuracy percent: 83.20610687022901\n", "DecisionTree classifier accuracy percent: 77.42639040348965\n", "RandomForest classifier accuracy percent: 83.53326063249727\n" ] } ], "source": [ "from nltk import NaiveBayesClassifier\n", "from nltk.classify.decisiontree import DecisionTreeClassifier\n", "from nltk.classify.scikitlearn import SklearnClassifier\n", "\n", "from sklearn.dummy import DummyClassifier\n", "from sklearn.naive_bayes import MultinomialNB, BernoulliNB\n", "from sklearn.linear_model import LogisticRegressionCV, SGDClassifier\n", "from sklearn.svm import SVC, LinearSVC\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.calibration import CalibratedClassifierCV\n", "\n", "dummy = SklearnClassifier(DummyClassifier(strategy='stratified', random_state=0))\n", "dummy.train(training_set)\n", "dummy_predict = predict(dummy, testing_set)\n", "dummy_accuracy = accuracy(dummy_predict)\n", "print(\"Dummy classifier accuracy percent:\", dummy_accuracy*100)\n", "\n", "nb = NaiveBayesClassifier.train(training_set)\n", "nb_predict = predict(nb, testing_set)\n", "nb_accuracy = accuracy(nb_predict)\n", "print(\"NaiveBayes classifier accuracy percent:\", nb_accuracy*100)\n", "\n", "multinomial_nb = SklearnClassifier(MultinomialNB())\n", "multinomial_nb.train(training_set)\n", "mnb_predict = predict(multinomial_nb, testing_set)\n", "mnb_accuracy = accuracy(mnb_predict)\n", "print(\"MultinomialNB classifier accuracy percent:\", mnb_accuracy*100)\n", "\n", "bernoulli_nb = SklearnClassifier(BernoulliNB())\n", "bernoulli_nb.train(training_set)\n", "bnb_predict = predict(bernoulli_nb, testing_set)\n", "bnb_accuracy = accuracy(bnb_predict)\n", "print(\"BernoulliNB classifier accuracy percent:\", bnb_accuracy*100)\n", "\n", "# ??logistic_regression._clf\n", "# sklearn.svm.LinearSVC : learns SVM models using the same algorithm.\n", "logistic_regression = SklearnClassifier(LogisticRegressionCV())\n", "logistic_regression.train(training_set)\n", "lr_predict = predict(logistic_regression, testing_set)\n", "lr_accuracy = accuracy(lr_predict)\n", "print(\"LogisticRegressionCV classifier accuracy percent:\", lr_accuracy*100)\n", "\n", "# ??sgd._clf\n", "# The 'log' loss gives logistic regression, a probabilistic classifier.\n", "# ??linear_svc._clf\n", "# can optimize the same cost function as LinearSVC\n", "# by adjusting the penalty and loss parameters. In addition it requires\n", "# less memory, allows incremental (online) learning, and implements\n", "# various loss functions and regularization regimes.\n", "sgd = SklearnClassifier(SGDClassifier(loss='log'))\n", "sgd.train(training_set)\n", "sgd_predict = predict(sgd, testing_set)\n", "sgd_accuracy = accuracy(sgd_predict)\n", "print(\"SGD classifier accuracy percent:\", sgd_accuracy*100)\n", "\n", "# slow\n", "# using libsvm with kernel 'rbf' (radial basis function)\n", "svc = SklearnClassifier(SVC(probability=True))\n", "svc.train(training_set)\n", "svc_predict = predict(svc, testing_set)\n", "svc_accuracy = accuracy(svc_predict)\n", "print(\"SVC classifier accuracy percent:\", svc_accuracy*100)\n", "\n", "# ??linear_svc._clf\n", "# Similar to SVC with parameter kernel='linear', but implemented in terms of\n", "# liblinear rather than libsvm, so it has more flexibility in the choice of\n", "# penalties and loss functions and should scale better to large numbers of\n", "# samples.\n", "# Prefer dual=False when n_samples > n_features.\n", "# Using CalibratedClassifierCV as wrapper to get predict probabilities (https://stackoverflow.com/a/39712590)\n", "linear_svc = SklearnClassifier(CalibratedClassifierCV(LinearSVC(dual=False)))\n", "linear_svc.train(training_set)\n", "linear_svc_predict = predict(linear_svc, testing_set)\n", "linear_svc_accuracy = accuracy(linear_svc_predict)\n", "print(\"LinearSVC classifier accuracy percent:\", linear_svc_accuracy*100)\n", "\n", "# slower\n", "dt = DecisionTreeClassifier.train(training_set)\n", "dt_predict = predict(dt, testing_set, False)\n", "dt_accuracy = accuracy(dt_predict, False)\n", "print(\"DecisionTree classifier accuracy percent:\", dt_accuracy*100)\n", "\n", "random_forest = SklearnClassifier(RandomForestClassifier(n_estimators = 100))\n", "random_forest.train(training_set)\n", "rf_predict = predict(random_forest, testing_set)\n", "rf_accuracy = accuracy(rf_predict)\n", "print(\"RandomForest classifier accuracy percent:\", rf_accuracy*100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### SGD: Multiple Epochs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`sgd` classifiers improves with epochs. `??sgd._clf` tells us that the default number of epochs `n_iter` is 5. So let's run more epochs. Also not that the training_set shuffle is `True` by default." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SGDClassifier classifier accuracy percent (epochs: 1000): 83.96946564885496\n" ] } ], "source": [ "num_epochs = 1000\n", "sgd = SklearnClassifier(SGDClassifier(loss='log', n_iter=num_epochs))\n", "sgd.train(training_set)\n", "sgd_predict = predict(sgd, testing_set)\n", "sgd_accuracy = accuracy(sgd_predict)\n", "print(f\"SGDClassifier classifier accuracy percent (epochs: {num_epochs}):\", sgd_accuracy*100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fortunately, 1000 epochs run very quickly! And `SGDClassifier` performance has improved with more iterations.\n", "\n", "*Also note that we can set `warm_start` to `True` if we want to take advantage of online learning and reuse the solution of the previous call.*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### GridSearch and Cross-Validation\n", "\n", "Next we perform 1) grid search to find optimal hyperparameters, and 2) cross-validation to evaluate performance over multiple folds of the data (to avoid overfitting).\n", "\n", "http://scikit-learn.org/stable/modules/grid_search.html#grid-search\n", "\n", "http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from nltk.classify.scikitlearn import SklearnClassifier\n", "\n", "from sklearn.linear_model import SGDClassifier\n", "from sklearn.svm import LinearSVC\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.calibration import CalibratedClassifierCV" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# https://stackoverflow.com/a/16388804\n", "from sklearn.model_selection import KFold\n", "from sklearn.base import clone\n", "from numpy import zeros\n", "\n", "def cross_val(name, model, debug=True):\n", " num_splits = 3\n", " original_clf = clone(model._clf)\n", " cvidx = KFold(n_splits=num_splits, shuffle=True).split(training_set)\n", " \n", " nested_acc = zeros(num_splits)\n", " i=0\n", " for trainidx, testidx in cvidx:\n", " model._clf = clone(original_clf) # we clone the estimator to make sure that all the folds are independent\n", " classifier = model.train(training_set[trainidx[0]:trainidx[len(trainidx)-1]])\n", " pred = predict(classifier, training_set[testidx[0]:testidx[len(testidx)-1]])\n", " nested_acc[i] = accuracy(pred)\n", " i += 1\n", " \n", " if debug == True:\n", " print(f\"{name} CV accuracies:\", nested_acc)\n", " \n", " return nested_acc.mean()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "# LinearSVC: Tuning hyper-parameters for roc_auc\n", "\n", "Best parameters set found on development set:\n", "\n", "{'C': 0.01, 'dual': False, 'loss': 'squared_hinge', 'max_iter': 1000, 'penalty': 'l2', 'tol': 0.001}\n", "roc_auc: 0.934 (+/-0.019)\n", "\n", "LinearSVC (calibrated) classifier accuracy percent: 83.20610687022901\n", "LinearSVC (raw) classifier accuracy percent: 82.87895310796074\n", "LinearSVC CV accuracies: [ 0.85550082 0.85566166 0.85496183]\n", "LinearSVC (calibrated) CV classifier avg accuracy percent: 85.5374772491\n", "\n", "# LogisticRegression: Tuning hyper-parameters for roc_auc\n", "\n", "Best parameters set found on development set:\n", "\n", "{'C': 1.0, 'dual': False, 'max_iter': 100, 'penalty': 'l2', 'solver': 'liblinear', 'tol': 0.001}\n", "roc_auc: 0.937 (+/-0.019)\n", "\n", "LogisticRegression classifier accuracy percent: 83.53326063249727\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\narho_000\\Anaconda3\\lib\\site-packages\\sklearn\\linear_model\\sag.py:286: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n", " \"the coef_ did not converge\", ConvergenceWarning)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "LogisticRegression CV accuracies: [ 0.86619334 0.86637578 0.86630286]\n", "LogisticRegression CV classifier avg accuracy percent: 86.6290661978\n", "\n", "# SGD: Tuning hyper-parameters for roc_auc\n", "\n", "Best parameters set found on development set:\n", "\n", "{'alpha': 0.0001, 'average': True, 'n_iter': 100, 'penalty': 'l2'}\n", "roc_auc: 0.937 (+/-0.018)\n", "\n", "SGD classifier accuracy percent: 82.76990185387132\n", "SGD CV accuracies: [ 0.86724939 0.86743044 0.8672301 ]\n", "SGD CV classifier avg accuracy percent: 86.7303308486\n", "\n", "# RandomForest: Tuning hyper-parameters for roc_auc\n", "\n", "Best parameters set found on development set:\n", "\n", "{'criterion': 'entropy', 'max_features': 'auto', 'n_estimators': 1000, 'oob_score': True}\n", "roc_auc: 0.936 (+/-0.016)\n", "\n", "RandomForest classifier accuracy percent: 83.86041439476554\n", "RandomForest CV accuracies: [ 0.94050218 0.94011485 0.94055086]\n", "RandomForest CV classifier avg accuracy percent: 94.0389296885\n", "\n" ] } ], "source": [ "# http://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html#sphx-glr-auto-examples-model-selection-plot-grid-search-digits-py\n", "\n", "from sklearn.model_selection import GridSearchCV\n", "from sklearn.linear_model import LogisticRegression\n", "\n", "# Set the parameters by cross-validation\n", "model_parameters = [{'LinearSVC': [{\n", " 'loss': ['hinge'],\n", " 'dual': [True],\n", " 'penalty': ['l2'],\n", " 'tol': [1e-3, 1e-4, 1e-5],\n", " 'max_iter': [1000, 10000],\n", " 'C': [100.0, 1.0, 0.01]\n", " },\n", " {\n", " 'loss': ['squared_hinge'],\n", " 'dual': [False, True],\n", " 'penalty': ['l2'],\n", " 'tol': [1e-3, 1e-4, 1e-5],\n", " 'max_iter': [1000, 10000],\n", " 'C': [100.0, 1.0, 0.01]\n", " }]},\n", " {'LogisticRegression': [{\n", " 'penalty': ['l1'],\n", " 'dual': [False],\n", " 'C': [100.0, 1.0, 0.01],\n", " 'solver': ['liblinear']\n", " },\n", " {\n", " 'penalty': ['l2'],\n", " 'dual': [False, True],\n", " 'C': [100.0, 1.0, 0.01],\n", " 'max_iter': [100, 1000],\n", " 'solver': ['liblinear'],\n", " 'tol': [1e-3, 1e-4, 1e-5]\n", " },\n", " {\n", " 'penalty': ['l2'],\n", " 'dual': [False],\n", " 'C': [100.0, 1.0, 0.01],\n", " 'max_iter': [100, 1000],\n", " 'solver': ['newton-cg', 'lbfgs', 'sag'],\n", " 'tol': [1e-3, 1e-4, 1e-5]\n", " }]},\n", " {'SGD': [{\n", " 'penalty': ['l1', 'l2', 'elasticnet'],\n", " 'alpha': [1e-3, 1e-4, 1e-5],\n", " 'average': [True, False],\n", " 'n_iter': [100, 1000, 10000]\n", " }]},\n", " {'RandomForest': [{\n", " 'n_estimators': [10, 100, 1000],\n", " 'criterion': ['gini', 'entropy'],\n", " 'max_features': ['auto', 'log2', None],\n", " 'oob_score': [True, False]\n", " }]}]\n", "\n", "score = 'roc_auc'\n", "\n", "for i, model_param in enumerate(model_parameters):\n", " model = [key for i, key in enumerate(model_param)][0]\n", " \n", " print(f\"# {model}: Tuning hyper-parameters for {score}\")\n", " print()\n", " \n", " if model == 'LinearSVC': \n", " clf = LinearSVC()\n", " elif model == 'LogisticRegression':\n", " clf = LogisticRegression()\n", " elif model == 'SGD':\n", " clf = SGDClassifier(loss='log')\n", " elif model == 'RandomForest':\n", " clf = RandomForestClassifier()\n", " else:\n", " raise Exception('%s model needs to be added to the if-block' % model)\n", "\n", " grid = SklearnClassifier(GridSearchCV(clf, model_param[model], cv=5,\n", " scoring=score, n_jobs=-1))\n", " grid.train(training_set)\n", "\n", " print(\"Best parameters set found on development set:\")\n", " print()\n", " print(grid._clf.best_params_)\n", " mean = grid._clf.cv_results_['mean_test_score'][grid._clf.best_index_]\n", " std = grid._clf.cv_results_['std_test_score'][grid._clf.best_index_]\n", " print(\"roc_auc: %0.3f (+/-%0.03f)\" % (mean, std * 2))\n", " print()\n", " \n", " if model == 'LinearSVC':\n", " # Wrapping LinearSVC in CalibratedClassifierCV to add support for probability prediction\n", " # Note that there is a difference in accuracies between raw GridSearchCV and calibrated GridSearchCV\n", " # However, I'm willing to sacrifice the potential 'best' result from raw in order to output probabilities\n", " grid_calibrated = SklearnClassifier(CalibratedClassifierCV(grid._clf.best_estimator_, cv=None))\n", " grid_calibrated.train(training_set)\n", " gridc_predict = predict(grid_calibrated, testing_set)\n", " gridc_accuracy = accuracy(gridc_predict)\n", " print(f\"{model} (calibrated) classifier accuracy percent:\", gridc_accuracy*100)\n", " \n", " grid_predict = predict(grid, testing_set, False)\n", " grid_accuracy = accuracy(grid_predict, False)\n", " print(f\"{model} (raw) classifier accuracy percent:\", grid_accuracy*100)\n", " \n", " # CV after parameter optimization\n", " cv_acc = cross_val(model, grid_calibrated)\n", " print(f\"{model} (calibrated) CV classifier avg accuracy percent:\", cv_acc*100)\n", " \n", " linear_svc_opt = grid_calibrated\n", " linear_svc_predict = gridc_predict\n", " else:\n", " grid_predict = predict(grid, testing_set)\n", " grid_accuracy = accuracy(grid_predict)\n", " print(f\"{model} classifier accuracy percent:\", grid_accuracy*100)\n", " \n", " # CV after parameter optimization\n", " cv_acc = cross_val(model, grid)\n", " print(f\"{model} CV classifier avg accuracy percent:\", cv_acc*100)\n", " \n", " if model == 'LogisticRegression':\n", " logistic_regression_opt = grid\n", " lr_predict = grid_predict\n", " elif model == 'SGD':\n", " sgd_opt = grid\n", " sgd_predict = grid_predict\n", " elif model == 'RandomForest':\n", " random_forest_opt = grid\n", " rf_predict = grid_predict\n", " else:\n", " raise Exception('%s model was not run through Grid Search' % model)\n", " \n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### VotingClassifier\n", "\n", "We're going to create an ensemble classifier by letting our top-performing classifiers, which consistently perform with >80% accuracy — `LogisticRegression`, `LinearSVC`, `SGD`, and `RandomForest` (excluding `SVC` due to its slowness) — vote on each prediction." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Soft voting classifier accuracy percent: 83.96946564885496\n", "Soft voting CV accuracies: [ 0.90834697 0.90854491 0.90673575]\n", "Soft voting CV classifier accuracy percent: 90.7875877339\n" ] } ], "source": [ "from sklearn.ensemble import VotingClassifier\n", "\n", "voting = SklearnClassifier(VotingClassifier(estimators=[\n", " ('lr', logistic_regression_opt._clf),\n", " ('linear_svc', linear_svc_opt._clf),\n", " ('sgd', sgd_opt._clf),\n", " ('rf', random_forest_opt._clf)\n", "], voting='soft', weights=[1,1,1,3], n_jobs=-1))\n", "voting.train(training_set)\n", "\n", "voting_predict = predict(voting, testing_set)\n", "voting_accuracy = accuracy(voting_predict)\n", "print(\"Soft voting classifier accuracy percent:\", voting_accuracy*100)\n", "\n", "# CV after parameter optimization\n", "voting_acc = cross_val(\"Soft voting\", voting)\n", "print(f\"Soft voting CV classifier accuracy percent:\", voting_acc*100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly to the voting model, we're also going to scope analysis down to our top-performing classifiers. We'll include the `Voting` model itself, and then `Dummy` as a baseline." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Most Informative Features" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# https://stackoverflow.com/a/11140887\n", "def show_most_informative_features(vectorizer, clf, n=20):\n", " feature_names = vectorizer.get_feature_names()\n", " coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))\n", " top = zip(coefs_with_fns[:round(n/2)], coefs_with_fns[:-(round(n/2) + 1):-1])\n", " for (coef_1, fn_1), (coef_2, fn_2) in top:\n", " print(\"\\t%.4f\\t%-15s\\t\\t%.4f\\t%-15s\" % (coef_1, fn_1, coef_2, fn_2))" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SGD\n", "\t-3.3820\tFOLLOWING_POSTAG_WRB\t\t2.9712\tCHILD_POSTAG_-LRB-\n", "\t-2.1901\tCHILD_DEP_AGENT\t\t2.5766\tFOLLOWING_POSTAG_JJS\n", "\t-2.0411\tCHILD_DEP_DET \t\t2.1984\tFOLLOWING_POSTAG_VBZ\n", "\t-2.0337\tFOLLOWING_POSTAG_JJR\t\t1.8884\tCHILD_DEP_NPADVMOD\n", "\t-2.0008\tCHILD_DEP_NEG \t\t1.8787\tPARENT_DEP_PARATAXIS\n", "\t-1.7942\tCHILD_DEP_RELCL\t\t1.7963\tCHILD_DEP_APPOS\n", "\t-1.7393\tFOLLOWING_POSTAG_VB\t\t1.7699\tFOLLOWING_POSTAG_RB\n", "\t-1.7151\tCHILD_POSTAG_HYPH\t\t1.7124\tCHILD_POSTAG_PDT\n", "\n", "Logistic Regression\n", "\t-1.4486\tCHILD_DEP_NEG \t\t1.4545\tCHILD_POSTAG_-LRB-\n", "\t-1.4322\tCHILD_DEP_AGENT\t\t1.3721\tCHILD_DEP_NPADVMOD\n", "\t-1.0650\tPARENT_POSTAG_VBZ\t\t1.1594\tPARENT_POSTAG_VB\n", "\t-1.0267\tCHILD_POSTAG_WDT\t\t1.0527\tCHILD_POSTAG_-RRB-\n", "\t-1.0153\tCHILD_DEP_NSUBJ\t\t1.0491\tCHILD_DEP_MARK \n", "\t-0.9563\tPARENT_DEP_XCOMP\t\t0.9183\tCHILD_POSTAG_VB\n", "\t-0.9126\tCHILD_DEP_DET \t\t0.9078\tCHILD_POS_PROPN\n", "\t-0.8719\tCHILD_DEP_AUX \t\t0.9077\tCHILD_POSTAG_NNP\n" ] } ], "source": [ "print('SGD')\n", "show_most_informative_features(sgd._vectorizer, sgd._clf, 15)\n", "print()\n", "print('Logistic Regression')\n", "show_most_informative_features(logistic_regression._vectorizer, logistic_regression._clf, 15)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Note: Because `CalibratedClassifierCV` has no attribute `coef_`, we cannot show the most informative features for `LinearSVC` while it's wrapped. `Random Forest` and `Voting` also lack `coef_`.*" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'adjective, superlative'" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spacy.explain(\"JJS\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Negative coefficients**:\n", "- VERB parents [`AGENT`](http://universaldependencies.org/docs/sv/dep/nmod-agent.html): \"used for agents of passive verbs\" - interpreting this to mean that _existence of passive verbs (i.e., the opposite of active verbs) means negative correlation with it being imperative_\n", "- VERB followed by a `WRB`: \"wh-adverb\" (where, when)\n", "- VERB is a child of [`AMOD`](http://universaldependencies.org/en/dep/amod.html): \"any adjective or adjectival phrase that serves to modify the meaning\" of the verb\n", "\n", "\n", "**Positive coefficients**:\n", "- VERB parents a `-RRB-`: \"right round bracket\"\n", "- VERB is a child of `PROPN`: \"proper noun\"\n", "- VERB is a child of `NNP`: \"noun, proper singular\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Scikit Learn metrics: Confusion matrix, Classification report, F1 score, Log loss\n", "\n", "http://scikit-learn.org/stable/modules/model_evaluation.html" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn import metrics\n", "\n", "def classification_report(predict, prob=True):\n", " predictions, labels = zip(*predict)\n", " if prob is True:\n", " return metrics.classification_report(labels, [p.max() for p in predictions])\n", " else:\n", " return metrics.classification_report(labels, predictions)\n", "\n", "def confusion_matrix(predict, prob=True, print_layout=False):\n", " predictions, labels = zip(*predict)\n", " if print_layout is True:\n", " print('Layout\\n[[tn fp]\\n [fn tp]]\\n')\n", " if prob is True:\n", " return metrics.confusion_matrix(labels, [p.max() for p in predictions])\n", " else:\n", " return metrics.confusion_matrix(labels, predictions)\n", " \n", "def log_loss(predict):\n", " predictions, labels = zip(*predict)\n", " return metrics.log_loss(labels, [p.prob('pos') for p in predictions])\n", "\n", "def roc_auc_score(predict):\n", " predictions, labels = zip(*predict)\n", " # need to convert labels to binary classification of 0 or 1\n", " return metrics.roc_auc_score([1 if l == 'pos' else 0 for l in labels], [p.prob('pos') for p in predictions], average='weighted')\n", "\n", "def precision_recall_curve(predict):\n", " predictions, labels = zip(*predict)\n", " return metrics.precision_recall_curve(labels, [p.prob('pos') for p in predictions], pos_label='pos')\n", "\n", "def average_precision_score(predict):\n", " predictions, labels = zip(*predict)\n", " return metrics.average_precision_score([1 if l == 'pos' else 0 for l in labels], [p.prob('pos') for p in predictions])\n", "\n", "def roc_curve(predict):\n", " predictions, labels = zip(*predict)\n", " return metrics.roc_curve(labels, [p.prob('pos') for p in predictions], pos_label='pos')" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SGD\n", " precision recall f1-score support\n", "\n", " neg 0.87 0.79 0.83 480\n", " pos 0.79 0.87 0.83 437\n", "\n", "avg / total 0.83 0.83 0.83 917\n", "\n", "\n", "Logistic Regression\n", " precision recall f1-score support\n", "\n", " neg 0.88 0.79 0.83 480\n", " pos 0.79 0.89 0.84 437\n", "\n", "avg / total 0.84 0.84 0.84 917\n", "\n", "\n", "LinearSVC\n", " precision recall f1-score support\n", "\n", " neg 0.88 0.78 0.83 480\n", " pos 0.79 0.89 0.83 437\n", "\n", "avg / total 0.84 0.83 0.83 917\n", "\n", "\n", "Random Forest\n", " precision recall f1-score support\n", "\n", " neg 0.88 0.80 0.84 480\n", " pos 0.80 0.89 0.84 437\n", "\n", "avg / total 0.84 0.84 0.84 917\n", "\n" ] } ], "source": [ "print('SGD')\n", "print(classification_report(sgd_predict))\n", "print()\n", "print('Logistic Regression')\n", "print(classification_report(lr_predict))\n", "print()\n", "print('LinearSVC')\n", "print(classification_report(linear_svc_predict))\n", "print()\n", "print('Random Forest')\n", "print(classification_report(rf_predict))" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Voting\n", " precision recall f1-score support\n", "\n", " neg 0.88 0.80 0.84 480\n", " pos 0.80 0.88 0.84 437\n", "\n", "avg / total 0.84 0.84 0.84 917\n", "\n" ] } ], "source": [ "print('Voting')\n", "print(classification_report(voting_predict))" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Layout\n", "[[tn fp]\n", " [fn tp]]\n", "\n", "SGD\n", "[[379 101]\n", " [ 57 380]]\n", "\n", "Logistic Regression\n", "[[379 101]\n", " [ 50 387]]\n", "\n", "LinearSVC\n", "[[375 105]\n", " [ 49 388]]\n", "\n", "Random Forest\n", "[[382 98]\n", " [ 50 387]]\n" ] } ], "source": [ "print('Layout\\n[[tn fp]\\n [fn tp]]\\n')\n", "\n", "print('SGD')\n", "print(confusion_matrix(sgd_predict))\n", "print()\n", "print('Logistic Regression')\n", "print(confusion_matrix(lr_predict))\n", "print()\n", "print('LinearSVC')\n", "print(confusion_matrix(linear_svc_predict))\n", "print()\n", "print('Random Forest')\n", "print(confusion_matrix(rf_predict))" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Voting\n", "[[384 96]\n", " [ 51 386]]\n" ] } ], "source": [ "print('Voting')\n", "print(confusion_matrix(voting_predict))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The lower the better for `log_loss`..." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SGD: 0.5408885923854805\n", "Logistic Regression: 0.3269831271366439\n", "LinearSVC: 0.34151481635171227\n", "Random Forest: 0.3557548099256462\n" ] } ], "source": [ "print(f'SGD: {log_loss(sgd_predict)}')\n", "print(f'Logistic Regression: {log_loss(lr_predict)}')\n", "print(f'LinearSVC: {log_loss(linear_svc_predict)}')\n", "print(f'Random Forest: {log_loss(rf_predict)}')" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Voting: 0.3058827183576675\n" ] } ], "source": [ "print(f'Voting: {log_loss(voting_predict)}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The higher the better for `roc_auc_score`..." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SGD: 0.9333333333333333\n", "Logistic Regression: 0.9354405034324942\n", "LinearSVC: 0.9284610983981694\n", "Random Forest: 0.9428751906941265\n" ] } ], "source": [ "print(f'SGD: {roc_auc_score(sgd_predict)}')\n", "print(f'Logistic Regression: {roc_auc_score(lr_predict)}')\n", "print(f'LinearSVC: {roc_auc_score(linear_svc_predict)}')\n", "print(f'Random Forest: {roc_auc_score(rf_predict)}')" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Voting: 0.9454185736079329\n" ] } ], "source": [ "print(f'Voting: {roc_auc_score(voting_predict)}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Performance on sample tasks" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dummy: [('neg', 0.0), ('neg', 0.0), ('neg', 0.0), ('neg', 0.0), ('pos', 1.0), ('neg', 0.0), ('pos', 1.0)]\n", "LogisticRegression: [('pos', 0.5177528042129399), ('pos', 0.5177528042129399), ('pos', 0.95041384668781259), ('pos', 0.9042204368988711), ('pos', 0.77613792093402312), ('pos', 0.74783727994957849), ('pos', 0.90181743756911736)]\n", "LinearSVC: [('pos', 0.56492618977808062), ('pos', 0.56492618977808062), ('pos', 0.92860840939215505), ('pos', 0.8805809957950842), ('pos', 0.82874640690443713), ('pos', 0.83449212505086656), ('pos', 0.8889786555784075)]\n", "SGD: [('pos', 0.53547937378473365), ('pos', 0.53547937378473365), ('pos', 0.99992355013586687), ('pos', 0.99902040085679145), ('pos', 0.99105878886231658), ('pos', 0.98526269258531807), ('pos', 0.99895146721659678)]\n", "Random Forest: [('pos', 0.50561516177492483), ('pos', 0.50561516177492483), ('pos', 0.96279172244353028), ('pos', 0.97799999999999998), ('pos', 0.95699999999999996), ('pos', 0.94999999999999996), ('pos', 0.96999999999999997)]\n", "\n", "Voting: [('pos', 0.52202910464797891), ('pos', 0.52202910464797891), ('pos', 0.95976101963659921), ('pos', 0.95395871789375353), ('pos', 0.9032303318255851), ('pos', 0.89501175259253529), ('pos', 0.95289944311959662)]\n" ] } ], "source": [ "sample_tasks = [\"Mow lawn\", \"Mow the lawn\", \"Buy new shoes\", \"Feed the dog\", \"Send report to Kyle\", \"Send the report to Kyle\", \"Peel the potatoes\"]\n", "features = [featurize(nlp(task)) for task in sample_tasks]\n", "\n", "tasks_dummy = [(l, p.prob('pos')*1.0) for l, p in zip(dummy.classify_many(features), dummy.prob_classify_many(features))]\n", "tasks_logistic = [(l, p.prob('pos')) for l,p in zip(logistic_regression_opt.classify_many(features), logistic_regression_opt.prob_classify_many(features))]\n", "tasks_linear_svc = [(l, p.prob('pos')) for l,p in zip(linear_svc_opt.classify_many(features), linear_svc_opt.prob_classify_many(features))]\n", "tasks_sgd = [(l, p.prob('pos')) for l,p in zip(sgd_opt.classify_many(features), sgd_opt.prob_classify_many(features))]\n", "tasks_rf = [(l, p.prob('pos')) for l,p in zip(random_forest_opt.classify_many(features), random_forest_opt.prob_classify_many(features))]\n", "tasks_voting = [(l, p.prob('pos')) for l,p in zip(voting.classify_many(features), voting.prob_classify_many(features))]\n", "\n", "print(f'Dummy: {tasks_dummy}')\n", "print(f'LogisticRegression: {tasks_logistic}')\n", "print(f'LinearSVC: {tasks_linear_svc}')\n", "print(f'SGD: {tasks_sgd}')\n", "print(f'Random Forest: {tasks_rf}')\n", "print()\n", "print(f'Voting: {tasks_voting}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Voting Model: Curves" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYoAAAEWCAYAAAB42tAoAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XucXHV9//HXO7ubhJCQDeQC5E4IJuEShAhqqaAoECri\nXS6iUC2llVZ/1Yrto9V4v1V/4g8Q+QGlCCUVpRowiFJFUEAT7iQQCAmQEG4JhFwISTb59I/vGWYy\n7J6d3eyZnd28n4/HPHbOZc75zNndeZ/v+c45RxGBmZlZRwb0dgFmZtbYHBRmZpbLQWFmZrkcFGZm\nlstBYWZmuRwUZmaWy0HRh0g6U9Lve7uOniZpkaRjOplngqQNkprqVFbhJD0u6e3Z8zmSrurtmsza\n46AomKRBki6T9ISk9ZLulTS7t+uqRfZBtin7gH5W0hWShvb0eiLiwIi4pZN5noyIoRGxrafXn31I\nb83e51pJt0t6U0+vZ1eR/Z20SdqnnfFfqRo3SVJIaq4Yd5qkhdnv42lJN0o6qht1/B9Jz0haJ+ly\nSYNy5j1J0oPZOm+XNKNi2pmStmXTSo9julpPX+agKF4zsAI4GhgO/AvwY0mTerGmrjgpIoYChwGz\nSPXvQElf/1v6r+x9jgR+C1zby/X0uMoP4wLXsTvwPuAl4MPdeP0/AN8DvgaMASYAFwLv6uJyjgc+\nBxwLTAT2A77YwbxTgauBc4BW4HpgXtX2uiPbUSk9bulKPX1dX//nbngRsTEi5kTE4xGxPSJuAJYD\nh3f0GknjJV0n6XlJayRd0MF850take0x3SXpzyumHZHtla3LWgPfzcYPlnRVtty1khZIGlPD+3gK\nuBE4KFvOLZK+KukPwMvAfpKGZ62npyU9JekrlYeKJP2VpIeyltViSYdl4ysPwXRU9w57npL2lTRP\n0guSlkr6q4r1zJH0Y0lXZutaJGlWZ+8xe59tpA+NsZJGVSzznVlrsNTiOKRiWru/L0lTJP0mG7da\n0tWSWmupo5qkk7P1r5P0mKQTqrddxXu/qmqbfUzSk8Bvsr3zc6uWfZ+k92bPp0n6dbZdl0j6YBdL\nfR+wFvgS8NEuvsfh2es+ERHXZf87WyPihoj4bBfr+ChwWUQsiogXs+We2cG8xwO/j4jfZ7//bwJj\nSTt3hoOi7rIP5QOARR1MbwJuAJ4AJpH+YOd2sLgFwKHAnsB/AtdKGpxNOx84PyL2AKYAP87Gf5TU\nshkP7EXai9pUQ93jgROBeypGnwGcDQzL6r0CaAP2B14PHAd8PHv9B4A5wEeAPUh7iGvaWVVHdVeb\nC6wE9gXeD3xN0tsqpr8rm6cVmAe0G7btvM+BWY1rgBezca8HLgf+mrTNfkja4xzUye9LwNezGqeT\ntvmcWuqoqukI4ErgH7P38xbg8S4s4uhs/ccD1wCnVix7BmmP+xdZa+DXpL+l0cApwEWlwzBKh4Tu\n72RdH83WMReYJqnDHaJ2vAkYDPx3RzNkNazNeUzIZj0QuK/ipfcBYyTtVUMdyh4HVYx7fRb2j0j6\nV9WhddZQIsKPOj2AFuBm4Ic587wJeB5obmfamaQ9n45e+yIwM3t+K6mpPbJqnr8EbgcOqaHex4EN\npD3EJ4CLgN2yabcAX6qYdwywuTQ9G3cq8Nvs+U3AJ3PW8/ZO6p4EBOlQ3nhgGzCsYvrXgSuy53OA\nmyumzQA25bzPOcCW7H1uI4XEMRXTfwB8ueo1S0gfwB3+vtpZz7uBezp433OAqzp43Q+B/9vZtqte\nTsU2269i+jBgIzAxG/4qcHn2/EPAbe2s+ws1/n1PALYDh1b8zs+vmH4F8JWc3+vpwDM99L/2GHBC\n1f9eAJPamXdatk2OAQYC/5q9j3/Kpu8HTCbtWB8MLC5N21UeblHUidIx/B+RPpDOrRh/o8odZKeT\nPgSfiNQE7myZn8kO5bwkaS2ppTAym/wxUsvl4ezw0juz8T8i/QPPlbRK0rckteSs5t0R0RoREyPi\nbyOisvWxouL5RNI/49OlvTvSh8zobPp40j9vZzqqu9K+wAsRsb5i3BOkvfmSZyqevwwMltQs6fSK\n7X1jxTw/johWUuA9yI6HBicCn67cc83ez77k/L4kjZE0NzsMtw64ivLvpytq3XYdefX3lG2zX5Ba\nC5DC/Ors+UTgyKr3eTqwd43rOQN4KCLuzYavBk6r+PtqI/2NVGohfShvJwX0yB7aW99AarmWDM9+\nrq+eMSIeJrWELgCeJv2OFpNarETEsohYHunQ8QOkw1jv74Ea+wwHRR1IEnAZ6UPofRGxtTQtImZH\nuYPsatI/9YTO/lmU+iM+C3wQGJF9yL1EajITEY9GxKmkD+pvAj+RtHukY75fjIgZwJuBd5IOtXRH\n5aWHV5BaFCOzYGmNiD0i4sCK6VM6XWAHdVfNtgrYU9KwinETgKdqWP7VFdv7Nd8+i4jVpMNpc1T+\n1s4K4KsV76s1IoZExDXk/76+RtpGB0c6lPZhst9PF+Vtu43AkIrh9j7Uqy8RfQ1wqtI3uwaTOu9L\n6/ld1fscGhF/U2OdHyH1VT0j6Rngu6QP3ROz6U+SWhCVJgMrImI7cAfpb+jdHa2gKujbe5QOPS0C\nZla8dCbwbES0d7iTiPhJRBwUEXsBX8jqXNBBGUH3fo99loOiPn5AOkZ8UtUeeXv+RNqr+Yak3ZU6\nn/+snfmGkfbQngeaJX2eij0oSR+WNCr7B1ybjd4u6a2SDs6Ora8DtpL25nZKRDwN/Ar4jqQ9JA3I\nOnNLHYKXAp+RdLiS/SVNrF5OR3VXrWsF6fDZ17PtcwipJdIj5yFExBJSq6vUgfr/gXMkHZnVvruk\nv8iCKu/3NYy0Z/uSpLGkPobuuAw4S9Kx2XYdK2laNu1e4BRJLUod9rXs6c4ntR6+RPq2V2n73gAc\nIOmMbHktkt4gaXpnC8xCZwpwBKnf7FDSMf7/pLwj8lPgLyQdJ6lJ0r6kb9HNBYiIl4DPAxdKerek\nIVkNsyV9K5unMujbezyZretK4GOSZkgaQTqcdEVO/YdnNY0CLgHmZS0NsvWPyZ5Py5b18862Sb/S\n28e++vuD9A8ZwCukD43S4/Sc10wAfkZqiq8Gvp+NP5OsjwJoInWwriN9UH2WHY95XwU8l61rEekQ\nEqRDDUtIe6LPAt+ng+PrVB3/rpp2C/DxqnHDSaG4ktS6uQc4pWL6Odm6N5AO77y+ej05dU/KtmNz\nNjyO9MH2AumwzDkV65lDxfH+6te28152mD8bd2S2jUZnwyeQ9jDXZtv7WrI+kpzf14HAXdl7uRf4\nNLCyve3bXg1V9bwHuJ906GQpcHw2fj/gj9k6fpH9Pqv7KNrr77osm/aGqvGvy5bzfPZ+fkO5z+F0\nYFEH9V0M/LSd8UeQWgl7ZsMnZdvkJdLhwm9T0a9VsZ6F2fZ/Jqvnzd343/sH0t/4OuDfgUEV024E\n/rli+PfZtn2BdMh094pp/5YtZyOwjBSwLb392VLPh7INYWZm1i4fejIzs1wOCjMzy+WgMDOzXA4K\nMzPL1edOQx85cmRMmjSpt8swM+tT7rrrrtURMarzOV+rzwXFpEmTWLhwYW+XYWbWp0h6oruv9aEn\nMzPL5aAwM7NcDgozM8vloDAzs1wOCjMzy+WgMDOzXIUFhaTLJT0n6cEOpkvS95Xud3y/svsnm5lZ\nYynyPIorSHeMurKD6bOBqdnjSNLlqY+sZcHbd/ruCbYrG+B2tFmXFBYUEXGrpEk5s5wMXBnpOud3\nSmqVtE+kG+B0aMMGuO22HizUdjmjR8P0Tm7FU3n1/ernAwaAdqn7m9murjfPzB7LjvdcXpmNe01Q\nSDqbdHtKRo+exIoV3iu07lm3DhYvhjVrygFQaqFWfvhH7NhyrQyLYcNg5swdx7f3c9AgaGrq+fdg\nVm994hIeEXEJ6faETJ8+KyZPhsGDe7ko65M2boRly+Cxx9Lw9u3pw7zUUii1FiJg27byDklpvtWr\n0881a6C5ueNAkdL0ww4rj68OkxEjHCTWN/RmUDwFjK8YHpeNMyvM7rvDwQd3//Uvvwx33w0vvFAO\nFKkcKKXAeTprF69dm4a3b98xKCJSUBxwwI7jSvM0N8P48a89xFUZRGb10ptBMQ84V9JcUif2S531\nT5j1tiFD4KijOp9v4sTUcmluTh/qTU3lFksEPP54apU8+WQ5bEo/N29OP/fbD4YOTcsrtXBKz3fb\nDaZOLS+v8jFsWDrsZdZTCgsKSdcAxwAjJa0EvgC0AETExcB84ETSjeJfBs4qqhazemtuLrcW2jNm\nTAqEklILYcCANP4Pf4Dly1MLqHRobOvW8mGvgQPhmWfS67Zvh7a29PpSiEybVg6XrVvLrZpt29K8\ngwalIKoMn9JjyBAHje2oyG89ndrJ9AA+UdT6zRqZ1HE/W0sLHH98x6/dsgWWLCkHh5Q+2KXUUtm8\nGZ59dsfOeqn8WL8+zb98ebmfZdu2cn/LnnvCn/95ebiyNVMKm8pllwJm+PC0POt//Gs162MGDuy4\nn2Wffcqti1LHfHV/xubN8MAD8MorO3bit7TAypXpUdnS6Oh5W1s5TLZtgz32gLFj03LGj08tE+sf\nHBRm/Uxne/WDBsGsWe1PGzIE7r8/9Z2UltPUlMJgwIByCwbScHNzCobly9O4JUvS8seNg0MOSYFh\nfZ+DwsxeNWIEHH1011+3777p5+bNcN99sGhRuZVSanUMHgx77bXjoTDrGxwUZtZjBg2Cww+He+5J\nne2//W3qU4F0eGrChHLLZOZMaG3t3XqtNg4KM+tRpRDYsKF8ouKWLXDvval1sWFD6t9Ytgze9a7U\nivGVFhqbg8LMelxLSwqASscdV35+993w0ktw882pVTF8ePoW16hRqUN++PD61mv5HBRmVnczZ6YO\n84cfTv0Wzc3pOlzDhqVzRyZNSoepSueJtLamb3tFpL4O92/Ul4PCzOquqQkmT06Pkq1bU3gsXZpa\nG8uXp36NV15J4bDvvik0xo2DAw/svdp3RQ4KM2sILS0wZUo6Y/zZZ1M4NDWlTvHnn0/Tn346hYiD\nor4cFGbWUCTYe+/y8LBh6bpWkL5+u3596hBvakrfsnJHePEcFGbWZ2zfnkLi+uvTpUZaWtIZ4dOm\nlS+gaD3PQWFmfcahh6ZvTK1fD6tWpXGDBqWbUU2enL5ptc8+qYPceo6Dwsz6lNLNoCB9C+rhh2HF\ninRV3SFDUsvive9NV9G1nuGgMLM+S0r3P58+PR2WevTR1BG+ebODoie5G8jM+oUBA8rnZFjPclCY\nmVkuB4WZmeVyUJiZWS4HhZmZ5XJQmJlZLgeFmZnlclCYmVkuB4WZmeVyUJiZWS4HhZmZ5XJQmJlZ\nLgeFmZnlclCYmVkuB4WZmeVyUJiZWS4HhZmZ5XJQmJlZrkKDQtIJkpZIWirpc+1MHy7pekn3SVok\n6awi6zEzs64r7KaBkpqAC4F3ACuBBZLmRcTiitk+ASyOiJMkjQKWSLo6IrYUVZeZ9X8bN+54S9SI\n9ufraHxXpw0YAIMHQ0tLuo93f1Pk3WWPAJZGxDIASXOBk4HKoAhgmCQBQ4EXgLYCazKzfmxAdozk\n7rth4MD6rLOtDbZsSesbPBiGDoVDD4VBg+qz/nooMijGAisqhlcCR1bNcwEwD1gFDAM+FBHbqxck\n6WzgbIC9955QSLFm1veNGAFTpsCmTenDO2/vvrvTqm3bBg89lFoTAwakx9NPw1FHwV579Y8WRpFB\nUYvjgXuBtwFTgF9Lui0i1lXOFBGXAJcATJ8+K6dBaGa7MgnGjKn/eidNSj+3bYPbboPHH4eXXoLJ\nk2H0aBg+HPbeu1xjX1NkUDwFjK8YHpeNq3QW8I2ICGCppOXANOBPBdZlZlaIpiY4+mhYvhyWLIF1\n61IwDBwIU6em5wceCKNG9XalXVNkUCwApkqaTAqIU4DTquZ5EjgWuE3SGOB1wLICazIzK5QE++2X\nWhMbN6aWxeLFqd8kApYtgyOPhCFD0vDmzTB2LOyxR29X3rHCgiIi2iSdC9wENAGXR8QiSedk0y8G\nvgxcIekBQMB5EbG6qJrMzOpFSh3bQ4emIABYtAieeQZuvz19K+uVV1KfxtChcOKJqY+lERXaRxER\n84H5VeMurni+CjiuyBrMzBrFgQfCxImwZk0KhYEDU0f48uWpxXHssb1dYft8ZraZWR0NHZrCYo89\n0tdpZ8xIfRtLlsALL/R2de1zUJiZ9aJBg1JH94YNcOONqbXRaBwUZma9bOLE1NJYuRJ++ct0HkZE\n/tnh9dTb51GYme3yBgxI34RauBBefBGuvz6Fx+DB6dtRI0bAyJHQ2to79TkozMwaxOGHp5P1Hn4Y\ntm9P52HssUc6kW/vveF97+uduhwUZmYNQkrnX0yenIbb2lLfxeOPw+rV6eeECeVrWtWLg8LMrEE1\nN6fDTc3NqXUxf346JDV6NOy/f+rXaG4u/rIgDgozswZ30EGwdi0sWJAOSy1fns72Hj06XdvqsMOK\nXb+DwsysD2hthXe8Iz1fuhQefTSd5b377unrtcOGFbdufz3WzKyP2X9/mD0bxo1LlwFZsKDY9blF\nYWbWR02blvouHn00nbjX3AzTp6fnPXnjJAeFmVkfNnQorFqVzsFoaUn9FyNGpBsn7b57z6zDQWFm\n1ofNmJEemzenmyZt2ABPPJHGOSjMzOxVgwbB29+e7n+xaFHPLtud2WZmlstBYWbWD/XkBQUdFGZm\n/cimTbB1a/nWqz3BQWFm1o+MHJkCYsWK1MHdExwUZmb9SHNzOr9CcovCzMw68OKLqTXxyCM9szwH\nhZlZP1O6tMcdd6RLle8sB4WZWT8zbBgccABs2ZKuNruzHBRmZv3QxImwfj08//zOL8tBYWbWD0kw\nalTP3NTIQWFmZrkcFGZmlstBYWbWj61aBc8+u3PLcFCYmfVDAwakK8o+8wzccAPsTG+Fg8LMrJ86\n6CCYMiWdgAcDHBRmZvZagwbB4ME7twwHhZmZ5ar5DneSxgITK18TEbcWUZSZmTWOmoJC0jeBDwGL\ngW3Z6AByg0LSCcD5QBNwaUR8o515jgG+B7QAqyPi6FqLNzOz4tXaong38LqIqPnq5pKagAuBdwAr\ngQWS5kXE4op5WoGLgBMi4klJo2sv3czM6qHWPoplpD3+rjgCWBoRyyJiCzAXOLlqntOA6yLiSYCI\neK6L6zAzs4LV2qJ4GbhX0v8Ar7YqIuLvc14zFlhRMbwSOLJqngOAFkm3AMOA8yPiyhprMjOzOqg1\nKOZljyLWfzhwLLAbcIekOyNih9ttSDobOBtg770nFFCGmZl1pKagiIj/kDSQ1AIAWBIRWzt52VPA\n+Irhcdm4SiuBNRGxEdgo6VZgJrBDUETEJcAlANOnz+qhm/uZmVktauqjyL6Z9Cipc/oi4BFJb+nk\nZQuAqZImZyFzCq9tlfwcOEpSs6QhpENTD3WhfjMzK1ith56+AxwXEUsAJB0AXEM6bNSuiGiTdC5w\nE+nrsZdHxCJJ52TTL46IhyT9Ergf2E76Cu2D3X87ZmbW02oNipZSSABExCOSOv0WVETMB+ZXjbu4\navjbwLdrrMPMzLqgrQ1iJw/Y1xoUCyVdClyVDZ8OLNy5VZuZWdFefhk2bYJ01afuqTUo/gb4BFD6\nOuxtpL4KMzNrYJMnQ3MzwMCungv3qlq/9bQZ+G72MDOzPmLAAJg4ceeWkRsUkn4cER+U9ADp2k47\niIhDdm71ZmbW6DprUXwy+/nOogsxM7PGlHseRUQ8nT1dDayIiCeAQaST4lYVXJuZmTWAWi8KeCsw\nOLsnxa+AM4AriirKzMwaR61BoYh4GXgvcFFEfAA4sLiyzMysUdQcFJLeRDp/4hfZuKZiSjIzs0ZS\na1B8Cvgn4L+zy3DsB/y2uLLMzKxR1Hoexe+A31UML6N88p2ZmfVjnZ1H8b2I+JSk62n/PIp3FVaZ\nmZk1hM5aFD/Kfv5b0YWYmVljyg2KiLgre7oQ2BQR2wEkNZHOpzAzs36u1s7s/wGGVAzvBtzc8+WY\nmVmjqTUoBkfEhtJA9nxIzvxmZtZP1BoUGyUdVhqQdDiwqZiSzMyskdR6P4pPAddKWgUI2Bv4UGFV\nmZlZw6j1PIoFkqYBr8tGLYmIrcWVZWZmjaKmQ0+ShgDnAZ+MiAeBSZJ86XEzs11ArX0U/w5sAd6U\nDT8FfKWQiszMrKHUGhRTIuJbwFaA7EqyKqwqMzNrGLUGxRZJu5FdxkPSFGBzYVWZmVnDqPVbT18A\nfgmMl3Q18GfAmUUVZWZmjaPToJAk4GHSTYveSDrk9MmIWF1wbWZm1gA6DYqICEnzI+JgyjctMjOz\nXUStfRR3S3pDoZWYmVlDqrWP4kjgw5IeBzaSDj9FRBxSVGFmZtYYag2K4wutwszMGlZnd7gbDJwD\n7A88AFwWEW31KMzMzBpDZ30U/wHMIoXEbOA7hVdkZmYNpbNDTzOybzsh6TLgT8WXZGZmjaSzFsWr\nV4j1ISczs11TZ0ExU9K67LEeOKT0XNK6zhYu6QRJSyQtlfS5nPneIKlN0vu7+gbMzKxYuYeeIqKp\nuwuW1ARcCLwDWAkskDQvIha3M983gV91d11mZlacWk+4644jgKURsSwitgBzgZPbme/vgJ8CzxVY\ni5mZdVORQTEWWFExvDIb9ypJY4H3AD/IW5CksyUtlLRw7drne7xQMzPrWJFBUYvvAedFxPa8mSLi\nkoiYFRGzWltH1ak0MzOD2s/M7o6ngPEVw+OycZVmAXPTBWoZCZwoqS0iflZgXWZm1gVFBsUCYKqk\nyaSAOAU4rXKGiJhcei7pCuAGh4SZWWMpLCgiok3SucBNQBNweUQsknRONv3iotZtZmY9p8gWBREx\nH5hfNa7dgIiIM4usxczMuqe3O7PNzKzBOSjMzCyXg8LMzHI5KMzMLJeDwszMcjkozMwsl4PCzMxy\nOSjMzCyXg8LMzHI5KMzMLJeDwszMcjkozMwsl4PCzMxyOSjMzCyXg8LMzHI5KMzMLJeDwszMcjko\nzMwsl4PCzMxyOSjMzCyXg8LMzHI5KMzMLJeDwszMcjkozMwsl4PCzMxyOSjMzCyXg8LMzHI5KMzM\nLJeDwszMcjkozMwsl4PCzMxyOSjMzCxXoUEh6QRJSyQtlfS5dqafLul+SQ9Iul3SzCLrMTOzriss\nKCQ1ARcCs4EZwKmSZlTNthw4OiIOBr4MXFJUPWZm1j1FtiiOAJZGxLKI2ALMBU6unCEibo+IF7PB\nO4FxBdZjZmbdUGRQjAVWVAyvzMZ15GPAje1NkHS2pIWSFq5d+3wPlmhmZp1piM5sSW8lBcV57U2P\niEsiYlZEzGptHVXf4szMdnHNBS77KWB8xfC4bNwOJB0CXArMjog1BdZjZmbdUGSLYgEwVdJkSQOB\nU4B5lTNImgBcB5wREY8UWIuZmXVTYS2KiGiTdC5wE9AEXB4RiySdk02/GPg8sBdwkSSAtoiYVVRN\nZmbWdUUeeiIi5gPzq8ZdXPH848DHi6zBzMx2TkN0ZpuZWeNyUJiZWS4HhZmZ5XJQmJlZLgeFmZnl\nclCYmVkuB4WZmeVyUJiZWS4HhZmZ5XJQmJlZLgeFmZnlclCYmVkuB4WZmeVyUJiZWS4HhZmZ5XJQ\nmJlZLgeFmZnlclCYmVkuB4WZmeVyUJiZWS4HhZmZ5XJQmJlZLgeFmZnlclCYmVkuB4WZmeVyUJiZ\nWS4HhZmZ5XJQmJlZLgeFmZnlclCYmVkuB4WZmeVyUJiZWS4HhZmZ5So0KCSdIGmJpKWSPtfOdEn6\nfjb9fkmHFVmPmZl1XWFBIakJuBCYDcwATpU0o2q22cDU7HE28IOi6jEzs+5pLnDZRwBLI2IZgKS5\nwMnA4op5TgaujIgA7pTUKmmfiHg6b8GbNxdVspmZVSsyKMYCKyqGVwJH1jDPWGCHoJB0NqnFAbDl\n2GOHPdazpfZVW0dAy4u9XUVj8LYo87Yo87Yoe3lid19ZZFD0mIi4BLgEQNLCiPWzermkhpC2xSve\nFnhbVPK2KPO2KJO0sLuvLbIz+ylgfMXwuGxcV+cxM7NeVGRQLACmSposaSBwCjCvap55wEeybz+9\nEXips/4JMzOrr8IOPUVEm6RzgZuAJuDyiFgk6Zxs+sXAfOBEYCnwMnBWDYu+pKCS+yJvizJvizJv\nizJvi7JubwulLxyZmZm1z2dmm5lZLgeFmZnlatig8OU/ymrYFqdn2+ABSbdLmtkbddZDZ9uiYr43\nSGqT9P561ldPtWwLScdIulfSIkm/q3eN9VLD/8hwSddLui/bFrX0h/Y5ki6X9JykBzuY3r3PzYho\nuAep8/sxYD9gIHAfMKNqnhOBGwEBbwT+2Nt19+K2eDMwIns+e1feFhXz/Yb0ZYn393bdvfh30Uq6\nEsKEbHh0b9fdi9vin4FvZs9HAS8AA3u79gK2xVuAw4AHO5jerc/NRm1RvHr5j4jYApQu/1Hp1ct/\nRMSdQKukfepdaB10ui0i4vaIKJ19eifpfJT+qJa/C4C/A34KPFfP4uqslm1xGnBdRDwJEBH9dXvU\nsi0CGCZJwFBSULTVt8ziRcStpPfWkW59bjZqUHR0aY+uztMfdPV9foy0x9AfdbotJI0F3kP/v8Bk\nLX8XBwAjJN0i6S5JH6lbdfVVy7a4AJgOrAIeAD4ZEdvrU15D6dbnZp+4hIfVRtJbSUFxVG/X0ou+\nB5wXEdvTzuMurRk4HDgW2A24Q9KdEfFI75bVK44H7gXeBkwBfi3ptohY17tl9Q2NGhS+/EdZTe9T\n0iHApcDsiFhTp9rqrZZtMQuYm4XESOBESW0R8bP6lFg3tWyLlcCaiNgIbJR0KzAT6G9BUcu2OAv4\nRqQD9UslLQemAX+qT4kNo1ufm4166MmX/yjrdFtImgBcB5zRz/cWO90WETE5IiZFxCTgJ8Df9sOQ\ngNr+R34OHCWpWdIQ0tWbH6pznfVQy7Z4ktSyQtIY4HXAsrpW2Ri69bnZkC2KKO7yH31Ojdvi88Be\nwEXZnnRbRPS7K2bWuC12CbVsi4h4SNIvgfuB7cClEdHu1yb7shr/Lr4MXCHpAdI3fs6LiNW9VnRB\nJF0DHAPJrBceAAABxUlEQVSMlLQS+ALQAjv3uelLeJiZWa5GPfRkZmYNwkFhZma5HBRmZpbLQWFm\nZrkcFGZmlstBYVZF0rbsiqsPZlccbe3h5Z8p6YLs+RxJn+nJ5Zv1NAeF2WttiohDI+Ig0gXWPtHb\nBZn1JgeFWb47qLhomqR/lLQgu5b/FyvGfyQbd5+kH2XjTpL0R0n3SLo5OyPYrM9pyDOzzRqBpCbS\nZR8uy4aPA6aSLmstYJ6ktwBrgH8B3hwRqyXtmS3i98AbIyIkfRz4LPDpOr8Ns53moDB7rd0k3Utq\nSTwE/Dobf1z2uCcbHkoKjpnAtaVLQkRE6X4A44D/yq73PxBYXp/yzXqWDz2ZvdamiDgUmEhqOZT6\nKAR8Peu/ODQi9o+Iy3KW8/+ACyLiYOCvgcGFVm1WEAeFWQci4mXg74FPS2omXXTuLyUNhXSTJEmj\nSbdd/YCkvbLxpUNPwylfwvmjdS3erAf50JNZjoi4R9L9wKkR8SNJ00k3AALYAHw4u1LpV4HfSdpG\nOjR1JjAHuFbSi6Qwmdwb78FsZ/nqsWZmlsuHnszMLJeDwszMcjkozMwsl4PCzMxyOSjMzCyXg8LM\nzHI5KMzMLNf/AkbvrLJF4vQvAAAAAElFTkSuQmCC\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "\n", "precision, recall, prc_thresholds = precision_recall_curve(voting_predict)\n", "average_precision = average_precision_score(voting_predict)\n", "\n", "plt.figure()\n", "plt.step(recall, precision, color='b', alpha=0.2,\n", " where='post')\n", "plt.fill_between(recall, precision, step='post', alpha=0.2,\n", " color='b')\n", "\n", "plt.xlabel('Recall')\n", "plt.ylabel('Precision')\n", "plt.ylim([0.0, 1.05])\n", "plt.xlim([0.0, 1.0])\n", "plt.title('2-class Precision-Recall curve: AUC={0:0.2f}'.format(\n", " average_precision))\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYoAAAEWCAYAAAB42tAoAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XmYHVWZx/HvLwkJSyJbcEsICYtAGAlgDDouoKgkqIMi\ng2DEEXUQR8R9cBvFbXAdFBEjoxgXJKgsIhNFUDZZDEF2EAxBIIhAFiAQFpu888c5TVcu91ZXd7pu\n3+7+fZ7nPuna3zq5t96qc6pOKSIwMzNrZdRgB2BmZp3NicLMzEo5UZiZWSknCjMzK+VEYWZmpZwo\nzMyslBNFh5E0V9JvK857o6S9aw6pY0n6taR/G6Rt/1XSqwZj2wNN0jxJ/9WP5aZIeljS6Driss7h\nRFEiHwwezT+Gv0uaL2l8nduMiFMi4jUV590lIi6sIw5JkyWdImmFpEckLZL0ujq2VTGeYyT9pDgu\nIuZExA9r2t4zJH1D0p35//+2PDyxju31l6QLJb1rfdYREUdExOcrbGud5BgRd0bE+Ih4cn2236kk\nTZV0gaQ1kv5cdmIgaTNJP5R0X/4c0zC9eCx5uOrJYKdwoujd6yNiPLAbsDvw8UGOp3aStgD+ADwB\n7AJMBI4DfirpwBq2N2ag17k+JI0Ffkfa99nAM4AXA8uBWQO8LUkatN/hULsaaPN35VTgamBL4JPA\nLyRt1WLe44CNgamk78ihkg5rmOf1ObGOr3oy2DEiwp8WH+CvwKsKw18B/q8wPA74GnAncC8wD9io\nMH1/4BrgIeA2YHYevynwfeAe4G7gC8DoPO3twB/y398BvtYQ0y+BDzXGBxwD/Az4EbAauBGYWVhu\nD9KXfjXwc+A04Ast9vvzwA3AqIbxRwN3AMrDARwFLCUdRL9aXAZ4B3AzsAo4F9imMC2A9wJ/AW7P\n474J3JXL6yrgZXn8bFLS+gfwMHBtHn8h8K5iueX/j1XA7cCcwvamARfn/T8f+Dbwkxb7/678/zm+\nl+/GR4DrgAdzeW6Yp20OnAPcn2M5B5hcWPZC4IvApcCjwPbAYbmsVufyfHfD9p72XcrreBJ4LJfL\nCXnenYDzgJXALcBBhfXMJ32vFgKPAK/K476Qp0/M8T6Ql7+EdEL5Y2Btjvdh4D9JB8UAxuRltwB+\nAPwt7/dZFX9n2wG/B1aQvkenAJs1lPXRuawfB8YAzwVOz2V8O3BUYf5ZwOV5H+4BTgDG9vG3/7y8\nrQmFcRcDR7SYfzkwqzD8CeCSVseSofYZ9AA6+cO6B+LJwPXANwvTjwPOzj+QCcCvgGPztFmkA8ir\n8w9tErBTnnYm8F1gE+CZwKLuAwPrJoqXkw6c3QfmzfMP9blN4juGdMDYDxgNHAtckaeNJR3g3w9s\nABxAOvC2ShRXAJ9tMn5aPjDsmIcDuCDv/xTgVnoO3PsDS4Cd8w/7U8BlhXUF6WC2BTm5Am8lnb2N\nAT4M/J2eg+8xNBzYeXqi+Afw73n/30M6YHWX3eWkJDIWeCnpgNsqUSwAfljhu7GIdMDagnSQPyJP\n2xJ4E+kMcwIpMZ/VEPedpCuWMfn/5LWkA6aAvYA1wB4VvktPlUEe3iR/Zw7L696ddBCbnqfPz+t6\nSV7XhqybKI4lnfBskD8vK5ThX1n3xGkq6yaK/yMlzM3zsnsV5n0AeGmLstw+79s4YCvSAfkbDWV9\nDbA1sFGO+yrg0/n/c1tSct03z/8C4EV5/6fm/5sPFNZ3XY6n2efEPM8bgZsb4vwW8K0W+9CYKD4J\nrGrYh3tJie23wIzBPr716Vg42AF08if/5z5MOssLUnXEZnmaSGdk2xXmfzE9Z8ffBY5rss5nkc5U\nilcehwAX5L/fTk+iEOmA8vI8/O/A7xviKyaK8wvTpgOP5r9fTrpyUWH6H2idKJbQ5MyJdFAJ4CV5\nOMhXSXn4P4Df5b9/DbyzMG0U6eC3TWHZV/ZS/qu6f1BUSxRLCtM2ztt4NimJdQEbF6b/pHF9hWnn\nAV+q8N14a2H4K8C8FvPu1nDQuBD4XC/rPwt4f9l3qbEM8vCbKZzJFpb/TP57PvCjhunz6UkUnyNd\ntW7fYp+bJgrgOaQrjs0H4Hf3BuDqhu2+ozC8J3BnwzIfB37QYn0fAM7sYwyHkk+0CuO+CMxvMf9P\nSFc4E0iJ7zbg8cL0l5CS3MY51r9TuGrq9I/bKHr3hoiYAOxNuqTvbszcivSffpWkByQ9APwmj4d0\n9nNbk/VtQzrbuqew3HdJVxbriPQNW0BKJABvIV2Wt/L3wt9rgA1zne5zgbvz+rrdVbKe5aQffqPn\nFKY3W88deVuQ9vObhX1cSUp8k1rFIOkjkm6W9GBeZlN6yruKp/Y/ItbkP8fnmFYWxj1t2w1W0Hz/\nW26PVN7jASRtLOm7ku6Q9BDpDHmzhvaAxn2fI+kKSSvzvu9Hz763+i41sw2wZ3e553XNJSXMpttu\n8FXSicJvJS2V9LGK292aVMarKs7/FEnPkrRA0t25vH7C0//fizFvAzy3YR8/QToJQ9LzJJ2Tb0B5\nCPjvJuvrzcOktqmiTUknjc0cRbqi/wsp0Z4KLOueGBGXRsSjEbEmIo4lXb28rI8xDRoniooi4iLS\nmdfX8qjlpGqgXSJis/zZNFLDN6Qv9nZNVnUX6YpiYmG5Z0TELi02fSpwoKRtSGdSp/cj/HuASZJU\nGLd1yfznAwc0aWQ9KMd/a4v1TCFV95Dne3dhHzeLiI0i4rLC/E8lLkkvI9V7H0Q6K92MVEWixnn7\n4R5gC0kbt4i70fnAvpI26ef2PgzsCOwZEc8gXdFBz77Auvs+jvT/+jXgWXnfFxbmb/VdWmc9hXkv\naij38RHxnpJleiZErI6ID0fEtsC/AB+StE9vy+XtbiFps5J5WvnvvO7n5/J6K+uWVeO27yJduRf3\ncUJE7Jenfwf4M7BDXt8niuvLt5U/3OIzL892I7CtpAmF7c7I458mIlZGxNyIeHb+LY8iVU22Ek32\nsWM5UfTNN4BXS5oREWuB/wWOk/RMAEmTJO2b5/0+cJikfSSNytN2ioh7SHWUX8+3YI6StJ2kvZpt\nMCKuJiWl7wHnRsQD/Yj7clKj55GSxkjan/K7d44jN7hLerakDSUdQqp3/WjDlclHJW0uaWtSG8hp\nefw84OOSdgGQtKmkfy3Z5gRS9dD9wBhJn2bdM7p7gan9uUMoIu4AFgPHSBor6cXA60sW+THpYHS6\npJ3y/9GWkj4hab+S5Yr78ijwQL6D7DO9zD+WVD9/P9AlaQ5QvCum6XcpT7uXVEff7RzgeZIOlbRB\n/rxQ0s4V4kbS6yRtn08qHiR9b9a22NZT8vf618CJ+fuwgaSXN5u3iQmkM/gHJU0CPtrL/IuA1ZKO\nlrSRpNGS/knSCwvrewh4OJdTMUkS6bby8S0+R+R5biW1i3wmf/8PAJ5PixO1/BveMscyBzicdJNK\n9/MmL8nfvQ0lfZR0hXNpxfIZdE4UfRAR95PuKvp0HnU06TL9inyJez7pTJKIWERqUDyO9IO7iHTJ\nDPA20sHhJlI9/C8or+r4KenulJ/2M+4nSA3Y7yRd8r6VdEB5vMX8K0gNvhvmGFcAHwIOjYjTGmb/\nJalh8RpSY+b38zrOBL4MLMhlcwMwpyTMc0lVd7eSqrAeY93qhp/nf1dI+lOvO/10c0ltSCtIP+DT\naL3/j5PK+8+k9oqHSAenicAfK2zrG6T66OWkGwN+UzZzRKwmVV38jPR9eAvpJonu6WXfpW+SrjhX\nSTo+r+s1wMGkq7u/k/4fxlWIG2AH0vf4YdIJxokRcUGedizwqVzd85Emyx5KuqHgz8B9pLYBAPLZ\nequqls+S7sp7kPQdOqMswEjPbbyO1PZzOz0nUpvmWT5CKsPVpJO5xu9sVQcDM0n/J8cCB+ZjAJJe\nJunhwrwvIN3ssjrPOzciuq8+JpCuclaR2gpnk+7IW9HPuNpO654c2kgh6Y+kxtcfrMc6gnR5v2Tg\nImsPSacBf46I3s72zUY8X1GMEJL2ytVIY5S6vdiVXs50h5Nc/bJdrrqZTbp996zBjstsKOioJ2Kt\nVjuSqjY2Id1zfmCuVx4pnk2q0tiSdDfKe3L7j5n1wlVPZmZWylVPZmZWashVPU2cODGmTp062GGY\nmQ0pV1111fKIaNWpYakhlyimTp3K4sWLBzsMM7MhRdId/V3WVU9mZlbKicLMzEo5UZiZWSknCjMz\nK+VEYWZmpZwozMysVG2JQtLJku6TdEOL6ZJ0vKQlkq6TtEddsZiZWf/VeUUxn9SdbitzSF0a70Dq\nu/07NcZiZmb9VNsDdxFxsaSpJbPsT3p3b5De57CZpOeMsI7qbJhavhxWrhzsKMwGxmA+mT2JdV9M\nsyyPe1qikHQ46aqDKVOmtCU4GznqOKivXg3XXAPuc9M6x4T+vtp3aHThEREnAScBzJw50z89e5r1\nOdjXdVAfNw6mTx/YdZr13+jR/V1yMBPF3az7gvvJeZyNQOt7Vr++B3sf1M1aG8xEcTZwpKQFwJ7A\ng26fGHm6E8RAnNX7YG9Wj9oShaRTgb2BiZKWAZ8BNgCIiHnAQmA/YAmwhvTyeBumWl0xFBOED/Rm\nnanOu54O6WV6AO+ta/s2uBoTQ9kVgxOEWWcbEo3ZNnSUVSU5IZgNTU4UNiCaJQgnBrPhwYnC1osT\nhNnw50RhlTVrkHaCMBv+nCisV73dwuoEYTa8OVFYr1auhMsug64uJwWzkciJwtbRrHppzZqUJHbf\nfXBiMrPB5UQxgvXW5lA0blz74jKzzuJEMYJUfQjO1UtmVuREMUIsXw533OGH4Mys75woRoiVK1OS\nGDvWicHM+saJYhgrVjWtWZOuJJwkzKyvnCiGqWZVTW6QNrP+cKIYZhofjnNVk5mtLyeKYcYPx5nZ\nQHOiGOIab3n1w3FmNtCcKIao3t77YGY2UJwohqDGhmpXMZlZnZwohiA/E2Fm7TRqsAOw/vEzEWbW\nLk4UZmZWyonCzMxKOVGYmVkpJwozMyvlRGFmZqWcKMzMrJQThZmZlXKiMDOzUk4UZmZWyonCzMxK\nOVGYmVkpJwozMytVa6KQNFvSLZKWSPpYk+mbSvqVpGsl3SjpsDrjMTOzvqstUUgaDXwbmANMBw6R\n1Njf6XuBmyJiBrA38HVJY+uKyczM+q7OK4pZwJKIWBoRTwALgP0b5glggiQB44GVQFeNMZmZWR/V\nmSgmAXcVhpflcUUnADsDfwOuB94fEWsbVyTpcEmLJS2+//7764rXzMyaGOw33O0LXAO8EtgOOE/S\nJRHxUHGmiDgJOAlg5syZ8bS1DJDu91B3ujVrBjsCMxtJ6kwUdwNbF4Yn53FFhwFfiogAlki6HdgJ\nWFRjXE01voe6040bN9gRmNlIUWeiuBLYQdI0UoI4GHhLwzx3AvsAl0h6FrAjsLTGmJoqJgm/h9rM\nbF21JYqI6JJ0JHAuMBo4OSJulHREnj4P+DwwX9L1gICjI2J5XTE14yRhZlau1jaKiFgILGwYN6/w\n99+A19QZQ29WrnSSMDMrM9iN2W3X2GC9Zk1qk3CSMDNrbsQlipUr4bLLoKvwtIYbhs3MWhtRiWL5\ncli9OiWJ3Xcf7GjMzIaGEdUpYHd7hK8gzMyqG1GJAtweYWbWVyMmUXRXO5mZWd+MiERRfFbC1U5m\nZn1TKVFIGitp+7qDqYuflTAz679eE4Wk15J6dj0vD+8m6cy6Axso3VVObpswM+ufKrfHfg7YE7gA\nICKuGQpXF90P1q1e7SonM7P1USVR/CMiHkjvFnpKx/evWnywbtw4X02YmfVXlURxs6SDgFG5J9ij\ngCvqDWtg+ME6M7P1V6Ux+0jgBcBa4AzgceD9dQZlZmado8oVxb4RcTRwdPcISQeQkoaZmQ1zVa4o\nPtVk3CcHOpCBsnw53HqrXxdqZjZQWl5RSNoXmA1MkvQ/hUnPIFVDdZzG15n6Ticzs/VXVvV0H3AD\n8BhwY2H8auBjdQbVH35TnZlZPVomioi4Grha0ikR8VgbY+oXP31tZlaPKo3ZkyR9EZgObNg9MiKe\nV1tU/eSnr83MBl6Vxuz5wA8AAXOAnwGn1RiTmZl1kCqJYuOIOBcgIm6LiE+REoaZmY0AVaqeHpc0\nCrhN0hHA3cCEesMyM7NOUSVRfBDYhNR1xxeBTYF31BmUmZl1jl4TRUT8Mf+5GjgUQNKkOoMyM7PO\nUdpGIemFkt4gaWIe3kXSj4A/li1nZmbDR8tEIelY4BRgLvAbSceQ3klxLdBRt8b6fdhmZvUpq3ra\nH5gREY9K2gK4C3h+RCxtT2jVdT9s5y47zMwGXlnV02MR8ShARKwEbu3EJNHND9uZmdWj7IpiW0nd\nXYkLmFYYJiIOqDUyMzPrCGWJ4k0NwyfUGUh/uX3CzKxeZZ0C/q6dgfSX2yfMzOpVpQuPjuf2CTOz\n+tSaKCTNlnSLpCWSmr7DQtLekq6RdKOki+qMx8zM+q5KFx4ASBoXEY/3Yf7RwLeBVwPLgCslnR0R\nNxXm2Qw4EZgdEXdKemb10M3MrB16vaKQNEvS9cBf8vAMSd+qsO5ZwJKIWBoRTwALSM9mFL0FOCMi\n7gSIiPv6FL2ZmdWuStXT8cDrgBUAEXEt8IoKy00iPaTXbVkeV/Q8YHNJF0q6StLbKqz3Kb7jycys\nflWqnkZFxB2SiuOeHMDtvwDYB9gIuFzSFRFxa3EmSYcDhwNMmTLlqfG+48nMrH5VrijukjQLCEmj\nJX0AuLW3hUjvrdi6MDw5jytaBpwbEY9ExHLgYmBG44oi4qSImBkRM7faaquGab7jycysTlUSxXuA\nDwFTgHuBF+VxvbkS2EHSNEljgYOBsxvm+SXwUkljJG0M7AncXDV4MzOrX5Wqp66IOLivK46ILklH\nAucCo4GTI+LG/JY8ImJeRNws6TfAdcBa4HsRcUNft2VmZvWpkiiulHQLcBrpDqXKzccRsRBY2DBu\nXsPwV4GvVl2nmZm1V69VTxGxHfAFUqPz9ZLOktTnKwwzMxuaKj2ZHRGXRcRRwB7AQ6QXGpmZ2QhQ\n5YG78ZLmSvoVsAi4H/jn2iMzM7OOUKWN4gbgV8BXIuKSmuOpzA/bmZm1R5VEsW1ErK09kj7yw3Zm\nZu3RMlFI+npEfBg4XVI0Tu+EN9z5YTszs/qVXVGclv/tyDfbmZlZe5S94W5R/nPniFgnWeQH6Qbt\nDXhunzAza58qt8e+o8m4dw50IH3h9gkzs/Ypa6N4M6l/pmmSzihMmgA8UHdgvXH7hJlZe5S1USwi\nvYNiMulNdd1WA1fXGZSZmXWOsjaK24HbgfPbF46ZmXWasqqniyJiL0mrgOLtsQIiIraoPTozMxt0\nZVVP3a87ndiOQMzMrDO1vOup8DT21sDoiHgSeDHwbmCTNsRmZmYdoMrtsWeRXoO6HfADYAfgp7VG\nZWZmHaNKolgbEf8ADgC+FREfBCbVG5aZmXWKKomiS9K/AocC5+RxG9QXkpmZdZKqT2a/gtTN+FJJ\n04BT6w2rta4ud99hZtZOvXYzHhE3SDoK2F7STsCSiPhi/aE119Xl7jvMzNqp10Qh6WXAj4G7Sc9Q\nPFvSoRFxad3BteLuO8zM2qfKi4uOA/aLiJsAJO1MShwz6wzMzMw6Q5U2irHdSQIgIm4GxtYXkpmZ\ndZIqVxR/kjQP+Ekenos7BTQzGzGqJIojgKOA/8zDlwDfqi0iMzPrKKWJQtLzge2AMyPiK+0JyczM\nOknLNgpJnyB13zEXOE9SszfdmZnZMFd2RTEX2DUiHpG0FbAQOLk9YZmZWacou+vp8Yh4BCAi7u9l\n3rZZu7b3eczMbOCUXVFsW3hXtoDtiu/OjogDao2shTVr/FS2mVk7lSWKNzUMn1BnIH3hp7LNzNqn\n7J3Zv2tnIGZm1pk6ot3BzMw6V62JQtJsSbdIWiLpYyXzvVBSl6QD64zHzMz6rnKikNSnJmRJo4Fv\nA3OA6cAhkp7WupDn+zLw276s38zM2qPXRCFplqTrgb/k4RmSqnThMYv07oqlEfEEsADYv8l87wNO\nB+6rHraZmbVLlSuK44HXASsAIuJa0hvvejMJuKswvIyGd21LmgS8EfhO2YokHS5psaTFq1evqrBp\nMzMbKFUSxaiIuKNh3JMDtP1vAEdHROljdBFxUkTMjIiZEyZsPkCbNjOzKqr0HnuXpFlA5PaE9wG3\nVljubmDrwvDkPK5oJrBAEsBEYD9JXRFxVoX1m5lZG1RJFO8hVT9NAe4Fzs/jenMlsIOkaaQEcTDw\nluIMETGt+29J84FznCTMzDpLr4kiIu4jHeT7JCK6JB0JnAuMBk6OiBslHZGnz+vrOs3MrP16TRSS\n/heIxvERcXhvy0bEQlKvs8VxTRNERLy9t/WZmVn7Val6Or/w94aku5TuajGvmZkNM1Wqnk4rDkv6\nMfCH2iIyM7OO0p8uPKYBzxroQMzMrDNVaaNYRU8bxShgJdCy3yYzMxteShOF0gMOM+h5/mFtRDyt\nYdvMzIav0qqnnBQWRsST+eMkYWY2wlRpo7hG0u61R2JmZh2pZdWTpDER0QXsDlwp6TbgEdL7syMi\n9mhTjGZmNojK2igWAXsA/9KmWMzMrAOVJQoBRMRtbYrFzMw6UFmi2ErSh1pNjIj/qSEeMzPrMGWJ\nYjQwnnxlYWZmI1NZorgnIj7XtkjMzKwjld0e6ysJMzMrTRT7tC0KMzPrWC0TRUSsbGcgZmbWmfrT\ne6yZmY0gThRmZlbKicLMzEo5UZiZWSknCjMzK+VEYWZmpZwozMyslBOFmZmVcqIwM7NSThRmZlbK\nicLMzEo5UZiZWSknCjMzK+VEYWZmpZwozMyslBOFmZmVqjVRSJot6RZJSyR9rMn0uZKuk3S9pMsk\nzagzHjMz67vaEoWk0cC3gTnAdOAQSdMbZrsd2Csing98HjiprnjMzKx/6ryimAUsiYilEfEEsADY\nvzhDRFwWEavy4BXA5BrjMTOzfqgzUUwC7ioML8vjWnkn8OtmEyQdLmmxpMWrV69qNouZmdWkIxqz\nJb2ClCiObjY9Ik6KiJkRMXPChM3bG5yZ2Qg3psZ13w1sXRienMetQ9KuwPeAORGxosZ4zMysH+q8\norgS2EHSNEljgYOBs4szSJoCnAEcGhG31hiLmZn1U21XFBHRJelI4FxgNHByRNwo6Yg8fR7waWBL\n4ERJAF0RMbOumMzMrO/qrHoiIhYCCxvGzSv8/S7gXXXGYGZm66cjGrPNzKxzOVGYmVkpJwozMyvl\nRGFmZqWcKMzMrJQThZmZlXKiMDOzUk4UZmZWyonCzMxKOVGYmVkpJwozMyvlRGFmZqWcKMzMrJQT\nhZmZlXKiMDOzUk4UZmZWyonCzMxKOVGYmVkpJwozMyvlRGFmZqWcKMzMrJQThZmZlXKiMDOzUk4U\nZmZWyonCzMxKOVGYmVkpJwozMyvlRGFmZqWcKMzMrJQThZmZlXKiMDOzUk4UZmZWyonCzMxK1Zoo\nJM2WdIukJZI+1mS6JB2fp18naY864zEzs76rLVFIGg18G5gDTAcOkTS9YbY5wA75czjwnbriMTOz\n/qnzimIWsCQilkbEE8ACYP+GefYHfhTJFcBmkp5TtlKpnmDNzKy5MTWuexJwV2F4GbBnhXkmAfcU\nZ5J0OOmKA9A/Zs7c/K8DG+pQ9fimMO7BwY6iM7gsergsergsejw0ub9L1pkoBkxEnAScBCBpccSq\nmYMcUkdIZbHGZYHLoshl0cNl0UPS4v4uW2fV093A1oXhyXlcX+cxM7NBVGeiuBLYQdI0SWOBg4Gz\nG+Y5G3hbvvvpRcCDEXFP44rMzGzw1Fb1FBFdko4EzgVGAydHxI2SjsjT5wELgf2AJcAa4LAKqz6p\nppCHIpdFD5dFD5dFD5dFj36XhSJiIAMxM7Nhxk9mm5lZKScKMzMr1bGJwt1/9KhQFnNzGVwv6TJJ\nMwYjznborSwK871QUpekA9sZXztVKQtJe0u6RtKNki5qd4ztUuE3sqmkX0m6NpdFlfbQIUfSyZLu\nk3RDi+n9O25GRMd9SI3ftwHbAmOBa4HpDfPsB/waEPAi4I+DHfcglsU/A5vnv+eM5LIozPd70s0S\nBw523IP4vdgMuAmYkoefOdhxD2JZfAL4cv57K2AlMHawY6+hLF4O7AHc0GJ6v46bnXpFUUv3H0NU\nr2UREZdFxKo8eAXpeZThqMr3AuB9wOnAfe0Mrs2qlMVbgDMi4k6AiBiu5VGlLAKYIEnAeFKi6Gpv\nmPWLiItJ+9ZKv46bnZooWnXt0dd5hoO+7uc7SWcMw1GvZSFpEvBGhn8Hk1W+F88DNpd0oaSrJL2t\nbdG1V5WyOAHYGfgbcD3w/ohY257wOkq/jptDogsPq0bSK0iJ4qWDHcsg+gZwdESslXuQHAO8ANgH\n2Ai4XNIVEXHr4IY1KPYFrgFeCWwHnCfpkoh4aHDDGho6NVG4+48elfZT0q7A94A5EbGiTbG1W5Wy\nmAksyEliIrCfpK6IOKs9IbZNlbJYBqyIiEeARyRdDMwAhluiqFIWhwFfilRRv0TS7cBOwKL2hNgx\n+nXc7NSqJ3f/0aPXspA0BTgDOHSYny32WhYRMS0ipkbEVOAXwH8MwyQB1X4jvwReKmmMpI1JvTff\n3OY426FKWdxJurJC0rOAHYGlbY2yM/TruNmRVxRRX/cfQ07Fsvg0sCVwYj6T7oqIYddjZsWyGBGq\nlEVE3CzpN8B1wFrgexHR9LbJoazi9+LzwHxJ15Pu+Dk6IpYPWtA1kXQqsDcwUdIy4DPABrB+x013\n4WFmZqU6terJzMw6hBOFmZmVcqIwM7NSThRmZlbKicLMzEo5UVjHkfRk7vG0+zO1ZN6prXrK7OM2\nL8y9j14r6VJJO/ZjHUd0d5Mh6e2SnluY9j1J0wc4zisl7VZhmQ/k5yjM+sWJwjrRoxGxW+Hz1zZt\nd25EzAB+CHy1rwvnZxd+lAffDjy3MO1dEXHTgETZE+eJVIvzA4AThfWbE4UNCfnK4RJJf8qff24y\nzy6SFuWrkOsk7ZDHv7Uw/ruSRveyuYuB7fOy+0i6WuldHydLGpfHf0nSTXk7X8vjjpH0EaV3YMwE\nTsnb3ChkOp3QAAAC4klEQVRfCczMVx1PHdzzlccJ/Yzzcgodukn6jqTFSu9b+GwedxQpYV0g6YI8\n7jWSLs/l+HNJ43vZjo1wThTWiTYqVDudmcfdB7w6IvYA3gwc32S5I4BvRsRupAP1Mkk75/lfksc/\nCcztZfuvB66XtCEwH3hzRDyf1JPBeyRtSeqhdpeI2BX4QnHhiPgFsJh05r9bRDxamHx6Xrbbm0l9\nU/UnztlAsXuST+Yn8ncF9pK0a0QcT+ox9RUR8QpJE4FPAa/KZbkY+FAv27ERriO78LAR79F8sCza\nADgh18k/SepCu9HlwCclTSa9h+EvkvYh9aB6Ze7eZCNav6fiFEmPAn8lvdNiR+D2Qv9ZPwTeS+qy\n+jHg+5LOAc6pumMRcb+kpbmfnb+QOqa7NK+3L3GOJb1XoVhOB0k6nPS7fg4wndR9R9GL8vhL83bG\nksrNrCUnChsqPgjcS+r9dBTpQL2OiPippD8CrwUWSno3qV+fH0bExytsY25ELO4ekLRFs5ly30Kz\nSJ3MHQgcSeq+uqoFwEHAn4EzIyKUjtqV4wSuIrVPfAs4QNI04CPACyNilaT5wIZNlhVwXkQc0od4\nbYRz1ZMNFZsC9+SXzRxK6vxtHZK2BZbm6pZfkqpgfgccKOmZeZ4tJG1TcZu3AFMlbZ+HDwUuynX6\nm0bEQlICa/aO8tXAhBbrPZP0prFDSEmDvsaZu8v+L+BFknYCngE8Ajyo1DvqnBaxXAG8pHufJG0i\nqdnVmdlTnChsqDgR+DdJ15Kqax5pMs9BwA2SrgH+ifTKx5tIdfK/lXQdcB6pWqZXEfEYqXfNn+de\nR9cC80gH3XPy+v5A8zr++cC87sbshvWuInX3vU1ELMrj+hxnbvv4OvDRiLgWuJp0lfJTUnVWt5OA\n30i6ICLuJ92RdWrezuWk8jRryb3HmplZKV9RmJlZKScKMzMr5URhZmalnCjMzKyUE4WZmZVyojAz\ns1JOFGZmVur/AYzPb786SFobAAAAAElFTkSuQmCC\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fpr, tpr, roc_thresholds = roc_curve(voting_predict)\n", "area = roc_auc_score(voting_predict)\n", "\n", "plt.figure()\n", "plt.step(fpr, tpr, color='b', alpha=0.2,\n", " where='post')\n", "plt.fill_between(fpr, tpr, step='post', alpha=0.2,\n", " color='b')\n", "\n", "plt.xlabel('False Positive Rate')\n", "plt.ylabel('True Positive Rate')\n", "plt.ylim([0.0, 1.05])\n", "plt.xlim([0.0, 1.0])\n", "plt.title('Receiving Operating Characteristic: area={0:0.2f}'.format(\n", " area))\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Considered a bad idea to actually adjust predictions based on optimal `Threshold` from holdout test data curves - it's a form of overfitting on the test set: https://stackoverflow.com/questions/32627926/scikit-changing-the-threshold-to-create-multiple-confusion-matrixes (although using ROC to do this might be ok? or on cross-validated training data? https://stackoverflow.com/a/35300649)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pickling the Voting Model" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Exporting the voting model to model.v2.pkl\n" ] } ], "source": [ "import pickle\n", "\n", "print (\"Exporting the voting model to model.v2.pkl\")\n", "with open('model.v2.pkl', 'wb') as f:\n", " pickle.dump(voting, f)" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Importing the model from model.v2.pkl\n", "New sample: Buy ice cream\n", "Predicted class is ('pos', 0.96540602930541619)\n" ] } ], "source": [ "# load the model back into memory\n", "print(\"Importing the model from model.v2.pkl\")\n", "with open('model.v2.pkl', 'rb') as f:\n", " loaded_clf = pickle.load(f)\n", "\n", "# predict on a new sample\n", "task_new = 'Buy ice cream'\n", "print ('New sample: {}'.format(task_new))\n", "\n", "# score on the new sample\n", "features = featurize(nlp(task_new));\n", "predict = [(l, p.prob('pos')) for l,p in zip(loaded_clf.classify_many(features), loaded_clf.prob_classify_many(features))]\n", "print('Predicted class is {}'.format(predict[0]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Next Steps and Improvements\n", "\n", "1. Training set may be too specific/not relevant enough (recipe instructions for positive dataset, recipe descriptions+short movie reviews for negative dataset)\n", "2. Throwing features into a blender - need to understand value of each\n", " - What feature \"classes\" tend to perform the best/worst?\n", " - [PCA](http://jotterbach.github.io/2016/03/24/Principal_Component_Analysis/): Reducing dimensionality using most informative feature information\n", "3. Phrase vectorizations of all 0s - how problematic is this?\n", "4. Varying feature vector lengths - does this matter?\n", "5. Voting - POS taggers\n", " - [SciKit Learn: Ensembles](http://scikit-learn.org/stable/modules/ensemble.html)\n", " - [Kaggle Ensembling Guide](https://mlwave.com/kaggle-ensembling-guide/)\n", "6. Combining verb phrases\n", "7. Look at examples from different quadrants of the confusion matrix - is there something we can learn?\n", " - Same idea with the classification report" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "# Things abandoned" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## NLTK" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I needed a library that supports dependency parsing, which NLTK does not... so I thought I'd add the [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) toolkit and [its associated software](https://nlp.stanford.edu/software/) to NLTK. However, there are many conflicting instructions for installing the Java-based project, depending on NLTK version used. By the time I figured this out, the installation had become a time sink. So I abandoned this effort in favor of Spacy.io.\n", "\n", "I might return this way if I want to improve results/implement a voter system between the various linguistic and classification methods later." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import nltk\n", "from nltk.tokenize import sent_tokenize, word_tokenize" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "nltk.download('punkt')\n", "nltk.download('averaged_perceptron_tagger')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tokenization" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "sentences = [s for l in lines for s in sent_tokenize(l)] # punkt\n", "sentences" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "tagged_sentences = []\n", "for s in sentences:\n", " words = word_tokenize(s)\n", " tagged = nltk.pos_tag(words) # averaged_perceptron_tagger\n", " tagged_sentences.append(tagged)\n", "print(tagged_sentences)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Note: POS accuracy\n", "\n", "`Run down to the shop, will you, Peter` is parsed unexpectedly by `nltk.pos_tag`:\n", "> `[('Run', 'NNP'), ('down', 'RB'), ('to', 'TO'), ('the', 'DT'), ('shop', 'NN'), (',', ','), ('will', 'MD'), ('you', 'PRP'), (',', ','), ('Peter', 'NNP')]`\n", "\n", "`Run` is tagged as a `NNP (proper noun, singular)`\n", "\n", "I expected an output more like what the [Stanford Parser](http://nlp.stanford.edu:8080/parser/) provides:\n", "> `Run/VBG down/RP to/TO the/DT shop/NN ,/, will/MD you/PRP ,/, Peter/NNP`\n", "\n", "`Run` is tagged as a `VGB (verb, gerund/present participle)` - still not quite the `VB` I want, but at least it's a `V*`\n", "\n", "_MEANWHILE..._\n", "\n", "`nltk.pos_tag` did better with:\n", "> `[('Do', 'VB'), ('not', 'RB'), ('clean', 'VB'), ('soot', 'NN'), ('off', 'IN'), ('the', 'DT'), ('window', 'NN')]`\n", "\n", "Compared to [Stanford CoreNLP](http://nlp.stanford.edu:8080/corenlp/process) (note that this is different than what [Stanford Parser](http://nlp.stanford.edu:8080/parser/) outputs):\n", "> `(ROOT (S (VP (VB Do) (NP (RB not) (JJ clean) (NN soot)) (PP (IN off) (NP (DT the) (NN window))))))`\n", "\n", "Concern: _clean_ as `VB (verb, base form)` vs `JJ (adjective)` \n", "\n", "**IMPROVE** POS taggers should vote: nltk.pos_tag (averaged_perceptron_tagger), Stanford Parser, CoreNLP, etc." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note what Spacy POS tagger did with `Run down to the shop, will you Peter`:\n", "\n", "`Run/VB down/RP to/IN the shop/NN ,/, will/MD you/PRP ,/, Peter/NNP`\n", "\n", " where `Run` is the `VB` I expected from POS tagging (compared to `nltk.pos_tag` result of `NNP`). Also note that Spacy collapses `the shop` into a single unit, which should be helpful during featurization." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Featurization" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import re\n", "from collections import defaultdict\n", "\n", "featuresets = []\n", "for ts in tagged_sentences:\n", " s_features = defaultdict(int)\n", " for idx, tup in enumerate(ts):\n", " #print(tup)\n", " pos = tup[1]\n", " # FeatureName.VERB\n", " is_verb = re.match(r'VB.?', pos) is not None\n", " print(tup, is_verb)\n", " if is_verb:\n", " s_features[FeatureName.VERB] += 1\n", " # FOLLOWING_POS\n", " next_idx = idx + 1;\n", " if next_idx < len(ts):\n", " s_features[f'{FeatureName.FOLLOWING}_{ts[next_idx][1]}'] += 1\n", " # VERB_MODIFIER\n", " # VERB_MODIFYING\n", " else:\n", " s_features[FeatureName.VERB] = 0\n", " featuresets.append(dict(s_features))\n", "\n", "print()\n", "print(featuresets)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Stanford NLP](https://nlp.stanford.edu/software/)\n", "Setup guide used: https://stackoverflow.com/a/34112695" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Get dependency parser, NER, POS tagger\n", "!wget https://nlp.stanford.edu/software/stanford-parser-full-2017-06-09.zip\n", "!wget https://nlp.stanford.edu/software/stanford-ner-2017-06-09.zip\n", "!wget https://nlp.stanford.edu/software/stanford-postagger-full-2017-06-09.zip\n", "!unzip stanford-parser-full-2017-06-09.zip\n", "!unzip stanford-ner-2017-06-09.zip\n", "!unzip stanford-postagger-full-2017-06-09.zip" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from nltk.parse.stanford import StanfordParser\n", "from nltk.parse.stanford import StanfordDependencyParser\n", "from nltk.parse.stanford import StanfordNeuralDependencyParser\n", "from nltk.tag.stanford import StanfordPOSTagger, StanfordNERTagger\n", "from nltk.tokenize.stanford import StanfordTokenizer" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 2 }