{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## NLP for Task Classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Hypothesis**: Part of Speech (POS) tagging and syntactic dependency parsing provides valuable information for classifying imperative phrases. The thinking is that being able to detect imperative phrases will transfer well to detecting tasks and to-dos.\n", "\n", "#### Some Terminology\n", "- [_Imperative mood_](https://en.wikipedia.org/wiki/Imperative_mood) is \"used principally for ordering, requesting or advising the listener to do (or not to do) something... also often used for giving instructions as to how to perform a task.\"\n", "- _Part of speech (POS)_ is a way of categorizing a word based on its syntactic function.\n", " - The POS tagger from Spacy.io that is used in this notebook differentiates between [*pos_* and *tag_*](https://spacy.io/docs/api/annotation#pos-tagging-english) - *POS (pos_)* refers to \"coarse-grained part-of-speech\" like `VERB`, `ADJ`, or `PUNCT`; and *POSTAG (tag_)* refers to \"fine-grained part-of-speech\" like `VB`, `JJ`, or `.`.\n", "- _Syntactic dependency parsing_ is a way of connecting words based on syntactic relationships, [such as](https://spacy.io/docs/api/annotation#dependency-parsing-english) `DOBJ` (direct object), `PREP` (prepositional modifier), or `POBJ` (object of preposition).\n", " - Check out the dependency parse of the phrase [\"Send the report to Kyle by tomorrow\"](https://demos.explosion.ai/displacy/?text=Send%20the%20report%20to%20Kyle%20by%20tomorrow&model=en&cpu=1&cph=1) as an example.\n", "\n", "### Proposed Features\n", "The imperative mood centers around _actions_, and actions are generally represented in English using verbs. So the features are engineered to also center on the VERB:\n", "1. `FeatureName.VERB`: Does the phrase contain `VERB`(s) of the tag form `VB*`?\n", "2. `FeatureName.FOLLOWING_POS`: Are the words following the `VERB`(s) of certain parts of speech?\n", "3. `FeatureName.FOLLOWING_POSTAG`: Are the words following the `VERB`(s) of certain POS tags?\n", "4. `FeatureName.CHILD_DEP`: Are the `VERB`(s) parents of certain syntactic dependencies?\n", "5. `FeatureName.PARENT_DEP`: Are the `VERB`(s) children of certain syntactic dependencies?\n", "6. `FeatureName.CHILD_POS`: Are the syntactic dependencies that the `VERB`(s) are children of of certain parts of speech?\n", "7. `FeatureName.CHILD_POSTAG`: Are the syntactic dependencies that the `VERB`(s) are children of of certain POS tags?\n", "8. `FeatureName.PARENT_POS`: Are the syntactic dependencies that the `VERB`(s) parent of certain parts of speech?\n", "9. `FeatureName.PARENT_POSTAG`: Are the syntactic dependencies that the `VERB`(s) parent of certain POS tags?\n", "\n", "**Notes:**\n", "- Features 2-9 all depend on feature 1 between `True`; if `False`, phrase vectorization will result in all zeroes.\n", "- When features 2-9 are applied to actual phrases, they will append identifying informating about the feature in the form of `_*` (e.g., `FeatureName.FOLLOWING_POSTAG_WRB`)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data and Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Building a recipe corpus" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I wrote and ran `epicurious_recipes.py`\\* to scrape Epicurious.com for recipe instructions and descriptions. I then performed some manual cleanup of the script results. Output is in `epicurious-pos.txt` and `epicurious-neg.txt`.\n", "\n", "\\* _script (very) loosely based off of https://github.com/benosment/hrecipe-parse_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note** that deriving all negative examples in the training set from Epicurious recipe descriptions would result in negative examples that are longer and syntactically more complicated than the positive examples. This is a form of bias.\n", "\n", "To (hopefully?) correct for this a bit, I will add the short movie reviews found at https://pythonprogramming.net/static/downloads/short_reviews/ as more negative examples.\n", "\n", "This still feels weird because we're selecting negative examples only from specific categories of text (recipe descriptions, short movie reviews) - just because they're readily available. Further, most positive examples are recipe instructions - also a specific (and not necessarily related to the main \"task\" category) category of text.\n", "\n", "Ultimately though, this recipe corpus is a **stopgap/proof of concept** for a corpus more relevant to tasks later on, so I won't worry further about this for now." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import os\n", "from pandas import read_csv\n", "from numpy import random" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "BASE_DIR = os.getcwd()\n", "data_path = BASE_DIR + '/data.tsv'" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Text | \n", "Label | \n", "
---|---|---|
0 | \n", "Be kind | \n", "pos | \n", "
1 | \n", "Get out of here | \n", "pos | \n", "
2 | \n", "Look this over | \n", "pos | \n", "
3 | \n", "Paul, do your homework now | \n", "pos | \n", "
4 | \n", "Do not clean soot off the window | \n", "pos | \n", "