{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Chunking With The NLTK" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## An introduction and guide for linguists who aren't programmers, and programmers who aren't linguists." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "by Luke Petschauer ([@lukewrites](https://twitter.com/lukewrites) | [linkedin](https://www.linkedin.com/in/lukepetschauer\n", ") | [blog](http://lukewrites.com))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Table of Contents" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[What is the NLTK? How do I use this guide?](#What-is-the-NLTK?-How-do-I-use-this-guide?)\n", "\n", "[What is 'Chunking' and why should I do it?](#What-is-'Chunking',-and-why-should-I-do-it?)\n", "\n", "[Note for Non-Linguists: Noun Phrases](#Note-for-Non-Linguists:-Noun-Phrases)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This guide is intended for two audiences:\n", " 1. **Linguists** who have awesome ideas to realize, but aren't comfortable using Python. Code is broken up into small chunks that are thoroughly explained both in the text and via in-line comments.\n", " 2. **Programmers** who want to analyze language with Python, but aren't familiar enough with linguistics terminology to fully exploit the NLTK. Linguistic terms are explained, examples are provided, and code samples are given.\n", " \n", "The code for this notebook [is on github](https://github.com/lukewrites/NP_chunking_with_nltk)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What is the NLTK? How do I use this guide?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Natural Language Toolkit ([NLTK](http://www.nltk.org/)) is an open source library of tools for natural language processing with [Python](http://python.org). A number of the tools included in the NLTK have direct applications in linguistics, and have the potential to be of great use to linguists.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What is 'Chunking', and why should I do it?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_Chunking_ breaks a text up into user-defined units ('_chunks_') that contain certain types of words (nouns, adjectives, verbs) or phrases (noun phrases, verb phrases, prepositional phrases). What makes chunking with the NLTK different from using a built-in string method like `split` is the NLTK's ability to analyze the text and tag each word with its part of speech. \n", "\n", "Chunking can be very useful when undertaking analysis of text, though it is more computationally intensive than preparing text for frequency analysis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Note for Non-Linguists: Noun Phrases" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A noun phrase is \"a single word, a part of a word, or a chain of words\" ([Wikipedia](http://en.wikipedia.org/wiki/Lexical_item)) that has a noun as its _head_ (main) _word_. \"Everton football club's victory\" has the noun 'Everton' as its headword; \"Everton football club\" is thus a noun phrase since the three words (which in this case are all nouns) together describe a single thing/idea/entity. \n", "\n", "Most _word classes_ (nouns, verbs, adjectives, etc.) can form phrases; articles/determiners (_the_, _a_, etc) and pronouns are examples of word classes that do not.\n", "\n", "For example:\n", "\n", "| Phrase | Head Word | Phrase Type |\n", "| ------ | ------ | ------ |\n", "| super happy | happy (adjective) | adjective phrase (AP) |\n", "| kick the bucket | kick (verb) | verb phrase (VP) |\n", "| an extremely difficult problem | problem (noun) | noun phrase (NP) |\n", "| over the river | over (preposition) | prepositional phrase (PP) |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Why Chunk?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Chunking is useful for selecting and extracting meaningful information from texts for analysis. Chunking allows us to pull out groups of words with set characteristics rather than selecting text by frequency.\n", "\n", "In this tutorial we will perform an analysis of the text of the etiquette book [_Beadle's Dime Book of Practical Etiquette for Ladies and Gentlemen_](http://www.gutenberg.org/ebooks/45591?msg=welcome_stranger). I've chosen this book more or less at random from Project Gutenberg because we can predict that it will use lots of domain-specific vocabulary. We'd hope that our chunker will be able to automatically pull out such language. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Chunking vs `split`ing\n", "\n", "We could use the Python `split` string method on the text, resulting in a big list of words. Here is the result of splitting just the first sentence:" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['If',\n", " 'you',\n", " 'wish',\n", " 'to',\n", " 'make',\n", " 'yourself',\n", " 'agreeable',\n", " 'to',\n", " 'a',\n", " 'lady,',\n", " 'turn',\n", " 'the',\n", " 'conversation',\n", " 'adroitly',\n", " 'upon',\n", " 'taste,',\n", " 'or',\n", " 'art,',\n", " 'or',\n", " 'books,',\n", " 'or',\n", " 'persons,',\n", " 'or',\n", " 'events',\n", " 'of',\n", " 'the',\n", " 'day.']" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "etiquette_excerpt = \"\"\"If you wish to make yourself agreeable to a lady, turn the conversation adroitly upon taste, or art, or books, or persons, or events of the day.\n", "\"\"\"\n", "\n", "etiquette_excerpt.split()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we did this to the whole text, we could do a frequency analysis to see which words are most common. This is not particularly helpful, since additional processing would be required to remove [stop words](http://en.wikipedia.org/wiki/Stop_words). Also, notice that `lady,` and `day.` (punctuation appended) are assumed to be words; we need to make sure that punctuation is stripped away.\n", "\n", "##Chunking based on part of speech (POS)\n", "\n", "Below we will look at how extracting chunks of text can allow us to gain insight into a text that word frequency does not.\n", "\n", "If we were to extract the nouns from our sample sentence, we would get a list of words including the following:\n", "\n", "```python\n", "[lady, conversation, taste, art, books, persons, events, day, …]\n", "```\n", "\n", "This is a perfectly fine list, and one from which we probably could make sound assumptions about the nature of the text, but we can get a better sense of and learn more about the text if we are able to see the noun phrases (NP) in it.\n", "\n", "To illustrate the difference between the two: while extracting nouns alone would return\n", "\n", "```python\n", "[…, events, day, …]\n", "```\n", "\n", "extracting NPs could return\n", "\n", "```python\n", "[…, events of the day, …]\n", "```\n", "\n", "This is arguably a more meaningful chunk of language since it gives us a specific concept that the etiquette book mentions, rather than just a list of the topic's constituent nouns. Automatically being able to extract a number of NPs from a text can allow us to make good guesses about what the text is about, among other uses.\n", "\n", "We can easily extract NPs using the NLTK. To do so, we need to define what language patterns we want the NLTK to identify as NPs. The NLTK uses regular expressions to set these definitions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Workflow" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To extract NP from our `sample_text`, we will need to do the following:\n", "> 0. Set up our environment.\n", "> 1. Identify and store a text that we want to analyze.\n", "> 2. Define the patterns that we want the NLTK to identify as being NP.\n", "> 3. Prepare the text by _tokenizing_ it and _tagging_ each word. (Descriptions of _tokenizing_ and _tagging_ come below.)\n", "> 4. Having the NLTK identify NP in the tokenized text and, finally, showing us a list of these NPs.\n", "Each of these three steps will give us the opportunity to learn more about programming with Python and linguistics." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Set Up Our Environment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To go through this tutorial you need to have installed the NLTK and numpy. You can find out how to do that by follow the previous links.\n", "\n", "At the beginning of your Python script, you need to import `nltk`, `re` (regular expressions, which will be used in step 2), `pprint` (necessary to create trees, an intermediary step in our chunking process), and `Tree` from the `nltk` library. \n", "\n", "The beginning of our script will look like this:" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import nltk\n", "import re\n", "import pprint\n", "from nltk import Tree" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the above we are importing libraries necessary to make the code run. These libraries include the NLTK (`nltk`), regular expressions (`re`), and data pretty printer (`pprint`)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Define our NPs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The NLTK can find NPs, but we have to tell the NLTK what chunks of language it should identify as noun phrases. To do this we need to know two things:\n", "\n", "> 1. What notation does the NLTK use for parts of speech?\n", "\n", "> 2. How can we write NP definitions that allow for ambiguity?\n", "\n", "To answer (1), we will look at how the NLTK tags words for part of speech (POS). To answer (2), we will need to gain a basic understanding of _regular expressions_…the reason that we had to \n", "\n", "```python\n", "import re\n", "```\n", "\n", "at the beginning of our Python script." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.1 Parts of Speech in the NLTK" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The NLTK provides a function, `pos_tag()`, that tags POS using the [Penn Treebank Tag Set](http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). Here is a partial list of its tags:\n", "\n", "> 1.\tCC\tCoordinating conjunction\n", "2.\tCD\tCardinal number\n", "3.\tDT\tDeterminer\n", "4.\tEX\tExistential there\n", "5.\tFW\tForeign word\n", "6.\tIN\tPreposition or subordinating conjunction\n", "7.\tJJ\tAdjective\n", "8.\tJJR\tAdjective, comparative\n", "9.\tJJS\tAdjective, superlative\n", "10.\tLS\tList item marker\n", "11.\tMD\tModal\n", "12.\tNN\tNoun, singular or mass\n", "13.\tNNS\tNoun, plural\n", "14.\tNNP\tProper noun, singular\n", "15.\tNNPS\tProper noun, plural\n", "\n", "Notice that the Penn Treebank Tag Set differentiates between four different types of nouns: `NN`, `NNS`, `NNP`, `NNPS`. In this case, we will want to consider all types of nouns: proper and common, singular/mass and plural. Rather than writing separate rules for each case, we can use regular expressions to include them all." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.2 Regular Expressions and chunk patterns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can identify and tag chunks based upon _morphological structure_ using **_regular expressions_**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Regular expressions have a reputation for being complicated and difficult. Click on the following to [read more about regular expressions in Python](https://docs.python.org/2/library/re.html).\n", "\n", "We are going to use regex to define NPs to be certain patterns:\n", "* adjective (optional) + one or more noun of any type\n", "* adjective (optional) + one noun of any type + cordinating conjunction (optional) + one or more noun of any type\n", "\n", "The way we would write these patterns is:" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": true }, "outputs": [], "source": [ "patterns = \"\"\"\n", " NP: {*+}\n", " {**+}\n", " \"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since the patterns span more than one line, we enclose them with triple quotes `\"\"\"`.\n", "\n", "You notice that the patterns are being defined as a variable. This variable is necessary as they will be used with the NLTK's regular expression parser:" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": true }, "outputs": [], "source": [ "NPChunker = nltk.RegexpParser(patterns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, here we create another variable, `NPChunker` that calls the `RegexpParser` method using our `patterns`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Create a text sample" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to save the text as a _variable_. In Python, we _assign variables_ by entering the variable name followed by an `=` sign:\n", "\n", "```python\n", "sample_text = # our variable name is sample_text\n", "```\n", "\n", "Whatever data we are assigning to the variable goes immediately to the right of the `=` sign.\n", "\n", "Since we are using multiple lines of text, we will surround the text with triple quotes:\n", "```python\n", "\"\"\"text\n", "\"\"\"\n", "```\n", "\n", "Our final variable assignment will look like this:" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": true }, "outputs": [], "source": [ "sample_text = \"\"\"Good behavior upon the street, or public promenade, marks the gentleman\n", "most effectually; rudeness, incivility, disregard of \"what the world\n", "says,\" marks the person of low breeding. We always know, in walking a\n", "square with a man, if he is a gentleman or not. A real gentility never\n", "does the following things on the street, in presence of observers:--\n", "\n", "Never picks the teeth, nor scratches the head.\n", "\n", "Never swears or talks uproariously.\n", "\n", "Never picks the nose with the finger.\n", "\n", "Never smokes, or spits upon the walk, to the exceeding annoyance of\n", "those who are always disgusted with tobacco in any shape.\n", "\n", "Never stares at any one, man or woman, in a marked manner.\n", "\n", "Never scans a lady's dress impertinently, and makes no rude remarks\n", "about her.\n", "\n", "Never crowds before promenaders in a rough or hurried way.\n", "\n", "Never jostles a lady or gentleman without an \"excuse me.\"\n", "\n", "Never treads upon a lady's dress without begging pardon.\n", "\n", "Never loses temper, nor attracts attention by excited conversation.\n", "\n", "Never dresses in an odd or singular manner, so as to create remark.\n", "\n", "Never fails to raise his hat politely to a lady acquaintance; nor to\n", "a male friend who may be walking with a lady--it is a courtesy to the\n", "lady.\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4. Preparing our text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to identify and extract NP, we need to perform four steps:\n", "\n", "> 1. _Tokenize_ the text into sentences.\n", "\n", "> 2. _Tokenize_ each sentence into words.\n", "\n", "> 3. Tag the words in each sentence for POS.\n", "\n", "> 4. Go through each sentence and _chunk_ NPs.\n", "\n", "The NLTK has corresponding functions and methods that can be used for each of these steps:\n", "\n", "> 1. `nltk.sent_tokenize()`\n", "\n", "> 2. `nltk.word_tokenize()`\n", "\n", "> 3. `nltk.pos_tag()`\n", "\n", "Let's play with each in turn to see what they do." ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['If you wish to make yourself agreeable to a lady, turn the conversation adroitly upon taste, or art, or books, or persons, or events of the day.\\n']" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nltk.sent_tokenize(etiquette_excerpt)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `nltk.sent_tokenize()` method takes a text and breaks it up into sentences.\n", "\n", "Now we can break the sentence into words:" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[['If',\n", " 'you',\n", " 'wish',\n", " 'to',\n", " 'make',\n", " 'yourself',\n", " 'agreeable',\n", " 'to',\n", " 'a',\n", " 'lady',\n", " ',',\n", " 'turn',\n", " 'the',\n", " 'conversation',\n", " 'adroitly',\n", " 'upon',\n", " 'taste',\n", " ',',\n", " 'or',\n", " 'art',\n", " ',',\n", " 'or',\n", " 'books',\n", " ',',\n", " 'or',\n", " 'persons',\n", " ',',\n", " 'or',\n", " 'events',\n", " 'of',\n", " 'the',\n", " 'day',\n", " '.']]" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenized_sentence = nltk.sent_tokenize(etiquette_excerpt)\n", "\n", "[nltk.word_tokenize(sentence) for sentence in tokenized_sentence]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is an improvement upon the `str.split()` method we tried up above. Notice that punctuation is stripped from the words. We have written the method as a list comprehension, meaning that the tokenized sentence is a list within another list. This will be more useful when we are tokenizing more than one sentence at a time.\n", "\n", "Our next step is to try some part of speech (POS) tagging, using `nltk.pos_tag()`." ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[[('If', 'IN'),\n", " ('you', 'PRP'),\n", " ('wish', 'VBP'),\n", " ('to', 'TO'),\n", " ('make', 'VB'),\n", " ('yourself', 'PRP'),\n", " ('agreeable', 'JJ'),\n", " ('to', 'TO'),\n", " ('a', 'DT'),\n", " ('lady', 'NN'),\n", " (',', ','),\n", " ('turn', 'VB'),\n", " ('the', 'DT'),\n", " ('conversation', 'NN'),\n", " ('adroitly', 'RB'),\n", " ('upon', 'IN'),\n", " ('taste', 'NN'),\n", " (',', ','),\n", " ('or', 'CC'),\n", " ('art', 'NN'),\n", " (',', ','),\n", " ('or', 'CC'),\n", " ('books', 'NNS'),\n", " (',', ','),\n", " ('or', 'CC'),\n", " ('persons', 'NNS'),\n", " (',', ','),\n", " ('or', 'CC'),\n", " ('events', 'NNS'),\n", " ('of', 'IN'),\n", " ('the', 'DT'),\n", " ('day', 'NN'),\n", " ('.', '.')]]" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenized_sentence = nltk.sent_tokenize(etiquette_excerpt)\n", "\n", "tokenized_words = [nltk.word_tokenize(sentence) for sentence in tokenized_sentence]\n", "\n", "[nltk.pos_tag(word) for word in tokenized_words]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we're cooking!\n", "\n", "In three lines of code, we've gone from a complete sentence to a sentence in which each word is tagged for part of speech. (The accuracy of the POS tagging is another matter for another tutorial.)\n", "\n", "Let's move on to the last step, looking for NPs in the processed sentence. We will do this by using the `NPChunker` that we defined above." ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[Tree('S', [('If', 'IN'), ('you', 'PRP'), ('wish', 'VBP'), ('to', 'TO'), ('make', 'VB'), ('yourself', 'PRP'), ('agreeable', 'JJ'), ('to', 'TO'), ('a', 'DT'), Tree('NP', [('lady', 'NN')]), (',', ','), ('turn', 'VB'), ('the', 'DT'), Tree('NP', [('conversation', 'NN')]), ('adroitly', 'RB'), ('upon', 'IN'), Tree('NP', [('taste', 'NN')]), (',', ','), ('or', 'CC'), Tree('NP', [('art', 'NN')]), (',', ','), ('or', 'CC'), ('books', 'NNS'), (',', ','), ('or', 'CC'), ('persons', 'NNS'), (',', ','), ('or', 'CC'), ('events', 'NNS'), ('of', 'IN'), ('the', 'DT'), Tree('NP', [('day', 'NN')]), ('.', '.')])]" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenized_sentence = nltk.sent_tokenize(etiquette_excerpt)\n", "tokenized_words = [nltk.word_tokenize(sentence) for sentence in tokenized_sentence]\n", "tagged_words = [nltk.pos_tag(word) for word in tokenized_words]\n", "[NPChunker.parse(word) for word in tagged_words]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fantastic! At first glance, this output is a mess, and in real-world applications we probably wouldn't ever see it. However, it's worthwhile to take a look at it now to see how the NLTK is using our `patterns` to organize the text and to preview what data will (or should be) extracted by the rest of our code." ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def prepare_text(input):\n", " tokenized_sentence = nltk.sent_tokenize(input) # Tokenize the text into sentences.\n", " tokenized_words = [nltk.word_tokenize(sentence) for sentence in tokenized_sentence] # Tokenize words in sentences.\n", " tagged_words = [nltk.pos_tag(word) for word in tokenized_words] # Tag words for POS in each sentence.\n", " word_tree = [NPChunker.parse(word) for word in tagged_words] # Identify NP chunks\n", " return word_tree # Return the tagged & chunked sentences." ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[Tree('S', [('Good', 'NNP'), Tree('NP', [('behavior', 'NN')]), ('upon', 'IN'), ('the', 'DT'), Tree('NP', [('street', 'NN')]), (',', ','), ('or', 'CC'), Tree('NP', [('public', 'JJ'), ('promenade', 'NN')]), (',', ','), ('marks', 'VBZ'), ('the', 'DT'), Tree('NP', [('gentleman', 'NN')]), ('most', 'JJS'), ('effectually', 'RB'), (';', ':'), Tree('NP', [('rudeness', 'NN')]), (',', ','), Tree('NP', [('incivility', 'NN')]), (',', ','), Tree('NP', [('disregard', 'NN')]), ('of', 'IN'), ('``', '``'), ('what', 'WP'), ('the', 'DT'), Tree('NP', [('world', 'NN')]), ('says', 'VBZ'), (',', ','), (\"''\", \"''\"), ('marks', 'VBZ'), ('the', 'DT'), Tree('NP', [('person', 'NN')]), ('of', 'IN'), Tree('NP', [('low', 'JJ'), ('breeding', 'NN')]), ('.', '.')]),\n", " Tree('S', [('We', 'PRP'), ('always', 'RB'), ('know', 'VBP'), (',', ','), ('in', 'IN'), Tree('NP', [('walking', 'NN')]), ('a', 'DT'), Tree('NP', [('square', 'NN')]), ('with', 'IN'), ('a', 'DT'), Tree('NP', [('man', 'NN')]), (',', ','), ('if', 'IN'), ('he', 'PRP'), ('is', 'VBZ'), ('a', 'DT'), Tree('NP', [('gentleman', 'NN')]), ('or', 'CC'), ('not', 'RB'), ('.', '.')]),\n", " Tree('S', [('A', 'DT'), Tree('NP', [('real', 'JJ'), ('gentility', 'NN')]), ('never', 'RB'), ('does', 'VBZ'), ('the', 'DT'), ('following', 'VBG'), ('things', 'NNS'), ('on', 'IN'), ('the', 'DT'), Tree('NP', [('street', 'NN')]), (',', ','), ('in', 'IN'), Tree('NP', [('presence', 'NN')]), ('of', 'IN'), ('observers', 'NNS'), (':', ':'), ('--', ':'), ('Never', 'RB'), ('picks', 'VBZ'), ('the', 'DT'), Tree('NP', [('teeth', 'NN')]), (',', ','), ('nor', 'CC'), ('scratches', 'NNS'), ('the', 'DT'), Tree('NP', [('head', 'NN')]), ('.', '.')]),\n", " Tree('S', [('Never', 'RB'), ('swears', 'VBZ'), ('or', 'CC'), ('talks', 'NNS'), ('uproariously', 'RB'), ('.', '.')]),\n", " Tree('S', [('Never', 'RB'), ('picks', 'VBZ'), ('the', 'DT'), Tree('NP', [('nose', 'NN')]), ('with', 'IN'), ('the', 'DT'), Tree('NP', [('finger', 'NN')]), ('.', '.')]),\n", " Tree('S', [('Never', 'RB'), ('smokes', 'VBZ'), (',', ','), ('or', 'CC'), ('spits', 'NNS'), ('upon', 'IN'), ('the', 'DT'), Tree('NP', [('walk', 'NN')]), (',', ','), ('to', 'TO'), ('the', 'DT'), Tree('NP', [('exceeding', 'NN'), ('annoyance', 'NN')]), ('of', 'IN'), ('those', 'DT'), ('who', 'WP'), ('are', 'VBP'), ('always', 'RB'), ('disgusted', 'VBN'), ('with', 'IN'), Tree('NP', [('tobacco', 'NN')]), ('in', 'IN'), ('any', 'DT'), Tree('NP', [('shape', 'NN')]), ('.', '.')]),\n", " Tree('S', [('Never', 'RB'), ('stares', 'VBZ'), ('at', 'IN'), ('any', 'DT'), ('one', 'CD'), (',', ','), Tree('NP', [('man', 'NN')]), ('or', 'CC'), Tree('NP', [('woman', 'NN')]), (',', ','), ('in', 'IN'), ('a', 'DT'), Tree('NP', [('marked', 'JJ'), ('manner', 'NN')]), ('.', '.')]),\n", " Tree('S', [('Never', 'RB'), ('scans', 'VBZ'), ('a', 'DT'), Tree('NP', [('lady', 'NN')]), (\"'s\", 'POS'), Tree('NP', [('dress', 'NN')]), ('impertinently', 'RB'), (',', ','), ('and', 'CC'), ('makes', 'VBZ'), ('no', 'DT'), Tree('NP', [('rude', 'NN')]), ('remarks', 'VBZ'), ('about', 'IN'), ('her', 'PRP$'), ('.', '.')]),\n", " Tree('S', [('Never', 'RB'), ('crowds', 'VBZ'), ('before', 'IN'), ('promenaders', 'NNS'), ('in', 'IN'), ('a', 'DT'), ('rough', 'JJ'), ('or', 'CC'), Tree('NP', [('hurried', 'JJ'), ('way', 'NN')]), ('.', '.')]),\n", " Tree('S', [('Never', 'RB'), ('jostles', 'VBZ'), ('a', 'DT'), Tree('NP', [('lady', 'NN')]), ('or', 'CC'), Tree('NP', [('gentleman', 'NN')]), ('without', 'IN'), ('an', 'DT'), ('``', '``'), Tree('NP', [('excuse', 'NN')]), ('me', 'PRP'), ('.', '.')]),\n", " Tree('S', [('``', '``'), ('Never', 'RB'), ('treads', 'VBZ'), ('upon', 'IN'), ('a', 'DT'), Tree('NP', [('lady', 'NN')]), (\"'s\", 'POS'), Tree('NP', [('dress', 'NN')]), ('without', 'IN'), Tree('NP', [('begging', 'NN'), ('pardon', 'NN')]), ('.', '.')]),\n", " Tree('S', [('Never', 'RB'), ('loses', 'VBZ'), ('temper', 'JJR'), (',', ','), ('nor', 'CC'), ('attracts', 'NNS'), Tree('NP', [('attention', 'NN')]), ('by', 'IN'), ('excited', 'VBN'), Tree('NP', [('conversation', 'NN')]), ('.', '.')]),\n", " Tree('S', [('Never', 'RB'), ('dresses', 'VBZ'), ('in', 'IN'), ('an', 'DT'), ('odd', 'JJ'), ('or', 'CC'), Tree('NP', [('singular', 'JJ'), ('manner', 'NN')]), (',', ','), ('so', 'RB'), ('as', 'IN'), ('to', 'TO'), ('create', 'VB'), Tree('NP', [('remark', 'NN')]), ('.', '.')]),\n", " Tree('S', [('Never', 'RB'), ('fails', 'VBZ'), ('to', 'TO'), ('raise', 'VB'), ('his', 'PRP$'), Tree('NP', [('hat', 'NN')]), ('politely', 'RB'), ('to', 'TO'), ('a', 'DT'), Tree('NP', [('lady', 'NN'), ('acquaintance', 'NN')]), (';', ':'), ('nor', 'CC'), ('to', 'TO'), ('a', 'DT'), Tree('NP', [('male', 'JJ'), ('friend', 'NN')]), ('who', 'WP'), ('may', 'MD'), ('be', 'VB'), ('walking', 'VBG'), ('with', 'IN'), ('a', 'DT'), Tree('NP', [('lady', 'NN')]), ('--', ':'), ('it', 'PRP'), ('is', 'VBZ'), ('a', 'DT'), Tree('NP', [('courtesy', 'NN')]), ('to', 'TO'), ('the', 'DT'), Tree('NP', [('lady', 'NN')]), ('.', '.')])]" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prepare_text(sample_text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At first glance, this output is a mess, and in real-world applications we probably wouldn't ever see it. However, it's worthwhile to take a look at it now to see how the NLTK is using our `patterns` to organize the text and to preview what data will (or should be) extracted by the rest of our code.\n", "\n", "Consider the following sentence:\n", "\n", ">Never smokes, or spits upon the walk, to the exceeding annoyance of\n", "those who are always disgusted with tobacco in any shape.\n", "\n", "Once processed and organized into a Tree, the sentence, indicated by `'S'`, is divided into `(word, part of speech)` tuples. The NLTK's output contains the following NPs:\n", "\n", "```python\n", "Tree('NP', [('walk', 'NN')]),\n", "Tree('NP', [('exceeding', 'NN'), ('annoyance', 'NN')]),\n", "Tree('NP', [('tobacco', 'NN')]), \n", "Tree('NP', [('shape', 'NN')]), ('.', '.')])\n", "```\n", "\n", "Notice that each NP is identified as its own Tree, while no other part of speech in the sentence is organized into trees. This is the power of our `patterns`; `nltk.RegexpParser(patterns)` looks for chunks of text we defined as NPs and organizes them as NPs, each of which gets its own tree (or subtree) within the greater Tree that makes up a sentence.\n", "\n", "You may also have noticed that `'those who are always disgusted'` was not recognized as a NP, meaning that we would need to adjust our `patterns`." ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "collapsed": false }, "outputs": [], "source": [ "new_patterns = \"\"\"\n", " NP: {
**}\n", " {*+}\n", " {**+}\n", " {*+}\n", " \n", " \"\"\"\n", "\n", "new_NPChunker = nltk.RegexpParser(new_patterns)\n", "\n", "def prepare_text(input):\n", " tokenized_sentence = nltk.sent_tokenize(input) # Tokenize the text into sentences.\n", " tokenized_words = [nltk.word_tokenize(sentence) for sentence in tokenized_sentence] # Tokenize words in sentences.\n", " tagged_words = [nltk.pos_tag(word) for word in tokenized_words] # Tag words for POS in each sentence.\n", " word_tree = [new_NPChunker.parse(word) for word in tagged_words] # Identify NP chunks\n", " return word_tree # Return the tagged & chunked sentences." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that our new pattern has been added at the start of our list of NP. This has been done because the NLTK will try your definitions in order; if one works, it will use it and move on.\n", "\n", "Let's try running our function again to see if this worked." ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[Tree('S', [Tree('NP', [('Good', 'NNP'), ('behavior', 'NN')]), ('upon', 'IN'), ('the', 'DT'), Tree('NP', [('street', 'NN')]), (',', ','), ('or', 'CC'), Tree('NP', [('public', 'JJ'), ('promenade', 'NN')]), (',', ','), ('marks', 'VBZ'), ('the', 'DT'), Tree('NP', [('gentleman', 'NN')]), ('most', 'JJS'), ('effectually', 'RB'), (';', ':'), Tree('NP', [('rudeness', 'NN')]), (',', ','), Tree('NP', [('incivility', 'NN')]), (',', ','), Tree('NP', [('disregard', 'NN')]), ('of', 'IN'), ('``', '``'), ('what', 'WP'), ('the', 'DT'), Tree('NP', [('world', 'NN')]), ('says', 'VBZ'), (',', ','), (\"''\", \"''\"), ('marks', 'VBZ'), ('the', 'DT'), Tree('NP', [('person', 'NN')]), ('of', 'IN'), Tree('NP', [('low', 'JJ'), ('breeding', 'NN')]), ('.', '.')]),\n", " Tree('S', [('We', 'PRP'), ('always', 'RB'), ('know', 'VBP'), (',', ','), ('in', 'IN'), Tree('NP', [('walking', 'NN')]), ('a', 'DT'), Tree('NP', [('square', 'NN')]), ('with', 'IN'), ('a', 'DT'), Tree('NP', [('man', 'NN')]), (',', ','), ('if', 'IN'), ('he', 'PRP'), ('is', 'VBZ'), ('a', 'DT'), Tree('NP', [('gentleman', 'NN')]), ('or', 'CC'), ('not', 'RB'), ('.', '.')]),\n", " Tree('S', [('A', 'DT'), Tree('NP', [('real', 'JJ'), ('gentility', 'NN')]), ('never', 'RB'), ('does', 'VBZ'), ('the', 'DT'), ('following', 'VBG'), Tree('NP', [('things', 'NNS')]), ('on', 'IN'), ('the', 'DT'), Tree('NP', [('street', 'NN')]), (',', ','), ('in', 'IN'), Tree('NP', [('presence', 'NN'), ('of', 'IN'), ('observers', 'NNS')]), (':', ':'), ('--', ':'), ('Never', 'RB'), ('picks', 'VBZ'), ('the', 'DT'), Tree('NP', [('teeth', 'NN')]), (',', ','), ('nor', 'CC'), Tree('NP', [('scratches', 'NNS')]), ('the', 'DT'), Tree('NP', [('head', 'NN')]), ('.', '.')]),\n", " Tree('S', [('Never', 'RB'), ('swears', 'VBZ'), ('or', 'CC'), Tree('NP', [('talks', 'NNS')]), ('uproariously', 'RB'), ('.', '.')]),\n", " Tree('S', [('Never', 'RB'), ('picks', 'VBZ'), ('the', 'DT'), Tree('NP', [('nose', 'NN')]), ('with', 'IN'), ('the', 'DT'), Tree('NP', [('finger', 'NN')]), ('.', '.')]),\n", " Tree('S', [('Never', 'RB'), ('smokes', 'VBZ'), (',', ','), ('or', 'CC'), Tree('NP', [('spits', 'NNS')]), ('upon', 'IN'), ('the', 'DT'), Tree('NP', [('walk', 'NN')]), (',', ','), ('to', 'TO'), ('the', 'DT'), Tree('NP', [('exceeding', 'NN'), ('annoyance', 'NN')]), ('of', 'IN'), Tree('NP', [('those', 'DT'), ('who', 'WP'), ('are', 'VBP'), ('always', 'RB'), ('disgusted', 'VBN'), ('with', 'IN'), ('tobacco', 'NN')]), ('in', 'IN'), ('any', 'DT'), Tree('NP', [('shape', 'NN')]), ('.', '.')]),\n", " Tree('S', [('Never', 'RB'), ('stares', 'VBZ'), ('at', 'IN'), ('any', 'DT'), ('one', 'CD'), (',', ','), Tree('NP', [('man', 'NN'), ('or', 'CC'), ('woman', 'NN')]), (',', ','), ('in', 'IN'), ('a', 'DT'), Tree('NP', [('marked', 'JJ'), ('manner', 'NN')]), ('.', '.')]),\n", " Tree('S', [('Never', 'RB'), ('scans', 'VBZ'), ('a', 'DT'), Tree('NP', [('lady', 'NN')]), (\"'s\", 'POS'), Tree('NP', [('dress', 'NN')]), ('impertinently', 'RB'), (',', ','), ('and', 'CC'), ('makes', 'VBZ'), ('no', 'DT'), Tree('NP', [('rude', 'NN')]), ('remarks', 'VBZ'), ('about', 'IN'), ('her', 'PRP$'), ('.', '.')]),\n", " Tree('S', [('Never', 'RB'), ('crowds', 'VBZ'), ('before', 'IN'), Tree('NP', [('promenaders', 'NNS')]), ('in', 'IN'), ('a', 'DT'), ('rough', 'JJ'), ('or', 'CC'), Tree('NP', [('hurried', 'JJ'), ('way', 'NN')]), ('.', '.')]),\n", " Tree('S', [('Never', 'RB'), ('jostles', 'VBZ'), ('a', 'DT'), Tree('NP', [('lady', 'NN'), ('or', 'CC'), ('gentleman', 'NN')]), ('without', 'IN'), ('an', 'DT'), ('``', '``'), Tree('NP', [('excuse', 'NN')]), ('me', 'PRP'), ('.', '.')]),\n", " Tree('S', [('``', '``'), ('Never', 'RB'), ('treads', 'VBZ'), ('upon', 'IN'), ('a', 'DT'), Tree('NP', [('lady', 'NN')]), (\"'s\", 'POS'), Tree('NP', [('dress', 'NN'), ('without', 'IN'), ('begging', 'NN'), ('pardon', 'NN')]), ('.', '.')]),\n", " Tree('S', [('Never', 'RB'), ('loses', 'VBZ'), ('temper', 'JJR'), (',', ','), ('nor', 'CC'), Tree('NP', [('attracts', 'NNS'), ('attention', 'NN')]), ('by', 'IN'), ('excited', 'VBN'), Tree('NP', [('conversation', 'NN')]), ('.', '.')]),\n", " Tree('S', [('Never', 'RB'), ('dresses', 'VBZ'), ('in', 'IN'), ('an', 'DT'), ('odd', 'JJ'), ('or', 'CC'), Tree('NP', [('singular', 'JJ'), ('manner', 'NN')]), (',', ','), ('so', 'RB'), ('as', 'IN'), ('to', 'TO'), ('create', 'VB'), Tree('NP', [('remark', 'NN')]), ('.', '.')]),\n", " Tree('S', [('Never', 'RB'), ('fails', 'VBZ'), ('to', 'TO'), ('raise', 'VB'), ('his', 'PRP$'), Tree('NP', [('hat', 'NN')]), ('politely', 'RB'), ('to', 'TO'), ('a', 'DT'), Tree('NP', [('lady', 'NN'), ('acquaintance', 'NN')]), (';', ':'), ('nor', 'CC'), ('to', 'TO'), ('a', 'DT'), Tree('NP', [('male', 'JJ'), ('friend', 'NN')]), ('who', 'WP'), ('may', 'MD'), ('be', 'VB'), ('walking', 'VBG'), ('with', 'IN'), ('a', 'DT'), Tree('NP', [('lady', 'NN')]), ('--', ':'), ('it', 'PRP'), ('is', 'VBZ'), ('a', 'DT'), Tree('NP', [('courtesy', 'NN')]), ('to', 'TO'), ('the', 'DT'), Tree('NP', [('lady', 'NN')]), ('.', '.')])]" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prepare_text(sample_text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Converting Nouns to NPs" ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "collapsed": true }, "outputs": [], "source": [ "sentences = prepare_text(sample_text)\n", "\n", "def return_a_list_of_NPs(sentences):\n", " nps = [] # an empty list in which to NPs will be stored.\n", " for sent in sentences:\n", " tree = NPChunker.parse(sent)\n", " for subtree in tree.subtrees():\n", " if subtree.node == 'NP':\n", " t = subtree\n", " t = ' '.join(word for word, tag in t.leaves())\n", " nps.append(t)\n", " return nps" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "['Good behavior',\n", " 'street',\n", " 'public promenade',\n", " 'gentleman',\n", " 'rudeness',\n", " 'incivility',\n", " 'disregard',\n", " 'world',\n", " 'person',\n", " 'low breeding',\n", " 'walking',\n", " 'square',\n", " 'man',\n", " 'gentleman',\n", " 'real gentility',\n", " 'things',\n", " 'street',\n", " 'presence of observers',\n", " 'teeth',\n", " 'scratches',\n", " 'head',\n", " 'talks',\n", " 'nose',\n", " 'finger',\n", " 'spits',\n", " 'walk',\n", " 'exceeding annoyance',\n", " 'those who are always disgusted with tobacco',\n", " 'shape',\n", " 'man or woman',\n", " 'marked manner',\n", " 'lady',\n", " 'dress',\n", " 'rude',\n", " 'promenaders',\n", " 'hurried way',\n", " 'lady or gentleman',\n", " 'excuse',\n", " 'lady',\n", " 'dress without begging pardon',\n", " 'attracts attention',\n", " 'conversation',\n", " 'singular manner',\n", " 'remark',\n", " 'hat',\n", " 'lady acquaintance',\n", " 'male friend',\n", " 'lady',\n", " 'courtesy',\n", " 'lady']" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "return_a_list_of_NPs(sentences)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a pretty good list of NPs from the text. Let's look at `sample_text` again…this time, see if there are any NPs that _should_ have appeared above but didn't. How could add to or revise our `patterns` to make sure they were included?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Final Code:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```python\n", "import nltk\n", "import re\n", "import pprint\n", "from nltk import Tree\n", "\n", "patterns = \"\"\"\n", " NP: {*+}\n", " {**+}\n", " \"\"\"\n", "\n", "NPChunker = nltk.RegexpParser(patterns)\n", "\n", "def prepare_text(input):\n", " sentences = nltk.sent_tokenize(input)\n", " sentences = [nltk.word_tokenize(sent) for sent in sentences]\n", " sentences = [nltk.pos_tag(sent) for sent in sentences]\n", " sentences = [NPChunker.parse(sent) for sent in sentences]\n", " return sentences\n", "\n", "\n", "def parsed_text_to_NP(sentences):\n", " nps = []\n", " for sent in sentences:\n", " tree = NPChunker.parse(sent)\n", " for subtree in tree.subtrees():\n", " if subtree.node == 'NP':\n", " t = subtree\n", " t = ' '.join(word for word, tag in t.leaves())\n", " nps.append(t)\n", " return nps\n", "\n", "\n", "def sent_parse(input):\n", " sentences = prepare_text(input)\n", " nps = parsed_text_to_NP(sentences)\n", " return nps\n", " \n", "def find_nps(text):\n", " prepared = prepare_text(text)\n", " parsed = parsed_text_to_NP(prepared)\n", " final = sent_parse(parsed)\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I hope the above has given you a clear idea of how the NLTK works and how it can be useful to look at chunks of language instead of single words. If you've got any questions, please don't hesitate to get in touch via [twitter](http://twitter.com/lukewrites/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }