{ "metadata": { "name": "", "signature": "sha256:c66209ccfab921587d237e57fa68c3d27cbe5b9f603617d40b431b963cf5cd66" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Good morning!\n", "=============\n", "\n", "PRELIMINARIES\n", "=============\n", "\n", "I have (I think/hope) about 90 minutes of material. This session is supposed to go for about 110 minutes. I thought we could break it up as follows:\n", "\n", "* 30 minutes content\n", "* 5 minutes\n", "* 30 minutes content\n", "* 5 minutes\n", "* 30 minutes content\n", "* 10 minutes questions/discussion\n", "\n", "About me\n", "--------\n", "\n", "* Douglas Starnes\n", "* Pythonic polyglot 'ninja'\n", "* Memphis, TN\n", "* Programming languages, mobile, 'extreme web', data\n", "* Speaker\n", "* Co-director of Memphis Python User Group (MEMpy)\n", " * [@MemphisPython](http://twitter.com/MemphisPython)\n", " * [MEMPy](http://mempy.org)\n", "* PyTennessee staff\n", " * [@pytennessee](http://twitter.com/pytennessee)\n", " * [PyTennessee](http://pytennessee.org)\n", "\n", "Contact\n", "-------\n", "* douglas@douglasstarnes.com (douglas.a.starnes@gmail.com)\n", "* http://douglasstarnes.com\n", "* @poweredbyaltnet\n", "\n", "This talk is about introductory natural langauge techniques using Python. Most of the talk will focus on NLTK but towards the end we'll look at a few other frameworks such as scikit-learn, TextBlob and BeautifulSoup.\n", "\n", "NLTK is the Natural Language ToolKit. It is a set of software, data and documentation to perform Natural Language Processing (NLP) in Python.\n", "\n", "Natural language is spoken and/or written language used by humans for communication. We will be focusing on English (because today's host only speaks English) but NLTK has some facilities for other languages as well.\n", "\n", "Natural language processing is the automated (usually by a computer) analysis of natural language.\n", "\n", "Uses of natural language processing:\n", "------------------------------------\n", "\n", "* essay evaluation\n", "* information retrieval\n", "* question answering\n", "* sentiment analysis\n", "* spam detection\n", "* optical character recognition\n", "* speech recognition\n", "\n", "Getting NLTK\n", "------------\n", "\n", "You must have Python version 2.6+ or 3.2+ (for NLTK 3) After that installation is simple:\n", "\n", " pip install nltk\n", "\n", "NLTK has soft dependencies on numpy and matplotlib.\n", "\n", "You'll want to have some data, or corpora to experiment with:\n", "\n", " nltk.download_gui()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "CORPORA\n", "=======\n", "\n", "NLTK comes with over over 40 corpora or bodies of text that we can experiment with. The corpora have a variety of features. Here are a few of the more popular ones:\n", "\n", "* Gutenberg\n", "* Brown\n", "* Stopwords\n", "* Penn TreeBank\n", "* WordNet\n", "\n", "Corpora are imported as modules:\n", "\n", "Let's take a look at a few of them in code.\n", "\n", "Gutenberg\n", "---------" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import nltk\n", "from nltk.corpus import gutenberg\n", "gutenberg" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Gutenberg corpus is a **`PlaintextCorpusReader`** and contains a collection of text documents with no metadata." ] }, { "cell_type": "code", "collapsed": false, "input": [ "for fileid in gutenberg.fileids():\n", " print fileid" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "fileid = 'shakespeare-macbeth.txt'\n", "words = gutenberg.words(fileid)\n", "words[:20]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": true, "input": [ "raw = gutenberg.raw(fileid)\n", "raw[:1000]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "len(words)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "sents = gutenberg.sents(fileid)\n", "sents" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "len(sents)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": true, "input": [ "len(raw)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Stopwords\n", "---------\n", "A 'stopword' is a word that usually is filtered out. This includes words such as 'the', 'and', 'or', etc. Note that there is no definitive list but NLTK has a fairly complete one." ] }, { "cell_type": "code", "collapsed": true, "input": [ "from nltk.corpus import stopwords\n", "words = stopwords.words('english')\n", "for word in words[:25]: print word," ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NLTK also has lists of stopwords in many languages." ] }, { "cell_type": "code", "collapsed": true, "input": [ "languages = stopwords.fileids()\n", "for lang in languages:\n", " words = stopwords.words(lang)\n", " print lang\n", " for word in words[:25]: print word,\n", " print '\\n'" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Penn Treebank \n", "-------------\n", "\n", "The Penn Treebank organizes sentences into trees. [Example](http://en.wikipedia.org/wiki/Treebank#mediaviewer/File:The_house_at_the_end_of_the_street.jpg)\n", "\n", "Another valuable feature of the Penn Treebank is extensive part of speech tagging. [List](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)" ] }, { "cell_type": "code", "collapsed": true, "input": [ "from nltk.corpus import treebank\n", "words = treebank.tagged_words()[:25]\n", "print ', '.join([\"('%s', '%s')\" % (word[0], word[1]) for word in words])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": true, "input": [ "sent = treebank.tagged_sents()[0]\n", "sent" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll use the Penn Treebank again to tag our own text." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wordnet\n", "-------\n", "\n", "Wordnet is a different animal. This corpus contains what are called 'synsets' or synonym rings. To get the synset for a particular word:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from nltk.corpus import wordnet\n", "good_synsets = wordnet.synsets('good')\n", "for synset in good_synsets[:20]:\n", " print synset.name\n", " print synset.pos\n", " print synset.definition\n", " print '-' * 40" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that each synset has a definition and a part of speech or POS tag.\n", "\n", "####Wordnet Part Of Speech (POS) Tags:\n", "\n", "* n - noun\n", "* v - verb\n", "* a - adjective\n", "* s - adjective satellite\n", "* r - adverb\n", "\n", "Suppose we are only interested in nouns." ] }, { "cell_type": "code", "collapsed": true, "input": [ "good_nouns = wordnet.synsets('good', pos='n')\n", "for noun in good_nouns:\n", " print noun.name\n", " print noun.pos\n", " print noun.definition\n", " print '-' * 40" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each synset also (potenially) has examples of usage." ] }, { "cell_type": "code", "collapsed": true, "input": [ "for noun in good_nouns:\n", " print noun.definition\n", " if len(noun.examples) == 0:\n", " print '**NO EXAMPLES'\n", " else:\n", " for example in noun.examples:\n", " print example\n", " print '-' * 40" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each synset can also have a set of lemmas which each can have a set of antonyms. An antonym is a lemma which has a structure like that of synsets." ] }, { "cell_type": "code", "collapsed": true, "input": [ "noun = good_nouns[1]\n", "\n", "for lemma in good_nouns[1].lemmas:\n", " print lemma.name\n", " antonyms = lemma.antonyms()\n", " for antonym in antonyms: print '-', antonym.name" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since you'll probably have your own domain specific documents that you want to work with, NLTK lets you create custom corpora. We'll see how to do that later on." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "STRING PROCESSING\n", "=================\n", "\n", "Tokenizing\n", "----------\n", "\n", "Tokenization is the process of breaking a body of text into meaningful pieces. This is similar to parsing a math equation into tokens such as operators and operands. With natural language we might want sentences or words. NLTK provides many ways to tokenize text. [Gutenberg](http://gutenberg.org/catalog)" ] }, { "cell_type": "code", "collapsed": true, "input": [ "url = 'http://www.gutenberg.org/cache/epub/1661/pg1661.txt' # The Adventures of Sherlock Holmes\n", "import urllib\n", "book = urllib.urlopen(url).read()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Obviously, we could try a naive method." ] }, { "cell_type": "code", "collapsed": true, "input": [ "tokens = book.split(' ')\n", "print tokens[:100]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This has some drawbacks. The NLTK methods elimate a lot of the noise for us.\n", "\n", "We can tokenize by sentences." ] }, { "cell_type": "code", "collapsed": true, "input": [ "from nltk import sent_tokenize\n", "sents = sent_tokenize(book)\n", "print 'There are', len(sents), 'sentences'\n", "print sents[1002:1004]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And words." ] }, { "cell_type": "code", "collapsed": true, "input": [ "from nltk import word_tokenize\n", "words = word_tokenize(book)\n", "print 'There are', len(words), 'words'\n", "print word_tokenize(''.join(sents[1002:1004]))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The **`word_tokenize`** function uses a **`TreeBankTokenizer`** to generate tokens. As you can see it separates contractions. To avoid this we could use a **`WhitespaceTokenizer`**." ] }, { "cell_type": "code", "collapsed": true, "input": [ "from nltk.tokenize import WhitespaceTokenizer\n", "tokenizer = WhitespaceTokenizer()\n", "print tokenizer.tokenize(''.join(sents[1002:1004]))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The **`WhitespaceTokenizer`** only looks at whitespace characters and does not tokenize punctuation. The **`WordPunctTokenizer`** will create a token." ] }, { "cell_type": "code", "collapsed": true, "input": [ "from nltk.tokenize import WordPunctTokenizer\n", "tokenizer = WordPunctTokenizer()\n", "print tokenizer.tokenize(''.join(sents[1002:1004]))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Stemming\n", "--------\n", "\n", "Stemming is the process of reducing inflected words to their root or stem. Inflection is the modification of a word to express tense, gender, number and others. For example, the stem of 'running' is 'run'. The following is a naive approach." ] }, { "cell_type": "code", "collapsed": true, "input": [ "def stem(word):\n", " for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:\n", " if word.endswith(suffix):\n", " return word[:-len(suffix)]\n", " return word\n", "\n", "print 'runs:', stem('runs')\n", "print 'played:', stem('played')\n", "print 'playing:', stem('playing')\n", "print 'running:', stem('running')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This does not work for all situations, especially those where the stem is modified. \n", "\n", "NLTK gives us several stemmer classes to accomplish this. The most common is the **`PorterStemmer`**. [Python](http://tartarus.org/martin/PorterStemmer/python.txt)" ] }, { "cell_type": "code", "collapsed": true, "input": [ "from nltk.stem import PorterStemmer\n", "stemmer = PorterStemmer()\n", "print 'talking:', stemmer.stem('talking')\n", "print 'talks:', stemmer.stem('talks')\n", "print 'talked:', stemmer.stem('talked')\n", "print 'running:', stemmer.stem('running')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NLTK also provides a **`RegexpStemmer`** which is more of a brute force method." ] }, { "cell_type": "code", "collapsed": true, "input": [ "from nltk.stem import RegexpStemmer\n", "stemmer = RegexpStemmer('ing')\n", "print 'talking:', stemmer.stem('talking')\n", "print 'talks:', stemmer.stem('talks')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This has the limitation of working only with the specified suffix." ] }, { "cell_type": "code", "collapsed": true, "input": [ "stemmer = RegexpStemmer('ed')\n", "print 'talked:', stemmer.stem('talked')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And in instances where the stem itself is modified before adding a suffix:" ] }, { "cell_type": "code", "collapsed": true, "input": [ "stemmer = RegexpStemmer('ing')\n", "print 'running:', stemmer.stem('running')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The **`SnowballStemmer`** is another place that NLTK supports multiple languages. [Conjugation](http://www.spanishdict.com/conjugation)" ] }, { "cell_type": "code", "collapsed": true, "input": [ "from nltk.stem import SnowballStemmer\n", "stemmer = SnowballStemmer('spanish')\n", "print 'hablar:', stemmer.stem('hablar') # to speak\n", "print 'hablo:', stemmer.stem('hablo') # I speak" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "COLLOCATION\n", "===========\n", "\n", "Words that appear in proximity are collocated. A collocation of two words is a bigram. NLTK makes this easy." ] }, { "cell_type": "code", "collapsed": true, "input": [ "from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures\n", "finder = BigramCollocationFinder.from_words(gutenberg.words('shakespeare-macbeth.txt'))\n", "bigram = BigramAssocMeasures()\n", "finder.nbest(bigram.pmi, 25) # pmi => pointwise mutual information, scoring method" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": true, "input": [ "from nltk.collocations import TrigramAssocMeasures, TrigramCollocationFinder\n", "tri_finder = TrigramCollocationFinder.from_words(gutenberg.words('shakespeare-macbeth.txt'))\n", "trigram = TrigramAssocMeasures()\n", "tri_finder.nbest(trigram.pmi, 25)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": true, "input": [ "finder.apply_freq_filter(5)\n", "finder.nbest(bigram.pmi, 25)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": true, "input": [ "def remove_my(word):\n", " return word == 'my'\n", "\n", "finder.apply_word_filter(remove_my)\n", "finder.nbest(bigram.pmi, 25)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "POS TAGGING\n", "===========\n", "\n", "The easy way:\n", "-------------" ] }, { "cell_type": "code", "collapsed": false, "input": [ "nltk.pos_tag(nltk.word_tokenize('PyOhio is an awesome software development conference.'))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want more control:\n", "-------------------------" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from nltk.tag import DefaultTagger\n", "default_tagger = DefaultTagger('NN')\n", "sent = default_tagger.tag(nltk.word_tokenize('PyOhio is an awesome software development conference.'))\n", "sent" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the DefaultTagger simple tags each token with the given tag. This is not very useful by itself (but we'll come back to it later). Let's try training a tagger based on the Penn treebank." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from nltk.tag import UnigramTagger\n", "tagger = UnigramTagger(treebank.tagged_sents())\n", "sent = tagger.tag(nltk.word_tokenize('PyOhio is an awesome software development conference.'))\n", "sent" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Penn treebank has about 3900 sentences so it's not going to have everything. That it didn't tag 'PyOhio' is not really a surprise but it also missed 'awesome'. \n", "\n", "Let's try a more well known set of words:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "sent = tagger.tag(nltk.word_tokenize('The quick brown fox jumped over the lazy dog.'))\n", "sent" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And it didn't get all of these either.\n", "\n", "The Brown corpus is the first public corpus with over 1 million tagged words. Let's see if it can do any better. With 57K sentences, it will take a little longer to train." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from nltk.corpus import brown\n", "tagger = UnigramTagger(brown.tagged_sents(brown.fileids()))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "sent = tagger.tag(nltk.word_tokenize('The quick brown fox jumped over the lazy dog.'))\n", "sent" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's more like it! Let's try our other sentence:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "sent = tagger.tag(nltk.word_tokenize('PyOhio is an awesome software development conference.'))\n", "sent" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not any better. \n", "\n", "The taggers we've seen are called **backoff taggers**. One benefit of these is they can be chained together and words not tagged in one tagger will be handled by the next tagger in the chain. Let's put our default tagger at the end of the chain and then any words missed will eventually be tagged 'NN'." ] }, { "cell_type": "code", "collapsed": false, "input": [ "tagger._taggers = [tagger, default_tagger]\n", "sent = tagger.tag(nltk.word_tokenize('PyOhio is an awesome software development conference.'))\n", "sent" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Taggers can also be pickled." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pickle\n", "f = open('/Users/douglasstarnes/nltk_data/taggers/dumbtagger.pickle', 'w')\n", "pickle.dump(tagger, f)\n", "f.close()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The reason I stored the pickle in the nltk_data directory is so I could load it using the `nltk.tag.load` method:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "dumbtagger = nltk.tag.load('taggers/dumbtagger.pickle')\n", "sent = dumbtagger.tag(nltk.word_tokenize('PyOhio is an awesome software development conference.'))\n", "sent" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also remove tags with the `untag` function:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "nltk.tag.untag(sent)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fortunately, we don't have to do that. The community (via Github) has provided us with scripts to automate the training process.\n", "[Github](https://github.com/japerk/nltk-trainer)\n", "\n", "`python train_tagger.py treebank`\n", "\n", "`loading treebank`\n", "\n", "`3914 tagged sents, training on 3914`\n", "\n", "`training AffixTagger with affix -3 and backoff `\n", "\n", "`training tagger with backoff `\n", "\n", "`training tagger with backoff `\n", "\n", "`training tagger with backoff `\n", "\n", "`evaluating TrigramTagger`\n", "\n", "`accuracy: 0.992362`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Training the Brown corpus this way takes forever so I'm going to use the Penn treebank. It will save to the nltk_data so we can use the nltk library to load the pickle." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import nltk.data\n", "tagger = nltk.data.load('taggers/treebank_aubt.pickle')\n", "sent = tagger.tag(treebank.sents()[0])\n", "sent" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "words = nltk.word_tokenize('PyOhio is an awesome software development conference.')\n", "sent = tagger.tag(words)\n", "sent" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "words = nltk.word_tokenize('The quick brown fox jumped over the lazy dog.')\n", "sent = tagger.tag(words)\n", "sent" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "CLASSIFICATION\n", "==============" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NLTK has several different classifiers. The most common one is the `NaiveBayesClassifier`." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from nltk.corpus import movie_reviews\n", "import nltk.classify.util\n", "from nltk.classify import NaiveBayesClassifier\n", "\n", "p_f = [] # positive features\n", "n_f = [] # negative features" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to extract the features using the 'bag of words' method which just indicates word presence. Since the classifier requires a dict, we'll just make the value of each dict True." ] }, { "cell_type": "code", "collapsed": false, "input": [ "for fileid in movie_reviews.fileids('pos'):\n", " words = movie_reviews.words(fileid)\n", " words = dict([(word, True) for word in words])\n", " p_f.append((words, 'pos'))\n", " \n", "for fileid in movie_reviews.fileids('neg'):\n", " words = movie_reviews.words(fileid)\n", " words = dict([(word, True) for word in words])\n", " n_f.append((words, 'neg'))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll use 90% to train the classifier and then 10% to test it." ] }, { "cell_type": "code", "collapsed": false, "input": [ "training_set = p_f[:900] + n_f[:900]\n", "test_set = p_f[900:] + n_f[900:]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we'll train the set and test the accuracy." ] }, { "cell_type": "code", "collapsed": false, "input": [ "classifier = NaiveBayesClassifier.train(training_set)\n", "nltk.classify.util.accuracy(classifier, test_set)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "DOCUMENT SIMILARITY\n", "===================\n", "\n", "For this we can use a formula known as TFIDF or 'term frequency inverse document frequency'. This tells us the importance of a word in a document in a corpus.\n", "\n", "The term frequency is the number of times a word appears in a document and thus how important that word is to all others.\n", "\n", "The inverse document frequency is how important a word is in the corpus.\n", "\n", "The product of the two is TFIDF." ] }, { "cell_type": "code", "collapsed": true, "input": [ "import math\n", "from nltk.tokenize import WhitespaceTokenizer\n", "\n", "documents = [\n", " 'I like to play golf and tennis', \n", " 'The local court is a place I like', \n", " 'I do not like to play tennis but I like to play golf', \n", " 'My neighbor went bowling yesterday'\n", "]\n", "\n", "tokenizer = WhitespaceTokenizer()\n", "\n", "def tf(term, doc):\n", " words = tokenizer.tokenize(doc)\n", " terms = sum([1 for t in words if t == term])\n", " return float(terms) / float(len(words))\n", "\n", "def idf(term):\n", " docs = sum([1 for doc in documents if term in tokenizer.tokenize(doc)])\n", " return math.log(float(len(documents)) / float(1 + docs))\n", "\n", "def tfidf(term, doc):\n", " return tf(term, doc) * idf(term)\n", "\n", "for doc in documents: \n", " print tfidf('bowling', doc) # golf, play, tennis, bowling" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a production situation we would have a much larger corpus, filter out stopwords, normalize the text, etc. But this example fits on a screen.\n", "\n", "To compute document similarity, scikit-learn provides a nice implementation in **`TfidfVectorizer`**." ] }, { "cell_type": "code", "collapsed": true, "input": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "\n", "vectorizer = TfidfVectorizer(min_df=1) # consider terms that occur one or more times\n", "doc_matrix = vectorizer.fit_transform(['happy dog', 'sad dog', 'happy cat', 'sad cat'])\n", "(doc_matrix * doc_matrix.T).A\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": true, "input": [ "vectorizer = TfidfVectorizer(min_df=1)\n", "doc_matrix = vectorizer.fit_transform(documents)\n", "(doc_matrix * doc_matrix.T).A" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "CLEANING UP YOUR CORPUS\n", "=======================" ] }, { "cell_type": "code", "collapsed": false, "input": [ "url_osu_home = 'http://www.osu.edu/'\n", "html = urllib.urlopen(url_osu_home).read()\n", "print html[:1000]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "html = nltk.util.clean_html(html)\n", "nltk.word_tokenize(html)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "from bs4 import BeautifulSoup\n", "html = urllib.urlopen(url_osu_home).read()\n", "osu_home = BeautifulSoup(html)\n", "osu_home.title" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "metas = osu_home.findAll('meta')\n", "for tag in metas:\n", " try:\n", " print tag.attrs['content']\n", " except KeyError:\n", " pass" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "TEXTBLOB\n", "========\n", "\n", "TextBlob is an NLP library built on top of NLTK and Pattern (see below). It has a simplified API." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from textblob import TextBlob\n", "text = 'the quick brown fox jumped over the lazy dog'\n", "blob = TextBlob(text)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "POS tagging is a breeze" ] }, { "cell_type": "code", "collapsed": false, "input": [ "blob.tags" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Translating between languages is a snap!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "fr = TextBlob(text)\n", "fr = fr.translate(to=\"fr\")\n", "print fr.string" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And it works with non-Western/European languages:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "cn = TextBlob(text)\n", "cn = cn.translate(to=\"zh-CN\")\n", "print cn.string" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And right to left languages:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "iw = TextBlob(text)\n", "iw = iw.translate(to=\"IW\")\n", "print iw.string" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Misspelled words? No problem!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "text = 'I spell like an amatur. I beleive it is not a serious issue. But ignorence is bliss.'\n", "blob = TextBlob(text)\n", "print text\n", "print blob.correct().string" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And remember all the trouble we went through with sentiment analysis before?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "blobs = [\n", " 'Today is a beautiful day.',\n", " 'Today is a terrible day.',\n", " 'Today is a rainy day.',\n", " 'I love rainy days.',\n", " 'I think rainy days are beautiful.',\n", " 'Today is a sunny day.',\n", " 'I think sunny days are awful.'\n", "]\n", "for blob in blobs:\n", " print '****',blob\n", " print TextBlob(blob).sentiment" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I WANT MORE!!!!\n", "===============\n", "\n", "* [NLTK Site](http://nltk.org)\n", "* [NLTK book -- FREE](http://nltk.org/book)\n", "* [NLTK book 1st ed. -- also free](http://www.nltk.org/book_1ed/)\n", "* [Chinese POS tagger](http://www.infochimps.com/datasets/chinese-part-of-speech-tagger-for-python-nltk)\n", "* [scikit-learn (TfidfVectorizer)](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)\n", "* [TextBlob](http://textblob.readthedocs.org/)\n", "* [Pattern](http://www.clips.ua.ac.be/pattern)\n", "* [Memory-Based Shallow Parsing](http://www.clips.ua.ac.be/pages/MBSP) [Demo](http://www.clips.ua.ac.be/cgi-bin/webdemo/MBSP-instant-webdemo.cgi)\n", "* Google and StackOverflow are your friends" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "THANK YOU!\n", "==========\n", "\n", "* douglas.a.starnes@gmail.com\n", "* http://douglasstarnes.com\n", "* @poweredbyaltnet" ] }, { "cell_type": "code", "collapsed": true, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }