{ "metadata": { "name": "", "signature": "sha256:cdad2565742d364a55a7872cccdf56899a8068f2f61efb6ee365ea523f1eb312" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Naive Bayes Teaching Example\n", "_Inpsired by [Krishna Sankar's](https://github.com/xsankar) pydata lecture _" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# This example is a quick and fun way to show applications of NB and how it works with NLTK.\n", "\n", "#!/usr/bin/env python\n", "#\n", "# pydata Tutorial March 18, 2013\n", "# Thanks to StreamHacker a.k.a. Jacob Perkins\n", "# Thanks to Prof.Todd Ebert\n", "#\n", "import nltk.classify.util\n", "from nltk.classify import NaiveBayesClassifier\n", "from nltk import tokenize" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "#\n", "# Data\n", "#\n", "label_1 = \"Bieber\"\n", "\n", "train_text_1 = \"When I met you girl my heart went knock knock \\\n", "Now them butterflies in my stomach won't stop stop \\\n", "And even though it's a struggle love is all we got \\\n", "So we goin keep keep climbin' to the mountain top\\\n", "\\\n", "Your world is my world \\\n", "And my fight is your fight \\\n", "My breath is your breath \\\n", "When you're hurt, I'm not right\"\n", "\n", "test_text_1 = \"You look so deep\\\n", "You know that it humbles me\\\n", "You're by my side and troubles -- they don't trouble me\\\n", "Many have called but the chosen is you\\\n", "Whatever you want shawty I'll give it to you\\\n", "\"\n", "\n", "\n", "label_2 = \"Obama\"\n", "\n", "\n", "# First try\n", "train_text_2 = \"Hi, everybody. In my State of the Union Address,\\\n", "I talked about pizza. More specifically, \\\n", "I talked about a pizza chain in Minneapolis \u2013 Punch Pizza \u2013 whose owner, \\\n", "John Soranno, made the business decision to give his employees a raise to ten bucks an hour.\\\n", "\\\n", "And while not all of us always see eye to eye politically, one thing we overwhelmingly agree on is that nobody who works full-time should ever have to live in poverty. That\u2019s why nearly three in four Americans support raising the minimum wage. The problem is, Republicans in Congress don\u2019t support raising the minimum wage. Some even want to get rid of it entirely. In Oklahoma, for example, the Republican governor just signed a law prohibiting cities from establishing their own minimum wage. \"\n", "\n", "test_text_2 = \"Republicans have voted more than 50 times to undermine or repeal health care for millions of Americans. They should vote at least once to raise the minimum wage for millions of working families. \"" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 18 }, { "cell_type": "code", "collapsed": false, "input": [ "#\n", "# For testing classification\n", "#\n", "classify_bieber = \"It's like an angel came by and took me to heaven, 'cause when i stare in your eyes it couldn't be better\"\n", "classify_obama = \"one thing we overwhelmingly agree on is that nobody who works full-time should ever have to live in poverty. .\"\n", "classify_other = \"What do you call a lazy baby kangaroo? A pouch potato.\"" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 19 }, { "cell_type": "code", "collapsed": false, "input": [ "#\n", "# Take a list of words and turn it into a Bag Of Words\n", "#\n", "def bag_of_words(words):\n", " return dict([(word, True) for word in words])" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 20 }, { "cell_type": "code", "collapsed": false, "input": [ "#\n", "# Program Starts Here\n", "#\n", "#\n", "# Step 1 : Feature Extraction\n", "#\n", "train_words_1 = tokenize.word_tokenize(train_text_1)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 21 }, { "cell_type": "code", "collapsed": false, "input": [ "# Let's take a look at what tokenize does:\n", "train_words_1" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 22, "text": [ "['When',\n", " 'I',\n", " 'met',\n", " 'you',\n", " 'girl',\n", " 'my',\n", " 'heart',\n", " 'went',\n", " 'knock',\n", " 'knock',\n", " 'Now',\n", " 'them',\n", " 'butterflies',\n", " 'in',\n", " 'my',\n", " 'stomach',\n", " 'wo',\n", " \"n't\",\n", " 'stop',\n", " 'stop',\n", " 'And',\n", " 'even',\n", " 'though',\n", " 'it',\n", " \"'s\",\n", " 'a',\n", " 'struggle',\n", " 'love',\n", " 'is',\n", " 'all',\n", " 'we',\n", " 'got',\n", " 'So',\n", " 'we',\n", " 'goin',\n", " 'keep',\n", " 'keep',\n", " 'climbin',\n", " \"'\",\n", " 'to',\n", " 'the',\n", " 'mountain',\n", " 'topYour',\n", " 'world',\n", " 'is',\n", " 'my',\n", " 'world',\n", " 'And',\n", " 'my',\n", " 'fight',\n", " 'is',\n", " 'your',\n", " 'fight',\n", " 'My',\n", " 'breath',\n", " 'is',\n", " 'your',\n", " 'breath',\n", " 'When',\n", " 'you',\n", " \"'re\",\n", " 'hurt',\n", " ',',\n", " 'I',\n", " \"'m\",\n", " 'not',\n", " 'right']" ] } ], "prompt_number": 22 }, { "cell_type": "code", "collapsed": false, "input": [ "train_words_2 = tokenize.word_tokenize(train_text_2)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 23 }, { "cell_type": "code", "collapsed": false, "input": [ "train_bieber = [(bag_of_words(train_words_1), label_1)]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 24 }, { "cell_type": "code", "collapsed": true, "input": [ "# Let's take a look at what bag_of_words does...\n", "train_bieber" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 25, "text": [ "[({\"'\": True,\n", " \"'m\": True,\n", " \"'re\": True,\n", " \"'s\": True,\n", " ',': True,\n", " 'And': True,\n", " 'I': True,\n", " 'My': True,\n", " 'Now': True,\n", " 'So': True,\n", " 'When': True,\n", " 'a': True,\n", " 'all': True,\n", " 'breath': True,\n", " 'butterflies': True,\n", " 'climbin': True,\n", " 'even': True,\n", " 'fight': True,\n", " 'girl': True,\n", " 'goin': True,\n", " 'got': True,\n", " 'heart': True,\n", " 'hurt': True,\n", " 'in': True,\n", " 'is': True,\n", " 'it': True,\n", " 'keep': True,\n", " 'knock': True,\n", " 'love': True,\n", " 'met': True,\n", " 'mountain': True,\n", " 'my': True,\n", " \"n't\": True,\n", " 'not': True,\n", " 'right': True,\n", " 'stomach': True,\n", " 'stop': True,\n", " 'struggle': True,\n", " 'the': True,\n", " 'them': True,\n", " 'though': True,\n", " 'to': True,\n", " 'topYour': True,\n", " 'we': True,\n", " 'went': True,\n", " 'wo': True,\n", " 'world': True,\n", " 'you': True,\n", " 'your': True},\n", " 'Bieber')]" ] } ], "prompt_number": 25 }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 11 }, { "cell_type": "code", "collapsed": false, "input": [ "# Bag_of_words creates our Bernoulli word vector, and then we include our condition at the end.\n", "# We can thus use this for our probabilities" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "train_obama = [(bag_of_words(train_words_2), label_2)]\n", "\n", "test_words_1 = tokenize.word_tokenize(test_text_1)\n", "test_words_2 = tokenize.word_tokenize(test_text_2)\n", "test_bieber = [(bag_of_words(test_words_1), label_1)]\n", "test_obama = [(bag_of_words(test_words_2), label_2)]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 26 }, { "cell_type": "code", "collapsed": false, "input": [ "# We give NLTK all of our training features. And all of our test features\n", "train_features = train_bieber + train_obama\n", "test_features = test_bieber + test_obama" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 27 }, { "cell_type": "code", "collapsed": false, "input": [ "#\n", "# Step 2: Train the classifier \n", "#\n", "# \n", "# We use the test set to measure accuracy\n", "classifier = NaiveBayesClassifier.train(train_features)\n", "#print 'Accuracy : %d' % nltk.classify.util.accuracy(classifier, test_features)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 28 }, { "cell_type": "code", "collapsed": false, "input": [ "#\n", "# Step 3 :Test Classification\n", "#\n", "print classifier.classify(bag_of_words(tokenize.word_tokenize(classify_bieber)))\n", "print classifier.classify(bag_of_words(tokenize.word_tokenize(classify_obama)))\n", "print classifier.classify(bag_of_words(tokenize.word_tokenize(classify_other)))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Bieber\n", "Obama\n", "Obama\n" ] } ], "prompt_number": 29 }, { "cell_type": "code", "collapsed": false, "input": [ "#\n", "# Testing with random words\n", "#\n", "def BiebOrBam(wordz):\n", " print classifier.classify(bag_of_words(tokenize.word_tokenize(wordz)))" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 30 }, { "cell_type": "code", "collapsed": false, "input": [ "#\n", "# Testing it out\n", "#\n", "#\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "BiebOrBam('Peace out, ladies')" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Obama\n" ] } ], "prompt_number": 33 }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "BiebOrBam('I love you')" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Bieber\n" ] } ], "prompt_number": 31 }, { "cell_type": "code", "collapsed": false, "input": [ "BiebOrBam('let\\'s dance')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "BiebOrBam('Let\\'s dance poverty away')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "BiebOrBam('Let\\'s dance poverty away for good America')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "BiebOrBam('Let\\'s dance poverty away for good America. I love you')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#\n", "# The End\n", "#" ], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }