{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We aren't all happy at the same time \n", "\n", "![](https://raw.github.com/nealcaren/workshop_2014/master/notebooks/images/twitter_day.png) \n", "\n", "A list based sentiment analysis by Scott Golder and Michael Macy\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Winning makes us happy. \n", "![](https://raw.github.com/nealcaren/workshop_2014/master/notebooks/images/fb.png) \n", "by Sean J. Taylor (@seanjtaylor)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "![](https://raw.github.com/nealcaren/workshop_2014/master/notebooks/images/blank.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Data types\n", "\n", "Strings\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "tweet = 'We have some delightful new food in the cafeteria. Awesome!!!'" ], "language": "python", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Open a note book and copy this text over.\n", "\n", "Then press `shift-enter` or select `Cell-Run` from the pull down menu to `run` it." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "What's in there?\n", "\n", "In a new cell, type `tweet` and then run the cell" ] }, { "cell_type": "code", "collapsed": false, "input": [ "tweet" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 10, "text": [ "'We have some delightful new food in the cafeteria. Awesome!!!'" ] } ], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We can also `print` out the contents." ] }, { "cell_type": "code", "collapsed": false, "input": [ "print tweet" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "We have some delightful new food in the cafeteria. Awesome!!!\n" ] } ], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Python doesn't care if you use `'`, `\"`, or even `'''` for your strings." ] }, { "cell_type": "code", "collapsed": false, "input": [ "tweet = \"We have some delightful new food in the cafeteria. Awesome!!!\"\n", "tweet" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ "'We have some delightful new food in the cafeteria. Awesome!!!'" ] } ], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Will this work?\n", "\n", "`tweet = Does anyone call New Haven \"NeHa\"?`\n", "\n", "Guess? Try it!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "tweet = Does anyone call New Haven \"NeHa\"?" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "ename": "SyntaxError", "evalue": "invalid syntax (, line 1)", "output_type": "pyerr", "traceback": [ "\u001b[0;36m File \u001b[0;32m\"\"\u001b[0;36m, line \u001b[0;32m1\u001b[0m\n\u001b[0;31m tweet = Does anyone call New Haven \"NeHa\"?\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n" ] } ], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "tweet = '''Does anyone call New Haven \"NeHa\"?'''\n", "\n", "print tweet" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Does anyone call New Haven \"NeHa\"?\n" ] } ], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [ "tweet = 'Does anyone call New Haven \"NeHa\"?'\n", "\n", "print tweet" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Does anyone call New Haven \"NeHa\"?\n" ] } ], "prompt_number": 15 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Lists - another way to store data\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "['everything','in','brackets','separated','by','commas.']" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 16, "text": [ "['everything', 'in', 'brackets', 'separated', 'by', 'commas.']" ] } ], "prompt_number": 16 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Think of these like variables." ] }, { "cell_type": "code", "collapsed": false, "input": [ "positive_words = ['awesome', 'good', 'nice', 'super', 'fun']" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "prompt_number": 17 }, { "cell_type": "code", "collapsed": false, "input": [ "print positive_words" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['awesome', 'good', 'nice', 'super', 'fun']\n" ] } ], "prompt_number": 18 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We can add things to the list with `append`. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "positive_words.append('delightful')" ], "language": "python", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "prompt_number": 19 }, { "cell_type": "code", "collapsed": false, "input": [ "print positive_words" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['awesome', 'good', 'nice', 'super', 'fun', 'delightful']\n" ] } ], "prompt_number": 20 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Note that we didn't write `postive_words = positive_words.append('delightful')`. \n", "\n", "`.append()` modifies the content of the list." ] }, { "cell_type": "code", "collapsed": false, "input": [ "positive_words.append(like)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "ename": "NameError", "evalue": "name 'like' is not defined", "output_type": "pyerr", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mpositive_words\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlike\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mNameError\u001b[0m: name 'like' is not defined" ] } ], "prompt_number": 21 }, { "cell_type": "code", "collapsed": false, "input": [ "new_word_to_add = 'like'\n", "positive_words.append(new_word_to_add)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "prompt_number": 22 }, { "cell_type": "code", "collapsed": false, "input": [ "print positive_words" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['awesome', 'good', 'nice', 'super', 'fun', 'delightful', 'like']\n" ] } ], "prompt_number": 23 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Your turn.\n", "\n", "Make a list callled `negative_words` that includes awful, lame, horrible and bad. `print` out the contents." ] }, { "cell_type": "code", "collapsed": false, "input": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ], "language": "python", "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "negative_words = ['awful','lame','horrible','bad']\n", "print negative_words" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['awful', 'lame', 'horrible', 'bad']\n" ] } ], "prompt_number": 24 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Combining lists" ] }, { "cell_type": "code", "collapsed": false, "input": [ "emotional_words = negative_words + positive_words\n", "print emotional_words" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['awful', 'lame', 'horrible', 'bad', 'awesome', 'good', 'nice', 'super', 'fun', 'delightful', 'like']\n" ] } ], "prompt_number": 25 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Strings can be split to create lists. I do this a lot." ] }, { "cell_type": "code", "collapsed": false, "input": [ "tweet = 'We have some delightful new food in the cafeteria. Awesome!!!'\n", "\n", "words = tweet.split()\n", "print words" ], "language": "python", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['We', 'have', 'some', 'delightful', 'new', 'food', 'in', 'the', 'cafeteria.', 'Awesome!!!']\n" ] } ], "prompt_number": 26 }, { "cell_type": "code", "collapsed": false, "input": [ "tweet = 'We have some delightful new food in the cafeteria. Awesome!!!'\n", "print tweet.split('.')" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['We have some delightful new food in the cafeteria', ' Awesome!!!']\n" ] } ], "prompt_number": 27 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Unlike `.append()`, `.split()` doesn't alter the string strings. Strings are immutable." ] }, { "cell_type": "code", "collapsed": false, "input": [ "tweet = 'We have some delightful new food in the cafeteria. Awesome!!!'\n", "print tweet.split()\n", "print tweet" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['We', 'have', 'some', 'delightful', 'new', 'food', 'in', 'the', 'cafeteria.', 'Awesome!!!']\n", "We have some delightful new food in the cafeteria. Awesome!!!\n" ] } ], "prompt_number": 28 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "So when you modify a string, make sure you store the results somewhere." ] }, { "cell_type": "code", "collapsed": false, "input": [ "tweet = 'We have some delightful new food in the cafeteria. Awesome!!!'\n", "words = tweet.split()\n", "print words" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['We', 'have', 'some', 'delightful', 'new', 'food', 'in', 'the', 'cafeteria.', 'Awesome!!!']\n" ] } ], "prompt_number": 29 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Most of the fun math is in `numpy` but we can count the length of objects." ] }, { "cell_type": "code", "collapsed": false, "input": [ "print words\n", "print len(words)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['We', 'have', 'some', 'delightful', 'new', 'food', 'in', 'the', 'cafeteria.', 'Awesome!!!']\n", "10\n" ] } ], "prompt_number": 30 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "How long is tweet?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ], "language": "python", "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "prompt_number": 91 }, { "cell_type": "code", "collapsed": false, "input": [ "print tweet \n", "len(tweet)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "RT @SoCalConservtiv: His failed budget proposal that was voted down 414-0 in Congress... Oh wait that's Professor Obama #WhatsRomneyHiding #tcot\n" ] }, { "metadata": {}, "output_type": "pyout", "prompt_number": 92, "text": [ "144" ] } ], "prompt_number": 92 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "With functions like `len()`, Python counts the number of items in list and the number of characters in a string." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "There's a couple of more data types that you might need:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#tuple\n", "row = (1,3,'fish')\n", "\n", "print row" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "(1, 3, 'fish')\n" ] } ], "prompt_number": 88 }, { "cell_type": "code", "collapsed": false, "input": [ "#sets\n", "set([3,4,5,5])" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 89, "text": [ "{3, 4, 5}" ] } ], "prompt_number": 89 }, { "cell_type": "code", "collapsed": false, "input": [ "#Dictionary\n", "\n", "article_1 = {'title': 'Cat in the Hat', 'author': 'Dr. Seuss', 'Year': 1957}" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "prompt_number": 90 }, { "cell_type": "code", "collapsed": false, "input": [ "#And a list of dictionaries is awfully close to a JSON.\n", "article_2 = {'title': 'Go Do Go!', 'author': 'PD Eastman', 'Year': 1961}\n", "\n", "articles = [article_1, article_2]\n", "\n", "articles" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 91, "text": [ "[{'Year': 1957, 'author': 'Dr. Seuss', 'title': 'Cat in the Hat'},\n", " {'Year': 1961, 'author': 'PD Eastman', 'title': 'Go Do Go!'}]" ] } ], "prompt_number": 91 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Loops\n", "---\n", "Was any of our sentence words in the postive word list?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "for word in words:\n", "print word" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "ename": "IndentationError", "evalue": "expected an indented block (, line 2)", "output_type": "pyerr", "traceback": [ "\u001b[0;36m File \u001b[0;32m\"\"\u001b[0;36m, line \u001b[0;32m2\u001b[0m\n\u001b[0;31m print word\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mIndentationError\u001b[0m\u001b[0;31m:\u001b[0m expected an indented block\n" ] } ], "prompt_number": 32 }, { "cell_type": "code", "collapsed": false, "input": [ "for word in words:\n", " print word" ], "language": "python", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "We\n", "have\n", "some\n", "delightful\n", "new\n", "food\n", "in\n", "the\n", "cafeteria.\n", "Awesome!!!\n" ] } ], "prompt_number": 33 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Note the colon at the end of the first line. Python will expect the next line to be indented. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We can also add conditionals, like `if` and `else` or `elif`. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "for word in words:\n", " if word in positive_words:\n", " print word" ], "language": "python", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "delightful\n" ] } ], "prompt_number": 34 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Your turn. Take the following tweet and print out a plus sign for each positive word:\n", "\n", "`tweet_2 = \"Food is lame today. I don't like it at all.\"`\n", "\n", "Don't peak.\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ], "language": "python", "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "prompt_number": 34 }, { "cell_type": "code", "collapsed": false, "input": [ "tweet_2 = \"Food is lame today. I don't like it at all.\"\n", "words_2 = tweet_2.split()\n", "\n", "for word in words_2:\n", " if word in positive_words:\n", " print '+'\n", " " ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "+\n" ] } ], "prompt_number": 35 }, { "cell_type": "code", "collapsed": false, "input": [ "tweet_2 = \"Food is lame today. I don't like it at all.\"\n", "words_2 = tweet_2.split()\n", "\n", "for word in words_2:\n", " if word in positive_words:\n", " print '+'\n", " elif word in negative_words:\n", " print '-'" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "-\n", "+\n" ] } ], "prompt_number": 36 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Like lists, we can combine strings with a `+`." ] }, { "cell_type": "code", "collapsed": false, "input": [ "for word in words:\n", " if word in positive_words:\n", " print word + ' is a positive word.'" ], "language": "python", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "delightful is a positive word.\n" ] } ], "prompt_number": 37 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Why doesn't this work?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print 3 + ' is a number.'\n" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "ename": "TypeError", "evalue": "unsupported operand type(s) for +: 'int' and 'str'", "output_type": "pyerr", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0;32mprint\u001b[0m \u001b[0;36m3\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0;34m' is a number.'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mTypeError\u001b[0m: unsupported operand type(s) for +: 'int' and 'str'" ] } ], "prompt_number": 38 }, { "cell_type": "code", "collapsed": false, "input": [ "print ['puppies','dogs'] + 'are pets.'\n" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "ename": "TypeError", "evalue": "can only concatenate list (not \"str\") to list", "output_type": "pyerr", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0;32mprint\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;34m'puppies'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m'dogs'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0;34m'are pets.'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mTypeError\u001b[0m: can only concatenate list (not \"str\") to list" ] } ], "prompt_number": 39 }, { "cell_type": "code", "collapsed": false, "input": [ "print '3' + ' is a number.'\n", "\n", "print str(3) + ' is a number.'\n", "\n", "print '%s is a number.' % 3" ], "language": "python", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "3 is a number.\n", "3 is a number.\n", "3 is a number.\n" ] } ], "prompt_number": 40 }, { "cell_type": "code", "collapsed": false, "input": [ "for some_number in [1,2,4,9]:\n", " sentence = str(some_number) + ' is a number.'\n", " print sentence" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "1 is a number.\n", "2 is a number.\n", "4 is a number.\n", "9 is a number.\n" ] } ], "prompt_number": 41 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Text cleaning\n", "\n", "Or why Awesome!!! wasn't a positive word" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print tweet.lower()" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "we have some delightful new food in the cafeteria. awesome!!!\n" ] } ], "prompt_number": 42 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "But we can't do it with a list of things." ] }, { "cell_type": "code", "collapsed": false, "input": [ "print words.lower()" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "ename": "AttributeError", "evalue": "'list' object has no attribute 'lower'", "output_type": "pyerr", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0;32mprint\u001b[0m \u001b[0mwords\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlower\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mAttributeError\u001b[0m: 'list' object has no attribute 'lower'" ] } ], "prompt_number": 43 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "So you'll either need clean the whole sentence, each word, or both." ] }, { "cell_type": "code", "collapsed": false, "input": [ "for word in words:\n", " word_lower = word.lower()\n", " if word_lower in positive_words:\n", " print word_lower + ' is a positive word.'" ], "language": "python", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "delightful is a positive word.\n" ] } ], "prompt_number": 44 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Updating our loop, we still don\u2019t find `awesome!!!` yet.\n", "\n", "Why?\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print 'awesome!!!'.strip('!')" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "awesome\n" ] } ], "prompt_number": 45 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Getting rid of `'!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'`" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from string import punctuation" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "prompt_number": 46 }, { "cell_type": "code", "collapsed": false, "input": [ "punctuation" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 47, "text": [ "'!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'" ] } ], "prompt_number": 47 }, { "cell_type": "code", "collapsed": false, "input": [ "'awesome!!!!'.strip(punctuation)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 48, "text": [ "'awesome'" ] } ], "prompt_number": 48 }, { "cell_type": "code", "collapsed": false, "input": [ "'awesome?!?!'.strip(punctuation)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 49, "text": [ "'awesome'" ] } ], "prompt_number": 49 }, { "cell_type": "code", "collapsed": false, "input": [ "'awesome!!!! party'.strip(punctuation)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 50, "text": [ "'awesome!!!! party'" ] } ], "prompt_number": 50 }, { "cell_type": "code", "collapsed": false, "input": [ "print 'awesome!!! party'\n", "print 'awesome!!! party'.replace('!','')" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "awesome!!! party\n", "awesome party\n" ] } ], "prompt_number": 51 }, { "cell_type": "code", "collapsed": false, "input": [ "word = 'awesome!!!'\n", "\n", "word_processed = word.strip('!')\n", "word_processed = word_processed.lower()\n", "\n", "print word_processed\n" ], "language": "python", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "awesome\n" ] } ], "prompt_number": 52 }, { "cell_type": "code", "collapsed": false, "input": [ "word = 'awesome!!!'\n", "word_processed = word.strip('!').lower()\n", "print word_processed" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "awesome\n" ] } ], "prompt_number": 53 }, { "cell_type": "code", "collapsed": false, "input": [ "for word in words:\n", " word_processed = word.lower()\n", " word_processed = word_processed.strip(punctuation)\n", " if word_processed in positive_words:\n", " print word + ' is a positive word'" ], "language": "python", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "delightful is a positive word\n", "Awesome!!! is a positive word\n" ] } ], "prompt_number": 54 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "It worked!!!\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "But what we really care about is the count of words." ] }, { "cell_type": "code", "collapsed": false, "input": [ "postive_counter = 0\n", "\n", "for word in words:\n", " word_processed = word.lower()\n", " word_processed = word_processed.strip(punctuation)\n", " if word_processed in positive_words:\n", " postive_counter = postive_counter + 1\n", "print postive_counter" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "2\n" ] } ], "prompt_number": 55 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Let's do this for real.\n", "\n", "- Get a real list of affect words.\n", "- Get a real list of tweets.\n", "- Output the results to a csv file for additional analysis." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "LIWC is what all the cool kids use, but there list is copyrighted. So we'll use lists of positive and negative words from:\n", "\n", "Theresa Wilson, Janyce Wiebe and Paul Hoffmann (2005). \"Recognizing Contextual \n", "Polarity in Phrase-Level Sentiment Analysis.\" Proceedings of HLT/EMNLP 2005,\n", "Vancouver, Canada." ] }, { "cell_type": "code", "collapsed": false, "input": [ "negative_file = open('negative.txt', 'r').read()" ], "language": "python", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "prompt_number": 95 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The basic way to open and read a text file. \n", "\n", "`r` tells the operating system you want permission to read it. \n", "`.read()` tells Python to import all the text" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Let's take a look at `negative_file` by slicing it up." ] }, { "cell_type": "code", "collapsed": false, "input": [ "negative_file[:50]" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 57, "text": [ "'abandoned\\nabandonment\\naberration\\naberration\\nabhorr'" ] } ], "prompt_number": 57 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "`\\n` is an End of Line character\n", "\n", "`[:50]` took the first 50 characters" ] }, { "cell_type": "code", "collapsed": false, "input": [ "negative_list = negative_file.splitlines()" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "prompt_number": 58 }, { "cell_type": "code", "collapsed": false, "input": [ "negative_list[:5]" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 59, "text": [ "['abandoned', 'abandonment', 'aberration', 'aberration', 'abhorred']" ] } ], "prompt_number": 59 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Python starts at 0" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print negative_list[0]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "abandoned\n" ] } ], "prompt_number": 60 }, { "cell_type": "code", "collapsed": false, "input": [ "print negative_list[0:1]" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['abandoned']\n" ] } ], "prompt_number": 61 }, { "cell_type": "code", "collapsed": false, "input": [ "print negative_list[-5:]" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['wrought', 'yawn', 'zealot', 'zealous', 'zealously']\n" ] } ], "prompt_number": 62 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's do it again for the postive words" ] }, { "cell_type": "code", "collapsed": false, "input": [ "postive_file = open('positive.txt', 'r').read()\n", "postive_list = postive_file.splitlines()\n" ], "language": "python", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "prompt_number": 63 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Quiz: How many words are in the two lists combined?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ], "language": "python", "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "prompt_number": 63 }, { "cell_type": "code", "collapsed": false, "input": [ "print len(postive_list) + len(negative_list)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "6135\n" ] } ], "prompt_number": 64 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "A while back, I use the Twitter API to get some Tweets, or status updates, that mentioned President Obama." ] }, { "cell_type": "code", "collapsed": false, "input": [ "obama_tweets = open('obama_tweets.txt', 'r').read()\n", "obama_tweets = obama_tweets.splitlines()" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "prompt_number": 65 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Avoid copy and paste text in file! Make a function! " ] }, { "cell_type": "code", "collapsed": false, "input": [ "def open_list(filename):\n", " list_file = open(filename, 'r').read()\n", " list_file = list_file.splitlines()\n", " return list_file\n", "\n", "obama_tweets = open_list('obama_tweets.txt')" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "prompt_number": 66 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "More slicing" ] }, { "cell_type": "code", "collapsed": false, "input": [ "obama_tweets[:5]" ], "language": "python", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 67, "text": [ "['Obama has called the GOP budget social Darwinism. Nice try, but they believe in social creationism.',\n", " 'In his teen years, Obama has been known to use marijuana and cocaine.',\n", " 'IPA Congratulates President Barack Obama for Leadership Regarding JOBS Act: WASHINGTON, Apr 05, 2012 (BUSINESS W... http://t.co/8le3DC8E',\n", " 'RT @Professor_Why: #WhatsRomneyHiding - his connection to supporters of Critical Race Theory.... Oh wait, that was Obama, not Romney...',\n", " 'RT @wardollarshome: Obama has approved more targeted assassinations than any modern US prez; READ & RT: http://t.co/bfC4gbBW']" ] } ], "prompt_number": 67 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Some from the middle?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print obama_tweets[52:55]" ], "language": "python", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[\"Barack Obama President Ronald Reagan's Initial Actions Project: President Ronald Reagan was also facing an econo... http://t.co/8Go8oCpf\", 'RT @TXGaryM: Yes #WHFail RT @jltho: This #WhatsRomneyHiding hashtag is entertaining. Is this another social media backfire from the Obama administration?', 'Barack Obama LONGBOARD Package CORE 7\" TRUCKS 76mm BIGFOOT WHEELS: The newest addition to the Bigfoot Collection... http://t.co/cnHRuUBZ']\n" ] } ], "prompt_number": 68 }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's ugly." ] }, { "cell_type": "code", "collapsed": false, "input": [ "for tweet in obama_tweets[52:55]:\n", " print tweet" ], "language": "python", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Barack Obama President Ronald Reagan's Initial Actions Project: President Ronald Reagan was also facing an econo... http://t.co/8Go8oCpf\n", "RT @TXGaryM: Yes #WHFail RT @jltho: This #WhatsRomneyHiding hashtag is entertaining. Is this another social media backfire from the Obama administration?\n", "Barack Obama LONGBOARD Package CORE 7\" TRUCKS 76mm BIGFOOT WHEELS: The newest addition to the Bigfoot Collection... http://t.co/cnHRuUBZ\n" ] } ], "prompt_number": 69 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Let's get going!!!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#loop, but don't go through everything yet\n", "for tweet in obama_tweets[:5]:\n", " print tweet\n", " positive_counter=0\n", " #Lower case everything\n", " tweet_processed=tweet.lower()\n", " \n", " #split by ' ' into a list of words\n", " words = tweet_processed.split()\n", " \n", " #Loop through each word in the tweet\n", " for word in words:\n", " clean_word = word.strip(punctuation)\n", " \n", " if clean_word in postive_list:\n", " print clean_word\n", " positive_counter=positive_counter+1\n", " \n", " print positive_counter,len(words)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Obama has called the GOP budget social Darwinism. Nice try, but they believe in social creationism.\n", "nice\n", "1 16\n", "In his teen years, Obama has been known to use marijuana and cocaine.\n", "0 13\n", "IPA Congratulates President Barack Obama for Leadership Regarding JOBS Act: WASHINGTON, Apr 05, 2012 (BUSINESS W... http://t.co/8le3DC8E\n", "0 17\n", "RT @Professor_Why: #WhatsRomneyHiding - his connection to supporters of Critical Race Theory.... Oh wait, that was Obama, not Romney...\n", "0 19\n", "RT @wardollarshome: Obama has approved more targeted assassinations than any modern US prez; READ & RT: http://t.co/bfC4gbBW\n", "modern\n", "1 17\n" ] } ], "prompt_number": 70 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Your turn. Add a negative_counter!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ], "language": "python", "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "prompt_number": 70 }, { "cell_type": "code", "collapsed": false, "input": [ "#loop, but don't go through everything yet\n", "for tweet in obama_tweets[:5]:\n", "\n", " positive_counter = 0\n", " negative_counter = 0\n", " #Lower case everything\n", " tweet_processed = tweet.lower()\n", " \n", " #split by ' ' into a list of words\n", " words = tweet_processed.split()\n", " \n", " #Loop through each word in the tweet\n", " for word in words:\n", " clean_word = word.strip(punctuation)\n", " \n", " if clean_word in postive_list:\n", " positive_counter = positive_counter + 1\n", " elif clean_word in negative_list:\n", " negative_counter = negative_counter + 1\n", " \n", " print positive_counter, negative_counter, len(words)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "1 0 16\n", "0 0 13\n", "0 0 17\n", "0 0 19\n", "1 0 17\n" ] } ], "prompt_number": 71 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Now let's get this out of Python" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import csv" ], "language": "python", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "prompt_number": 72 }, { "cell_type": "code", "collapsed": false, "input": [ "csv_file = open('tweet_sentiment.csv','w')\n", "csv_writer = csv.writer( csv_file )" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "prompt_number": 73 }, { "cell_type": "code", "collapsed": false, "input": [ "#loop, but don't go through everything yet\n", "for tweet in obama_tweets[:5]:\n", "\n", " positive_counter = 0\n", " negative_counter = 0\n", " #Lower case everything\n", " tweet_processed = tweet.lower()\n", " \n", " #split by ' ' into a list of words\n", " words = tweet_processed.split()\n", " \n", " #Loop through each word in the tweet\n", " for word in words:\n", " clean_word = word.strip(punctuation)\n", " \n", " if clean_word in postive_list:\n", " positive_counter = positive_counter + 1\n", " elif clean_word in negative_list:\n", " negative_counter = negative_counter + 1\n", " \n", " csv_writer.writerow( [positive_counter, negative_counter, len(words)] )\n", "\n", "csv_file.close()" ], "language": "python", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "prompt_number": 74 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Let's look at the results. If `!cat` doesn't work, try `!type` which is the Windows equivalent." ] }, { "cell_type": "code", "collapsed": false, "input": [ "!cat tweet_sentiment.csv" ], "language": "python", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "1,0,16\r", "\r\n", "0,0,13\r", "\r\n", "0,0,17\r", "\r\n", "0,0,19\r", "\r\n", "1,0,17\r", "\r\n" ] } ], "prompt_number": 75 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Why only 5 rows?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "csv_file = open('tweet_sentiment.csv','w')\n", "csv_writer = csv.writer( csv_file )\n", "\n", "#loop\n", "for tweet in obama_tweets[:]:\n", " positive_counter = 0\n", " negative_counter = 0\n", " #Lower case everything\n", " tweet_processed = tweet.lower()\n", " \n", " #split by ' ' into a list of words\n", " words = tweet_processed.split()\n", "\n", " #Loop through each word in the tweet\n", " for word in words:\n", " clean_word = word.strip(punctuation)\n", " \n", " if clean_word in postive_list:\n", " positive_counter = positive_counter + 1\n", " \n", " if clean_word in negative_list:\n", " negative_counter = negative_counter + 1\n", "\n", " csv_writer.writerow( [ positive_counter, negative_counter, len(words)] )\n", "\n", "csv_file.close()" ], "language": "python", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "prompt_number": 76 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The `csv` file can be read in other programs." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "![](files/stata7.png)\n", "\n", "That's not Python" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "If you wanted to add a header, you could by putting \n", "\n", "`csv_writer.writerow( [ 'postive', 'negative', 'length'] )`\n", "\n", "before you start writing the values." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "In case your wondering, I would probably make my code a little different" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def clean_split(tweet):\n", " ''' Take sentence and return cleaned list of words '''\n", " return [word.strip(punctuation) for word in tweet.lower().split()]\n", "\n", "#turn the lists into sets so we can do intersections\n", "postive_set = set(postive_list)\n", "negative_set = set(negative_list)\n", "\n", "\n", "sentiment = []\n", "for tweet in obama_tweets:\n", " words = clean_split(tweet)\n", " postive_counter = len( postive_set.intersection(words) )\n", " negative_counter = len( negative_set.intersection(words) )\n", " sentiment.append ( [ postive_counter , negative_counter, len(words)] )\n", "\n", "with open('tweet_sentiment_2.csv','w') as csv_file:\n", " csv_writer = csv.writer( csv_file )\n", " csv_writer.writerows(sentiment)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "prompt_number": 94 }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Some takeaways\n", "\n", "1. That's most of what you need to know in Python. \n", " 1. List comprehension\n", " 1. Functions/Classes\n", "2. Start simple.\n", " 1. Develop a solution for one case\n", " 2. Scale up (so make sure your 2A is generalizable.)" ] } ], "metadata": {} } ] }