{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "Classification with scikit learn\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "#### Faking it as a Data Scientist\n",
      "\n",
      "1. Supervised learning / Classification\n",
      "   1. Logistic Regression\n",
      "   2. Naive Bayes\n",
      "   3. SVM\n",
      "   4. Ensemble/Random Forest\n",
      "\n",
      "2. Unsupervised learning\n",
      "   1. PCA\n",
      "   2. Topic Modeling\n",
      "3. Features\n",
      "4. Train/test"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "source": [
      "1. Load the data\n",
      "2. Tranform variables\n",
      "3. Select model and parameters\n",
      "3. Train the model\n",
      "4. Test the model\n",
      "5. GOTO 2"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "Input our data from before"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "upworthy_titles = open('upworthy_titles.txt', \"r\").read()"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "-"
      }
     },
     "outputs": [],
     "prompt_number": 4
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "upworthy_titles[:500]"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 5,
       "text": [
        "\"5 Reasons Why My Girlfriend Thinks She's Not Beautiful Enough, No Matter What Anyone Tells Her\\nThe Perfect Reply A Girl Can Give To The Question 'What's Your Favorite Position?'\\nWhen A Simple Idea Like This Actually Works And It Helps People Get By, Everybody Wins\\nOne Of The Biggest Lies We\\xe2\\x80\\x99re Encouraged To Tell Ourselves About Our Value To Society Is Right Here\\nHe Was About To Take His Own Life \\xe2\\x80\\x94 Until A Man Stopped Him. Here He Meets Him Face To Face Again.\\nWhen I Was A Kid, An Ad Aired On\""
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "source": [
      "I hate unicode"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "20 minutes of Googling later"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "upworthy_titles = open('upworthy_titles.txt', \"r\").read()\n",
      "\n",
      "upworthy_titles = upworthy_titles.decode('utf-8')\n",
      "upworthy_titles = upworthy_titles.encode('ascii', 'ignore')"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [],
     "prompt_number": 6
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print len(upworthy_titles)\n",
      "\n",
      "upworthy_titles[:500]"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "56210\n"
       ]
      },
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 7,
       "text": [
        "\"5 Reasons Why My Girlfriend Thinks She's Not Beautiful Enough, No Matter What Anyone Tells Her\\nThe Perfect Reply A Girl Can Give To The Question 'What's Your Favorite Position?'\\nWhen A Simple Idea Like This Actually Works And It Helps People Get By, Everybody Wins\\nOne Of The Biggest Lies Were Encouraged To Tell Ourselves About Our Value To Society Is Right Here\\nHe Was About To Take His Own Life  Until A Man Stopped Him. Here He Meets Him Face To Face Again.\\nWhen I Was A Kid, An Ad Aired On TV Th\""
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "upworthy_titles = upworthy_titles.splitlines()\n",
      "\n",
      "print len(upworthy_titles)\n",
      "\n",
      "upworthy_titles[:20]"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "693\n"
       ]
      },
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 8,
       "text": [
        "[\"5 Reasons Why My Girlfriend Thinks She's Not Beautiful Enough, No Matter What Anyone Tells Her\",\n",
        " \"The Perfect Reply A Girl Can Give To The Question 'What's Your Favorite Position?'\",\n",
        " 'When A Simple Idea Like This Actually Works And It Helps People Get By, Everybody Wins',\n",
        " 'One Of The Biggest Lies Were Encouraged To Tell Ourselves About Our Value To Society Is Right Here',\n",
        " 'He Was About To Take His Own Life  Until A Man Stopped Him. Here He Meets Him Face To Face Again.',\n",
        " \"When I Was A Kid, An Ad Aired On TV That I Didn't Fully Get. Now, I Want Us All To Watch It Again.\",\n",
        " \"These Parents Think It Might Be A Phase, But They Just Don't Understand What It Means\",\n",
        " 'How China Deals With Internet-Addicted Teens Is Kind Of Shocking. And Maybe A Good Idea?',\n",
        " 'If You Want A Successful Long-Term Relationship (Of Any Kind), Here Are 3 Invaluable Things To Know',\n",
        " 'How Do You Know If You Have Depression? Hear This Woman Explain How She Found Out.',\n",
        " \"If You Ever Wanted The Great Novels Explained To You By Your Thug Friend, Now's Your Chance\",\n",
        " \"Seems Like Any Other High School, Right? So What's The Major Thing Missing From These Pictures?\",\n",
        " 'Every Person On Earth Is Supposed To Have These. But What Are They Exactly?',\n",
        " 'Jon Stewart Delivers One Of The Best Interviews In Recent Memory',\n",
        " 'If You Give A Bride A Beautiful Set Of Bone China, You Set Her Table For A Day',\n",
        " 'A High School In A Poor Neighborhood Closed Down. These Folks Reopened It And Kicked Some Butt. How?',\n",
        " 'The Super Bowl Ad That You Should See If You Think Little Girls Can Become Epic Rocket Scientists',\n",
        " \"The NFL Would Never Let This Ad Air On The Super Bowl, So We're Gonna Show You It. It's Important.\",\n",
        " 'The Simple Way A Developing Country Managed To Decrease The Number Of Babies Who Died By Almost 30%',\n",
        " 'When Theres Nothing Scarier Than A Room Full Of White Men Laughing']"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "times_titles = open('times_titles.txt', 'rb').read()\n",
      "\n",
      "times_titles = times_titles.decode('utf-8')\n",
      "times_titles = times_titles.encode('ascii', 'ignore')\n",
      "\n",
      "times_titles = times_titles.splitlines()"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "outputs": [],
     "prompt_number": 9
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "source": [
      "I might think about wrapping this in a function"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def import_titles(filename):\n",
      "    ''' Imports text file, \n",
      "        cleans up some unicode mess, and \n",
      "        splits by line. '''\n",
      "    titles = open(filename, 'rb').read()\n",
      "\n",
      "    titles = titles.decode('utf-8')\n",
      "    titles = titles.encode('ascii', 'ignore')\n",
      "\n",
      "    return times_titles.splitlines()"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "What we want the data to look like next\n",
      "\n",
      "| upworthy | titles\n",
      "|:-----------|:------------:|\n",
      "| 1       | He Was About To Take His Own Life \u2014 Until A Man Stopped Him. Here He Meets Him Face To Face Again     \n",
      "| 0     | CVS Pharmacy to Stop Selling Tobacco Goods by October  \n",
      "| 0 | Twitter\u2019s Share Price Falls After It Reports a Loss \n",
      "| 1 | A 16-Year-Old Explains Why Everything You Thought You Knew About Beauty May Be Wrong. With Math.\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Set up placehold lists\n",
      "upworthy = []\n",
      "titles   = []\n",
      "\n",
      "#Go through all the upworthy articles\n",
      "for title in upworthy_titles:\n",
      "    #add title to title list\n",
      "    titles.append(title)\n",
      "    #add 1 to Y list\n",
      "    upworthy.append(1)\n",
      "\n",
      "for title in times_titles:\n",
      "    titles.append(title)\n",
      "    upworthy.append(0)   "
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "outputs": [],
     "prompt_number": 11
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print len(upworthy)\n",
      "print upworthy[:5]"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "1193\n",
        "[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]\n"
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print len(titles)\n",
      "print titles[:5]"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "1193\n",
        "[\"5 Reasons Why My Girlfriend Thinks She's Not Beautiful Enough, No Matter What Anyone Tells Her\", \"The Perfect Reply A Girl Can Give To The Question 'What's Your Favorite Position?'\", 'When A Simple Idea Like This Actually Works And It Helps People Get By, Everybody Wins', 'One Of The Biggest Lies Were Encouraged To Tell Ourselves About Our Value To Society Is Right Here', 'He Was About To Take His Own Life  Until A Man Stopped Him. Here He Meets Him Face To Face Again.', \"When I Was A Kid, An Ad Aired On TV That I Didn't Fully Get. Now, I Want Us All To Watch It Again.\", \"These Parents Think It Might Be A Phase, But They Just Don't Understand What It Means\", 'How China Deals With Internet-Addicted Teens Is Kind Of Shocking. And Maybe A Good Idea?', 'If You Want A Successful Long-Term Relationship (Of Any Kind), Here Are 3 Invaluable Things To Know', 'How Do You Know If You Have Depression? Hear This Woman Explain How She Found Out.']\n"
       ]
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "source": [
      "I don't like that the X and the Y are in two unrelated lists. Matches cases depends on the sort order, so don't mess that up."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "What we want the data to look like next\n",
      "\n",
      "| upworthy | stop   | man  | obama  | explain  | everything  | you  | nato  | debate | industry | believe  \n",
      "|:-----------|:----:| :----:| :----:| :----:| :----:| :----:| :----:| :----:| :----:| :----:| :----:| \n",
      "| 1       | 0 | 1  | 0 | 0  | 0 | 2  | 0 | 0  | 0 | 1  \n",
      "| 0       | 1 | 0  | 0 | 0  | 0 | 0  | 1| 1 | 0 | 0  \n",
      "| 0       | 0 | 0  |1 | 1  | 0 | 0  | 0 | 0  | 1 | 0  \n",
      "| 1       | 0 | 0  | 0 | 0  | 0 | 1  | 0 | 1   | 0 | 0  \n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.feature_extraction.text import CountVectorizer"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "outputs": [],
     "prompt_number": 14
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "vectorizer = CountVectorizer(lowercase     = True ,\n",
      "                             min_df        = 2 ,\n",
      "                             max_df        = .5 ,\n",
      "                             ngram_range   = (1, 1),\n",
      "                             stop_words    = 'english', \n",
      "                             strip_accents = 'unicode')\n"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [],
     "prompt_number": 83
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "vectorizer.fit(titles)"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 84,
       "text": [
        "CountVectorizer(analyzer=u'word', binary=False, charset=None,\n",
        "        charset_error=None, decode_error=u'strict',\n",
        "        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',\n",
        "        lowercase=True, max_df=0.5, max_features=None, min_df=2,\n",
        "        ngram_range=(1, 1), preprocessor=None, stop_words='english',\n",
        "        strip_accents='unicode', token_pattern=u'(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
        "        tokenizer=None, vocabulary=None)"
       ]
      }
     ],
     "prompt_number": 84
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "X = vectorizer.transform(titles)"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [],
     "prompt_number": 85
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.naive_bayes import MultinomialNB"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "outputs": [],
     "prompt_number": 86
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "clf = MultinomialNB()"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [],
     "prompt_number": 87
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "clf.fit(X,upworthy)"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 88,
       "text": [
        "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)"
       ]
      }
     ],
     "prompt_number": 88
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "y_hat = clf.predict(X)"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "outputs": [],
     "prompt_number": 89
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "y_hat[:20]"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 90,
       "text": [
        "array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1])"
       ]
      }
     ],
     "prompt_number": 90
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "source": [
      "What % were correctly predicted?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "clf.score(X, upworthy)"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 91,
       "text": [
        "0.91701592623637884"
       ]
      }
     ],
     "prompt_number": 91
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Better fit statistics"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn import metrics"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "-"
      }
     },
     "outputs": [],
     "prompt_number": 92
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "metrics.confusion_matrix(upworthy, y_hat)"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 93,
       "text": [
        "array([[416,  84],\n",
        "       [ 15, 678]])"
       ]
      }
     ],
     "prompt_number": 93
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "Sociologists care about coeficients, but they are a tucked away."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "clf.coef_"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 94,
       "text": [
        "array([[-6.71921442, -6.31374931, -7.81782671, ..., -7.12467953,\n",
        "        -7.4123616 , -7.4123616 ]])"
       ]
      }
     ],
     "prompt_number": 94
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "len(clf.coef_[0])"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 95,
       "text": [
        "1192"
       ]
      }
     ],
     "prompt_number": 95
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "len(vectorizer.vocabulary_)"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 96,
       "text": [
        "1192"
       ]
      }
     ],
     "prompt_number": 96
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "source": [
      "I know what we can do."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "coefficients = zip(vectorizer.vocabulary_,clf.coef_[0])\n",
      "coefficients = {h[0] : h[1] for h in coefficients}\n",
      "\n",
      "for word in sorted(coefficients, key = coefficients.get, reverse=True)[:20]:\n",
      "    print word,coefficients[word]"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "nocturnalist -4.55973017302\n",
        "rest -4.63977288069\n",
        "columnist -4.70431140183\n",
        "debate -5.01446633014\n",
        "drunk -5.01446633014\n",
        "biggest -5.01446633014\n",
        "industry -5.14367806162\n",
        "nato -5.17876938143\n",
        "severe -5.17876938143\n",
        "need -5.2151370256\n",
        "rebels -5.2151370256\n",
        "disposing -5.29209806673\n",
        "weeks -5.33292006125\n",
        "learning -5.33292006125\n",
        "head -5.33292006125\n",
        "bump -5.37547967567\n",
        "abortions -5.46645145388\n",
        "world -5.46645145388\n",
        "million -5.51524161805\n",
        "000 -5.51524161805\n"
       ]
      }
     ],
     "prompt_number": 97
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "See anything different in this code?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "coefficients = zip(vectorizer.vocabulary_,clf.coef_[0])\n",
      "coefficients = {h[0] : h[1] for h in coefficients}\n",
      "\n",
      "for word in sorted(coefficients, key = coefficients.get, reverse=False)[:20]:\n",
      "    print word,coefficients[word]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "ron -8.5109738916\n",
        "leads -8.5109738916\n",
        "glimpse -8.5109738916\n",
        "young -8.5109738916\n",
        "james -8.5109738916\n",
        "detention -8.5109738916\n",
        "fat -8.5109738916\n",
        "cool -8.5109738916\n",
        "dating -8.5109738916\n",
        "immediately -8.5109738916\n",
        "die -8.5109738916\n",
        "lapse -8.5109738916\n",
        "pretend -8.5109738916\n",
        "second -8.5109738916\n",
        "street -8.5109738916\n",
        "7th -8.5109738916\n",
        "saved -8.5109738916\n",
        "goes -8.5109738916\n",
        "seeks -8.5109738916\n",
        "box -8.5109738916\n"
       ]
      }
     ],
     "prompt_number": 98
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "Data scientists care about over fitting. We probably should too."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.cross_validation import train_test_split\n",
      "\n",
      "X_train, X_test, y_train, y_test = train_test_split(X, upworthy, test_size=0.3)"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [],
     "prompt_number": 99
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "clf.fit(X_train, y_train)\n",
      "\n",
      "y_test_pred = clf.predict(X_test)\n"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [],
     "prompt_number": 100
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "y_train_pred = clf.predict(X_train)\n",
      "\n",
      "print \"classification accuracy:\", metrics.accuracy_score(y_train, y_train_pred)\n",
      "metrics.confusion_matrix(y_train, y_train_pred)"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "classification accuracy: 0.925748502994\n"
       ]
      },
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 101,
       "text": [
        "array([[303,  57],\n",
        "       [  5, 470]])"
       ]
      }
     ],
     "prompt_number": 101
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "y_test_pred = clf.predict(X_test)\n",
      "\n",
      "print \"classification accuracy:\", metrics.accuracy_score(y_test, y_test_pred)\n",
      "metrics.confusion_matrix(y_test, y_test_pred)"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "classification accuracy: 0.821229050279\n"
       ]
      },
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 102,
       "text": [
        "array([[ 99,  41],\n",
        "       [ 23, 195]])"
       ]
      }
     ],
     "prompt_number": 102
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn import cross_validation\n",
      "import numpy as np"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "outputs": [],
     "prompt_number": 103
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "cross_validation.cross_val_score(clf, X, np.array(upworthy),  cv=10)"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 104,
       "text": [
        "array([ 0.81666667,  0.81666667,  0.86666667,  0.79831933,  0.81512605,\n",
        "        0.83193277,  0.84033613,  0.79831933,  0.77310924,  0.85714286])"
       ]
      }
     ],
     "prompt_number": 104
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "np.mean(cross_validation.cross_val_score(clf, X, np.array(upworthy),  cv=10))"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 105,
       "text": [
        "0.8214285714285714"
       ]
      }
     ],
     "prompt_number": 105
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "Do we do any better by looking at bigrams? Go back up and change the values for CountVectorizer."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "What about a new headline?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "soc_title = 'Educational Assortative Mating and Earnings Inequality in the United States'"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [],
     "prompt_number": 106
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "soc_title_vector = vectorizer.transform([soc_title])"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [],
     "prompt_number": 107
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "clf.predict_proba(soc_title_vector)"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 110,
       "text": [
        "array([[ 0.94490128,  0.05509872]])"
       ]
      }
     ],
     "prompt_number": 110
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "gawker_title = 'Shocking Footage Aired of Police Shooting Face-Eating Nude Man'\n",
      "gawker_title_vector = vectorizer.transform([gawker_title])\n",
      "clf.predict_proba(gawker_title_vector)"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 111,
       "text": [
        "array([[ 0.38060745,  0.61939255]])"
       ]
      }
     ],
     "prompt_number": 111
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "Your turn. Take the file `abstracts.csv` has 3,232 abstracts, divided evenly between those that were published in sociology journals and those that were presented at the ASAs. \n",
      "\n",
      "After importing the data, develop a model that categorizes each, without overfitting.   \n",
      "\n",
      "Note: Copy and pasting code is your friend!!!"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "source": [
      "Make a prediction for this out of sample abstract:\n",
      "\n",
      "This article investigates how changes in educational assortative mating affected the growth in earnings inequality among households in the United States between the late 1970s and early 2000s. The authors find that these changes had a small, negative effect on inequality: there would have been more inequality in earnings in the early 2000s if educational assortative mating patterns had remained as they were in the 1970s. Given the educational distribution of men and women in the United States, educational assortative mating can have only a weak impact on inequality, and educational sorting among partners is a poor proxy for sorting on earnings.\n"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}