{
 "metadata": {
  "name": "",
  "signature": "sha256:313a543e6843b06c804ad48ce5e91708cf3824b30fa30a3dd0ad949169be1c58"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##5. Naive Bayes in Scikit-Learn: A Brief Intro\n",
      "\n",
      "### Lynn Cherny, 2/15, arnicas@gmail\n",
      "\n",
      "Doing the same operation using a more flexible classification tool... \n",
      "Full repo here: https://github.com/arnicas/NLP-in-Python\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "For scikit-learn, it's convenient to have the data files in separate directories based on their classification.\n",
      "So there's a \"yes\" and \"no\" subdirectory for the labeled files for my training and test data."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ls data/fiftyshades/"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\u001b[34mno\u001b[m\u001b[m/  \u001b[34myes\u001b[m\u001b[m/\r\n"
       ]
      }
     ],
     "prompt_number": 15
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.datasets import load_files"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# If you want to load data into sklearn, put your text into directories labeled by the category \n",
      "# 'labels'\n",
      "\n",
      "bunchf = load_files('data/fiftyshades/', categories=['yes','no'])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# This is what sklearn creates:\n",
      "\n",
      "bunchf.keys()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 3,
       "text": [
        "['target_names', 'data', 'target', 'DESCR', 'filenames']"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "len(bunchf.data)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 4,
       "text": [
        "382"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "bunchf.filenames[0:10]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 5,
       "text": [
        "array(['data/fiftyshades/no/no_fifty_500_77',\n",
        "       'data/fiftyshades/no/no_fifty_500_239',\n",
        "       'data/fiftyshades/yes/yes_fifty_500_328',\n",
        "       'data/fiftyshades/yes/yes_fifty_500_208',\n",
        "       'data/fiftyshades/no/no_fifty_500_30',\n",
        "       'data/fiftyshades/no/no_fifty_500_183',\n",
        "       'data/fiftyshades/yes/yes_fifty_500_114',\n",
        "       'data/fiftyshades/no/no_fifty_500_17',\n",
        "       'data/fiftyshades/yes/yes_fifty_500_92',\n",
        "       'data/fiftyshades/no/no_fifty_500_15'], \n",
        "      dtype='|S38')"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Try uncommenting each of the lines to see what's what\n",
      "\n",
      "bunchf.filenames[0]\n",
      "#bunchf.target[0]\n",
      "#bunchf.target_names[0]\n",
      "#bunchf.data[0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 6,
       "text": [
        "'data/fiftyshades/no/no_fifty_500_77'"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Split up the data into train and test using sklearn's cross-validation..."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# instead of a simple true/false for a feature (word), we'll use the TF-IDF weight\n",
      "from sklearn.feature_extraction.text import TfidfVectorizer"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 7
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn import cross_validation\n",
      "# try changing the random_state and % of test data - interesting differences in results.\n",
      "Xf_train, Xf_test, yf_train, yf_test = cross_validation.train_test_split(bunchf.data, \n",
      "                                                                         bunchf.target, \n",
      "                                                                         test_size=0.30,\n",
      "                                                                         random_state=4)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 8
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nltk import word_tokenize, wordpunct_tokenize\n",
      "from nltk.stem import WordNetLemmatizer\n",
      "from nltk.stem import PorterStemmer\n",
      "import string\n",
      "\n",
      "class LemmaTokenizer(object):\n",
      "    \"\"\" You can use either the lemmatizer or the stemmer here - try both? \"\"\"\n",
      "    def __init__(self):\n",
      "        self.wnl = WordNetLemmatizer()\n",
      "        self.port = PorterStemmer()\n",
      "    def __call__(self, doc):\n",
      "        # Comment/uncomment whichever one you want to try\n",
      "        #return [self.wnl.lemmatize(t.lower()) for t in word_tokenize(doc) if t not in string.punctuation and len(t) > 2]\n",
      "        return [self.port.stem(t.lower()) for t in word_tokenize(doc) if t not in string.punctuation and len(t) > 2]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 9
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# The sklearn vectorizer for TF-IDF has the stopwords as an option.\n",
      "\n",
      "tfidfvec = TfidfVectorizer(tokenizer=LemmaTokenizer(), stop_words='english')\n",
      "# we create the tf-idf model from the training data:\n",
      "vectors_train = tfidfvec.fit_transform(Xf_train)\n",
      "\n",
      "# Depending on whether you stemmed or lemmatized, you'll get different column numbers here!\n",
      "vectors_train.shape"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 10,
       "text": [
        "(267, 5405)"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.naive_bayes import MultinomialNB\n",
      "import sklearn.metrics as metrics\n",
      "\n",
      "\n",
      "# We pick our classifier\n",
      "clf = MultinomialNB(alpha=.01)\n",
      "# We train the classifier on the training data and target classes (yes/no)\n",
      "clf.fit(vectors_train, yf_train)\n",
      "\n",
      "# We use the model on the test data:\n",
      "vectors_test = tfidfvec.transform(Xf_test)\n",
      "# We get a prediction from the test data \n",
      "pred = clf.predict(vectors_test)\n",
      "# We check the accuracy against the \"truth\" in the yf_test var\n",
      "metrics.accuracy_score(yf_test, pred)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 11,
       "text": [
        "0.93043478260869561"
       ]
      }
     ],
     "prompt_number": 11
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Notice the accuracy is (probably) better than the base NLTK Naive Bayes (your results may vary a little due to random shuffling and seeds).\n",
      "\n",
      "*Unfortunately, getting most informative features out of sklearn is a little more ugly:*"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# code for binary classification case posted here: \n",
      "# http://stackoverflow.com/questions/11116697/how-to-get-most-informative-features-for-scikit-learn-classifiers\n",
      "\n",
      "def show_most_informative_features(vectorizer, clf, n=20):\n",
      "    feature_names = vectorizer.get_feature_names()\n",
      "    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))\n",
      "    top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])\n",
      "    for (coef_1, fn_1), (coef_2, fn_2) in top:\n",
      "        print \"\\t%.4f\\t%-15s\\t\\t%.4f\\t%-15s\" % (coef_1, fn_1, coef_2, fn_2)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 12
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "show_most_informative_features(tfidfvec, clf, n=15)  # positive to the right, negative to left."
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\t-11.1539\t-nine-tail     \t\t-3.9677\thi             \n",
        "\t-11.1539\t146963         \t\t-5.1706\thand           \n",
        "\t-11.1539\t15.1           \t\t-5.3202\teye            \n",
        "\t-11.1539\t15.10          \t\t-5.3398\tthi            \n",
        "\t-11.1539\t15.14          \t\t-5.4032\twant           \n",
        "\t-11.1539\t15.19          \t\t-5.4302\tbreath         \n",
        "\t-11.1539\t15.2           \t\t-5.5105\tkiss           \n",
        "\t-11.1539\t15.21          \t\t-5.5657\tfeel           \n",
        "\t-11.1539\t15.22          \t\t-5.5670\tfinger         \n",
        "\t-11.1539\t15.24          \t\t-5.5917\tmouth          \n",
        "\t-11.1539\t15.3           \t\t-5.6324\tslowli         \n",
        "\t-11.1539\t15.4           \t\t-5.6421\twhisper        \n",
        "\t-11.1539\t15.5           \t\t-5.6838\tpull           \n",
        "\t-11.1539\t15.6           \t\t-5.6875\tbodi           \n",
        "\t-11.1539\t15.7           \t\t-5.6961\tgroan          \n"
       ]
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "*Why do you think the features differ? What's different about how we created the inputs?*\n",
      "\n",
      "A few more references:\n",
      "\n",
      "* YHat post on Naives Bayes: http://blog.yhathq.com/posts/naive-bayes-in-python.html\n",
      "* A much better tutorial on text and sklearn: http://radimrehurek.com/data_science_python/\n",
      "\n",
      "And for more fun--\n",
      "* A tutorial on sentiment detection and scikit-learn https://marcobonzanini.wordpress.com/2015/01/19/sentiment-analysis-with-python-and-scikit-learn/"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 13
    }
   ],
   "metadata": {}
  }
 ]
}