{ "metadata": { "name": "", "signature": "sha256:313a543e6843b06c804ad48ce5e91708cf3824b30fa30a3dd0ad949169be1c58" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "##5. Naive Bayes in Scikit-Learn: A Brief Intro\n", "\n", "### Lynn Cherny, 2/15, arnicas@gmail\n", "\n", "Doing the same operation using a more flexible classification tool... \n", "Full repo here: https://github.com/arnicas/NLP-in-Python\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For scikit-learn, it's convenient to have the data files in separate directories based on their classification.\n", "So there's a \"yes\" and \"no\" subdirectory for the labeled files for my training and test data." ] }, { "cell_type": "code", "collapsed": false, "input": [ "ls data/fiftyshades/" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\u001b[34mno\u001b[m\u001b[m/ \u001b[34myes\u001b[m\u001b[m/\r\n" ] } ], "prompt_number": 15 }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.datasets import load_files" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "# If you want to load data into sklearn, put your text into directories labeled by the category \n", "# 'labels'\n", "\n", "bunchf = load_files('data/fiftyshades/', categories=['yes','no'])" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "# This is what sklearn creates:\n", "\n", "bunchf.keys()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 3, "text": [ "['target_names', 'data', 'target', 'DESCR', 'filenames']" ] } ], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "len(bunchf.data)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 4, "text": [ "382" ] } ], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "bunchf.filenames[0:10]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 5, "text": [ "array(['data/fiftyshades/no/no_fifty_500_77',\n", " 'data/fiftyshades/no/no_fifty_500_239',\n", " 'data/fiftyshades/yes/yes_fifty_500_328',\n", " 'data/fiftyshades/yes/yes_fifty_500_208',\n", " 'data/fiftyshades/no/no_fifty_500_30',\n", " 'data/fiftyshades/no/no_fifty_500_183',\n", " 'data/fiftyshades/yes/yes_fifty_500_114',\n", " 'data/fiftyshades/no/no_fifty_500_17',\n", " 'data/fiftyshades/yes/yes_fifty_500_92',\n", " 'data/fiftyshades/no/no_fifty_500_15'], \n", " dtype='|S38')" ] } ], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "# Try uncommenting each of the lines to see what's what\n", "\n", "bunchf.filenames[0]\n", "#bunchf.target[0]\n", "#bunchf.target_names[0]\n", "#bunchf.data[0]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 6, "text": [ "'data/fiftyshades/no/no_fifty_500_77'" ] } ], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Split up the data into train and test using sklearn's cross-validation..." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# instead of a simple true/false for a feature (word), we'll use the TF-IDF weight\n", "from sklearn.feature_extraction.text import TfidfVectorizer" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn import cross_validation\n", "# try changing the random_state and % of test data - interesting differences in results.\n", "Xf_train, Xf_test, yf_train, yf_test = cross_validation.train_test_split(bunchf.data, \n", " bunchf.target, \n", " test_size=0.30,\n", " random_state=4)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 8 }, { "cell_type": "code", "collapsed": false, "input": [ "from nltk import word_tokenize, wordpunct_tokenize\n", "from nltk.stem import WordNetLemmatizer\n", "from nltk.stem import PorterStemmer\n", "import string\n", "\n", "class LemmaTokenizer(object):\n", " \"\"\" You can use either the lemmatizer or the stemmer here - try both? \"\"\"\n", " def __init__(self):\n", " self.wnl = WordNetLemmatizer()\n", " self.port = PorterStemmer()\n", " def __call__(self, doc):\n", " # Comment/uncomment whichever one you want to try\n", " #return [self.wnl.lemmatize(t.lower()) for t in word_tokenize(doc) if t not in string.punctuation and len(t) > 2]\n", " return [self.port.stem(t.lower()) for t in word_tokenize(doc) if t not in string.punctuation and len(t) > 2]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "# The sklearn vectorizer for TF-IDF has the stopwords as an option.\n", "\n", "tfidfvec = TfidfVectorizer(tokenizer=LemmaTokenizer(), stop_words='english')\n", "# we create the tf-idf model from the training data:\n", "vectors_train = tfidfvec.fit_transform(Xf_train)\n", "\n", "# Depending on whether you stemmed or lemmatized, you'll get different column numbers here!\n", "vectors_train.shape" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 10, "text": [ "(267, 5405)" ] } ], "prompt_number": 10 }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.naive_bayes import MultinomialNB\n", "import sklearn.metrics as metrics\n", "\n", "\n", "# We pick our classifier\n", "clf = MultinomialNB(alpha=.01)\n", "# We train the classifier on the training data and target classes (yes/no)\n", "clf.fit(vectors_train, yf_train)\n", "\n", "# We use the model on the test data:\n", "vectors_test = tfidfvec.transform(Xf_test)\n", "# We get a prediction from the test data \n", "pred = clf.predict(vectors_test)\n", "# We check the accuracy against the \"truth\" in the yf_test var\n", "metrics.accuracy_score(yf_test, pred)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 11, "text": [ "0.93043478260869561" ] } ], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Notice the accuracy is (probably) better than the base NLTK Naive Bayes (your results may vary a little due to random shuffling and seeds).\n", "\n", "*Unfortunately, getting most informative features out of sklearn is a little more ugly:*" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# code for binary classification case posted here: \n", "# http://stackoverflow.com/questions/11116697/how-to-get-most-informative-features-for-scikit-learn-classifiers\n", "\n", "def show_most_informative_features(vectorizer, clf, n=20):\n", " feature_names = vectorizer.get_feature_names()\n", " coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))\n", " top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])\n", " for (coef_1, fn_1), (coef_2, fn_2) in top:\n", " print \"\\t%.4f\\t%-15s\\t\\t%.4f\\t%-15s\" % (coef_1, fn_1, coef_2, fn_2)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "show_most_informative_features(tfidfvec, clf, n=15) # positive to the right, negative to left." ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\t-11.1539\t-nine-tail \t\t-3.9677\thi \n", "\t-11.1539\t146963 \t\t-5.1706\thand \n", "\t-11.1539\t15.1 \t\t-5.3202\teye \n", "\t-11.1539\t15.10 \t\t-5.3398\tthi \n", "\t-11.1539\t15.14 \t\t-5.4032\twant \n", "\t-11.1539\t15.19 \t\t-5.4302\tbreath \n", "\t-11.1539\t15.2 \t\t-5.5105\tkiss \n", "\t-11.1539\t15.21 \t\t-5.5657\tfeel \n", "\t-11.1539\t15.22 \t\t-5.5670\tfinger \n", "\t-11.1539\t15.24 \t\t-5.5917\tmouth \n", "\t-11.1539\t15.3 \t\t-5.6324\tslowli \n", "\t-11.1539\t15.4 \t\t-5.6421\twhisper \n", "\t-11.1539\t15.5 \t\t-5.6838\tpull \n", "\t-11.1539\t15.6 \t\t-5.6875\tbodi \n", "\t-11.1539\t15.7 \t\t-5.6961\tgroan \n" ] } ], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Why do you think the features differ? What's different about how we created the inputs?*\n", "\n", "A few more references:\n", "\n", "* YHat post on Naives Bayes: http://blog.yhathq.com/posts/naive-bayes-in-python.html\n", "* A much better tutorial on text and sklearn: http://radimrehurek.com/data_science_python/\n", "\n", "And for more fun--\n", "* A tutorial on sentiment detection and scikit-learn https://marcobonzanini.wordpress.com/2015/01/19/sentiment-analysis-with-python-and-scikit-learn/" ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 13 } ], "metadata": {} } ] }