{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Naive Bayes Model for Newsgroups Data\n", "\n", "For an explanation of the Naive Bayes model, see [our course notes](https://jennselby.github.io/MachineLearningCourseNotes/#naive-bayes).\n", "\n", "This notebook uses code from http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html.\n", "\n", "## Instructions\n", "0. If you haven't already, follow [the setup instructions here](https://jennselby.github.io/MachineLearningCourseNotes/#setting-up-python3) to get all necessary software installed.\n", "0. Read through the code in the following sections:\n", " * [Newgroups Data](#Newgroups-Data)\n", " * [Model Training](#Model-Training)\n", " * [Prediction](#Prediction)\n", "0. Complete at least one of the following exercises:\n", " * [Exercise Option #1 - Standard Difficulty](#Exercise-Option-#1---Standard-Difficulty)\n", " * [Exercise Option #2 - Advanced Difficulty](#Exercise-Option-#2---Advanced-Difficulty)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import fetch_20newsgroups # the 20 newgroups set is included in scikit-learn\n", "from sklearn.naive_bayes import MultinomialNB # we need this for our Naive Bayes model\n", "\n", "# These next two are about processing the data. We'll look into this more later in the semester.\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "from sklearn.feature_extraction.text import TfidfTransformer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Newgroups Data\n", "\n", "Back in the day, [Usenet](https://en.wikipedia.org/wiki/Usenet_newsgroup) was a popular discussion system where people could discuss topics in relevant newsgroups (think Slack channel or subreddit). At some point, someone pulled together messages sent to 20 different newsgroups, to use as [a dataset for doing text processing](http://qwone.com/~jason/20Newsgroups/).\n", "\n", "We are going to pull out messages from just a few different groups to try out a Naive Bayes model.\n", "\n", "Examine the newsgroups dictionary, to make sure you understand the dataset.\n", "\n", "**Note**: If you get an error about SSL certificates, you can fix this with the following:\n", "1. In Finder, click on Applications in the list on the left panel\n", "1. Double click to go into the Python folder (it will be called something like Python 3.7)\n", "1. Double click on the Install Certificates command in that folder\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR', 'description'])" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# which newsgroups we want to download\n", "newsgroup_names = ['comp.graphics', 'rec.sport.hockey', 'sci.electronics', 'sci.space']\n", "\n", "# get the newsgroup data (organized much like the iris data)\n", "newsgroups = fetch_20newsgroups(categories=newsgroup_names, shuffle=True, random_state=265)\n", "newsgroups.keys()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This next part does some processing of the data, because the scikit-learn Naive Bayes module is expecting numerical data rather than text data. We will talk more about what this code is doing later in the semester. For now, you can ignore it." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Convert the text into numbers that represent each word (bag of words method)\n", "word_vector = CountVectorizer()\n", "word_vector_counts = word_vector.fit_transform(newsgroups.data)\n", "\n", "# Account for the length of the documents:\n", "# get the frequency with which the word occurs instead of the raw number of times\n", "term_freq_transformer = TfidfTransformer()\n", "term_freq = term_freq_transformer.fit_transform(word_vector_counts)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model Training\n", "\n", "Now we fit the Naive Bayes model to the subset of the 20 newsgroups data that we've pulled out." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Train the Naive Bayes model\n", "model = MultinomialNB().fit(term_freq, newsgroups.target)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prediction\n", "\n", "Let's see how the model does on some (very short) documents that we made up to fit into the specific categories our model is trained on." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Predictions:\n", "\tThat GPU has amazing performance with a lot of shaders => comp.graphics\n", "\tThe player had a wicked slap shot => rec.sport.hockey\n", "\tI spent all day yesterday soldering banks of capacitors => sci.space\n", "\tToday I have to solder a bank of capacitors => sci.electronics\n", "\tNASA has rovers on Mars => sci.space\n", "Probabilities:\n", "comp.graphics rec.sport.hockey sci.electronics sci.space \n", "0.29466149 0.22895149 0.24926344 0.22712357 \n", "0.12948055 0.51155698 0.18248712 0.17647535 \n", "0.18604814 0.24117771 0.27540452 0.29736963 \n", "0.21285086 0.21081302 0.3486507 0.22768541 \n", "0.079185633 0.066225915 0.10236622 0.75222223 \n" ] } ], "source": [ "# Predict some new fake documents\n", "fake_docs = [\n", " 'That GPU has amazing performance with a lot of shaders',\n", " 'The player had a wicked slap shot',\n", " 'I spent all day yesterday soldering banks of capacitors',\n", " 'Today I have to solder a bank of capacitors',\n", " 'NASA has rovers on Mars']\n", "fake_counts = word_vector.transform(fake_docs)\n", "fake_term_freq = term_freq_transformer.transform(fake_counts)\n", "\n", "predicted = model.predict(fake_term_freq)\n", "print('Predictions:')\n", "for doc, group in zip(fake_docs, predicted):\n", " print('\\t{0} => {1}'.format(doc, newsgroups.target_names[group]))\n", "\n", "probabilities = model.predict_proba(fake_term_freq)\n", "print('Probabilities:')\n", "print(''.join(['{:17}'.format(name) for name in newsgroups.target_names]))\n", "for probs in probabilities:\n", " print(''.join(['{:<17.8}'.format(prob) for prob in probs]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise Option #1 - Standard Difficulty\n", "\n", "Modify the fake documents and add some new documents of your own. \n", "\n", "What words in your documents have particularly large effects on the model probabilities? Note that we're not looking for documents that consist of a single word, but for words that, when included or excluded from a document, tend to change the model's output.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise Option #2 - Advanced Difficulty\n", "\n", "Write some code to count up how often the words you found in the exercise above appear in each category in the training dataset. Does this match up with your intuition?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5" } }, "nbformat": 4, "nbformat_minor": 2 }