{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Name:** \\_\\_\\_\\_\\_\n",
    "\n",
    "**EID:** \\_\\_\\_\\_\\_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial 1: Text Document Classification with KNN and Naive Bayes\n",
    "\n",
    "In this tutorial you will classify text documents using Naive Bayes classifers.  We will be working with the dataset called \"20 Newsgroups\", which is a collection of 20,000 newsgroup posts organized into 20 categories."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Loading the 20 Newsgroups Dataset\n",
    "The dataset is called “20 Newsgroups”. Here is the official description, quoted from the [website](http://qwone.com/~jason/20Newsgroups/)\n",
    ">The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of our knowledge, it was originally collected by Ken Lang, probably for his paper “Newsweeder: Learning to filter netnews,” though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "#First we need to initialize Python.  Run the below cell.\n",
    "%matplotlib inline\n",
    "import IPython.core.display         \n",
    "# setup output image format (Chrome works best)\n",
    "#IPython.core.display.set_matplotlib_formats(\"svg\")\n",
    "%config InlineBackend.figure_format = 'svg'\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib\n",
    "from numpy import *\n",
    "from sklearn import *\n",
    "from scipy import stats\n",
    "random.seed(4487)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Put the file \"20news-bydate_py3.pkz' into the same directory as this ipynb file. **Do not unzip the file**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# downloaded already at the same path"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Extract 4 classes ('alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space') from the dataset. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# here it use sklearn.datasets method, more details in Lecture2.3_3.1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# strip away headers/footers/quotes from the text\n",
    "removeset = ('headers', 'footers', 'quotes')\n",
    "\n",
    "# only use 4 categories\n",
    "cats      = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']\n",
    "\n",
    "# load the training and testing sets\n",
    "newsgroups_train = datasets.fetch_20newsgroups(subset='train',\n",
    "                           remove=removeset, categories=cats, data_home='./')\n",
    "newsgroups_test  = datasets.fetch_20newsgroups(subset='test', \n",
    "                           remove=removeset, categories=cats, data_home='./')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Check if we got all the data.  The training set should have 2034 documents, and the test set should have 1353 documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "training set size: 2034\n",
      "testing set size:  1353\n",
      "['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']\n"
     ]
    }
   ],
   "source": [
    "print(\"training set size:\", len(newsgroups_train.data))\n",
    "print(\"testing set size: \",  len(newsgroups_test.data))\n",
    "print(newsgroups_train.target_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Count the number examples in each class.  `newsgroups_train.target` is an array of class values (0 through 3), and `newsgroups_train.target[i]` is the class of the i-th document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "class counts\n",
      "alt.atheism         : 480\n",
      "comp.graphics       : 584\n",
      "sci.space           : 593\n",
      "talk.religion.misc  : 377\n"
     ]
    }
   ],
   "source": [
    "print(\"class counts\")\n",
    "for i in [0, 1, 2, 3]:\n",
    "    print(\"{:20s}: {}\".format(newsgroups_train.target_names[i], sum(newsgroups_train.target == i)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Show the documents.  `newsgroups_train.data` is a list of strings, and `newsgroups_train.data[i]` is the i-th document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--- document 0 (class=comp.graphics) ---\n",
      "Hi,\n",
      "\n",
      "I've noticed that if you only save a model (with all your mapping planes\n",
      "positioned carefully) to a .3DS file that when you reload it after restarting\n",
      "3DS, they are given a default position and orientation.  But if you save\n",
      "to a .PRJ file their positions/orientation are preserved.  Does anyone\n",
      "know why this information is not stored in the .3DS file?  Nothing is\n",
      "explicitly said in the manual about saving texture rules in the .PRJ file. \n",
      "I'd like to be able to read the texture rule information, does anyone have \n",
      "the format for the .PRJ file?\n",
      "\n",
      "Is the .CEL file format available from somewhere?\n",
      "\n",
      "Rych\n",
      "--- document 1 (class=talk.religion.misc) ---\n",
      "\n",
      "\n",
      "Seems to be, barring evidence to the contrary, that Koresh was simply\n",
      "another deranged fanatic who thought it neccessary to take a whole bunch of\n",
      "folks with him, children and all, to satisfy his delusional mania. Jim\n",
      "Jones, circa 1993.\n",
      "\n",
      "\n",
      "Nope - fruitcakes like Koresh have been demonstrating such evil corruption\n",
      "for centuries.\n",
      "--- document 2 (class=sci.space) ---\n",
      "\n",
      " >In article <1993Apr19.020359.26996@sq.sq.com>, msb@sq.sq.com (Mark Brader) \n",
      "\n",
      "MB>                                                             So the\n",
      "MB> 1970 figure seems unlikely to actually be anything but a perijove.\n",
      "\n",
      "JG>Sorry, _perijoves_...I'm not used to talking this language.\n",
      "\n",
      "Couldn't we just say periapsis or apoapsis?\n",
      "\n",
      " \n",
      "--- document 3 (class=alt.atheism) ---\n",
      "I have a request for those who would like to see Charley Wingate\n",
      "respond to the \"Charley Challenges\" (and judging from my e-mail, there\n",
      "appear to be quite a few of you.)  \n",
      "\n",
      "It is clear that Mr. Wingate intends to continue to post tangential or\n",
      "unrelated articles while ingoring the Challenges themselves.  Between\n",
      "the last two re-postings of the Challenges, I noted perhaps a dozen or\n",
      "more posts by Mr. Wingate, none of which answered a single Challenge.  \n",
      "\n",
      "It seems unmistakable to me that Mr. Wingate hopes that the questions\n",
      "will just go away, and he is doing his level best to change the\n",
      "subject.  Given that this seems a rather common net.theist tactic, I\n",
      "would like to suggest that we impress upon him our desire for answers,\n",
      "in the following manner:\n",
      "\n",
      "1. Ignore any future articles by Mr. Wingate that do not address the\n",
      "Challenges, until he answers them or explictly announces that he\n",
      "refuses to do so.\n",
      "\n",
      "--or--\n",
      "\n",
      "2. If you must respond to one of his articles, include within it\n",
      "something similar to the following:\n",
      "\n",
      "    \"Please answer the questions posed to you in the Charley Challenges.\"\n",
      "\n",
      "Really, I'm not looking to humiliate anyone here, I just want some\n",
      "honest answers.  You wouldn't think that honesty would be too much to\n",
      "ask from a devout Christian, would you?  \n",
      "\n",
      "Nevermind, that was a rhetorical question.\n"
     ]
    }
   ],
   "source": [
    "for i in [0, 1, 2 ,3]:\n",
    "    print(\"--- document {} (class={}) ---\".format(\n",
    "        i, newsgroups_train.target_names[newsgroups_train.target[i]]))\n",
    "    print(newsgroups_train.data[i])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Tip:** while you do the tutorial, it is okay to make additional code cells in the file.  This will allow you to avoid re-running code (like training a classifier, then testing a classifier)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Extracting Features from Text Files\n",
    "In order to perform machine learning on text documents, we first need to <b>turn the text content into numerical feature vectors.</b>\n",
    "\n",
    "Next, we will introduce two basic text representation methods: One-hot encoding, Bag of words, and TF-IDF. More feature vector extraction functions, please refer to https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### one-hot encoding\n",
    "- Each word is coded with an index, which is represented by one-hot."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> John likes to watch movies. Mary likes too.\n",
    "\n",
    "> John also likes to watch football games."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we need to represent the words in the above two sentences, you can encode the words as following:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> {\"John\": 1, \"likes\": 2, \"to\": 3, \"watch\": 4, \"movies\": 5, \"also\":6, \"football\": 7, \"games\": 8, \"Mary\": 9, \"too\": 10}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can encode each word using one-hot method"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ">John: [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n",
    "\n",
    ">likes: [0, 1, 0, 0, 0, 0, 0, 0, 0, 0]\n",
    "\n",
    ">..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### However, this text representation method is impractical when the scale of corpus(语料库) becomes large."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Bag of Words\n",
    "- The index value of a word in the vocabulary is linked to its frequency in the whole training corpus."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> John likes to watch movies. Mary likes too.  -->> [1, 2, 1, 1, 1, 0, 0, 0, 1, 1]\n",
    "\n",
    "> John also likes to watch football games.     -->> [1, 1, 1, 1, 0, 1, 1, 1, 0, 0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The **sklearn.feature_extraction.text.CountVectorizer** implement the `Bag of Words` method that converts a collection of text documents to a matrix of token counts. This implementation produces a sparse（稀疏的） representation of the counts using **scipy.sparse.coo_matrix** to save memory by only storing the non-zero parts of the feature vectors in memory."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Term Frequency - Inverse Document Frequency (TF-IDF)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the word bag model, we can get the vector representation of this text.However, in the face of the diversity of text, <b>each word has different weight to the content of text </b>in practical application, so we introduce tf-idf model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### TF (Term Frequency)\n",
    "\n",
    "In the case of the term frequency $tf(t, d)$, the simplest choice is to use the raw count of a term in a document, i.e., the number of times that term $t$ occurs in document $d$. If we denote the raw count by $f_{t, d}$, then the simplest tf scheme is $tf(t,d) = f_{t, d}$. \n",
    "\n",
    "$tf_{t, d} = n_{t, d}/\\sum_kn_{t, d}$\n",
    "\n",
    "The numerator in the above formula is the number of occurrences of the word in the document $d$, and the denominator is the sum of the occurrences of all words in the document $d$.\n",
    "\n",
    "##### IDF (Inverse Document Frequency) \n",
    "\n",
    "The inverse document frequency is a measure of how much information the word provides, i.e., if it's common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient): \n",
    "\n",
    "$idf(t ,D) = log\\frac{N}{|\\{ d\\in D:t \\in d \\}|}$\n",
    "\n",
    "with \n",
    "- $N$: total number of documents in the corpus $N=|D|$\n",
    "- $|\\{ d\\in D:t \\in d \\}|$: number of documents where the term $t$ appears. If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the denominator to  $1+|\\{ d\\in D:t \\in d \\}|$\n",
    "\n",
    "Then tf-idf is calculated as: \n",
    "$tfidf(t, d, D) = tf(t, d) * idf(t, D)$\n",
    "\n",
    "Both tf and tf–idf can be computed as follows using **sklearn.feature_extraction.text.TfidfTransformer**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']\n",
      "[[0 1 1 1 0 0 1 0 1]\n",
      " [0 2 0 1 0 1 1 0 1]\n",
      " [1 0 0 1 1 0 1 1 1]\n",
      " [0 1 1 1 0 0 1 0 1]]\n"
     ]
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    " \n",
    "corpus = ['This is the first document.',\n",
    "\t'This document is the second document.',\n",
    "\t'And this is the third one.',\n",
    "\t'Is this the first document?']\n",
    " \n",
    "vectorizer = CountVectorizer()\n",
    "X = vectorizer.fit_transform(corpus)\n",
    " \n",
    "print(vectorizer.get_feature_names())\n",
    "print(X.toarray())\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create the vocabulary from the training data.  Then use **sklearn.feature_extraction.text.CountVectorizer** to build the document vectors for the training and testing sets.  You can decide how many words you want in the vocabulary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "### INSERT YOUR CODE HERE\n",
    "## HINT\n",
    "# feature_extraction.text.CountVectorizer(stop_words='english')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. K Nearest Neighbor (KNN)\n",
    "Let's train a K Nearest Neighbor (KNN) model. Using cross-validation to select the best K parameter. Then, showing the accuracy of training and testing set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "### INSERT YOUR CODE HERE\n",
    "## HINT\n",
    "# 1. C: paramgrid = {'n_neighbors': [3,5,7]}\n",
    "# 2. cross-validation: clf = model_selection.GridSearchCV(neighbors.KNeighborsClassifier(), param_grid=paramgrid, cv=10, n_jobs=-1)\n",
    "# 3. To find the best K: print(clf.best_params_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Bernoulli Naive Bayes \n",
    "Learn a Bernoulli Naive Bayes model from the training set.  What is the prediction accuracy on the test set?  Try different parameters (alpha, max_features, etc) to get the best performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "### INSERT YOUR CODE HERE\n",
    "## HINT\n",
    "# 1.  naive_bayes.BernoulliNB(alpha=0.1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What are the most informative words for each category?  Run the below code.\n",
    "\n",
    "Note: `model.coef_[i]` will index the scores for the i-th class"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'cntvect' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-10-7a347b1e2236>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# get the word names\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mfnames\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0masarray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcntvect\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_feature_names\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      3\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mi\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mc\u001b[0m \u001b[0;32min\u001b[0m \u001b[0menumerate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnewsgroups_train\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtarget_names\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      4\u001b[0m     \u001b[0mtmp\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0margsort\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mbmodel\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcoef_\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mi\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m-\u001b[0m\u001b[0;36m10\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m     \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"class\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mNameError\u001b[0m: name 'cntvect' is not defined"
     ]
    }
   ],
   "source": [
    "# get the word names\n",
    "fnames = asarray(cntvect.get_feature_names())\n",
    "for i,c in enumerate(newsgroups_train.target_names):\n",
    "    tmp = argsort(bmodel.coef_[i])[-10:]\n",
    "    print(\"class\", c)\n",
    "    for t in tmp:\n",
    "        print(\"    {:9s} ({:.5f})\".format(fnames[t], bmodel.coef_[i][t]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Multinomial Naive Bayes model\n",
    "Now learn a multinomial Naive Bayes model using the TF-IDF representation for the documents.  Again try different parameter values to improve the test accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "### INSERT YOUR CODE HERE\n",
    "## HINT \n",
    "# 1. feature_extraction.text.TfidfTransformer(use_idf=True, norm= )\n",
    "# 2. naive_bayes.MultinomialNB(alpha= )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What are the most informative features for Multinomial model? Run the below code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'cntvect' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-12-eaf122abf687>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;31m# get the word names\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mfnames\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0masarray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcntvect\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_feature_names\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      3\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mi\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mc\u001b[0m \u001b[0;32min\u001b[0m \u001b[0menumerate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnewsgroups_train\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtarget_names\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      4\u001b[0m     \u001b[0mtmp\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0margsort\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmmodel_tf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcoef_\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mi\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m-\u001b[0m\u001b[0;36m10\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m     \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"class\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mNameError\u001b[0m: name 'cntvect' is not defined"
     ]
    }
   ],
   "source": [
    "# get the word names\n",
    "fnames = asarray(cntvect.get_feature_names())\n",
    "for i,c in enumerate(newsgroups_train.target_names):\n",
    "    tmp = argsort(mmodel_tf.coef_[i])[-10:]\n",
    "    print(\"class\", c)\n",
    "    for t in tmp:\n",
    "        print(\"    {:9s} ({:.5f})\".format(fnames[t], mmodel_tf.coef_[i][t]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How do the most informative words differ between the TF-IDF multinomial model and the Bernoulli model?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- **INSERT YOUR ANSWER HERE**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## 6. Effect of smoothing\n",
    "The smoothing (regularization) parameter has a big effect on the performance.  Using the Multinomial TF-IDF models, make a plot of accuracy versus different values of alpha. For each alpha, you need to train a new model. Which alpha value yields the best result?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "### INSERT YOUR CODE HERE\n",
    "## HINT\n",
    "# 1. Iterating: alphas = logspace(-5,0,50)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Effect of vocabulary size\n",
    "The vocabulary size also affects the accuracy.  Make another plot of accuracy versus vocabulary size.  Which vocabulary size yields the best result?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "### INSERT YOUR CODE HERE\n",
    "## HINT\n",
    "# 1. Iterating: maxfeatures = linspace(100,26577,20)"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}