{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Name:** \\_\\_\\_\\_\\_\n",
"\n",
"**EID:** \\_\\_\\_\\_\\_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tutorial 1: Text Document Classification with KNN and Naive Bayes\n",
"\n",
"In this tutorial you will classify text documents using Naive Bayes classifers. We will be working with the dataset called \"20 Newsgroups\", which is a collection of 20,000 newsgroup posts organized into 20 categories."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Loading the 20 Newsgroups Dataset\n",
"The dataset is called “20 Newsgroups”. Here is the official description, quoted from the [website](http://qwone.com/~jason/20Newsgroups/)\n",
">The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of our knowledge, it was originally collected by Ken Lang, probably for his paper “Newsweeder: Learning to filter netnews,” though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"#First we need to initialize Python. Run the below cell.\n",
"%matplotlib inline\n",
"import IPython.core.display \n",
"# setup output image format (Chrome works best)\n",
"IPython.core.display.set_matplotlib_formats(\"svg\")\n",
"import matplotlib.pyplot as plt\n",
"import matplotlib\n",
"from numpy import *\n",
"from sklearn import *\n",
"from scipy import stats\n",
"random.seed(4487)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Put the file \"20news-bydate_py3.pkz' into the same directory as this ipynb file. **Do not unzip the file**."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Extract 4 classes ('alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space') from the dataset. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# strip away headers/footers/quotes from the text\n",
"removeset = ('headers', 'footers', 'quotes')\n",
"\n",
"# only use 4 categories\n",
"cats = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']\n",
"\n",
"# load the training and testing sets\n",
"newsgroups_train = datasets.fetch_20newsgroups(subset='train',\n",
" remove=removeset, categories=cats, data_home='./')\n",
"newsgroups_test = datasets.fetch_20newsgroups(subset='test', \n",
" remove=removeset, categories=cats, data_home='./')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Check if we got all the data. The training set should have 2034 documents, and the test set should have 1353 documents."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"training set size: 2034\n",
"testing set size: 1353\n",
"['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']\n"
]
}
],
"source": [
"print(\"training set size:\", len(newsgroups_train.data))\n",
"print(\"testing set size: \", len(newsgroups_test.data))\n",
"print(newsgroups_train.target_names)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Count the number examples in each class. `newsgroups_train.target` is an array of class values (0 through 3), and `newsgroups_train.target[i]` is the class of the i-th document."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"class counts\n",
"alt.atheism : 480\n",
"comp.graphics : 584\n",
"sci.space : 593\n",
"talk.religion.misc : 377\n"
]
}
],
"source": [
"print(\"class counts\")\n",
"for i in [0, 1, 2, 3]:\n",
" print(\"{:20s}: {}\".format(newsgroups_train.target_names[i], sum(newsgroups_train.target == i)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Show the documents. `newsgroups_train.data` is a list of strings, and `newsgroups_train.data[i]` is the i-th document."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--- document 0 (class=comp.graphics) ---\n",
"Hi,\n",
"\n",
"I've noticed that if you only save a model (with all your mapping planes\n",
"positioned carefully) to a .3DS file that when you reload it after restarting\n",
"3DS, they are given a default position and orientation. But if you save\n",
"to a .PRJ file their positions/orientation are preserved. Does anyone\n",
"know why this information is not stored in the .3DS file? Nothing is\n",
"explicitly said in the manual about saving texture rules in the .PRJ file. \n",
"I'd like to be able to read the texture rule information, does anyone have \n",
"the format for the .PRJ file?\n",
"\n",
"Is the .CEL file format available from somewhere?\n",
"\n",
"Rych\n",
"--- document 1 (class=talk.religion.misc) ---\n",
"\n",
"\n",
"Seems to be, barring evidence to the contrary, that Koresh was simply\n",
"another deranged fanatic who thought it neccessary to take a whole bunch of\n",
"folks with him, children and all, to satisfy his delusional mania. Jim\n",
"Jones, circa 1993.\n",
"\n",
"\n",
"Nope - fruitcakes like Koresh have been demonstrating such evil corruption\n",
"for centuries.\n",
"--- document 2 (class=sci.space) ---\n",
"\n",
" >In article <1993Apr19.020359.26996@sq.sq.com>, msb@sq.sq.com (Mark Brader) \n",
"\n",
"MB> So the\n",
"MB> 1970 figure seems unlikely to actually be anything but a perijove.\n",
"\n",
"JG>Sorry, _perijoves_...I'm not used to talking this language.\n",
"\n",
"Couldn't we just say periapsis or apoapsis?\n",
"\n",
" \n",
"--- document 3 (class=alt.atheism) ---\n",
"I have a request for those who would like to see Charley Wingate\n",
"respond to the \"Charley Challenges\" (and judging from my e-mail, there\n",
"appear to be quite a few of you.) \n",
"\n",
"It is clear that Mr. Wingate intends to continue to post tangential or\n",
"unrelated articles while ingoring the Challenges themselves. Between\n",
"the last two re-postings of the Challenges, I noted perhaps a dozen or\n",
"more posts by Mr. Wingate, none of which answered a single Challenge. \n",
"\n",
"It seems unmistakable to me that Mr. Wingate hopes that the questions\n",
"will just go away, and he is doing his level best to change the\n",
"subject. Given that this seems a rather common net.theist tactic, I\n",
"would like to suggest that we impress upon him our desire for answers,\n",
"in the following manner:\n",
"\n",
"1. Ignore any future articles by Mr. Wingate that do not address the\n",
"Challenges, until he answers them or explictly announces that he\n",
"refuses to do so.\n",
"\n",
"--or--\n",
"\n",
"2. If you must respond to one of his articles, include within it\n",
"something similar to the following:\n",
"\n",
" \"Please answer the questions posed to you in the Charley Challenges.\"\n",
"\n",
"Really, I'm not looking to humiliate anyone here, I just want some\n",
"honest answers. You wouldn't think that honesty would be too much to\n",
"ask from a devout Christian, would you? \n",
"\n",
"Nevermind, that was a rhetorical question.\n"
]
}
],
"source": [
"for i in [0, 1, 2 ,3]:\n",
" print(\"--- document {} (class={}) ---\".format(\n",
" i, newsgroups_train.target_names[newsgroups_train.target[i]]))\n",
" print(newsgroups_train.data[i])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Tip:** while you do the tutorial, it is okay to make additional code cells in the file. This will allow you to avoid re-running code (like training a classifier, then testing a classifier)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Extracting Features from Text Files\n",
"In order to perform machine learning on text documents, we first need to turn the text content into numerical feature vectors.\n",
"\n",
"Next, we will introduce two basic text representation methods: One-hot encoding, Bag of words, and TF-IDF. More feature vector extraction functions, please refer to https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### One-hot encoding\n",
"- Each word is coded with an index, which is represented by one-hot."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> John likes to watch movies. Mary likes too.\n",
"\n",
"> John also likes to watch football games."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we need to represent the words in the above two sentences, you can encode the words as following:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> {\"John\": 1, \"likes\": 2, \"to\": 3, \"watch\": 4, \"movies\": 5, \"also\":6, \"football\": 7, \"games\": 8, \"Mary\": 9, \"too\": 10}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can encode each word using one-hot method"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
">John: [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n",
"\n",
">likes: [0, 1, 0, 0, 0, 0, 0, 0, 0, 0]\n",
"\n",
">..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### However, this text representation method is impractical when the scale of corpus becomes large."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Bag of Words\n",
"- The index value of a word in the vocabulary is linked to its frequency in the whole training corpus."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> John likes to watch movies. Mary likes too. -->> [1, 2, 1, 1, 1, 0, 0, 0, 1, 1]\n",
"\n",
"> John also likes to watch football games. -->> [1, 1, 1, 1, 0, 1, 1, 1, 0, 0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The **sklearn.feature_extraction.text.CountVectorizer** implement the `Bag of Words` method that converts a collection of text documents to a matrix of token counts. This implementation produces a sparse representation of the counts using **scipy.sparse.coo_matrix** to save memory by only storing the non-zero parts of the feature vectors in memory."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Term Frequency - Inverse Document Frequency (TF-IDF)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the word bag model, we can get the vector representation of this text. However, in the face of the diversity of text, each word has different weight to the content of text in practical application, so we introduce tf-idf model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### TF (Term Frequency)\n",
"\n",
"In the case of the term frequency $tf(t, d)$, the simplest choice is to use the raw count of a term in a document, i.e., the number of times that term $t$ occurs in document $d$. If we denote the raw count by $f_{t, d}$, then the simplest tf scheme is $tf(t,d) = f_{t, d}$. \n",
"\n",
"$tf_{t, d} = n_{t, d}/\\sum_kn_{t, d}$\n",
"\n",
"The numerator in the above formula is the number of occurrences of the word in the document $d$, and the denominator is the sum of the occurrences of all words in the document $d$.\n",
"\n",
"##### IDF (Inverse Document Frequency) \n",
"\n",
"The inverse document frequency is a measure of how much information the word provides, i.e., if it's common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient): \n",
"\n",
"$idf(t ,D) = log\\frac{N}{|\\{ d\\in D:t \\in d \\}|}$\n",
"\n",
"with \n",
"- $N$: total number of documents in the corpus $N=|D|$\n",
"- $|\\{ d\\in D:t \\in d \\}|$: number of documents where the term $t$ appears. If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the denominator to $1+|\\{ d\\in D:t \\in d \\}|$\n",
"\n",
"Then tf-idf is calculated as: \n",
"$tfidf(t, d, D) = tf(t, d) * idf(t, D)$\n",
"\n",
"Both tf and tf–idf can be computed as follows using **sklearn.feature_extraction.text.TfidfTransformer**."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']\n",
"[[0 1 1 1 0 0 1 0 1]\n",
" [0 2 0 1 0 1 1 0 1]\n",
" [1 0 0 1 1 0 1 1 1]\n",
" [0 1 1 1 0 0 1 0 1]]\n"
]
}
],
"source": [
"from sklearn.feature_extraction.text import CountVectorizer\n",
" \n",
"corpus = ['This is the first document.',\n",
"\t'This document is the second document.',\n",
"\t'And this is the third one.',\n",
"\t'Is this the first document?']\n",
" \n",
"vectorizer = CountVectorizer()\n",
"X = vectorizer.fit_transform(corpus)\n",
" \n",
"print(vectorizer.get_feature_names())\n",
"print(X.toarray())\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create the vocabulary from the training data. Then use **sklearn.feature_extraction.text.CountVectorizer** to build the document vectors for the training and testing sets. You can decide how many words you want in the vocabulary"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"### INSERT YOUR CODE HERE\n",
"## HINT\n",
"# feature_extraction.text.CountVectorizer(stop_words='english')"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(2034, 26576)\n",
"(1353, 26576)\n"
]
}
],
"source": [
"### INSERT YOUR CODE HERE\n",
"\n",
"# setup the document vectorizer\n",
"# - use english stop words\n",
"cntvect = feature_extraction.text.CountVectorizer(stop_words='english')\n",
"\n",
"# create the vocabulary, and return the document vectors\n",
"trainX = cntvect.fit_transform(newsgroups_train.data)\n",
"trainY = newsgroups_train.target\n",
"\n",
"# convert the test data\n",
"testX = cntvect.transform(newsgroups_test.data)\n",
"testY = newsgroups_test.target\n",
"\n",
"print(trainX.shape)\n",
"print(testX.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. K Nearest Neighbor (KNN)\n",
"Let's train a K Nearest Neighbor (KNN) model. Using cross-validation to select the best K parameter. Then, showing the accuracy of training and testing set."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"### INSERT YOUR CODE HERE\n",
"## HINT\n",
"# 1. C: paramgrid = {'n_neighbors': [3,5,7]}\n",
"# 2. cross-validation: clf = model_selection.GridSearchCV(neighbors.KNeighborsClassifier(), param_grid=paramgrid, cv=10, n_jobs=-1)\n",
"# 3. To find the best K: print(clf.best_params_)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'n_neighbors': [3, 5, 7]}\n",
"train accuracy = 0.5585054080629301\n",
"test accuracy = 0.4050258684405026\n",
"{'n_neighbors': 3}\n"
]
}
],
"source": [
"paramgrid = {'n_neighbors': [3,5,7]}\n",
"print(paramgrid)\n",
"clf = model_selection.GridSearchCV(neighbors.KNeighborsClassifier(), param_grid=paramgrid, cv=10, n_jobs=-1)\n",
"# train the model\n",
"clf.fit(trainX, trainY)\n",
"# predict from the model\n",
"predYtrain = clf.predict(trainX)\n",
"predYtest = clf.predict(testX)\n",
"\n",
"# calculate accuracy\n",
"acc = metrics.accuracy_score(trainY, predYtrain)\n",
"print(\"train accuracy =\", acc)\n",
"\n",
"# calculate accuracy\n",
"acc = metrics.accuracy_score(testY, predYtest)\n",
"print(\"test accuracy =\", acc)\n",
"\n",
"print(clf.best_params_)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Bernoulli Naive Bayes \n",
"Learn a Bernoulli Naive Bayes model from the training set. What is the prediction accuracy on the test set? Try different parameters (alpha, max_features, etc) to get the best performance."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"### INSERT YOUR CODE HERE\n",
"## HINT\n",
"# 1. naive_bayes.BernoulliNB(alpha=0.1)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.696969696969697\n"
]
}
],
"source": [
"### INSERT YOUR CODE HERE\n",
"## HINT\n",
"# fit the NB Bernoulli model.\n",
"# the model automatically converts count vector into binary vector\n",
"bmodel = naive_bayes.BernoulliNB(alpha=0.1)\n",
"bmodel.fit(trainX, trainY)\n",
"\n",
"# prediction\n",
"predY = bmodel.predict(testX)\n",
"\n",
"# calculate accuracy\n",
"acc = metrics.accuracy_score(testY, predY)\n",
"print(acc)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What are the most informative words for each category? Run the below code.\n",
"\n",
"Note: `model.coef_[i]` will index the scores for the i-th class"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"class alt.atheism\n",
" time (-1.80349)\n",
" know (-1.60881)\n",
" does (-1.60881)\n",
" god (-1.58822)\n",
" like (-1.54825)\n",
" say (-1.52885)\n",
" just (-1.45481)\n",
" think (-1.39424)\n",
" people (-1.29824)\n",
" don (-1.18991)\n",
"class comp.graphics\n",
" just (-1.95020)\n",
" don (-1.91473)\n",
" program (-1.88048)\n",
" need (-1.85829)\n",
" does (-1.74429)\n",
" use (-1.73454)\n",
" like (-1.60722)\n",
" know (-1.50966)\n",
" graphics (-1.49428)\n",
" thanks (-1.47166)\n",
"class sci.space\n",
" earth (-1.90706)\n",
" use (-1.88461)\n",
" time (-1.76942)\n",
" know (-1.73062)\n",
" think (-1.73062)\n",
" nasa (-1.73062)\n",
" don (-1.69327)\n",
" just (-1.47214)\n",
" like (-1.41502)\n",
" space (-1.01909)\n",
"class talk.religion.misc\n",
" say (-1.65472)\n",
" way (-1.62736)\n",
" like (-1.61395)\n",
" does (-1.53709)\n",
" know (-1.48895)\n",
" think (-1.42082)\n",
" god (-1.37785)\n",
" don (-1.35703)\n",
" just (-1.34679)\n",
" people (-1.31667)\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/zhw/opt/anaconda3/envs/DNN/lib/python3.6/site-packages/sklearn/utils/deprecation.py:101: FutureWarning: Attribute coef_ was deprecated in version 0.24 and will be removed in 1.1 (renaming of 0.26).\n",
" warnings.warn(msg, category=FutureWarning)\n"
]
}
],
"source": [
"# get the word names\n",
"fnames = asarray(cntvect.get_feature_names())\n",
"for i,c in enumerate(newsgroups_train.target_names):\n",
" tmp = argsort(bmodel.coef_[i])[-10:]\n",
" print(\"class\", c)\n",
" for t in tmp:\n",
" print(\" {:9s} ({:.5f})\".format(fnames[t], bmodel.coef_[i][t]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Multinomial Naive Bayes model\n",
"Now learn a multinomial Naive Bayes model using the TF-IDF representation for the documents. Again try different parameter values to improve the test accuracy."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"### INSERT YOUR CODE HERE\n",
"## HINT \n",
"# 1. feature_extraction.text.TfidfTransformer(use_idf=True, norm= )\n",
"# 2. naive_bayes.MultinomialNB(alpha= )"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2 1 1 ... 2 1 1]\n",
"[2 1 1 ... 3 1 1]\n",
"0.754619364375462\n"
]
}
],
"source": [
"### INSERT YOUR CODE HERE\n",
"\n",
"# TF-IDF representation\n",
"# (For TF, set use_idf=False)\n",
"tf_trans = feature_extraction.text.TfidfTransformer(use_idf=True, norm='l1')\n",
"\n",
"# setup the TF-IDF representation, and transform the training set\n",
"trainXtf = tf_trans.fit_transform(trainX)\n",
"\n",
"# transform the test set\n",
"testXtf = tf_trans.transform(testX)\n",
"\n",
"# fit a multinomial model (with smoothing)\n",
"mmodel_tf = naive_bayes.MultinomialNB(alpha=0.01)\n",
"mmodel_tf.fit(trainXtf, trainY)\n",
"\n",
"# prediction\n",
"predYtf = mmodel_tf.predict(testXtf)\n",
"print(predYtf)\n",
"print(testY)\n",
"\n",
"# calculate accuracy\n",
"acc = metrics.accuracy_score(testY, predYtf)\n",
"print(acc)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What are the most informative features for Multinomial model? Run the below code."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"class alt.atheism\n",
" religion (-6.27527)\n",
" objective (-6.22645)\n",
" does (-6.18746)\n",
" say (-6.09804)\n",
" think (-5.98203)\n",
" people (-5.96589)\n",
" don (-5.87817)\n",
" deletion (-5.73142)\n",
" god (-5.65913)\n",
" just (-5.63866)\n",
"class comp.graphics\n",
" program (-6.08537)\n",
" hi (-6.08070)\n",
" does (-6.02576)\n",
" image (-5.98895)\n",
" looking (-5.98739)\n",
" know (-5.91146)\n",
" file (-5.85131)\n",
" files (-5.85033)\n",
" graphics (-5.44781)\n",
" thanks (-5.42243)\n",
"class sci.space\n",
" real (-6.55010)\n",
" launch (-6.47681)\n",
" moon (-6.45843)\n",
" think (-6.45771)\n",
" orbit (-6.40969)\n",
" thanks (-6.37048)\n",
" just (-6.26170)\n",
" like (-6.16102)\n",
" nasa (-6.14133)\n",
" space (-5.35748)\n",
"class talk.religion.misc\n",
" think (-6.41554)\n",
" just (-6.41541)\n",
" wrong (-6.41521)\n",
" don (-6.39721)\n",
" objective (-6.38006)\n",
" people (-6.30357)\n",
" christian (-6.24435)\n",
" christians (-6.17716)\n",
" jesus (-5.97433)\n",
" god (-5.63915)\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/zhw/opt/anaconda3/envs/DNN/lib/python3.6/site-packages/sklearn/utils/deprecation.py:101: FutureWarning: Attribute coef_ was deprecated in version 0.24 and will be removed in 1.1 (renaming of 0.26).\n",
" warnings.warn(msg, category=FutureWarning)\n"
]
}
],
"source": [
"# get the word names\n",
"fnames = asarray(cntvect.get_feature_names())\n",
"for i,c in enumerate(newsgroups_train.target_names):\n",
" tmp = argsort(mmodel_tf.coef_[i])[-10:]\n",
" print(\"class\", c)\n",
" for t in tmp:\n",
" print(\" {:9s} ({:.5f})\".format(fnames[t], mmodel_tf.coef_[i][t]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How do the most informative words differ between the TF-IDF multinomial model and the Bernoulli model?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- **INSERT YOUR ANSWER HERE**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- **INSERT YOUR ANSWER HERE**\n",
"- the TF-IDF words are more unique, e.g., for religion.misc, {christians, jesus, god} for TF-IDF religion, compared {people, just, don} for Bernoulli"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"## 6. Effect of smoothing\n",
"The smoothing (regularization) parameter has a big effect on the performance. Using the Multinomial TF-IDF models, make a plot of accuracy versus different values of alpha. For each alpha, you need to train a new model. Which alpha value yields the best result?"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"### INSERT YOUR CODE HERE\n",
"## HINT\n",
"# 1. Iterating: alphas = logspace(-5,0,50)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"best alpha= 0.000339322177189533 \n",
"best acc= 0.7812269031781227\n"
]
},
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"### INSERT YOUR CODE HERE\n",
"\n",
"alphas = logspace(-5,0,50)\n",
"accs = []\n",
"\n",
"# TF-IDF feature extraction\n",
"tf_trans = feature_extraction.text.TfidfTransformer(use_idf=True, norm='l1')\n",
"trainXtf = tf_trans.fit_transform(trainX)\n",
"testXtf = tf_trans.transform(testX)\n",
"\n",
"for myalpha in alphas:\n",
" # fit a multinomial model (with smoothing)\n",
" mmodel_tf = naive_bayes.MultinomialNB(alpha=myalpha)\n",
" mmodel_tf.fit(trainXtf, trainY)\n",
"\n",
" # prediction\n",
" predYtf = mmodel_tf.predict(testXtf)\n",
" # calculate accuracy\n",
" acc = metrics.accuracy_score(testY, predYtf)\n",
"\n",
" accs.append(acc)\n",
"\n",
"# get best accuracy\n",
"bestalphai = argmax(accs)\n",
"bestalpha = alphas[bestalphai]\n",
"bestacc = accs[bestalphai]\n",
"print(\"best alpha=\", bestalpha, \"\\nbest acc=\", bestacc)\n",
"\n",
"# make a plot\n",
"plt.figure()\n",
"plt.semilogx(alphas, accs)\n",
"plt.semilogx(bestalpha, bestacc, 'kx')\n",
"plt.xlabel('alpha'); plt.ylabel('accuracy') \n",
"plt.grid(True) \n",
"plt.title('accuracy versus alpha');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. Effect of vocabulary size\n",
"The vocabulary size also affects the accuracy. Make another plot of accuracy versus vocabulary size. Which vocabulary size yields the best result?"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"### INSERT YOUR CODE HERE\n",
"## HINT\n",
"# 1. Iterating: maxfeatures = linspace(100,26577,20)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"best maxf= 23789.947368421053 \n",
"best acc= 0.7812269031781227\n"
]
},
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"### INSERT YOUR CODE HERE\n",
"\n",
"alpha = 0.001\n",
"maxfeatures = linspace(100,26577,20)\n",
"accs = []\n",
"\n",
"for mf in maxfeatures:\n",
" # create vocabulary, and extract counts\n",
" cntvect = feature_extraction.text.CountVectorizer(stop_words='english', max_features=int(mf))\n",
" trainX = cntvect.fit_transform(newsgroups_train.data)\n",
" trainY = newsgroups_train.target\n",
" testX = cntvect.transform(newsgroups_test.data)\n",
" testY = newsgroups_test.target\n",
" \n",
" # TF-IDF feature extraction\n",
" tf_trans = feature_extraction.text.TfidfTransformer(use_idf=True, norm='l1')\n",
" trainXtf = tf_trans.fit_transform(trainX)\n",
" testXtf = tf_trans.transform(testX)\n",
"\n",
" # fit a multinomial model (with smoothing)\n",
" mmodel_tf = naive_bayes.MultinomialNB(alpha=alpha)\n",
" mmodel_tf.fit(trainXtf, trainY)\n",
"\n",
" # prediction\n",
" predYtf = mmodel_tf.predict(testXtf)\n",
" # calculate accuracy\n",
" acc = mean(predYtf==testY)\n",
" \n",
" accs.append(acc)\n",
"\n",
"# get best accuracy\n",
"bestmfi = argmax(accs)\n",
"bestmf = maxfeatures[bestmfi]\n",
"bestacc = accs[bestmfi]\n",
"print(\"best maxf=\", bestmf, \"\\nbest acc=\", bestacc)\n",
"\n",
"# make a plot\n",
"plt.figure\n",
"plt.plot(maxfeatures, accs)\n",
"plt.plot(bestmf, bestacc, 'kx')\n",
"plt.xlabel('vocab size'); plt.ylabel('accuracy') \n",
"plt.grid(True) \n",
"plt.title('accuracy versus vocabulary size');"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 1
}