{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "##
Tutorial on text classification

Analyzing Amazon product reviews\n", "
Yury Kashnitskiy, Data Science Lab" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll be analyzing Amazon products [reviews](http://jmcauley.ucsd.edu/data/amazon/). We took a sample of 100k grocery reviews. The prepared zipped `.csv` file is [here](https://drive.google.com/file/d/1fUSV3GrFzKkpY7tvbp3-tNHqb-Veo4xY/view?usp=sharing).\n", "\n", "**Outline:**\n", "\n", "1. Simple text features
\n", " 1.1. Bag of Words
\n", " 1.2. Tf-Idf vectorization
\n", "2. Simple text classification
\n", "3. Understanding the model
\n", " 3.1. Confusion matrix
\n", " 3.2. Visualizing coefficients
\n", " 3.3. ELI5 (\"Explain Like I'm 5\")
\n", "4. Hierarchical text classification" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "PATH_TO_DATA = '/home/yorko/Documents/data/amazon_reviews_sample100k_grocery.csv.zip'" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# some necessary imports\n", "import os\n", "import pickle\n", "import json\n", "from pprint import pprint\n", "import numpy as np\n", "import pandas as pd\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import accuracy_score, confusion_matrix\n", "import seaborn as sns\n", "from matplotlib import pyplot as plt" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(PATH_TO_DATA)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
productIdTitleuserIdHelpfulnessScoreTimeTextCat1Cat2Cat3
0B0000DF3IXPaprika Hungarian SweetA244MHL2UN2EYL0/05.01127088000While in Hungary we were given a recipe for Hu...grocery gourmet foodherbsspices seasonings
1B0002QF1LKQuaker Honey Graham Oh's 10.5 oz - (6 pack)A3FL7SXVYMC5NR3/35.01138147200Without a doubt, I would recommend this wholes...grocery gourmet foodbreakfast foodscereals
2B0002QF1LKQuaker Honey Graham Oh's 10.5 oz - (6 pack)A12IDQSS4OW33B3/35.01118016000This cereal is so sweet....yet so good for you...grocery gourmet foodbreakfast foodscereals
3B0002QF1LKQuaker Honey Graham Oh's 10.5 oz - (6 pack)A2GZKHC1M4PKF42/23.01206489600Man I love Oh's cereal. It is really great to ...grocery gourmet foodbreakfast foodscereals
4B0002QF1LKQuaker Honey Graham Oh's 10.5 oz - (6 pack)AUGT2DOGKLHIN2/25.01177545600And I've tried alot of cereals. This is by far...grocery gourmet foodbreakfast foodscereals
\n", "
" ], "text/plain": [ " productId Title userId \\\n", "0 B0000DF3IX Paprika Hungarian Sweet A244MHL2UN2EYL \n", "1 B0002QF1LK Quaker Honey Graham Oh's 10.5 oz - (6 pack) A3FL7SXVYMC5NR \n", "2 B0002QF1LK Quaker Honey Graham Oh's 10.5 oz - (6 pack) A12IDQSS4OW33B \n", "3 B0002QF1LK Quaker Honey Graham Oh's 10.5 oz - (6 pack) A2GZKHC1M4PKF4 \n", "4 B0002QF1LK Quaker Honey Graham Oh's 10.5 oz - (6 pack) AUGT2DOGKLHIN \n", "\n", " Helpfulness Score Time \\\n", "0 0/0 5.0 1127088000 \n", "1 3/3 5.0 1138147200 \n", "2 3/3 5.0 1118016000 \n", "3 2/2 3.0 1206489600 \n", "4 2/2 5.0 1177545600 \n", "\n", " Text Cat1 \\\n", "0 While in Hungary we were given a recipe for Hu... grocery gourmet food \n", "1 Without a doubt, I would recommend this wholes... grocery gourmet food \n", "2 This cereal is so sweet....yet so good for you... grocery gourmet food \n", "3 Man I love Oh's cereal. It is really great to ... grocery gourmet food \n", "4 And I've tried alot of cereals. This is by far... grocery gourmet food \n", "\n", " Cat2 Cat3 \n", "0 herbs spices seasonings \n", "1 breakfast foods cereals \n", "2 breakfast foods cereals \n", "3 breakfast foods cereals \n", "4 breakfast foods cereals " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(99982, 10)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['productId', 'Title', 'userId', 'Helpfulness', 'Score', 'Time', 'Text',\n", " 'Cat1', 'Cat2', 'Cat3'],\n", " dtype='object')" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From these 10 columns we'll use only 3 now:\n", " - Text - review on the product\n", " - Cat2 - label of category 2 for this product\n", " - Cat3 - label of category 3 for this product\n", " \n", "There's a taxonomy (hierarchical catalog) of all products with 3 categories (a.k.a. levels). Based on the review, we're going to classify it into one of level 2 categories (i.e. predicting `Cat2`) and level 3 categories (i.e. predicting `Cat3`). \n", "\n", "We're not intrested anymore in `Cat1` because here we chose only grocery. So we have 16 `Cat2` categories and 157 `Cat3` categories." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([' grocery gourmet food'], dtype=object)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Cat1'].unique()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "pantry staples 27291\n", "beverages 23440\n", "snack food 12724\n", "candy chocolate 11433\n", "breakfast foods 6248\n", "breads bakery 4240\n", "cooking baking supplies 2444\n", "herbs 2069\n", "gourmet gifts 1939\n", "fresh flowers live indoor plants 1811\n", "baby food 1270\n", "meat poultry 1268\n", "meat seafood 1250\n", "produce 1196\n", "sauces dips 845\n", "dairy eggs 514\n", "Name: Cat2, dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Cat2'].value_counts()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "157" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Cat3'].nunique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Simple text features\n", "#### 1.1. Bag of Words \n", "\n", "*The following explanation of Bag of Words and Tf-Idf is based on [this](https://www.kaggle.com/kashnitsky/topic-6-feature-engineering-and-feature-selection) notebook from our course [mlcourse.ai](https://mlcourse.a).*\n", "\n", "\n", "\n", "The easiest way to convert text to features is called Bag of Words: we create a vector with the length of the vocabulary, compute the number of occurrences of each word in the text, and place that number of occurrences in the appropriate position in the vector. The process described looks simpler in code:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Vocabulary: [(0, 'i'), (1, 'dog'), (2, 'and'), (3, 'a'), (4, 'cat'), (5, 'you'), (6, 'have')]\n", "Vectors:\n", "[1. 0. 0. 1. 1. 0. 1.]\n", "[0. 1. 0. 1. 0. 1. 1.]\n", "[1. 1. 2. 2. 1. 1. 1.]\n" ] } ], "source": [ "texts = ['i have a cat', \n", " 'you have a dog', \n", " 'you and i have a cat and a dog']\n", "\n", "vocabulary = list(enumerate(set([word for sentence in texts \n", " for word in sentence.split()])))\n", "print('Vocabulary:', vocabulary)\n", "\n", "def vectorize(text): \n", " vector = np.zeros(len(vocabulary)) \n", " for i, word in vocabulary:\n", " num = 0 \n", " for w in text: \n", " if w == word: \n", " num += 1 \n", " if num: \n", " vector[i] = num \n", " return vector\n", "\n", "print('Vectors:')\n", "for sentence in texts: \n", " print(vectorize(sentence.split()))" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Feature matrix:\n", " [[0 1 0 1 0]\n", " [0 0 1 1 1]\n", " [2 1 1 1 1]]\n", "Vocabulary\n", "{'and': 0, 'cat': 1, 'dog': 2, 'have': 3, 'you': 4}\n" ] } ], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "\n", "vect = CountVectorizer()\n", "print('Feature matrix:\\n {}'.format(vect.fit_transform(texts).toarray()))\n", "print('Vocabulary')\n", "pprint(vect.vocabulary_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When using algorithms like Bag of Words, we lose the order of the words in the text, which means that the texts \"i have no cows\" and \"no, i have cows\" will appear identical after vectorization when, in fact, they have the opposite meaning. To avoid this problem, we can revisit our tokenization step and use N-grams (the *sequence* of N consecutive tokens) instead." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Feature matrix:\n", " [[0 0 0 1 0 0 1 1 0 0 0 0]\n", " [0 0 0 0 0 1 1 0 1 1 0 1]\n", " [2 1 1 1 1 1 1 1 0 1 1 0]]\n", "Vocabulary\n", "{'and': 0,\n", " 'and dog': 1,\n", " 'and have': 2,\n", " 'cat': 3,\n", " 'cat and': 4,\n", " 'dog': 5,\n", " 'have': 6,\n", " 'have cat': 7,\n", " 'have dog': 8,\n", " 'you': 9,\n", " 'you and': 10,\n", " 'you have': 11}\n" ] } ], "source": [ "# the same but with bigrams\n", "vect2 = CountVectorizer(ngram_range=(1, 2))\n", "print('Feature matrix:\\n {}'.format(vect2.fit_transform(texts).toarray()))\n", "print('Vocabulary')\n", "pprint(vect2.vocabulary_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2. Tf-Idf \n", "Adding onto the Bag of Words idea: words that are rarely found in the corpus (in all the documents of this dataset) but are present in this particular document might be more important. Then it makes sense to increase the weight of more domain-specific words to separate them out from common words. This approach is called TF-IDF (term frequency-inverse document frequency), which cannot be written in a few lines, so you should look into the details in references such as [this wiki](https://en.wikipedia.org/wiki/Tf%E2%80%93idf). The default option is as follows:\n", "\n", "$$ \\large idf(t,D) = \\log\\frac{\\mid D\\mid}{df(d,t)+1} $$\n", "\n", "$$ \\large tfidf(t,d,D) = tf(t,d) \\times idf(t,D) $$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Simple text classification\n", "\n", "For now, we'll only take a look at 16 level 2 categories. We'll be doing a 16-class classification with logistic regression and Tf-Idf vectorization. Here we resort to Sklearn pipelines." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# build bigrams, put a limit on maximal number of features\n", "# and minimal word frequency\n", "tf_idf = TfidfVectorizer(ngram_range=(1, 2), max_features=50000, min_df=2)\n", "# multinomial logistic regression a.k.a softmax classifier\n", "logit = LogisticRegression(C=1e2, n_jobs=4, solver='lbfgs', \n", " random_state=17, multi_class='multinomial',\n", " verbose=1)\n", "# sklearn's pipeline\n", "tfidf_logit_pipeline = Pipeline([('tf_idf', tf_idf), \n", " ('logit', logit)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For now, we only use review text. " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "texts, y = df['Text'], df['Cat2']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We split data into training and validation parts." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "train_texts, valid_texts, y_train, y_valid = \\\n", " train_test_split(texts, y, random_state=17,\n", " stratify=y, shuffle=True)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 10.1 s, sys: 300 ms, total: 10.4 s\n", "Wall time: 46.3 s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=4)]: Done 1 out of 1 | elapsed: 36.0s finished\n" ] }, { "data": { "text/plain": [ "Pipeline(memory=None,\n", " steps=[('tf_idf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',\n", " dtype=, encoding='utf-8', input='content',\n", " lowercase=True, max_df=1.0, max_features=50000, min_df=2,\n", " ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=Tru... penalty='l2', random_state=17, solver='lbfgs',\n", " tol=0.0001, verbose=1, warm_start=False))])" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "tfidf_logit_pipeline.fit(train_texts, y_train)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.42 s, sys: 7.79 ms, total: 2.43 s\n", "Wall time: 2.43 s\n" ] } ], "source": [ "%%time\n", "valid_pred = tfidf_logit_pipeline.predict(valid_texts)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.7564810369659145" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "accuracy_score(y_valid, valid_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Understanding the model\n", "#### 3.1. Confusion matrix" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "def plot_confusion_matrix(actual, predicted, classes,\n", " normalize=False,\n", " title='Confusion matrix', figsize=(7,7),\n", " cmap=plt.cm.Blues, path_to_save_fig=None):\n", " \"\"\"\n", " This function prints and plots the confusion matrix.\n", " Normalization can be applied by setting `normalize=True`.\n", " \"\"\"\n", " import itertools\n", " from sklearn.metrics import confusion_matrix\n", " cm = confusion_matrix(actual, predicted).T\n", " if normalize:\n", " cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n", " \n", " plt.figure(figsize=figsize)\n", " plt.imshow(cm, interpolation='nearest', cmap=cmap)\n", " plt.title(title)\n", " plt.colorbar()\n", " tick_marks = np.arange(len(classes))\n", " plt.xticks(tick_marks, classes, rotation=90)\n", " plt.yticks(tick_marks, classes)\n", "\n", " fmt = '.2f' if normalize else 'd'\n", " thresh = cm.max() / 2.\n", " for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n", " plt.text(j, i, format(cm[i, j], fmt),\n", " horizontalalignment=\"center\",\n", " color=\"white\" if cm[i, j] > thresh else \"black\")\n", "\n", " plt.tight_layout()\n", " plt.ylabel('Predicted label')\n", " plt.xlabel('True label')\n", " \n", " if path_to_save_fig:\n", " plt.savefig(path_to_save_fig, dpi=300, bbox_inches='tight')" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['baby food', 'beverages', 'breads bakery', 'breakfast foods',\n", " 'candy chocolate', 'cooking baking supplies', 'dairy eggs',\n", " 'fresh flowers live indoor plants', 'gourmet gifts', 'herbs',\n", " 'meat poultry', 'meat seafood', 'pantry staples', 'produce',\n", " 'sauces dips', 'snack food'], dtype=object)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "category2_classes = tfidf_logit_pipeline.named_steps['logit'].classes_\n", "category2_classes" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plot_confusion_matrix(y_valid, valid_pred, \n", " category2_classes, figsize=(8, 8))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.2. Visualizing coefficients" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "def visualize_coefficients(classifier_coefs, feature_names, \n", " n_top_features=25, title='Coefs', \n", " save_path=None):\n", " # get coefficients with large absolute values \n", " coef = classifier_coefs.ravel()\n", " positive_coefficients = np.argsort(coef)[-n_top_features:]\n", " negative_coefficients = np.argsort(coef)[:n_top_features]\n", " interesting_coefficients = np.hstack([negative_coefficients, \n", " positive_coefficients])\n", " # plot them\n", " plt.figure(figsize=(15, 5))\n", " colors = [\"red\" if c < 0 else \"blue\" \n", " for c in coef[interesting_coefficients]]\n", " plt.bar(np.arange(2 * n_top_features), \n", " coef[interesting_coefficients], color=colors)\n", " feature_names = np.array(feature_names)\n", " plt.xticks(np.arange(1, 1 + 2 * n_top_features), \n", " feature_names[interesting_coefficients], \n", " rotation=90, ha=\"right\")\n", " plt.title(title);\n", " if save_path:\n", " plt.savefig(save_path, dpi=300);" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "visualize_coefficients(tfidf_logit_pipeline.named_steps['logit'].coef_[0, :], \n", " tfidf_logit_pipeline.named_steps['tf_idf'].get_feature_names(),\n", " title=category2_classes[0])" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "visualize_coefficients(tfidf_logit_pipeline.named_steps['logit'].coef_[1, :], \n", " tfidf_logit_pipeline.named_steps['tf_idf'].get_feature_names(),\n", " title=category2_classes[1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.3. ELI5 (\"Explain Like I'm 5\")\n", "\n", "[GitHub](https://github.com/TeamHG-Memex/eli5). \n", "ELI5 is a Python package which helps to debug machine learning classifiers and explain their predictions. It supports Sklearn, Xgboost, LightGBM and others. " ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "# pip install eli5\n", "import eli5" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", "\n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", " \n", " \n", " y=baby food\n", " \n", "\n", "\n", "top features\n", " \n", " \n", " \n", " y=beverages\n", " \n", "\n", "\n", "top features\n", " \n", " \n", " \n", " y=breads bakery\n", " \n", "\n", "\n", "top features\n", " \n", " \n", " \n", " y=breakfast foods\n", " \n", "\n", "\n", "top features\n", " \n", " \n", " \n", " y=candy chocolate\n", " \n", "\n", "\n", "top features\n", " \n", " \n", " \n", " y=cooking baking supplies\n", " \n", "\n", "\n", "top features\n", " \n", " \n", " \n", " y=dairy eggs\n", " \n", "\n", "\n", "top features\n", " \n", " \n", " \n", " y=fresh flowers live indoor plants\n", " \n", "\n", "\n", "top features\n", " \n", " \n", " \n", " y=gourmet gifts\n", " \n", "\n", "\n", "top features\n", " \n", " \n", " \n", " y=herbs\n", " \n", "\n", "\n", "top features\n", " \n", " \n", " \n", " y=meat poultry\n", " \n", "\n", "\n", "top features\n", " \n", " \n", " \n", " y=meat seafood\n", " \n", "\n", "\n", "top features\n", " \n", " \n", " \n", " y=pantry staples\n", " \n", "\n", "\n", "top features\n", " \n", " \n", " \n", " y=produce\n", " \n", "\n", "\n", "top features\n", " \n", " \n", " \n", " y=sauces dips\n", " \n", "\n", "\n", "top features\n", " \n", " \n", " \n", " y=snack food\n", " \n", "\n", "\n", "top features\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "
\n", " Weight?\n", " Feature
\n", " +44.371\n", " \n", " baby\n", "
\n", " +39.677\n", " \n", " formula\n", "
\n", " +22.257\n", " \n", " gerber\n", "
\n", " +16.179\n", " \n", " similac\n", "
\n", " +16.148\n", " \n", " this formula\n", "
\n", " +15.863\n", " \n", " earth best\n", "
\n", " +15.532\n", " \n", " cereal\n", "
\n", " +15.284\n", " \n", " babies\n", "
\n", " +13.853\n", " \n", " my baby\n", "
\n", " +13.091\n", " \n", " daughter\n", "
\n", " +12.813\n", " \n", " month old\n", "
\n", " +12.559\n", " \n", " earth\n", "
\n", " +12.135\n", " \n", " baby food\n", "
\n", " +11.852\n", " \n", " food\n", "
\n", " +11.834\n", " \n", " old\n", "
\n", " +11.623\n", " \n", " toddler\n", "
\n", " +11.065\n", " \n", " son\n", "
\n", " +10.936\n", " \n", " months\n", "
\n", " +10.730\n", " \n", " month\n", "
\n", " +10.385\n", " \n", " child\n", "
\n", " … 10790 more positive …\n", "
\n", " … 39191 more negative …\n", "
\n", "\n", " \n", " \n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "
\n", " Weight?\n", " Feature
\n", " +71.027\n", " \n", " tea\n", "
\n", " +48.800\n", " \n", " this tea\n", "
\n", " +40.395\n", " \n", " drink\n", "
\n", " +34.729\n", " \n", " teas\n", "
\n", " +34.680\n", " \n", " pods\n", "
\n", " +31.821\n", " \n", " coffee\n", "
\n", " +31.040\n", " \n", " coconut water\n", "
\n", " +28.885\n", " \n", " movie\n", "
\n", " +28.634\n", " \n", " drinking\n", "
\n", " +27.084\n", " \n", " zico\n", "
\n", " +26.962\n", " \n", " chai\n", "
\n", " +26.894\n", " \n", " hot chocolate\n", "
\n", " +25.221\n", " \n", " soda\n", "
\n", " +23.568\n", " \n", " senseo\n", "
\n", " +23.189\n", " \n", " water\n", "
\n", " +22.381\n", " \n", " espresso\n", "
\n", " … 20526 more positive …\n", "
\n", " … 29455 more negative …\n", "
\n", " -21.645\n", " \n", " salt\n", "
\n", " -22.586\n", " \n", " popcorn\n", "
\n", " -23.213\n", " \n", " sauce\n", "
\n", " -28.835\n", " \n", " eat\n", "
\n", "\n", " \n", " \n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "
\n", " Weight?\n", " Feature
\n", " +25.276\n", " \n", " cookies\n", "
\n", " +22.181\n", " \n", " cookie\n", "
\n", " +20.022\n", " \n", " cake\n", "
\n", " +19.781\n", " \n", " fruitcake\n", "
\n", " +19.179\n", " \n", " pocky\n", "
\n", " +18.776\n", " \n", " biscotti\n", "
\n", " +16.200\n", " \n", " breadsticks\n", "
\n", " +15.999\n", " \n", " oreos\n", "
\n", " +15.511\n", " \n", " pizza\n", "
\n", " +15.289\n", " \n", " bread\n", "
\n", " +15.138\n", " \n", " baklava\n", "
\n", " +14.979\n", " \n", " cakes\n", "
\n", " +14.884\n", " \n", " wafers\n", "
\n", " +14.655\n", " \n", " wafer\n", "
\n", " +13.962\n", " \n", " mallomars\n", "
\n", " +12.839\n", " \n", " crust\n", "
\n", " +11.086\n", " \n", " shells\n", "
\n", " +11.055\n", " \n", " oreo\n", "
\n", " +10.833\n", " \n", " wraps\n", "
\n", " … 17490 more positive …\n", "
\n", " … 32491 more negative …\n", "
\n", " -14.362\n", " \n", " mix\n", "
\n", "\n", " \n", " \n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "
\n", " Weight?\n", " Feature
\n", " +34.379\n", " \n", " cereal\n", "
\n", " +30.655\n", " \n", " bars\n", "
\n", " +29.471\n", " \n", " bar\n", "
\n", " +29.235\n", " \n", " oatmeal\n", "
\n", " +28.824\n", " \n", " granola\n", "
\n", " +22.415\n", " \n", " breakfast\n", "
\n", " +21.553\n", " \n", " cereals\n", "
\n", " +20.874\n", " \n", " tarts\n", "
\n", " +20.341\n", " \n", " pop tarts\n", "
\n", " +18.485\n", " \n", " puffed\n", "
\n", " +18.082\n", " \n", " oats\n", "
\n", " +17.246\n", " \n", " blueberry\n", "
\n", " +16.701\n", " \n", " pop\n", "
\n", " +16.352\n", " \n", " these bars\n", "
\n", " +16.051\n", " \n", " this cereal\n", "
\n", " +15.865\n", " \n", " frosted\n", "
\n", " +15.545\n", " \n", " toaster\n", "
\n", " +13.876\n", " \n", " filling\n", "
\n", " +13.804\n", " \n", " this bar\n", "
\n", " … 17193 more positive …\n", "
\n", " … 32788 more negative …\n", "
\n", " -15.753\n", " \n", " tea\n", "
\n", "\n", " \n", " \n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "
\n", " Weight?\n", " Feature
\n", " +51.312\n", " \n", " licorice\n", "
\n", " +47.510\n", " \n", " gum\n", "
\n", " +37.105\n", " \n", " mints\n", "
\n", " +34.616\n", " \n", " candy\n", "
\n", " +32.241\n", " \n", " altoids\n", "
\n", " +29.186\n", " \n", " haribo\n", "
\n", " +27.022\n", " \n", " candies\n", "
\n", " +25.932\n", " \n", " chocolate\n", "
\n", " +24.702\n", " \n", " bears\n", "
\n", " +21.744\n", " \n", " gummi\n", "
\n", " +21.503\n", " \n", " gummy\n", "
\n", " +19.630\n", " \n", " bar\n", "
\n", " +19.497\n", " \n", " gummies\n", "
\n", " +19.185\n", " \n", " chocolates\n", "
\n", " +18.243\n", " \n", " liquorice\n", "
\n", " +18.102\n", " \n", " jelly\n", "
\n", " +15.880\n", " \n", " this gum\n", "
\n", " +15.691\n", " \n", " belly\n", "
\n", " … 19000 more positive …\n", "
\n", " … 30981 more negative …\n", "
\n", " -22.117\n", " \n", " tea\n", "
\n", " -24.629\n", " \n", " cookies\n", "
\n", "\n", " \n", " \n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "
\n", " Weight?\n", " Feature
\n", " +17.945\n", " \n", " bread\n", "
\n", " +14.719\n", " \n", " cake\n", "
\n", " +13.839\n", " \n", " vanilla\n", "
\n", " +13.691\n", " \n", " almonds\n", "
\n", " +13.063\n", " \n", " syrup\n", "
\n", " +13.024\n", " \n", " baking\n", "
\n", " +12.991\n", " \n", " mincemeat\n", "
\n", " +12.832\n", " \n", " mix\n", "
\n", " +12.764\n", " \n", " flour\n", "
\n", " +12.728\n", " \n", " muffins\n", "
\n", " +12.081\n", " \n", " cocoa\n", "
\n", " +12.003\n", " \n", " sugar\n", "
\n", " +11.901\n", " \n", " nuts\n", "
\n", " +11.820\n", " \n", " peanuts\n", "
\n", " +11.633\n", " \n", " spoon\n", "
\n", " +11.012\n", " \n", " pancakes\n", "
\n", " +10.687\n", " \n", " salt\n", "
\n", " +10.680\n", " \n", " wasabi\n", "
\n", " +10.302\n", " \n", " splenda\n", "
\n", " +10.260\n", " \n", " chocolate\n", "
\n", " … 16830 more positive …\n", "
\n", " … 33151 more negative …\n", "
\n", "\n", " \n", " \n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "
\n", " Weight?\n", " Feature
\n", " +28.332\n", " \n", " cheese\n", "
\n", " +18.710\n", " \n", " this cheese\n", "
\n", " +13.140\n", " \n", " milk\n", "
\n", " +13.119\n", " \n", " cheeses\n", "
\n", " +11.565\n", " \n", " coffee\n", "
\n", " +11.039\n", " \n", " creamer\n", "
\n", " +9.343\n", " \n", " cream\n", "
\n", " +9.145\n", " \n", " creamy\n", "
\n", " +8.321\n", " \n", " blue\n", "
\n", " +7.280\n", " \n", " cheese is\n", "
\n", " +7.036\n", " \n", " butter\n", "
\n", " +6.992\n", " \n", " egg\n", "
\n", " +6.720\n", " \n", " creamers\n", "
\n", " +6.423\n", " \n", " lurpak\n", "
\n", " +6.325\n", " \n", " igourmet\n", "
\n", " +5.795\n", " \n", " it\n", "
\n", " +5.710\n", " \n", " ice\n", "
\n", " +5.665\n", " \n", " coffee mate\n", "
\n", " +5.339\n", " \n", " blue cheese\n", "
\n", " … 8247 more positive …\n", "
\n", " … 41734 more negative …\n", "
\n", " -7.579\n", " \n", " these\n", "
\n", "\n", " \n", " \n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "
\n", " Weight?\n", " Feature
\n", " +45.917\n", " \n", " plant\n", "
\n", " +42.770\n", " \n", " tree\n", "
\n", " +38.306\n", " \n", " bonsai\n", "
\n", " +32.808\n", " \n", " plants\n", "
\n", " +31.475\n", " \n", " herbs\n", "
\n", " +29.085\n", " \n", " grow\n", "
\n", " +25.124\n", " \n", " aerogarden\n", "
\n", " +24.322\n", " \n", " flowers\n", "
\n", " +22.405\n", " \n", " garden\n", "
\n", " +22.146\n", " \n", " growing\n", "
\n", " +19.220\n", " \n", " leaves\n", "
\n", " +18.608\n", " \n", " kit\n", "
\n", " +18.017\n", " \n", " the plant\n", "
\n", " +15.479\n", " \n", " the tree\n", "
\n", " +14.450\n", " \n", " seed\n", "
\n", " +14.393\n", " \n", " pods\n", "
\n", " +13.557\n", " \n", " basil\n", "
\n", " +13.258\n", " \n", " weeks\n", "
\n", " +12.672\n", " \n", " lettuce\n", "
\n", " … 10332 more positive …\n", "
\n", " … 39649 more negative …\n", "
\n", " -14.176\n", " \n", " taste\n", "
\n", "\n", " \n", " \n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "
\n", " Weight?\n", " Feature
\n", " +47.095\n", " \n", " tea\n", "
\n", " +29.819\n", " \n", " candy\n", "
\n", " +27.930\n", " \n", " basket\n", "
\n", " +23.935\n", " \n", " sushi\n", "
\n", " +21.573\n", " \n", " the tea\n", "
\n", " +18.165\n", " \n", " gift\n", "
\n", " +16.264\n", " \n", " chocolates\n", "
\n", " +15.424\n", " \n", " set\n", "
\n", " +14.591\n", " \n", " flowering\n", "
\n", " +13.823\n", " \n", " teapot\n", "
\n", " +13.475\n", " \n", " this gift\n", "
\n", " +13.392\n", " \n", " kit\n", "
\n", " +13.051\n", " \n", " pot\n", "
\n", " +12.905\n", " \n", " candies\n", "
\n", " +12.401\n", " \n", " teas\n", "
\n", " +12.350\n", " \n", " bamboo\n", "
\n", " +12.204\n", " \n", " coffee\n", "
\n", " +11.800\n", " \n", " hot\n", "
\n", " +11.167\n", " \n", " flower\n", "
\n", " +10.973\n", " \n", " fun\n", "
\n", " … 13033 more positive …\n", "
\n", " … 36948 more negative …\n", "
\n", "\n", " \n", " \n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "
\n", " Weight?\n", " Feature
\n", " +21.926\n", " \n", " popcorn\n", "
\n", " +21.711\n", " \n", " salt\n", "
\n", " +21.467\n", " \n", " beans\n", "
\n", " +20.799\n", " \n", " seasoning\n", "
\n", " +15.792\n", " \n", " peppercorns\n", "
\n", " +15.629\n", " \n", " cinnamon\n", "
\n", " +15.521\n", " \n", " vanilla\n", "
\n", " +14.036\n", " \n", " spice\n", "
\n", " +14.001\n", " \n", " pepper\n", "
\n", " +13.502\n", " \n", " vanilla beans\n", "
\n", " +13.081\n", " \n", " chili\n", "
\n", " +12.501\n", " \n", " ginger\n", "
\n", " +12.035\n", " \n", " spices\n", "
\n", " +11.083\n", " \n", " seeds\n", "
\n", " +10.727\n", " \n", " curry\n", "
\n", " +10.309\n", " \n", " rub\n", "
\n", " +10.208\n", " \n", " powder\n", "
\n", " +10.129\n", " \n", " this salt\n", "
\n", " +10.093\n", " \n", " used\n", "
\n", " … 14181 more positive …\n", "
\n", " … 35800 more negative …\n", "
\n", " -11.357\n", " \n", " chocolate\n", "
\n", "\n", " \n", " \n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "
\n", " Weight?\n", " Feature
\n", " +32.038\n", " \n", " jerky\n", "
\n", " +21.365\n", " \n", " slim\n", "
\n", " +14.512\n", " \n", " snack\n", "
\n", " +14.437\n", " \n", " sausage\n", "
\n", " +12.727\n", " \n", " sticks\n", "
\n", " +12.551\n", " \n", " jims\n", "
\n", " +12.551\n", " \n", " slim jims\n", "
\n", " +12.185\n", " \n", " meat\n", "
\n", " +11.911\n", " \n", " chicken\n", "
\n", " +11.400\n", " \n", " salty\n", "
\n", " +10.772\n", " \n", " slim jim\n", "
\n", " +10.697\n", " \n", " jim\n", "
\n", " +10.616\n", " \n", " bacon\n", "
\n", " +10.613\n", " \n", " salami\n", "
\n", " +10.374\n", " \n", " sardines\n", "
\n", " +10.177\n", " \n", " beef\n", "
\n", " +9.305\n", " \n", " pate\n", "
\n", " +9.155\n", " \n", " teriyaki\n", "
\n", " +8.716\n", " \n", " duck\n", "
\n", " … 11580 more positive …\n", "
\n", " … 38401 more negative …\n", "
\n", " -9.532\n", " \n", " chocolate\n", "
\n", "\n", " \n", " \n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "
\n", " Weight?\n", " Feature
\n", " +44.230\n", " \n", " sardines\n", "
\n", " +28.676\n", " \n", " jerky\n", "
\n", " +23.535\n", " \n", " tuna\n", "
\n", " +18.209\n", " \n", " anchovies\n", "
\n", " +16.566\n", " \n", " lobster\n", "
\n", " +14.608\n", " \n", " salmon\n", "
\n", " +13.983\n", " \n", " meat\n", "
\n", " +13.665\n", " \n", " crab\n", "
\n", " +13.569\n", " \n", " fish\n", "
\n", " +13.443\n", " \n", " clams\n", "
\n", " +13.291\n", " \n", " smoked\n", "
\n", " +12.194\n", " \n", " kippers\n", "
\n", " +12.156\n", " \n", " beef\n", "
\n", " +11.561\n", " \n", " oysters\n", "
\n", " +11.532\n", " \n", " can\n", "
\n", " +11.367\n", " \n", " packed\n", "
\n", " +10.932\n", " \n", " these sardines\n", "
\n", " +10.344\n", " \n", " canned\n", "
\n", " +10.219\n", " \n", " bones\n", "
\n", " +10.192\n", " \n", " prince\n", "
\n", " … 12081 more positive …\n", "
\n", " … 37900 more negative …\n", "
\n", "\n", " \n", " \n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "
\n", " Weight?\n", " Feature
\n", " +30.934\n", " \n", " soup\n", "
\n", " +30.484\n", " \n", " noodles\n", "
\n", " +22.171\n", " \n", " pasta\n", "
\n", " +20.552\n", " \n", " olives\n", "
\n", " +20.147\n", " \n", " sauce\n", "
\n", " +19.369\n", " \n", " seasoning\n", "
\n", " +17.525\n", " \n", " splenda\n", "
\n", " +17.464\n", " \n", " mac\n", "
\n", " +17.173\n", " \n", " kraft\n", "
\n", " +16.808\n", " \n", " cake\n", "
\n", " +16.366\n", " \n", " beans\n", "
\n", " +16.338\n", " \n", " dressing\n", "
\n", " +15.891\n", " \n", " this soup\n", "
\n", " … 23590 more positive …\n", "
\n", " … 26391 more negative …\n", "
\n", " -15.874\n", " \n", " cereal\n", "
\n", " -15.931\n", " \n", " fruit\n", "
\n", " -16.100\n", " \n", " licorice\n", "
\n", " -17.908\n", " \n", " bar\n", "
\n", " -18.101\n", " \n", " this tea\n", "
\n", " -23.085\n", " \n", " tea\n", "
\n", " -26.895\n", " \n", " jerky\n", "
\n", "\n", " \n", " \n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "
\n", " Weight?\n", " Feature
\n", " +43.953\n", " \n", " cherries\n", "
\n", " +33.261\n", " \n", " pumpkin\n", "
\n", " +27.775\n", " \n", " dried\n", "
\n", " +21.389\n", " \n", " seaweed\n", "
\n", " +16.916\n", " \n", " fruit\n", "
\n", " +16.438\n", " \n", " dried cherries\n", "
\n", " +14.810\n", " \n", " tart\n", "
\n", " +12.109\n", " \n", " canned\n", "
\n", " +11.898\n", " \n", " these cherries\n", "
\n", " +11.486\n", " \n", " cans\n", "
\n", " +11.283\n", " \n", " dented\n", "
\n", " +9.935\n", " \n", " snack\n", "
\n", " +9.902\n", " \n", " truffles\n", "
\n", " +9.443\n", " \n", " traverse\n", "
\n", " +9.200\n", " \n", " plums\n", "
\n", " +9.066\n", " \n", " mushrooms\n", "
\n", " +9.056\n", " \n", " cherries are\n", "
\n", " +8.943\n", " \n", " canned pumpkin\n", "
\n", " +8.856\n", " \n", " apricots\n", "
\n", " … 11622 more positive …\n", "
\n", " … 38359 more negative …\n", "
\n", " -8.532\n", " \n", " chocolate\n", "
\n", "\n", " \n", " \n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "
\n", " Weight?\n", " Feature
\n", " +30.654\n", " \n", " sauce\n", "
\n", " +13.080\n", " \n", " salsa\n", "
\n", " +12.011\n", " \n", " paste\n", "
\n", " +11.276\n", " \n", " marinade\n", "
\n", " +11.229\n", " \n", " hot\n", "
\n", " +11.223\n", " \n", " sauces\n", "
\n", " +11.131\n", " \n", " use\n", "
\n", " +10.508\n", " \n", " marmite\n", "
\n", " +10.208\n", " \n", " bottle\n", "
\n", " +9.444\n", " \n", " gravy\n", "
\n", " +8.884\n", " \n", " this sauce\n", "
\n", " +8.149\n", " \n", " chicken\n", "
\n", " +8.089\n", " \n", " thai\n", "
\n", " +7.877\n", " \n", " use it\n", "
\n", " +7.637\n", " \n", " tapatio\n", "
\n", " +6.823\n", " \n", " spicy\n", "
\n", " +6.664\n", " \n", " bottles\n", "
\n", " +6.491\n", " \n", " curry\n", "
\n", " … 10636 more positive …\n", "
\n", " … 39345 more negative …\n", "
\n", " -8.777\n", " \n", " they\n", "
\n", " -10.002\n", " \n", " these\n", "
\n", "\n", " \n", " \n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "
\n", " Weight?\n", " Feature
\n", " +38.187\n", " \n", " popcorn\n", "
\n", " +34.106\n", " \n", " chips\n", "
\n", " +29.419\n", " \n", " pretzels\n", "
\n", " +27.025\n", " \n", " jerky\n", "
\n", " +24.860\n", " \n", " crackers\n", "
\n", " +22.564\n", " \n", " cracker\n", "
\n", " +22.472\n", " \n", " this popcorn\n", "
\n", " +21.880\n", " \n", " cookies\n", "
\n", " +18.115\n", " \n", " bloks\n", "
\n", " +17.453\n", " \n", " cookie\n", "
\n", " +17.322\n", " \n", " rice cakes\n", "
\n", " +16.863\n", " \n", " chip\n", "
\n", " +16.522\n", " \n", " pretzel\n", "
\n", " +16.463\n", " \n", " raisins\n", "
\n", " +16.040\n", " \n", " hummus\n", "
\n", " +15.033\n", " \n", " sahale\n", "
\n", " +15.015\n", " \n", " snack\n", "
\n", " +14.812\n", " \n", " granola\n", "
\n", " … 20111 more positive …\n", "
\n", " … 29870 more negative …\n", "
\n", " -15.215\n", " \n", " beans\n", "
\n", " -22.005\n", " \n", " tea\n", "
\n", "\n", " \n", " \n", "
\n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eli5.show_weights(estimator=tfidf_logit_pipeline.named_steps['logit'],\n", " vec=tfidf_logit_pipeline.named_steps['tf_idf'])" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('While in Hungary we were given a recipe for Hungarian Goulash. It needs sweet paprika. This was terrific in that dish and others. I will purchase it again when I need more.',\n", " 'herbs')" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_texts[0], y_train[0]" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", "\n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", " \n", "\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "

\n", " \n", " \n", " y=baby food\n", " \n", "\n", "\n", " \n", " (probability 0.000, score -2.327)\n", "\n", "top features\n", "

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "
\n", " Contribution?\n", " Feature
\n", " -0.697\n", " \n", " Highlighted in text (sum)\n", "
\n", " -1.630\n", " \n", " <BIAS>\n", "
\n", "\n", " \n", "\n", "\n", "\n", "

\n", " while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.\n", "

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "

\n", " \n", " \n", " y=beverages\n", " \n", "\n", "\n", " \n", " (probability 0.000, score -1.968)\n", "\n", "top features\n", "

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "
\n", " Contribution?\n", " Feature
\n", " +2.623\n", " \n", " <BIAS>\n", "
\n", " -4.591\n", " \n", " Highlighted in text (sum)\n", "
\n", "\n", " \n", "\n", "\n", "\n", "

\n", " while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.\n", "

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "

\n", " \n", " \n", " y=breads bakery\n", " \n", "\n", "\n", " \n", " (probability 0.001, score 0.055)\n", "\n", "top features\n", "

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "
\n", " Contribution?\n", " Feature
\n", " +0.835\n", " \n", " <BIAS>\n", "
\n", " -0.780\n", " \n", " Highlighted in text (sum)\n", "
\n", "\n", " \n", "\n", "\n", "\n", "

\n", " while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.\n", "

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "

\n", " \n", " \n", " y=breakfast foods\n", " \n", "\n", "\n", " \n", " (probability 0.000, score -1.159)\n", "\n", "top features\n", "

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "
\n", " Contribution?\n", " Feature
\n", " +0.279\n", " \n", " <BIAS>\n", "
\n", " -1.438\n", " \n", " Highlighted in text (sum)\n", "
\n", "\n", " \n", "\n", "\n", "\n", "

\n", " while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.\n", "

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "

\n", " \n", " \n", " y=candy chocolate\n", " \n", "\n", "\n", " \n", " (probability 0.001, score 0.212)\n", "\n", "top features\n", "

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "
\n", " Contribution?\n", " Feature
\n", " +1.973\n", " \n", " <BIAS>\n", "
\n", " -1.760\n", " \n", " Highlighted in text (sum)\n", "
\n", "\n", " \n", "\n", "\n", "\n", "

\n", " while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.\n", "

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "

\n", " \n", " \n", " y=cooking baking supplies\n", " \n", "\n", "\n", " \n", " (probability 0.004, score 1.170)\n", "\n", "top features\n", "

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", "
\n", " Contribution?\n", " Feature
\n", " +0.860\n", " \n", " <BIAS>\n", "
\n", " +0.310\n", " \n", " Highlighted in text (sum)\n", "
\n", "\n", " \n", "\n", "\n", "\n", "

\n", " while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.\n", "

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "

\n", " \n", " \n", " y=dairy eggs\n", " \n", "\n", "\n", " \n", " (probability 0.001, score -0.053)\n", "\n", "top features\n", "

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "
\n", " Contribution?\n", " Feature
\n", " +0.437\n", " \n", " Highlighted in text (sum)\n", "
\n", " -0.491\n", " \n", " <BIAS>\n", "
\n", "\n", " \n", "\n", "\n", "\n", "

\n", " while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.\n", "

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "

\n", " \n", " \n", " y=fresh flowers live indoor plants\n", " \n", "\n", "\n", " \n", " (probability 0.000, score -6.405)\n", "\n", "top features\n", "

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "
\n", " Contribution?\n", " Feature
\n", " -0.509\n", " \n", " Highlighted in text (sum)\n", "
\n", " -5.897\n", " \n", " <BIAS>\n", "
\n", "\n", " \n", "\n", "\n", "\n", "

\n", " while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.\n", "

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "

\n", " \n", " \n", " y=gourmet gifts\n", " \n", "\n", "\n", " \n", " (probability 0.000, score -2.366)\n", "\n", "top features\n", "

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "
\n", " Contribution?\n", " Feature
\n", " -0.537\n", " \n", " <BIAS>\n", "
\n", " -1.828\n", " \n", " Highlighted in text (sum)\n", "
\n", "\n", " \n", "\n", "\n", "\n", "

\n", " while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.\n", "

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "

\n", " \n", " \n", " y=herbs\n", " \n", "\n", "\n", " \n", " (probability 0.154, score 4.945)\n", "\n", "top features\n", "

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", "
\n", " Contribution?\n", " Feature
\n", " +4.215\n", " \n", " Highlighted in text (sum)\n", "
\n", " +0.730\n", " \n", " <BIAS>\n", "
\n", "\n", " \n", "\n", "\n", "\n", "

\n", " while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.\n", "

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "

\n", " \n", " \n", " y=meat poultry\n", " \n", "\n", "\n", " \n", " (probability 0.004, score 1.199)\n", "\n", "top features\n", "

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "
\n", " Contribution?\n", " Feature
\n", " +1.214\n", " \n", " Highlighted in text (sum)\n", "
\n", " -0.015\n", " \n", " <BIAS>\n", "
\n", "\n", " \n", "\n", "\n", "\n", "

\n", " while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.\n", "

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "

\n", " \n", " \n", " y=meat seafood\n", " \n", "\n", "\n", " \n", " (probability 0.000, score -0.847)\n", "\n", "top features\n", "

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "
\n", " Contribution?\n", " Feature
\n", " +0.606\n", " \n", " Highlighted in text (sum)\n", "
\n", " -1.453\n", " \n", " <BIAS>\n", "
\n", "\n", " \n", "\n", "\n", "\n", "

\n", " while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.\n", "

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "

\n", " \n", " \n", " y=pantry staples\n", " \n", "\n", "\n", " \n", " (probability 0.820, score 6.619)\n", "\n", "top features\n", "

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", "
\n", " Contribution?\n", " Feature
\n", " +3.366\n", " \n", " Highlighted in text (sum)\n", "
\n", " +3.253\n", " \n", " <BIAS>\n", "
\n", "\n", " \n", "\n", "\n", "\n", "

\n", " while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.\n", "

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "

\n", " \n", " \n", " y=produce\n", " \n", "\n", "\n", " \n", " (probability 0.013, score 2.489)\n", "\n", "top features\n", "

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "
\n", " Contribution?\n", " Feature
\n", " +3.578\n", " \n", " Highlighted in text (sum)\n", "
\n", " -1.089\n", " \n", " <BIAS>\n", "
\n", "\n", " \n", "\n", "\n", "\n", "

\n", " while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.\n", "

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "

\n", " \n", " \n", " y=sauces dips\n", " \n", "\n", "\n", " \n", " (probability 0.001, score -0.255)\n", "\n", "top features\n", "

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "
\n", " Contribution?\n", " Feature
\n", " +0.471\n", " \n", " Highlighted in text (sum)\n", "
\n", " -0.726\n", " \n", " <BIAS>\n", "
\n", "\n", " \n", "\n", "\n", "\n", "

\n", " while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.\n", "

\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "

\n", " \n", " \n", " y=snack food\n", " \n", "\n", "\n", " \n", " (probability 0.000, score -1.309)\n", "\n", "top features\n", "

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "\n", " \n", "
\n", " Contribution?\n", " Feature
\n", " +1.284\n", " \n", " <BIAS>\n", "
\n", " -2.593\n", " \n", " Highlighted in text (sum)\n", "
\n", "\n", " \n", "\n", "\n", "\n", "

\n", " while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.\n", "

\n", "\n", " \n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eli5.show_prediction(estimator=tfidf_logit_pipeline.named_steps['logit'],\n", " vec=tfidf_logit_pipeline.named_steps['tf_idf'],\n", " doc=train_texts[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4. Hierarchical text classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we are going to predict categories 2 and 3 at the same time. It's not straightfoward how you make your category 3 predictions consistent with category 2 predictions. Example: if the model predicts \"breakfast foods\" as category 2, then it's obliged to predicts subcategories of \"breakfast foods\" as category 3, for instance, \"cereals\". But not \"spices seasonings\". Formally, it's called hierarchical text classification." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "# combine categories 2 and 3\n", "df['Cat2_Cat3'] = df['Cat2'] + '/' + df['Cat3']" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "y_cat2_and_cat3 = df['Cat2_Cat3']" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 herbs/spices seasonings\n", "1 breakfast foods/cereals\n", "2 breakfast foods/cereals\n", "3 breakfast foods/cereals\n", "4 breakfast foods/cereals\n", "Name: Cat2_Cat3, dtype: object" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_cat2_and_cat3.head()" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "train_texts, valid_texts,y_train_cat2_and_cat3, y_valid_cat2_and_cat3 = \\\n", " train_test_split(texts, y_cat2_and_cat3, \n", " random_state=17,\n", " stratify=y_cat2_and_cat3, \n", " shuffle=True)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 9.69 s, sys: 257 ms, total: 9.95 s\n", "Wall time: 5min 29s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=4)]: Done 1 out of 1 | elapsed: 5.3min finished\n" ] }, { "data": { "text/plain": [ "Pipeline(memory=None,\n", " steps=[('tf_idf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',\n", " dtype=, encoding='utf-8', input='content',\n", " lowercase=True, max_df=1.0, max_features=50000, min_df=2,\n", " ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=Tru... penalty='l2', random_state=17, solver='lbfgs',\n", " tol=0.0001, verbose=1, warm_start=False))])" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "tfidf_logit_pipeline.fit(train_texts, y_train_cat2_and_cat3)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.52 s, sys: 36.2 ms, total: 2.56 s\n", "Wall time: 2.57 s\n" ] } ], "source": [ "%%time\n", "valid_pred_cat2_and_cat3 = tfidf_logit_pipeline.predict(valid_texts)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "cat2_pred = pd.Series(valid_pred_cat2_and_cat3).apply(lambda s: \n", " s.split('/')[0])\n", "cat3_pred = pd.Series(valid_pred_cat2_and_cat3).apply(lambda s: \n", " s.split('/')[1])" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "y_valid_cat2 = pd.Series(y_valid_cat2_and_cat3).apply(lambda s: \n", " s.split('/')[0])\n", "y_valid_cat3 = pd.Series(y_valid_cat2_and_cat3).apply(lambda s: \n", " s.split('/')[1])" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.6370619299087854" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "accuracy_score(y_valid_cat3, cat3_pred)" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.758801408225316" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "accuracy_score(y_valid_cat2, cat2_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Links:\n", " - Machine learning library [Scikit-learn](https://scikit-learn.org/stable/index.html) (a.k.a. sklearn)\n", " - Open ML course [mlcourse.ai](https://mlcourse.ai), the same as a [Kaggle Dataset](https://www.kaggle.com/kashnitsky/mlcourse) \n", " - Kernels on [logistic regression](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-2-classification) and its applications to [text classification](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-4-more-of-logit), also a [Kernel](https://www.kaggle.com/kashnitsky/topic-6-feature-engineering-and-feature-selection) on feature engineering and feature selection\n", " - [Kaggle Kernel](https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle) \"Approaching (Almost) Any NLP Problem on Kaggle\"\n", " - [ELI5](https://github.com/TeamHG-Memex/eli5) to explain model predictions" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" } }, "nbformat": 4, "nbformat_minor": 2 }