{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Supervised sentiment: hand-built feature functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "__author__ = \"Christopher Potts\"\n",
    "__version__ = \"CS224u, Stanford, Spring 2022\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Contents\n",
    "\n",
    "1. [Overview](#Overview)\n",
    "1. [Set-up](#Set-up)\n",
    "1. [Feature functions](#Feature-functions)\n",
    "    1. [Unigrams](#Unigrams)\n",
    "    1. [Bigrams](#Bigrams)\n",
    "    1. [A note on DictVectorizer](#A-note-on-DictVectorizer)\n",
    "1. [Building datasets for experiments](#Building-datasets-for-experiments)\n",
    "1. [Basic optimization](#Basic-optimization)\n",
    "    1. [Wrapper for SGDClassifier](#Wrapper-for-SGDClassifier)\n",
    "    1. [Wrapper for LogisticRegression](#Wrapper-for-LogisticRegression)\n",
    "    1. [Wrapper for TorchShallowNeuralClassifier](#Wrapper-for-TorchShallowNeuralClassifier)\n",
    "    1. [A softmax classifier in PyTorch](#A-softmax-classifier-in-PyTorch)\n",
    "    1. [Using sklearn Pipelines](#Using-sklearn-Pipelines)\n",
    "1. [Hyperparameter search](#Hyperparameter-search)\n",
    "    1. [utils.fit_classifier_with_hyperparameter_search](#utils.fit_classifier_with_hyperparameter_search)\n",
    "    1. [Example using LogisticRegression](#Example-using-LogisticRegression)\n",
    "1. [Reproducing  baselines from Socher et al. 2013](#Reproducing--baselines-from-Socher-et-al.-2013)\n",
    "    1. [Reproducing the Unigram NaiveBayes results](#Reproducing-the-Unigram-NaiveBayes-results)\n",
    "    1. [Reproducing the Bigrams NaiveBayes results](#Reproducing-the-Bigrams-NaiveBayes-results)\n",
    "    1. [Reproducing the SVM results](#Reproducing-the-SVM-results)\n",
    "1. [Statistical comparison of classifier models](#Statistical-comparison-of-classifier-models)\n",
    "    1. [Comparison with the Wilcoxon signed-rank test](#Comparison-with-the-Wilcoxon-signed-rank-test)\n",
    "    1. [Comparison with McNemar's test](#Comparison-with-McNemar's-test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Overview\n",
    "\n",
    "* The focus of this notebook is __building feature representations__ for use with (mostly linear) classifiers (though you're encouraged to try out some non-linear ones as well!).\n",
    "\n",
    "* The core characteristics of the feature functions we'll build here:\n",
    "   * They represent examples in __very large, very sparse feature spaces__.\n",
    "   * The individual feature functions can be __highly refined__, drawing on expert human knowledge of the domain. \n",
    "   * Taken together, these representations don't comprehensively represent the input examples. They just identify aspects of the inputs that the classifier model can make good use of (we hope).\n",
    "   \n",
    "* These classifiers tend to be __highly competitive__. We'll look at more powerful deep learning models in the next notebook, and it will immediately become apparent that it is very difficult to get them to measure up to well-built classifiers based in sparse feature representations. It can be done, but it tends to require a lot of attention to optimization details (and potentially a lot of compute resources).\n",
    "\n",
    "* For this notebook, we look in detail at just two very general strategies for featurization: unigram-based and bigram-based. This gives us a chance to introduce core concepts in optimization. The [associated homework](hw_sst.ipynb) is oriented towards designing more specialized, linguistically intricate feature functions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Set-up\n",
    "\n",
    "See [the previous notebook](sst_01_overview.ipynb#Set-up) for set-up instructions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import Counter\n",
    "import os\n",
    "import pandas as pd\n",
    "from sklearn.feature_extraction import DictVectorizer\n",
    "from sklearn.feature_extraction.text import TfidfTransformer\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.metrics import classification_report\n",
    "from sklearn.model_selection import PredefinedSplit\n",
    "from sklearn.naive_bayes import MultinomialNB\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.svm import LinearSVC\n",
    "import scipy.stats\n",
    "import torch.nn as nn\n",
    "\n",
    "from np_sgd_classifier import BasicSGDClassifier\n",
    "from torch_shallow_neural_classifier import TorchShallowNeuralClassifier\n",
    "import sst\n",
    "import utils"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "utils.fix_random_seeds()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "SST_HOME = os.path.join('data', 'sentiment')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Feature functions\n",
    "\n",
    "* Feature representation is arguably __the most important step in any machine learning task__. As you experiment with the SST, you'll come to appreciate this fact, since your choice of feature function will have a far greater impact on the effectiveness of your models than any other choice you make. This is especially true if you are careful to optimize the hyperparameters of your models.\n",
    "\n",
    "* We will define our feature functions as `dict`s mapping feature names (which can be any object that can be a `dict` key) to their values (which must be `bool`, `int`, or `float`). \n",
    "\n",
    "* To prepare for optimization, we will use `sklearn`'s [DictVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html) class to turn these into matrices of features. \n",
    "\n",
    "* The `dict`-based approach gives us a lot of flexibility and frees us from having to worry about the underlying feature matrix."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Unigrams"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "A typical baseline or default feature representation in NLP or NLU is built from unigrams. Here, those are the leaf nodes of the tree:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def unigrams_phi(text):\n",
    "    \"\"\"\n",
    "    The basis for a unigrams feature function. Downcases all tokens.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    text : str\n",
    "        The example to represent.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    defaultdict\n",
    "        A map from strings to their counts in `text`. (Counter maps a\n",
    "        list to a dict of counts of the elements in that list.)\n",
    "\n",
    "    \"\"\"\n",
    "    return Counter(text.lower().split())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "example_text = \"NLU is enlightening !\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({'nlu': 1, 'is': 1, 'enlightening': 1, '!': 1})"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "unigrams_phi(example_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Bigrams"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def bigrams_phi(text):\n",
    "    \"\"\"\n",
    "    The basis for a bigrams feature function. Downcases all tokens.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    text : str\n",
    "        The example to represent.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    defaultdict\n",
    "        A map from tuples to their counts in `text`.\n",
    "\n",
    "    \"\"\"\n",
    "    toks = text.lower().split()\n",
    "    left = [utils.START_SYMBOL] + toks\n",
    "    right = toks + [utils.END_SYMBOL]\n",
    "    grams = list(zip(left, right))\n",
    "    return Counter(grams)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({('<s>', 'nlu'): 1,\n",
       "         ('nlu', 'is'): 1,\n",
       "         ('is', 'enlightening'): 1,\n",
       "         ('enlightening', '!'): 1,\n",
       "         ('!', '</s>'): 1})"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bigrams_phi(example_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's generally good design to __write lots of atomic feature functions__ and then bring them together into a single function when running experiments. This will lead to reusable parts that you can assess independently and in sub-groups as part of development."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A note on DictVectorizer\n",
    "\n",
    "I've tried to be careful above to say that the above functions are just the __basis__ for feature representations. In truth, our models typically don't represent examples as dictionaries, but rather as vectors embedded in a matrix. In general, to manage the translation from dictionaries to vectors, we use  `sklearn.feature_extraction.DictVectorizer` instances. Here's a brief overview of how these work:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To start, suppose that we had just two examples to represent, and our feature function mapped them to the following list of dictionaries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_feats = [\n",
    "    {'a': 1, 'b': 1},\n",
    "    {'b': 1, 'c': 2}]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we create a `DictVectorizer`. So that we can more easily inspect the resulting matrix, I've set `sparse=False`, so that the return value is a dense matrix. For real problems, you'll probably want to use `sparse=True`, as it will be vastly more efficient for the very sparse feature matrices that you are likely to be creating."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "vec = DictVectorizer(sparse=False)  # Use `sparse=True` for real problems!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `fit_transform` method maps our list of dictionaries to a matrix:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train = vec.fit_transform(train_feats)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here I'll create a `pd.Datafame` just to help us inspect `X_train`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     a    b    c\n",
       "0  1.0  1.0  0.0\n",
       "1  0.0  1.0  2.0"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.DataFrame(X_train, columns=vec.get_feature_names_out())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can see that, intuitively, the feature called \"a\" is embedded in the first column, \"b\" in the second column, and \"c\" in the third."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now suppose we have some new test examples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_feats = [\n",
    "    {'a': 2, 'c': 1},\n",
    "    {'a': 4, 'b': 2, 'd': 1}]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we have trained a model on `X_train`, then it will not have any way to deal with this new feature \"d\". This shows that we need to embed `test_feats` in the same space as `X_train`. To do this, one just calls `transform` on the existing vectorizer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_test = vec.transform(test_feats)  # Not `fit_transform`!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     a    b    c\n",
       "0  2.0  0.0  1.0\n",
       "1  4.0  2.0  0.0"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.DataFrame(X_test, columns=vec.get_feature_names_out())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The most common mistake with `DictVectorizer` is calling `fit_transform` on test examples. This will wipe out the existing representation scheme, replacing it with one that matches the test examples. That will happen silently, but then you'll find that the new representations are incompatible with the model you fit. This is likely to manifest itself as a `ValueError` relating to feature counts. Here's an example that might help you spot this if and when it arises in your own work: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ValueError: X has 4 features, but LogisticRegression is expecting 3 features as input.\n"
     ]
    }
   ],
   "source": [
    "toy_mod = LogisticRegression()\n",
    "\n",
    "vec = DictVectorizer(sparse=False)\n",
    "\n",
    "X_train = vec.fit_transform(train_feats)\n",
    "\n",
    "toy_mod.fit(X_train, [0, 1])\n",
    "\n",
    "# Here's the error! Don't use `fit_transform` again! Use `transform`!\n",
    "X_test = vec.fit_transform(test_feats)\n",
    "\n",
    "try:\n",
    "    toy_mod.predict(X_test)\n",
    "except ValueError as err:\n",
    "    print(\"ValueError: {}\".format(err))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is actually the lucky case. If your train and test sets have the same number of features (columns), then no error will arise, and you might not even notice the misalignment between the train and test feature matrices."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In what follows, all these steps will be taken care of \"under the hood\", but it's good to be aware of what is happening. I think this also helps show the value of writing general experiment code, so that you don't have to check each experiment individually to make sure that you called the `DictVectorizer` methods (among other things!) correctly."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Building datasets for experiments\n",
    "\n",
    "The second major phase for our analysis is a kind of set-up phase. Ingredients:\n",
    "\n",
    "* A dataset from a function like `sst.train_reader`\n",
    "* A feature function like `unigrams_phi`\n",
    "\n",
    "The convenience function `sst.build_dataset` uses these to build a dataset for training and assessing a model. See its documentation for details on how it works. Much of this is about taking advantage of `sklearn`'s many functions for model building."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_dataset = sst.build_dataset(\n",
    "    sst.train_reader(SST_HOME),\n",
    "    phi=unigrams_phi,\n",
    "    vectorizer=None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train dataset with unigram features has 8,544 examples and 16,579 features.\n"
     ]
    }
   ],
   "source": [
    "print(\"Train dataset with unigram features has {:,} examples and \"\n",
    "      \"{:,} features.\".format(*train_dataset['X'].shape))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that `sst.build_dataset` has an optional argument `vectorizer`:\n",
    "\n",
    "* If it is `None`, then a new vectorizer is used and returned as `dataset['vectorizer']`. This is the usual scenario when training. \n",
    "\n",
    "* For evaluation, one wants to represent examples exactly as they were represented during training. To ensure that this happens, pass the training `vectorizer` to this function, so that `transform` is used, [as discussed just above](#A-note-on-DictVectorizer)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "dev_dataset = sst.build_dataset(\n",
    "    sst.dev_reader(SST_HOME),\n",
    "    phi=unigrams_phi,\n",
    "    vectorizer=train_dataset['vectorizer'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dev dataset with unigram features has 1,101 examples and 16,579 features\n"
     ]
    }
   ],
   "source": [
    "print(\"Dev dataset with unigram features has {:,} examples \"\n",
    "      \"and {:,} features\".format(*dev_dataset['X'].shape))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Basic optimization\n",
    "\n",
    "We're now in a position to begin training supervised models!\n",
    "\n",
    "For the most part, in this course, we will not study the theoretical aspects of machine learning optimization, concentrating instead on how to optimize systems effectively in practice. That is, this isn't a theory course, but rather an experimental, project-oriented one.\n",
    "\n",
    "Nonetheless, we do want to avoid treating our optimizers as black boxes that work their magic and give us some assessment figures for whatever we feed into them. That seems irresponsible from a scientific and engineering perspective, and it also sends the false signal that the optimization process is inherently mysterious. So we do want to take a minute to demystify it with some simple code.\n",
    "\n",
    "The module `np_sgd_classifier` contains a complete optimization framework, as `BasicSGDClassifier`. Well, it's complete in the sense that it achieves our full task of supervised learning. It's incomplete in the sense that it is very basic. You probably wouldn't want to use it in experiments. Rather, we're going to encourage you to rely on `sklearn` for your experiments (see below). Still, this is a good basic picture of what's happening under the hood.\n",
    "\n",
    "So what is `BasicSGDClassifier` doing? The heart of it is the `fit` function (reflecting the usual `sklearn` naming system). This method implements a hinge-loss stochastic sub-gradient descent optimization. Intuitively, it works as follows:\n",
    "\n",
    "1. Start by assuming that all the feature weights are `0`.\n",
    "1. Move through the dataset instance-by-instance in random order.\n",
    "1. For each instance, classify it using the current weights. \n",
    "1. If the classification is incorrect, move the weights in the direction of the correct classification\n",
    "\n",
    "This process repeats for a user-specified number of iterations (default `10` below), and the weight movement is tempered by a learning-rate parameter `eta` (default `0.1`). The output is a set of weights that can be used to make predictions about new (properly featurized) examples.\n",
    "\n",
    "In more technical terms, the objective function is \n",
    "\n",
    "$$\n",
    "  \\min_{\\mathbf{w} \\in \\mathbb{R}^{d}}\n",
    "  \\sum_{(x,y)\\in\\mathcal{D}} \n",
    "  \\max_{y'\\in\\mathbf{Y}}\n",
    "  \\left[\\mathbf{Score}_{\\textbf{w}, \\phi}(x,y') + \\mathbf{cost}(y,y')\\right] - \\mathbf{Score}_{\\textbf{w}, \\phi}(x,y)\n",
    "$$\n",
    "\n",
    "where $\\mathbf{w}$ is the set of weights to be learned, $\\mathcal{D}$ is the training set of example&ndash;label pairs, $\\mathbf{Y}$ is the set of labels, $\\mathbf{cost}(y,y') = 0$ if $y=y'$, else $1$, and $\\mathbf{Score}_{\\textbf{w}, \\phi}(x,y')$ is the inner product of the weights \n",
    "$\\mathbf{w}$ and the example as featurized according to $\\phi$.\n",
    "\n",
    "The `fit` method is then calculating the sub-gradient of this objective. In succinct pseudo-code:\n",
    "\n",
    "* Initialize $\\mathbf{w} = \\mathbf{0}$\n",
    "* Repeat $T$ times:\n",
    "    * for each $(x,y) \\in \\mathcal{D}$ (in random order):\n",
    "        * $\\tilde{y} = \\text{argmax}_{y'\\in \\mathcal{Y}} \\mathbf{Score}_{\\textbf{w}, \\phi}(x,y') + \\mathbf{cost}(y,y')$\n",
    "        * $\\mathbf{w} =  \\mathbf{w} + \\eta(\\phi(x,y) - \\phi(x,\\tilde{y}))$\n",
    "        \n",
    "This is very intuitive – push the weights in the direction of the positive cases. It doesn't require any probability theory. And such loss functions have proven highly effective in many settings. For a more powerful version of this classifier, see [sklearn.linear_model.SGDClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier). With `loss='hinge'`, it should behave much like `BasicSGDClassifier` (but faster!).\n",
    "\n",
    "For the most part, the classifiers that we use in this course have a softmax objective function. The module [np_shallow_neural_classifier.py](np_shallow_neural_classifier.py) is a straightforward example. The precise calculations are a bit less transparent than those for `BasicSGDClassifier`, but the general logic is the same for both."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Wrapper for SGDClassifier\n",
    "\n",
    "For the sake of our experimental framework, a simple wrapper for `SGDClassifier`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_basic_sgd_classifier(X, y):\n",
    "    \"\"\"\n",
    "    Wrapper for `BasicSGDClassifier`.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    X : np.array, shape `(n_examples, n_features)`\n",
    "        The matrix of features, one example per row.\n",
    "\n",
    "    y : list\n",
    "        The list of labels for rows in `X`.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    BasicSGDClassifier\n",
    "        A trained `BasicSGDClassifier` instance.\n",
    "\n",
    "    \"\"\"\n",
    "    mod = BasicSGDClassifier()\n",
    "    mod.fit(X, y)\n",
    "    return mod"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This might look like a roundabout way of just calling `fit`. We'll see shortly that having a \"wrapper\" like this creates space for us to include a lot of other modeling steps.\n",
    "\n",
    "We now have all the pieces needed to run experiments. And __we're going to want to run a lot of experiments__, trying out different feature functions, taking different perspectives on the data and labels, and using different models. \n",
    "\n",
    "To make that process efficient and regimented, `sst` contains a function `experiment`. All it does is pull together these pieces and use them for training and assessment. It's complicated, but the flexibility will turn out to be an asset. Here's an example with all of the default values spelled out:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.666     0.558     0.607       428\n",
      "     neutral      0.352     0.109     0.167       229\n",
      "    positive      0.562     0.849     0.676       444\n",
      "\n",
      "    accuracy                          0.582      1101\n",
      "   macro avg      0.527     0.506     0.483      1101\n",
      "weighted avg      0.559     0.582     0.543      1101\n",
      "\n"
     ]
    }
   ],
   "source": [
    "_ = sst.experiment(\n",
    "    sst.train_reader(SST_HOME),\n",
    "    unigrams_phi,\n",
    "    fit_basic_sgd_classifier,\n",
    "    assess_dataframes=sst.dev_reader(SST_HOME),\n",
    "    train_size=0.7,\n",
    "    score_func=utils.safe_macro_f1,\n",
    "    verbose=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A few notes on this function call:\n",
    "    \n",
    "* Since `assess_dataframes=None`, the function reports performance on a random train–test split from `train_dataframes`, as given by the first argument. Give `sst.dev_reader(SST_HOME)` as the argument to assess against the `dev` set.\n",
    "\n",
    "* `unigrams_phi` is the function we defined above. By changing/expanding this function, you can start to improve on the above baseline, perhaps periodically seeing how you do on the dev set.\n",
    "\n",
    "* `fit_basic_sgd_classifier` is the wrapper we defined above. To assess new models, simply define more functions like this one. Such functions just need to consume an `(X, y)` pair constituting a dataset and return a model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Wrapper for LogisticRegression\n",
    "\n",
    "As I said above, we likely don't want to rely on `BasicSGDClassifier` (though it does a good job with SST!). Instead, we want to rely on `sklearn` and our `torch_*` models. \n",
    "\n",
    "Here's a simple wrapper for [sklearn.linear.model.LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_softmax_classifier(X, y):\n",
    "    \"\"\"\n",
    "    Wrapper for `sklearn.linear.model.LogisticRegression`. This is\n",
    "    also called a Maximum Entropy (MaxEnt) Classifier, which is more\n",
    "    fitting for the multiclass case.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    X : np.array, shape `(n_examples, n_features)`\n",
    "        The matrix of features, one example per row.\n",
    "\n",
    "    y : list\n",
    "        The list of labels for rows in `X`.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    sklearn.linear.model.LogisticRegression\n",
    "        A trained `LogisticRegression` instance.\n",
    "\n",
    "    \"\"\"\n",
    "    mod = LogisticRegression(\n",
    "        fit_intercept=True,\n",
    "        solver='liblinear',\n",
    "        multi_class='auto')\n",
    "    mod.fit(X, y)\n",
    "    return mod"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And an experiment using `fit_softmax_classifier` and `unigrams_phi`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.628     0.689     0.657       996\n",
      "     neutral      0.336     0.157     0.214       479\n",
      "    positive      0.671     0.770     0.717      1089\n",
      "\n",
      "    accuracy                          0.624      2564\n",
      "   macro avg      0.545     0.538     0.529      2564\n",
      "weighted avg      0.592     0.624     0.600      2564\n",
      "\n"
     ]
    }
   ],
   "source": [
    "_ = sst.experiment(\n",
    "    sst.train_reader(SST_HOME),\n",
    "    unigrams_phi,\n",
    "    fit_softmax_classifier)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Wrapper for TorchShallowNeuralClassifier\n",
    "\n",
    "While we're at it, we might as well start to get a sense for whether adding a hidden layer to our softmax classifier yields any benefits. Whereas `LogisticRegression` is, at its core, computing\n",
    "\n",
    "$$\\begin{align*}\n",
    "y &= \\textbf{softmax}(xW_{xy} + b_{y})\n",
    "\\end{align*}$$\n",
    "\n",
    "the shallow neural network inserts a hidden layer with a non-linear activation applied to it:\n",
    "\n",
    "$$\\begin{align*}\n",
    "h &= \\tanh(xW_{xh} + b_{h}) \\\\\n",
    "y &= \\textbf{softmax}(hW_{hy} + b_{y})\n",
    "\\end{align*}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's an illustrative example using `TorchShallowNeuralClassifier`, which is in the course repo's `torch_shallow_neural_classifier.py`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_nn_classifier(X, y):\n",
    "    mod = TorchShallowNeuralClassifier(\n",
    "        hidden_dim=100,\n",
    "        early_stopping=True,      # A basic early stopping set-up.\n",
    "        validation_fraction=0.1,  # If no improvement on the\n",
    "        tol=1e-5,                 # validation set is seen within\n",
    "        n_iter_no_change=10)      # `n_iter_no_change`, we stop.\n",
    "    mod.fit(X, y)\n",
    "    return mod"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A noteworthy feature of this `fit_nn_classifier` is that it sets `early_stopping=True`. This instructs the optimizer to hold out a small fraction (see `validation_fraction`) of the training data to use as a dev set at the end of each epoch. Optimization will stop if improvements of at least `tol` on this dev set aren't seen within `n_iter_no_change` epochs. If that condition is triggered, the parameters from the top-scoring model are used for the final model. (For additional discussion, see [the section on model convergence in the evaluation methods notebook](#Assessing-models-without-convergence).)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another quick experiment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Stopping after epoch 21. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 0.8948389664292336"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.621     0.721     0.667       977\n",
      "     neutral      0.303     0.093     0.142       497\n",
      "    positive      0.659     0.773     0.712      1090\n",
      "\n",
      "    accuracy                          0.621      2564\n",
      "   macro avg      0.528     0.529     0.507      2564\n",
      "weighted avg      0.576     0.621     0.584      2564\n",
      "\n"
     ]
    }
   ],
   "source": [
    "_ = sst.experiment(\n",
    "    sst.train_reader(SST_HOME),\n",
    "    unigrams_phi,\n",
    "    fit_nn_classifier)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### A softmax classifier in PyTorch\n",
    "\n",
    "Our PyTorch modules should support easy modification as discussed in [tutorial_pytorch_models.ipynb](tutorial_pytorch_models.ipynb). Perhaps the simplest modification from that notebook uses `TorchShallowNeuralClassifier` to define a `TorchSoftmaxClassifier`. All you need to do for this is write a new `build_graph` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "class TorchSoftmaxClassifier(TorchShallowNeuralClassifier):\n",
    "\n",
    "    def build_graph(self):\n",
    "        return nn.Linear(self.input_dim, self.n_classes_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this function call, I added an L2 regularization term to help prevent overfitting:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_torch_softmax(X, y):\n",
    "    mod = TorchSoftmaxClassifier(l2_strength=0.0001)\n",
    "    mod.fit(X, y)\n",
    "    return mod"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Stopping after epoch 763. Training loss did not improve more than tol=1e-05. Final error is 1.1415197104215622."
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.630     0.686     0.657       969\n",
      "     neutral      0.318     0.178     0.228       528\n",
      "    positive      0.661     0.752     0.704      1067\n",
      "\n",
      "    accuracy                          0.609      2564\n",
      "   macro avg      0.536     0.539     0.530      2564\n",
      "weighted avg      0.579     0.609     0.588      2564\n",
      "\n"
     ]
    }
   ],
   "source": [
    "_ = sst.experiment(\n",
    "    sst.train_reader(SST_HOME),\n",
    "    unigrams_phi,\n",
    "    fit_torch_softmax)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using sklearn Pipelines\n",
    "\n",
    "The `sklearn.pipeline` module defines `Pipeline` objects, which let you chain together different transformations and estimators. `Pipeline` objects are fully compatible with `sst.experiment`. Here's a basic example using `TfidfTransformer` followed by `LogisticRegression`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_pipeline_softmax(X, y):\n",
    "    rescaler = TfidfTransformer()\n",
    "    mod = LogisticRegression(max_iter=2000)\n",
    "    pipeline = Pipeline([\n",
    "        ('scaler', rescaler),\n",
    "        ('model', mod)])\n",
    "    pipeline.fit(X, y)\n",
    "    return pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.611     0.726     0.664       965\n",
      "     neutral      0.361     0.051     0.089       511\n",
      "    positive      0.646     0.798     0.714      1088\n",
      "\n",
      "    accuracy                          0.622      2564\n",
      "   macro avg      0.539     0.525     0.489      2564\n",
      "weighted avg      0.576     0.622     0.570      2564\n",
      "\n"
     ]
    }
   ],
   "source": [
    "_ = sst.experiment(\n",
    "    sst.train_reader(SST_HOME),\n",
    "    unigrams_phi,\n",
    "    fit_pipeline_softmax)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pipelines can also include the models from the course repo. The one gotcha here is that some `sklearn` transformers return sparse matrices, which are likely to clash with the requirements of these other models. To get around this, just add `utils.DenseTransformer()` where you need to transition from a sparse matrix to a dense one. Here's an example using `TorchShallowNeuralClassifier` with early stopping:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_pipeline_classifier(X, y):\n",
    "    rescaler = TfidfTransformer()\n",
    "    mod = TorchShallowNeuralClassifier(early_stopping=True)\n",
    "    pipeline = Pipeline([\n",
    "        ('scaler', rescaler),\n",
    "        # We need this little bridge to go from\n",
    "        # sparse matrices to dense ones:\n",
    "        ('densify', utils.DenseTransformer()),\n",
    "        ('model', mod)])\n",
    "    pipeline.fit(X, y)\n",
    "    return pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Stopping after epoch 18. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 3.631275475025177"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.632     0.766     0.693       997\n",
      "     neutral      0.000     0.000     0.000       456\n",
      "    positive      0.663     0.809     0.729      1111\n",
      "\n",
      "    accuracy                          0.649      2564\n",
      "   macro avg      0.432     0.525     0.474      2564\n",
      "weighted avg      0.533     0.649     0.585      2564\n",
      "\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Applications/anaconda3/envs/nlu/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
      "  _warn_prf(average, modifier, msg_start, len(result))\n",
      "/Applications/anaconda3/envs/nlu/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
      "  _warn_prf(average, modifier, msg_start, len(result))\n",
      "/Applications/anaconda3/envs/nlu/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
      "  _warn_prf(average, modifier, msg_start, len(result))\n"
     ]
    }
   ],
   "source": [
    "_ = sst.experiment(\n",
    "    sst.train_reader(SST_HOME),\n",
    "    unigrams_phi,\n",
    "    fit_pipeline_classifier)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Hyperparameter search\n",
    "\n",
    "The training process learns __parameters__ &mdash; the weights. There are typically lots of other parameters that need to be set. For instance, our `BasicSGDClassifier` has a learning rate parameter and a training iteration parameter. These are called __hyperparameters__. The more powerful `sklearn` classifiers and our `torch_*` models have many more such hyperparameters. These are outside of the explicitly stated objective, hence the \"hyper\" part. \n",
    "\n",
    "So far, we have just set the hyperparameters by hand. However, their optimal values can vary widely between datasets, and choices here can dramatically impact performance, so we would like to set them as part of the overall experimental framework."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### utils.fit_classifier_with_hyperparameter_search\n",
    "\n",
    "Luckily, `sklearn` provides a lot of functionality for setting hyperparameters via cross-validation. The function `utils.fit_classifier_with_hyperparameter_search` implements a basic framework for taking advantage of these options. It's really just a lightweight wrapper around `slearn.model_selection.GridSearchCV`.\n",
    "\n",
    "This corresponding model wrappers have the same basic shape as `fit_softmax_classifier` above: they take a dataset as input and return a trained model. However, to find the best model, they explore a space of hyperparameters supplied by the user, seeking the optimal combination of settings. \n",
    "\n",
    "Only the training data is used to perform this search; that data is split into multiple train–test splits, and the best hyperparameter settings are the one that do the best on average across these splits. Once those settings are found, a model is trained with those settings on all the available data and finally evaluated on the assessment data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Example using LogisticRegression\n",
    "\n",
    "Here's a fairly full-featured use of the above for the `LogisticRegression` model family:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_softmax_with_hyperparameter_search(X, y):\n",
    "    \"\"\"\n",
    "    A MaxEnt model of dataset with hyperparameter cross-validation.\n",
    "\n",
    "    Some notes:\n",
    "\n",
    "    * 'fit_intercept': whether to include the class bias feature.\n",
    "    * 'C': weight for the regularization term (smaller is more regularized).\n",
    "    * 'penalty': type of regularization -- roughly, 'l1' ecourages small\n",
    "      sparse models, and 'l2' encourages the weights to conform to a\n",
    "      gaussian prior distribution.\n",
    "    * 'class_weight': 'balanced' adjusts the weights to simulate a\n",
    "      balanced class distribution, whereas None makes no adjustment.\n",
    "\n",
    "    Other arguments can be cross-validated; see\n",
    "    http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    X : 2d np.array\n",
    "        The matrix of features, one example per row.\n",
    "\n",
    "    y : list\n",
    "        The list of labels for rows in `X`.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    sklearn.linear_model.LogisticRegression\n",
    "        A trained model instance, the best model found.\n",
    "\n",
    "    \"\"\"\n",
    "    basemod = LogisticRegression(\n",
    "        fit_intercept=True,\n",
    "        solver='liblinear',\n",
    "        multi_class='auto')\n",
    "    cv = 5\n",
    "    param_grid = {\n",
    "        'C': [0.6, 0.8, 1.0, 2.0],\n",
    "        'penalty': ['l1', 'l2'],\n",
    "        'class_weight': ['balanced', None]}\n",
    "    bestmod = utils.fit_classifier_with_hyperparameter_search(\n",
    "        X, y, basemod, cv, param_grid)\n",
    "    return bestmod"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best params: {'C': 1.0, 'class_weight': 'balanced', 'penalty': 'l2'}\n",
      "Best score: 0.541\n",
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.635     0.659     0.647       428\n",
      "     neutral      0.313     0.223     0.260       229\n",
      "    positive      0.652     0.725     0.687       444\n",
      "\n",
      "    accuracy                          0.595      1101\n",
      "   macro avg      0.533     0.536     0.531      1101\n",
      "weighted avg      0.575     0.595     0.582      1101\n",
      "\n"
     ]
    }
   ],
   "source": [
    "softmax_experiment = sst.experiment(\n",
    "    sst.train_reader(SST_HOME),\n",
    "    unigrams_phi,\n",
    "    fit_softmax_with_hyperparameter_search,\n",
    "    assess_dataframes=sst.dev_reader(SST_HOME))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Recall that the \"Best params\" are found via evaluations only on the training data. The `assess_reader` is held out from that process, so it's giving us an estimate of how we will do on a final test set."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reproducing  baselines from Socher et al. 2013\n",
    "\n",
    "The goal of this section is to bring together ideas from the above to reproduce some of the the non-neural baselines from [Socher et al., Table 1](http://www.aclweb.org/anthology/D/D13/D13-1170.pdf). More specifically, we'll shoot for the root-level binary numbers:\n",
    "\n",
    "| Model              | Accuracy  | \n",
    "|--------------------|-----------|\n",
    "| Unigram NaiveBayes | 81.8      |\n",
    "| Bigram NaiveBayes  | 83.1      |\n",
    "| SVM                | 79.4      |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following reduces the dataset to the binary task:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df = sst.train_reader(SST_HOME)\n",
    "\n",
    "train_bin_df = train_df[train_df.label != 'neutral']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "dev_df = sst.dev_reader(SST_HOME)\n",
    "\n",
    "dev_bin_df = dev_df[dev_df.label != 'neutral']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_df = sst.sentiment_reader(os.path.join(SST_HOME, \"sst3-test-labeled.csv\"))\n",
    "\n",
    "test_bin_df = test_df[test_df.label != 'neutral']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note: we will continue to train on just the full examples, so that the experiments do not require a lot of time and computational resources. However, there are probably gains to be had from training on the subtrees as well. In that case, one needs to be careful in cross-validation: the test set needs to be the root-only dev set rather than slices of the train set. To achieve this, one can use a `PredefinedSplit`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "full_train_df = sst.train_reader(SST_HOME, include_subtrees=True)\n",
    "full_train_bin_df = full_train_df[full_train_df.label != 'neutral']\n",
    "\n",
    "split_indices = [0] * full_train_bin_df.shape[0]\n",
    "split_indices += [-1] * dev_bin_df.shape[0]\n",
    "sst_train_dev_splitter = PredefinedSplit(split_indices)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This would be used in place of `cv=5` in the model wrappers below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reproducing the Unigram NaiveBayes results\n",
    "\n",
    "To start, we might just use `MultinomialNB` with default parameters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_unigram_nb_classifier(X, y):\n",
    "    mod = MultinomialNB()\n",
    "    mod.fit(X, y)\n",
    "    return mod"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.802     0.778     0.790       428\n",
      "    positive      0.792     0.815     0.804       444\n",
      "\n",
      "    accuracy                          0.797       872\n",
      "   macro avg      0.797     0.797     0.797       872\n",
      "weighted avg      0.797     0.797     0.797       872\n",
      "\n"
     ]
    }
   ],
   "source": [
    "_ = sst.experiment(\n",
    "    train_bin_df,\n",
    "    unigrams_phi,\n",
    "    fit_unigram_nb_classifier,\n",
    "    assess_dataframes=dev_bin_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This falls slightly short of our goal, which is not encouraging about how we would do on the test set. However, `MultinomialNB` has a regularization term `alpha` that might have a significant impact given the very large, sparse feature matrices we are creating with `unigrams_phi`. In addition, it might help to transform the raw feature counts, for the same reason that reweighting was so powerful in our VSM module. The best way to try out all these ideas is to do a wide hyperparameter search. The following model wrapper function implements these steps: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_nb_classifier_with_hyperparameter_search(X, y):\n",
    "    rescaler = TfidfTransformer()\n",
    "    mod = MultinomialNB()\n",
    "\n",
    "    pipeline = Pipeline([('scaler', rescaler), ('model', mod)])\n",
    "\n",
    "    # Access the alpha and fit_prior parameters of `mod` with\n",
    "    # `model__alpha` and `model__fit_prior`, where \"model\" is the\n",
    "    # name from the Pipeline. Use 'passthrough' to optionally\n",
    "    # skip TF-IDF.\n",
    "    param_grid = {\n",
    "        'model__fit_prior': [True, False],\n",
    "        'scaler': ['passthrough', rescaler],\n",
    "        'model__alpha': [0.1, 0.2, 0.4, 0.8, 1.0, 1.2]}\n",
    "\n",
    "    bestmod = utils.fit_classifier_with_hyperparameter_search(\n",
    "        X, y, pipeline,\n",
    "        param_grid=param_grid,\n",
    "        cv=5)\n",
    "    return bestmod"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we run the experiment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best params: {'model__alpha': 1.2, 'model__fit_prior': False, 'scaler': TfidfTransformer()}\n",
      "Best score: 0.798\n",
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.852     0.798     0.824       912\n",
      "    positive      0.810     0.861     0.835       909\n",
      "\n",
      "    accuracy                          0.830      1821\n",
      "   macro avg      0.831     0.830     0.830      1821\n",
      "weighted avg      0.831     0.830     0.830      1821\n",
      "\n"
     ]
    }
   ],
   "source": [
    "unigram_nb_experiment_xval = sst.experiment(\n",
    "    [train_bin_df, dev_bin_df],\n",
    "    unigrams_phi,\n",
    "    fit_nb_classifier_with_hyperparameter_search,\n",
    "    assess_dataframes=test_bin_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We're above the target of 81.8, so we can say that we reproduced the paper's result."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reproducing the Bigrams NaiveBayes results\n",
    "\n",
    "For the bigram NaiveBayes mode, we can continue to use `fit_nb_classifier_with_hyperparameter_search`, but now the experiment is done with `bigrams_phi`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best params: {'model__alpha': 0.2, 'model__fit_prior': False, 'scaler': TfidfTransformer()}\n",
      "Best score: 0.764\n",
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.796     0.754     0.775       912\n",
      "    positive      0.766     0.806     0.786       909\n",
      "\n",
      "    accuracy                          0.780      1821\n",
      "   macro avg      0.781     0.780     0.780      1821\n",
      "weighted avg      0.781     0.780     0.780      1821\n",
      "\n"
     ]
    }
   ],
   "source": [
    "bigram_nb_experiment_xval = sst.experiment(\n",
    "    [train_bin_df, dev_bin_df],\n",
    "    bigrams_phi,\n",
    "    fit_nb_classifier_with_hyperparameter_search,\n",
    "    assess_dataframes=test_bin_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is below the target of 83.1, so we've failed to reproduce the paper's result. I am not sure where the implementation difference between our model and that of Socher et al. lies!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reproducing the SVM results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_svm_classifier_with_hyperparameter_search(X, y):\n",
    "    rescaler = TfidfTransformer()\n",
    "    mod = LinearSVC(loss='squared_hinge', penalty='l2')\n",
    "\n",
    "    pipeline = Pipeline([('scaler', rescaler), ('model', mod)])\n",
    "\n",
    "    # Access the alpha parameter of `mod` with `mod__alpha`,\n",
    "    # where \"model\" is the name from the Pipeline. Use\n",
    "    # 'passthrough' to optionally skip TF-IDF.\n",
    "    param_grid = {\n",
    "        'scaler': ['passthrough', rescaler],\n",
    "        'model__C': [0.1, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4]}\n",
    "\n",
    "    bestmod = utils.fit_classifier_with_hyperparameter_search(\n",
    "        X, y, pipeline,\n",
    "        param_grid=param_grid,\n",
    "        cv=5)\n",
    "    return bestmod"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best params: {'model__C': 0.4, 'scaler': TfidfTransformer()}\n",
      "Best score: 0.797\n",
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.844     0.795     0.819       912\n",
      "    positive      0.806     0.853     0.828       909\n",
      "\n",
      "    accuracy                          0.824      1821\n",
      "   macro avg      0.825     0.824     0.824      1821\n",
      "weighted avg      0.825     0.824     0.824      1821\n",
      "\n"
     ]
    }
   ],
   "source": [
    "svm_experiment_xval = sst.experiment(\n",
    "    [train_bin_df, dev_bin_df],\n",
    "    unigrams_phi,\n",
    "    fit_svm_classifier_with_hyperparameter_search,\n",
    "    assess_dataframes=test_bin_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Right on! This is quite a ways above the target of 79.4, so we can say that we successfully reproduced this result."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Statistical comparison of classifier models\n",
    "\n",
    "Suppose two classifiers differ according to an effectiveness measure like F1 or accuracy. Are they meaningfully different?\n",
    "\n",
    "* For very large datasets, the answer might be clear: if performance is very stable across different train/assess splits and the difference in terms of correct predictions has practical importance, then you can clearly say yes. \n",
    "\n",
    "* With smaller datasets, or models whose performance is closer together, it can be harder to determine whether the two models are different. We can address this question in a basic way with repeated runs and basic null-hypothesis testing on the resulting score vectors.\n",
    "\n",
    "In general, one wants to compare __two feature functions against the same model__, or one wants to compare __two models with the same feature function used for both__. If both are changed at the same time, then it will be hard to figure out what is causing any differences you see."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Comparison with the Wilcoxon signed-rank test\n",
    "\n",
    "The function `sst.compare_models` is designed for such testing. The default set-up uses the non-parametric [Wilcoxon signed-rank test](https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test) to make the comparisons, which is relatively conservative and recommended by [Demšar 2006](http://www.jmlr.org/papers/v7/demsar06a.html) for cases where one can afford to do multiple assessments. For discussion, see [the evaluation methods notebook](evaluation_methods.ipynb#Wilcoxon-signed-rank-test).\n",
    "\n",
    "Here's an example showing the default parameters values and comparing `LogisticRegression` and `BasicSGDClassifier`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model 1 mean: 0.522\n",
      "Model 2 mean: 0.509\n",
      "p = 0.006\n"
     ]
    }
   ],
   "source": [
    "_ = sst.compare_models(\n",
    "    sst.train_reader(SST_HOME),\n",
    "    unigrams_phi,\n",
    "    fit_softmax_classifier,\n",
    "    stats_test=scipy.stats.wilcoxon,\n",
    "    trials=10,\n",
    "    phi2=None,  # Defaults to same as first argument.\n",
    "    train_func2=fit_basic_sgd_classifier, # Defaults to same as second argument.\n",
    "    train_size=0.7,\n",
    "    score_func=utils.safe_macro_f1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Comparison with McNemar's test\n",
    "\n",
    "[McNemar's test](https://en.wikipedia.org/wiki/McNemar%27s_test) operates directly on the vectors of predictions for the two models being compared. As such, it doesn't require repeated runs, which is good where optimization is expensive. For discussion, see [the evaluation methods notebook](evaluation_methods.ipynb#McNemar's-test)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "m = utils.mcnemar(\n",
    "    unigram_nb_experiment_xval['assess_datasets'][0]['y'],\n",
    "    unigram_nb_experiment_xval['predictions'][0],\n",
    "    bigram_nb_experiment_xval['predictions'][0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "McNemar's test: 22.38 (p < 0.0001)\n"
     ]
    }
   ],
   "source": [
    "p = \"p < 0.0001\" if m[1] < 0.0001 else m[1]\n",
    "\n",
    "print(\"McNemar's test: {0:0.02f} ({1:})\".format(m[0], p))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  },
  "widgets": {
   "state": {},
   "version": "1.1.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}