{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Supervised sentiment: hand-built feature functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "__author__ = \"Christopher Potts\"\n",
    "__version__ = \"CS224u, Stanford, Spring 2020\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Contents\n",
    "\n",
    "1. [Overview](#Overview)\n",
    "1. [Set-up](#Set-up)\n",
    "1. [Feature functions](#Feature-functions)\n",
    "1. [Building datasets for experiments](#Building-datasets-for-experiments)\n",
    "1. [Basic optimization](#Basic-optimization)\n",
    "  1. [Wrapper for SGDClassifier](#Wrapper-for-SGDClassifier)\n",
    "  1. [Wrapper for LogisticRegression](#Wrapper-for-LogisticRegression)\n",
    "  1. [Other scikit-learn models](#Other-scikit-learn-models)\n",
    "1. [Experiments](#Experiments)\n",
    "  1. [Experiment with default values](#Experiment-with-default-values)\n",
    "  1. [A dev set run](#A-dev-set-run)\n",
    "  1. [Assessing BasicSGDClassifier](#Assessing-BasicSGDClassifier)\n",
    "  1. [Comparison with the baselines from Socher et al. 2013](#Comparison-with-the-baselines-from-Socher-et-al.-2013)\n",
    "  1. [A shallow neural network classifier](#A-shallow-neural-network-classifier)\n",
    "  1. [A softmax classifier in PyTorch](#A-softmax-classifier-in-PyTorch)\n",
    "1. [Hyperparameter search](#Hyperparameter-search)\n",
    "  1. [utils.fit_classifier_with_crossvalidation](#utils.fit_classifier_with_crossvalidation)\n",
    "  1. [Example using LogisticRegression](#Example-using-LogisticRegression)\n",
    "  1. [Example using BasicSGDClassifier](#Example-using-BasicSGDClassifier)\n",
    "1. [Statistical comparison of classifier models](#Statistical-comparison-of-classifier-models)\n",
    "  1. [Comparison with the Wilcoxon signed-rank test](#Comparison-with-the-Wilcoxon-signed-rank-test)\n",
    "  1. [Comparison with McNemar's test](#Comparison-with-McNemar's-test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Overview\n",
    "\n",
    "* The focus of this notebook is __building feature representations__ for use with (mostly linear) classifiers (though you're encouraged to try out some non-linear ones as well!).\n",
    "\n",
    "* The core characteristics of the feature functions we'll build here:\n",
    "   * They represent examples in __very large, very sparse feature spaces__.\n",
    "   * The individual feature functions can be __highly refined__, drawing on expert human knowledge of the domain. \n",
    "   * Taken together, these representations don't comprehensively represent the input examples. They just identify aspects of the inputs that the classifier model can make good use of (we hope).\n",
    "   \n",
    "* These classifiers tend to be __highly competitive__. We'll look at more powerful deep learning models in the next notebook, and it will immediately become apparent that it is very difficult to get them to measure up to well-built classifiers based in sparse feature representations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Set-up\n",
    "\n",
    "See [the previous notebook](sst_01_overview.ipynb#Set-up) for set-up instructions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import Counter\n",
    "import os\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "import scipy.stats\n",
    "from np_sgd_classifier import BasicSGDClassifier\n",
    "import torch.nn as nn\n",
    "from torch_shallow_neural_classifier import TorchShallowNeuralClassifier\n",
    "import sst\n",
    "import utils"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set all the random seeds for reproducibility. Only the\n",
    "# system and torch seeds are relevant for this notebook.\n",
    "\n",
    "utils.fix_random_seeds()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "SST_HOME = os.path.join('data', 'trees')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Feature functions\n",
    "\n",
    "* Feature representation is arguably __the most important step in any machine learning task__. As you experiment with the SST, you'll come to appreciate this fact, since your choice of feature function will have a far greater impact on the effectiveness of your models than any other choice you make.\n",
    "\n",
    "* We will define our feature functions as `dict`s mapping feature names (which can be any object that can be a `dict` key) to their values (which must be `bool`, `int`, or `float`). \n",
    "\n",
    "* To prepare for optimization, we will use `sklearn`'s [DictVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html) class to turn these into matrices of features. \n",
    "\n",
    "* The `dict`-based approach gives us a lot of flexibility and frees us from having to worry about the underlying feature matrix."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "A typical baseline or default feature representation in NLP or NLU is built from unigrams. Here, those are the leaf nodes of the tree:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def unigrams_phi(tree):\n",
    "    \"\"\"The basis for a unigrams feature function.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    tree : nltk.tree\n",
    "        The tree to represent.\n",
    "    \n",
    "    Returns\n",
    "    -------    \n",
    "    defaultdict\n",
    "        A map from strings to their counts in `tree`. (Counter maps a \n",
    "        list to a dict of counts of the elements in that list.)\n",
    "    \n",
    "    \"\"\"\n",
    "    return Counter(tree.leaves())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "In the docstring for `sst.sentiment_treebank_reader`, I pointed out that the labels on the subtrees can be used in a way that feels like cheating. Here's the most dramatic instance of this: `root_daughter_scores_phi` uses just the labels on the daughters of the root to predict the root (label). This will result in performance well north of 90% F1, but that's hardly worth reporting. (Interestingly, using the labels on the leaf nodes is much less powerful.) Anyway, don't use this function!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def root_daughter_scores_phi(tree):    \n",
    "    \"\"\"The best way we've found to cheat without literally using the \n",
    "    labels as part of the feature representations. \n",
    "    \n",
    "    Don't use this for any real experiments!\n",
    "    \n",
    "    \"\"\"\n",
    "    return Counter([child.label() for child in tree])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's generally good design to __write lots of atomic feature functions__ and then bring them together into a single function when running experiments. This will lead to reusable parts that you can assess independently and in sub-groups as part of development."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Building datasets for experiments\n",
    "\n",
    "The second major phase for our analysis is a kind of set-up phase. Ingredients:\n",
    "\n",
    "* A reader like `train_reader`\n",
    "* A feature function like `unigrams_phi`\n",
    "* A class function like `binary_class_func`\n",
    "\n",
    "The convenience function `sst.build_dataset` uses these to build a dataset for training and assessing a model. See its documentation for details on how it works. Much of this is about taking advantage of `sklearn`'s many functions for model building."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_dataset = sst.build_dataset(\n",
    "    SST_HOME,\n",
    "    reader=sst.train_reader,\n",
    "    phi=unigrams_phi,\n",
    "    class_func=sst.binary_class_func,\n",
    "    vectorizer=None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train dataset with unigram features has 6,920 examples and 16,282 features\n"
     ]
    }
   ],
   "source": [
    "print(\"Train dataset with unigram features has {:,} examples and {:,} features\".format(\n",
    "        *train_dataset['X'].shape))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that `sst.build_dataset` has an optional argument `vectorizer`:\n",
    "\n",
    "* If it is `None`, then a new vectorizer is used and returned as `dataset['vectorizer']`. This is the usual scenario when training. \n",
    "\n",
    "* For evaluation, one wants to represent examples exactly as they were represented during training. To ensure that this happens, pass the training `vectorizer` to this function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "dev_dataset = sst.build_dataset(\n",
    "    SST_HOME,\n",
    "    reader=sst.dev_reader,\n",
    "    phi=unigrams_phi,\n",
    "    class_func=sst.binary_class_func,\n",
    "    vectorizer=train_dataset['vectorizer'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dev dataset with unigram features has 872 examples and 16,282 features\n"
     ]
    }
   ],
   "source": [
    "print(\"Dev dataset with unigram features has {:,} examples \"\n",
    "      \"and {:,} features\".format(*dev_dataset['X'].shape))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Basic optimization\n",
    "\n",
    "We're now in a position to begin training supervised models!\n",
    "\n",
    "For the most part, in this course, we will not study the theoretical aspects of machine learning optimization, concentrating instead on how to optimize systems effectively in practice. That is, this isn't a theory course, but rather an experimental, project-oriented one.\n",
    "\n",
    "Nonetheless, we do want to avoid treating our optimizers as black boxes that work their magic and give us some assessment figures for whatever we feed into them. That seems irresponsible from a scientific and engineering perspective, and it also sends the false signal that the optimization process is inherently mysterious. So we do want to take a minute to demystify it with some simple code.\n",
    "\n",
    "The module `np_sgd_classifier` contains a complete optimization framework, as `BasicSGDClassifier`. Well, it's complete in the sense that it achieves our full task of supervised learning. It's incomplete in the sense that it is very basic. You probably wouldn't want to use it in experiments. Rather, we're going to encourage you to rely on `sklearn` for your experiments (see below). Still, this is a good basic picture of what's happening under the hood.\n",
    "\n",
    "So what is `BasicSGDClassifier` doing? The heart of it is the `fit` function (reflecting the usual `sklearn` naming system). This method implements a hinge-loss stochastic sub-gradient descent optimization. Intuitively, it works as follows:\n",
    "\n",
    "1. Start by assuming that all the feature weights are `0`.\n",
    "1. Move through the dataset instance-by-instance in random order.\n",
    "1. For each instance, classify it using the current weights. \n",
    "1. If the classification is incorrect, move the weights in the direction of the correct classification\n",
    "\n",
    "This process repeats for a user-specified number of iterations (default `10` below), and the weight movement is tempered by a learning-rate parameter `eta` (default `0.1`). The output is a set of weights that can be used to make predictions about new (properly featurized) examples.\n",
    "\n",
    "In more technical terms, the objective function is \n",
    "\n",
    "$$\n",
    "  \\min_{\\mathbf{w} \\in \\mathbb{R}^{d}}\n",
    "  \\sum_{(x,y)\\in\\mathcal{D}} \n",
    "  \\max_{y'\\in\\mathbf{Y}}\n",
    "  \\left[\\mathbf{Score}_{\\textbf{w}, \\phi}(x,y') + \\mathbf{cost}(y,y')\\right] - \\mathbf{Score}_{\\textbf{w}, \\phi}(x,y)\n",
    "$$\n",
    "\n",
    "where $\\mathbf{w}$ is the set of weights to be learned, $\\mathcal{D}$ is the training set of example&ndash;label pairs, $\\mathbf{Y}$ is the set of labels, $\\mathbf{cost}(y,y') = 0$ if $y=y'$, else $1$, and $\\mathbf{Score}_{\\textbf{w}, \\phi}(x,y')$ is the inner product of the weights \n",
    "$\\mathbf{w}$ and the example as featurized according to $\\phi$.\n",
    "\n",
    "The `fit` method is then calculating the sub-gradient of this objective. In succinct pseudo-code:\n",
    "\n",
    "* Initialize $\\mathbf{w} = \\mathbf{0}$\n",
    "* Repeat $T$ times:\n",
    "    * for each $(x,y) \\in \\mathcal{D}$ (in random order):\n",
    "        * $\\tilde{y} = \\text{argmax}_{y'\\in \\mathcal{Y}} \\mathbf{Score}_{\\textbf{w}, \\phi}(x,y') + \\mathbf{cost}(y,y')$\n",
    "        * $\\mathbf{w} =  \\mathbf{w} + \\eta(\\phi(x,y) - \\phi(x,\\tilde{y}))$\n",
    "        \n",
    "This is very intuitive – push the weights in the direction of the positive cases. It doesn't require any probability theory. And such loss functions have proven highly effective in many settings. For a more powerful version of this classifier, see [sklearn.linear_model.SGDClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier). With `loss='hinge'`, it should behave much like `BasicSGDClassifier` (but faster!)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Wrapper for SGDClassifier\n",
    "\n",
    "For the sake of our experimental framework, a simple wrapper for `SGDClassifier`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_basic_sgd_classifier(X, y):    \n",
    "    \"\"\"Wrapper for `BasicSGDClassifier`.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    X : 2d np.array\n",
    "        The matrix of features, one example per row.        \n",
    "    y : list\n",
    "        The list of labels for rows in `X`.\n",
    "    \n",
    "    Returns\n",
    "    -------\n",
    "    BasicSGDClassifier\n",
    "        A trained `BasicSGDClassifier` instance.\n",
    "    \n",
    "    \"\"\"    \n",
    "    mod = BasicSGDClassifier()\n",
    "    mod.fit(X, y)\n",
    "    return mod"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Wrapper for LogisticRegression\n",
    "\n",
    "As I said above, we likely don't want to rely on `BasicSGDClassifier` (though it does a good job with SST!). Instead, we want to rely on `sklearn`. Here's a simple wrapper for [sklearn.linear.model.LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) using our \n",
    "`build_dataset` paradigm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_softmax_classifier(X, y):    \n",
    "    \"\"\"Wrapper for `sklearn.linear.model.LogisticRegression`. This is \n",
    "    also called a Maximum Entropy (MaxEnt) Classifier, which is more \n",
    "    fitting for the multiclass case.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    X : 2d np.array\n",
    "        The matrix of features, one example per row.\n",
    "    y : list\n",
    "        The list of labels for rows in `X`.\n",
    "    \n",
    "    Returns\n",
    "    -------\n",
    "    sklearn.linear.model.LogisticRegression\n",
    "        A trained `LogisticRegression` instance.\n",
    "    \n",
    "    \"\"\"\n",
    "    mod = LogisticRegression(\n",
    "        fit_intercept=True, \n",
    "        solver='liblinear', \n",
    "        multi_class='auto')\n",
    "    mod.fit(X, y)\n",
    "    return mod"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Other scikit-learn models\n",
    "\n",
    "* The [sklearn.linear_model](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model) package has a number of other classifier models that could be effective for SST.\n",
    "\n",
    "* The [sklearn.ensemble](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble) package contains powerful classifiers as well. The theme that runs through all of them is that one can get better results by averaging the predictions of a bunch of more basic classifiers. A [RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) will bring some of the power of deep learning models without the optimization challenges (though see [this blog post on some limitations of the current sklearn implementation](https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/)).\n",
    "\n",
    "* The [sklearn.svm](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.svm) contains variations on Support Vector Machines (SVMs)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Experiments\n",
    "\n",
    "We now have all the pieces needed to run experiments. And __we're going to want to run a lot of experiments__, trying out different feature functions, taking different perspectives on the data and labels, and using different models. \n",
    "\n",
    "To make that process efficient and regimented, `sst` contains a function `experiment`. All it does is pull together these pieces and use them for training and assessment. It's complicated, but the flexibility will turn out to be an asset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Experiment with default values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.612     0.672     0.640       996\n",
      "     neutral      0.299     0.140     0.191       479\n",
      "    positive      0.658     0.753     0.702      1089\n",
      "\n",
      "    accuracy                          0.607      2564\n",
      "   macro avg      0.523     0.522     0.511      2564\n",
      "weighted avg      0.573     0.607     0.583      2564\n",
      "\n"
     ]
    }
   ],
   "source": [
    "_ = sst.experiment(\n",
    "    SST_HOME,\n",
    "    unigrams_phi,\n",
    "    fit_softmax_classifier,\n",
    "    train_reader=sst.train_reader, \n",
    "    assess_reader=None, \n",
    "    train_size=0.7,\n",
    "    class_func=sst.ternary_class_func,\n",
    "    score_func=utils.safe_macro_f1,\n",
    "    verbose=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A few notes on this function call:\n",
    "    \n",
    "* Since `assess_reader=None`, the function reports performance on a random train–test split. Give `sst.dev_reader` as the argument to assess against the `dev` set.\n",
    "\n",
    "* `unigrams_phi` is the function we defined above. By changing/expanding this function, you can start to improve on the above baseline, perhaps periodically seeing how you do on the dev set.\n",
    "\n",
    "* `fit_softmax_classifier` is the wrapper we defined above. To assess new models, simply define more functions like this one. Such functions just need to consume an `(X, y)` constituting a dataset and return a model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### A dev set run"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.628     0.689     0.657       428\n",
      "     neutral      0.343     0.153     0.211       229\n",
      "    positive      0.629     0.750     0.684       444\n",
      "\n",
      "    accuracy                          0.602      1101\n",
      "   macro avg      0.533     0.531     0.518      1101\n",
      "weighted avg      0.569     0.602     0.575      1101\n",
      "\n"
     ]
    }
   ],
   "source": [
    "_ = sst.experiment(\n",
    "    SST_HOME,\n",
    "    unigrams_phi,\n",
    "    fit_softmax_classifier,\n",
    "    class_func=sst.ternary_class_func,\n",
    "    assess_reader=sst.dev_reader)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Assessing BasicSGDClassifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.602     0.593     0.598       428\n",
      "     neutral      0.285     0.240     0.261       229\n",
      "    positive      0.632     0.691     0.660       444\n",
      "\n",
      "    accuracy                          0.559      1101\n",
      "   macro avg      0.506     0.508     0.506      1101\n",
      "weighted avg      0.548     0.559     0.553      1101\n",
      "\n"
     ]
    }
   ],
   "source": [
    "_ = sst.experiment(\n",
    "    SST_HOME,\n",
    "    unigrams_phi,\n",
    "    fit_basic_sgd_classifier,\n",
    "    class_func=sst.ternary_class_func,\n",
    "    assess_reader=sst.dev_reader)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Comparison with the baselines from Socher et al. 2013\n",
    "\n",
    "Where does our default set-up sit with regard to published baselines for the binary problem? (Compare  [Socher et al., Table 1](http://www.aclweb.org/anthology/D/D13/D13-1170.pdf).)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.783     0.741     0.761       428\n",
      "    positive      0.762     0.802     0.782       444\n",
      "\n",
      "    accuracy                          0.772       872\n",
      "   macro avg      0.773     0.771     0.771       872\n",
      "weighted avg      0.772     0.772     0.772       872\n",
      "\n"
     ]
    }
   ],
   "source": [
    "_ = sst.experiment(\n",
    "    SST_HOME,\n",
    "    unigrams_phi,\n",
    "    fit_softmax_classifier,\n",
    "    class_func=sst.binary_class_func,\n",
    "    assess_reader=sst.dev_reader)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A shallow neural network classifier\n",
    "\n",
    "While we're at it, we might as well see whether adding a hidden layer to our softmax classifier yields any benefits. Whereas `LogisticRegression` is, at its core, computing\n",
    "\n",
    "$$\\begin{align*}\n",
    "y &= \\textbf{softmax}(xW_{xy} + b_{y})\n",
    "\\end{align*}$$\n",
    "\n",
    "the shallow neural network inserts a hidden layer with a non-linear activation applied to it:\n",
    "\n",
    "$$\\begin{align*}\n",
    "h &= \\tanh(xW_{xh} + b_{h}) \\\\\n",
    "y &= \\textbf{softmax}(hW_{hy} + b_{y})\n",
    "\\end{align*}$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_nn_classifier(X, y):\n",
    "    mod = TorchShallowNeuralClassifier(\n",
    "        hidden_dim=50, max_iter=100)\n",
    "    mod.fit(X, y)\n",
    "    return mod"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Finished epoch 100 of 100; error is 0.00026185671231360175"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.728     0.704     0.716       999\n",
      "    positive      0.733     0.756     0.744      1077\n",
      "\n",
      "    accuracy                          0.731      2076\n",
      "   macro avg      0.731     0.730     0.730      2076\n",
      "weighted avg      0.731     0.731     0.731      2076\n",
      "\n"
     ]
    }
   ],
   "source": [
    "_ = sst.experiment(\n",
    "    SST_HOME,\n",
    "    unigrams_phi, \n",
    "    fit_nn_classifier, \n",
    "    class_func=sst.binary_class_func)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It looks like, with enough iterations (and perhaps some fiddling with the activation function and hidden dimensionality), this classifier would meet or exceed the baseline set up by `LogisticRegression`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### A softmax classifier in PyTorch\n",
    "\n",
    "Our PyTorch modules should support easy modification. For example, to turn `TorchShallowNeuralClassifier` into a `TorchSoftmaxClassifier`, one need only write a new `define_graph` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "class TorchSoftmaxClassifier(TorchShallowNeuralClassifier):\n",
    "    \n",
    "    def define_graph(self):\n",
    "        return nn.Linear(self.input_dim, self.n_classes_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_torch_softmax(X, y):\n",
    "    mod = TorchSoftmaxClassifier(max_iter=100)\n",
    "    mod.fit(X, y)\n",
    "    return mod"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Finished epoch 100 of 100; error is 0.07902006432414055"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.762     0.726     0.744      1011\n",
      "    positive      0.751     0.785     0.768      1065\n",
      "\n",
      "    accuracy                          0.756      2076\n",
      "   macro avg      0.757     0.755     0.756      2076\n",
      "weighted avg      0.757     0.756     0.756      2076\n",
      "\n"
     ]
    }
   ],
   "source": [
    "_ = sst.experiment(\n",
    "    SST_HOME,\n",
    "    unigrams_phi, \n",
    "    fit_torch_softmax, \n",
    "    class_func=sst.binary_class_func)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Hyperparameter search\n",
    "\n",
    "The training process learns __parameters__ &mdash; the weights. There are typically lots of other parameters that need to be set. For instance, our `BasicSGDClassifier` has a learning rate parameter and a training iteration parameter. These are called __hyperparameters__. The more powerful `sklearn` classifiers often have many more such hyperparameters. These are outside of the explicitly stated objective, hence the \"hyper\" part. \n",
    "\n",
    "So far, we have just set the hyperparameters by hand. However, their optimal values can vary widely between datasets, and choices here can dramatically impact performance, so we would like to set them as part of the overall experimental framework."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### utils.fit_classifier_with_crossvalidation\n",
    "\n",
    "Luckily, `sklearn` provides a lot of functionality for setting hyperparameters via cross-validation. The function `utils.fit_classifier_with_crossvalidation` implements a basic framework for taking advantage of these options. \n",
    "\n",
    "This method has the same basic shape as `fit_softmax_classifier` above: it takes a dataset as input and returns a trained model. However, to find its favored model, it explores a space of hyperparameters supplied by the user, seeking the optimal combination of settings.\n",
    "\n",
    "__Note__: this kind of search seems not to have a large impact for SST as we're using it. However, it can matter a lot for other data sets, and it's also an important step to take when trying to publish, since __reviewers are likely to want to check that your comparisons aren't based in part on opportunistic or ill-considered choices for the hyperparameters__."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Example using LogisticRegression\n",
    "\n",
    "Here's a fairly full-featured use of the above for the `LogisticRegression` model family:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_softmax_with_crossvalidation(X, y):\n",
    "    \"\"\"A MaxEnt model of dataset with hyperparameter \n",
    "    cross-validation. Some notes:\n",
    "        \n",
    "    * 'fit_intercept': whether to include the class bias feature.\n",
    "    * 'C': weight for the regularization term (smaller is more regularized).\n",
    "    * 'penalty': type of regularization -- roughly, 'l1' ecourages small \n",
    "      sparse models, and 'l2' encourages the weights to conform to a \n",
    "      gaussian prior distribution.\n",
    "    \n",
    "    Other arguments can be cross-validated; see \n",
    "    http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    X : 2d np.array\n",
    "        The matrix of features, one example per row.\n",
    "        \n",
    "    y : list\n",
    "        The list of labels for rows in `X`.   \n",
    "    \n",
    "    Returns\n",
    "    -------\n",
    "    sklearn.linear_model.LogisticRegression\n",
    "        A trained model instance, the best model found.\n",
    "    \n",
    "    \"\"\"    \n",
    "    basemod = LogisticRegression(\n",
    "        fit_intercept=True, \n",
    "        solver='liblinear', \n",
    "        multi_class='auto')\n",
    "    cv = 5\n",
    "    param_grid = {'fit_intercept': [True, False], \n",
    "                  'C': [0.4, 0.6, 0.8, 1.0, 2.0, 3.0],\n",
    "                  'penalty': ['l1','l2']}    \n",
    "    best_mod = utils.fit_classifier_with_crossvalidation(\n",
    "        X, y, basemod, cv, param_grid)\n",
    "    return best_mod"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best params: {'C': 3.0, 'fit_intercept': False, 'penalty': 'l2'}\n",
      "Best score: 0.511\n",
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.608     0.681     0.642       980\n",
      "     neutral      0.317     0.160     0.213       536\n",
      "    positive      0.666     0.760     0.709      1048\n",
      "\n",
      "    accuracy                          0.604      2564\n",
      "   macro avg      0.530     0.534     0.522      2564\n",
      "weighted avg      0.571     0.604     0.580      2564\n",
      "\n"
     ]
    }
   ],
   "source": [
    "softmax_experiment = sst.experiment(\n",
    "    SST_HOME,\n",
    "    unigrams_phi,\n",
    "    fit_softmax_with_crossvalidation, \n",
    "    class_func=sst.ternary_class_func)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Example using BasicSGDClassifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The models written for this course are also compatible with this framework. They [\"duck type\"](https://en.wikipedia.org/wiki/Duck_typing) the `sklearn` models by having methods `fit`, `predict`, `get_params`, and `set_params`, and an attribute `params`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fit_basic_sgd_classifier_with_crossvalidation(X, y):\n",
    "    basemod = BasicSGDClassifier()\n",
    "    cv = 5\n",
    "    param_grid = {'eta': [0.01, 0.1, 1.0], 'max_iter': [10]}\n",
    "    best_mod = utils.fit_classifier_with_crossvalidation(\n",
    "        X, y, basemod, cv, param_grid)\n",
    "    return best_mod"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best params: {'eta': 0.1, 'max_iter': 10}\n",
      "Best score: 0.504\n",
      "              precision    recall  f1-score   support\n",
      "\n",
      "    negative      0.606     0.544     0.573       976\n",
      "     neutral      0.262     0.286     0.274       510\n",
      "    positive      0.633     0.664     0.648      1078\n",
      "\n",
      "    accuracy                          0.543      2564\n",
      "   macro avg      0.500     0.498     0.498      2564\n",
      "weighted avg      0.549     0.543     0.545      2564\n",
      "\n"
     ]
    }
   ],
   "source": [
    "sgd_experiment = sst.experiment(\n",
    "    SST_HOME,\n",
    "    unigrams_phi,\n",
    "    fit_basic_sgd_classifier_with_crossvalidation, \n",
    "    class_func=sst.ternary_class_func)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Statistical comparison of classifier models\n",
    "\n",
    "Suppose two classifiers differ according to an effectiveness measure like F1 or accuracy. Are they meaningfully different?\n",
    "\n",
    "* For very large datasets, the answer might be clear: if performance is very stable across different train/assess splits and the difference in terms of correct predictions has practical importance, then you can clearly say yes. \n",
    "\n",
    "* With smaller datasets, or models whose performance is closer together, it can be harder to determine whether the two models are different. We can address this question in a basic way with repeated runs and basic null-hypothesis testing on the resulting score vectors.\n",
    "\n",
    "In general, one wants to compare __two feature functions against the same model__, or one wants to compare __two models with the same feature function used for both__. If both are changed at the same time, then it will be hard to figure out what is causing any differences you see."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Comparison with the Wilcoxon signed-rank test\n",
    "\n",
    "The function `sst.compare_models` is designed for such testing. The default set-up uses the non-parametric [Wilcoxon signed-rank test](https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test) to make the comparisons, which is relatively conservative and recommended by [Demšar 2006](http://www.jmlr.org/papers/v7/demsar06a.html) for cases where one can afford to do multiple assessments. For discussion, see [the evaluation methods notebook](evaluation_methods.ipynb#Wilcoxon-signed-rank-test).\n",
    "\n",
    "Here's an example showing the default parameters values and comparing `LogisticRegression` and `BasicSGDClassifier`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model 1 mean: 0.513\n",
      "Model 2 mean: 0.508\n",
      "p = 0.139\n"
     ]
    }
   ],
   "source": [
    "_ = sst.compare_models(\n",
    "    SST_HOME,\n",
    "    unigrams_phi,\n",
    "    fit_softmax_classifier,\n",
    "    stats_test=scipy.stats.wilcoxon,\n",
    "    trials=10,\n",
    "    phi2=None,  # Defaults to same as first required argument.\n",
    "    train_func2=fit_basic_sgd_classifier, # Defaults to same as second required argument.\n",
    "    reader=sst.train_reader, \n",
    "    train_size=0.7, \n",
    "    class_func=sst.ternary_class_func, \n",
    "    score_func=utils.safe_macro_f1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Comparison with McNemar's test\n",
    "\n",
    "[McNemar's test](https://en.wikipedia.org/wiki/McNemar%27s_test) operates directly on the vectors of predictions for the two models being compared. As such, it doesn't require repeated runs, which is good where optimization is expensive. For discussion, see [the evaluation methods notebook](evaluation_methods.ipynb#McNemar's-test)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "m = utils.mcnemar(\n",
    "    softmax_experiment['assess_dataset']['y'], \n",
    "    sgd_experiment['predictions'],\n",
    "    softmax_experiment['predictions'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "McNemar's test: 295.68 (p < 0.0001)\n"
     ]
    }
   ],
   "source": [
    "p = \"p < 0.0001\" if m[1] < 0.0001 else m[1]\n",
    "\n",
    "print(\"McNemar's test: {0:0.02f} ({1:})\".format(m[0], p))"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.5"
  },
  "widgets": {
   "state": {},
   "version": "1.1.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}