{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to supervised sentiment analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "__author__ = \"Christopher Potts\"\n",
    "__version__ = \"CS224u, Stanford, Spring 2016 term\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Contents\n",
    "\n",
    "0. [Overview](#Overview)\n",
    "0. [Set-up](#Set-up)\n",
    "0. [Data readers](#Data-readers)\n",
    "0. [Feature functions](#Feature-functions)\n",
    "0. [Modeling the labels](#Modeling-the-labels)\n",
    "0. [Building datasets for experiments](#Building-datasets-for-experiments)\n",
    "0. [Basic optimization](#Basic-optimization)\n",
    "0. [Experiments](#Experiments)\n",
    "0. [Hyperparameter search](#Hyperparameter-search)\n",
    "0. [Statistical comparison of classifier models](#Statistical-comparison-of-classifier-models)\n",
    "0. [Distributed representations as features](#Distributed-representations-as-features)\n",
    "0. [Additional sentiment resources](#Additional-sentiment-resources)\n",
    "0. [ In-class exploration](#In-class-exploration)\n",
    "0. [In-class bake-off](#In-class-bake-off)\n",
    "0. [Homework 2](#Homework-2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview\n",
    "\n",
    "The goal of this notebook is to provide a basic introduction to supervised learning in the context of a problem that has long been central to academic research and industry applications: __sentiment analysis__. \n",
    "\n",
    "The notebook is built around the [Stanford Sentiment Treebank](http://nlp.stanford.edu/sentiment/) (SST), a widely-used resource for evaluating supervised NLU models, and one that provides rich linguistic representations.\n",
    "\n",
    "If you're relatively new to supervised learning, we suggest you study the details of this notebook closely and follow the links to additional resources. If you're familiar with supervied learning, then you can focus right away on innovative feature representations and modeling. As of this writing, the state-of-the-art for the SST seems to be around 88% accuracy for the binary problem and 48% accuracy for the five-class problem. Perhaps you can best these numbers!\n",
    "\n",
    "Sentiment analysis seems simple at first but turns out to exhibit all of the complexity of full natural language understanding. To see this, consider how your intuitions about the sentiment of the following sentences can change depending on perspective, social relationships, tone of voice, and other aspects of the context of utterance:\n",
    "\n",
    "* There was an earthquake in LA.\n",
    "* The team failed the physical challenge. (We win/lose!)\n",
    "* They said it would be great. They were right/wrong.\n",
    "* Many consider the masterpiece bewildering, boring, slow-moving or annoying.\n",
    "* The party fat-cats are sipping their expensive, imported wines.\n",
    "* Oh, you’re terrible!\n",
    "\n",
    "SST mostly steers around these challenges by including only focused, evaluative texts (sentences from movie reviews), but you should have them in mind if you consider new domains and applications for the ideas."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set-up"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "0. Make sure your environment includes all the requirements for [the cs224u repository](https://github.com/cgpotts/cs224u).\n",
    "0. Download [the train/dev/test Stanford Sentiment Treebank distribution](http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip), unzip it, and put the resulting folder in the same directory as this notebook. (If you want to put it somewhere else, change `sst_home` below.)\n",
    "0. Download [the Wikipedia 2014 + Gigaword 5 distribution](http://nlp.stanford.edu/data/glove.6B.zip) of the pretrained GloVe vectors, unzip it, and put the resulting folder in the the same directory as this notebook. (If you want to put it somewhere else, change `glove_home` below.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sst_home = 'trees'\n",
    "glove_home = 'glove.6B'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import codecs\n",
    "import random\n",
    "import unicodecsv\n",
    "from collections import Counter\n",
    "import numpy as np\n",
    "from scipy.sparse import csr_matrix\n",
    "from nltk.tree import Tree\n",
    "from sklearn.cross_validation import train_test_split\n",
    "from sklearn.feature_extraction import DictVectorizer\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.grid_search import GridSearchCV\n",
    "from sklearn.metrics import classification_report, accuracy_score, f1_score\n",
    "import scipy.stats\n",
    "import utils"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data readers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The train/dev/test SST distribution contains files that are lists of trees where the part-of-speech tags have been replaced with sentiment scores `0...4`, where `0` and `1` are negative labels, `2` is a neutral label, and `3` and `4` are positive labels. Our readers yield `(tree, label)` pairs, where `tree` is an [NLTK Tree instance](http://www.nltk.org/_modules/nltk/tree.html) and `score` is a string."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def sentiment_treebank_reader(src_filename, include_subtrees=False, replace_root_score=True):\n",
    "    \"\"\"Iterator for the Penn-style distribution of the Stanford \n",
    "    Sentiment Treebank. The iterator yields (tree, label) pairs. \n",
    "\n",
    "    The root node of the tree is the label, so the root node itself is \n",
    "    replaced with a string to ensure that it doesn't get used as a \n",
    "    predictor. The subtree labels are retained. If they are used, it can \n",
    "    feel like cheating (see `root_daughter_scores_phis` below), so take \n",
    "    care!\n",
    "    \n",
    "    The labels are strings. They do not make sense as a linear order\n",
    "    because negative ('0', '1'), neutral ('2'), and positive ('3','4')\n",
    "    do not form a linear order conceptually, and because '0' is \n",
    "    stronger than '1' but '4' is stronger than '3'.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    src_filename : str\n",
    "        Full path to the file to be read.\n",
    "    \n",
    "    include_subtrees : boolean (default: False)\n",
    "        Whether to yield all the subtrees with labels or just the full \n",
    "        tree. In both cases, the label is the root of the subtree.\n",
    "        \n",
    "    replace_root_score : boolean (default: True)\n",
    "        The root node of the tree is the label, so, by default, the root \n",
    "        node itself is replaced with a string to ensure that it doesn't \n",
    "        get used as a predictor.\n",
    "\n",
    "    Yields\n",
    "    ------\n",
    "    (tree, label)\n",
    "        nltk.Tree, str in {'0','1','2','3','4'}\n",
    "    \n",
    "    \"\"\"\n",
    "    for line in codecs.open(src_filename, 'r', 'utf8'):\n",
    "        tree = Tree.fromstring(line)\n",
    "        if include_subtrees:\n",
    "            for subtree in tree.subtrees():\n",
    "                label = subtree.label()\n",
    "                if replace_root_score:\n",
    "                    subtree.set_label(\"X\")\n",
    "                yield (subtree, label) \n",
    "        else:\n",
    "            label = tree.label()\n",
    "            if replace_root_score:\n",
    "                tree.set_label(\"S\")\n",
    "            yield (tree, label) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following are convenience functions for reading `train.txt` and `dev.txt`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def train_reader():\n",
    "    \"\"\"Convenience function for reading the train file, full-trees only.\"\"\"\n",
    "    src = os.path.join(sst_home, 'train.txt')\n",
    "    return sentiment_treebank_reader(src, include_subtrees=False)\n",
    "\n",
    "def dev_reader():\n",
    "    \"\"\"Convenience function for reading the dev file, full-trees only.\"\"\"\n",
    "    src = os.path.join(sst_home, 'dev.txt')\n",
    "    return sentiment_treebank_reader(src, include_subtrees=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In SST parlance, the __all-nodes__ task trains and assesses, not just with the full sentence, but also with all the labeled subtrees. The following are convenience readers for such experiments:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def allnodes_train_reader():\n",
    "    \"\"\"Convenience function for reading the train file, all nodes.\"\"\"\n",
    "    src = os.path.join(sst_home, 'train.txt')\n",
    "    return sentiment_treebank_reader(src, include_subtrees=True)\n",
    "\n",
    "def allnodes_dev_reader():\n",
    "    \"\"\"Convenience function for reading the dev file, all nodes.\"\"\"\n",
    "    src = os.path.join(sst_home, 'dev.txt')\n",
    "    return sentiment_treebank_reader(src, include_subtrees=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Important notes__: \n",
    "\n",
    "* We've deliberately left out `test` readers. We urge you not to use the `test` set until and unless you are running experiments for a final project or similar. Overuse of test-sets corrupts them, since even subtle lessons learned from those runs can be incorporated back into model-building efforts.\n",
    "\n",
    "* We actually have mixed feelings about the overuse of  `dev` that might result from working with this notebook! We've tried to encourage using just splits of the training data for assessment most of the time, with only occasionally use of `dev`. This will give you a clearer picture of how you will ultimately do on `test`; over-use of `dev` can lead to over-fitting on that particular dataset with a resulting loss of performance of `test`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The notebook interface is good about displaying the [NLTK Tree objects](http://www.nltk.org/_modules/nltk/tree.html), which is handy for understanding their structure:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAALgAAACMCAIAAABAuvQrAAAACXBIWXMAAA3XAAAN1wFCKJt4AAAAHXRFWHRTb2Z0d2FyZQBHUEwgR2hvc3RzY3JpcHQgOS4xNnO9PXQAAAnWSURBVHic7Z0/bONGFsZnD1cE62tYeNsA3M4ueVtrA5CN3Jpq7TQUkkW6BGSbjoT6AGSTTXcgXR2wbsjCbrMaIM2q88DbWoAGCE5ODsFFVwxujktL1IgSNTPU+1Wi+EdvOJ/eDIfDj88WiwUCgHX8RXYAgB6AUAAhQCiAEHoIJQiCIAhkR3HQaCCUJEksy8IYyw7koFFdKJRSjLHrurIDOXRUF0oURZ7nyY4CUFsohBCEkGVZsgMB0F9lB1BHEARxHMuOAkBIZaFgjCmlURSxRUJIlmXQWZGFukIxTdP3fb5YFIVpmhLjOXCe6XKvx3GcPM9lR3G4KN2ZZWRZ5jgOxhjG3CSiTUYB5KJBRgFUAIQCCAFCAYQAoQBCgFAAITQQCn18/Ocvv5DpVHYgB426I7MIIfr4GF1fJzc3//nzz19/+817/drv983jY9lxHSKKjqNwidD53Hv9+qsvvvjHzz/zRZDL/lFOKBWJlDVRswpoG4WEIqgDkIsUlBBKg7oHuewZyULZsr5BLntDmlB2WMcglz0gQSgt1SvIpVX2KpQ91CXIpSX2JJQ91x/IZee0LhSJdQZy2SEtCkWRelIkDN1pRSgK1o2CIenFjoWieH0oHp7K7EwoGtWBRqGqww6Eoul51zRsWWwllA6c6w4UYT80FErHzm/HitMGTYRCptO/f/99985pWS7p11+7r17JjkghGmaU4OrK6/U6I5EyTC5+v288fy47FoVQYj4KoD4azMIHVGCzWfgY4yiKKKUIIdd1O+yuxpwTwjCUHYgqbJBRKKXMKyvP8zzPKaVZlrUXmUTAsPQpGwiFEOL7vmEYbNH3/U4KBQxLl7JB01NxZ8QYd9IrCwxLl9KwM0spTZKk7LHWDcCwdBVNhEIpHQwGnufxZqgzBEHQPfXvhI2fPWZd2jAMu/e3A8PSGjYTCiEkiqIwDLuXSxAYltazEGY8Htu2PZvN+DdhGIrvrh22bcsOQSE2GMJ3HIdSWsklnfR+zbIsSRKMsed5MObGgHs9gBBwrwcQAoQCCAFCAYQAoQBCgFAAIZR2hZQCmwr57z/++Ma2OznXsxlNLo+Dqyt8f59/+20bAcklub0NsozO53/77LN//f57x2aPb0PDjNI9e+Dk9ja6viYPD/bpaXxxYRwdsRn5yc2Nf3YGc60bZpTo3bvFjz+2EdD+KSaTIMvw/b19eur3+/bJCV/FmqHo3Tvj6Ihll4OVy0H3UYrJJLq+Lj58MF+8yL/7riwRhvH8eXh+7vV6TC7Jzc3ByuVAMwqZTlnLYr544ff7Xq8nvsthZpeDyyjl+vbPzsLzc8EdzePj+OLC7/dZdsnevxdUWDdoklGKycQZjbTLKPyJUYTQlikBf/wYZBlrsw5ELocilODqij+DHrruTlqNchen83LpftPDr3t3Pihin5zYJydMLsO3b6Pr6/D8vKuPtjcXSjGZPL1MUIrs/fvg6ooPjbQUbVkugx9+eHqN3Q26mVF4o9CqRMowubDs5YxG3ZNL14RCptPhTz+xfkN8ebnnfoPX63m9Xifl0h2hlK979y+RMk/lEl9caH/DqMGE7LuHB3R5mX/4sLs53lsxm8/9LEOXl8abN36WzeZz2RH9n/jmxnjzBl1eem/f3j08yA6nOU0yijp/jrKZlpq37rxez331it9f1Pd2dMNZ+M++/HLpzZF9wqcEaHH2dfcTbC4Uif2AypQAjc64+ilwFc2FstGNkl1RMyVAI3ScvdDwqsc+Pd3//5hMp85otGpKgEZUZi/Q+Ty+uJAd1Bo0e1IQf/xoff657Ch2CZsrqH7rqZlQAFnA4xqAEJ/0UQghhBDTNLkvSFEUCCHDMJhLIjMORQiVt3m65dPFbei2Z6kuPqWfCAVjXBQFxjjPc8MwCCFs0TTNOI6ZByQzITJNs1xhzJwI/U8ZfEfLsrYUCjN4StOU2W1EUdQlFyTmU5okiexABKiM1OZ57vu+7/v8G9/38zzni6vsZZj5bPmb8kEaMx6PK4d1XXf7w6rAbDbzPG+hiWPPkj6KbduUUuaPKB3Lsmzb5otd8izVy6d0eWc2DEPWdipFlzxLtfMpXS4U1tVQypi6Y56l2vmUrrw8Zg7m/DJnFftJPB3zLOU+pUEQBEHALwVUpm4I3/M8brq6ivKbBSrdmrUiE6R7nqU6+pTWDbixXq1gfVuWxcZOGOI71oMxHg6HZZWs1a76GIZhl9jJaFPbfDKETwgZDAYIIdM00zRFCFFKX758maapbduDwYC9eaJSKu4gWhRFFEVsLcaYD35sQ7c9SzXyKd39vZ6iKLT4iwAbATcFASHgpiAgBAgFEAKEAggBQgGE0EkoxWRCHx9lR7FjislEC+dEnYTijEb4/l52FDvGGY2S21vZUaxHJ6EAEgGhSMY+PZUdghAgFEAIEAogBAhFPlr00EEogBAgFEAIEAogBAhFMuo/ns4AoUjGODqSHYIQIBRACBCKfOCmICAEeXiQHcJ6QCiAECAUQAgQCiAECEUyuthb6iQU/+xMl+EpcczjY//sTHYU64EHwAAhdMoogETUFUrZhFJBFA+vwvbRqiuUIAjK5iv7JEmSIAiGw2FNABuFRwhhnjk127Qqu+1PprpCYSalUn6amVAYhlFTeRuFZ5pmGIb1VTUcDjeLchO2P5nqvipOccfEnYfXakbZPloVhcLz5FLTNowxy+GGYbB/yVILmizLmAcd28z3fe7GMxgMmPM2c05jf3dxz5+a8JIk4cbMzASPEMIsifgGRVFQSss/SghhzZzjOGwzwzDKe21Tlppo6/ctlyVN0ybvFNwPFSNkjmVZs9mMfR6Px0vdfNM0ZV6/fLOKjbFhGHEc87XljdcGsGptmqYVI+fKYet/dJUtcatlWbVvpSyLxUI/obium6YpX7y7u1u6TeWbOI7Le1VqZWklbSqUykFms1n9r4jEsGi5LKv2fXoQFZueeuI4ZlclhBDDMJa2vkVR8DTO2bODPm8Zt0SFsiA1+yg1sB4f996klDqOMx6PK5vZtl1u4/dDpZfDXlWy/WFVKAtS+fJ4KRjj8rsoVvVAXdetuIzuqtpq8DxvOBwyKTMH5Y0cD03TLF/48M8qlAUpeK+HOWoihFjLwqQQxzFL40VRJEnCvyeE2La9tPWJooi/YYEVlQ+NDAaDsmNnEARJkvDF+gBEwmNV7vt+kiTsmE9/1HEcjLHrunEcs4DZ1RzTFguYr2qjLGv3LZcljmPlhCICs7tFCJVfvLEU5pFsWZYU1+vhcMgrWwRerqUByy2LlkLRgqIoiqJQ3GZYHM06s4rD8jn7bFmW1ioplyXPc8gogBCaXfUAsgChAEKAUAAhQCiAECAUQIj/AjI3ZtWKJzOnAAAAAElFTkSuQmCC",
      "text/plain": [
       "Tree('4', [Tree('2', ['NLU']), Tree('4', [Tree('2', ['is']), Tree('4', ['enlightening'])])])"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Tree.fromstring(\"\"\"(4 (2 NLU) (4 (2 is) (4 enlightening)))\"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature functions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Feature representation is arguably the most important step in any machine learning task. As you experiment with the SST, you'll come to appreciate this fact, since your choice of feature function will have a far greater impact on the effectiveness of your models than any other choice you make.\n",
    "\n",
    "We will define our feature functions as `dict`s mapping feature names (which can be any object that can be a `dict` key) to their values (which must be booleans, ints, or floats). For optimization, we will use `sklearn`'s `DictVectorizer` class to turn these into matrices of features. The `dict`-based approach gives us a lot of flexibility and frees us from having to worry about the underlying feature matrix."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A typical baseline or default feature representation in NLP or NLU is built from unigrams. Here, those are the leaf nodes of the tree:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def unigrams_phi(tree):\n",
    "    \"\"\"The basis for a unigrams feature function.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    tree : nltk.tree\n",
    "        The tree to represent.\n",
    "    \n",
    "    Returns\n",
    "    -------    \n",
    "    defaultdict\n",
    "        A map from strings to their counts in `tree`. (Counter maps a \n",
    "        list to a dict of counts of the elements in that list.)\n",
    "    \n",
    "    \"\"\"\n",
    "    return Counter(tree.leaves())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the docstring for `sentiment_treebank_reader`, I pointed out that the labels on the subtrees can be used in a way that feels like cheating. Here's the most dramatic instance of this: `root_daughter_scores_phi` uses just the labels on the daughters of the root to predict the root (label). This will result in performance well north of 90% F1, but that's hardly worth reporting. (Interestingly, using the labels on the leaf nodes is much less powerful.) Anyway, don't use this function!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def root_daughter_scores_phi(tree):    \n",
    "    \"\"\"The best way we've found to cheat without literally using the \n",
    "    labels as part of the feature representations. Don't use this for \n",
    "    any real experiments!\"\"\"\n",
    "    return Counter([child.label() for child in tree])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Modeling the labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Working with the SST involves making decisions about how to handle the raw SST labels. The interpretation of these labels is as follows ([Socher et al., sec. 3](http://www.aclweb.org/anthology/D/D13/D13-1170.pdf)):\n",
    "\n",
    "* `'0'`: very negative\n",
    "* `'1'`: negative\n",
    "* `'2'`: neutral\n",
    "* `'3'`: positive\n",
    "* `'4'`: very positive\n",
    "\n",
    "The labels look like they could be treated as totally ordered, even continuous. However, conceptually, they do not form such an order. Rather, they consist of three separate classes, with the negative and positive classes being totally ordered:\n",
    "\n",
    "* `'0' > '1'`: negative\n",
    "* `'2'`: neutral\n",
    "* `'4' > '3'`: positive\n",
    "\n",
    "Thus, in this notebook, we'll look mainly at binary (positive/negative) and ternary tasks.\n",
    "\n",
    "A related note: the above shows that the __fine-grained sentiment task__ for the SST is particularly punishing as usually formulated, since it ignores the partial-order structure in the categories completely. As a result, mistaking `'0'` for `'1'` is as bad as mistaking `'0'` for `'4'`, though the first error is clearly less severe than the second."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following functions will be used to define the labels for our experiments:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def binary_class_func(y):\n",
    "    \"\"\"Define a binary SST task.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    y : str\n",
    "        Assumed to be one of the SST labels.\n",
    "          \n",
    "    Returns\n",
    "    ------- \n",
    "    str or None    \n",
    "        None values are ignored by `build_dataset` and thus left out of \n",
    "        the experiments.\n",
    "    \n",
    "    \"\"\"\n",
    "    if y in (\"0\", \"1\"):\n",
    "        return \"negative\"\n",
    "    elif y in (\"3\", \"4\"):\n",
    "        return \"positive\"\n",
    "    else:\n",
    "        return None\n",
    "    \n",
    "def ternary_class_func(y):    \n",
    "    \"\"\"Define a binary SST task. Just like `binary_class_func` except \n",
    "    input '2' returns 'neutral'.\"\"\"    \n",
    "    if y in (\"0\", \"1\"):\n",
    "        return \"negative\"\n",
    "    elif y in (\"3\", \"4\"):\n",
    "        return \"positive\"\n",
    "    else:\n",
    "        return \"neutral\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you do want to run fine-grained sentiment, just define an identity function over the SST labels."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Building datasets for experiments"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next phase for our analysis is a kind of set-up phase: we need to use\n",
    "\n",
    "* a reader like `train_reader`,\n",
    "* a feature function like `unigrams_phi`, and\n",
    "* a class function like `binary_class_func`\n",
    "\n",
    "to build a dataset that can be used for training and assessing a model. The heart of this is `build_dataset`. See its documentation for details on how it works. Much of this is about taking advantage of `sklearn`'s many functions for model building."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def build_dataset(reader, phi, class_func, vectorizer=None):\n",
    "    \"\"\"Core general function for building experimental datasets.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    reader : iterator\n",
    "       Should be `train_reader`, `dev_reader`, or another function\n",
    "       defined in those terms. This is the dataset we'll be \n",
    "       featurizing.\n",
    "       \n",
    "    phi : feature function\n",
    "       Any function that takes an `nltk.Tree` instance as input \n",
    "       and returns a bool/int/float-valued dict as output.\n",
    "       \n",
    "    class_func : function on the SST labels\n",
    "       Any function like `binary_class_func` or `ternary_class_func`. \n",
    "       This modifies the SST labels based on the experimental \n",
    "       design. If `class_func` returns None for a label, then that \n",
    "       item is ignored.\n",
    "       \n",
    "    vectorizer : sklearn.feature_extraction.DictVectorizer    \n",
    "       If this is None, then a new `DictVectorizer` is created and\n",
    "       used to turn the list of dicts created by `phi` into a \n",
    "       feature matrix. This happens when we are training.\n",
    "              \n",
    "       If this is not None, then it's assumed to be a `DictVectorizer` \n",
    "       and used to transform the list of dicts. This happens in \n",
    "       assessment, when we take in new instances and need to \n",
    "       featurize them as we did in training.\n",
    "       \n",
    "    Returns\n",
    "    -------\n",
    "    dict\n",
    "        A dict with keys 'X' (the feature matrix), 'y' (the list of\n",
    "        labels), 'vectorizer' (the `DictVectorizer`), and \n",
    "        'raw_examples' (the `nltk.Tree` objects, for error analysis).\n",
    "    \n",
    "    \"\"\"    \n",
    "    labels = []\n",
    "    feat_dicts = []\n",
    "    raw_examples = []\n",
    "    for tree, label in reader():\n",
    "        cls = class_func(label)\n",
    "        # None values are ignored -- these are instances we've\n",
    "        # decided not to include.\n",
    "        if cls != None:\n",
    "            labels.append(cls)\n",
    "            feat_dicts.append(phi(tree))\n",
    "            raw_examples.append(tree)\n",
    "    feat_matrix = None\n",
    "    # In training, we want a new vectorizer:    \n",
    "    if vectorizer == None:\n",
    "        vectorizer = DictVectorizer(sparse=True)\n",
    "        feat_matrix = vectorizer.fit_transform(feat_dicts)\n",
    "    # In assessment, we featurize using the existing vectorizer:\n",
    "    else:\n",
    "        feat_matrix = vectorizer.transform(feat_dicts)\n",
    "    return {'X': feat_matrix, \n",
    "            'y': labels, \n",
    "            'vectorizer': vectorizer, \n",
    "            'raw_examples': raw_examples}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic optimization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We're now in a position to begin training supervised models!\n",
    "\n",
    "For the most part, in this course, we will not study the theoretical aspects of machine learning optimization, concentrating instead on how to optimize systems effectively in practice. That is, this isn't a theory course, but rather an experimental, project-oriented one.\n",
    "\n",
    "Nonetheless, we do want to avoid treating our optimizers as black boxes that work their magic and give us some assessment figures for whatever we feed into them. That seems irresponsible from a scientific and engineering perspective, and it also sends the false signal that the optimization process is inherently mysterious. So we do want to take a minute to demystify it with some simple code.\n",
    "\n",
    "The following `BasicSGDClassifier` is a complete optimization framework. Well, it's complete in the sense that it achieves our full task of supervised learning. It's incomplete in the sense that it is very basic. You probably wouldn't want to use it in experiments. Rather, we're going to encourage you to rely on `sklearn` for your experiments (see below). Still, this is a good basic picture of what's happening under the hood.\n",
    "\n",
    "\n",
    "So what is `BasicSGDClassifier` doing? The heart of it is the `fit` function (reflecting the usual `sklearn` naming system). This method implements a hinge-loss stochastic sub-gradient descent optimization. Intuitively, it works as follows:\n",
    "\n",
    "0. Start by assuming that all the feature weights are `0`.\n",
    "0. Move through the dataset instance-by-instance in random order.\n",
    "0. For each instance, classify it using the current weights. \n",
    "0. If the classification is incorrect, move the weights in the direction of the correct classification\n",
    "\n",
    "This process repeats for a user-specified number of iterations (default `10` below), and the weight movement is tempered by a learning-rate parameter `eta` (default `0.1`). The output is a set of weights that can be used to make predictions about new (properly featurized) examples.\n",
    "\n",
    "In more technical terms, the objective function is \n",
    "\n",
    "$$\n",
    "  \\min_{\\mathbf{w} \\in \\mathbb{R}^{d}}\n",
    "  \\sum_{(x,y)\\in\\mathcal{D}} \n",
    "  \\max_{y'\\in\\mathbf{Y}}\n",
    "  \\left[\\mathbf{Score}_{\\textbf{w}, \\phi}(x,y') + \\mathbf{cost}(y,y')\\right] - \\mathbf{Score}_{\\textbf{w}, \\phi}(x,y)\n",
    "$$\n",
    "\n",
    "where $\\mathbf{w}$ is the set of weights to be learned, $\\mathcal{D}$ is the training set of example&ndash;label pairs, $\\mathbf{Y}$ is the set of labels, $\\mathbf{cost}(y,y') = 0$ if $y=y'$, else $1$, and $\\mathbf{Score}_{\\textbf{w}, \\phi}(x,y')$ is the inner product of the weights \n",
    "$\\mathbf{w}$ and the example as featurized according to $\\phi$.\n",
    "\n",
    "The `fit` method is then calculating the sub-gradient of this objective. In succinct pseudo-code:\n",
    "\n",
    "* Initialize $\\mathbf{w} = \\mathbf{0}$\n",
    "* Repeat $T$ times:\n",
    "    * for each $(x,y)$ in $\\mathcal{D}$ (in random order):\n",
    "        * $\\tilde{y} = \\text{argmax}_{y'\\in \\mathcal{Y}} \\mathbf{Score}_{\\textbf{w}, \\phi}(x,y') + \\mathbf{cost}(y,y')$\n",
    "        * $\\mathbf{w} =  \\mathbf{w} + \\eta(\\phi(x,y) - \\phi(x,\\tilde{y}))$\n",
    "      \n",
    "We'll use this basic and widely applicable framework throughout the term, to optimize neural networks, structured prediction models, and all sorts of basic classification tasks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "class BasicSGDClassifier:    \n",
    "    \"\"\"Basic implementation hinge-loss stochastic sub-gradient descent \n",
    "    optimization, intended to illustrate the basic concepts of classifier \n",
    "    optimization in code.\"\"\"\n",
    "    def __init__(self, iterations=10, eta=0.1):\n",
    "        \"\"\"\n",
    "        Parameters\n",
    "        ----------\n",
    "        iterations : int (default: 10)\n",
    "            Number of training epochs (full runs through shuffled data).\n",
    "        \n",
    "        eta : float (default: 0.1)\n",
    "            Learning rate parameter.\n",
    "            \n",
    "        \"\"\"        \n",
    "        self.iterations = iterations\n",
    "        self.eta = eta\n",
    "\n",
    "    def fit(self, feat_matrix, labels): \n",
    "        \"\"\"Core optimization function.\n",
    "        \n",
    "        Parameters\n",
    "        ----------        \n",
    "        feat_matrix : 2d matrix (np.array or any scipy.sparse type)\n",
    "            The design matrix, one row per example. Hence, the row \n",
    "            dimensionality is the example count and the column \n",
    "            dimensionality is number of features.\n",
    "            \n",
    "        labels : list\n",
    "            The labels for each example, hence assumed to have the\n",
    "            same length as, and be aligned with, `feat_matrix`.\n",
    "            \n",
    "        For attributes, we follow the `sklearn` style of using a \n",
    "        final `_` for attributes that are created by `fit` methods: \n",
    "            \n",
    "        Attributes\n",
    "        ----------\n",
    "        self.classes_ : list\n",
    "            The set of class labels in sorted order.\n",
    "            \n",
    "        self.n_classes_ : int\n",
    "            Length of `self.classes_`\n",
    "            \n",
    "        self.coef_ : np.array of dimension (class count, feature count)\n",
    "            These are the weights, named as in `sklearn`. They are \n",
    "            organized so that each row represents the feature weights \n",
    "            for a given class, as is typical in `sklearn`.\n",
    "            \n",
    "        \"\"\"        \n",
    "        # We'll deal with the labels via their indices into self.classes_:\n",
    "        self.classes_ = sorted(set(labels))\n",
    "        self.n_classes_ = len(self.classes_)\n",
    "        # Useful dimensions to store:\n",
    "        examplecount, featcount = feat_matrix.shape\n",
    "        # The weight matrix -- classes by row:\n",
    "        self.coef_ = np.zeros((self.n_classes_, featcount))\n",
    "        # Indices for shuffling the data at the start of each epoch:\n",
    "        indices = list(range(examplecount))\n",
    "        for _ in range(self.iterations):\n",
    "            random.shuffle(indices)\n",
    "            for i in indices:\n",
    "                # Training instance as a feature rep and a label index:\n",
    "                rep = feat_matrix[i]                 \n",
    "                label_index = self.classes_.index(labels[i])\n",
    "                # Costs are 1.0 except for the true label:\n",
    "                costs = np.ones(self.n_classes_)\n",
    "                costs[label_index] = 0.0\n",
    "                # Make a prediction:\n",
    "                predicted_index = self.predict_one(rep, costs=costs)\n",
    "                # Weight update if it's an incorrect prediction:\n",
    "                if predicted_index != label_index:\n",
    "                    self.coef_[label_index] += self.eta * rep                                  \n",
    "    \n",
    "    def predict_one(self, rep, costs=0.0):\n",
    "        \"\"\"The core prediction function. This is computed as\n",
    "        \n",
    "        (rep * self.coef_.T) + costs\n",
    "        \n",
    "        which corresponds to taking the inner product of `rep`\n",
    "        with each row in `self.weights` and adding `costs`.\n",
    "        \n",
    "        After that, the code just needs to figure out which\n",
    "        class is highest scoring and make a random choice \n",
    "        from that set (in case of ties).\n",
    "\n",
    "        Parameters\n",
    "        ----------\n",
    "        rep : np.array of dimension featcount or \n",
    "              `scipy.sparse` matrix of dimension (1 x `featcount`)\n",
    "        \n",
    "        costs : float or np.array of dimension self.classcount\n",
    "            Where this is 0.0, we're doing prediction. Where it\n",
    "            is an array, we expect a 0.0 at the coordinate \n",
    "            corresponding to the true label and a 1.0 in all \n",
    "            other positions.\n",
    "        \n",
    "        Returns\n",
    "        -------\n",
    "        int\n",
    "            The index of the correct class. This is for the \n",
    "            sake of the `fit` method. `predict` returns the class\n",
    "            names themselves.\n",
    "        \n",
    "        \"\"\"\n",
    "        scores = rep.dot(self.coef_.T) + costs\n",
    "        # Manage the difference between scipy and numpy 1d matrices:\n",
    "        scores = scores.reshape(self.n_classes_)\n",
    "        # Set of highest scoring label indices (in case of ties):\n",
    "        candidates = np.argwhere(scores==np.max(scores)).flatten()    \n",
    "        return random.choice(candidates)\n",
    "    \n",
    "    def predict(self, reps):\n",
    "        \"\"\"Batch prediction function for experiments. \n",
    "        \n",
    "        Parameters\n",
    "        ----------\n",
    "        reps : list or feature matrix\n",
    "           A featurized set of examples to make predictions about.\n",
    "           \n",
    "        Returns\n",
    "        -------        \n",
    "        list of str\n",
    "            A list of class names -- the predictions. Unlike `predict_one`, \n",
    "            it returns the class name rather than its index.\n",
    "                \n",
    "        \"\"\"\n",
    "        return [self.classes_[self.predict_one(rep)] for rep in reps]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the sake of our experimental framework, a simple wrapper for the above:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def fit_basic_sgd_classifier(X, y):    \n",
    "    \"\"\"Wrapper for `BasicSGDClassifier`.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    X : 2d np.array\n",
    "        The matrix of features, one example per row.\n",
    "        \n",
    "    y : list\n",
    "        The list of labels for rows in `X`.\n",
    "    \n",
    "    Returns\n",
    "    -------\n",
    "    BasicSGDClassifier\n",
    "        A trained `BasicSGDClassifier` instance.\n",
    "    \n",
    "    \"\"\"    \n",
    "    mod = BasicSGDClassifier()\n",
    "    mod.fit(X, y)\n",
    "    return mod"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As I said above, we likely don't want to rely on `BasicSGDClassifier` (though it does a good job with SST!). Instead, we want to rely on `sklearn`. Here's a simple wrapper for [sklearn.linear.model.LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) using our \n",
    "`build_dataset` paradigm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def fit_maxent_classifier(X, y):    \n",
    "    \"\"\"Wrapper for `sklearn.linear.model.LogisticRegression`. This is also \n",
    "    called a Maximum Entropy (MaxEnt) Classifier, which is more fitting \n",
    "    for the multiclass case.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    X : 2d np.array\n",
    "        The matrix of features, one example per row.\n",
    "        \n",
    "    y : list\n",
    "        The list of labels for rows in `X`.\n",
    "    \n",
    "    Returns\n",
    "    -------\n",
    "    sklearn.linear.model.LogisticRegression\n",
    "        A trained `LogisticRegression` instance.\n",
    "    \n",
    "    \"\"\"\n",
    "    mod = LogisticRegression(fit_intercept=True)\n",
    "    mod.fit(X, y)\n",
    "    return mod"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Experiments"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now have all the pieces needed to run experiments. And we're going to want to run a lot of experiments, trying out different feature functions, taking different perspectives on the data and labels, and using different models. To make that process efficient and regimented, we now define a function `experiment`. All it does is pull together these pieces and use them to train and assess. It's complicated, but the flexibility will turn out to be an asset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def experiment(\n",
    "        train_reader=train_reader, \n",
    "        assess_reader=None, \n",
    "        train_size=0.7,\n",
    "        phi=unigrams_phi, \n",
    "        class_func=ternary_class_func,\n",
    "        train_func=fit_maxent_classifier,\n",
    "        score_func=utils.safe_macro_f1,\n",
    "        verbose=True):\n",
    "    \"\"\"Generic experimental framework for SST. Either assesses with a \n",
    "    random train/test split of `train_reader` or with `assess_reader` if \n",
    "    it is given.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    train_reader : SST iterator (default: `train_reader`)\n",
    "        Iterator for training data.\n",
    "       \n",
    "    assess_reader : iterator or None (default: None)\n",
    "        If None, then the data from `train_reader` are split into \n",
    "        a random train/test split, with the the train percentage \n",
    "        determined by `train_size`. If not None, then this should \n",
    "        be an iterator for assessment data (e.g., `dev_reader`).\n",
    "        \n",
    "    train_size : float (default: 0.7)\n",
    "        If `assess_reader` is None, then this is the percentage of\n",
    "        `train_reader` devoted to training. If `assess_reader` is\n",
    "        not None, then this value is ignored.\n",
    "       \n",
    "    phi : feature function (default: `unigrams_phi`)\n",
    "        Any function that takes an `nltk.Tree` instance as input \n",
    "        and returns a bool/int/float-valued dict as output.\n",
    "       \n",
    "    class_func : function on the SST labels\n",
    "        Any function like `binary_class_func` or `ternary_class_func`. \n",
    "        This modifies the SST labels based on the experimental \n",
    "        design. If `class_func` returns None for a label, then that \n",
    "        item is ignored.\n",
    "       \n",
    "    train_func : model wrapper (default: `fit_maxent_classifier`)\n",
    "        Any function that takes a feature matrix and a label list\n",
    "        as its values and returns a fitted model with a `predict`\n",
    "        function that operates on feature matrices.\n",
    "    \n",
    "    score_metric : function name (default: `utils.safe_macro_f1`)\n",
    "        This should be an `sklearn.metrics` scoring function. The \n",
    "        default is weighted average F1 (macro-averaged F1). For \n",
    "        comparison with the SST literature, `accuracy_score` might\n",
    "        be used instead. For micro-averaged F1, use\n",
    "        \n",
    "        (lambda y, y_pred : f1_score(y, y_pred, average='micro', pos_label=None))\n",
    "                \n",
    "        For other metrics that can be used here, see\n",
    "        see http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics\n",
    "        \n",
    "    verbose : bool (default: True)\n",
    "        Whether to print out the model assessment to standard output.\n",
    "        Set to False for statistical testing via repeated runs.\n",
    "       \n",
    "    Prints\n",
    "    -------    \n",
    "    To standard output, if `verbose=True`\n",
    "        Model accuracy and a model precision/recall/F1 report. Accuracy is \n",
    "        reported because many SST papers report that figure, but the \n",
    "        precision/recall/F1 is better given the class imbalances and the \n",
    "        fact that performance across the classes can be highly variable.\n",
    "        \n",
    "    Returns\n",
    "    -------\n",
    "    float\n",
    "        The overall scoring metric as determined by `score_metric`.\n",
    "    \n",
    "    \"\"\"        \n",
    "    # Train dataset:\n",
    "    train = build_dataset(train_reader, phi, class_func, vectorizer=None) \n",
    "    # Manage the assessment set-up:\n",
    "    X_train = train['X']\n",
    "    y_train = train['y']\n",
    "    X_assess = None \n",
    "    y_assess = None\n",
    "    if assess_reader == None:\n",
    "         X_train, X_assess, y_train, y_assess = train_test_split(\n",
    "                X_train, y_train, train_size=train_size)\n",
    "    else:\n",
    "        # Assessment dataset using the training vectorizer:\n",
    "        assess = build_dataset(\n",
    "            assess_reader, \n",
    "            phi, \n",
    "            class_func, \n",
    "            vectorizer=train['vectorizer'])\n",
    "        X_assess, y_assess = assess['X'], assess['y']\n",
    "    # Train:      \n",
    "    mod = train_func(X_train, y_train)    \n",
    "    # Predictions:\n",
    "    predictions = mod.predict(X_assess)\n",
    "    # Report:\n",
    "    if verbose:\n",
    "        print('Accuracy: %0.03f' % accuracy_score(y_assess, predictions))\n",
    "        print(classification_report(y_assess, predictions, digits=3))\n",
    "    # Return the overall score:\n",
    "    return score_func(y_assess, predictions)    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's an experiment with all default values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy: 0.601\n",
      "             precision    recall  f1-score   support\n",
      "\n",
      "   negative      0.600     0.687     0.641       997\n",
      "    neutral      0.318     0.130     0.184       516\n",
      "   positive      0.650     0.750     0.696      1051\n",
      "\n",
      "avg / total      0.564     0.601     0.572      2564\n",
      "\n"
     ]
    }
   ],
   "source": [
    "_ = experiment()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's a run on the dev set (use sparingly!):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy: 0.602\n",
      "             precision    recall  f1-score   support\n",
      "\n",
      "   negative      0.628     0.689     0.657       428\n",
      "    neutral      0.343     0.153     0.211       229\n",
      "   positive      0.629     0.750     0.684       444\n",
      "\n",
      "avg / total      0.569     0.602     0.575      1101\n",
      "\n"
     ]
    }
   ],
   "source": [
    "_ = experiment(assess_reader=dev_reader)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see how our reference optimizer does on the same task:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy: 0.586\n",
      "             precision    recall  f1-score   support\n",
      "\n",
      "   negative      0.606     0.693     0.647      1051\n",
      "    neutral      0.269     0.203     0.231       478\n",
      "   positive      0.677     0.655     0.666      1035\n",
      "\n",
      "avg / total      0.572     0.586     0.577      2564\n",
      "\n"
     ]
    }
   ],
   "source": [
    "_ = experiment(train_func=fit_basic_sgd_classifier)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Where does our default set-up sit with regard to published baselines for the binary problem? (Compare  [Socher et al., Table 1](http://www.aclweb.org/anthology/D/D13/D13-1170.pdf).)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy: 0.769\n",
      "             precision    recall  f1-score   support\n",
      "\n",
      "   negative      0.784     0.731     0.757      1020\n",
      "   positive      0.756     0.806     0.780      1056\n",
      "\n",
      "avg / total      0.770     0.769     0.769      2076\n",
      "\n"
     ]
    }
   ],
   "source": [
    "_ = experiment(class_func=binary_class_func)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Hyperparameter search"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The training process learns a set of parameters &mdash; the weights. There are typically lots of other parameters that need to be set. For instance, our `BasicSGDClassifier` has a learning rate parameter and a training iteration parameter. These are called __hyperparameters__. The more powerful `sklearn` classifiers often have many more such hyperparameters. These are outside of the explicitly stated objective, hence the \"hyper\" part. \n",
    "\n",
    "So far, we have just set the hyperparameters by hand. However, their optimal values can vary widely between datasets, and choices here can dramatically impact performance, so we would like to set them as part of the overall experimental framework.\n",
    "\n",
    "Luckily, `sklearn` provides a lot of functionality for setting hyperparameters via cross-validation. The following function implements a basic framework for taking advantage of these options. This method has the same basic shape as `fit_maxent_classifier` above: it takes a dataset as input and returns a trained model. However, to find its favored model, it explores a space of hyperparameters supplied by the user, seeking the optimal combination of settings.\n",
    "\n",
    "__Note__: this kind of search seems not to have a large impact for SST as we're using it. However, it can matter a lot for other data sets, and it's also an important step to take when trying to publish, since reviewers are likely to want to check that your comparisons aren't based in part on opportunistic or ill-considered choices for the hyperparameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def fit_classifier_with_crossvalidation(X, y, basemod, cv, param_grid, scoring='f1_macro'): \n",
    "    \"\"\"Fit a classifier with hyperparmaters set via cross-validation.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    X : 2d np.array\n",
    "        The matrix of features, one example per row.\n",
    "        \n",
    "    y : list\n",
    "        The list of labels for rows in `X`.  \n",
    "    \n",
    "    basemod : an sklearn model class instance\n",
    "        This is the basic model-type we'll be optimizing.\n",
    "    \n",
    "    cv : int\n",
    "        Number of cross-validation folds.\n",
    "        \n",
    "    param_grid : dict\n",
    "        A dict whose keys name appropriate parameters for `basemod` and \n",
    "        whose values are lists of values to try.\n",
    "        \n",
    "    scoring : value to optimize for (default: f1_macro)\n",
    "        Other options include 'accuracy' and 'f1_micro'. See\n",
    "        http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter\n",
    "            \n",
    "    Prints\n",
    "    ------\n",
    "    To standard output:\n",
    "        The best parameters found.\n",
    "        The best macro F1 score obtained.\n",
    "        \n",
    "    Returns\n",
    "    -------\n",
    "    An instance of the same class as `basemod`.\n",
    "        A trained model instance, the best model found.\n",
    "        \n",
    "    \"\"\"    \n",
    "    # Find the best model within param_grid:\n",
    "    crossvalidator = GridSearchCV(basemod, param_grid, cv=cv, scoring=scoring)\n",
    "    crossvalidator.fit(X, y)\n",
    "    # Report some information:\n",
    "    print(\"Best params\", crossvalidator.best_params_)\n",
    "    print(\"Best score: %0.03f\" % crossvalidator.best_score_)\n",
    "    # Return the best model found:\n",
    "    return crossvalidator.best_estimator_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's a fairly full-featured use of the above for the MaxEnt model family:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def fit_maxent_with_crossvalidation(X, y):\n",
    "    \"\"\"A MaxEnt model of dataset with hyperparameter \n",
    "    cross-validation. Some notes:\n",
    "        \n",
    "    * 'fit_intercept': whether to include the class bias feature.\n",
    "    * 'C': weight for the regularization term (smaller is more regularized).\n",
    "    * 'penalty': type of regularization -- roughly, 'l1' ecourages small \n",
    "      sparse models, and 'l2' encourages the weights to conform to a \n",
    "      gaussian prior distribution.\n",
    "    \n",
    "    Other arguments can be cross-validated; see \n",
    "    http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    X : 2d np.array\n",
    "        The matrix of features, one example per row.\n",
    "        \n",
    "    y : list\n",
    "        The list of labels for rows in `X`.   \n",
    "    \n",
    "    Returns\n",
    "    -------\n",
    "    sklearn.linear_model.LogisticRegression\n",
    "        A trained model instance, the best model found.\n",
    "    \n",
    "    \"\"\"    \n",
    "    basemod = LogisticRegression()\n",
    "    cv = 5\n",
    "    param_grid = {'fit_intercept': [True, False], \n",
    "                  'C': [0.4, 0.6, 0.8, 1.0, 2.0, 3.0],\n",
    "                  'penalty': ['l1','l2']}    \n",
    "    return fit_classifier_with_crossvalidation(X, y, basemod, cv, param_grid)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An example run:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best params {'C': 2.0, 'penalty': 'l2', 'fit_intercept': False}\n",
      "Best score: 0.768\n",
      "Accuracy: 0.770\n",
      "             precision    recall  f1-score   support\n",
      "\n",
      "   negative      0.767     0.743     0.755       987\n",
      "   positive      0.773     0.795     0.784      1089\n",
      "\n",
      "avg / total      0.770     0.770     0.770      2076\n",
      "\n"
     ]
    }
   ],
   "source": [
    "_ = experiment(\n",
    "        train_func=fit_maxent_with_crossvalidation, \n",
    "        class_func=binary_class_func)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So there's no real change for this basic experiment. Still, I'm glad to have checked. For real experiments, I would likely explore a much larger space of parameters. The small size of SST makes this feasible, so why not!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Statistical comparison of classifier models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Suppose two classifiers differ according to an effectiveness measure like F1 or accuracy. Are they meaningfully different?\n",
    "\n",
    "For very large datasets, the answer might be clear: if performance is very stable across different train/assess splits and the difference in terms of correct predictions has practical import, then you can clearly say yes. \n",
    "\n",
    "With smaller datasets, or models whose performance is closer together, it can be harder to determine whether the two models are different. We can address this question in a basic way with repeated runs and basic null-hypothesis testing on the resulting score vectors.\n",
    "\n",
    "The following is a basic function for doing such testing. The default set-up uses the non-parametric [Wilcoxon signed-rank test](https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test) to make the comparisons."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def compare_models(\n",
    "        stats_test=scipy.stats.wilcoxon,\n",
    "        trials=10,\n",
    "        phi1=unigrams_phi,\n",
    "        phi2=None,\n",
    "        train_func1=fit_maxent_classifier,         \n",
    "        train_func2=None,\n",
    "        reader=train_reader, \n",
    "        train_size=0.7, \n",
    "        class_func=ternary_class_func, \n",
    "        score_func=utils.safe_macro_f1):    \n",
    "    \"\"\"Wrapper for comparing models. The parameters are like those of \n",
    "    `experiment`, with the same defaults, except\n",
    "    \n",
    "    Parameters\n",
    "    ----------    \n",
    "    stats_test : scipy.stats function\n",
    "        Defaults to `scipy.stats.wilcoxon`, a non-parametric version \n",
    "        of the paired t-test. \n",
    "        \n",
    "    trials : int (default: 10)\n",
    "        Number of runs on random train/test splits of `reader`,\n",
    "        with `train_size` controlling the amount of training data.\n",
    "            \n",
    "    train_func1, train_func2\n",
    "        Just like `train_func` for `experiment`. `train_func1`\n",
    "        defaults to `fit_maxent_classifier`. If `train_func2`\n",
    "        is None, then it is set equal to `train_func`.\n",
    "        \n",
    "    phi1, phi2\n",
    "        Just like `phi` for `experiment`. `phi1` defaults to \n",
    "        `unigrams_phi`. If `phi2` is None, then it is set equal \n",
    "        to `phi1`.\n",
    "        \n",
    "    Prints\n",
    "    ------\n",
    "    To standard output\n",
    "        A report of the assessment.\n",
    "        \n",
    "    Returns\n",
    "    -------\n",
    "    (np.array, np.array, float)\n",
    "        The first two are the scores from each model (length `trials`),\n",
    "        and the third is the p-value returned by stats_test.\n",
    "        \n",
    "    TODO\n",
    "    ----\n",
    "    This function can easily be parallelized. The ParallelPython\n",
    "    makes this easy:http://www.parallelpython.com\n",
    "    \n",
    "    \"\"\"    \n",
    "    if phi2 == None:\n",
    "        phi2 = phi1\n",
    "    if train_func2 == None:\n",
    "        train_func2 = train_func1    \n",
    "    scores1 = np.array([experiment(train_reader=reader, \n",
    "        phi=phi1,\n",
    "        train_func=train_func1,\n",
    "        class_func=class_func,\n",
    "        score_func=score_func,\n",
    "        verbose=False) for _ in range(trials)])    \n",
    "    scores2 = np.array([experiment(train_reader=reader,\n",
    "        phi=phi2,\n",
    "        train_func=train_func2,\n",
    "        class_func=class_func,\n",
    "        score_func=score_func,\n",
    "        verbose=False) for _ in range(trials)])\n",
    "    # stats_test returns (test_statistic, p-value). We keep just the p-value:\n",
    "    pval = stats_test(scores1, scores2)[1]\n",
    "    # Report:\n",
    "    print('Model 1 mean: %0.03f' % scores1.mean())\n",
    "    print('Model 2 mean: %0.03f' % scores2.mean())\n",
    "    print('p = %0.03f' % pval if pval >= 0.001 else 'p < 0.001')\n",
    "    # Return the scores for later analysis, and the p value:\n",
    "    return (scores1, scores2, pval)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's a comparison of our SGD classifier with `sklearn`'s MaxEnt classifier, using `unigrams_phi` to featurize:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model 1 mean: 0.510\n",
      "Model 2 mean: 0.510\n",
      "p = 0.959\n"
     ]
    }
   ],
   "source": [
    "_ = compare_models(\n",
    "        train_func1=fit_basic_sgd_classifier, \n",
    "        train_func2=fit_maxent_classifier)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Distributed representations as features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The goal of this final content section is to make connections between our study of distributed representations and supervised learning. \n",
    "\n",
    "Arguably, more than any specific model architecture, this is the major innovation of __deep learning__: rather than designing feature functions by hand, we use dense, distributed representations, often derived from unsupervised models.\n",
    "\n",
    "To illustrate this process, we'll use GloVe vectors. The approach is very simple; problem 4 encourages you to trying more advanced, creative approaches. \n",
    "\n",
    "To start, we need a simple GloVe reader. The one in `utils` just creates a mapping from strings to their GloVe vectors. Here we use the `100d` version, but the others are worth trying:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "GLOVE = utils.glove2dict(os.path.join(glove_home, 'glove.6B.100d.txt'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have the GloVe data, the only additional code we need to write is a new feature function that makes use of it. The feature function  `glove_leaves_phi` does the most basic reasonable thing: it gets GloVe vectors for all the words it can and just sums them into a single vector.\n",
    "\n",
    "__Note__: because we want to use the above framework, we have to take the step of turning each vector into a dict with nonce key names. This is silly because it all just gets turned back into a vector by `build_dataset`, but the costs are small in practice."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def glove_leaves_phi(tree, np_func=np.sum):\n",
    "    \"\"\"Represent tree as a combination of the GloVe vector of its words.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    tree : nltk.Tree\n",
    "    \n",
    "    np_func : function (default: np.sum)\n",
    "        A numpy matrix operation that can be applied columnwise, \n",
    "        like `np.mean`, `np.sum`, or `np.prod`. The requirement is that \n",
    "        the function take `axis=0` as one of its arguments (to ensure\n",
    "        columnwise combination) and that it return a vector of a \n",
    "        fixed length, no matter what the size of the tree is.\n",
    "    \n",
    "    Returns\n",
    "    -------\n",
    "    dict \n",
    "        A map from column index names to GloVe values. (The dict\n",
    "        structure is for the sake of conforming to our general \n",
    "        framework for feature functions.) \n",
    "    \n",
    "    \"\"\"    \n",
    "    allvecs = np.array([GLOVE[w] for w in tree.leaves() if w in GLOVE])\n",
    "    feats = {}    \n",
    "    if len(allvecs) > 0:\n",
    "        combo = np_func(allvecs, axis=0)\n",
    "        names = list(range(len(combo)))\n",
    "        feats = dict(zip(names, combo))        \n",
    "    return feats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Performance of this basic approach on the binary problem, using our default training function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy: 0.756\n",
      "             precision    recall  f1-score   support\n",
      "\n",
      "   negative      0.734     0.741     0.737       960\n",
      "   positive      0.775     0.769     0.772      1116\n",
      "\n",
      "avg / total      0.756     0.756     0.756      2076\n",
      "\n"
     ]
    }
   ],
   "source": [
    "_ = experiment(phi=glove_leaves_phi, class_func=binary_class_func)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Additional sentiment resources"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here are a few publicly available datasets and other resources; if you decide to work on sentiment analysis, get in touch with the teaching staff &mdash; we have a number of other resources that we can point you to.\n",
    "\n",
    "* Sentiment lexica: http://sentiment.christopherpotts.net/lexicons.html\n",
    "* NLTK now has a SentiWordNet module: http://www.nltk.org/api/nltk.corpus.reader.html#module-nltk.corpus.reader.sentiwordnet\n",
    "* Stanford Large Movie Review Dataset: http://ai.stanford.edu/~amaas/data/sentiment/index.html\n",
    "* SemEval-2013: Sentiment Analysis in Twitter: https://www.cs.york.ac.uk/semeval-2013/task2/\n",
    "* Starter code for a sentiment-aware tokenizer: http://sentiment.christopherpotts.net/code-data/happyfuntokenizing.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## In-class exploration"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to get a feel for the codebase and prepare for the in-class bake-off, we'll use class time for the following:\n",
    "\n",
    "0. Write a new feature function. We recommend starting with something simple.\n",
    "0. Use `experiment` to evaluate your new feature function on the binary and ternary versons of SST, with at least `fit_basic_sgd_classifier` and `fit_maxent_classifier`.\n",
    "0. If you have time, compare your feature function with `unigrams_phi` using `compare_models`.\n",
    "\n",
    "__Submit__: At the end of class, bring a summary of your feature function (e.g., \"bigrams\") to one of the teaching team. We'll summarize the approaches taken in the class discussion forum."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## In-class bake-off"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The goal of the in-class bake-off is to achieve the highest average F1 score on the development set as reported by `experiment`, with the ternary class function. \n",
    "\n",
    "The only restriction: the feature functions cannot make any use of the subtree labels.\n",
    "\n",
    "Here's the baseline model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy: 0.602\n",
      "             precision    recall  f1-score   support\n",
      "\n",
      "   negative      0.628     0.689     0.657       428\n",
      "    neutral      0.343     0.153     0.211       229\n",
      "   positive      0.629     0.750     0.684       444\n",
      "\n",
      "avg / total      0.569     0.602     0.575      1101\n",
      "\n"
     ]
    }
   ],
   "source": [
    "_ = experiment(\n",
    "        train_reader=train_reader,         # Fixed by the competition.\n",
    "        assess_reader=dev_reader,          # Fixed.\n",
    "        class_func=ternary_class_func,     # Fixed.\n",
    "        train_func=fit_maxent_classifier,  # Free to write your own!\n",
    "        phi=unigrams_phi)                  # Free to write your own!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So the baseline is 0.575. For the top scores (bake-off winners), we'll want to see the function call and any new code (especially feature functions). At the end of class, bring your score to one of the teaching team. We'll report the results in the class discussion forum."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Homework 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Error analysis [3 points]\n",
    "\n",
    "Error analysis is one of the most important methods for steadily improving a system, as it facilitates a kind of human-powered hill-climbing on your ultimate objective. Often, it takes a careful human analyst just a few examples to spot a major pattern that can lead to a beneficial change to the feature representations.\n",
    "\n",
    "__Your task__: improve `experiment` above by adding a keyword argument `view_errors` with default value `0`. Where the  value is `n`, the function prints out a random selection of `n` errors: the underlying tree, the correct label, and the predicted label.\n",
    "\n",
    "__Note__: the error printing need only work where `assess_reader` is specified (not None). If no `assess_reader` is given, then you can just change `view_errors=0`. (I realized that printing errors for random train/test splits requires more code-rewriting than I originally anticipated. It's doable but a bit cumbersome.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Reproducing Socher et al.'s NaiveBayes baselines [4 points]\n",
    "\n",
    "[Socher et al.](http://www.aclweb.org/anthology/D/D13/D13-1170.pdf) compare against two NaiveBayes baselines: one with unigram features and one with bigram features. See how close you can come to reproducing the performance of those models on the binary, root-only problem (values in the rightmost column of their Table 1, rows 1 and 3). \n",
    "\n",
    "__Specific tasks__:\n",
    "\n",
    "0. Write a bigrams feature function on the model of `unigrams_phi`.\n",
    "0. Write a function `fit_nb_classifier` that serves as a wrapper for [sklearn.naive_bayes.MultinomialNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB).\n",
    "0. Use `experiment` to run the experiments, assessing against `dev_reader`.\n",
    "0. Use `compare_models` to compare the feature_functions statistically with `fit_nb_classifier` as the shared `train_func` value.\n",
    "\n",
    "Submit all the code you write for this, including any new `import` statements, as well as the output from running the code in steps 3 and 4.\n",
    "\n",
    "__A note on performance__: in our experience, the unigrams NaiveBayes model achieves about 0.79, and the bigrams NaiveBayes model under-performs, at around 0.75. It's fine to submit answers with comparable numbers; the Socher et al. baselines are very strong."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Sentiment shifters [3 points]\n",
    "\n",
    "Some words have greater power than others to shift sentiment around. Because the SST has sentiment labels on all of its subconstituents, it provides an opportunity to study these shifts in detail. This question takes a first step in that direction by asking you to identify some of these sentiment shifters automatically.\n",
    "\n",
    "More specifically, the task is to identify words that effect a particularly large shift between the value of their sibling node and the value of their mother node. For instance, in the tree"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAQwAAACzCAIAAADzBbXQAAAACXBIWXMAAA3XAAAN1wFCKJt4AAAAHXRFWHRTb2Z0d2FyZQBHUEwgR2hvc3RzY3JpcHQgOS4xNnO9PXQAAA3USURBVHic7d2/b9tGHwbwS9olDVqAL2APWWrw3ZylAOF3KuAM5OIA71RqbbqQQLt0aED+CSKcuQBvStFNDN4tXsTBXl/rxmjr1V4jwAcUsNBN73Bvr4Rk8ahfPJJ6PpNlmeTR5sO7o2h+H81mMwIAyz023QCApkNIADQQEgANhARAY99DwjmnlPZ6vTzPTbcFGmqvQyKESJLEsizbtk23BZrrU9MNMMmyrDRNCSGMMdNtgeba654EoAqEBEADIQHQQEgANBASAA2EBEDj0T7fBSyE6PV6hBDOuWVZ8gMTeVEYQNnrkABUgeEWgAZCAqCBkABoICQAGggJIYSI6VRMp6ZbAQ2113cBE0Ly8Ti7vqaXl188efIv2/ZPToLTU9ONgmbZ00vAfDKhV1fZ9TX/+NE+PPz3V18dfP75fxhjNzfW06f+yUnw4oXz5ZemmwmNsHchoVdX+YcP2fU1ISR48cI9PvZPTtS77PZWdizi/t45OpJpsT77zFx7wbx9CcmqR395lmCvdDwkYjqV2VhvHDU3KpMzFvvgYKdthqbpbEjUjJwQ4j5/vuGMfLtrg3bpWkh2eu7fsF+ClupOSOqcRWB+v1daHxKzxyvm9/ugrSFp1MgH8/tua19ImjyHbnLbYG2tCUmLztaN6uVgcy0ISXvH/Zjfd0NzQ9KlI6y9OQfSwJB0eKzSohEjFDUoJPsz692fPe0G8yHZ2/Nrh/vMjjEZEozUpS7NvjrJTEjid+9wTCyaO2v0fR+/liYwFhJxf4/RxYPk+JPd3Ax/+sl0W4CQJsxJABoOT0sB0Kg1JIyxXq/neZ7neZTSOjfdFqgG3ED1hUQIEcdxmqbD4XA4HAohsiyrbeutgGrAzVTfnIQxJoRwXVd9p9frDQaDerbeLnEcu65b/F2BQfU9nM5xnOJLxhjOl9AKZibuQghKaRRFRrYOsBIDIZH1pYIgsCyr/q0DrKrukMjpe7/fnxt9ATRWrQ/M5pwnSdLv99GHQIvU15MwxsIwLCYkSZLatg6wtvouAXueJ4SY60OGw2E9W28FVANuJty7BaCBe7cANBASAA2EBEADIQHQQEgANBCShsrH43w8Nt0KIMRIiWp2extnGf6Bexl6dZVcXPCPHwkh7vPn0dmZe3xsulF7zcQNjvf3+YcPYjqtf9MNl4/H/4yi8O1b++Bg+Pp1+uoVn0y883PvzRv0KgYZ6EkkdnODE6SSj8fJxUX+4YNzdDR8/Vr9ZoLTU9mxeOfn6FVMMRYSkFQ87MPD9NWrxeedBqeniIpZCIkx2ngUISoGISQG8MkkzrLs+rpKPIoQFSMM3OAoptN//PBDceS9P/hkklxc0MtL6+nT6OwsOjtbe1XqIhiismtm7gJ+9N13+xaSYjyCFy+is7OtPOcXUakBQrJzO4pHEaKyU8ZC0vf9TQYbrSCmUxkPQsiO4lGEqOyIsYl7tz9MVPGQD8+vp4gCpvU7gqtbWzYXj+jsrOaqXYjK1iEk26SKExmJRxGiskVmQmI9fWpku7uj5gPG41GEqGyFmYm79+aNc3TU/+ab+je9dW2ZLrelnQ2EkKyvjYddG9tsHEKyjnw8Dn/5pb2HGqKyEmMhIYS08f+uive0932/1YcXolIRQrKC+N275P17+/AwOjurfldiw6moRC9ftrdv3yk8wXEF+XjMJ5POxKOIXl3ZBwfoTB6EkABo4GkpABqffP3115ZlPXnypPoyQoiVfr4iSmmWZVmWPXv27NmzZ1tf/9oYYz/++COl9Ndff/3zzz+7VH6Ic55lGUr+lnucJEkcxystE4bhLpoSBIGsXiKE2MX619PhytqoiF3R4yAIVj0oG3UQ7xrnPIoiVVYliqLOhMSyrDRNfd833ZCm+9T3fVkLNwgC9V3GmOxe1Gmm3+8TQjjnYRgyxjzPkz9pWVaxFrv8AULIcDiUYydCiOu6auWMMUqpKlJTvTRc+YJyqCbbI49jznmaprImThAE8lCI45gx5jiO3J0qUFkbyGw2m81mQRDMChzHubu7k1+PRiPXdYvvzr1c5LpuFEX9fl++HA6HxVXNrVm9lKIoUj+vlC84GAyiKCquQe3Ob7/9Nrdrvu/PbbG6u7u7IAjWXryxHvydg/L31S3Oufratu08z+XXjuOsUZHMtm1Vpt11XfkFpXQwGKgewHEc3/erjF7KF6SUFnuGKIrUvsizvnqZ57lt2+uVNUVl7b312PM8z/MYY8WDNU1Tznkcx71eLwzDNSYhxcGbIkdKxe/4vl8M5zIrLTg3E42iSFUwpZSq6K4ElbX32aeqtKfnefIAkpFQB5MQwvO80Wi0+cYWw8YYq3JiLl9wbg2c87leUX5TCLFeN4LK2nvu7+GW67pyiCWnyOr7i0eGbdvFo7Z6P+O6bvFys7wEWeXqSvmCQRCo7k6e8ufO97IzybJsjW4ElbXh/7elxHGc57kQIooi27YppWqEwzkvXp4if137kgeiPDTVpCVJkjzP5RUk+Z25IQqlVE4M5HrUu1mWyWSq61dytWrgtGxBKc9zSqlMbxRFc7MUQkgYhvKa2Kq/oA5X1kZF7IqW3rslhGCMkcK0+8F3HcdZdRBSvuatLBiG4dwfO47j4scdANUt/R93y7JKjsXyd8utvWzFBfM8X5ylkIfGjQBVdORpKWrkQAgpflbY6/WEEHI4kWUZPl2GNeBWeQAN3CoPoIGQAGh0ZE4Cm8jH4//+/vsnjx4FL17U8Mzi1sGcpKr43Tt2c9O6h1eUUw9/Ofjii8kff+yuOESroSfZU4sVG2UdleT9e3p5iagUoSepqjM9STEei89GqqHkUOsgJFV1ICT06opeXrKbG+2jw4pR8U9OmvMIcCMQkqpaHZLiwxqD01P/5KTKUnwykblqQjEJgxCSqloaks2fZWq8LJFxCElVrQvJdh/1u89RQUiqaktIikfz1p+EvdOVNxZCUlXzQ1LnyX6vnkiPz0m6oP6x0F4VmkNI2s3sBajFqFS/dNYiCElbNeejDBUVennZ+/nnjtVvIZiTVNecOUmTPxQv/zi/pdCTtEkxHtHLl42Kh+QeH7vHxzIq4du3ycVFB6KCkLRDPh5n19dNjkfRYlSC09P23oePkDRdcQDT9/0WHWoqKtn1dZxlycVF0waHFWFOUlX9c5Iuje+bPI/SQkiqqjMkXYpHUUujguFWVbVdYGW3t975ufpfqHo2Wg/74CD99tvo7Ez+dxf/+HHw/femG6WHnqSJ8vG4q59eK3wyITWeejaBkABo4JFCSzHG9qo6JCyDkCwlCyzWuUXGWK/Xk2WVitUvuiqO41UrPxuBkCzl+36dNUQ7XAv7QZRSx3FqPg2tqf4yjfCg0Wg0V93T931Tjdk1WaJ1VqFIbRPgEvAD1EBrsUjisuLdm9urWthJkjxYVbOZEJIHyOM+juPFiXsYhsPhUJY6UYHZOiHEYrGuzpDlYlpUohVzktVsXrxbq/O1sGXVMdOtWAF6ktWkaUopjeNYFgba+pih87Ww5YV1VZyVc9784koIyQp2V7xb2oda2LZtF7sRVSy2yRCSFTDGGGPqb7zdQ1nOcAaDQbEWdruGJVXMFb60LKv5fSZuS5lXUixbFsIuKd69iQ7Xwn6Q/D0zxoIgaPglCoRkZWuX2IaWQkgANHAJGEADIQHQQEgANBASAA2EpHH4ZMJub023YufEdJqPx2I6Nd0QPYSkcejVVdzp/ySR2M2Nd37Obm5MN0QPIQHQQEjAjBY9DgYhAdBASAA0EBIwKR+PTTdBDyEB0EBIADQQEgANhASMsQ8PTTehEoQEjGnFI+UJQgKghZCASeL+3nQT9BASMEmW8mk4hARAAyEB0EBIADQQEjDGOToy3YRK8JjTxnGPj9vyAcKG2rKneDgdgAaGWwAaCAmspnWVuzdvMELSDrUdl7JEURiGy+rirlq5m3OuLUW9073bvNQ4QtIOYRjWsyFZCMGyrGUH7qqVu23b7vf75YfpTvdu81LjuLrVDs0Z4eyiau5O927zBiMk9en1erZt27adZRn56xRbrNrDGKOUquJB8l3OuRz8eJ4nf8yyrMFgULKhLMuyLJMlgWT5NbUVbRtKlFTuJoRQSuU6LcuKoijLMs55sZ2U0jzPhRDFjWr3bpN9WdZg7YLFfRkMBsRsGfl9Y1lWmqby69FoFASBems0Grmue3d39+BL13UrbmIwGMyt1vf9im1QoigaDocPrv/BtwaDQRRFxZ+ZW235Rpft3e72pWTBuX2ZzWYISa3mjobiyyAIVCSkNE3VH7J6SOYOI7mewWBQpQ3KqiGZW8nd3V35Vqq0YbbLfSlZcHElGG41hSrFqPi+r0o5V5fnuRq6FFe1UeNWJIdGm6+nCftCMCdpjsXJK2NsjQK/ruuWz1h2Ya6dnHPO+earbcK+EFwCbg7XdYsfJgghkiRRZ03btospKrkctNj/bOuQLREEQRiGslVCiDiOVyo8vWzvmrAvBPdu1UYI0ev1ihWZ4zimlBYLNMvrP3KgwhgrXpORVd7lS/mXS9N02baSJGGMyfXIH1YffZS3oaQ8d8lbcqOyfrc83KMoopTKdS5u1PM8xpjv+2oXSvZu6/tiWZb2D1HclzRNEZJmKal/rd5yHKfKMCzP8+o/vHVhGJbEeFH53pndF4QEti/P8zzP1Ym57TBxh+2Q4x/5teM4rU5IcV+GwyF6EgANXN0C0EBIADQQEgANhARAAyEB0Pgf/hqFrUkIIMIAAAAASUVORK5CYII=",
      "text/plain": [
       "Tree('1', [Tree('2', ['Astrology']), Tree('1', [Tree('2', ['is']), Tree('1', [Tree('2', ['not']), Tree('4', ['enlightening'])])])])"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t = Tree.fromstring(\"\"\"(1 (2 Astrology) (1 (2 is) (1 (2 not) (4 enlightening))))\"\"\")\n",
    "t"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "we have the shifter scores:\n",
    "\n",
    "* _not_: `1-4=-3`\n",
    "* _enlightening_: `1-2=-1`\n",
    "* _is_: `1-1=0`\n",
    "* _Astrology_: `1-1=0`.\n",
    "\n",
    "__Your task__: write a function that calculates the mean shifter scores for all the words in the training data and prints out the top 10 and bottom 10 as ranked by those mean scores, _limiting attention to words with at least 100 scores to reduce noise_. \n",
    "\n",
    "__Tips__:\n",
    "\n",
    "* You'll probably want to use `tree.subtrees()` to inspect all of the subtrees in each `tree`.\n",
    "* `len(tree)` counts the number of children (immediate descendents) of `tree`.\n",
    "* Use `from six import string_types` and then `isinstance(subtree[0][0], string_types)` will test whether the left daughter of `subtree` has a lexical child.\n",
    "* `tree.label()` gives the label for any tree or subtree.\n",
    "* Your reader should use `replace_root_score=False` so that you keep the root node label."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### Extra credit: Toward compositional distributed features [up to 2 points]\n",
    "\n",
    "In [Distributed representations as features](#Distributed-representations-as-features), we just averaged together all of the leaf-node GloVe vectors to obtain a fixed-dimensional representation for all sentences. This ignores all of the tree structure. See if you can do better by paying attention to the binary tree structure.\n",
    "\n",
    "__Your tasks__:\n",
    "0. Write a function `glove_subtree_phi` that obtains a vector representation for each subtree by combining the vectors of its daughters, with the leaf nodes again given by GloVe (any dimension you like) and the full representation of the sentence given by the final vector obtained by this recursive process. You can decide on how you combine the vectors. As usual, the requirement is just that we get representations of the same dimensionality for all trees (a basic requirement of supervised learning in our sense). Submit this function.\n",
    "0. Use `experiment` to evaluate this on the binary-class, root-only problem, with `fit_maxent_with_crossvalidation` as the training function. Use `dev_reader` for the evaluation. Submit the code for this function call as well. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}