{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# Bake-off: Word analogies\n",
    "\n",
    "__Important__: This isn't being run as a bake-off this year. It's included in the repository in case people want to do additional exploration or incorporate this kind of evaluation into a project."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "__author__ = \"Christopher Potts\"\n",
    "__version__ = \"CS224u, Stanford, Spring 2018 term\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Contents\n",
    "\n",
    "0. [Contents](#Contents)\n",
    "0. [Overview](#Overview)\n",
    "0. [Set-up](#Set-up)\n",
    "0. [Analogy completion](#Analogy-completion)\n",
    "0. [Evaluation](#Evaluation)\n",
    "0. [Baseline](#Baseline)\n",
    "0. [Bake-off submission](#Bake-off-submission)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Word analogies provide another kind of evaluation for distributed representations. Here, we are given three vectors A, B, and C, in the relationship\n",
    "\n",
    "*A is to B as C is to __*\n",
    "\n",
    "and asked to identify the fourth that completes the analogy. This section conducts such analyses using a large, automatically collected analogies dataset from Google. These analogies are by and large substantially easier than the classic brain-teaser analogies that used to appear on tests like the SAT, but it's still an interesting, demanding task.\n",
    "\n",
    "The core idea  is that we make predictions by creating the vector\n",
    "\n",
    "$$(A−B)+C$$\n",
    "\n",
    "and then ranking all vectors based on their distance from this new vector, choosing the closest as our prediction."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set-up"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import defaultdict\n",
    "import csv\n",
    "import numpy as np\n",
    "import os\n",
    "import pandas as pd\n",
    "from scipy.stats import spearmanr\n",
    "import vsm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_home = 'vsmdata'\n",
    "\n",
    "analogies_home = os.path.join(data_home, 'question-data')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Analogy completion\n",
    "\n",
    "The function `analogy_completion` implements the analogy calculation on VSMs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def analogy_completion(a, b, c, df, distfunc=vsm.cosine):\n",
    "    \"\"\"a is to be as c is to predicted, where predicted is the \n",
    "    closest to (b-a) + c\"\"\"\n",
    "    for x in (a, b, c):\n",
    "        if x not in df.index:\n",
    "            raise ValueError('{} is not in this VSM'.format(x))\n",
    "    avec = df.loc[a]\n",
    "    bvec = df.loc[b]\n",
    "    cvec = df.loc[c]\n",
    "    newvec = (bvec - avec) + cvec\n",
    "    dists = df.apply(lambda row: distfunc(newvec, row))\n",
    "    dists = dists.drop([a,b,c])\n",
    "    return pd.Series(dists).sort_values()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "imdb20 = pd.read_csv(\n",
    "    os.path.join(data_home, \"imdb_window20-flat.csv.gz\"), index_col=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "imdb_ppmi = vsm.pmi(imdb20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = analogy_completion(\"good\", \"great\", \"bad\", imdb_ppmi)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "awful       0.589598\n",
       "terrible    0.603033\n",
       "acting      0.604288\n",
       "horrible    0.634886\n",
       "worst       0.648661\n",
       "dtype: float64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluation\n",
    "\n",
    "The function `analogy_evaluation` evaluates a VSM against one of the files in `analogies_home`. The default is to use `gram1-adjective-to-adverb.txt`, but there are a lot to choose from. The calculations are somewhat slow, so you can use the `verbose=True` option to see what predictions are being made incrementally."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "def analogy_evaluation(\n",
    "        df, \n",
    "        src_filename='gram1-adjective-to-adverb.txt', \n",
    "        distfunc=vsm.cosine,\n",
    "        verbose=True):\n",
    "    \"\"\"Basic analogies evaluation for a file `src_filename `\n",
    "    in `question-data/`.\n",
    "    \n",
    "    Parameters\n",
    "    ----------    \n",
    "    df : pd.DataFrame\n",
    "        The VSM being evaluated.\n",
    "    src_filename : str\n",
    "        Basename of the file to be evaluated. It's assumed to be in\n",
    "        `analogies_home`.        \n",
    "    distfunc : function mapping vector pairs to floats (default: `cosine`)\n",
    "        The measure of distance between vectors. Can also be `euclidean`, \n",
    "        `matching`, `jaccard`, as well as any other distance measure \n",
    "        between 1d vectors.\n",
    "    \n",
    "    Returns\n",
    "    -------\n",
    "    (float, float)\n",
    "        The first is the mean reciprocal rank of the predictions and \n",
    "        the second is the accuracy of the predictions.\n",
    "    \n",
    "    \"\"\"\n",
    "    src_filename = os.path.join(analogies_home, src_filename)\n",
    "    # Read in the data and restrict to problems we can solve:\n",
    "    with open(src_filename) as f:    \n",
    "        data = [line.split() for line in f.read().splitlines()]\n",
    "    data = [prob for prob in data if set(prob) <= set(df.index)]\n",
    "    # Run the evaluation, collecting accuracy and rankings:\n",
    "    results = defaultdict(int)\n",
    "    ranks = []\n",
    "    for a, b, c, d in data:\n",
    "        ranking = analogy_completion(a, b, c, df=df, distfunc=distfunc)       \n",
    "        predicted = ranking.index[0]\n",
    "        # Accuracy:\n",
    "        results[predicted == d] += 1  \n",
    "        # Rank of actual, starting at 1:\n",
    "        rank = ranking.index.get_loc(d) + 1\n",
    "        ranks.append(rank)        \n",
    "        if verbose:\n",
    "            print(\"{} is to {} as {} is to {} (gold: {} at rank {})\".format(\n",
    "                a, b, c, predicted, d, rank))        \n",
    "    # Return the mean reciprocal rank and the accuracy results:\n",
    "    mrr = np.mean(1.0 / (np.array(ranks)))\n",
    "    return (mrr, results)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Baseline\n",
    "\n",
    "My baseline is PPMI on `imdb20` as loaded above:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.5446665665352782, defaultdict(int, {False: 297, True: 209}))"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "analogy_evaluation(imdb_ppmi, src_filename='family.txt', verbose=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.14621311421118469, defaultdict(int, {False: 904, True: 88}))"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "analogy_evaluation(imdb_ppmi, src_filename='gram1-adjective-to-adverb.txt', verbose=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.11918977012507004, defaultdict(int, {False: 756, True: 56}))"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "analogy_evaluation(imdb_ppmi, src_filename='gram2-opposite.txt', verbose=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bake-off submission\n",
    "\n",
    "1. The name of the count matrix you started with (must be one in `vsmdata`).\n",
    "1. A description of the steps you took to create your bake-off VSM – must be different from the above baseline.\n",
    "1. Your mean reciprocal rank scores for the following files in `analogies_home`:\n",
    "  * 'family.txt'\n",
    "  * 'gram1-adjective-to-adverb.txt'\n",
    "  * 'gram2-opposite.txt'"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  },
  "widgets": {
   "state": {},
   "version": "1.1.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}