{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Semantic date parsing" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "__author__ = \"Christopher Potts\"\n", "__version__ = \"CS224u, Stanford, Spring 2016 term\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Contents\n", "\n", "0. [Set-up](#Set-up)\n", "0. [Goals](#Goals)\n", "0. [In-class bake-off](#In-class-bake-off)\n", "0. [Data set](#Data-set)\n", "0. [Core interpretive functionality](#Core-interpretive-functionality)\n", "0. [Build a grammar](#Build-a-grammar)\n", "0. [Check oracle accuracy](#Check-oracle-accuracy)\n", "0. [Define features](#Define-features)\n", "0. [Train a model](#Train-a-model)\n", "0. [Inspect the trained model](#Inspect-the-trained-model)\n", "0. [Evaluate the trained model](#Evaluate-the-trained-model)\n", "0. [Appendix: a rule-based baseline](#Appendix:-a-rule-based-baseline)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set-up\n", "\n", "0. Make sure the `sys.path.append` value is the path to your local [SippyCup repository](https://github.com/wcmac/sippycup). (Alternatively, you can add SippyCup to your Python path; see one of the teaching team if you'd like to do that but aren't sure how.)\n", "\n", "0. Make sure that `semparse_dateparse_data.pickle` is in the current directory (or available via your Python path)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import sys\n", "sys.path.append('../sippycup')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import sys\n", "import re\n", "import calendar\n", "import itertools\n", "import pickle\n", "import random\n", "from copy import copy\n", "from collections import defaultdict\n", "from datetime import date\n", "from dateutil import parser" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Goals\n", "\n", "The goal of this bake-off is to train a semantic parser to interpret date strings. Although this requires only simple syntactic structures, it turns out to be a deep understanding challenge involving contextual inference. Date strings are often ambiguous, and there are conventions for how to resolve the ambiguities. For example, _1/2/11_ is ambiguous between `Jan 2` and `Feb 1`, and the _11_ could pick out `1211`, `1911`, `2011`, and so forth.\n", "\n", "The codelab is structured basically as our last one was: the first goal is to create a grammar that gets perfect oracle accuracy, and the second goal is to design features that resolve ambiguities accurately and so predict the correct denotations.\n", "\n", "In this case, it's possible to write a high-quality grammar right from the start. We know (approximately) what most of the words mean, so we really just need to handle the variation in the forms people use. The bigger challenge comes in defining features that capture the subtler interpretive conventions.\n", "\n", "For reference, I've included at the bottom of the codelab a baseline: the rule-based `dateutil.parser.parse`. It's good, achieving around 70% on the train and dev sets. However, there are some strings that it just can't parse, and, since it isn't learning-based, it can't learn the conventions latent in the data. You'll be able to do much better!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## In-class bake-off\n", "\n", "Your bake-off entry is your __denotation accuracy__ result for the dev set after training on the train set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data set has `train` and `dev` portions, with 1000 and 500 examples, respectively." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from example import Example\n", "\n", "data = pickle.load(open('semparse_dateparse_data.pickle', 'rb'))\n", "train = data['train']\n", "dev = data['dev']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's a look at the data via random samples. Running this cell a few times will give you a feel for the data.\n", "\n", "* All the punctuation has been removed from the strings for convenience.\n", "\n", "* All of the examples end with one of two informal timezone flags: `US` if the string comes from text produced in the U.S., and `non-US` otherwise. This is an important piece of information that can be used in learning, since (I assume) only Americans use the jumbled `month/day/year` order." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Display a random sample of training examples:\n", "for ex in random.sample(train, k=5): \n", " print(\"=\" * 70)\n", " print(\"Input:\", ex.input)\n", " print(\"Denotation:\", repr(ex.denotation))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Core interpretive functionality" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's the semantics you'll want to use. You won't need to change it. \n", "\n", "__Important__: `date_semantics` expects its arguments in international order: `year, month, day`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def date_semantics(year, month, day):\n", " \"\"\"Interpret the arguments as a date objects if possible. Returns \n", " None of the datetime module won't treat the arguments as a date \n", " (say, because it would be a Feb 31).\"\"\"\n", " try:\n", " return date(year=year, month=month, day=day)\n", " except ValueError:\n", " return None\n", "\n", "# The key dt is arbitrary, but the grammar rules need to hook into it.\n", "ops = {'dt': date_semantics}\n", "\n", "def execute(semantics):\n", " \"\"\"The usual SippyCup interpretation function\"\"\"\n", " if isinstance(semantics, tuple):\n", " op = ops[semantics[0]]\n", " args = [execute(arg) for arg in semantics[1:]]\n", " return op(*args)\n", " else:\n", " return semantics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Build a grammar" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__TO DO__: Fill out this starter code to build your grammar. The following is a high-level summary of the grammar you're trying to build. (You might end up making slightly different assumptions, but this represents a good default goal.) [This image is available in higher resolution here](fig/dateparse-grammar-summary.pdf). Rules in green have been completed for you. Rules in orange still need to be written.\n", "\n", "![dateparse-grammar-summary.png](fig/dateparse-grammar-summary.png)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from parsing import Grammar, Rule, add_rule\n", "\n", "# Use $DATE for the root node of its full date expressions:\n", "gram = Grammar(start_symbol=\"$DATE\")\n", "\n", "# Fill this with Rule objects:\n", "rules = []\n", "\n", "# Make sure the grammar can handle final timezone information:\n", "timezone_rules = [\n", " Rule('$DATE', '$DATE $TZ', (lambda sems : sems[0])),\n", " Rule('$TZ', 'US'),\n", " Rule('$TZ', 'non-US') ]\n", "\n", "rules += timezone_rules\n", "\n", "######################################################################\n", "#################### RULES FOR MONTHS (COMPLETED) ####################\n", "#\n", "# As a reminder about the notation, concepts, etc., here are lexical \n", "# rules for the months. They all determine trees\n", "# \n", "# $M : i\n", "# |\n", "# monthname\n", "#\n", "# where monthname is the ith month of the year.\n", "\n", "# Full month names with \"\" in the 0th position so that 'January' has \n", "# index 1, 'February' index 2, ...\n", "months = [str(s) for s in calendar.month_name]\n", "# Add the rules:\n", "rules += [Rule('$M', m, i) for i, m in enumerate(months) if m]\n", "\n", "# 3-letter month names like \"Jan\", \"Feb\", with \"\" in 0th position:\n", "mos = [str(s) for s in calendar.month_abbr]\n", "# Add the rules:\n", "rules += [Rule('$M', m, i) for i, m in enumerate(mos) if m]\n", "\n", "# Numerical months:\n", "num_months = [str(i) for i in range(1, 13)]\n", "# Add the rules:\n", "rules += [Rule('$M', m, int(m)) for m in num_months]\n", "\n", "# 1-digit months with an initial zero:\n", "num_months_padded = [str(i).zfill(2) for i in range(1,10)]\n", "# Add the rules:\n", "rules += [Rule('$M', m, int(m)) for m in num_months_padded]\n", " \n", "######################################################################\n", "####################### ADD RULES FOR THE DAYS #######################\n", "#\n", "# Add lexical rules for the days. Again, these will be structures\n", "#\n", "# $D : i\n", "# |\n", "# day\n", "# \n", "# where day is a 1- or 2-digit string and i is its corresponding int.\n", "\n", "# Here are days of the month without and with two-digit padding. Each \n", "# day string m can be interpreted semantically as int(m).\n", "\n", "days = [str(d) for d in range(1, 32)]\n", "\n", "days_padded = [str(i).zfill(2) for i in range(1, 11)]\n", "\n", "# Add to the rules list here:\n", "\n", "\n", "\n", "\n", "######################################################################\n", "################## ADD RULES FOR THE ORDINAL DAYS #################### \n", "#\n", "# The data contain day names like \"2nd\". Use int2ordinal to create \n", "# rules for these too:\n", "#\n", "# $D : int(i)\n", "# |\n", "# int2ordinal(i)\n", "\n", "def int2ordinal(s):\n", " \"\"\"Forms numerals like \"1st\" from int strs like 1\"\"\"\n", " suffixes = {\"1\": \"st\", \"2\": \"nd\", \"3\": \"rd\"} \n", " if len(s) == 2 and s[0] == '1': # teens\n", " suff = \"th\"\n", " else: # all others use suffixes or default to \"th\"\n", " suff = suffixes.get(s[-1], \"th\")\n", " return s + suff\n", "\n", "# Add to the rules list here:\n", "\n", "\n", "\n", "\n", "######################################################################\n", "##################### ADD RULES TO HANDLE 'the' ######################\n", "#\n", "# The data contain expressions like \"the 3rd\". Add a lexical rule to \n", "# include 'the' and a binary combination rule so that we have \n", "# structures like\n", "#\n", "# $D\n", "# / \\ \n", "# $Det $D\n", "# | |\n", "# the 3rd\n", "\n", "# Add to the rules list here:\n", "\n", "\n", "\n", "\n", "######################################################################\n", "################## ADD RULES FOR THE 4-DIGIT YEARS ##################\n", "#\n", "# Long (4-digit) years are easy: they can be interpeted as themselves. \n", "# So these are rules of the form\n", "#\n", "# $Y : int(year)\n", "# |\n", "# year\n", "#\n", "# where year is a 4-digit string.\n", "\n", "# This range will be fine for our data:\n", "years_long = [str(y) for y in range(1900, 2100)]\n", "\n", "# Add to the rules list here:\n", "\n", "\n", "\n", "\n", "######################################################################\n", "################## ADD RULES FOR THE 2-DIGIT YEARS ###################\n", "#\n", "# 2-digit years are ambiguous about their century. Intuitively, for \n", "# each 2-digit year, we need a rule for interpreting it in all \n", "# potential centuries:\n", "\n", "# All 2-digit years:\n", "years_short = [str(x).zfill(2) for x in range(0, 100)]\n", "# A suitable range of century multipliers given our data:\n", "centuries = range(1700, 2500, 100) \n", "\n", "# Add to the rules list here:\n", "\n", "\n", "\n", "\n", "######################################################################\n", "####################### ADD COMPOSITION RULES ########################\n", "#\n", "# Here are some initial composition rules. Together, these rules \n", "# create structures like this:\n", "#\n", "# $DATE\n", "# / \\\n", "# $MonthDay $Y\n", "# / \\\n", "# $M $D\n", "#\n", "# and the 'dt' key connects with the ops dictionary we defined above.\n", "# The semantics is creating a tuple (dt year month day) for the\n", "# sake of date_semantics. You'll clearly need rules for at least\n", "# \n", "# * $MonthDay in the reverse order\n", "# * $Y before $MonthDay\n", "\n", "composition_rules = [\n", " # $M $D\n", " Rule('$MonthDay', '$M $D', lambda sems : ('dt', sems[0], sems[1])),\n", " # dt $Y $M $D\n", " Rule('$DATE', '$MonthDay $Y', lambda sems: (sems[0][0], sems[1], sems[0][1], sems[0][2]))\n", "]\n", "\n", "\n", "# Add to composition_rules here:\n", "\n", "\n", "\n", "\n", "rules += composition_rules\n", "\n", "\n", "# Add all the rules to the grammar:\n", "for rule in rules:\n", " add_rule(gram, rule)\n", "\n", "print('Grammar now has %s lexical rules, %s unary rules, and %s binary rules' % \\\n", " (len(gram.lexical_rules), len(gram.unary_rules), len(gram.binary_rules)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It can be informative to enter your own date strings and see what happens to them:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from parsing import parse_to_pretty_string\n", "\n", "def parse_and_interpret(s, grammar=gram):\n", " for i, parse in enumerate(gram.parse_input(s)):\n", " print(\"=\" * 70)\n", " print(\"Parse %s:\" % (i+1), parse_to_pretty_string(parse))\n", " print(\"Denotation:\", execute(parse.semantics))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "parse_and_interpret(\"5 4 2015 US\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Check oracle accuracy" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "You should be able to get 100% oracle accuracy with your grammar:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def check_oracle_accuracy(grammar=None, examples=train, verbose=True):\n", " \"\"\"Use verbose=True to print out cases where your grammar can't\n", " find a correct denotation among all of its parses.\"\"\"\n", " oracle = 0\n", " for ex in examples:\n", " # All the denotations for all the parses:\n", " dens = [execute(parse.semantics) for parse in gram.parse_input(ex.input)]\n", " if ex.denotation in dens:\n", " oracle += 1\n", " elif verbose:\n", " print(\"=\" * 70)\n", " print(ex.input)\n", " print(set(dens))\n", " print(repr(ex.denotation))\n", " percent_correct = (oracle/float(len(examples)))*100\n", " print(\"Oracle accuracy: %s / %s (%0.1f%%)\" % (oracle, len(examples), percent_correct))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "check_oracle_accuracy(grammar=gram, examples=train, verbose=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define features " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you now have perfect (or at least high) oracle accuracy, then you can begin training effective date-interpreting models." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's some space for inspecting parses to see how to get good features from them:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "parse = gram.parse_input(\"2015 05 04 US\")[0]\n", "\n", "print(parse)\n", "\n", "# Here's how to get the timezone string:\n", "print(parse.children[1].rule.rhs[0])\n", "\n", "# Here's how to get the top-level $Y vs. $MonthDay ordering:\n", "print(parse.children[0].rule.rhs)\n", "\n", "# Here's how to get the $MonthDay child names:\n", "print(parse.children[0].children[1].rule.rhs)\n", "\n", "def lemmas(parse):\n", " \"\"\"Returns a list of (category, word) pairs\"\"\"\n", " labs = []\n", " for t in parse.children:\n", " if len(t.rule.rhs) == 1:\n", " labs.append((t.rule.lhs, t.rule.rhs[0]))\n", " else:\n", " labs += lemmas(t)\n", " return labs\n", "\n", "print(lemmas(parse))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__TO DO__: extend `date_features` to capture more properties of the input, the denotation, and/or how they relate to each other." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from scoring import Model\n", "\n", "\n", "def date_rule_features(parse):\n", " features = defaultdict(float) \n", " \n", " # This is the timezone string:\n", " tz = parse.children[1].rule.rhs[0] \n", " \n", " \n", " return features\n", "\n", "\n", "model = Model(grammar=gram, feature_fn=date_rule_features, executor=execute)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train a model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The training step (feel free to fiddle — longer optimization, lower learning rate):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from learning import latent_sgd\n", "from metrics import DenotationAccuracyMetric\n", "\n", "trained_model = latent_sgd(\n", " model, \n", " train, \n", " training_metric=DenotationAccuracyMetric(), \n", " T=10, \n", " eta=0.1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inspect the trained model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which examples are you getting wrong?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def sample_errors(model=None, dataset=None, k=5):\n", " data = copy(dataset)\n", " random.shuffle(data)\n", " errors = 0\n", " for i in range(len(data)):\n", " ex = data[i]\n", " parses = model.parse_input(ex.input)\n", " if not parses or (parses[0].denotation != ex.denotation):\n", " best_parse = parses[0] if parses else None\n", " prediction = parses[0].denotation if parses else None\n", " print(\"=\" * 70)\n", " print('Input:', ex.input)\n", " print('Best parse:', best_parse)\n", " print('Prediction:', prediction)\n", " print('Actual:', ex.denotation)\n", " errors += 1\n", " if errors >= k:\n", " return" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "sample_errors(model=trained_model, dataset=train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluate the trained model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now evaluate your model on the held-out data. Turn in your 'denotation accuracy' value for the bake-off." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from experiment import evaluate_model\n", "from metrics import denotation_match_metrics\n", "\n", "evaluate_model(\n", " model=trained_model, \n", " examples=dev, \n", " examples_label=\"Dev\",\n", " metrics=denotation_match_metrics(),\n", " print_examples=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Appendix: a rule-based baseline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Was it worth the trouble? I think it was, in that you probably soundly beat the high-performance `dateutil` parser." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def evaluate_dateutil(dataset=data['dev'], verbose=False):\n", " \"\"\"dataset should be one data['train'] or data['dev']. Use\n", " verbose=True to see information about the errors.\"\"\"\n", " results = defaultdict(int)\n", " for ex in dataset:\n", " input = re.sub(r\"(non-)?US$\", \"\", ex.input)\n", " prediction = None \n", " try:\n", " prediction = parser.parse(input).date()\n", " except ValueError:\n", " if verbose:\n", " print(\"dateutil can't parse '%s'\" % input)\n", " results[prediction == ex.denotation] += 1\n", " if prediction and prediction != ex.denotation and verbose:\n", " print(\"dateutil predicted %s to mean %s\" % (input, repr(prediction)))\n", " acc = (results[True] / float(len(dataset)))*100\n", " return \"accuracy: %s / %s (%0.1f%%)\" % (results[True], len(dataset), acc)\n", "\n", "# Use verbose=True to see where and how dateutil stumbles:\n", "for key in ('train', 'dev'):\n", " print(key, evaluate_dateutil(dataset=data[key], verbose=False))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 0 }