{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "0a68f3d1",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import nltk\n",
    "import tagutils; reload(tagutils)\n",
    "from tagutils import *\n",
    "from IPython.core.display import HTML\n",
    "from nltk.corpus import brown\n",
    "import random as pyrand\n",
    "from tagutils import *"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0f36a27e",
   "metadata": {},
   "source": [
    "# Evaluation Framework"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "5325e07c",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "sents = list(brown.tagged_sents())\n",
    "n = len(sents)\n",
    "test = sorted(list(set(range(0,n,10))))\n",
    "training = sorted(list(set(range(n))-set(test)))\n",
    "training_set = [sents[i] for i in training]\n",
    "test_set = [sents[i] for i in test]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "3ee2d098",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "51606\n",
      "5734\n"
     ]
    }
   ],
   "source": [
    "print len(training_set)\n",
    "print len(test_set)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "290c50e0",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9236947791164659"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t0 = nltk.DefaultTagger('NN')\n",
    "t1 = nltk.UnigramTagger(training_set, backoff=t0)\n",
    "t2 = nltk.BigramTagger(training_set, backoff=t1)\n",
    "t2.evaluate(test_set)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e909843a",
   "metadata": {},
   "source": [
    "# Wordnet-Based Improvements"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "79bafa77",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import nltk.tag.api\n",
    "# help(nltk.tag.api.TaggerI)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b913049e",
   "metadata": {},
   "source": [
    "Your homework consists of implementing new taggers based on Wordnet.\n",
    "With regular taggers, we have a problem of sparsity; that is, we don't know\n",
    "what tag to assign to a word if we have never seen it in a context.\n",
    "\n",
    "However, for many words, Wordnet may give us useful information to help with\n",
    "tagging. You need to work out some ideas, implement them, and test them.\n",
    "\n",
    "There are different implementation strategies, but a simple one might be:\n",
    "\n",
    "- write classes that map token sequences to other token sequences using\n",
    "  WordNet; for example, you might map an input sentence to some collection\n",
    "  of hyponyms\n",
    "- then, apply the regular NLTK n-gram taggers to the modified output sequences\n",
    "- use backoff (as above) when the WordNet mapping fails for some reason\n",
    "  (you can't find the word, or maybe the mapping would be ambiguous and\n",
    "  you don't know how to handle it)\n",
    "\n",
    "This may not be the best strategy, but it's a good way of getting started.\n",
    "\n",
    "Another strategy is to use use WordNet to generate a cloud of related words\n",
    "around a given word, and then see whether you can find bigrams in an existing\n",
    "model for any of the related words.\n",
    "\n",
    "Implement your model(s) so that they conform to the NLTK tagging APIs,\n",
    "perform evaluations on the training and test sets defined above,\n",
    "and be ready to present your results (idea, evaluation, results) in \n",
    "the exercises.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "49438688",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 5
}