{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Spanish POS-tagger with nltk\n",
      "\n",
      "NLTK comes with an english POS-tagger trained by default. For others languanges you will need to train it. Depending on the language, this is not a problem because, as you may know, NLTK has a lot of corpus that you can download and use.\n",
      "\n",
      "In this example I'm going to use a spanish corpus called cess_esp. As I want to test if the tagger performs good or not I need to evaluate it so I'll need some test data. For this reason I'm going to divide the corpus in two sets: 90% for training and 10% for testing."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nltk.corpus import cess_esp\n",
      "\n",
      "sents = cess_esp.tagged_sents()\n",
      "training = []\n",
      "test = []\n",
      "for i in range(len(sents)):\n",
      "    if i % 10:\n",
      "        training.append(sents[i])\n",
      "    else:\n",
      "        test.append(sents[i])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "NLTK provides different types of taggers so, I'm going to train a few taggers and after that I will choose the most accurate one."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nltk import UnigramTagger, BigramTagger, TrigramTagger\n",
      "from nltk.tag.hmm import HiddenMarkovModelTagger\n",
      "\n",
      "unigram_tagger = UnigramTagger(training)\n",
      "bigram_tagger = BigramTagger(training, backoff=unigram_tagger) # uses unigram tagger in case it can't tag a word\n",
      "trigram_tagger = TrigramTagger(training, backoff=unigram_tagger)\n",
      "hmm_tagger = HiddenMarkovModelTagger.train(training)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now, let's evaluate each tagger with our test data."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print 'UnigramTagger: %.1f %%' % (unigram_tagger.evaluate(test) * 100)\n",
      "print 'BigramTagger: %.1f %%' % (bigram_tagger.evaluate(test) * 100)\n",
      "print 'TrigramTagger: %.1f %%' % (trigram_tagger.evaluate(test) * 100)\n",
      "print 'HMM: %.1f %%' % (hmm_tagger.evaluate(test) * 100)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "UnigramTagger: 87.6 %\n",
        "BigramTagger: 89.4 %"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "TrigramTagger: 89.0 %"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "HMM: 89.9 %"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In this case _HiddenMarkovModelTagger_ is the best. Besides, this tagger can be tuned a bit more, so maybe you find useful this [link](http://nltk.googlecode.com/svn/trunk/doc/howto/probability.html).\n",
      "\n",
      "Depending on the size of the corpus, training a tagger can take long (especially the hmm tagger) and maybe you don't want a decrease of performance in your application in production. You can use the same technique that nltk uses with its default POS-tagger, you can dump a trained tagger with pickle and load it whenever you want."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pickle\n",
      "\n",
      "# Dump trained tagger\n",
      "with open('unigram_spanish.pickle', 'w') as fd:\n",
      "    pickle.dump(unigram_tagger, fd)\n",
      "\n",
      "# Load tagger\n",
      "with open('unigram_spanish.pickle', 'r') as fd:\n",
      "    tagger = pickle.load(fd)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If you save *unigram_spanish.pickle* in one of the directories of the _nltk.data.path_ you will be able to load your tagger with _nltk.data.load_ function which is more clean because you won\u2019t have to use absolute paths or non intuitive relative paths."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}