{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Spanish POS-tagger with nltk\n", "\n", "NLTK comes with an english POS-tagger trained by default. For others languanges you will need to train it. Depending on the language, this is not a problem because, as you may know, NLTK has a lot of corpus that you can download and use.\n", "\n", "In this example I'm going to use a spanish corpus called cess_esp. As I want to test if the tagger performs good or not I need to evaluate it so I'll need some test data. For this reason I'm going to divide the corpus in two sets: 90% for training and 10% for testing." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from nltk.corpus import cess_esp\n", "\n", "sents = cess_esp.tagged_sents()\n", "training = []\n", "test = []\n", "for i in range(len(sents)):\n", " if i % 10:\n", " training.append(sents[i])\n", " else:\n", " test.append(sents[i])" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "NLTK provides different types of taggers so, I'm going to train a few taggers and after that I will choose the most accurate one." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from nltk import UnigramTagger, BigramTagger, TrigramTagger\n", "from nltk.tag.hmm import HiddenMarkovModelTagger\n", "\n", "unigram_tagger = UnigramTagger(training)\n", "bigram_tagger = BigramTagger(training, backoff=unigram_tagger) # uses unigram tagger in case it can't tag a word\n", "trigram_tagger = TrigramTagger(training, backoff=unigram_tagger)\n", "hmm_tagger = HiddenMarkovModelTagger.train(training)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's evaluate each tagger with our test data." ] }, { "cell_type": "code", "collapsed": false, "input": [ "print 'UnigramTagger: %.1f %%' % (unigram_tagger.evaluate(test) * 100)\n", "print 'BigramTagger: %.1f %%' % (bigram_tagger.evaluate(test) * 100)\n", "print 'TrigramTagger: %.1f %%' % (trigram_tagger.evaluate(test) * 100)\n", "print 'HMM: %.1f %%' % (hmm_tagger.evaluate(test) * 100)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "UnigramTagger: 87.6 %\n", "BigramTagger: 89.4 %" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "TrigramTagger: 89.0 %" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "HMM: 89.9 %" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case _HiddenMarkovModelTagger_ is the best. Besides, this tagger can be tuned a bit more, so maybe you find useful this [link](http://nltk.googlecode.com/svn/trunk/doc/howto/probability.html).\n", "\n", "Depending on the size of the corpus, training a tagger can take long (especially the hmm tagger) and maybe you don't want a decrease of performance in your application in production. You can use the same technique that nltk uses with its default POS-tagger, you can dump a trained tagger with pickle and load it whenever you want." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pickle\n", "\n", "# Dump trained tagger\n", "with open('unigram_spanish.pickle', 'w') as fd:\n", " pickle.dump(unigram_tagger, fd)\n", "\n", "# Load tagger\n", "with open('unigram_spanish.pickle', 'r') as fd:\n", " tagger = pickle.load(fd)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you save *unigram_spanish.pickle* in one of the directories of the _nltk.data.path_ you will be able to load your tagger with _nltk.data.load_ function which is more clean because you won\u2019t have to use absolute paths or non intuitive relative paths." ] } ], "metadata": {} } ] }