{
 "metadata": {
  "name": "",
  "signature": "sha256:adc300dd5179069e1aa8b48456c9defb238a87dc30efd855a4f2f46f3da7b0e7"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from itertools import chain\n",
      "import nltk\n",
      "from sklearn.metrics import classification_report, confusion_matrix\n",
      "from sklearn.preprocessing import LabelBinarizer\n",
      "import sklearn\n",
      "import pycrfsuite\n",
      "\n",
      "print(sklearn.__version__)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "0.15-git\n"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Let's use CoNLL 2002 data to build a NER system\n",
      "\n",
      "CoNLL2002 corpus is available in NLTK. We use Spanish data."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "nltk.corpus.conll2002.fileids()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 2,
       "text": [
        "['esp.testa', 'esp.testb', 'esp.train', 'ned.testa', 'ned.testb', 'ned.train']"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%time\n",
      "train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))\n",
      "test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "CPU times: user 4.4 s, sys: 172 ms, total: 4.57 s\n",
        "Wall time: 4.59 s\n"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Data format:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "train_sents[0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 4,
       "text": [
        "[('Melbourne', 'NP', 'B-LOC'),\n",
        " ('(', 'Fpa', 'O'),\n",
        " ('Australia', 'NP', 'B-LOC'),\n",
        " (')', 'Fpt', 'O'),\n",
        " (',', 'Fc', 'O'),\n",
        " ('25', 'Z', 'O'),\n",
        " ('may', 'NC', 'O'),\n",
        " ('(', 'Fpa', 'O'),\n",
        " ('EFE', 'NC', 'B-ORG'),\n",
        " (')', 'Fpt', 'O'),\n",
        " ('.', 'Fp', 'O')]"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Features\n",
      "\n",
      "Next, define some features. In this example we use word identity, word suffix, word shape and word POS tag; also, some information from nearby words is used. \n",
      "\n",
      "This makes a simple baseline, but you certainly can add and remove some features to get (much?) better results - experiment with it."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def word2features(sent, i):\n",
      "    word = sent[i][0]\n",
      "    postag = sent[i][1]\n",
      "    features = [\n",
      "        'bias',\n",
      "        'word.lower=' + word.lower(),\n",
      "        'word[-3:]=' + word[-3:],\n",
      "        'word[-2:]=' + word[-2:],\n",
      "        'word.isupper=%s' % word.isupper(),\n",
      "        'word.istitle=%s' % word.istitle(),\n",
      "        'word.isdigit=%s' % word.isdigit(),\n",
      "        'postag=' + postag,\n",
      "        'postag[:2]=' + postag[:2],\n",
      "    ]\n",
      "    if i > 0:\n",
      "        word1 = sent[i-1][0]\n",
      "        postag1 = sent[i-1][1]\n",
      "        features.extend([\n",
      "            '-1:word.lower=' + word1.lower(),\n",
      "            '-1:word.istitle=%s' % word1.istitle(),\n",
      "            '-1:word.isupper=%s' % word1.isupper(),\n",
      "            '-1:postag=' + postag1,\n",
      "            '-1:postag[:2]=' + postag1[:2],\n",
      "        ])\n",
      "    else:\n",
      "        features.append('BOS')\n",
      "        \n",
      "    if i < len(sent)-1:\n",
      "        word1 = sent[i+1][0]\n",
      "        postag1 = sent[i+1][1]\n",
      "        features.extend([\n",
      "            '+1:word.lower=' + word1.lower(),\n",
      "            '+1:word.istitle=%s' % word1.istitle(),\n",
      "            '+1:word.isupper=%s' % word1.isupper(),\n",
      "            '+1:postag=' + postag1,\n",
      "            '+1:postag[:2]=' + postag1[:2],\n",
      "        ])\n",
      "    else:\n",
      "        features.append('EOS')\n",
      "                \n",
      "    return features\n",
      "\n",
      "\n",
      "def sent2features(sent):\n",
      "    return [word2features(sent, i) for i in range(len(sent))]\n",
      "\n",
      "def sent2labels(sent):\n",
      "    return [label for token, postag, label in sent]\n",
      "\n",
      "def sent2tokens(sent):\n",
      "    return [token for token, postag, label in sent]    "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This is what word2features extracts:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "sent2features(train_sents[0])[0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 6,
       "text": [
        "['bias',\n",
        " 'word.lower=melbourne',\n",
        " 'word[-3:]=rne',\n",
        " 'word[-2:]=ne',\n",
        " 'word.isupper=False',\n",
        " 'word.istitle=True',\n",
        " 'word.isdigit=False',\n",
        " 'postag=NP',\n",
        " 'postag[:2]=NP',\n",
        " 'BOS',\n",
        " '+1:word.lower=(',\n",
        " '+1:word.istitle=False',\n",
        " '+1:word.isupper=False',\n",
        " '+1:postag=Fpa',\n",
        " '+1:postag[:2]=Fp']"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Extract the features from the data:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%time\n",
      "X_train = [sent2features(s) for s in train_sents]\n",
      "y_train = [sent2labels(s) for s in train_sents]\n",
      "\n",
      "X_test = [sent2features(s) for s in test_sents]\n",
      "y_test = [sent2labels(s) for s in test_sents]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "CPU times: user 3.8 s, sys: 353 ms, total: 4.15 s\n",
        "Wall time: 4.16 s\n"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Train the model\n",
      "\n",
      "To train the model, we create pycrfsuite.Trainer, load the training data and call 'train' method. \n",
      "First, create pycrfsuite.Trainer and load the training data to CRFsuite:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%time\n",
      "trainer = pycrfsuite.Trainer(verbose=False)\n",
      "\n",
      "for xseq, yseq in zip(X_train, y_train):\n",
      "    trainer.append(xseq, yseq)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "CPU times: user 4.14 s, sys: 72.4 ms, total: 4.21 s\n",
        "Wall time: 4.21 s\n"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Set training parameters. We will use L-BFGS training algorithm (it is default) with Elastic Net (L1 + L2) regularization."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "trainer.set_params({\n",
      "    'c1': 1.0,   # coefficient for L1 penalty\n",
      "    'c2': 1e-3,  # coefficient for L2 penalty\n",
      "    'max_iterations': 50,  # stop earlier\n",
      "\n",
      "    # include transitions that are possible, but not observed\n",
      "    'feature.possible_transitions': True\n",
      "})"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 9
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Possible parameters for the default training algorithm:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "trainer.params()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 10,
       "text": [
        "['feature.minfreq',\n",
        " 'feature.possible_states',\n",
        " 'feature.possible_transitions',\n",
        " 'c1',\n",
        " 'c2',\n",
        " 'max_iterations',\n",
        " 'num_memories',\n",
        " 'epsilon',\n",
        " 'period',\n",
        " 'delta',\n",
        " 'linesearch',\n",
        " 'max_linesearch']"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Train the model:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%time\n",
      "trainer.train('conll2002-esp.crfsuite')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "CPU times: user 21.4 s, sys: 237 ms, total: 21.6 s\n",
        "Wall time: 21.5 s\n"
       ]
      }
     ],
     "prompt_number": 11
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "trainer.train saves model to a file:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!ls -lh ./conll2002-esp.crfsuite"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "-rw-r--r--  1 kmike  staff   600K 16 \u043c\u0430\u0439 01:12 ./conll2002-esp.crfsuite\r\n"
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Make predictions\n",
      "\n",
      "To use the trained model, create pycrfsuite.Tagger, open the model and use \"tag\" method:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "tagger = pycrfsuite.Tagger()\n",
      "tagger.open('conll2002-esp.crfsuite')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 13,
       "text": [
        "<contextlib.closing at 0x10b517dd8>"
       ]
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's tag a sentence to see how it works:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "example_sent = test_sents[0]\n",
      "print(' '.join(sent2tokens(example_sent)), end='\\n\\n')\n",
      "\n",
      "print(\"Predicted:\", ' '.join(tagger.tag(sent2features(example_sent))))\n",
      "print(\"Correct:  \", ' '.join(sent2labels(example_sent)))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "La Coru\u00f1a , 23 may ( EFECOM ) .\n",
        "\n",
        "Predicted: B-LOC I-LOC O O O O B-ORG O O\n",
        "Correct:   B-LOC I-LOC O O O O B-ORG O O\n"
       ]
      }
     ],
     "prompt_number": 14
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Evaluate the model"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def bio_classification_report(y_true, y_pred):\n",
      "    \"\"\"\n",
      "    Classification report for a list of BIO-encoded sequences.\n",
      "    It computes token-level metrics and discards \"O\" labels.\n",
      "    \n",
      "    Note that it requires scikit-learn 0.15+ (or a version from github master)\n",
      "    to calculate averages properly!\n",
      "    \"\"\"\n",
      "    lb = LabelBinarizer()\n",
      "    y_true_combined = lb.fit_transform(list(chain.from_iterable(y_true)))\n",
      "    y_pred_combined = lb.transform(list(chain.from_iterable(y_pred)))\n",
      "        \n",
      "    tagset = set(lb.classes_) - {'O'}\n",
      "    tagset = sorted(tagset, key=lambda tag: tag.split('-', 1)[::-1])\n",
      "    class_indices = {cls: idx for idx, cls in enumerate(lb.classes_)}\n",
      "    \n",
      "    return classification_report(\n",
      "        y_true_combined,\n",
      "        y_pred_combined,\n",
      "        labels = [class_indices[cls] for cls in tagset],\n",
      "        target_names = tagset,\n",
      "    )"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 15
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Predict entity labels for all sentences in our testing set ('testb' Spanish data):"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%time\n",
      "y_pred = [tagger.tag(xseq) for xseq in X_test]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "CPU times: user 598 ms, sys: 17.4 ms, total: 616 ms\n",
        "Wall time: 615 ms\n"
       ]
      }
     ],
     "prompt_number": 16
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "..and check the result. Note this report is not comparable to results in CONLL2002 papers because here we check per-token results (not per-entity). Per-entity numbers will be worse.  "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(bio_classification_report(y_test, y_pred))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "             precision    recall  f1-score   support\n",
        "\n",
        "      B-LOC       0.78      0.75      0.76      1084\n",
        "      I-LOC       0.87      0.93      0.90       634\n",
        "     B-MISC       0.69      0.47      0.56       339\n",
        "     I-MISC       0.87      0.93      0.90       634\n",
        "      B-ORG       0.82      0.87      0.84       735\n",
        "      I-ORG       0.87      0.93      0.90       634\n",
        "      B-PER       0.61      0.49      0.54       557\n",
        "      I-PER       0.87      0.93      0.90       634\n",
        "\n",
        "avg / total       0.81      0.81      0.80      5251\n",
        "\n"
       ]
      }
     ],
     "prompt_number": 17
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Let's check what classifier learned"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from collections import Counter\n",
      "info = tagger.info()\n",
      "\n",
      "def print_transitions(trans_features):\n",
      "    for (label_from, label_to), weight in trans_features:\n",
      "        print(\"%-6s -> %-7s %0.6f\" % (label_from, label_to, weight))\n",
      "\n",
      "print(\"Top likely transitions:\")\n",
      "print_transitions(Counter(info.transitions).most_common(15))\n",
      "\n",
      "print(\"\\nTop unlikely transitions:\")\n",
      "print_transitions(Counter(info.transitions).most_common()[-15:])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Top likely transitions:\n",
        "B-ORG  -> I-ORG   8.631963\n",
        "I-ORG  -> I-ORG   7.833706\n",
        "B-PER  -> I-PER   6.998706\n",
        "B-LOC  -> I-LOC   6.913675\n",
        "I-MISC -> I-MISC  6.129735\n",
        "B-MISC -> I-MISC  5.538291\n",
        "I-LOC  -> I-LOC   4.983567\n",
        "I-PER  -> I-PER   3.748358\n",
        "B-ORG  -> B-LOC   1.727090\n",
        "B-PER  -> B-LOC   1.388267\n",
        "B-LOC  -> B-LOC   1.240278\n",
        "O      -> O       1.197929\n",
        "O      -> B-ORG   1.097062\n",
        "I-PER  -> B-LOC   1.083332\n",
        "O      -> B-MISC  1.046113\n",
        "\n",
        "Top unlikely transitions:\n",
        "I-PER  -> B-ORG   -2.056130\n",
        "I-LOC  -> I-ORG   -2.143940\n",
        "B-ORG  -> I-MISC  -2.167501\n",
        "I-PER  -> I-ORG   -2.369380\n",
        "B-ORG  -> I-PER   -2.378110\n",
        "I-MISC -> I-PER   -2.458788\n",
        "B-LOC  -> I-PER   -2.516414\n",
        "I-ORG  -> I-MISC  -2.571973\n",
        "I-LOC  -> B-PER   -2.697791\n",
        "I-LOC  -> I-PER   -3.065950\n",
        "I-ORG  -> I-PER   -3.364434\n",
        "O      -> I-PER   -7.322841\n",
        "O      -> I-MISC  -7.648246\n",
        "O      -> I-ORG   -8.024126\n",
        "O      -> I-LOC   -8.333815\n"
       ]
      }
     ],
     "prompt_number": 18
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can see that, for example, it is very likely that the beginning of an organization name (B-ORG) will be followed by a token inside organization name (I-ORG), but transitions to I-ORG from tokens with other labels are penalized. Also note I-PER -> B-LOC transition: a positive weight means that model thinks that a person name is often followed by a location.\n",
      "\n",
      "Check the state features:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def print_state_features(state_features):\n",
      "    for (attr, label), weight in state_features:\n",
      "        print(\"%0.6f %-6s %s\" % (weight, label, attr))    \n",
      "\n",
      "print(\"Top positive:\")\n",
      "print_state_features(Counter(info.state_features).most_common(20))\n",
      "\n",
      "print(\"\\nTop negative:\")\n",
      "print_state_features(Counter(info.state_features).most_common()[-20:])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Top positive:\n",
        "8.886516 B-ORG  word.lower=efe-cantabria\n",
        "8.743642 B-ORG  word.lower=psoe-progresistas\n",
        "5.769032 B-LOC  -1:word.lower=cantabria\n",
        "5.195429 I-LOC  -1:word.lower=calle\n",
        "5.116821 O      word.lower=mayo\n",
        "4.990871 O      -1:word.lower=d\u00eda\n",
        "4.910915 I-ORG  -1:word.lower=l\n",
        "4.721572 B-MISC word.lower=diversia\n",
        "4.676259 B-ORG  word.lower=telef\u00f3nica\n",
        "4.334354 B-ORG  word[-2:]=-e\n",
        "4.149862 B-ORG  word.lower=amena\n",
        "4.141370 B-ORG  word.lower=terra\n",
        "3.942852 O      word.istitle=False\n",
        "3.926397 B-ORG  word.lower=continente\n",
        "3.924672 B-ORG  word.lower=acesa\n",
        "3.888706 O      word.lower=euro\n",
        "3.856445 B-PER  -1:word.lower=seg\u00fan\n",
        "3.812373 B-MISC word.lower=exteriores\n",
        "3.807582 I-MISC -1:word.lower=1.9\n",
        "3.807098 B-MISC word.lower=sanidad\n",
        "\n",
        "Top negative:\n",
        "-1.965379 O      word.lower=fundaci\u00f3n\n",
        "-1.981541 O      -1:word.lower=brit\u00e1nica\n",
        "-2.118347 O      word.lower=061\n",
        "-2.190653 B-PER  word[-3:]=nes\n",
        "-2.226373 B-ORG  postag=SP\n",
        "-2.226373 B-ORG  postag[:2]=SP\n",
        "-2.260972 O      word[-3:]=uia\n",
        "-2.384920 O      -1:word.lower=secci\u00f3n\n",
        "-2.483009 O      word[-2:]=s.\n",
        "-2.535050 I-LOC  BOS\n",
        "-2.583123 O      -1:word.lower=s\u00e1nchez\n",
        "-2.585756 O      postag[:2]=NP\n",
        "-2.585756 O      postag=NP\n",
        "-2.588899 O      word[-2:]=om\n",
        "-2.738583 O      -1:word.lower=carretera\n",
        "-2.913103 O      word.istitle=True\n",
        "-2.926560 O      word[-2:]=nd\n",
        "-2.946862 I-PER  -1:word.lower=san\n",
        "-2.954094 B-PER  -1:word.lower=del\n",
        "-3.529449 O      word.isupper=True\n"
       ]
      }
     ],
     "prompt_number": 19
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Some observations:\n",
      "\n",
      "* **8.743642 B-ORG  word.lower=psoe-progresistas** - the model remembered names of some entities - maybe it is overfit, or maybe our features are not adequate, or maybe remembering is indeed helpful;\n",
      "* **5.195429 I-LOC  -1:word.lower=calle**: \"calle\" is a street in Spanish; model learns that if a previous word was \"calle\" then the token is likely a part of location;\n",
      "* **-3.529449 O      word.isupper=True**, ** -2.913103 O      word.istitle=True **: UPPERCASED or TitleCased words are likely entities of some kind;\n",
      "* **-2.585756 O      postag=NP** - proper nouns (NP is a proper noun in the Spanish tagset) are often entities."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## What to do next\n",
      "\n",
      "1. Load 'testa' Spanish data.\n",
      "2. Use it to develop better features and to find best model parameters.\n",
      "3. Apply the model to 'testb' data again.\n",
      "\n",
      "The model in this notebook is just a starting point; you certainly can do better!"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}