{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "242cdce7-2d43-4944-b207-3d59a07d6fc7"
    }
   },
   "source": [
    "# Python Tutorial 1: Part-of-Speech Tagging 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "dc82295e-32bd-42db-ba74-90a6a5ec5328"
    }
   },
   "source": [
    "**(C) 2016-2024 by [Damir Cavar](http://damir.cavar.me/) <<dcavar@iu.edu>>**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "f38c160f-227a-4ed1-b097-259ef6e8d592"
    }
   },
   "source": [
    "**Version:** 1.6, January 2024"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "adfaa63c-e346-4efa-97f8-f2b019ce90e2"
    }
   },
   "source": [
    "**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Download:** This and various other Jupyter notebooks are available from my [GitHub repo](https://github.com/dcavar/python-tutorial-notebooks)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Prerequisites:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -U nltk"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -U conllu"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "faa12275-e06d-41a6-a463-9ac596b9f97f"
    }
   },
   "source": [
    "## Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "9adb2cba-d0ad-4128-8789-d89740954f65"
    }
   },
   "source": [
    "This is a tutorial about developing simple [Part-of-Speech taggers](https://en.wikipedia.org/wiki/Part-of-speech_tagging) using Python 3.x and the [NLTK](http://nltk.org/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "84d90485-5d7e-4458-8b7e-34dbae2b60b5"
    }
   },
   "source": [
    "This tutorial was developed as part of the course material for Advanced Natural Language Processing classes at [Indiana University](https://www.indiana.edu/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "42f6a01f-1953-421c-91e7-6c13456abb2b"
    }
   },
   "source": [
    "The [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) in distributed as part of the [NLTK Data](http://www.nltk.org/data.html). To be able to use the [NLTK Data](http://www.nltk.org/data.html) and the [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) on your local machine, you need to install the data as described on [the Installing NLTK Data page](http://www.nltk.org/data.html). If you want to use iPython on your local machine, I recommend installing a Python 3.x distribution, for example the most recent [Anaconda release](https://www.continuum.io/downloads), and reading the instructions how to run [iPython on Anaconda](http://jupyter.readthedocs.io/en/latest/install.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "c661ca33-2c25-4738-ad7d-5aa38ec431f6"
    }
   },
   "source": [
    "## Part-of-Speech Tagging"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "d6e4c56f-d256-4ac0-8a80-e0486df4854c"
    }
   },
   "source": [
    "We refer to Part-of-Speech (PoS) tagging as the task of assigning class information to individual words (tokens) in some text. The tags are defined in tagsets that specify character sequences that represent sets of for example lexical, morphological, syntactic, or semantic features. See for more details the [Categorizing and Tagging Words chapter](http://www.nltk.org/book/ch05.html) of the [NLTK book](http://www.nltk.org/book/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "92aaaa7a-75f5-4a1d-b325-dd93533695bf"
    }
   },
   "source": [
    "## Using the Brown Corpus"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "2b7566b0-f2f7-4e6c-bf85-d1bab4a5fb0b"
    }
   },
   "source": [
    "The documentation of the [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) design and properties can be found on [this page](http://clu.uni.no/icame/brown/bcm.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "d8d4ad25-0df1-4db3-bb5c-dee7a5d5cc6f"
    }
   },
   "source": [
    "Using the following line of code we are importing the [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) into the running Python instance. This will make the tokens and PoS-tags from the [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) available for further processing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "nbpresent": {
     "id": "0aa4658a-a396-434e-b836-9bb92d56e0d6"
    }
   },
   "outputs": [],
   "source": [
    "from nltk.corpus import brown"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "99fa6dff-cf35-493f-8ca4-ad5f83e60dcb"
    }
   },
   "source": [
    "Our goal is to assign PoS-tags to a sequence of words that represent a phrase, utterance, or sentence."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "f05be013-9d8d-483f-ad2f-01e789ea1bee"
    }
   },
   "source": [
    "Let us assume that the probability of a sequence of 5 tags $t_1\\ t_2\\ t_3\\ t_4\\ t_5$ given a sequence of 5 tokens $w_1\\ w_2\\ w_3\\ w_4\\ w_5$ is $P(t_1\\ t_2\\ t_3\\ t_4\\ t_5\\ |\\ w_1\\ w_2\\ w_3\\ w_4\\ w_5)$ and can be computed as the product of the probability of one tag given another, e.g. the probability of tag 2 given that tag 1 occurred: $P(t_2\\ |\\ t_1)$, and the probability of one word and a specific tag, e.g. the probability of word 2 given that tag 2 occurred: $P(w_2\\ |\\ t_2)$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "a9e508b1-07f3-4650-b851-ab697c896346"
    }
   },
   "source": [
    "Let us assume that we use two extra symbols *S* and *E*. *S* stands for sentence beginning and *E* for sentence end. We use these symbols to keep track of different distributions of tags and tokens relative to sentence positions. The token *the* for example is very unlikely to occur in sentence final and more likely to occur in sentence initial position."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "8932f0f7-5571-4a5f-af97-bf25832fbe78"
    }
   },
   "source": [
    "$$P(t_1 \\dots t_5\\ |\\ w_1 \\dots w_5) = P(t_1|S)\\ P(w_1|t_1)\\ P(t_2|t_1)\\ P(w_2|t_2)\\ P(t_3|t_2)\\ P(w_3|t_3)\\ P(t_4|t_3)\\ P(w_3|t_3)\\ P(t_5|t_4)\\ P(w_4|t_4)\\ P(E|t_4)\\ P(w_5|t_5)$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "008190c1-2626-4fab-8f0c-fe8b3627f6a0"
    }
   },
   "source": [
    "This equation can be abbreviated as follows:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "f4042ad3-3854-427b-ae8a-5e629a92f532"
    }
   },
   "source": [
    "$$P(t_1 \\dots t_5\\ |\\ w_1 \\dots w_5) = P(t_1\\ |\\ S)\\ P(E\\ |\\ t_5)\\ \\prod_{i=1}^{5} P(t_{i+1}\\ |\\ t_i)\\ P(w_{i+1}\\ |\\ t_{i+1})$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "d50f24e1-6ce5-4541-a2f9-5831b97e5fcf"
    }
   },
   "source": [
    "We extract the probabilities for a word (or token) given that a certain tag occurred, that is $P(w_1\\ |\\ t_1)$, form the frequency profile for tags and tokens from the [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus). The necessary data-structure should be loaded and in memory after executing the code cell above."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "67f24dcd-f017-4893-90fd-c5f4a3c421e9"
    }
   },
   "source": [
    "Since we loaded the [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) into memory, we can now use specific methods to access tokens and PoS-tags from the corpus. The following line of code unzips the list of tuples that contain tokens and tags in sequence as found in the corpus and stores the tokens in the *tokens* list and tags in the *tags* list. Note that the \\* operator is used here to unzip a list. See for more details on the [Python zip-function the documentation page](https://docs.python.org/3.5/library/functions.html#zip). The function *brown.tagged_words()* returns a list of tuples *(word, token)*. The *zip*-function creates two lists and assigns those to the variables *tokens* and *tags* respectively."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('The', 'AT'), ('Fulton', 'NP-TL'), ...]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "brown.tagged_words()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "nbpresent": {
     "id": "d964006f-65f3-4d16-b997-405275447e18"
    }
   },
   "outputs": [],
   "source": [
    "tokens, tags = zip(*brown.tagged_words())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "c7580aa2-cde5-40e9-ac4b-3b2e989c2d45"
    }
   },
   "source": [
    "You can inspect the resulting list of *tokens* by printing it out:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "nbpresent": {
     "id": "295275f9-1bec-4e3d-b365-e86369c4be26"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('The',\n",
       " 'Fulton',\n",
       " 'County',\n",
       " 'Grand',\n",
       " 'Jury',\n",
       " 'said',\n",
       " 'Friday',\n",
       " 'an',\n",
       " 'investigation',\n",
       " 'of',\n",
       " \"Atlanta's\",\n",
       " 'recent',\n",
       " 'primary',\n",
       " 'election',\n",
       " 'produced',\n",
       " '``',\n",
       " 'no',\n",
       " 'evidence',\n",
       " \"''\",\n",
       " 'that')"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokens[:20]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "26842f69-cc6e-46d5-9838-d626c32a9c31"
    }
   },
   "source": [
    "You can print the *tags* as well:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "nbpresent": {
     "id": "35384dca-0220-473a-a2c0-8ddc674eb5f9"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('AT',\n",
       " 'NP-TL',\n",
       " 'NN-TL',\n",
       " 'JJ-TL',\n",
       " 'NN-TL',\n",
       " 'VBD',\n",
       " 'NR',\n",
       " 'AT',\n",
       " 'NN',\n",
       " 'IN',\n",
       " 'NP$',\n",
       " 'JJ',\n",
       " 'NN',\n",
       " 'NN',\n",
       " 'VBD',\n",
       " '``',\n",
       " 'AT',\n",
       " 'NN',\n",
       " \"''\",\n",
       " 'CS')"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tags[:20]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "3a11bf72-293e-4f70-a90a-9756ff2dfbf6"
    }
   },
   "source": [
    "The sequence of *tokens* and *tags* is aligned, that is, the first tag in the *tags* list belongs to the first token in the *tokens* list. You can print the token-tag pair out in the following way:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "nbpresent": {
     "id": "d620c3ae-eebd-434c-b209-3a115b694c33"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Token: The Tag: AT\n"
     ]
    }
   ],
   "source": [
    "print(\"Token:\", tokens[0], \"Tag:\", tags[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "186dd9d5-1e7a-49ce-8406-f4e4822a03f3"
    }
   },
   "source": [
    "To create a frequency profile of tags for example, we can make use of the [*Counter* container datatype](http://docs.python.org/3/library/collections.html#collections.Counter) from the [*collections* module](http://docs.python.org/3/library/collections.html). We import the [*Counter* datatype](http://docs.python.org/3/library/collections.html#collections.Counter) with the following code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "nbpresent": {
     "id": "c5bf9e25-c6ec-484b-b127-1dbca8c1842e"
    }
   },
   "outputs": [],
   "source": [
    "from collections import Counter"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "d76c4246-be56-47c4-8f65-bd07f7ecf033"
    }
   },
   "source": [
    "We can create a frequency profile of the tags from the [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) and store it in the variable *tagCounts* using the following code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "nbpresent": {
     "id": "0cc89377-598c-41cd-8446-a040c8d22cb0"
    }
   },
   "outputs": [],
   "source": [
    "tagCounter = Counter(tags)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "0703707c-8e25-4413-b171-2739c6e246fc"
    }
   },
   "source": [
    "The *tagCounter* datatype now contains a hash-table with *tags* as keys and their frequencies as values. Accessing the frequency of a specific *tag* can be achieved using the following code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "nbpresent": {
     "id": "eee58a53-5c33-4be1-b5c8-b2450f8babf8"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "55110"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tagCounter[\"NNS\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "1d7e6206-b7e0-4893-a1f5-f6e08c3015aa"
    }
   },
   "source": [
    "The frequency of a specific *token* can be accessed by generating a frequency profile from the *token*-list in the same way as for *tags*:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "nbpresent": {
     "id": "a3cbf5f1-bd72-4961-9af0-83995147c7dc"
    }
   },
   "outputs": [],
   "source": [
    "tokenCounter = Counter(tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "0336f08a-95e2-48bb-ac96-c2af8d4dd185"
    }
   },
   "source": [
    "We access the *token* frequency in the same way as for *tags*:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "nbpresent": {
     "id": "3444d95e-b21a-43ab-ab1e-7821a560368b"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "62713"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenCounter[\"the\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "e832f5fa-ded5-44b9-8779-8b8becfb3fbd"
    }
   },
   "source": [
    "Since one type (or word) in the [Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus) can have more than one corresponding tag with a specific frequency, we need to store this information in a specific datastructure. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "nbpresent": {
     "id": "d9458745-f60f-48b5-812f-970e0113e9f6"
    }
   },
   "outputs": [],
   "source": [
    "from collections import defaultdict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "d1ce5c03-1d28-445f-96db-12b1a42c7f7b"
    }
   },
   "source": [
    "The following loop reads from the list of *token-tag*-tuples in *brown.tagged_words* the individual *token* and *tag* pairs and sets their counter in the *dictionary* of *Counter* datastructures."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "nbpresent": {
     "id": "d94e1423-9978-4505-a62a-98b306d09b0f"
    }
   },
   "outputs": [],
   "source": [
    "tokenTags = defaultdict(Counter)\n",
    "for token, tag in brown.tagged_words():\n",
    "    tokenTags[token][tag] +=1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "3e29cc56-4ac4-4f9c-8a26-51b5daee862b"
    }
   },
   "source": [
    "We can now ask for the *Counter* datastructure for the key *the*. The *Counter* datastructure is a hash-table with tags as keys and the corresponding frequency as values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "nbpresent": {
     "id": "733452ca-de40-441e-bb41-2c2a43179770"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({'VBZ': 17, 'NNS': 2})"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenTags[\"loves\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "nbpresent": {
     "id": "425e5caf-3ef8-4e36-a21b-2acf7b4802b5"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "62288"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenTags[\"the\"][\"AT\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1161192"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "0a966380-63e5-43cf-926f-7ac0655ebcc9"
    }
   },
   "source": [
    "For the calculation of the probability of a $tag_2$ given that a $tag_1$ occured, that is $P(tag_2\\ |\\ tag_1)$ we will need to count the bigrams from the *tags* list. The NLTK ngram module provides a convenient set of functions and datastructures to achieve this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "nbpresent": {
     "id": "c7b78047-d907-4ee1-aede-ecb624c7ba04"
    }
   },
   "outputs": [],
   "source": [
    "from nltk.util import ngrams"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "5623d2f0-bd87-4098-a87b-a87639ff1992"
    }
   },
   "source": [
    "As for the *tokenTags* datatype above, we can create a *tags* bigram model using a dictionary of *Counter* datatypes. The dictionary keys will be the first tag of the tag-bigram. The value will contain a Counter datatype with the second tag of the tag-bigram as the key and the frequency of the bigram as value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "nbpresent": {
     "id": "c6e8d3d7-b974-46c6-9a04-25f236cda1e9"
    }
   },
   "outputs": [],
   "source": [
    "tagTags = defaultdict(Counter)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "78642c69-3b32-44e0-9209-7a1f5d51a600"
    }
   },
   "source": [
    "Using the *ngrams* module we generate a bigram model from the tags list and store it in the variable *posBigrams* using the following code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "nbpresent": {
     "id": "d2df08e5-fa00-49c3-89e4-691aa4201457"
    }
   },
   "outputs": [],
   "source": [
    "posBigrams = list(ngrams(tags, 2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "f2089b44-60a8-4ca1-92d1-adb1a15b2aa8"
    }
   },
   "source": [
    "The following loop goes through the list of bigram tuples, assigned the left bigram tag to the variable *tag1* and the right bigram tag to variable *tag2*, and stores the count of the bigram in the *tagTags* datastructure:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "nbpresent": {
     "id": "ff4e5bf1-5615-4031-aa88-45d39bc81942"
    }
   },
   "outputs": [],
   "source": [
    "for tag1, tag2 in posBigrams:\n",
    "    tagTags[tag1][tag2] += 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "ed464fb4-3460-4ef0-a950-8578ec85148f"
    }
   },
   "source": [
    "We can now list all *tags* that follow the *AT* tag with the corresponding frequency:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "nbpresent": {
     "id": "9dd4b2f2-80af-46c6-98af-ad710e82621c"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({'NN': 48376,\n",
       "         'JJ': 19488,\n",
       "         'NNS': 7215,\n",
       "         'AP': 3007,\n",
       "         'NN-TL': 2565,\n",
       "         'NP': 2230,\n",
       "         'VBG': 1568,\n",
       "         'VBN': 1468,\n",
       "         'JJ-TL': 1414,\n",
       "         'QL': 1377,\n",
       "         'OD': 1251,\n",
       "         'CD': 981,\n",
       "         'NN$': 907,\n",
       "         'NP-TL': 809,\n",
       "         'JJT': 675,\n",
       "         'JJR': 630,\n",
       "         '``': 620,\n",
       "         'NPS': 588,\n",
       "         'VBN-TL': 390,\n",
       "         'RB': 350,\n",
       "         'NNS-TL': 284,\n",
       "         'NR': 218,\n",
       "         'NR-TL': 208,\n",
       "         'JJS': 206,\n",
       "         'NN$-TL': 162,\n",
       "         'PN': 149,\n",
       "         'OD-TL': 98,\n",
       "         'NNS$': 97,\n",
       "         'FW-NN': 76,\n",
       "         'NP$': 62,\n",
       "         'FW-NN-TL': 52,\n",
       "         'ABN': 42,\n",
       "         'RBR': 40,\n",
       "         'VBG-TL': 34,\n",
       "         'NPS$': 30,\n",
       "         'CD-TL': 29,\n",
       "         'NNS$-TL': 28,\n",
       "         \"'\": 24,\n",
       "         'NP$-TL': 17,\n",
       "         'VB': 16,\n",
       "         '(': 15,\n",
       "         'NN+BEZ': 13,\n",
       "         ',': 12,\n",
       "         'RBT': 11,\n",
       "         'FW-NNS': 11,\n",
       "         'FW-JJ-TL': 8,\n",
       "         'NR$-TL': 8,\n",
       "         'FW-IN': 7,\n",
       "         \"''\": 7,\n",
       "         'FW-NNS-TL': 5,\n",
       "         '*': 4,\n",
       "         'CC': 4,\n",
       "         '--': 4,\n",
       "         'FW-IN-TL': 4,\n",
       "         '.': 4,\n",
       "         'JJR-TL': 3,\n",
       "         'NN+HVZ': 3,\n",
       "         'IN': 3,\n",
       "         'JJT-TL': 3,\n",
       "         'NN-HL': 3,\n",
       "         'VB-TL': 3,\n",
       "         'JJS-TL': 2,\n",
       "         'VBD': 2,\n",
       "         'NPS-TL': 2,\n",
       "         'NN+MD': 2,\n",
       "         'FW-JJ': 2,\n",
       "         'FW-VB': 2,\n",
       "         'WDT': 2,\n",
       "         'UH': 2,\n",
       "         'AP-TL': 2,\n",
       "         'RB-TL': 1,\n",
       "         'RB-NC': 1,\n",
       "         'VBZ': 1,\n",
       "         'RP': 1,\n",
       "         'PN$': 1,\n",
       "         'HV': 1,\n",
       "         'MD': 1,\n",
       "         'FW-RB': 1,\n",
       "         'AT': 1,\n",
       "         'PPO': 1,\n",
       "         'NIL': 1,\n",
       "         'FW-VBN': 1,\n",
       "         'FW-NN-TL-NC': 1,\n",
       "         ')': 1,\n",
       "         'BEZ-NC': 1,\n",
       "         'JJ-HL': 1,\n",
       "         'FW-JJT': 1,\n",
       "         'NN+IN': 1,\n",
       "         'FW-NN$': 1,\n",
       "         'AP$': 1,\n",
       "         'FW-CC': 1,\n",
       "         'PN+HVZ': 1,\n",
       "         'RBR+CS': 1,\n",
       "         'PPS': 1,\n",
       "         'IN-TL': 1})"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tagTags[\"AT\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "a3e00db0-678e-478f-a503-7bb6f1f17d9c"
    }
   },
   "source": [
    "We can request the frequency of the tag-bigram *AT NN* using the following code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "nbpresent": {
     "id": "4baf913a-0b9e-4e62-982b-a4b3d912c76c"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "489"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tagTags[\"NP\"][\"NNS\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({'VBZ': 17, 'NNS': 2})"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenTags[\"loves\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can calculate the total number of bigrams and relativize the count of any particular bigram:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1161192.0\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0.00012228823681892126"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "total = float(len(tags))\n",
    "print(total)\n",
    "tagTags[\"NNS\"][\"NNS\"]/(total-1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we want to know how many times a certain tag occurs in sentence initial position, to estimate initial probabilities for startstates in a Hidden Markov Model for example, we can loop through the sentences and count the tags in initial position."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Example:\n",
      "AT: 8297\n"
     ]
    }
   ],
   "source": [
    "offset = 0\n",
    "initialTags = Counter()\n",
    "for x in brown.sents():\n",
    "    initTag = tags[offset]\n",
    "    initialTags[initTag] += 1\n",
    "    offset += len(x)\n",
    "print(\"Example:\")\n",
    "print(\"AT:\", initialTags[\"AT\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note, for the code above, I do not know how to access the initial sentence tag directly, thus I am indirectly accessing the tag over an offset count. If you know a better way, let me know, please."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now estimate the probability of any tag being in sentence initial position in the following way:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.007145243852868432"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "initialTags[\"AT\"]/total"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can estimate the probability of any tag being followed by any other, in the following way:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.04166067425600095"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tagTags[\"AT\"][\"NN\"]/(total-1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note, we are dividing by *total - 1*, since the number of bigrams in the *tagTags* data structure is exactly this. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can estimate the likelihood of a tag token combination using the *tokenTags* data-structure:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenTags[\"John\"][\"NN\"]/total"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "397c051d-fc65-42d7-8216-0a332d23542e"
    }
   },
   "source": [
    "Given the data structures *tokenTags* and *tagTags* we can now estimate the probability of a word given a specific tag, or intuitively, the probability that a specific word is assigned a tag, that is for the token *cat* and the tag *NN*: $P(cat\\ |\\ NN)$ using the following equation and corresponding code (with $C(cat\\ NN)$ as the absolute frequency or count of the *cat NN* tuple, and $C(NN)$ the count of the *NN*-tag):"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "f4c3907d-8932-4333-b5b8-648f00699345"
    }
   },
   "source": [
    "$$P(w_n\\ |\\ t_n) = \\frac{C(w_n\\ t_n)}{C(t_n)}$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "nbpresent": {
     "id": "6446d406-5bfb-4c72-bc2e-31f8f0007051"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.00013117334557617892"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenTags[\"cat\"][\"NN\"] / tagCounter[\"NN\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "8ea9b5e1-fd11-41e1-b2ca-d2ea2448a650"
    }
   },
   "source": [
    "We can estimate the probability of a $tag_2$ following a $tag_1$ using a similar approach:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "0e7350df-2ae2-4574-a2d4-4718fdacf42a"
    }
   },
   "source": [
    "$$P(t_n\\ |\\ t_{n-1}) = \\frac{C(t_{n-1}\\ t_n)}{C(t_{n-1})}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "ae3f2d65-17dc-4f3d-b37d-1b212d14c518"
    }
   },
   "source": [
    "Here $C(t_{n-1}\\ t_n)$ is the count of the bigram of these two tags in sequence. $C(t_{n-1})$ is the count or absolute frequency of the first or left tag in the bigram. Let us assume that the input sequence was *the cat ...* and that the most likely initial tag for *the* was *AT*, then the probability of the tag *NN* given that a tag *AT* occurred can be estimated as:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "nbpresent": {
     "id": "1ab47b9d-79ba-47b6-9388-45624a867a43"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.4938392592819445"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tagTags[\"AT\"][\"NN\"] / tagCounter[\"AT\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "d989bcba-3896-4719-8a65-aca6281622a4"
    }
   },
   "source": [
    "The product of the two probabilities $P(w_n\\ |\\ t_n)\\ P(t_n\\ |\\ t_{n-1})$ for the tokens *the cat* and the possible tags *AT NN* should be:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "nbpresent": {
     "id": "6af6e8e8-ef9e-46f1-9d59-66e68b2554b5"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6.477854781687473e-05"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(tokenTags[\"cat\"][\"NN\"] / tagCounter[\"NN\"]) * (tagTags[\"AT\"][\"NN\"] / tagCounter[\"AT\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "2b0e540c-93e7-49a9-b96e-a816a8d33c02"
    }
   },
   "source": [
    "If we would want to calculate this for any sequence of words, we should wrap this code in some function and a loop over all tokens. To avoid an underflow from the product of many probabilities, we can sum up the log-likelihoods of these probabilities. We would calculate the probabilities for all possible tag combinations assigned to the sequence of words or tokens and select the largest one as the best."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Processing CoNLL-U Files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the following code to install the `conllu` module. See for more details the [PiPy site](https://pypi.org/project/conllu/) and the [GitHub page](https://github.com/EmilStenstrom/conllu/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "!pip install conllu"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load the `conllu` module and other necessary modules:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "import conllu\n",
    "import io\n",
    "import os"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load the data from the CoNLL-U file and collect the token and the part-of-speech tag:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_file = io.open(os.path.join(\"data\", \"hi_hdtb-ud-dev.conllu\"), \"r\", encoding=\"utf-8\")\n",
    "for tokenlist in conllu.parse_incr(data_file):\n",
    "    # print(t)\n",
    "    data = [ (t[\"form\"], t[\"upos\"]) for t in tokenlist ]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can find more data at this site:\n",
    "\n",
    "- [Universal Dependencies](https://universaldependencies.org/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now process the sequence of tokens and tags:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('मंगलवार', 'PROPN'), ('को', 'ADP')]\n"
     ]
    }
   ],
   "source": [
    "print(data[:2])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('PROPN', 'ADP') 2\n"
     ]
    }
   ],
   "source": [
    "tags = [ x[1] for x in data ]\n",
    "tag_bigrams = list(ngrams(tags, 2))\n",
    "tag_bi_fp = Counter(tag_bigrams)\n",
    "# total = float(len(tag_bigrams))\n",
    "tagTags = defaultdict(Counter)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "27a27e36-dd12-4296-ab63-6b1f2a1fe217"
    }
   },
   "source": [
    "In the next section we will discuss Hidden Markov Models (HMMs) for Part-of-Speech Tagging."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Manning, Chris and Hinrich Schütze (1999) *[Foundations of Statistical Natural Language Processing](http://nlp.stanford.edu/fsnlp/)*, MIT Press. Cambridge, MA."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbpresent": {
     "id": "466335b8-7064-41bd-b154-2327abefd9a4"
    }
   },
   "source": [
    "(C) 2016-2024 by [Damir Cavar](http://damir.cavar.me/) <<dcavar@iu.edu>>"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {
   "environment": null,
   "summary": "A Python tutorial for Part-of-Speech Tagging",
   "url": "https://anaconda.org/dcavar/python-tutorial-pos-tagging"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.3"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autoclose": false,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  },
  "nbpresent": {
   "slides": {},
   "themes": {
    "default": "e2ad077c-80e1-4c08-addc-201005506eb1",
    "theme": {
     "510cdff0-402c-4eba-bd09-b848f6dfd289": {
      "backgrounds": {
       "dc7afa04-bf90-40b1-82a5-726e3cff5267": {
        "background-color": "31af15d2-7e15-44c5-ab5e-e04b16a89eff",
        "id": "dc7afa04-bf90-40b1-82a5-726e3cff5267"
       }
      },
      "id": "510cdff0-402c-4eba-bd09-b848f6dfd289",
      "palette": {
       "19cc588f-0593-49c9-9f4b-e4d7cc113b1c": {
        "id": "19cc588f-0593-49c9-9f4b-e4d7cc113b1c",
        "rgb": [
         252,
         252,
         252
        ]
       },
       "31af15d2-7e15-44c5-ab5e-e04b16a89eff": {
        "id": "31af15d2-7e15-44c5-ab5e-e04b16a89eff",
        "rgb": [
         68,
         68,
         68
        ]
       },
       "50f92c45-a630-455b-aec3-788680ec7410": {
        "id": "50f92c45-a630-455b-aec3-788680ec7410",
        "rgb": [
         197,
         226,
         245
        ]
       },
       "c5cc3653-2ee1-402a-aba2-7caae1da4f6c": {
        "id": "c5cc3653-2ee1-402a-aba2-7caae1da4f6c",
        "rgb": [
         43,
         126,
         184
        ]
       },
       "efa7f048-9acb-414c-8b04-a26811511a21": {
        "id": "efa7f048-9acb-414c-8b04-a26811511a21",
        "rgb": [
         25.118061674008803,
         73.60176211453744,
         107.4819383259912
        ]
       }
      },
      "rules": {
       "a": {
        "color": "19cc588f-0593-49c9-9f4b-e4d7cc113b1c"
       },
       "blockquote": {
        "color": "50f92c45-a630-455b-aec3-788680ec7410",
        "font-size": 3
       },
       "code": {
        "font-family": "Anonymous Pro"
       },
       "h1": {
        "color": "19cc588f-0593-49c9-9f4b-e4d7cc113b1c",
        "font-family": "Merriweather",
        "font-size": 8
       },
       "h2": {
        "color": "19cc588f-0593-49c9-9f4b-e4d7cc113b1c",
        "font-family": "Merriweather",
        "font-size": 6
       },
       "h3": {
        "color": "50f92c45-a630-455b-aec3-788680ec7410",
        "font-family": "Lato",
        "font-size": 5.5
       },
       "h4": {
        "color": "c5cc3653-2ee1-402a-aba2-7caae1da4f6c",
        "font-family": "Lato",
        "font-size": 5
       },
       "h5": {
        "font-family": "Lato"
       },
       "h6": {
        "font-family": "Lato"
       },
       "h7": {
        "font-family": "Lato"
       },
       "li": {
        "color": "50f92c45-a630-455b-aec3-788680ec7410",
        "font-size": 3.25
       },
       "pre": {
        "font-family": "Anonymous Pro",
        "font-size": 4
       }
      },
      "text-base": {
       "color": "19cc588f-0593-49c9-9f4b-e4d7cc113b1c",
       "font-family": "Lato",
       "font-size": 4
      }
     }
    }
   }
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {
    "height": "121px",
    "width": "252px"
   },
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": "block",
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}