{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Python Tutorial 1: Smoothing 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**(C) 2016-2024 by [Damir Cavar](http://cavar.me/damir/) <<dcavar@iu.edu>>**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Version:** 1.1, January 2024"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a tutorial about developing simple [Part-of-Speech taggers](https://en.wikipedia.org/wiki/Part-of-speech_tagging) using Python 3.x and the [NLTK](http://nltk.org/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This tutorial was developed as part of the course material for the course Advanced Natural Language Processing in the [Computational Linguistics Program](http://cl.indiana.edu/) of the [Department of Linguistics](http://www.indiana.edu/~lingdept/) at [Indiana University](https://www.indiana.edu/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Smoothing](https://en.wikipedia.org/wiki/Smoothing) is re-evaluating for example an n-gram frequency profile and zero and small probabilities in it, and assigning very small probabilities for zero n-grams, that is for n-grams that have not been seen in a training corpus."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For example, if we extract a frequency profile using the tuples tokens and Part-of-Speech tags from the Brown corpus, we will surely not find any occurrence of a tuple like (*iPhone NN*)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are various smoothing techniques:\n",
    "\n",
    "* Additive smoothing\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Add-One Smoothing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To estimate the probability of an n-gram, a token (or word) $w_n$ occuring, given that the words $w_1\\dots{}w_{n-1}$ occured, we can use the Maximum Likelihood Estimation (MLE) as described below. The conditional probability of an n-gram like *the cat* as the conditional probability $P(cat|the)$, for example, is the probability of the n-gram $P(the\\ cat)$ divided by the probability of the token *the*, i.e. $P(the)$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$P(w_n\\ |\\ w_1\\ \\dots{}\\ w_{n-1}) = \\frac{P(w_1\\ \\dots{}\\ w_n)}{P(w_1\\ \\dots{}\\ w_{n-1})}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us assume that $C(w_1\\dots{}w_n)$ is the count or frequency of a particular n-gram with words $w_1\\dots{}w_n$. $C(w_1\\dots{}w_{n-1})$ is the count of the $(n-1)$-gram of words, i.e. the words preceding $w_n$ in the n-gram. $C(t)$ is the total count of tokens (or words), and $C(t)-(N-1)$ is the total count of n-grams. If we have a text with 3 words, i.e. $t=3$, and with $N=2$ n-grams, we will have $C(t)-(N-1)=3-(2-1)=2$ bigrams. We can now derive the conditional probability as an estimate, assuming that for large number of n-grams $C(t)-(N-1)$ and the number of $(n-1)$-grams $C(t)-(N-2)$ are very close, thus can be assumed to be the same:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$P(w_n\\ |\\ w_1\\ \\dots{}\\ w_{n-1}) = \\frac{P(w_1\\ \\dots{}\\ w_n)}{P(w_1\\ \\dots{}\\ w_{n-1})} = \\frac{\\frac{C(w_1\\dots{}w_n)}{C(t)-(N-1)}}{\\frac{C(w_1\\dots{}w_{n-1})}{C(t)-(N-2)}}=\\frac{C(w_1\\dots{}w_n)}{C(t)-(N-1)} \\frac{C(t)-(N-2)}{C(w_1\\dots{}w_{n-1})}=\\frac{C(w_1\\dots{}w_n)}{C(w_1\\dots{}w_{n-1})}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The general idea in Add-One Smoothing is to pretend that an unseen n-gram should be assumed to have occurred at least once."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We differentiate between the number of types and the number of tokens. The number of tokens in the following text is 12, $tokens={the, black, cat, is, chasing, the, white, mouse, with, the, black, tail}$. The number of types is 9, $types = {the, black, cat, is, chasing, white, mouse, with, tail}$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*The black cat is chasing the white mouse with the black tail.*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With $C$ the count of a particular n-gram in a corpus, $N$ the count of all n-grams, and $V$ the vocabulary size or the number of types, Add-One Smoothing can be computed for all n-grams as:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$P=\\frac{C+1}{N+V}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "a"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.7"
  },
  "widgets": {
   "state": {},
   "version": "1.1.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}