{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Python Tutorial 1: Smoothing 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**(C) 2016-2024 by [Damir Cavar](http://cavar.me/damir/) <>**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Version:** 1.1, January 2024" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a tutorial about developing simple [Part-of-Speech taggers](https://en.wikipedia.org/wiki/Part-of-speech_tagging) using Python 3.x and the [NLTK](http://nltk.org/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial was developed as part of the course material for the course Advanced Natural Language Processing in the [Computational Linguistics Program](http://cl.indiana.edu/) of the [Department of Linguistics](http://www.indiana.edu/~lingdept/) at [Indiana University](https://www.indiana.edu/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Smoothing](https://en.wikipedia.org/wiki/Smoothing) is re-evaluating for example an n-gram frequency profile and zero and small probabilities in it, and assigning very small probabilities for zero n-grams, that is for n-grams that have not been seen in a training corpus." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, if we extract a frequency profile using the tuples tokens and Part-of-Speech tags from the Brown corpus, we will surely not find any occurrence of a tuple like (*iPhone NN*)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are various smoothing techniques:\n", "\n", "* Additive smoothing\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Add-One Smoothing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To estimate the probability of an n-gram, a token (or word) $w_n$ occuring, given that the words $w_1\\dots{}w_{n-1}$ occured, we can use the Maximum Likelihood Estimation (MLE) as described below. The conditional probability of an n-gram like *the cat* as the conditional probability $P(cat|the)$, for example, is the probability of the n-gram $P(the\\ cat)$ divided by the probability of the token *the*, i.e. $P(the)$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$P(w_n\\ |\\ w_1\\ \\dots{}\\ w_{n-1}) = \\frac{P(w_1\\ \\dots{}\\ w_n)}{P(w_1\\ \\dots{}\\ w_{n-1})}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us assume that $C(w_1\\dots{}w_n)$ is the count or frequency of a particular n-gram with words $w_1\\dots{}w_n$. $C(w_1\\dots{}w_{n-1})$ is the count of the $(n-1)$-gram of words, i.e. the words preceding $w_n$ in the n-gram. $C(t)$ is the total count of tokens (or words), and $C(t)-(N-1)$ is the total count of n-grams. If we have a text with 3 words, i.e. $t=3$, and with $N=2$ n-grams, we will have $C(t)-(N-1)=3-(2-1)=2$ bigrams. We can now derive the conditional probability as an estimate, assuming that for large number of n-grams $C(t)-(N-1)$ and the number of $(n-1)$-grams $C(t)-(N-2)$ are very close, thus can be assumed to be the same:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$P(w_n\\ |\\ w_1\\ \\dots{}\\ w_{n-1}) = \\frac{P(w_1\\ \\dots{}\\ w_n)}{P(w_1\\ \\dots{}\\ w_{n-1})} = \\frac{\\frac{C(w_1\\dots{}w_n)}{C(t)-(N-1)}}{\\frac{C(w_1\\dots{}w_{n-1})}{C(t)-(N-2)}}=\\frac{C(w_1\\dots{}w_n)}{C(t)-(N-1)} \\frac{C(t)-(N-2)}{C(w_1\\dots{}w_{n-1})}=\\frac{C(w_1\\dots{}w_n)}{C(w_1\\dots{}w_{n-1})}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The general idea in Add-One Smoothing is to pretend that an unseen n-gram should be assumed to have occurred at least once." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We differentiate between the number of types and the number of tokens. The number of tokens in the following text is 12, $tokens={the, black, cat, is, chasing, the, white, mouse, with, the, black, tail}$. The number of types is 9, $types = {the, black, cat, is, chasing, white, mouse, with, tail}$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*The black cat is chasing the white mouse with the black tail.*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With $C$ the count of a particular n-gram in a corpus, $N$ the count of all n-grams, and $V$ the vocabulary size or the number of types, Add-One Smoothing can be computed for all n-grams as:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$P=\\frac{C+1}{N+V}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "a" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.7" }, "widgets": { "state": {}, "version": "1.1.2" } }, "nbformat": 4, "nbformat_minor": 4 }