{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Language Models: Auto-Complete\n",
"\n",
"In this assignment, you will build an auto-complete system. Auto-complete system is something you may see every day\n",
"- When you google something, you often have suggestions to help you complete your search. \n",
"- When you are writing an email, you get suggestions telling you possible endings to your sentence. \n",
"\n",
"By the end of this assignment, you will develop a prototype of such a system.\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Outline\n",
"- [1 Load and Preprocess Data](#1)\n",
"- [1.1: Load the data](#1.1)\n",
"- [1.2 Pre-process the data](#1.2)\n",
" - [Exercise 01](#ex-01)\n",
" - [Exercise 02](#ex-02)\n",
" - [Exercise 03](#ex-03)\n",
" - [Exercise 04](#ex-04)\n",
" - [Exercise 05](#ex-05)\n",
" - [Exercise 06](#ex-06)\n",
" - [Exercise 07](#ex-07)\n",
"- [2 Develop n-gram based language models](#2)\n",
" - [Exercise 08](#ex-08)\n",
" - [Exercise 09](#ex-09) \n",
"- [3 Perplexity](#3)\n",
" - [Exercise 10](#ex-10)\n",
"- [4 Build an auto-complete system](#4)\n",
" - [Exercise 11](#ex-11)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A key building block for an auto-complete system is a language model.\n",
"A language model assigns the probability to a sequence of words, in a way that more \"likely\" sequences receive higher scores. For example, \n",
">\"I have a pen\" \n",
"is expected to have a higher probability than \n",
">\"I am a pen\"\n",
"since the first one seems to be a more natural sentence in the real world.\n",
"\n",
"You can take advantage of this probability calculation to develop an auto-complete system. \n",
"Suppose the user typed \n",
">\"I eat scrambled\"\n",
"Then you can find a word `x` such that \"I eat scrambled x\" receives the highest probability. If x = \"eggs\", the sentence would be\n",
">\"I eat scrambled eggs\"\n",
"\n",
"While a variety of language models have been developed, this assignment uses **N-grams**, a simple but powerful method for language modeling.\n",
"- N-grams are also used in machine translation and speech recognition. \n",
"\n",
"\n",
"Here are the steps of this assignment:\n",
"\n",
"1. Load and preprocess data\n",
" - Load and tokenize data.\n",
" - Split the sentences into train and test sets.\n",
" - Replace words with a low frequency by an unknown marker `
\n", "
\n", "
\n", "
['a'] * 3
to get ['a','a','a']
[0,1,2,3,4]
. The largest index 'i' where a bigram can start is at position i=3, because the word tokens at position 3 and 4 will form the bigram. range()
function excludes the value that is used for the maximum of the range. range(3)
produces (0,1,2) but excludes 3. \n", "
('apple',)
is a tuple containing a single string 'apple' \n", " | dog | \n", "this | \n", "cat | \n", "is | \n", "like | \n", "a | \n", "i | \n", "<e> | \n", "<unk> | \n", "
---|---|---|---|---|---|---|---|---|---|
(<s>,) | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "
(dog,) | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
(cat,) | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "2.0 | \n", "0.0 | \n", "
(this,) | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
(is,) | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
(like,) | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "2.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
(a,) | \n", "0.0 | \n", "0.0 | \n", "2.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
(i,) | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
\n", " | dog | \n", "this | \n", "cat | \n", "is | \n", "like | \n", "a | \n", "i | \n", "<e> | \n", "<unk> | \n", "
---|---|---|---|---|---|---|---|---|---|
(like, a) | \n", "0.0 | \n", "0.0 | \n", "2.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
(<s>, <s>) | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "
(i, like) | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
(a, cat) | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "2.0 | \n", "0.0 | \n", "
(dog, is) | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
(cat,) | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "2.0 | \n", "0.0 | \n", "
(<s>, this) | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
(this, dog) | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
(is, like) | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
(<s>, i) | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
\n", " | dog | \n", "this | \n", "cat | \n", "is | \n", "like | \n", "a | \n", "i | \n", "<e> | \n", "<unk> | \n", "
---|---|---|---|---|---|---|---|---|---|
(<s>,) | \n", "0.090909 | \n", "0.181818 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.181818 | \n", "0.090909 | \n", "0.090909 | \n", "
(dog,) | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.200000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "
(cat,) | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.272727 | \n", "0.090909 | \n", "
(this,) | \n", "0.200000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "
(is,) | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.200000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "
(like,) | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.272727 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "
(a,) | \n", "0.090909 | \n", "0.090909 | \n", "0.272727 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "
(i,) | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.200000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "
\n", " | dog | \n", "this | \n", "cat | \n", "is | \n", "like | \n", "a | \n", "i | \n", "<e> | \n", "<unk> | \n", "
---|---|---|---|---|---|---|---|---|---|
(like, a) | \n", "0.090909 | \n", "0.090909 | \n", "0.272727 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "
(<s>, <s>) | \n", "0.090909 | \n", "0.181818 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.181818 | \n", "0.090909 | \n", "0.090909 | \n", "
(i, like) | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.200000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "
(a, cat) | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.272727 | \n", "0.090909 | \n", "
(dog, is) | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.200000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "
(cat,) | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.272727 | \n", "0.090909 | \n", "
(<s>, this) | \n", "0.200000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "
(this, dog) | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.200000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "
(is, like) | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.200000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "
(<s>, i) | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.200000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "
\n", "
range(2,4)
produces the integers [2, 3] (and excludes 4).\n", "
estimate_probabilities
returns a dictionary where the key is a word and the value is the word's probability.str1.startswith(str2)
to determine if a string starts with the letters of another string. For example, 'learning'.startswith('lea')
returns True, whereas 'learning'.startswith('ear')
returns False. There are two additional parameters in str.startswith()
, but you can use the default values for those parameters in this case.