{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Language Models: Auto-Complete\n",
"\n",
"In this assignment, you will build an auto-complete system. Auto-complete system is something you may see every day\n",
"- When you google something, you often have suggestions to help you complete your search. \n",
"- When you are writing an email, you get suggestions telling you possible endings to your sentence. \n",
"\n",
"By the end of this assignment, you will develop a prototype of such a system.\n",
"\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Outline\n",
"- [1 Load and Preprocess Data](#1)\n",
"- [1.1: Load the data](#1.1)\n",
"- [1.2 Pre-process the data](#1.2)\n",
" - [Exercise 01](#ex-01)\n",
" - [Exercise 02](#ex-02)\n",
" - [Exercise 03](#ex-03)\n",
" - [Exercise 04](#ex-04)\n",
" - [Exercise 05](#ex-05)\n",
" - [Exercise 06](#ex-06)\n",
" - [Exercise 07](#ex-07)\n",
"- [2 Develop n-gram based language models](#2)\n",
" - [Exercise 08](#ex-08)\n",
" - [Exercise 09](#ex-09) \n",
"- [3 Perplexity](#3)\n",
" - [Exercise 10](#ex-10)\n",
"- [4 Build an auto-complete system](#4)\n",
" - [Exercise 11](#ex-11)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A key building block for an auto-complete system is a language model.\n",
"A language model assigns the probability to a sequence of words, in a way that more \"likely\" sequences receive higher scores. For example, \n",
">\"I have a pen\" \n",
"is expected to have a higher probability than \n",
">\"I am a pen\"\n",
"since the first one seems to be a more natural sentence in the real world.\n",
"\n",
"You can take advantage of this probability calculation to develop an auto-complete system. \n",
"Suppose the user typed \n",
">\"I eat scrambled\"\n",
"Then you can find a word `x` such that \"I eat scrambled x\" receives the highest probability. If x = \"eggs\", the sentence would be\n",
">\"I eat scrambled eggs\"\n",
"\n",
"While a variety of language models have been developed, this assignment uses **N-grams**, a simple but powerful method for language modeling.\n",
"- N-grams are also used in machine translation and speech recognition. \n",
"\n",
"\n",
"Here are the steps of this assignment:\n",
"\n",
"1. Load and preprocess data\n",
" - Load and tokenize data.\n",
" - Split the sentences into train and test sets.\n",
" - Replace words with a low frequency by an unknown marker `
\n", "
\n", "
\n", "
['a'] * 3 to get ['a','a','a'] [0,1,2,3,4]. The largest index 'i' where a bigram can start is at position i=3, because the word tokens at position 3 and 4 will form the bigram. range() function excludes the value that is used for the maximum of the range. range(3) produces (0,1,2) but excludes 3. \n", "
('apple',) is a tuple containing a single string 'apple' | \n", " | dog | \n", "this | \n", "cat | \n", "is | \n", "like | \n", "a | \n", "i | \n", "<e> | \n", "<unk> | \n", "
|---|---|---|---|---|---|---|---|---|---|
| (<s>,) | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "
| (dog,) | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| (cat,) | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "2.0 | \n", "0.0 | \n", "
| (this,) | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| (is,) | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| (like,) | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "2.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| (a,) | \n", "0.0 | \n", "0.0 | \n", "2.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| (i,) | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| \n", " | dog | \n", "this | \n", "cat | \n", "is | \n", "like | \n", "a | \n", "i | \n", "<e> | \n", "<unk> | \n", "
|---|---|---|---|---|---|---|---|---|---|
| (like, a) | \n", "0.0 | \n", "0.0 | \n", "2.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| (<s>, <s>) | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "
| (i, like) | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| (a, cat) | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "2.0 | \n", "0.0 | \n", "
| (dog, is) | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| (cat,) | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "2.0 | \n", "0.0 | \n", "
| (<s>, this) | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| (this, dog) | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| (is, like) | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| (<s>, i) | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| \n", " | dog | \n", "this | \n", "cat | \n", "is | \n", "like | \n", "a | \n", "i | \n", "<e> | \n", "<unk> | \n", "
|---|---|---|---|---|---|---|---|---|---|
| (<s>,) | \n", "0.090909 | \n", "0.181818 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.181818 | \n", "0.090909 | \n", "0.090909 | \n", "
| (dog,) | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.200000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "
| (cat,) | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.272727 | \n", "0.090909 | \n", "
| (this,) | \n", "0.200000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "
| (is,) | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.200000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "
| (like,) | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.272727 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "
| (a,) | \n", "0.090909 | \n", "0.090909 | \n", "0.272727 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "
| (i,) | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.200000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "
| \n", " | dog | \n", "this | \n", "cat | \n", "is | \n", "like | \n", "a | \n", "i | \n", "<e> | \n", "<unk> | \n", "
|---|---|---|---|---|---|---|---|---|---|
| (like, a) | \n", "0.090909 | \n", "0.090909 | \n", "0.272727 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "
| (<s>, <s>) | \n", "0.090909 | \n", "0.181818 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.181818 | \n", "0.090909 | \n", "0.090909 | \n", "
| (i, like) | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.200000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "
| (a, cat) | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.272727 | \n", "0.090909 | \n", "
| (dog, is) | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.200000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "
| (cat,) | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.090909 | \n", "0.272727 | \n", "0.090909 | \n", "
| (<s>, this) | \n", "0.200000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "
| (this, dog) | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.200000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "
| (is, like) | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.200000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "
| (<s>, i) | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.200000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "0.100000 | \n", "
\n", "
range(2,4) produces the integers [2, 3] (and excludes 4).\n", "
estimate_probabilities returns a dictionary where the key is a word and the value is the word's probability.str1.startswith(str2) to determine if a string starts with the letters of another string. For example, 'learning'.startswith('lea') returns True, whereas 'learning'.startswith('ear') returns False. There are two additional parameters in str.startswith(), but you can use the default values for those parameters in this case.