{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Machine Translation Exercises\n",
"In these exercises you will develop a machine translation system that can turn modern English into Shakespeare. \n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup 1: Load Libraries"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"ExecuteTime": {
"end_time": "2016-10-25T14:37:53.142489",
"start_time": "2016-10-25T14:37:52.140810"
},
"run_control": {
"frozen": false,
"read_only": false
}
},
"outputs": [],
"source": [
"%load_ext autoreload\n",
"%autoreload 2\n",
"%matplotlib inline\n",
"import sys, os\n",
"_snlp_book_dir = \"..\"\n",
"sys.path.append(_snlp_book_dir) \n",
"import statnlpbook.word_mt as word_mt\n",
"# %cd .. \n",
"import sys\n",
"sys.path.append(\"..\")\n",
"import matplotlib\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"matplotlib.rcParams['figure.figsize'] = (10.0, 6.0)\n",
"from collections import defaultdict \n",
"import statnlpbook.util as util\n",
"from statnlpbook.lm import *\n",
"from statnlpbook.util import safe_log as log\n",
"import statnlpbook.mt as mt\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2016-10-21T14:01:00.919981",
"start_time": "2016-10-21T14:01:00.912871"
}
},
"source": [
"\n",
"$$\n",
"\\newcommand{\\Xs}{\\mathcal{X}}\n",
"\\newcommand{\\Ys}{\\mathcal{Y}}\n",
"\\newcommand{\\y}{\\mathbf{y}}\n",
"\\newcommand{\\balpha}{\\boldsymbol{\\alpha}}\n",
"\\newcommand{\\bbeta}{\\boldsymbol{\\beta}}\n",
"\\newcommand{\\aligns}{\\mathbf{a}}\n",
"\\newcommand{\\align}{a}\n",
"\\newcommand{\\source}{\\mathbf{s}}\n",
"\\newcommand{\\target}{\\mathbf{t}}\n",
"\\newcommand{\\ssource}{s}\n",
"\\newcommand{\\starget}{t}\n",
"\\newcommand{\\repr}{\\mathbf{f}}\n",
"\\newcommand{\\repry}{\\mathbf{g}}\n",
"\\newcommand{\\x}{\\mathbf{x}}\n",
"\\newcommand{\\prob}{p}\n",
"\\newcommand{\\vocab}{V}\n",
"\\newcommand{\\params}{\\boldsymbol{\\theta}}\n",
"\\newcommand{\\param}{\\theta}\n",
"\\DeclareMathOperator{\\perplexity}{PP}\n",
"\\DeclareMathOperator{\\argmax}{argmax}\n",
"\\DeclareMathOperator{\\argmin}{argmin}\n",
"\\newcommand{\\train}{\\mathcal{D}}\n",
"\\newcommand{\\counts}[2]{\\#_{#1}(#2) }\n",
"\\newcommand{\\length}[1]{\\text{length}(#1) }\n",
"\\newcommand{\\indi}{\\mathbb{I}}\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup 2: Download Data"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"ExecuteTime": {
"end_time": "2016-10-25T14:37:53.180877",
"start_time": "2016-10-25T14:37:53.144067"
},
"run_control": {
"frozen": false,
"read_only": false
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"I have half a mind to hit you before you speak again.\n",
"I have a mind to strike thee ere thou speak’st.\n"
]
}
],
"source": [
"%%sh\n",
"cd ../data\n",
"if [ ! -d \"shakespeare\" ]; then\n",
" git clone https://github.com/tokestermw/tensorflow-shakespeare.git shakespeare \n",
" cd shakespeare\n",
" cat ./data/shakespeare/sparknotes/merged/*_modern.snt.aligned > modern.txt\n",
" cat ./data/shakespeare/sparknotes/merged/*_original.snt.aligned > original.txt\n",
" cd ..\n",
"fi\n",
"head -n 1 shakespeare/modern.txt\n",
"head -n 1 shakespeare/original.txt "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Task 1: Preprocessing Aligned Corpus\n",
"Write methods for loading and tokenizing the aligned corpus."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"ExecuteTime": {
"end_time": "2016-10-25T14:38:09.784552",
"start_time": "2016-10-25T14:38:09.636153"
},
"run_control": {
"frozen": false,
"read_only": false
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"NULL | \n",
"NULL | \n",
"NULL | \n",
"NULL | \n",
"NULL | \n",
"NULL | \n",
"NULL | \n",
"NULL | \n",
"NULL | \n",
"NULL | \n",
"\n",
"Total number of aligned sentence pairs 21079\n"
]
}
],
"source": [
"import re\n",
"\n",
"NULL = \"NULL\"\n",
"\n",
"def tokenize(sentence):\n",
" return [] # todo\n",
"\n",
"def pre_process(sentence):\n",
" return [] # todo\n",
"\n",
"\n",
"def load_shakespeare(corpus):\n",
" with open(\"../data/shakespeare/%s.txt\" % corpus, \"r\") as f:\n",
" return [pre_process(x.rstrip('\\n')) for x in f.readlines()] \n",
" \n",
"modern = load_shakespeare(\"modern\")\n",
"original = load_shakespeare(\"original\")\n",
"\n",
"MAX_LENGTH = 6\n",
"\n",
"def create_wordmt_pairs(modern, original):\n",
" alignments = []\n",
" for i in range(len(modern)):\n",
" if len(modern[i]) <= MAX_LENGTH and len(original[i]) <= MAX_LENGTH:\n",
" alignments.append(([NULL] + modern[i], original[i]))\n",
" return alignments\n",
" \n",
"train = create_wordmt_pairs(modern, original)\n",
"\n",
"for i in range(10):\n",
" (mod, org) = train[i]\n",
" print(\" \".join(mod), \"|\", \" \".join(org))\n",
"\n",
"print(\"\\nTotal number of aligned sentence pairs\", len(train))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Task 2: Train IBM Model 2\n",
"- Train an IBM Model 2 that translates modern English to Shakespeare\n",
"- Visualize alignments of the sentence pairs before and after training using EM \n",
"- Do you find interesting cases?\n",
"- What are likely words that \"killed\" can be translated to?\n",
"- Test your translation system using a beam-search decoder\n",
" - How does the beam size change the quality of the translation?\n",
" - Give examples of good and bad translations"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"ExecuteTime": {
"end_time": "2016-10-25T14:46:50.330225",
"start_time": "2016-10-25T14:46:50.312486"
}
},
"outputs": [],
"source": [
"# todo"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Task 3: Better Language Model\n",
"Try a better language model for machine translation. How does the translation quality change for the examples you found earlier?"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"ExecuteTime": {
"end_time": "2016-10-25T14:46:56.628954",
"start_time": "2016-10-25T14:46:56.616732"
}
},
"outputs": [],
"source": [
"# todo"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Task 4: Better Decoding\n",
"How can you change the decoder to work to translate to shorter or longer target sequences than the source?"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# todo"
]
}
],
"metadata": {
"hide_input": false,
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 1
}