{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine Translation Exercises\n", "In these exercises you will develop a machine translation system that can turn modern English into Shakespeare. \n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup 1: Load Libraries" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2016-10-25T14:37:53.142489", "start_time": "2016-10-25T14:37:52.140810" }, "run_control": { "frozen": false, "read_only": false } }, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "import sys, os\n", "_snlp_book_dir = \"..\"\n", "sys.path.append(_snlp_book_dir) \n", "import statnlpbook.word_mt as word_mt\n", "# %cd .. \n", "import sys\n", "sys.path.append(\"..\")\n", "import matplotlib\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "matplotlib.rcParams['figure.figsize'] = (10.0, 6.0)\n", "from collections import defaultdict \n", "import statnlpbook.util as util\n", "from statnlpbook.lm import *\n", "from statnlpbook.util import safe_log as log\n", "import statnlpbook.mt as mt\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2016-10-21T14:01:00.919981", "start_time": "2016-10-21T14:01:00.912871" } }, "source": [ "\n", "$$\n", "\\newcommand{\\Xs}{\\mathcal{X}}\n", "\\newcommand{\\Ys}{\\mathcal{Y}}\n", "\\newcommand{\\y}{\\mathbf{y}}\n", "\\newcommand{\\balpha}{\\boldsymbol{\\alpha}}\n", "\\newcommand{\\bbeta}{\\boldsymbol{\\beta}}\n", "\\newcommand{\\aligns}{\\mathbf{a}}\n", "\\newcommand{\\align}{a}\n", "\\newcommand{\\source}{\\mathbf{s}}\n", "\\newcommand{\\target}{\\mathbf{t}}\n", "\\newcommand{\\ssource}{s}\n", "\\newcommand{\\starget}{t}\n", "\\newcommand{\\repr}{\\mathbf{f}}\n", "\\newcommand{\\repry}{\\mathbf{g}}\n", "\\newcommand{\\x}{\\mathbf{x}}\n", "\\newcommand{\\prob}{p}\n", "\\newcommand{\\vocab}{V}\n", "\\newcommand{\\params}{\\boldsymbol{\\theta}}\n", "\\newcommand{\\param}{\\theta}\n", "\\DeclareMathOperator{\\perplexity}{PP}\n", "\\DeclareMathOperator{\\argmax}{argmax}\n", "\\DeclareMathOperator{\\argmin}{argmin}\n", "\\newcommand{\\train}{\\mathcal{D}}\n", "\\newcommand{\\counts}[2]{\\#_{#1}(#2) }\n", "\\newcommand{\\length}[1]{\\text{length}(#1) }\n", "\\newcommand{\\indi}{\\mathbb{I}}\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup 2: Download Data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2016-10-25T14:37:53.180877", "start_time": "2016-10-25T14:37:53.144067" }, "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I have half a mind to hit you before you speak again.\n", "I have a mind to strike thee ere thou speak’st.\n" ] } ], "source": [ "%%sh\n", "cd ../data\n", "if [ ! -d \"shakespeare\" ]; then\n", " git clone https://github.com/tokestermw/tensorflow-shakespeare.git shakespeare \n", " cd shakespeare\n", " cat ./data/shakespeare/sparknotes/merged/*_modern.snt.aligned > modern.txt\n", " cat ./data/shakespeare/sparknotes/merged/*_original.snt.aligned > original.txt\n", " cd ..\n", "fi\n", "head -n 1 shakespeare/modern.txt\n", "head -n 1 shakespeare/original.txt " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 1: Preprocessing Aligned Corpus\n", "Write methods for loading and tokenizing the aligned corpus." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2016-10-25T14:38:09.784552", "start_time": "2016-10-25T14:38:09.636153" }, "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NULL | \n", "NULL | \n", "NULL | \n", "NULL | \n", "NULL | \n", "NULL | \n", "NULL | \n", "NULL | \n", "NULL | \n", "NULL | \n", "\n", "Total number of aligned sentence pairs 21079\n" ] } ], "source": [ "import re\n", "\n", "NULL = \"NULL\"\n", "\n", "def tokenize(sentence):\n", " return [] # todo\n", "\n", "def pre_process(sentence):\n", " return [] # todo\n", "\n", "\n", "def load_shakespeare(corpus):\n", " with open(\"../data/shakespeare/%s.txt\" % corpus, \"r\") as f:\n", " return [pre_process(x.rstrip('\\n')) for x in f.readlines()] \n", " \n", "modern = load_shakespeare(\"modern\")\n", "original = load_shakespeare(\"original\")\n", "\n", "MAX_LENGTH = 6\n", "\n", "def create_wordmt_pairs(modern, original):\n", " alignments = []\n", " for i in range(len(modern)):\n", " if len(modern[i]) <= MAX_LENGTH and len(original[i]) <= MAX_LENGTH:\n", " alignments.append(([NULL] + modern[i], original[i]))\n", " return alignments\n", " \n", "train = create_wordmt_pairs(modern, original)\n", "\n", "for i in range(10):\n", " (mod, org) = train[i]\n", " print(\" \".join(mod), \"|\", \" \".join(org))\n", "\n", "print(\"\\nTotal number of aligned sentence pairs\", len(train))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 2: Train IBM Model 2\n", "- Train an IBM Model 2 that translates modern English to Shakespeare\n", "- Visualize alignments of the sentence pairs before and after training using EM \n", "- Do you find interesting cases?\n", "- What are likely words that \"killed\" can be translated to?\n", "- Test your translation system using a beam-search decoder\n", " - How does the beam size change the quality of the translation?\n", " - Give examples of good and bad translations" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2016-10-25T14:46:50.330225", "start_time": "2016-10-25T14:46:50.312486" } }, "outputs": [], "source": [ "# todo" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 3: Better Language Model\n", "Try a better language model for machine translation. How does the translation quality change for the examples you found earlier?" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2016-10-25T14:46:56.628954", "start_time": "2016-10-25T14:46:56.616732" } }, "outputs": [], "source": [ "# todo" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 4: Better Decoding\n", "How can you change the decoder to work to translate to shorter or longer target sequences than the source?" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# todo" ] } ], "metadata": { "hide_input": false, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 1 }