{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

Neural Translation Model in PyTorch

\n", "by Mac Brennan" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", " Translation Model Summary\n", "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This project will be broken up into several parts as follows:\n", "\n", "__Part 1:__ Preparing the words\n", "\n", "+ Inspecting the Dataset\n", "+ Using Word Embeddings\n", "+ Organizing the Data\n", "\n", "__Part 2:__ Building the Model\n", "\n", "+ Bi-Directional Encoder\n", "+ Building Attention\n", "+ Decoder with Attention\n", "\n", "__Part 3:__ Training the Model\n", "\n", "+ Training Function\n", "+ Training Loop\n", "\n", "__Part 4:__ Evaluation\n", "\n", "\n", "This project closely follows the [PyTorch Sequence to Sequence tutorial](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html), while attempting to go more in depth with both the model implementation and the explanation. Thanks to [Sean Robertson](https://github.com/spro/practical-pytorch) and [PyTorch](https://pytorch.org/tutorials/) for providing such great tutorials.\n", "\n", "If you are working through this notebook, it is strongly recommended that [Jupyter Notebook Extensions](https://github.com/ipython-contrib/jupyter_contrib_nbextensions) is installed so you can turn on collapsable headings. It makes the notebook much easier to navigate." ] }, { "cell_type": "code", "execution_count": 250, "metadata": {}, "outputs": [], "source": [ "# Before we get started we will load all the packages we will need\n", "\n", "# Pytorch\n", "import torch\n", "import torch.nn as nn\n", "import torch.nn.functional as F\n", "import torch.optim as optim\n", "from torch.utils.data import Dataset, DataLoader\n", "\n", "import numpy as np\n", "import os.path\n", "import time\n", "import math\n", "import random\n", "import matplotlib.pyplot as plt\n", "import string\n", "\n", "# Use gpu if available\n", "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")" ] }, { "cell_type": "code", "execution_count": 251, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "device(type='cuda')" ] }, "execution_count": 251, "metadata": {}, "output_type": "execute_result" } ], "source": [ "device" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "## Part 1: Preparing the Words" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "### Inspecting the Dataset" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "The dataset that will be used is a text file of english sentences and the corresponding french sentences.\n", "\n", "Each sentence is on a new line. The sentences will be split into a list." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "#### Load the data\n", "The data will be stored in two lists where each item is a sentence. The lists are:\n", "+ english_sentences\n", "+ french_sentences\n", "\n", "Download the first dataset from the projects' github repo. Place it in the same folder as the notebook or create a data folder in the notebook's folder." ] }, { "cell_type": "code", "execution_count": 252, "metadata": { "hidden": true }, "outputs": [], "source": [ "with open('data/small_vocab_en', \"r\") as f:\n", " data1 = f.read()\n", "with open('data/small_vocab_fr', \"r\") as f:\n", " data2 = f.read()\n", " \n", "# The data is just in a text file with each sentence on its own line\n", "english_sentences = data1.split('\\n')\n", "french_sentences = data2.split('\\n')" ] }, { "cell_type": "code", "execution_count": 253, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of English sentences: 137861 \n", "Number of French sentences: 137861 \n", "\n", "Example/Target pair:\n", "\n", " california is usually quiet during march , and it is usually hot in june .\n", " california est généralement calme en mars , et il est généralement chaud en juin .\n" ] } ], "source": [ "print('Number of English sentences:', len(english_sentences), \n", " '\\nNumber of French sentences:', len(french_sentences),'\\n')\n", "print('Example/Target pair:\\n')\n", "print(' '+english_sentences[2])\n", "print(' '+french_sentences[2])" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "#### Vocabulary\n", "Let's take a closer look at the dataset.\n" ] }, { "cell_type": "code", "execution_count": 254, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "['california',\n", " 'is',\n", " 'usually',\n", " 'quiet',\n", " 'during',\n", " 'march',\n", " ',',\n", " 'and',\n", " 'it',\n", " 'is',\n", " 'usually',\n", " 'hot',\n", " 'in',\n", " 'june',\n", " '.']" ] }, "execution_count": 254, "metadata": {}, "output_type": "execute_result" } ], "source": [ "english_sentences[2].split()" ] }, { "cell_type": "code", "execution_count": 255, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The longest english sentence in our dataset is: 17\n" ] } ], "source": [ "max_en_length = 0\n", "for sentence in english_sentences:\n", " length = len(sentence.split())\n", " max_en_length = max(max_en_length, length)\n", "print(\"The longest english sentence in our dataset is:\", max_en_length) " ] }, { "cell_type": "code", "execution_count": 256, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The longest french sentence in our dataset is: 23\n" ] } ], "source": [ "max_fr_length = 0\n", "for sentence in french_sentences:\n", " length = len(sentence.split())\n", " max_fr_length = max(max_fr_length, length)\n", "print(\"The longest french sentence in our dataset is:\", max_fr_length)" ] }, { "cell_type": "code", "execution_count": 257, "metadata": { "hidden": true }, "outputs": [], "source": [ "max_seq_length = max(max_fr_length, max_en_length) + 1\n", "seq_length = max_seq_length" ] }, { "cell_type": "code", "execution_count": 258, "metadata": { "hidden": true }, "outputs": [], "source": [ "en_word_count = {}\n", "fr_word_count = {}\n", "\n", "for sentence in english_sentences:\n", " for word in sentence.split():\n", " if word in en_word_count:\n", " en_word_count[word] +=1\n", " else:\n", " en_word_count[word] = 1\n", " \n", "for sentence in french_sentences:\n", " for word in sentence.split():\n", " if word in fr_word_count:\n", " fr_word_count[word] +=1\n", " else:\n", " fr_word_count[word] = 1\n" ] }, { "cell_type": "code", "execution_count": 259, "metadata": { "hidden": true }, "outputs": [], "source": [ "# Add end of sentence token to word count dict\n", "en_word_count[''] = len(english_sentences)\n", "fr_word_count[''] = len(english_sentences)" ] }, { "cell_type": "code", "execution_count": 260, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of unique English words: 228\n", "Number of unique French words: 356\n" ] } ], "source": [ "print('Number of unique English words:', len(en_word_count))\n", "print('Number of unique French words:', len(fr_word_count))" ] }, { "cell_type": "code", "execution_count": 261, "metadata": { "hidden": true }, "outputs": [], "source": [ "def get_value(items_tuple):\n", " return items_tuple[1]\n", "\n", "# Sort the word counts to see what words or most/least common\n", "sorted_en_words= sorted(en_word_count.items(), key=get_value, reverse=True)" ] }, { "cell_type": "code", "execution_count": 262, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "[('is', 205858),\n", " (',', 140897),\n", " ('', 137861),\n", " ('.', 129039),\n", " ('in', 75525),\n", " ('it', 75137),\n", " ('during', 74933),\n", " ('the', 67628),\n", " ('but', 63987),\n", " ('and', 59850)]" ] }, "execution_count": 262, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sorted_en_words[:10]" ] }, { "cell_type": "code", "execution_count": 263, "metadata": { "hidden": true }, "outputs": [], "source": [ "sorted_fr_words = sorted(fr_word_count.items(), key=get_value, reverse=True)" ] }, { "cell_type": "code", "execution_count": 264, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "[('est', 196809),\n", " ('', 137861),\n", " ('.', 135619),\n", " (',', 123135),\n", " ('en', 105768),\n", " ('il', 84079),\n", " ('les', 65255),\n", " ('mais', 63987),\n", " ('et', 59851),\n", " ('la', 49861)]" ] }, "execution_count": 264, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sorted_fr_words[:10]" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "So the dataset is pretty small, we may want to get a bigger data set, but we'll see how this one does." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "#### Alternate Dataset\n", "Skip this section for now. You can come back and try training on this second dataset later. It is more diverse so it takes longer to train.\n", "\n", "Download the French-English dataset from [here](http://www.manythings.org/anki/), Although you could train the model on any of the other language pairs. However, you would need different word embeddings or they would need to be trained from scratch." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "hidden": true }, "outputs": [], "source": [ "with open('data/fra.txt', \"r\") as f:\n", " data1 = f.read()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "hidden": true }, "outputs": [], "source": [ "pairs = data1.split('\\n')\n", "english_sentences = []\n", "french_sentences = []\n", "for i, pair in enumerate(pairs):\n", " pair_split = pair.split('\\t')\n", " if len(pair_split)!= 2:\n", " continue\n", " english = pair_split[0].lower()\n", " french = pair_split[1].lower()\n", " \n", " # Remove punctuation and limit sentence length\n", " max_sent_length = 10\n", " punctuation_table = english.maketrans({i:None for i in string.punctuation})\n", " english = english.translate(punctuation_table)\n", " french = french.translate(punctuation_table)\n", " if len(english.split()) > max_sent_length or len(french.split()) > max_sent_length:\n", " continue\n", " \n", " english_sentences.append(english)\n", " french_sentences.append(french)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "139692 139692\n" ] }, { "data": { "text/plain": [ "['i', 'have', 'to', 'fight']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(len(english_sentences), len(french_sentences))\n", "english_sentences[10000].split()\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "['il', 'me', 'faut', 'me', 'battre']" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "french_sentences[10000].split()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['would', 'you', 'consider', 'taking', 'care', 'of', 'my', 'children', 'next', 'saturday']\n" ] }, { "data": { "text/plain": [ "['pourriezvous',\n", " 'réfléchir',\n", " 'à',\n", " 'vous',\n", " 'occuper',\n", " 'de',\n", " 'mes',\n", " 'enfants',\n", " 'samedi',\n", " 'prochain']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(english_sentences[-100].split())\n", "french_sentences[-100].split()\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The longest english sentence in our dataset is: 10\n" ] } ], "source": [ "max_en_length = 0\n", "for sentence in english_sentences:\n", " length = len(sentence.split())\n", " max_en_length = max(max_en_length, length)\n", "print(\"The longest english sentence in our dataset is:\", max_en_length) " ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The longest french sentence in our dataset is: 10\n" ] } ], "source": [ "max_fr_length = 0\n", "for sentence in french_sentences:\n", " length = len(sentence.split())\n", " max_fr_length = max(max_fr_length, length)\n", "print(\"The longest french sentence in our dataset is:\", max_fr_length) " ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "hidden": true }, "outputs": [], "source": [ "max_seq_length = max(max_fr_length, max_en_length) + 1\n", "seq_length = max_seq_length" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "hidden": true }, "outputs": [], "source": [ "en_word_count = {}\n", "fr_word_count = {}\n", "\n", "for sentence in english_sentences:\n", " for word in sentence.split():\n", " if word in en_word_count:\n", " en_word_count[word] +=1\n", " else:\n", " en_word_count[word] = 1\n", " \n", "for sentence in french_sentences:\n", " for word in sentence.split():\n", " if word in fr_word_count:\n", " fr_word_count[word] +=1\n", " else:\n", " fr_word_count[word] = 1\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "hidden": true }, "outputs": [], "source": [ "en_word_count[''] = len(english_sentences)\n", "fr_word_count[''] = len(english_sentences)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of unique English words: 12603\n", "Number of unique French words: 25809\n" ] } ], "source": [ "print('Number of unique English words:', len(en_word_count))\n", "print('Number of unique French words:', len(fr_word_count))" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "hidden": true }, "outputs": [], "source": [ "fr_word2idx = {k:v+3 for v, k in enumerate(fr_word_count.keys())}\n", "en_word2idx = {k:v+3 for v, k in enumerate(en_word_count.keys())}" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "hidden": true }, "outputs": [], "source": [ "fr_word2idx[''] = 0\n", "fr_word2idx[''] = 1\n", "fr_word2idx[''] = 2\n", "\n", "en_word2idx[''] = 0\n", "en_word2idx[''] = 1\n", "en_word2idx[''] = 2" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "25812" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(fr_word2idx)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "hidden": true }, "outputs": [], "source": [ "def get_value(items_tuple):\n", " return items_tuple[1]\n", "\n", "sorted_en_words= sorted(en_word_count.items(), key=get_value, reverse=True)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "[('impossibilities', 1),\n", " ('offers', 1),\n", " ('profound', 1),\n", " ('insights', 1),\n", " ('promoting', 1),\n", " ('domestically', 1),\n", " ('feat', 1),\n", " ('hummer', 1),\n", " ('limousines', 1),\n", " ('imprison', 1)]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sorted_en_words[-10:]" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "### Using Word Embeddings" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Here we are building an embedding matrix of pretrained word vectors. The word embeddings used here were downloaded from the [fastText repository](https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md). These embeddings have 300 dimensions. To start we will add a few token embeddings for our specific case. We want a token to signal the start of the sentence, A token for words that we do not have an embedding for, and a token to pad sentences so all the sentences we use have the same length. This will allow us to train the model on batches of sentences that are different lengths, rather than one at a time.\n", "\n", "After this step we will have a dictionary and an embedding matrix for each language. The dictionary will map words to an index value in the embedding matrix where its' corresponding embedding vector is stored." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "#### Load Embeddings for the English data" ] }, { "cell_type": "code", "execution_count": 265, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Embeddings load from .npy file\n" ] } ], "source": [ "# The data file containing the embeddings is very large so once we have the embeddings we want\n", "# we will save them as a numpy array. This way we can load this much faster then having to re read from\n", "# the large embedding file\n", "if os.path.exists('data/en_words.npy') and os.path.exists('data/en_vectors.npy'):\n", " en_words = np.load('data/en_words.npy')\n", " en_vectors = np.load('data/en_vectors.npy')\n", " print('Embeddings load from .npy file')\n", "else:\n", " # make a dict with the top 100,000 words\n", " en_words = ['', # Padding Token\n", " '', # Start of sentence token\n", " ''# Unknown word token\n", " ]\n", "\n", " en_vectors = list(np.random.uniform(-0.1, 0.1, (3, 300)))\n", " en_vectors[0] *= 0 # make the padding vector zeros\n", "\n", " with open('data/wiki.en.vec', \"r\") as f:\n", " f.readline()\n", " for _ in range(100000):\n", " en_vecs = f.readline()\n", " word = en_vecs.split()[0]\n", " vector = np.float32(en_vecs.split()[1:])\n", "\n", " # skip lines that don't have 300 dim\n", " if len(vector) != 300:\n", " continue\n", "\n", " if word not in en_words:\n", " en_words.append(word)\n", " en_vectors.append(vector)\n", " print(word, vector[:10]) # Last word embedding read from the file\n", " en_words = np.array(en_words)\n", " en_vectors = np.array(en_vectors)\n", " # Save the arrays so we don't have to load the full word embedding file\n", " np.save('data/en_words.npy', en_words)\n", " np.save('data/en_vectors.npy', en_vectors)" ] }, { "cell_type": "code", "execution_count": 266, "metadata": { "hidden": true }, "outputs": [], "source": [ "en_word2idx = {word:index for index, word in enumerate(en_words)}" ] }, { "cell_type": "code", "execution_count": 267, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "index for word hemophilia: 99996 \n", "vector for word hemophilia:\n", " [ 0.16189 -0.056121 -0.65560001 0.21569 -0.11878 -0.02066\n", " 0.37613001 -0.24117 -0.098989 -0.010058 ]\n" ] } ], "source": [ "hemophilia_idx = en_word2idx['hemophilia']\n", "print('index for word hemophilia:', hemophilia_idx, \n", " '\\nvector for word hemophilia:\\n',en_vectors[hemophilia_idx][:10])" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "The word embedding for hemophilia matches the one read from the file, so it looks like everything worked properly." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "#### Load Embeddings for the Frech data" ] }, { "cell_type": "code", "execution_count": 268, "metadata": { "hidden": true, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Embeddings load from .npy file\n" ] } ], "source": [ "if os.path.exists('data/fr_words.npy') and os.path.exists('data/fr_vectors.npy'):\n", " fr_words = np.load('data/fr_words.npy')\n", " fr_vectors = np.load('data/fr_vectors.npy')\n", " print('Embeddings load from .npy file')\n", "else:\n", " # make a dict with the top 100,000 words\n", " fr_words = ['',\n", " '',\n", " '']\n", "\n", " fr_vectors = list(np.random.uniform(-0.1, 0.1, (3, 300)))\n", " fr_vectors[0] = np.zeros(300) # make the padding vector zeros\n", "\n", " with open('data/wiki.fr.vec', \"r\") as f:\n", " f.readline()\n", " for _ in range(100000):\n", " fr_vecs = f.readline()\n", " word = fr_vecs.split()[0]\n", " try:\n", " vector = np.float32(fr_vecs.split()[1:])\n", " except ValueError:\n", " continue\n", "\n", " # skip lines that don't have 300 dim\n", " if len(vector) != 300:\n", " continue\n", "\n", " if word not in fr_words:\n", " fr_words.append(word)\n", " fr_vectors.append(vector)\n", " print(word, vector[:10])\n", " fr_words = np.array(fr_words)\n", " fr_vectors = np.array(fr_vectors)\n", " # Save the arrays so we don't have to load the full word embedding file\n", " np.save('data/fr_words.npy', fr_words)\n", " np.save('data/fr_vectors.npy', fr_vectors)" ] }, { "cell_type": "code", "execution_count": 269, "metadata": { "hidden": true }, "outputs": [], "source": [ "fr_word2idx = {word:index for index, word in enumerate(fr_words)}" ] }, { "cell_type": "code", "execution_count": 270, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "index for word chabeuil: 99783 \n", "vector for word chabeuil:\n", " [-0.18058001 -0.24758001 0.075607 0.17299999 0.24116001 -0.11223\n", " -0.28173 0.27373999 0.37997001 0.48008999]\n" ] } ], "source": [ "chabeuil_idx = fr_word2idx['chabeuil']\n", "print('index for word chabeuil:', chabeuil_idx, \n", " '\\nvector for word chabeuil:\\n',fr_vectors[chabeuil_idx][:10])" ] }, { "cell_type": "code", "execution_count": 271, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "99783" ] }, "execution_count": 271, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fr_word2idx[\"chabeuil\"]" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "The word embedding for chabeuil matches as well so everything worked correctly for the french vocab." ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Ok, so we have all the pieces needed to take words and convert them into word embeddings. These word embeddings already have a lot of useful information about how words relate since we loaded the pre-trained word embeddings. Now we can build the translation model with the embedding matrices built in." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "### Setting up PyTorch Dataset and Dataloader" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Rather than organizing all the data from a file and storing it in a list or some other data structure, PyTorch allows us to create a dataset object. To get an example from a dataset we just index the dataset object like we would a list. However, all our processing can be contained in the objects initialization or indexing process.\n", "\n", "This will also make training easier when we want to iterate through batches." ] }, { "cell_type": "code", "execution_count": 272, "metadata": { "hidden": true }, "outputs": [], "source": [ "class French2EnglishDataset(Dataset):\n", " '''\n", " French and associated English sentences.\n", " '''\n", " \n", " def __init__(self, fr_sentences, en_sentences, fr_word2idx, en_word2idx, seq_length):\n", " self.fr_sentences = fr_sentences\n", " self.en_sentences = en_sentences\n", " self.fr_word2idx = fr_word2idx\n", " self.en_word2idx = en_word2idx\n", " self.seq_length = seq_length\n", " self.unk_en = set()\n", " self.unk_fr = set()\n", " \n", " def __len__(self):\n", " return len(french_sentences)\n", " \n", " def __getitem__(self, idx):\n", " '''\n", " Returns a pair of tensors containing word indices\n", " for the specified sentence pair in the dataset.\n", " '''\n", " \n", " # init torch tensors, note that 0 is the padding index\n", " french_tensor = torch.zeros(self.seq_length, dtype=torch.long)\n", " english_tensor = torch.zeros(self.seq_length, dtype=torch.long)\n", " \n", " # Get sentence pair\n", " french_sentence = self.fr_sentences[idx].split()\n", " english_sentence = self.en_sentences[idx].split()\n", " \n", " # Add tags\n", " french_sentence.append('')\n", " english_sentence.append('')\n", " \n", " # Load word indices\n", " for i, word in enumerate(french_sentence):\n", " if word in fr_word2idx and fr_word_count[word] > 5:\n", " french_tensor[i] = fr_word2idx[word]\n", " else:\n", " french_tensor[i] = fr_word2idx['']\n", " self.unk_fr.add(word)\n", " \n", " for i, word in enumerate(english_sentence):\n", " if word in en_word2idx and en_word_count[word] > 5:\n", " english_tensor[i] = en_word2idx[word]\n", " else:\n", " english_tensor[i] = en_word2idx['']\n", " self.unk_en.add(word)\n", " \n", " sample = {'french_tensor': french_tensor, 'french_sentence': self.fr_sentences[idx],\n", " 'english_tensor': english_tensor, 'english_sentence': self.en_sentences[idx]}\n", " return sample" ] }, { "cell_type": "code", "execution_count": 273, "metadata": { "hidden": true }, "outputs": [], "source": [ "french_english_dataset = French2EnglishDataset(french_sentences,\n", " english_sentences,\n", " fr_word2idx,\n", " en_word2idx,\n", " seq_length = seq_length)" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "#### Example output of dataset" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "hidden": true }, "outputs": [], "source": [ "test_sample = french_english_dataset[-10] # get 10th to last item in dataset" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Input example:\n", "Sentence: la station spatiale internationale est une étonnante prouesse technologique\n", "Tensor: tensor([ 9, 787, 2, 730, 21, 23, 2, 2, 2, 3,\n", " 0])\n", "\n", "Target example:\n", "Sentence: the international space station is an amazing feat of engineering\n", "Tensor: tensor([ 5, 214, 657, 309, 16, 32, 6425, 2, 7,\n", " 2, 6])\n" ] } ], "source": [ "print('Input example:')\n", "print('Sentence:', test_sample['french_sentence'])\n", "print('Tensor:', test_sample['french_tensor'])\n", "\n", "print('\\nTarget example:')\n", "print('Sentence:', test_sample['english_sentence'])\n", "print('Tensor:', test_sample['english_tensor'])" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3\n" ] }, { "data": { "text/plain": [ "6" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check that both tensors end with the end of sentence token\n", "print(fr_word2idx[''])\n", "en_word2idx['']" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "hidden": true }, "outputs": [], "source": [ "# Build dataloader to check how the batching works\n", "dataloader = DataLoader(french_english_dataset, batch_size=5,\n", " shuffle=True, num_workers=4)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "hidden": true, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 torch.Size([5, 11]) torch.Size([5, 11])\n", "1 torch.Size([5, 11]) torch.Size([5, 11])\n", "2 torch.Size([5, 11]) torch.Size([5, 11])\n", "3 torch.Size([5, 11]) torch.Size([5, 11])\n" ] } ], "source": [ "# Prints out 10 batches from the dataloader\n", "for i_batch, sample_batched in enumerate(dataloader):\n", " print(i_batch, sample_batched['french_tensor'].shape,\n", " sample_batched['english_tensor'].shape)\n", " if i_batch == 3:\n", " break" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "French Sentence: plus nous vieillissons plus notre mémoire faiblit\n", "English Sentence: the older we get the weaker our memory becomes \n", "\n", "French Sentence: personne ne peut aider\n", "English Sentence: no one can help \n", "\n", "French Sentence: cest très gentil de ta part\n", "English Sentence: thats very sweet of you \n", "\n", "French Sentence: quand avezvous commencé à apprendre lallemand \n", "English Sentence: when did you start learning german \n", "\n", "French Sentence: passezle trente secondes au microondes\n", "English Sentence: zap it in the microwave for thirty seconds \n", "\n" ] } ], "source": [ "for i in dataloader:\n", " batch = i\n", " break\n", "\n", "for i in range(5):\n", " print('French Sentence:', batch['french_sentence'][i])\n", " print('English Sentence:', batch['english_sentence'][i],'\\n')" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "## Part 2: Building the Model" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "### Bi-Directional Encoder" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "hidden": true }, "outputs": [], "source": [ "class EncoderBiLSTM(nn.Module):\n", " def __init__(self, hidden_size, pretrained_embeddings):\n", " super(EncoderBiLSTM, self).__init__()\n", " \n", " # Model Parameters\n", " self.hidden_size = hidden_size\n", " self.embedding_dim = pretrained_embeddings.shape[1]\n", " self.vocab_size = pretrained_embeddings.shape[0]\n", " self.num_layers = 2\n", " self.dropout = 0.1 if self.num_layers > 1 else 0\n", " self.bidirectional = True\n", " \n", " \n", " # Construct the layers\n", " self.embedding = nn.Embedding(self.vocab_size, self.embedding_dim)\n", " \n", " self.embedding.weight.data.copy_(torch.from_numpy(pretrained_embeddings)) #Load the pretrained embeddings\n", " self.embedding.weight.requires_grad = False #Freeze embedding layer\n", " \n", " self.lstm = nn.LSTM(self.embedding_dim,\n", " self.hidden_size,\n", " self.num_layers,\n", " batch_first = True,\n", " dropout=self.dropout,\n", " bidirectional=self.bidirectional)\n", " \n", " # Initialize hidden to hidden weights in LSTM to the Identity matrix\n", " # This improves training and prevents exploding gradients\n", " # PyTorch LSTM has the 4 different hidden to hidden weights stacked in one matrix\n", " identity_init = torch.eye(self.hidden_size)\n", " self.lstm.weight_hh_l0.data.copy_(torch.cat([identity_init]*4, dim=0))\n", " self.lstm.weight_hh_l0_reverse.data.copy_(torch.cat([identity_init]*4, dim=0))\n", " self.lstm.weight_hh_l1.data.copy_(torch.cat([identity_init]*4, dim=0))\n", " self.lstm.weight_hh_l1_reverse.data.copy_(torch.cat([identity_init]*4, dim=0))\n", " \n", " def forward(self, input, hidden):\n", " embedded = self.embedding(input)\n", " output = self.lstm(embedded, hidden)\n", " return output\n", " \n", " def initHidden(self, batch_size):\n", " \n", " hidden_state = torch.zeros(self.num_layers*(2 if self.bidirectional else 1),\n", " batch_size,\n", " self.hidden_size, \n", " device=device)\n", " \n", " cell_state = torch.zeros(self.num_layers*(2 if self.bidirectional else 1),\n", " batch_size,\n", " self.hidden_size, \n", " device=device)\n", " \n", " return (hidden_state, cell_state)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "hidden": true }, "outputs": [], "source": [ "class EncoderBiGRU(nn.Module):\n", " def __init__(self, hidden_size, pretrained_embeddings):\n", " super(EncoderBiGRU, self).__init__()\n", " \n", " # Model parameters\n", " self.hidden_size = hidden_size\n", " self.embedding_dim = pretrained_embeddings.shape[1]\n", " self.vocab_size = pretrained_embeddings.shape[0]\n", " self.num_layers = 2\n", " self.dropout = 0.1 if self.num_layers > 1 else 0\n", " self.bidirectional = True\n", " \n", " \n", " # Construct the layers\n", " self.embedding = nn.Embedding(self.vocab_size, self.embedding_dim)\n", " self.embedding.weight.data.copy_(torch.from_numpy(pretrained_embeddings))\n", " self.embedding.weight.requires_grad = False\n", " \n", " self.gru = nn.GRU(self.embedding_dim,\n", " self.hidden_size,\n", " self.num_layers,\n", " batch_first = True,\n", " dropout=self.dropout,\n", " bidirectional=self.bidirectional)\n", " \n", " # Initialize hidden to hidden weights in GRU to the Identity matrix\n", " # PyTorch GRU has 3 different hidden to hidden weights stacked in one matrix\n", " identity_init = torch.eye(self.hidden_size)\n", " self.gru.weight_hh_l0.data.copy_(torch.cat([identity_init]*3, dim=0))\n", " self.gru.weight_hh_l0_reverse.data.copy_(torch.cat([identity_init]*3, dim=0))\n", " self.gru.weight_hh_l1.data.copy_(torch.cat([identity_init]*3, dim=0))\n", " self.gru.weight_hh_l1_reverse.data.copy_(torch.cat([identity_init]*3, dim=0))\n", " \n", " def forward(self, input, hidden):\n", " embedded = self.embedding(input)\n", " output = self.gru(embedded, hidden)\n", " return output\n", " \n", " def initHidden(self, batch_size):\n", " \n", " hidden_state = torch.zeros(self.num_layers*(2 if self.bidirectional else 1),\n", " batch_size,\n", " self.hidden_size, \n", " device=device)\n", " \n", " return hidden_state" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "#### Testing the Encoder" ] }, { "cell_type": "code", "execution_count": 229, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The final output of the BiLSTM Encoder on our test input is: \n", "\n", " torch.Size([1, 3, 10])\n", "\n", "\n", "Encoder output tensor: \n", "\n", " tensor([[[ 0.1696, -0.0685, -0.1059, 0.0245, -0.0932, -0.1734, 0.0031,\n", " 0.0233, -0.0100, 0.1628],\n", " [ 0.2719, -0.1025, -0.1429, 0.0170, -0.1392, -0.1159, -0.0073,\n", " 0.0053, 0.0103, 0.1306],\n", " [ 0.3649, -0.1420, -0.1676, 0.0222, -0.1329, -0.0704, -0.0404,\n", " 0.0054, -0.0080, 0.0976]]], device='cuda:0')\n" ] } ], "source": [ "# Test the encoder on a sample input, input tensor has dimensions (batch_size, seq_length)\n", "# all the variable have test_ in front of them so they don't reassign variables needed later on with the real models\n", "\n", "test_batch_size = 1\n", "test_seq_length = 3\n", "test_hidden_size = 5\n", "test_encoder = EncoderBiLSTM(test_hidden_size, fr_vectors).to(device)\n", "test_hidden = test_encoder.initHidden(test_batch_size)\n", "\n", "# Create an input tensor of random indices\n", "test_inputs = torch.randint(0, 50, (test_batch_size, test_seq_length), dtype=torch.long, device=device)\n", "\n", "test_encoder_output, test_encoder_hidden = test_encoder.forward(test_inputs, test_hidden)\n", "\n", "print(\"The final output of the BiLSTM Encoder on our test input is: \\n\\n\", test_encoder_output.shape)\n", "\n", "print('\\n\\nEncoder output tensor: \\n\\n', test_encoder_output)" ] }, { "cell_type": "code", "execution_count": 230, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "(tensor([[[-0.1083, -0.2036, 0.1844, -0.1568, -0.0570]],\n", " \n", " [[ 0.1193, -0.1132, -0.1703, -0.1203, 0.0673]],\n", " \n", " [[ 0.3649, -0.1420, -0.1676, 0.0222, -0.1329]],\n", " \n", " [[-0.1734, 0.0031, 0.0233, -0.0100, 0.1628]]], device='cuda:0'),\n", " tensor([[[-0.3360, -0.5497, 0.4430, -0.2667, -0.1674]],\n", " \n", " [[ 0.1950, -0.2336, -0.3140, -0.3467, 0.1642]],\n", " \n", " [[ 0.7378, -0.2309, -0.2565, 0.0606, -0.3088]],\n", " \n", " [[-0.3480, 0.0054, 0.0424, -0.0174, 0.4835]]], device='cuda:0'))" ] }, "execution_count": 230, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_encoder_hidden# Tuple where first item is the hidden states, second item is the cell states.\n", "\n", "# The lstm has 2 layers, each layer has a forward and backward pass giving 4" ] }, { "cell_type": "code", "execution_count": 231, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "tensor([[[-0.1083, -0.2036, 0.1844, -0.1568, -0.0570]],\n", "\n", " [[ 0.3649, -0.1420, -0.1676, 0.0222, -0.1329]]], device='cuda:0')" ] }, "execution_count": 231, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_encoder_hidden[0][::2] # Hidden states from forward pass for both lstm layers." ] }, { "cell_type": "code", "execution_count": 232, "metadata": { "hidden": true }, "outputs": [], "source": [ "test_encoder_gru = EncoderBiGRU(test_hidden_size, fr_vectors).to(device)\n", "test_hidden = test_encoder_gru.initHidden(test_batch_size)\n", "o,h = test_encoder_gru(test_inputs, test_hidden)" ] }, { "cell_type": "code", "execution_count": 233, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "tensor([[[-0.2638, 0.0444, 0.0588, -0.1062, -0.0780, -0.1639, 0.1332,\n", " -0.5315, 0.0886, 0.1702],\n", " [-0.4556, 0.1273, 0.0350, -0.3055, -0.0166, -0.2639, 0.0371,\n", " -0.4623, 0.0706, 0.1357],\n", " [-0.5962, 0.2580, 0.1032, -0.4162, -0.0136, -0.1962, -0.0306,\n", " -0.2940, -0.0043, 0.1003]]], device='cuda:0')" ] }, "execution_count": 233, "metadata": {}, "output_type": "execute_result" } ], "source": [ "o" ] }, { "cell_type": "code", "execution_count": 234, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tensor([[[-0.5236, -0.1632, -0.1012, -0.1132, -0.3256]],\n", "\n", " [[ 0.2437, -0.0019, -0.1655, 0.0607, 0.2190]],\n", "\n", " [[-0.5962, 0.2580, 0.1032, -0.4162, -0.0136]],\n", "\n", " [[-0.1639, 0.1332, -0.5315, 0.0886, 0.1702]]], device='cuda:0')\n" ] }, { "data": { "text/plain": [ "tensor([[[ 0.2437, -0.0019, -0.1655, 0.0607, 0.2190]],\n", "\n", " [[-0.1639, 0.1332, -0.5315, 0.0886, 0.1702]]], device='cuda:0')" ] }, "execution_count": 234, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(h)\n", "h[1::2]" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "### Attention\n", "Let's take a moment test how attention is being modeled. Weighted sum of sequence items from encoder output." ] }, { "cell_type": "code", "execution_count": 235, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Attention weights:\n", " tensor([[[ 1., 0., 0.]]], device='cuda:0')\n", "\n", "First sequence item in Encoder output: \n", " tensor([[ 0.1696, -0.0685, -0.1059, 0.0245, -0.0932, -0.1734, 0.0031,\n", " 0.0233, -0.0100, 0.1628]], device='cuda:0')\n", "\n", "Encoder Output after attention is applied: \n", " tensor([ 0.1696, -0.0685, -0.1059, 0.0245, -0.0932, -0.1734, 0.0031,\n", " 0.0233, -0.0100, 0.1628], device='cuda:0')\n", "\n", " torch.Size([10])\n" ] } ], "source": [ "# Initialize attention weights to one, note the dimensions\n", "attn_weights = torch.ones((test_batch_size, test_seq_length),device=device)\n", "\n", "# Set all weights except the weights associated with the first sequence item equal to zero\n", "# This would represent full attention on the first word in the sequence\n", "attn_weights[:, 1:] = 0\n", "\n", "attn_weights.unsqueeze_(1) # Add dimension for batch matrix multiplication\n", "\n", "# BMM(Batch Matrix Multiply) muliplies the [1 x seq_length] matrix by the [seq_length x hidden_size] matrix for\n", "# each batch. This produces a single vector(for each batch) of length(encoder_hidden_size) that is the weighted\n", "# sum of the encoder hidden vectors for each item in the sequence.\n", "attn_applied = torch.bmm(attn_weights, test_encoder_output)\n", "attn_applied.squeeze_() # Remove extra dimension\n", "\n", "print('Attention weights:\\n', attn_weights)\n", "print('\\nFirst sequence item in Encoder output: \\n', test_encoder_output[:,0,:])\n", "print('\\nEncoder Output after attention is applied: \\n', attn_applied)\n", "print('\\n', attn_applied.shape)" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "### Decoder with Attention" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "hidden": true }, "outputs": [], "source": [ "class AttnDecoderLSTM(nn.Module):\n", " def __init__(self, decoder_hidden_size, pretrained_embeddings, seq_length):\n", " super(AttnDecoderLSTM, self).__init__()\n", " # Embedding parameters\n", " self.embedding_dim = pretrained_embeddings.shape[1]\n", " self.output_vocab_size = pretrained_embeddings.shape[0]\n", " \n", " # LSTM parameters\n", " self.decoder_hidden_size = decoder_hidden_size\n", " self.num_layers = 2 # Potentially add more layers to LSTM later\n", " self.dropout = 0.1 if self.num_layers > 1 else 0 # Potentially add dropout later\n", " \n", " # Attention parameters\n", " self.seq_length = seq_length\n", " self.encoder_hidden_dim = 2*decoder_hidden_size\n", " \n", " # Construct embedding layer for output language\n", " self.embedding = nn.Embedding(self.output_vocab_size, self.embedding_dim)\n", " self.embedding.weight.data.copy_(torch.from_numpy(pretrained_embeddings))\n", " self.embedding.weight.requires_grad = False # we don't want to train the embedding weights\n", " \n", " # Construct layer that calculates attentional weights\n", " self.attn = nn.Linear((self.decoder_hidden_size + self.embedding_dim), self.seq_length)\n", " \n", " # Construct layer that compresses the combined matrix of the input embeddings\n", " # and the encoder inputs after attention has been applied\n", " self.attn_with_input = nn.Linear(self.embedding_dim + self.encoder_hidden_dim, self.embedding_dim)\n", " \n", " # LSTM for Decoder\n", " self.lstm = nn.LSTM(self.embedding_dim,\n", " self.decoder_hidden_size,\n", " self.num_layers,\n", " dropout=self.dropout)\n", " \n", " # Initialize hidden to hidden weights in LSTM to the Identity matrix\n", " # PyTorch LSTM has 4 different hidden to hidden weights stacked in one matrix\n", " identity_init = torch.eye(self.decoder_hidden_size)\n", " self.lstm.weight_hh_l0.data.copy_(torch.cat([identity_init]*4, dim=0))\n", " self.lstm.weight_hh_l1.data.copy_(torch.cat([identity_init]*4, dim=0))\n", " \n", " # Output layer\n", " self.out = nn.Linear(self.decoder_hidden_size, self.output_vocab_size)\n", " \n", " def forward(self, input, hidden, encoder_output):\n", " # Input word indices, should have dim(1, batch_size), output will be (1, batch_size, embedding_dim)\n", " embedded = self.embedding(input)\n", " \n", " # Calculate Attention weights\n", " attn_weights = F.softmax(self.attn(torch.cat((hidden[0][1], embedded[0]), 1)), dim=1)\n", " attn_weights = attn_weights.unsqueeze(1) # Add dimension for batch matrix multiplication\n", " \n", " # Apply Attention weights\n", " attn_applied = torch.bmm(attn_weights, encoder_output)\n", " attn_applied = attn_applied.squeeze(1) # Remove extra dimension, dim are now (batch_size, encoder_hidden_size)\n", " \n", " # Prepare LSTM input tensor\n", " attn_combined = torch.cat((embedded[0], attn_applied), 1) # Combine embedding input and attn_applied,\n", " lstm_input = F.relu(self.attn_with_input(attn_combined)) # pass through fully connected with ReLU\n", " lstm_input = lstm_input.unsqueeze(0) # Add seq dimension so tensor has expected dimensions for lstm\n", " \n", " output, hidden = self.lstm(lstm_input, hidden) # Output dim = (1, batch_size, decoder_hidden_size)\n", " output = F.log_softmax(self.out(output[0]), dim=1) # softmax over all words in vocab\n", " \n", " \n", " return output, hidden, attn_weights" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "hidden": true }, "outputs": [], "source": [ "class AttnDecoderGRU(nn.Module):\n", " def __init__(self, decoder_hidden_size, pretrained_embeddings, seq_length):\n", " super(AttnDecoderGRU, self).__init__()\n", " # Embedding parameters\n", " self.embedding_dim = pretrained_embeddings.shape[1]\n", " self.output_vocab_size = pretrained_embeddings.shape[0]\n", " \n", " # GRU parameters\n", " self.decoder_hidden_size = decoder_hidden_size\n", " self.num_layers = 2 # Potentially add more layers to LSTM later\n", " self.dropout = 0.1 if self.num_layers > 1 else 0 # Potentially add dropout later\n", " \n", " # Attention parameters\n", " self.seq_length = seq_length\n", " self.encoder_hidden_dim = 2*decoder_hidden_size\n", " \n", " # Construct embedding layer for output language\n", " self.embedding = nn.Embedding(self.output_vocab_size, self.embedding_dim)\n", " self.embedding.weight.data.copy_(torch.from_numpy(pretrained_embeddings))\n", " self.embedding.weight.requires_grad = False # we don't want to train the embedding weights\n", " \n", " # Construct layer that calculates attentional weights\n", " self.attn = nn.Linear(self.decoder_hidden_size + self.embedding_dim, self.seq_length)\n", " \n", " # Construct layer that compresses the combined matrix of the input embeddings\n", " # and the encoder inputs after attention has been applied\n", " self.attn_with_input = nn.Linear(self.embedding_dim + self.encoder_hidden_dim, self.embedding_dim)\n", " \n", " # gru for Decoder\n", " self.gru = nn.GRU(self.embedding_dim,\n", " self.decoder_hidden_size,\n", " self.num_layers,\n", " dropout=self.dropout)\n", " \n", " # Initialize hidden to hidden weights in GRU to the Identity matrix\n", " # PyTorch GRU has 3 different hidden to hidden weights stacked in one matrix\n", " identity_init = torch.eye(self.decoder_hidden_size)\n", " self.gru.weight_hh_l0.data.copy_(torch.cat([identity_init]*3, dim=0))\n", " self.gru.weight_hh_l1.data.copy_(torch.cat([identity_init]*3, dim=0))\n", " \n", " # Output layer\n", " self.out = nn.Linear(self.decoder_hidden_size, self.output_vocab_size)\n", " \n", " def forward(self, input, hidden, encoder_output):\n", " # Input word indices, should have dim(1, batch_size), output will be (1, batch_size, embedding_dim)\n", " embedded = self.embedding(input)\n", " \n", " # Calculate Attention weights\n", " attn_weights = F.softmax(self.attn(torch.cat((hidden[0], embedded[0]), 1)), dim=1)\n", " attn_weights = attn_weights.unsqueeze(1) # Add dimension for batch matrix multiplication\n", " \n", " # Apply Attention weights\n", " attn_applied = torch.bmm(attn_weights, encoder_output)\n", " attn_applied = attn_applied.squeeze(1) # Remove extra dimension, dim are now (batch_size, encoder_hidden_size)\n", " \n", " # Prepare GRU input tensor\n", "\n", " attn_combined = torch.cat((embedded[0], attn_applied), 1) # Combine embedding input and attn_applied,\n", " gru_input = F.relu(self.attn_with_input(attn_combined)) # pass through fully connected with ReLU\n", " gru_input = gru_input.unsqueeze(0) # Add seq dimension so tensor has expected dimensions for lstm\n", " \n", " output, hidden = self.gru(gru_input, hidden) # Output dim = (1, batch_size, decoder_hidden_size)\n", " output = F.log_softmax(self.out(output[0]), dim=1) # softmax over all words in vocab\n", " \n", " return output, hidden, attn_weights" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "#### Testing the Decoder" ] }, { "cell_type": "code", "execution_count": 238, "metadata": { "hidden": true }, "outputs": [], "source": [ "# Test the decoder on sample inputs to check that the dimensions of everything is correct\n", "test_decoder_hidden_size = 5\n", "\n", "test_decoder = AttnDecoderLSTM(test_decoder_hidden_size, en_vectors, test_seq_length).to(device)" ] }, { "cell_type": "code", "execution_count": 239, "metadata": { "hidden": true }, "outputs": [], "source": [ "input_idx = torch.tensor([fr_word2idx['']]*test_batch_size, dtype=torch.long, device=device)" ] }, { "cell_type": "code", "execution_count": 240, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "torch.Size([1])" ] }, "execution_count": 240, "metadata": {}, "output_type": "execute_result" } ], "source": [ "input_idx.shape" ] }, { "cell_type": "code", "execution_count": 241, "metadata": { "hidden": true }, "outputs": [], "source": [ "input_idx = input_idx.unsqueeze_(0)\n", "test_decoder_hidden = (test_encoder_hidden[0][1::2].contiguous(), test_encoder_hidden[1][1::2].contiguous())" ] }, { "cell_type": "code", "execution_count": 242, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "torch.Size([1, 1])" ] }, "execution_count": 242, "metadata": {}, "output_type": "execute_result" } ], "source": [ "input_idx.shape" ] }, { "cell_type": "code", "execution_count": 243, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "torch.Size([1, 99997])\n" ] } ], "source": [ "output, hidden, attention = test_decoder.forward(input_idx, test_decoder_hidden, test_encoder_output)\n", "print(output.shape)" ] }, { "cell_type": "code", "execution_count": 244, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "torch.Size([2, 1, 5])" ] }, "execution_count": 244, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_decoder_hidden[0].shape" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "## Part 3: Training the Model" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "### Training Function" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "hidden": true }, "outputs": [], "source": [ "def train(input_tensor, target_tensor, encoder, decoder,\n", " encoder_optimizer, decoder_optimizer, criterion):\n", " \n", " # Initialize encoder hidden state\n", " encoder_hidden = encoder.initHidden(input_tensor.shape[0])\n", " \n", " # clear the gradients in the optimizers\n", " encoder_optimizer.zero_grad()\n", " decoder_optimizer.zero_grad()\n", " \n", " # run forward pass through encoder on entire sequence\n", " encoder_output, encoder_hidden = encoder.forward(input_tensor, encoder_hidden)\n", " \n", " # Initialize decoder input(Start of Sentence tag) and hidden state from encoder\n", " decoder_input = torch.tensor([en_word2idx['']]*input_tensor.shape[0], dtype=torch.long, device=device).unsqueeze(0)\n", " \n", " # Use correct initial hidden state dimensions depending on type of RNN\n", " try:\n", " encoder.lstm\n", " decoder_hidden = (encoder_hidden[0][1::2].contiguous(), encoder_hidden[1][1::2].contiguous())\n", " except AttributeError:\n", " decoder_hidden = encoder_hidden[1::2].contiguous()\n", " \n", " # Initialize loss\n", " loss = 0\n", " \n", " # Implement teacher forcing\n", " use_teacher_forcing = True if random.random() < 0.5 else False\n", "\n", " if use_teacher_forcing:\n", " # Step through target output sequence\n", " for di in range(seq_length):\n", " output, decoder_hidden, attn_weights = decoder(decoder_input,\n", " decoder_hidden,\n", " encoder_output)\n", " \n", " # Feed target as input to next item in the sequence\n", " decoder_input = target_tensor[di].unsqueeze(0)\n", " loss += criterion(output, target_tensor[di])\n", " else:\n", " # Step through target output sequence\n", " for di in range(seq_length):\n", " \n", " # Forward pass through decoder\n", " output, decoder_hidden, attn_weights = decoder(decoder_input,\n", " decoder_hidden,\n", " encoder_output)\n", " \n", " # Feed output as input to next item in the sequence\n", " decoder_input = output.topk(1)[1].view(1,-1).detach()\n", " \n", " # Calculate loss\n", " loss += criterion(output, target_tensor[di])\n", " \n", " # Compute the gradients\n", " loss.backward()\n", " \n", " # Clip the gradients\n", " nn.utils.clip_grad_norm_(encoder.parameters(), 25)\n", " nn.utils.clip_grad_norm_(decoder.parameters(), 25)\n", " \n", " # Update the weights\n", " encoder_optimizer.step()\n", " decoder_optimizer.step()\n", " \n", " return loss.item()" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "### Training Loop" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "hidden": true }, "outputs": [], "source": [ "def trainIters(encoder, decoder, dataloader, epochs, print_every_n_batches=100, learning_rate=0.01):\n", " \n", " # keep track of losses\n", " plot_losses = []\n", "\n", " # Initialize Encoder Optimizer\n", " encoder_parameters = filter(lambda p: p.requires_grad, encoder.parameters())\n", " encoder_optimizer = optim.Adam(encoder_parameters, lr=learning_rate)\n", " \n", " # Initialize Decoder Optimizer\n", " decoder_parameters = filter(lambda p: p.requires_grad, decoder.parameters())\n", " decoder_optimizer = optim.Adam(decoder_parameters, lr=learning_rate)\n", "\n", " # Specify loss function, ignore the token index so it does not contribute to loss.\n", " criterion = nn.NLLLoss(ignore_index=0)\n", " \n", " # Cycle through epochs\n", " for epoch in range(epochs):\n", " loss_avg = 0\n", " print(f'Epoch {epoch + 1}/{epochs}')\n", " # Cycle through batches\n", " for i, batch in enumerate(dataloader):\n", " \n", " input_tensor = batch['french_tensor'].to(device)\n", " target_tensor = batch['english_tensor'].transpose(1,0).to(device)\n", " \n", "\n", " loss = train(input_tensor, target_tensor, encoder, decoder,\n", " encoder_optimizer, decoder_optimizer, criterion)\n", " \n", " loss_avg += loss\n", " if i % print_every_n_batches == 0 and i != 0:\n", " loss_avg /= print_every_n_batches\n", " print(f'After {i} batches, average loss/{print_every_n_batches} batches: {loss_avg}')\n", " plot_losses.append(loss)\n", " loss_avg = 0\n", " return plot_losses" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "### Training the Model" ] }, { "cell_type": "code", "execution_count": 274, "metadata": { "hidden": true }, "outputs": [], "source": [ "# Set hyperparameters and construct dataloader\n", "hidden_size = 256\n", "batch_size = 16\n", "dataloader = DataLoader(french_english_dataset, batch_size=batch_size,\n", " shuffle=True, num_workers=4) " ] }, { "cell_type": "code", "execution_count": 275, "metadata": { "hidden": true }, "outputs": [], "source": [ "# Construct encoder and decoder instances\n", "encoder_lstm = EncoderBiLSTM(hidden_size, fr_vectors).to(device)\n", "decoder_lstm = AttnDecoderLSTM(hidden_size, en_vectors, seq_length).to(device)\n", "\n", "encoder_gru = EncoderBiGRU(hidden_size, fr_vectors).to(device)\n", "decoder_gru = AttnDecoderGRU(hidden_size, en_vectors, seq_length).to(device)" ] }, { "cell_type": "code", "execution_count": 276, "metadata": { "hidden": true }, "outputs": [], "source": [ "from_scratch = True # Set to False if you have saved weights and want to load them\n", "\n", "if not from_scratch:\n", " # Load weights from earlier model\n", " encoder_lstm_state_dict = torch.load('models/encoder1_lstm.pth')\n", " decoder_lstm_state_dict = torch.load('models/decoder1_lstm.pth')\n", "\n", " encoder_lstm.load_state_dict(encoder_lstm_state_dict)\n", " decoder_lstm.load_state_dict(decoder_lstm_state_dict)\n", " \n", " # Load weights from earlier model\n", " encoder_gru_state_dict = torch.load('models/encoder1_gru.pth')\n", " decoder_gru_state_dict = torch.load('models/decoder1_gru.pth')\n", "\n", " encoder_gru.load_state_dict(encoder_gru_state_dict)\n", " decoder_gru.load_state_dict(decoder_gru_state_dict)\n", "else:\n", " print('Training model from scratch.')" ] }, { "cell_type": "code", "execution_count": 361, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "# For dataset 1, models were trained for 3 epochs\n", "# For dataset 2, models were trained for 50 epochs\n", "\n", "learning_rate = 0.0001\n", "encoder_lstm.train() # Set model to training mode\n", "decoder_lstm.train() # Set model to training mode\n", "\n", "lstm_losses_cont = trainIters(encoder_lstm, decoder_lstm, dataloader, epochs=50, learning_rate = learning_rate)\n", "\n", "\n", "# For dataset 1, models were trained for 3 epochs\n", "# For dataset 2, models were trained for 50 epochs\n", "print('Training GRU based network.')\n", "learning_rate = 0.0001\n", "encoder_gru.train() # Set model to training mode\n", "decoder_gru.train() # Set model to training mode\n", "\n", "gru_losses = trainIters(encoder_gru, decoder_gru, dataloader, epochs=50, learning_rate = learning_rate)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "hidden": true }, "outputs": [], "source": [ "np.save('data/lstm2_losses.npy', lstm_losses)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "hidden": true }, "outputs": [], "source": [ "np.save('data/gru2_losses.npy', gru_losses)" ] }, { "cell_type": "code", "execution_count": 277, "metadata": { "hidden": true }, "outputs": [], "source": [ "lstm_losses = np.load('data/lstm1_losses.npy')\n", "gru_losses = np.load('data/gru1_losses.npy')" ] }, { "cell_type": "code", "execution_count": 294, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 294, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.plot(lstm_losses)\n", "plt.plot(gru_losses)\n", "\n", "plt.title('Loss Plots for Dataset 1; Trained on 1 Epoch')\n", "plt.xlabel('Batches')\n", "plt.xticks([0,20,40,60,80],[0,2000,4000,6000,8000])\n", "plt.ylabel('Loss per Batch, MSE')\n", "plt.legend(['LSTM', 'GRU'])" ] }, { "cell_type": "code", "execution_count": 170, "metadata": { "hidden": true }, "outputs": [], "source": [ "# Save the model weights to continue later\n", "torch.save(encoder_lstm.state_dict(), 'models/encoder2_lstm.pth')\n", "torch.save(decoder_lstm.state_dict(), 'models/decoder2_lstm.pth')" ] }, { "cell_type": "code", "execution_count": 171, "metadata": { "hidden": true }, "outputs": [], "source": [ "torch.save(encoder_gru.state_dict(), 'models/encoder2_gru.pth')\n", "torch.save(decoder_gru.state_dict(), 'models/decoder2_gru.pth')" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "## Part 4: Using the Model for Evaluation" ] }, { "cell_type": "code", "execution_count": 295, "metadata": { "hidden": true }, "outputs": [], "source": [ "# Build the idx to word dictionaries to convert predicted indices to words\n", "en_idx2word = {k:i for i, k in en_word2idx.items()}\n", "fr_idx2word = {k:i for i, k in fr_word2idx.items()}" ] }, { "cell_type": "code", "execution_count": 309, "metadata": { "hidden": true }, "outputs": [], "source": [ "def get_batch(dataloader):\n", " for batch in dataloader:\n", " return batch" ] }, { "cell_type": "code", "execution_count": 310, "metadata": { "hidden": true }, "outputs": [], "source": [ "def evaluate(input_tensor, encoder, decoder):\n", " with torch.no_grad():\n", " encoder_hidden = encoder.initHidden(1)\n", " encoder.eval()\n", " decoder.eval()\n", "\n", " encoder_output, encoder_hidden = encoder(input_tensor.to(device), encoder_hidden)\n", "\n", " decoder_input = torch.tensor([fr_word2idx['']]*input_tensor.shape[0], dtype=torch.long, device=device).unsqueeze(0)\n", " try:\n", " encoder.lstm\n", " decoder_hidden = (encoder_hidden[0][1::2].contiguous(), encoder_hidden[1][1::2].contiguous())\n", " except AttributeError:\n", " decoder_hidden = encoder_hidden[1::2].contiguous()\n", "\n", " output_list = []\n", " attn_weight_list = np.zeros((seq_length, seq_length))\n", " for di in range(seq_length):\n", " output, decoder_hidden, attn_weights = decoder(decoder_input,\n", " decoder_hidden,\n", " encoder_output)\n", "\n", " decoder_input = output.topk(1)[1].detach()\n", " output_list.append(output.topk(1)[1])\n", " word = en_idx2word[output.topk(1)[1].item()]\n", "\n", " attn_weight_list[di] += attn_weights[0,0,:].cpu().numpy()\n", " return output_list, attn_weight_list" ] }, { "cell_type": "code", "execution_count": 357, "metadata": { "hidden": true }, "outputs": [], "source": [ "batch = get_batch(dataloader)\n", "input_tensor = batch['french_tensor'][11].unsqueeze_(0)\n", "output_list, attn = evaluate(input_tensor, encoder_lstm, decoder_lstm)\n", "gru_output_list, gru_attn = evaluate(input_tensor, encoder_gru, decoder_gru)" ] }, { "cell_type": "code", "execution_count": 358, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Input Sentence:\n", " elle déteste les poires , les fraises et les oranges . \n", "\n", "Target Sentence:\n", " she dislikes pears , strawberries , and oranges .\n", "\n", "LSTM model output:\n", " she dislikes pears , strawberries , and oranges . \n", "\n", "GRU model output:\n", " she dislikes pears , strawberries , and oranges . \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/mac/anaconda3/envs/pytorch/lib/python3.6/site-packages/matplotlib/cbook/deprecation.py:107: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance. In a future version, a new instance will always be created and returned. Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.\n", " warnings.warn(message, mplDeprecation, stacklevel=1)\n" ] }, { "data": { "text/plain": [ "[Text(0,0,'she'),\n", " Text(0,0,'dislikes'),\n", " Text(0,0,'pears'),\n", " Text(0,0,','),\n", " Text(0,0,'strawberries'),\n", " Text(0,0,','),\n", " Text(0,0,'and'),\n", " Text(0,0,'oranges'),\n", " Text(0,0,'.'),\n", " Text(0,0,'')]" ] }, "execution_count": 358, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAUYAAAFRCAYAAAAIBATTAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAIABJREFUeJzt3Xm4HGWdxfHvSUgIW8QQREAhgBo2JUBQEFEUREUd0WFABFEUEGVUREYHwQWXcUNmGEaRyMjmuDCi484ioijKEgKBAG7DIkoYjGwhCVnP/FHVoVPc5HZyq9Ld957P8/ST21XVv3670vfct9ZXtomIiCeM6nYDIiJ6TYIxIqIiwRgRUZFgjIioSDBGRFQkGCMiKhKMMexJmiTJktbpYNm3SvrV2mhXXSR9WdKHu92O4STBGCsl6Y2SrpM0T9ID5c/vkqRy/vmSFkl6TNKDkq6QtH3b6z8m6WsD1LWkZ63kPe8ua06sTL+5fN2kej/l6pO0QfmZfzzAvLsl7d/2vONQ7vC9nxTcto+z/Yk66kchwRgDkvR+4Ezg88DTgc2A44C9gbFti37O9obAlsBfgP+s4e3vAg5ra8tzgfVqqFuXg4GFwAGSNu92Y6J+CcZ4EklPAT4OvMv2t23PdeEm24fbXlh9je0FwMXAlBqacBFwZNvztwAXVtso6UJJf5V0j6RTJY0q542WdLqkOZLuBF49wGv/U9JsSX+R9ElJo1ejfW8BvgzcAhzeVvciYCvgB2WP8gPA1eXsh8tpe5XLvk3SHZIeknSZpK3b6ljScZL+UM7/ogo7lO+7V1nr4XL58yV9su31x0j6Y9mL/76kLQarvRqffURIMMZA9gLWBb7X6QskbUDRy/tjDe9/LTBe0g5lYB0KVDfJzwKeAmwLvIQiSI8q5x0DvAbYFZhK0cNrdwGwBHhWucwBwNGdNEzSVsC+wH+Vj+UBbvvNwJ+A19re0PbngBeXszcup/1G0kHAh4A3AJsCvwS+UXmr1wB7ALsAhwCvsH0HRa/9N2WtjQdo38uAT5ev2Ry4B/jmYLU7+ewjSYIxBjIRmGN7SWuCpF9LeljSAkkvblv2pLLnMhd4EfDmmtrQ6jW+HPgtxWZ6qy2tsDy57M3eDXyh7b0PAf7N9r22H6QIitZrNwNeBZxge57tB4B/Bd7YYbuOBG6xfTtFmO0kadfV/GzvAD5t+45yHf8LMKW91wh8xvbDtv8EXEXnPfHDga/anlH27E+m6GFOqqH2iJFgjIH8DZjYfsDA9gvLHsrfWPF7c3o5fRKwAJjcNm8JMKa9sKTW88WDtOEi4E3AW6lsRlME91iK3lDLPRT7OQG2AO6tzGvZumzT7DLoHwbOAZ42SHtajqToKWL7PuAXFJvWq2Nr4My2938QUFv7Ae5v+3k+sGGHtbeg7fPafozi/6yO2iNGgjEG8huKgwuv6/QFZe/jvRS/8K0DJX+iCMx22wBLaesBrqTePRQHYQ4EvlOZPYciWNt7WFu11ZwNPLMyr+Veis820fbG5WO87Z1W1R4ASS8Eng2cLOl+SfcDLwAOa/sjUr1d1UC3r7oXeEfb+29sez3bvx6sDSup1+4+2tZLuYtjEwZZ37GiBGM8ie2HgdOAL0k6WNKGkkZJmgJssIrXXUHxi3lsOelSYLKkN0saI2kCxWbjt9s301fh7cDLbM+rvM9SigM9n5K0UbkJeiJP7Ie8GHiPpGdIeirwz22vnQ1cDnxB0vjyc20n6SUdtOctwBXAjhSbn1OAnYH1KTbPAf6PYr9ny1+BZZVpX6YI151g+cGgf+jg/Vv1nyFp7Ermfx04StIUSetSrO/ryt0N0aEEYwyoPHBwIvAB4AGKX8hzgA8Cq+rZfB74gKR1y/13B1LsU3sAmAU8Aryzwzb8r+3pK5n9bmAecCfwK4pA+Go57yvAZcBMYAZP7nEeSbEpfjvwEPBtigMVKyVpHMW+y7Ns39/2uItis7+1Of1p4NRyM/kk2/OBTwHXlNP2tP1d4LPANyU9SrFeXvXkdx3Qz4DbgPslzanOtH0l8GHgEoqe83Z0vv80SsqNaiMiVpQeY0RERYIxIqIiwRgRUZFgjIioSDBGRFQkGCMiKhKMEREVCcaIiIoEY0RERYIxIqIiwRgRUZFgjIioSDBGRFQkGCMiKhKMEREVCcaIiIoEY0RERYIxIqIiwRgRUZFgjIioSDBGRFQkGCMiKhKMEREVCcaIiIoEY0RERYIxIqIiwRgRUZFgjIioSDBGRFQkGCMiKhKMEREVCcaIiIoEY0RERYIxIqIiwRgRUZFgjIioSDBGRFQkGCMiKhKMEREVCcaIiIoEY0RERYIxIqIiwRgRUZFgjIioSDBGRFQkGCMiKhKMEdEYSe+VNF6F/5Q0Q9IB3W7XYBKMEdGkt9l+FDgA2BQ4CvhMd5s0uARjH5G0nqTJ3W5HxGpQ+e+BwHm2Z7ZN61kJxj4h6bXAzcCl5fMpkr7f3VZFDOpGSZdTBONlkjYClnW5TYOS7W63ITog6UbgZcDPbe9aTrvF9vO627KIlZM0CpgC3Gn7YUmbAFvavqXLTVul9Bj7xxLbj3S7ERGrycCOwHvK5xsA47rXnM4kGPvHLElvAkZLeraks4Bfd7tREYP4ErAXcFj5fC7wxe41pzMJxv7xbmAnYCHwdeAR4L1dbVGsVL+eptKAF9g+HngcwPZDwNjuNmlwCcb+8Wrbp9jeo3ycCvxdtxsVK9WXp6k0YLGk0RSb1EjalD44+JJg7B8ndzit50gaJWl8t9uxlvXlaSoN+Hfgu8DTJH0K+BXwL91t0uDW6XYDYtUkvYril2tLSf/eNms8sKQ7rRqcpK8DxwFLgRuBp0g6w/bnu9uytaZ1mso2wMn9cppK3Wz/V3lGxX4UfxgOsn1Hl5s1qJyu0+Mk7UJxusPHgY+0zZoLXFXus+k5km62PUXS4cDuwAeBG0fK6UX9eppK3SRNGGDyXNuL13pjVkN6jD2u3ASbKenrrS+TpKcCz+zVUCyNkTQGOAj4D9uLJY2kv8Kt01ReQ/FHrS9OU2nADOCZwEMUPcaNgdmSHgCOsX1jNxu3MtnH2D+uKI9yTgBmAudJOqPbjVqFc4C7KQLhaklbA492tUVrV1+eptKAS4EDbU+0vQnwKuBi4F0U66gnZVO6T0i6yfauko6m6C1+tN+ufJG0ju2e3S9aJ0kzbO/W+n8rp820vUu327Y2SZpue+pA01q7W7rVtlVJj7F/rCNpc+AQ4IfdbsxgJG1Wnr/3k/L5jsBbutystakvT1NpwIOSPihp6/LxAeChct307PpIMPaPjwOXAf9r+wZJ2wJ/6HKbVuV8ivZuUT7/PXBC11qz9vXlaSoNeBPwDOB/gO8BW5XTRlP8ke9J2ZSORki6wfYelU3Jnt10aoKk7XniNJUr++E0lSikx9gnJD1H0pWSZpXPnyfp1G63axXmlaeotDYl96S4jHFEkLQdcJftLwKzgJdL2rjLzVrryu/tNEmXS/pZ69Htdg0mPcY+IekXwD8B57T1wGbZ3rm7LRuYpN2As4CdKYJhU+DgkXIen6SbganAJIojsz8AJts+sJvtWtskzQS+THGS/9LW9F49Tacl5zH2j/VtXy+tcFVZTx7hLU9uHge8BJhMsSn5u14/qbdmy2wvkfQG4EzbZ0m6qduN6oIlts/udiNWVzal+8eccvOstWl6MDC7u00amO1lwBdsL7F9m+1ZIywUoTgqfRhwJE+cRTCmi+3plh9IepekzSVNaD263ajBZFO6AZJeBDzb9nnlaRob2r5riDW3BaYBL6S4iuAu4HDb9wy5wQ2QdBpwC/Adj8AvWXl60nHAb2x/Q9I2wKG2R9QddiQN9L237W3XemNWQ4KxZpI+SrFvabLt50jaAvhv23sPse42tu+StAEwyvbc1rQ62l03SXMprnpZQnEvPlH8Qoy0u+xEH0ow1qzc6b4rMKPOsVlaV1JUpt1oe/eh1I16SbrY9iGSbqXc7dGun65UqouknSmuG19+rbjtC7vXosHl4Ev9Ftl264YJZQ9vjZXnwu1EcduuN7TNGk8P3pRA0va2f1selX4S2zPWdpvWstZd1V/T1Vb0iHILal+KYPwxxbXSvwISjCPMxZLOATaWdAzwNuArQ6g3meKXbGPgtW3T5wLHDKEuAJL+Abi03DQ/FdgN+OQQAuxE4FjgCwPMM8VIhz2lznVge3b5b0/u+12ZBr4HLQcDuwA32T5K0mbAuUOs2TzbedT8AF4OfB44HXh5TTX3aqitt5T/vgj4JfA64Lpur8O1/P9V+zoA9gRuAB4DFlGcw/dotz/r2v4eANeX/95IsZUj4LZuf97BHjldpwG2r7D9T7ZPsn1FTWX/1tCVL62Tbl8NnG37e9QwWJGkMZLeI+nb5eMfy/sz9qIm1sF/UNxy7A/AesDRFCe896pGvgfA9PKKn69QhOMM4Poa6jYqB19qUh6FHWhlrvHRWEnHAT93sc+ukStfJP0Q+AuwP8WdthdQ/JUf0u2xJJ1Lcd7eBeWkNwNLbR89lLpNaGIdtN1aa/mBN0m/tv3CWhpds4bWgYBn2L63fD4JGO9+uPqp213WPFb+ADYELih/vqH896a2+TfX8B7rA2+gOO8SYHPggBrqzuxkWi88mlgHwNUUPa4Lgc8B76vj8wMXdTKtF9ZBWefGbv//rskjm9I1aT+rf6DHmtS0/RjFJhg0dOWL7fnAAxT7lqA477CO25ktLdsLLD9Bfekqlu+ahtbBmymuLPtHYB7F7f3/fog1oThDYTlJ61D08Iakwe/BtZL2qKHOWpVN6ZqUZ/ibctO5Nbn81x7imf5NXfnS4Anp+wHnAXeWkyYBR9m+aih1m1D3OihvwnqB7SNqbOPJwIco9lfOb02mOLAzzfaQhtJt8HtwO8WZFXdT/IFo7Vrq6fM5E4w1K2+gcDiwje2PS9oK2Nz2dWtY78TKpPUoeiLzAGwPadyXBk9IHwe8n+J+hABXAP9q+/Gh1G1CE+tA0mXAa20vqqmZrbqfA24FtrV9Wvn9errtIR3QaPB7sDXwVGCfctLVwMND/YPetGxK1++LFKdqtA+C9B9DqLdR+ZgKvJPiS7YxxXW4Ow6hbssiF38dazkhvc2FFGMqf6J8bANcVFPtujWxDu4GrpH0YUknth411B1P8f16Y/m8rkG2mvoeHETx/z6R4tZzFwF/V1PtxuQE7/q9wOUgSAC2H5K0xqc92D4NQMXg7bvZnls+/xjw3zW0t+4T0lsme8UjmlepuDdfL6ptHUi6yPabgUOBf6XofGxUW0vh+XV+v9o09T14O7Cn7XkAkj4L/IbePnUpwdiApgZB2opif1LLIor9dkNi+3RJL6cY2nQy8BHXc+7lTZL2tH0tgKQXANfUUHc5FYODPWh74VDq1LwOdi83H/9EM7/8jXy/GvweiBUPui3liX3vPSv7GGsm6XCK3sJuFOfwHQycantIvTtJp1AMHvRdil+K1wPfsv3pobW4GZLuoPgF+1M5aSvgDopf4lp2vkv6KbAdcIntk4Zarw6S3kOxy2Mb4L72WdRzEK6R71dTyt0Hb6H43kKxaX2+7X9r4L2ebvv+WmolGOunhgZBKm/MsHwntu01viN0EyekV+pvvar5de18L08i3tH2bWvw2sbWgaSzbb9zTV8/SO3avl9Nfw/K99iN4jQgMcTv7SDv8yPbr66lVoIxImJFOSodEVGRYIyIqEgwNkjSsanbTN1+amu/1e2ntjZVN8HYrEa+CKnbWM3Uba5mX9VNMEZEVOSo9BoYq3Feb9SGgy63yI8zVqsxLMvYzu7jumjJfMaus37ndRd2drnuIhYylnU7qzmuw+VYvfZ6VGfn/i5eMo8x66zGVWvzFnRWl4WM6XQdAMXZQoNbrXULeIPOvjeLFs9j7JjO1oMWdPg98ALGar2OlgWgwwxZ3d+HpeM7a8PihY8xZt3Bfx8BFs5/kMUL5w36n5YrX9bAeqM2ZM/16x/rSM/YvPaaAL7nz/UXnbxN/TWBpRvUcXXbk+nXzVyNOGpcM+ORLZm6Q+01x8y8c/CF1oAXL2mk7mMvG9I9mAc082dndrRcNqUjIioSjBERFQnGiIiKBGNEREWCMSKiYkQEo6S7JU3sdjsioj+MiGCMiFgdwy4YJW0g6UeSZkqaJenQcta7Jc2QdGt5P7vWsl+VdIOkmyS9rotNj4geMeyCEXglcJ/tXWzvDFxaTp9jezfgbKB1t+dTgJ/Z3gN4KfD5GgcBiog+NRyD8VZgf0mflbSP7UfK6d8p/72RJ8ZKOQD453LoyJ8D4yhuwf8kko6VNF3S9EW9NwJoRNRo2F0SaPv3knYHDgQ+XY6uB9AaMGkpT3xuAX9v+3cd1J1GMeA9Txk9MReYRwxjw67HKGkLYL7trwGnUwwatDKXUex7VPnaXddCEyOixw27YASeC1xfbh6fAnxyFct+AhgD3CJpVvk8Ika44bgpfRlFT7DdpLb504F9y58XAO9YW22LiP4wHHuMERFDkmCMiKhIMEZEVCQYIyIqEowRERXD7qj0WrHOaEZNeGrtZf94xKa11wSY9NG7aq8571nja68JMP7K3zZSd2kjVcFLmhnv5PFN6h/7ZuxGG9VeE2DpX2Y3Unf89ffWXnP0vM4GBEuPMSKiIsEYEVGRYIyIqEgwRkRUJBgjIioSjBERFQnGiIiKnjyPUdLHgMeA8cDVtn+6quVsny7p461lJd0NTLU9Zy01OSKGkZ4MxhbbH2li2YiIVemZTWlJp0j6naSfApPLaedLOrj8+TOSbpd0i6TTB3j98mXbpq0n6VJJx5TPj5B0vaSbJZ0jaXT5OL8cUfBWSe9bCx83InpYT/QYyzFa3gjsStGmGRSDVrXmTwBeD2xv25I27qDshsA3gQttXyhpB+BQYG/biyV9CTgcuA3YshxRkJXVlnQscCzAuNHNXFoVEb2hV3qM+wDftT3f9qPA9yvzHwUeB86V9AZgfgc1vwecZ/vC8vl+wO7ADeWwB/sB2wJ3AttKOkvSK8v3ehLb02xPtT117Oj1VvfzRUQf6ZVgBFjpyHu2lwDPBy4BDuKJsaJX5RrgVa2BrihGBLzA9pTyMdn2x2w/BOxCMXzq8cC5Q/gMETEM9EowXg28vtwnuBHw2vaZkjYEnmL7x8AJwJQOan4E+BvwpfL5lcDBkp5W1pwgaWtJE4FRti8BPsyqRxWMiBGgJ/Yx2p4h6VvAzcA9wC8ri2wEfE/SOIqeX6cHSE4Avirpc7Y/IOlU4HJJo4DFFD3EBcB55TSAk4f4cSKiz/VEMALY/hTwqVUs8vwBXvOxtp/f2vbzpLbFjmqb/i3gWwPUTi8xIpbrlU3piIiekWCMiKhIMEZEVCQYIyIqEowRERU9c1S6n3jxYpbOvr/2utt9YW7tNQGWLqt/jLyNLp1Ve02ApQseb6Quy8/zr5eXNjP+4Ea3PFB/0VHNrAMa+H4B/OmwSbXXXHRhZ6MvpscYEVGRYIyIqEgwRkRUJBgjIioSjBERFQnGiIiKBGNERMWID0ZJOZczIlbQV6EgaRLF3buvoxgf5vfAkcAOwBkU47zMAd5qe3Y5CNaxwFjgj8Cbbc+XdD7wYFljhqTvA2eWb2PgxbabOds6InpeP/YYJwPTbD+PYnyW44GzgINt7w58lSfu6/gd23vY3gW4A3h7W53nAPvbfj9wEnC87SkU488sWDsfJSJ6UV/1GEv32r6m/PlrwIeAnYEryuFdRgOzy/k7S/oksDFFb/Kytjr/bbt1LdM1wBmS/osiTP9cfdMVRglk/Xo/UUT0lH4MxuqgWXOB22zvNcCy5wMH2Z4p6a3Avm3z5i0vaH9G0o+AA4FrJe1v+7crvKk9DZgGMH7UhJUO3BUR/a8fN6W3ktQKwcOAa4FNW9MkjZG0Uzl/I2C2pDEUY0gPSNJ2tm+1/VlgOrB9c82PiF7Xj8F4B/AWSbcAEyj3LwKflTSTYkCtF5bLfpjiQM0VwG8HqNVygqRZ5esXAD9pqvER0fv6cVN6me3jKtNuBl5cXdD22cDZA0x/a+X5u+tsYET0t37sMUZENKqveoy276Y4Ah0R0Zj0GCMiKhKMEREVCcaIiIq+2sfYMwxesqT+spO2rL0mADc/Un/NUQ39TW1oYKV+8/t3PL32mqMX1V4SgEmn3NtI3S3PurH2mvcunN/RcukxRkRUJBgjIioSjBERFQnGiIiKBGNEREWCMSKiIsEYEVGRYIyIqEgwRkRUJBgjIioSjBERFblWukMZJTBi5EiPsUO2p9meanvqGNbtdnMiokEJxgpJV0pq6DY3EdEPEoxtJI0CngU82O22RET3JBhXtCNwie0F3W5IRHRPDr60sT0LOLHb7YiI7kqPMSKiIsEYEVGRYIyIqEgwRkRU5OBLDxn1f82cJbSsiaJLM5pfk9Z9SLXXvO5dZ9ReE+DvT9mzkbpeuLCBou5osfQYIyIqEowRERUJxoiIigRjRERFgjEioiLBGBFRkWCMiKhYo2CUdIKk2m9jLel8SQfXXPPHkjaus2ZEDG9r2mM8AQa+v7+k0WvenDVXfV8VRtk+0PbD3WhTRPSnQYNR0gaSfiRppqRZkj4KbAFcJemqcpnHJH1c0nXAXpI+IumGcvlpZUg9TdKN5fK7SLKkrcrn/9vWA91f0i8l/V7Sa8r5oyV9vqx5i6R3lNP3lXSVpK8Dt0qaJOkOSV8CZgDPlHS3pInl8kdIul7SzZLOKeuOLnuqsyTdKul9ta7hiOg7nVwS+ErgPtuvBpD0FOAo4KW255TLbADMsv2RcpnbbX+8/Pki4DW2fyBpnKTxwD7AdGAfSb8CHrA9XxLAJOAlwHYU4fss4EjgEdt7SFoXuEbS5eV7Px/Y2fZdkiYBk4GjbL+rfH/Kf3cADgX2tr24DM/DgduALW3vXC6Xze6IEa6TTelbKXpxn5W0j+1HBlhmKXBJ2/OXSrpO0q3Ay4Cdyum/BvYGXgz8S/nvPsAv2157se1ltv8A3AlsDxwAHCnpZuA6YBPg2eXy19u+q+3199i+doA27gfsDtxQ1tkP2LZ8j20lnSXplcCjA60EScdKmi5p+mIauIYzInrGoD1G27+XtDtwIPDptp5au8dtLwWQNA74EjDV9r2SPgaMK5f7JUUQbg18D/ggYOCH7W9ZbQIg4N22L2ufIWlfYF5l+erz5YsDF9g++UkzpF2AVwDHA4cAb6suY3saMA1gvCZ0diV6RPSlTvYxbgHMt/014HRgN2AusNFKXtIKwTmSNgTajzJfDRwB/MH2MopBpw4Ermlb5h8kjZK0HUWP7nfAZcA7JY0p2/QcSRt0+BlbrgQOlvS0ssYESVuX+x9H2b4E+HD5+SJiBOtkH+Nzgc9LWgYsBt4J7AX8RNJs2y9tX9j2w5K+QrEJfjdwQ9u8u8t9fleXk34FPMP2Q20lfgf8AtgMOM7245LOpdj3OENFgb8CB63OB7V9u6RTgcvL0QAXU/QQFwDnldMAntSjjIiRRe7w/mTxhPGa4Bdov9rrrrP502uvCbBk9v211xy1fu2nsQKwbP78Rur2mz9/6IW112zsfozPaOZ+jE24zlfyqB8c9GaXufIlIqIiwRgRUZFgjIioSDBGRFQkGCMiKnJUeg00dVQa1T8yHNDxyGirpZ/a2oeaOOrvhkZ2bGQ0v4bkqHRExBpKMEZEVCQYIyIqEowRERUJxoiIigRjRERFgjEioiLBGBFRkWCMiKhIMEZEVHRyB++gGAwLOBZg3MBDakfEMJEeY4dsT7M91fbUMazb7eZERIMSjBERFQnGCklXStqy2+2IiO5JMLYpRwp8FsWwrhExQiUYV7QjcIntBd1uSER0T45Kt7E9Czix2+2IiO5KjzEioiLBGBFRkWCMiKhIMEZEVOTgSy9RQ3+n3MzocNGgZctqL+lFi2qvOVylxxgRUZFgjIioSDBGRFQkGCMiKhKMEREVCcaIiIoEYxtJj3W7DRHRfQnGiIiKYReMkv5H0o2SbivHaUHSY5I+JWmmpGslbVZO30bSbyTdIOkT3W15RPSKYReMwNts7w5MBd4jaRNgA+Ba27sAVwPHlMueCZxtew/g/q60NiJ6znAMxvdImglcCzwTeDawCPhhOf9GYFL5897AN8qfL1pVUUnHSpouafpiFtbe6IjoHcPqWmlJ+wL7A3vZni/p58A4YLFtl4stZcXPbTpgexowDWC8JnT0mojoT8Otx/gU4KEyFLcH9hxk+WuAN5Y/H95oyyKibwy3YLwUWEfSLcAnKDanV+W9wPGSbqAI1YgI9MQWZnRqvCb4Bdqv/sKjRtdfE2BZA7cdk+qvCZDvIwCjxo2rveayhQ3tG++j/7PrfCWP+sFBv7zDrccYETFkCcaIiIoEY0RERYIxIqIiwRgRUZFgjIioGFZXvvS7y/58YyN1X7HlrrXXHD1xYu01AZb+9a+N1G3q9CKNbuYUq1GbbVp7zWX33ld7zbJyQ3Ub0OGZRekxRkRUJBgjIioSjBERFQnGiIiKBGNEREWCMSKiIsEYEVHRs8GoQs+2LyKGr64Gj6QTJc0qHydImiTpDklfAmYAz5R0djnWym2STmt77d2STpM0Q9Kt5R27kbSppCvK6edIukfSxHLeEZKul3RzOW90+Ti/bMOtkt7XnbUREb2ia8EoaXfgKOAFFEMQHAM8FZgMXGh7V9v3AKfYngo8D3iJpOe1lZljezfgbOCkctpHgZ+V078LbFW+3w7AocDetqdQjP1yODAF2NL2zrafC5zX5OeOiN7XzR7ji4Dv2p5n+zHgO8A+wD2224ckOETSDOAmYCdgx7Z53yn/bR/570XANwFsXwo8VE7fD9gduEHSzeXzbYE7gW0lnSXplcCjAzU2owRGjBzdvFZ6ZRevzlu+gLQNRU9wD9sPSTqfYtS/llZCtY/8t7K6Ai6wffKTZki7AK8AjgcOAd5WXSajBEaMHN3sMV4NHCRpfUkbAK8HfllZZjxFUD4iaTPgVR3U/RVFuCHpAIrNc4ArgYMlPa2cN0HS1uX+x1G2LwE+DOw2xM8VEX2uaz1G2zPKHuD15aRzeWKzt7XMTEk3AbdRbPJe00Hp04BvSDqnKq5pAAADhElEQVQU+AUwG5hre46kU4HLy6Pdiyl6iAuA89qOgD+pRxkRI0tXbztm+wzgjMrknSvLvHUlr53U9vN0YN/y6SPAK2wvkbQX8FLbC8vlvgV8a4By6SVGxHLD8X6MWwEXlz3ARRRHuyMiOjbsgtH2H4D678waESNGriyJiKhIMEZEVCQYIyIqht0+xn72ii2mNFS5/vPRGxu0qilu5px8L1nSSN0l99zbSN3oTHqMEREVCcaIiIoEY0RERYIxIqIiwRgRUZFgjIioSDBGRFQkGCMiKhKMEREVCcaIiIoEY0RERa6V7pCkY4FjAcaxfpdbExFNSo+xQ7an2Z5qe+oY1u12cyKiQQnGiIiKBGMbST+WtEW32xER3ZV9jG1sH9jtNkRE96XHGBFRkWCMiKhIMEZEVCQYIyIqEowRERU5Kt1LpGbqNjRCXl/ps3WrMWNrr+nFi2qvOVylxxgRUZFgjIioSDBGRFQkGCMiKhKMEREVCcaIiIoEY0RExbANRkmHSTql2+2IiP4zbIJR0lhJG7RNeiVwaYfLRkQs1/fBKGkHSV8Afgc8p5wmYAowQ9JLJN1cPm6StBHwVOA2SedI2qN7rY+IXtSXlwSWvb1DgLcDAs4Dnmd7brnIrsBM25Z0EnC87WskbQg8bnuupMnA64FPSdq0rPE12w+u5D0zGFbECNGvPcbZFKF4tO29bZ/bFopQbEb/pPz5GuAMSe8BNra9BMD2QtvftH0A8Dpgf+C+lQ1tkMGwIkaOfg3Gg4G/AN+V9BFJW1fmHwBcDmD7M8DRwHrAtZK2by0k6WmS3g/8ABgNvAn4v7XQ/ojoYX25KW37cuBySZsARwDfkzSHIgAfAtax/TcASdvZvhW4VdJewPaSZgMXANsDXwMOtP2XbnyWiOg9fRmMLWX4nQmcKen5wFLg5cBP2xY7QdJLy3m3U2xijwP+HbjKzj25ImJFfR2M7WxfDyDpo8C5bdPfPcDiC4GfraWmRUSfGTbB2GL76G63ISL6W78efImIaEyCMSKiIsEYEVGRYIyIqFDOVll9kv4K3NPBohOBOQ00IXX7q639Vref2rq6dbe2velgCyUYGyRpuu2pqVt/3X5qa7/V7ae2NlU3m9IRERUJxoiIigRjs6albmN1+6mt/Va3n9raSN3sY4yIqEiPMSKiIsEYEVGRYIyIqEgwRkRUJBgjIir+HwW8p2+L7eiVAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "print('Input Sentence:')\n", "output = ''\n", "for index in input_tensor[0]:\n", " word = fr_idx2word[index.item()]\n", " if word != '':\n", " output += ' ' + word\n", " else:\n", " output += ' ' + word\n", " print(output)\n", " break\n", "\n", "print('\\nTarget Sentence:')\n", "print(' ' + batch['english_sentence'][11] + '')\n", "input_len = len(batch['french_sentence'][11].split())\n", "\n", "print('\\nLSTM model output:')\n", "output = ''\n", "for index in output_list:\n", " word = en_idx2word[index.item()]\n", " if word != '':\n", " output += ' ' + word\n", " else:\n", " output += ' ' + word\n", " print(output)\n", " break\n", "\n", "fig = plt.figure()\n", "plt.title('LSTM Model Attention\\n\\n\\n\\n\\n')\n", "ax = fig.add_subplot(111)\n", "ax.matshow(attn[:len(output.split()), :input_len])\n", "ax.set_xticks(np.arange(0,input_len, step=1))\n", "ax.set_yticks(np.arange(0,len(output.split())))\n", "ax.set_xticklabels(batch['french_sentence'][11].split(), rotation=90)\n", "ax.set_yticklabels(output.split()+[''])\n", "\n", "\n", "output = ''\n", "print('\\nGRU model output:')\n", "for index in gru_output_list:\n", " word = en_idx2word[index.item()]\n", " if word != '':\n", " output += ' ' + word\n", " else:\n", " output += ' ' + word\n", " print(output)\n", " break\n", " \n", "fig = plt.figure()\n", "plt.title('GRU Model Attention\\n\\n\\n\\n\\n')\n", "ax2 = fig.add_subplot(111)\n", "ax2.matshow(gru_attn[:len(output.split()), :input_len])\n", "ax2.set_xticks(np.arange(0,input_len, step=1))\n", "ax2.set_yticks(np.arange(0,len(output.split())))\n", "ax2.set_xticklabels(batch['french_sentence'][11].split(), rotation=90)\n", "ax2.set_yticklabels(output.split()+[''])\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python [default]", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }