{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Assignment 4 - Naive Machine Translation and LSH\n", "\n", "You will now implement your first machine translation system and then you\n", "will see how locality sensitive hashing works. Let's get started by importing\n", "the required functions!\n", "\n", "If you are running this notebook in your local computer, don't forget to\n", "download the twitter samples and stopwords from nltk.\n", "\n", "```\n", "nltk.download('stopwords')\n", "nltk.download('twitter_samples')\n", "```" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pdb\n", "import pickle\n", "import string\n", "\n", "import time\n", "\n", "import gensim\n", "import matplotlib.pyplot as plt\n", "import nltk\n", "import numpy as np\n", "import scipy\n", "import sklearn\n", "from gensim.models import KeyedVectors\n", "from nltk.corpus import stopwords, twitter_samples\n", "from nltk.tokenize import TweetTokenizer\n", "\n", "from utils import (cosine_similarity, get_dict,\n", " process_tweet)\n", "from os import getcwd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# add folder, tmp2, from our local workspace containing pre-downloaded corpora files to nltk's data path\n", "filePath = f\"{getcwd()}/../tmp2/\"\n", "nltk.data.path.append(filePath)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. The word embeddings data for English and French words\n", "\n", "Write a program that translates English to French.\n", "\n", "## The data\n", "\n", "The full dataset for English embeddings is about 3.64 gigabytes, and the French\n", "embeddings are about 629 megabytes. To prevent the Coursera workspace from\n", "crashing, we've extracted a subset of the embeddings for the words that you'll\n", "use in this assignment.\n", "\n", "If you want to run this on your local computer and use the full dataset,\n", "you can download the\n", "* English embeddings from Google code archive word2vec\n", "[look for GoogleNews-vectors-negative300.bin.gz](https://code.google.com/archive/p/word2vec/)\n", " * You'll need to unzip the file first.\n", "* and the French embeddings from\n", "[cross_lingual_text_classification](https://github.com/vjstark/crosslingual_text_classification).\n", " * in the terminal, type (in one line)\n", " `curl -o ./wiki.multi.fr.vec https://dl.fbaipublicfiles.com/arrival/vectors/wiki.multi.fr.vec`\n", "\n", "Then copy-paste the code below and run it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```python\n", "# Use this code to download and process the full dataset on your local computer\n", "\n", "from gensim.models import KeyedVectors\n", "\n", "en_embeddings = KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary = True)\n", "fr_embeddings = KeyedVectors.load_word2vec_format('./wiki.multi.fr.vec')\n", "\n", "\n", "# loading the english to french dictionaries\n", "en_fr_train = get_dict('en-fr.train.txt')\n", "print('The length of the english to french training dictionary is', len(en_fr_train))\n", "en_fr_test = get_dict('en-fr.test.txt')\n", "print('The length of the english to french test dictionary is', len(en_fr_train))\n", "\n", "english_set = set(en_embeddings.vocab)\n", "french_set = set(fr_embeddings.vocab)\n", "en_embeddings_subset = {}\n", "fr_embeddings_subset = {}\n", "french_words = set(en_fr_train.values())\n", "\n", "for en_word in en_fr_train.keys():\n", " fr_word = en_fr_train[en_word]\n", " if fr_word in french_set and en_word in english_set:\n", " en_embeddings_subset[en_word] = en_embeddings[en_word]\n", " fr_embeddings_subset[fr_word] = fr_embeddings[fr_word]\n", "\n", "\n", "for en_word in en_fr_test.keys():\n", " fr_word = en_fr_test[en_word]\n", " if fr_word in french_set and en_word in english_set:\n", " en_embeddings_subset[en_word] = en_embeddings[en_word]\n", " fr_embeddings_subset[fr_word] = fr_embeddings[fr_word]\n", "\n", "\n", "pickle.dump( en_embeddings_subset, open( \"en_embeddings.p\", \"wb\" ) )\n", "pickle.dump( fr_embeddings_subset, open( \"fr_embeddings.p\", \"wb\" ) )\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The subset of data\n", "\n", "To do the assignment on the Coursera workspace, we'll use the subset of word embeddings." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "en_embeddings_subset = pickle.load(open(\"en_embeddings.p\", \"rb\"))\n", "fr_embeddings_subset = pickle.load(open(\"fr_embeddings.p\", \"rb\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Look at the data\n", "\n", "* en_embeddings_subset: the key is an English word, and the vaule is a\n", "300 dimensional array, which is the embedding for that word.\n", "```\n", "'the': array([ 0.08007812, 0.10498047, 0.04980469, 0.0534668 , -0.06738281, ....\n", "```\n", "\n", "* fr_embeddings_subset: the key is an French word, and the vaule is a 300\n", "dimensional array, which is the embedding for that word.\n", "```\n", "'la': array([-6.18250e-03, -9.43867e-04, -8.82648e-03, 3.24623e-02,...\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Load two dictionaries mapping the English to French words\n", "* A training dictionary\n", "* and a testing dictionary." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The length of the English to French training dictionary is 5000\n", "The length of the English to French test dictionary is 5000\n" ] } ], "source": [ "# loading the english to french dictionaries\n", "en_fr_train = get_dict('en-fr.train.txt')\n", "print('The length of the English to French training dictionary is', len(en_fr_train))\n", "en_fr_test = get_dict('en-fr.test.txt')\n", "print('The length of the English to French test dictionary is', len(en_fr_train))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Looking at the English French dictionary\n", "\n", "* `en_fr_train` is a dictionary where the key is the English word and the value\n", "is the French translation of that English word.\n", "```\n", "{'the': 'la',\n", " 'and': 'et',\n", " 'was': 'Γ©tait',\n", " 'for': 'pour',\n", "```\n", "\n", "* `en_fr_test` is similar to `en_fr_train`, but is a test set. We won't look at it\n", "until we get to testing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1.1 Generate embedding and transform matrices\n", "\n", "#### Exercise: Translating English dictionary to French by using embeddings\n", "\n", "You will now implement a function `get_matrices`, which takes the loaded data\n", "and returns matrices `X` and `Y`.\n", "\n", "Inputs:\n", "- `en_fr` : English to French dictionary\n", "- `en_embeddings` : English to embeddings dictionary\n", "- `fr_embeddings` : French to embeddings dictionary\n", "\n", "Returns:\n", "- Matrix `X` and matrix `Y`, where each row in X is the word embedding for an\n", "english word, and the same row in Y is the word embedding for the French\n", "version of that English word.\n", "\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "