{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "I completed [Assignment 1](http://web.stanford.edu/class/cs224n/assignments/a1_preview/exploring_word_vectors.html) that was issued from Stanford CS224N course.\n", "# Imports" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package reuters to\n", "[nltk_data] /Users/nwams/anaconda3/lib/nltk_data...\n", "[nltk_data] Package reuters is already up-to-date!\n" ] } ], "source": [ "# All Import Statements Defined Here\n", "# ----------------\n", "import sys\n", "assert sys.version_info[0]==3\n", "assert sys.version_info[1] >= 5\n", "\n", "from gensim.models import KeyedVectors\n", "from gensim.test.utils import datapath\n", "import pprint\n", "import matplotlib.pyplot as plt\n", "plt.rcParams['figure.figsize'] = [10, 5]\n", "import nltk\n", "nltk.download('reuters')\n", "from nltk.corpus import reuters\n", "import numpy as np\n", "import random\n", "import scipy as sp\n", "from sklearn.decomposition import TruncatedSVD\n", "from sklearn.decomposition import PCA\n", "\n", "START_TOKEN = ''\n", "END_TOKEN = ''\n", "\n", "np.random.seed(0)\n", "random.seed(0)\n", "# ----------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Word Vectors\n", "Word Vectors are often used as a fundamental component for downstream NLP tasks, e.g. question answering, text generation, translation, etc., so it is important to build some intuitions as to their strengths and weaknesses. Here, you will explore two types of word vectors: those derived from co-occurrence matrices, and those derived via word2vec.\n", "\n", "**Note on Terminology**: The terms \"word vectors\" and \"word embeddings\" are often used interchangeably. The term \"embedding\" refers to the fact that we are encoding aspects of a word's meaning in a lower dimensional space. As [Wikipedia](https://en.wikipedia.org/wiki/Word_embedding) states, \"conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with a much lower dimension\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 1: Count-Based Word Vectors\n", "Most word vector models start from the following idea:\n", "\n", "You shall know a word by the company it keeps (Firth, J. R. 1957:11)\n", "\n", "Many word vector implementations are driven by the idea that similar words, i.e., (near) synonyms, will be used in similar contexts. As a result, similar words will often be spoken or written along with a shared subset of words, i.e., contexts. By examining these contexts, we can try to develop embeddings for our words. With this intuition in mind, many \"old school\" approaches to constructing word vectors relied on word counts. Here we elaborate upon one of those strategies, co-occurrence matrices (for more information, see here or here)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Co-Occurrence\n", "\n", "A co-occurrence matrix counts how often things co-occur in some environment. Given some word $w_i$ occurring in the document, we consider the *context window* surrounding $w_i$. Supposing our fixed window size is $n$, then this is the $n$ preceding and $n$ subsequent words in that document, i.e. words $w_{i-n} \\dots w_{i-1}$ and $w_{i+1} \\dots w_{i+n}$. We build a *co-occurrence matrix* $M$, which is a symmetric word-by-word matrix in which $M_{ij}$ is the number of times $w_j$ appears inside $w_i$'s window.\n", "\n", "**Example: Co-Occurrence with Fixed Window of n=1**:\n", "\n", "Document 1: \"all that glitters is not gold\"\n", "\n", "Document 2: \"all is well that ends well\"\n", "\n", "\n", "| * | START | all | that | glitters | is | not | gold | well | ends | END |\n", "|----------|-------|-----|------|----------|------|------|-------|------|------|-----|\n", "| START | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |\n", "| all | 2 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |\n", "| that | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 |\n", "| glitters | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |\n", "| is | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |\n", "| not | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |\n", "| gold | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |\n", "| well | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 |\n", "| ends | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |\n", "| END | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |\n", "\n", "**Note:** In NLP, we often add START and END tokens to represent the beginning and end of sentences, paragraphs or documents. In thise case we imagine START and END tokens encapsulating each document, e.g., \"START All that glitters is not gold END\", and include these tokens in our co-occurrence counts.\n", "\n", "The rows (or columns) of this matrix provide one type of word vectors (those based on word-word co-occurrence), but the vectors will be large in general (linear in the number of distinct words in a corpus). Thus, our next step is to run *dimensionality reduction*. In particular, we will run *SVD (Singular Value Decomposition)*, which is a kind of generalized *PCA (Principal Components Analysis)* to select the top $k$ principal components. Here's a visualization of dimensionality reduction with SVD. In this picture our co-occurrence matrix is $A$ with $n$ rows corresponding to $n$ words. We obtain a full matrix decomposition, with the singular values ordered in the diagonal $S$ matrix, and our new, shorter length-$k$ word vectors in $U_k$.\n", "\n", "![Picture of an SVD](imgs/svd.png \"SVD\")\n", "\n", "This reduced-dimensionality co-occurrence representation preserves semantic relationships between words, e.g. *doctor* and *hospital* will be closer than *doctor* and *dog*. \n", "\n", "**Notes**\n", "I personally found that this Youtube video [Lecture 47 — Singular Value Decomposition | Stanford University](https://www.youtube.com/watch?v=P5mlg91as1c) gives a fantastic high-level explanation of SVD. For the purpose of this class, **you only need to know how to extract the k-dimensional embeddings by utilizing pre-programmed implementations of these algorithms from the numpy, scipy, or sklearn python packages**. In practice, it is challenging to apply full SVD to large corpora because of the memory needed to perform PCA or SVD. However, if you only want the top $k$ vector components for relatively small $k$ — known as *[Truncated SVD](https://en.wikipedia.org/wiki/Singular_value_decomposition#Truncated_SVD)* — then there are reasonably scalable techniques to compute those iteratively." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plotting Co-Occurrence Word Embeddings\n", "Here, we will be using the Reuters (business and financial news) corpus. If you haven't run the import cell at the top of this page, please run it now (click it and press SHIFT-RETURN). The corpus consists of 10,788 news documents totaling 1.3 million words. These documents span 90 categories and are split into train and test. For more details, please see https://www.nltk.org/book/ch02.html. We provide a `read_corpus` function below that pulls out only articles from the \"crude\" (i.e. news articles about oil, gas, etc.) category. The function also adds START and END tokens to each of the documents, and lowercases words." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def read_corpus(category=\"crude\"):\n", " \"\"\" Read files from the specified Reuter's category.\n", " Params: \n", " Category (string): category name\n", " Return:\n", " list of lists, with words from each of the processed files\n", " \"\"\"\n", " files = reuters.fileids(category)\n", " return [[START_TOKEN] + [w.lower() for w in list(reuters.words(f))] + [END_TOKEN] for f in files]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's have a look what these documents are like…" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[['', 'japan', 'to', 'revise', 'long', '-', 'term', 'energy', 'demand', 'downwards', 'the',\n", " 'ministry', 'of', 'international', 'trade', 'and', 'industry', '(', 'miti', ')', 'will', 'revise',\n", " 'its', 'long', '-', 'term', 'energy', 'supply', '/', 'demand', 'outlook', 'by', 'august', 'to',\n", " 'meet', 'a', 'forecast', 'downtrend', 'in', 'japanese', 'energy', 'demand', ',', 'ministry',\n", " 'officials', 'said', '.', 'miti', 'is', 'expected', 'to', 'lower', 'the', 'projection', 'for',\n", " 'primary', 'energy', 'supplies', 'in', 'the', 'year', '2000', 'to', '550', 'mln', 'kilolitres',\n", " '(', 'kl', ')', 'from', '600', 'mln', ',', 'they', 'said', '.', 'the', 'decision', 'follows',\n", " 'the', 'emergence', 'of', 'structural', 'changes', 'in', 'japanese', 'industry', 'following',\n", " 'the', 'rise', 'in', 'the', 'value', 'of', 'the', 'yen', 'and', 'a', 'decline', 'in', 'domestic',\n", " 'electric', 'power', 'demand', '.', 'miti', 'is', 'planning', 'to', 'work', 'out', 'a', 'revised',\n", " 'energy', 'supply', '/', 'demand', 'outlook', 'through', 'deliberations', 'of', 'committee',\n", " 'meetings', 'of', 'the', 'agency', 'of', 'natural', 'resources', 'and', 'energy', ',', 'the',\n", " 'officials', 'said', '.', 'they', 'said', 'miti', 'will', 'also', 'review', 'the', 'breakdown',\n", " 'of', 'energy', 'supply', 'sources', ',', 'including', 'oil', ',', 'nuclear', ',', 'coal', 'and',\n", " 'natural', 'gas', '.', 'nuclear', 'energy', 'provided', 'the', 'bulk', 'of', 'japan', \"'\", 's',\n", " 'electric', 'power', 'in', 'the', 'fiscal', 'year', 'ended', 'march', '31', ',', 'supplying',\n", " 'an', 'estimated', '27', 'pct', 'on', 'a', 'kilowatt', '/', 'hour', 'basis', ',', 'followed',\n", " 'by', 'oil', '(', '23', 'pct', ')', 'and', 'liquefied', 'natural', 'gas', '(', '21', 'pct', '),',\n", " 'they', 'noted', '.', ''],\n", " ['', 'energy', '/', 'u', '.', 's', '.', 'petrochemical', 'industry', 'cheap', 'oil',\n", " 'feedstocks', ',', 'the', 'weakened', 'u', '.', 's', '.', 'dollar', 'and', 'a', 'plant',\n", " 'utilization', 'rate', 'approaching', '90', 'pct', 'will', 'propel', 'the', 'streamlined', 'u',\n", " '.', 's', '.', 'petrochemical', 'industry', 'to', 'record', 'profits', 'this', 'year', ',',\n", " 'with', 'growth', 'expected', 'through', 'at', 'least', '1990', ',', 'major', 'company',\n", " 'executives', 'predicted', '.', 'this', 'bullish', 'outlook', 'for', 'chemical', 'manufacturing',\n", " 'and', 'an', 'industrywide', 'move', 'to', 'shed', 'unrelated', 'businesses', 'has', 'prompted',\n", " 'gaf', 'corp', '&', 'lt', ';', 'gaf', '>,', 'privately', '-', 'held', 'cain', 'chemical', 'inc',\n", " ',', 'and', 'other', 'firms', 'to', 'aggressively', 'seek', 'acquisitions', 'of', 'petrochemical',\n", " 'plants', '.', 'oil', 'companies', 'such', 'as', 'ashland', 'oil', 'inc', '&', 'lt', ';', 'ash',\n", " '>,', 'the', 'kentucky', '-', 'based', 'oil', 'refiner', 'and', 'marketer', ',', 'are', 'also',\n", " 'shopping', 'for', 'money', '-', 'making', 'petrochemical', 'businesses', 'to', 'buy', '.', '\"',\n", " 'i', 'see', 'us', 'poised', 'at', 'the', 'threshold', 'of', 'a', 'golden', 'period', ',\"', 'said',\n", " 'paul', 'oreffice', ',', 'chairman', 'of', 'giant', 'dow', 'chemical', 'co', '&', 'lt', ';',\n", " 'dow', '>,', 'adding', ',', '\"', 'there', \"'\", 's', 'no', 'major', 'plant', 'capacity', 'being',\n", " 'added', 'around', 'the', 'world', 'now', '.', 'the', 'whole', 'game', 'is', 'bringing', 'out',\n", " 'new', 'products', 'and', 'improving', 'the', 'old', 'ones', '.\"', 'analysts', 'say', 'the',\n", " 'chemical', 'industry', \"'\", 's', 'biggest', 'customers', ',', 'automobile', 'manufacturers',\n", " 'and', 'home', 'builders', 'that', 'use', 'a', 'lot', 'of', 'paints', 'and', 'plastics', ',',\n", " 'are', 'expected', 'to', 'buy', 'quantities', 'this', 'year', '.', 'u', '.', 's', '.',\n", " 'petrochemical', 'plants', 'are', 'currently', 'operating', 'at', 'about', '90', 'pct',\n", " 'capacity', ',', 'reflecting', 'tighter', 'supply', 'that', 'could', 'hike', 'product', 'prices',\n", " 'by', '30', 'to', '40', 'pct', 'this', 'year', ',', 'said', 'john', 'dosher', ',', 'managing',\n", " 'director', 'of', 'pace', 'consultants', 'inc', 'of', 'houston', '.', 'demand', 'for', 'some',\n", " 'products', 'such', 'as', 'styrene', 'could', 'push', 'profit', 'margins', 'up', 'by', 'as',\n", " 'much', 'as', '300', 'pct', ',', 'he', 'said', '.', 'oreffice', ',', 'speaking', 'at', 'a',\n", " 'meeting', 'of', 'chemical', 'engineers', 'in', 'houston', ',', 'said', 'dow', 'would', 'easily',\n", " 'top', 'the', '741', 'mln', 'dlrs', 'it', 'earned', 'last', 'year', 'and', 'predicted', 'it',\n", " 'would', 'have', 'the', 'best', 'year', 'in', 'its', 'history', '.', 'in', '1985', ',', 'when',\n", " 'oil', 'prices', 'were', 'still', 'above', '25', 'dlrs', 'a', 'barrel', 'and', 'chemical',\n", " 'exports', 'were', 'adversely', 'affected', 'by', 'the', 'strong', 'u', '.', 's', '.', 'dollar',\n", " ',', 'dow', 'had', 'profits', 'of', '58', 'mln', 'dlrs', '.', '\"', 'i', 'believe', 'the',\n", " 'entire', 'chemical', 'industry', 'is', 'headed', 'for', 'a', 'record', 'year', 'or', 'close',\n", " 'to', 'it', ',\"', 'oreffice', 'said', '.', 'gaf', 'chairman', 'samuel', 'heyman', 'estimated',\n", " 'that', 'the', 'u', '.', 's', '.', 'chemical', 'industry', 'would', 'report', 'a', '20', 'pct',\n", " 'gain', 'in', 'profits', 'during', '1987', '.', 'last', 'year', ',', 'the', 'domestic',\n", " 'industry', 'earned', 'a', 'total', 'of', '13', 'billion', 'dlrs', ',', 'a', '54', 'pct', 'leap',\n", " 'from', '1985', '.', 'the', 'turn', 'in', 'the', 'fortunes', 'of', 'the', 'once', '-', 'sickly',\n", " 'chemical', 'industry', 'has', 'been', 'brought', 'about', 'by', 'a', 'combination', 'of', 'luck',\n", " 'and', 'planning', ',', 'said', 'pace', \"'\", 's', 'john', 'dosher', '.', 'dosher', 'said', 'last',\n", " 'year', \"'\", 's', 'fall', 'in', 'oil', 'prices', 'made', 'feedstocks', 'dramatically', 'cheaper',\n", " 'and', 'at', 'the', 'same', 'time', 'the', 'american', 'dollar', 'was', 'weakening', 'against',\n", " 'foreign', 'currencies', '.', 'that', 'helped', 'boost', 'u', '.', 's', '.', 'chemical',\n", " 'exports', '.', 'also', 'helping', 'to', 'bring', 'supply', 'and', 'demand', 'into', 'balance',\n", " 'has', 'been', 'the', 'gradual', 'market', 'absorption', 'of', 'the', 'extra', 'chemical',\n", " 'manufacturing', 'capacity', 'created', 'by', 'middle', 'eastern', 'oil', 'producers', 'in',\n", " 'the', 'early', '1980s', '.', 'finally', ',', 'virtually', 'all', 'major', 'u', '.', 's', '.',\n", " 'chemical', 'manufacturers', 'have', 'embarked', 'on', 'an', 'extensive', 'corporate',\n", " 'restructuring', 'program', 'to', 'mothball', 'inefficient', 'plants', ',', 'trim', 'the',\n", " 'payroll', 'and', 'eliminate', 'unrelated', 'businesses', '.', 'the', 'restructuring', 'touched',\n", " 'off', 'a', 'flurry', 'of', 'friendly', 'and', 'hostile', 'takeover', 'attempts', '.', 'gaf', ',',\n", " 'which', 'made', 'an', 'unsuccessful', 'attempt', 'in', '1985', 'to', 'acquire', 'union',\n", " 'carbide', 'corp', '&', 'lt', ';', 'uk', '>,', 'recently', 'offered', 'three', 'billion', 'dlrs',\n", " 'for', 'borg', 'warner', 'corp', '&', 'lt', ';', 'bor', '>,', 'a', 'chicago', 'manufacturer',\n", " 'of', 'plastics', 'and', 'chemicals', '.', 'another', 'industry', 'powerhouse', ',', 'w', '.',\n", " 'r', '.', 'grace', '&', 'lt', ';', 'gra', '>', 'has', 'divested', 'its', 'retailing', ',',\n", " 'restaurant', 'and', 'fertilizer', 'businesses', 'to', 'raise', 'cash', 'for', 'chemical',\n", " 'acquisitions', '.', 'but', 'some', 'experts', 'worry', 'that', 'the', 'chemical', 'industry',\n", " 'may', 'be', 'headed', 'for', 'trouble', 'if', 'companies', 'continue', 'turning', 'their',\n", " 'back', 'on', 'the', 'manufacturing', 'of', 'staple', 'petrochemical', 'commodities', ',', 'such',\n", " 'as', 'ethylene', ',', 'in', 'favor', 'of', 'more', 'profitable', 'specialty', 'chemicals',\n", " 'that', 'are', 'custom', '-', 'designed', 'for', 'a', 'small', 'group', 'of', 'buyers', '.', '\"',\n", " 'companies', 'like', 'dupont', '&', 'lt', ';', 'dd', '>', 'and', 'monsanto', 'co', '&', 'lt', ';',\n", " 'mtc', '>', 'spent', 'the', 'past', 'two', 'or', 'three', 'years', 'trying', 'to', 'get', 'out',\n", " 'of', 'the', 'commodity', 'chemical', 'business', 'in', 'reaction', 'to', 'how', 'badly', 'the',\n", " 'market', 'had', 'deteriorated', ',\"', 'dosher', 'said', '.', '\"', 'but', 'i', 'think', 'they',\n", " 'will', 'eventually', 'kill', 'the', 'margins', 'on', 'the', 'profitable', 'chemicals', 'in',\n", " 'the', 'niche', 'market', '.\"', 'some', 'top', 'chemical', 'executives', 'share', 'the',\n", " 'concern', '.', '\"', 'the', 'challenge', 'for', 'our', 'industry', 'is', 'to', 'keep', 'from',\n", " 'getting', 'carried', 'away', 'and', 'repeating', 'past', 'mistakes', ',\"', 'gaf', \"'\", 's',\n", " 'heyman', 'cautioned', '.', '\"', 'the', 'shift', 'from', 'commodity', 'chemicals', 'may', 'be',\n", " 'ill', '-', 'advised', '.', 'specialty', 'businesses', 'do', 'not', 'stay', 'special', 'long',\n", " '.\"', 'houston', '-', 'based', 'cain', 'chemical', ',', 'created', 'this', 'month', 'by', 'the',\n", " 'sterling', 'investment', 'banking', 'group', ',', 'believes', 'it', 'can', 'generate', '700',\n", " 'mln', 'dlrs', 'in', 'annual', 'sales', 'by', 'bucking', 'the', 'industry', 'trend', '.',\n", " 'chairman', 'gordon', 'cain', ',', 'who', 'previously', 'led', 'a', 'leveraged', 'buyout', 'of',\n", " 'dupont', \"'\", 's', 'conoco', 'inc', \"'\", 's', 'chemical', 'business', ',', 'has', 'spent', '1',\n", " '.', '1', 'billion', 'dlrs', 'since', 'january', 'to', 'buy', 'seven', 'petrochemical', 'plants',\n", " 'along', 'the', 'texas', 'gulf', 'coast', '.', 'the', 'plants', 'produce', 'only', 'basic',\n", " 'commodity', 'petrochemicals', 'that', 'are', 'the', 'building', 'blocks', 'of', 'specialty',\n", " 'products', '.', '\"', 'this', 'kind', 'of', 'commodity', 'chemical', 'business', 'will', 'never',\n", " 'be', 'a', 'glamorous', ',', 'high', '-', 'margin', 'business', ',\"', 'cain', 'said', ',',\n", " 'adding', 'that', 'demand', 'is', 'expected', 'to', 'grow', 'by', 'about', 'three', 'pct',\n", " 'annually', '.', 'garo', 'armen', ',', 'an', 'analyst', 'with', 'dean', 'witter', 'reynolds', ',',\n", " 'said', 'chemical', 'makers', 'have', 'also', 'benefitted', 'by', 'increasing', 'demand', 'for',\n", " 'plastics', 'as', 'prices', 'become', 'more', 'competitive', 'with', 'aluminum', ',', 'wood',\n", " 'and', 'steel', 'products', '.', 'armen', 'estimated', 'the', 'upturn', 'in', 'the', 'chemical',\n", " 'business', 'could', 'last', 'as', 'long', 'as', 'four', 'or', 'five', 'years', ',', 'provided',\n", " 'the', 'u', '.', 's', '.', 'economy', 'continues', 'its', 'modest', 'rate', 'of', 'growth', '.',\n", " ''],\n", " ['', 'turkey', 'calls', 'for', 'dialogue', 'to', 'solve', 'dispute', 'turkey', 'said',\n", " 'today', 'its', 'disputes', 'with', 'greece', ',', 'including', 'rights', 'on', 'the',\n", " 'continental', 'shelf', 'in', 'the', 'aegean', 'sea', ',', 'should', 'be', 'solved', 'through',\n", " 'negotiations', '.', 'a', 'foreign', 'ministry', 'statement', 'said', 'the', 'latest', 'crisis',\n", " 'between', 'the', 'two', 'nato', 'members', 'stemmed', 'from', 'the', 'continental', 'shelf',\n", " 'dispute', 'and', 'an', 'agreement', 'on', 'this', 'issue', 'would', 'effect', 'the', 'security',\n", " ',', 'economy', 'and', 'other', 'rights', 'of', 'both', 'countries', '.', '\"', 'as', 'the',\n", " 'issue', 'is', 'basicly', 'political', ',', 'a', 'solution', 'can', 'only', 'be', 'found', 'by',\n", " 'bilateral', 'negotiations', ',\"', 'the', 'statement', 'said', '.', 'greece', 'has', 'repeatedly',\n", " 'said', 'the', 'issue', 'was', 'legal', 'and', 'could', 'be', 'solved', 'at', 'the',\n", " 'international', 'court', 'of', 'justice', '.', 'the', 'two', 'countries', 'approached', 'armed',\n", " 'confrontation', 'last', 'month', 'after', 'greece', 'announced', 'it', 'planned', 'oil',\n", " 'exploration', 'work', 'in', 'the', 'aegean', 'and', 'turkey', 'said', 'it', 'would', 'also',\n", " 'search', 'for', 'oil', '.', 'a', 'face', '-', 'off', 'was', 'averted', 'when', 'turkey',\n", " 'confined', 'its', 'research', 'to', 'territorrial', 'waters', '.', '\"', 'the', 'latest',\n", " 'crises', 'created', 'an', 'historic', 'opportunity', 'to', 'solve', 'the', 'disputes', 'between',\n", " 'the', 'two', 'countries', ',\"', 'the', 'foreign', 'ministry', 'statement', 'said', '.', 'turkey',\n", " \"'\", 's', 'ambassador', 'in', 'athens', ',', 'nazmi', 'akiman', ',', 'was', 'due', 'to', 'meet',\n", " 'prime', 'minister', 'andreas', 'papandreou', 'today', 'for', 'the', 'greek', 'reply', 'to', 'a',\n", " 'message', 'sent', 'last', 'week', 'by', 'turkish', 'prime', 'minister', 'turgut', 'ozal', '.',\n", " 'the', 'contents', 'of', 'the', 'message', 'were', 'not', 'disclosed', '.', '']]\n" ] } ], "source": [ "reuters_corpus = read_corpus()\n", "pprint.pprint(reuters_corpus[:3], compact=True, width=100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 1.1: Implement `distinct_words` [code] (2 points)\n", "\n", "Write a method to work out the distinct words (word types) that occur in the corpus. You can do this with `for` loops, but it's more efficient to do it with Python list comprehensions. In particular, [this](https://coderwall.com/p/rcmaea/flatten-a-list-of-lists-in-one-line-in-python) may be useful to flatten a list of lists. If you're not familiar with Python list comprehensions in general, here's [more information](https://python-3-patterns-idioms-test.readthedocs.io/en/latest/Comprehensions.html).\n", "\n", "You may find it useful to use [Python sets](https://www.w3schools.com/python/python_sets.asp) to remove duplicate words." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def distinct_words(corpus):\n", " \"\"\" Determine a list of distinct words for the corpus.\n", " Params: \n", " corpus (list of list of strings): corpus of documents\n", " Return: \n", " corpus_words (list of strings): list of distinct words across the corpus, sorted (using python 'sorted' function)\n", " num_corpus_words (integer): number of discint words across the corpus\n", " \"\"\"\n", " corpus_words = []\n", " num_corpus_words = -1\n", " \n", " # ------------------\n", " # Write your implementation here.\n", " flattened_list = [word for article in corpus for word in article]\n", " unique_words_set = set(flattened_list) # keep unique words only\n", " unique_word_list = [word for word in unique_words_set] # convert set back to a list, then sort it\n", " corpus_words = sorted(unique_word_list) # list of sorted, unique words \n", " \n", " num_corpus_words = len(corpus_words)\n", " # ------------------\n", " \n", " return corpus_words, num_corpus_words" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--------------------------------------------------------------------------------\n", "Passed All Tests!\n", "--------------------------------------------------------------------------------\n" ] } ], "source": [ "# ---------------------\n", "# Run this sanity check\n", "# Note that this not an exhaustive check for correctness.\n", "# ---------------------\n", "\n", "# Define a dummy/\"toy\" corpus\n", "test_corpus = [\"START All that glitters isn't gold END\".split(\" \"), \"START All's well that ends well END\".split(\" \")]\n", "test_corpus_words, num_corpus_words = distinct_words(test_corpus)\n", "\n", "# Correct answers\n", "ans_test_corpus_words = sorted(list(set([\"START\", \"All\", \"ends\", \"that\", \"gold\", \"All's\", \"glitters\", \"isn't\", \"well\", \"END\"])))\n", "ans_num_corpus_words = len(ans_test_corpus_words)\n", "\n", "# Test correct number of words\n", "assert(num_corpus_words == ans_num_corpus_words), \"Incorrect number of distinct words. Correct: {}. Yours: {}\".format(ans_num_corpus_words, num_corpus_words)\n", "\n", "# Test correct words\n", "assert (test_corpus_words == ans_test_corpus_words), \"Incorrect corpus_words.\\nCorrect: {}\\nYours: {}\".format(str(ans_test_corpus_words), str(test_corpus_words))\n", "\n", "# Print Success\n", "print (\"-\" * 80)\n", "print(\"Passed All Tests!\")\n", "print (\"-\" * 80)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 1.2: Implement `compute_co_occurrence_matrix` [code]\n", "Write a method that constructs a co-occurrence matrix for a certain window-size n (with a default of 4), considering words n before and n after the word in the center of the window. Here, we start to use `numpy (np)` to represent vectors, matrices, and tensors. If you're not familiar with NumPy, there's a NumPy tutorial in the second half of this cs231n [Python NumPy tutorial](http://cs231n.github.io/python-numpy-tutorial/)." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "def compute_co_occurrence_matrix(corpus, window_size=4):\n", " \"\"\" Compute co-occurrence matrix for the given corpus and window_size (default of 4).\n", "\n", " Note: Each word in a document should be at the center of a window. Words near edges will have a smaller\n", " number of co-occurring words.\n", "\n", " For example, if we take the document \"START All that glitters is not gold END\" with window size of 4,\n", " \"All\" will co-occur with \"START\", \"that\", \"glitters\", \"is\", and \"not\".\n", "\n", " Params:\n", " corpus (list of list of strings): corpus of documents\n", " window_size (int): size of context window\n", " Return:\n", " M (numpy matrix of shape (number of corpus words, number of corpus words)): \n", " Co-occurence matrix of word counts. \n", " The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.\n", " word2Ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.\n", " \"\"\"\n", " words, num_words = distinct_words(corpus)\n", " M = None\n", " word2Ind = {}\n", " \n", " # ------------------\n", " # Write your implementation here.\n", " for i in range(num_words):\n", " word2Ind[words[i]] = i\n", "\n", " M = np.zeros((num_words, num_words))\n", " \n", " for line in corpus:\n", " for i in range(len(line)):\n", " target = line[i]\n", " target_index = word2Ind[target]\n", " \n", " left = max(i - window_size, 0)\n", " right = min(i + window_size, len(line) - 1)\n", "\n", " for j in range(left, i):\n", " window_word = line[j]\n", " M[target_index][word2Ind[window_word]] += 1\n", " M[word2Ind[window_word]][target_index] += 1\n", " # ------------------\n", "\n", " return M, word2Ind" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Size of matrix M = (8185, 8185)\n" ] }, { "data": { "text/plain": [ "array([[78., 0., 1., ..., 0., 0., 0.],\n", " [ 0., 0., 0., ..., 0., 0., 0.],\n", " [ 1., 0., 0., ..., 0., 0., 0.],\n", " ...,\n", " [ 0., 0., 0., ..., 0., 0., 0.],\n", " [ 0., 0., 0., ..., 0., 0., 0.],\n", " [ 0., 0., 0., ..., 0., 0., 0.]])" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "M, word2Ind = compute_co_occurrence_matrix(reuters_corpus)\n", "print(\"Size of matrix M = \", M.shape)\n", "M" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--------------------------------------------------------------------------------\n", "Passed All Tests!\n", "--------------------------------------------------------------------------------\n" ] } ], "source": [ "# ---------------------\n", "# Run this sanity check\n", "# Note that this is not an exhaustive check for correctness.\n", "# ---------------------\n", "\n", "# Define toy corpus and get student's co-occurrence matrix\n", "test_corpus = [\"START All that glitters isn't gold END\".split(\" \"), \"START All's well that ends well END\".split(\" \")]\n", "M_test, word2Ind_test = compute_co_occurrence_matrix(test_corpus, window_size=1)\n", "\n", "# Correct M and word2Ind\n", "M_test_ans = np.array( \n", " [[0., 0., 0., 1., 0., 0., 0., 0., 1., 0.,],\n", " [0., 0., 0., 1., 0., 0., 0., 0., 0., 1.,],\n", " [0., 0., 0., 0., 0., 0., 1., 0., 0., 1.,],\n", " [1., 1., 0., 0., 0., 0., 0., 0., 0., 0.,],\n", " [0., 0., 0., 0., 0., 0., 0., 0., 1., 1.,],\n", " [0., 0., 0., 0., 0., 0., 0., 1., 1., 0.,],\n", " [0., 0., 1., 0., 0., 0., 0., 1., 0., 0.,],\n", " [0., 0., 0., 0., 0., 1., 1., 0., 0., 0.,],\n", " [1., 0., 0., 0., 1., 1., 0., 0., 0., 1.,],\n", " [0., 1., 1., 0., 1., 0., 0., 0., 1., 0.,]]\n", ")\n", "word2Ind_ans = {'All': 0, \"All's\": 1, 'END': 2, 'START': 3, 'ends': 4, 'glitters': 5, 'gold': 6, \"isn't\": 7, 'that': 8, 'well': 9}\n", "\n", "# Test correct word2Ind\n", "assert (word2Ind_ans == word2Ind_test), \"Your word2Ind is incorrect:\\nCorrect: {}\\nYours: {}\".format(word2Ind_ans, word2Ind_test)\n", "\n", "# Test correct M shape\n", "assert (M_test.shape == M_test_ans.shape), \"M matrix has incorrect shape.\\nCorrect: {}\\nYours: {}\".format(M_test.shape, M_test_ans.shape)\n", "\n", "# Test correct M values\n", "for w1 in word2Ind_ans.keys():\n", " idx1 = word2Ind_ans[w1]\n", " for w2 in word2Ind_ans.keys():\n", " idx2 = word2Ind_ans[w2]\n", " student = M_test[idx1, idx2]\n", " correct = M_test_ans[idx1, idx2]\n", " if student != correct:\n", " print(\"Correct M:\")\n", " print(M_test_ans)\n", " print(\"Your M: \")\n", " print(M_test)\n", " raise AssertionError(\"Incorrect count at index ({}, {})=({}, {}) in matrix M. Yours has {} but should have {}.\".format(idx1, idx2, w1, w2, student, correct))\n", "\n", "# Print Success\n", "print (\"-\" * 80)\n", "print(\"Passed All Tests!\")\n", "print (\"-\" * 80)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 1.3: Implement `reduce_to_k_dim` [code] (1 point)\n", "\n", "Construct a method that performs dimensionality reduction on the matrix to produce k-dimensional embeddings. Use SVD to take the top k components and produce a new matrix of k-dimensional embeddings. \n", "\n", "**Note:** All of numpy, scipy, and scikit-learn (`sklearn`) provide *some* implementation of SVD, but only scipy and sklearn provide an implementation of Truncated SVD, and only sklearn provides an efficient randomized algorithm for calculating large-scale Truncated SVD. So please use [sklearn.decomposition.TruncatedSVD](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html)." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "def reduce_to_k_dim(M, k=2):\n", " \"\"\" Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)\n", " to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:\n", " - http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html\n", " \n", " Params:\n", " M (numpy matrix of shape (number of corpus words, number of corpus words)): co-occurence matrix of word counts\n", " k (int): embedding size of each word after dimension reduction\n", " Return:\n", " M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.\n", " In terms of the SVD from math class, this actually returns U * S\n", " \"\"\" \n", " n_iters = 10 # Use this parameter in your call to `TruncatedSVD`\n", " M_reduced = None\n", " print(\"Running Truncated SVD over %i words...\" % (M.shape[0]))\n", " \n", " # ------------------\n", " # Write your implementation here.\n", " svd = TruncatedSVD(n_components = k, n_iter = n_iters, random_state = 123, tol = 0.0)\n", " M_reduced = svd.fit_transform(M)\n", " print(M_reduced.shape)\n", " # ------------------\n", "\n", " print(\"Done.\")\n", " return M_reduced" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Running Truncated SVD over 8185 words...\n", "(8185, 2)\n", "Done.\n" ] }, { "data": { "text/plain": [ "array([[ 7.32630060e+02, -1.16894192e+02],\n", " [ 1.26000427e+00, -1.61923588e-01],\n", " [ 2.80304332e-01, 6.47334603e-02],\n", " ...,\n", " [ 1.04145879e+00, -3.06320300e-01],\n", " [ 6.19972477e-01, -1.25537234e-01],\n", " [ 2.42230659e+00, 2.28089719e-01]])" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reduce_to_k_dim(M)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Running Truncated SVD over 10 words...\n", "(10, 2)\n", "Done.\n", "--------------------------------------------------------------------------------\n", "Passed All Tests!\n", "--------------------------------------------------------------------------------\n" ] } ], "source": [ "# ---------------------\n", "# Run this sanity check\n", "# Note that this not an exhaustive check for correctness \n", "# In fact we only check that your M_reduced has the right dimensions.\n", "# ---------------------\n", "\n", "# Define toy corpus and run student code\n", "test_corpus = [\"START All that glitters isn't gold END\".split(\" \"), \"START All's well that ends well END\".split(\" \")]\n", "M_test, word2Ind_test = compute_co_occurrence_matrix(test_corpus, window_size=1)\n", "M_test_reduced = reduce_to_k_dim(M_test, k=2)\n", "\n", "# Test proper dimensions\n", "assert (M_test_reduced.shape[0] == 10), \"M_reduced has {} rows; should have {}\".format(M_test_reduced.shape[0], 10)\n", "assert (M_test_reduced.shape[1] == 2), \"M_reduced has {} columns; should have {}\".format(M_test_reduced.shape[1], 2)\n", "\n", "# Print Success\n", "print (\"-\" * 80)\n", "print(\"Passed All Tests!\")\n", "print (\"-\" * 80)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 1.4: Implement `plot_embeddings` [code] (1 point)\n", "\n", "Here you will write a function to plot a set of 2D vectors in 2D space. For graphs, we will use Matplotlib (`plt`).\n", "\n", "For this example, you may find it useful to adapt [this code](https://www.pythonmembers.club/2018/05/08/matplotlib-scatter-plot-annotate-set-text-at-label-each-point/). In the future, a good way to make a plot is to look at [the Matplotlib gallery](https://matplotlib.org/gallery/index.html), find a plot that looks somewhat like what you want, and adapt the code they give." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "def plot_embeddings(M_reduced, word2Ind, words):\n", " \"\"\" Plot in a scatterplot the embeddings of the words specified in the list \"words\".\n", " NOTE: do not plot all the words listed in M_reduced / word2Ind.\n", " Include a label next to each point.\n", " \n", " Params:\n", " M_reduced (numpy matrix of shape (number of unique words in the corpus , k)): matrix of k-dimensioal word embeddings\n", " word2Ind (dict): dictionary that maps word to indices for matrix M\n", " words (list of strings): words whose embeddings we want to visualize\n", " \"\"\"\n", "\n", " # ------------------\n", " # Write your implementation here.\n", " words_index = [word2Ind[word] for word in words]\n", " print(words_index)\n", " x_coords = [M_reduced[word_index][0] for word_index in words_index]\n", " y_coords = [M_reduced[word_index][1] for word_index in words_index]\n", " \n", " for i, word in enumerate(words):\n", " x = x_coords[i]\n", " y = y_coords[i]\n", " plt.scatter(x, y, marker = 'x', color = 'red')\n", " plt.text(x + 0.0003, y + 0.0003, word, fontsize = 9)\n", " plt.show()\n", "\n", " # ------------------" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--------------------------------------------------------------------------------\n", "Outputted Plot:\n", "[0, 1, 2, 3, 4]\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYEAAAD8CAYAAACRkhiPAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAFyJJREFUeJzt3X+QXWWd5/H3l2TA6NZOIMkoCpmgZF3DuGXoXqRWC3oIKpCqwAAuSVVGkFCUsyIxMNYEslYChbVoFZ1qRmcZxnXpZKtEBxdohrEoExKtXQNLd5WKoAkJgTIbkKDTrlQwsdPf/eOexJtOd/om9/aP8LxfVbf6nnOe85zvfe7N87n3nNvpyEwkSWU6aaILkCRNHENAkgpmCEhSwQwBSSqYISBJBTMEJKlghoAkFcwQkKSCGQKSVLCpE13ASGbOnJlz5syZ6DIk6YTS19f3embOarT9pA2BOXPm0NvbO9FlSNIJJSJePpb2ng6SpIIZApJUMENAkgpmCEhSwQwBSSpYESHQ39/PunXrjmmfl156iZ6eniPWX3vttVx88cWtKk2SgNbMUw8//DDAORHxu0b7MARGMFwIPPvss/T397eyNEkCWjNPXXDBBQDPA7sa7eOtHQLVn87s7Oykr6+Pjo4Ouru7WbhwIRdddBELFy5kz5497N27l0svvZQLL7yQjo4Otm3bRmdnJ48//jgdHR309fUBcOedd3L77bdP5COS9FbTwnlqxowZAMf2N4Mzs+kb8A3gNeCnI2wP4F5gO/AT4NzR+mxra8umrF6duXx55uBg7ty5MxcsWJC5fHlec845uWXLlszMfOSRR/LWW2/Nvr6+XLJkyaFdDxw4kJs2bcply5YdWrdp06a8/fbb/9CXJDWrxfNUZibQC2zPBufvVv3G8APAV4GRPstcCsytbh8G/mv1c2xkQn8/dHXVlpcvh23bYONGnj3tNFauXAnAwMAAZ599NvPnz6etrY2lS5cyY8YM7rjjjiO6vPvuu3nwwQc9HSSpNcZgnjoeLQmBzPxBRMw5SpPLgXVVSj0VEdMj4vTMfKUVxz9CBKxdW7vf1cXJXV0MACxfzjm7d3Pbbbcxf/58APbv38++ffu45ZZbiAjuuusu1q9fT1tbGwMDAwD89re/5dVXX2Xx4sW8+eabPPfcc3zpS19i1apVY1K+pAK0eJ467jIyj+300Ygd1ULgnzLzz4bZ9k/A3Zn5v6rljcDfZGbvkHY3AjcCzJ49u+3ll4/pv8A4UiacdBKDwELg7VdeyaJFi/jOd77DG2+8AcD111/PvHnzuPnmm5k6dSqDg4N0d3czc+ZMLrvsMt75zneyevVqPvjBDwK1CzE33HADGzZsaK42SYKWzlP9/f1ccMEFv6X2Bv+HwN9l5v8c5fjNXxOogmQOI18TeBz4aN3yRqDtaP01fU1gcLB2rq02xLVbde5NkiaFMZingN48hrl7vL4dtAs4s275DGD3mB0tE1asqJ1rW74cBgdrP7u6autb9OlHko7bJJmnxuu/ku4BboqIB6ldEP5NjtX1AKida5s+vTaga9cefu5t+vTasiRNpEkyT7XkmkBEfBPoAGYCvwRWA38EkJn3RURQ+/bQJcBe4NM55HrAUO3t7dn03xPIPHwghy5L0kRr8TwVEX2Z2d5o+1Z9O2jJKNsT+GwrjnVMhg6kASBpspngeeqt/RvDkqSjMgQkqWCGgCQVzBCQpIIZApJUMENAkgpmCEhSwQwBSSqYISBJBTMEJKlghoAkFcwQkKSCGQKSVDBDQJIKZghIUsEMAUkqmCEgSQUzBCSpYIaAJBXMEJCkghkCklQwQ0CSCmYISFLBDAFJKpghIEkFMwQkqWCGgCQVzBCQpIIZApJUMENAkgpmCEhSwQwBSSqYISBJBTMEJKlghoAkFcwQkKSCGQKSVDBDQJIKZghIUsFaEgIRcUlEbI2I7RGxcpjt10XEnoj4UXW7oRXHlSQ1Z2qzHUTEFOBrwMeAXcAzEdGTmc8PafqtzLyp2eNJklqnFZ8EzgO2Z+aLmbkfeBC4vAX9SpLGWCtC4D3AL+qWd1XrhroqIn4SEQ9FxJnDdRQRN0ZEb0T07tmzpwWlSZKOphUhEMOsyyHLjwFzMvPfARuA7uE6ysz7M7M9M9tnzZrVgtIkSUfTihDYBdS/sz8D2F3fIDN/lZn7qsV/ANpacFxJUpNaEQLPAHMj4qyIOBlYDPTUN4iI0+sWFwE/a8FxJUlNavrbQZk5EBE3AU8AU4BvZOZzEXEn0JuZPcDNEbEIGAB+DVzX7HElSc2LzKGn7yeH9vb27O3tnegyJOmEEhF9mdneaHt/Y1iSCmYISFLBDAFJKpghIEkFMwQkqWCGgCQVzBCQpIIZApJUMENAkgpmCEhSwQwBSSqYISBJBTMEJKlghoAkFcwQkKSCGQKSVDBDQJIKZghIUsEMAUkqmCEgSQUzBCSpYIaAJBXMEJCkghkCklQwQ0CSCmYISFLBDAFJKpghIEkFMwQkqWCGgCQVzBCQpIIZApJUMENAkgpmCEhSwQwBSSqYISAdh/7+ftatW3dM+7z00kv09PQcWl6zZg0f+MAH6OjooKOjgwMHDrS6TGlUhoB0HFoRAgCrVq1i8+bNbN68mSlTprSyRKkhhoB0HDo7O+nr66Ojo4Pu7m4WLlzIRRddxMKFC9mzZw979+7l0ksv5cILL6Sjo4Nt27bR2dnJ448/TkdHB319fQB85Stf4aMf/Sj33nvvBD8iFSszm74BlwBbge3AymG2nwJ8q9r+NDBntD7b2tpSmnQGBzMzc+fOnblgwYLMwcG85pprcsuWLZmZ+cgjj+Stt96afX19uWTJkkO7HThwIDdt2pTLli07tO7111/PwcHB3Lt3by5YsCC///3vj+9j0VsS0JvHMH9PbTZEImIK8DXgY8Au4JmI6MnM5+uaLQP+JTPPjojFwJeBa5o9tjSu1qyB/n5Yu/YP61as4Nknn2Tlq68CMDAwwNlnn838+fNpa2tj6dKlzJgxgzvuuOOI7mbMmAHAtGnTuPLKK+nr6+OCCy4Yj0ciHdJ0CADnAdsz80WAiHgQuByoD4HLgTXV/YeAr0ZEVKklTX6ZtQDo6gLg5C98gYGf/xw2buScuXO5rbOT+eeeC8D+/fvZt28ft9xyCxHBXXfdxfr162lra2NgYOBQl/39/UyfPp3MZPPmzVx33XUT8chUuFaEwHuAX9Qt7wI+PFKbzByIiN8AM4DXW3B8aexF/OETQFcX7+rqYhpw1fvex6Lbb2f1mjW88cYbAFx//fXMmzePm2++malTpzI4OEh3dzczZ85kx44dXH311axevZp77rmHrVu3kpl0dHRw2WWXTdzjU7Gi2TfjEfFJ4BOZeUO1/JfAeZn5ubo2z1VtdlXLO6o2vxrS143AjQCzZ89ue/nll5uqTWq5TDip7vsUg4O1gJAmiYjoy8z2Rtu34ttBu4Az65bPAHaP1CYipgJ/DPx6aEeZeX9mtmdm+6xZs1pQmtRCmbBixeHrVqyorZdOUK0IgWeAuRFxVkScDCwGeoa06QGure5fDTzp9QCdUA4GQFcXLF9e+wSwfHlt2SDQCazpawLVOf6bgCeAKcA3MvO5iLiT2leVeoD/BqyPiO3UPgEsbva40riKgOnTaxP/2rWHXyOYPt1TQjphNX1NYKy0t7dnb2/vRJchHS7z8Al/6LI0wSbimoBUjqETvgGgE5whIEkFMwQkqWCGgCQVzBCQpIIZApJUMENAkgpmCEhSwQwBSSqYISBJBTMEJKlghoAkFcwQkKSCGQKSVDBDQJIKZghIUsEMAUkqmCEgSQUzBCSpYIaAJBXMEJCkghkCklQwQ0CSCmYISFLBDAFJKpghIEkFMwQkqWCGgCQVzBCQpIIZApJUMENAkgpmCEhSwQwBSSqYISBJBTMEJKlghoAkFcwQkKSCGQKSVDBDQJIK1lQIRMRpEfG9iHih+nnqCO0ORMSPqltPM8eUJLVOs58EVgIbM3MusLFaHs6bmfmh6raoyWNKklqk2RC4HOiu7ncDVzTZnyRpHDUbAu/MzFcAqp9/MkK7t0VEb0Q8FREGhSRNElNHaxARG4B3DbNp1TEcZ3Zm7o6I9wJPRsSzmbljmGPdCNwIMHv27GPoXpJ0PEYNgcy8eKRtEfHLiDg9M1+JiNOB10boY3f188WI2AzMB44Igcy8H7gfoL29PRt6BJKk49bs6aAe4Nrq/rXAo0MbRMSpEXFKdX8m8BHg+SaPK0lqgWZD4G7gYxHxAvCxapmIaI+Ir1dtPgD0RsSPgU3A3ZlpCEjSJDDq6aCjycxfAQuGWd8L3FDd/yHwwWaOI0kaG/7GsCQVzBCQpIIZApJUMENAkgpmCEhSwQwBSSqYISBJBTMEJKlghoAkFcwQkKSCGQKSVDBDQJIKZghIUsEMAUkqmCEgSQUzBCSpYIaAJBXMEJCkghkCklQwQ0CSCmYISFLBDAFJKpghIEkFMwQkqWCGgCQVzBCQpIIZApJUMENAkgpmCEhSwQwBSSqYISBJBTMEJKlghoAkFcwQkKSCGQKSVDBDQJIKVkQI9Pf3s27dumPa56WXXqKnp+fQ8uc//3nOP/98zj//fO6+++5WlyipcK2Ypzo7OwHeHxH/OyLWRcQfjdaHITCCoYP72c9+lqeeeoof/vCHPProo+zYsaPVZUoqWCvmqZtuuglga2Z+pFr18dH6mHpMRzxBdXZ20tfXR0dHB5/+9Kf59re/zZtvvsm0adN44IEHeMc73sFVV13F3r17iQjuv/9+Ojs7eeaZZ+jo6OCee+6hra0NgJNOOokpU6YwZcqUCX5Ukt5KWjlPRURQe5O/fdQDZ+Zx34BPAs8Bg0D7UdpdAmytClrZSN9tbW3ZtMHBzMzcuXNnLliwIHNwMK+55prcsmVLZmY+8sgjeeutt2ZfX18uWbLk0G4HDhzITZs25bJly47oct26dfmpT32q+dokKbPl8xSwC3gB+Gfg7TnKXNvsJ4GfAlcCfz9Sg4iYAnwN+FhV3DMR0ZOZzzd57KNbswb6+2Ht2j+sW7GCZ598kpWvvgrAwMAAZ599NvPnz6etrY2lS5cyY8YM7rjjjmG73LBhA93d3Tz22GNjWrqkQozBPAW8Cvx74KvAdcDfHa2EpkIgM38GUPvkMaLzgO2Z+WLV9kHgcmDsQiCzNrBdXQCc/IUvMPDzn8PGjZwzdy63dXYy/9xzAdi/fz/79u3jlltuISK46667WL9+PW1tbQwMDBzq8umnn+aLX/wi3/3ud5k2bdqYlS6pEGMwT/3ud7+rus6MiN8Ae0crYzyuCbwH+EXd8i7gw2N6xIg/JGtXF+/q6mIacNX73sei229n9Zo1vPHGGwBcf/31zJs3j5tvvpmpU6cyODhId3c3M2fOZMeOHVx99dWsXr2aZcuWAXDFFVcAHHb+TZKO2RjMU/fddx/Uvh30A2qn31ePWkbtFNLR6owNwLuG2bQqMx+t2mwG/joze4fZ/5PAJzLzhmr5L4HzMvNzw7S9EbgRYPbs2W0vv/zyaPUfXSacVPcFqMHB2sBL0mTR4nkqIvoys73R9qN+RTQzL87MPxvm9miDx9gFnFm3fAawe4Rj3Z+Z7ZnZPmvWrAa7H7FwWLHi8HUrVtTWS9JkMAnmqfH4PYFngLkRcVZEnAwsBnpG2ac5Bwe2qwuWL68l6/LltWWDQNJkMEnmqaauCUTEXwB/C8wCHo+IH2XmJyLi3cDXM/OyzByIiJuAJ4ApwDcy87mmKz96YTB9em1A1649/Nzb9OmeEpI08SbJPDXqNYGJ0t7enr29R1xiODaZhw/k0GVJmmgtnqdafk3ghDZ0IA0ASZPNBM9Tb+0QkCQdlSEgSQUzBCSpYIaAJBXMEJCkghkCklQwQ0CSCjZpf1ksIvYATf4PcofMBF5vUV+tYk2Nm4x1WVNjJmNNMDnralVNf5qZDf/na5M2BFopInqP5TfoxoM1NW4y1mVNjZmMNcHkrGuiavJ0kCQVzBCQpIKVEgL3T3QBw7Cmxk3GuqypMZOxJpicdU1ITUVcE5AkDa+UTwKSpGG8JUIgIj4ZEc9FxGBEjHh1PSIuiYitEbE9IlbWrT8rIp6OiBci4lvVX0BrRV2nRcT3qn6/FxGnDtPmzyPiR3W330XEFdW2ByJiZ922D41HTVW7A3XH7alb3/KxanCcPhQRW6rn+ScRcU3dtpaN00ivkbrtp1SPe3s1DnPqtt1Wrd8aEZ843hqOs65bIuL5amw2RsSf1m0b9rkch5qui4g9dce+oW7btdXz/UJEXDuONa2tq2dbRPTXbRurcfpGRLwWET8dYXtExL1VzT+JiHPrto3JOB0mM0/4G/AB4P3AZqB9hDZTgB3Ae4GTgR8D86pt3wYWV/fvA/6qRXV9BVhZ3V8JfHmU9qcBvwbeXi0/AFzd4rFqqCbgjRHWt3ysGqkJ+DfA3Or+u4FXgOmtHKejvUbq2vwn4L7q/mLgW9X9eVX7U4Czqn6mtOg5a6SuP6973fzVwbqO9lyOQ03XAV8d4XX+YvXz1Or+qeNR05D2n6P2lw7HbJyqfi8AzgV+OsL2y4DvAgGcDzw9luM09PaW+CSQmT/LzK2jNDsP2J6ZL2bmfuBB4PKICOAi4KGqXTdwRYtKu7zqr9F+rwa+m5l7W3T8VtR0yBiO1ag1Zea2zHyhur8beI3anzVtpWFfI0ep9SFgQTUulwMPZua+zNwJbK/6G5e6MnNT3evmKeCMFh37uGs6ik8A38vMX2fmvwDfAy6ZgJqWAN9swXGPKjN/QO3N3UguB9ZlzVPA9Ig4nbEbp8O8JUKgQe8BflG3vKtaNwPoz8yBIetb4Z2Z+QpA9fNPRmm/mCNflF+qPiKujYhTxrGmt0VEb0Q8dfD0FGM3Vsc0ThFxHrV3ejvqVrdinEZ6jQzbphqH31Abl0b2PV7H2vcyau8sDxruuRyvmq6qnpeHIuLMY9x3rGqiOl12FvBk3eqxGKdGjFT3WL6mDmnqD82Pp4jYALxrmE2rMvPRRroYZl0eZX3TdTXaR9XP6cAHgSfqVt8GvEptwrsf+BvgznGqaXZm7o6I9wJPRsSzwP8bpl1DY9XicVoPXJuZg9Xq4xqn4bofZt3Qxzcmr6NRNNx3RCwF2oEL61Yf8Vxm5o7h9m9xTY8B38zMfRHxGWqfoC5qcN+xqumgxcBDmXmgbt1YjFMjJuI1dcgJEwKZeXGTXewCzqxbPgPYTe3/6pgeEVOrd3YH1zddV0T8MiJOz8xXqsnrtaN09R+BhzPz93V9v1Ld3RcR/x346/GqqTrlQma+GBGbgfnAdzjOsWpFTRHxr4HHgf9cfWw+2PdxjdMwRnqNDNdmV0RMBf6Y2kf9RvY9Xg31HREXUwvVCzNz38H1IzyXzU5uo9aUmb+qW/wH4Mt1+3YM2Xdzk/U0VFOdxcBn61eM0Tg1YqS6x2qcDlPS6aBngLlR+3bLydReBD1ZuwKzidr5eIBrgUY+WTSip+qvkX6POD9ZTYgHz8VfAQz77YJW1xQRpx48pRIRM4GPAM+P4Vg1UtPJwMPUzp3+45BtrRqnYV8jR6n1auDJalx6gMVR+/bQWcBc4P8cZx3HXFdEzAf+HliUma/VrR/2uRynmk6vW1wE/Ky6/wTw8aq2U4GPc/gn4DGrqarr/dQutG6pWzdW49SIHuBT1beEzgd+U72xGatxOtxYXA0f7xvwF9RScx/wS+CJav27gX+ua3cZsI1auq+qW/9eav9gtwP/CJzSorpmABuBF6qfp1Xr24Gv17WbA/xf4KQh+z8JPEttUvsfwL8aj5qA/1Ad98fVz2VjOVYN1rQU+D3wo7rbh1o9TsO9RqidWlpU3X9b9bi3V+Pw3rp9V1X7bQUubfFrfLS6NlSv/YNj0zPaczkONf0X4Lnq2JuAf1u37/XVGG4HPj1eNVXLa4C7h+w3luP0TWrfZvs9tXlqGfAZ4DPV9gC+VtX8LHXfcByrcaq/+RvDklSwkk4HSZKGMAQkqWCGgCQVzBCQpIIZApJUMENAkgpmCEhSwQwBSSrY/wfkpK37CAsQ5gAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "--------------------------------------------------------------------------------\n" ] } ], "source": [ "# ---------------------\n", "# Run this sanity check\n", "# Note that this not an exhaustive check for correctness.\n", "# The plot produced should look like the \"test solution plot\" depicted below. \n", "# ---------------------\n", "\n", "print (\"-\" * 80)\n", "print (\"Outputted Plot:\")\n", "\n", "M_reduced_plot_test = np.array([[1, 1], [-1, -1], [1, -1], [-1, 1], [0, 0]])\n", "word2Ind_plot_test = {'test1': 0, 'test2': 1, 'test3': 2, 'test4': 3, 'test5': 4}\n", "words = ['test1', 'test2', 'test3', 'test4', 'test5']\n", "plot_embeddings(M_reduced_plot_test, word2Ind_plot_test, words)\n", "\n", "print (\"-\" * 80)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 1.5: Co-Occurrence Plot Analysis [written] (3 points)\n", "\n", "Now we will put together all the parts you have written! We will compute the co-occurrence matrix with fixed window of 4, over the Reuters \"crude\" corpus. Then we will use TruncatedSVD to compute 2-dimensional embeddings of each word. TruncatedSVD returns U\\*S, so we normalize the returned vectors, so that all the vectors will appear around the unit circle (therefore closeness is directional closeness). **Note**: The line of code below that does the normalizing uses the NumPy concept of *broadcasting*. If you don't know about broadcasting, check out\n", "[Computation on Arrays: Broadcasting by Jake VanderPlas](https://jakevdp.github.io/PythonDataScienceHandbook/02.05-computation-on-arrays-broadcasting.html).\n", "\n", "Run the below cell to produce the plot. It'll probably take a few seconds to run. What clusters together in 2-dimensional embedding space? What doesn't cluster together that you might think should have? **Note:** \"bpd\" stands for \"barrels per day\" and is a commonly used abbreviation in crude oil topic articles." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Running Truncated SVD over 8185 words...\n", "(8185, 2)\n", "Done.\n", "[1252, 1454, 2729, 2840, 3961, 4285, 5165, 5298, 5517, 7862]\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZIAAAD9CAYAAACWV/HBAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAIABJREFUeJzt3Xt0FdXd//H31wBSFAwQxQsiKAoC2igRJY0SEBQvBRVRK8sGGkD7VEnRVvChpSnFan9eaLxURbTQ8nitLWZhxXILStFqqFQQQRARAoioXKS13PL9/TFz4kk4gYQ5uQCf11pnnbnsmdkzi5UPs/eZ2ebuiIiIHKgj6roCIiJycFOQiIhIJAoSERGJREEiIiKRKEhERCQSBYmIiESSlCAxs75mttzMVprZ6ATrLzKzf5rZbjO7tsK6HDNbEX5yklEfERGpPRb1ORIzSwE+BPoAJcA7wPfcfWlcmbZAM+AnQKG7/ylc3gIoBjIABxYCXd19c6RKiYhIrUnGHUk3YKW7r3L3ncBzQP/4Au6+2t3fA0orbHspMNPdvwzDYybQNwl1EhGRWpKMIDkJWBs3XxIuq+ltRUSkHmiQhH1YgmVVbS+r8rZmNhwYDnDUUUd17dixYxUPISIiAAsXLvzc3Y9N9n6TESQlwMlx862B9dXYNrvCtkWJCrr7RGAiQEZGhhcXF1e3niIihzUz+6Qm9puMpq13gNPNrJ2ZNQJuAAqruO1rwCVm1tzMmgOXhMtEROQgETlI3H03cCtBAHwAvODu75vZODPrB2Bm55lZCTAQeMLM3g+3/RL4FUEYvQOMC5eJiMhBIvLPf+uCmrZERKrPzBa6e0ay96sn20VEJBIFiYhIPTF58mS2bdtWrW3at29fQ7WpOgWJiEg9UVmQ7Nmzpw5qU3UKEhGRGrR69WrOO+88Bg0aREZGBgUFBWzdupXrrruOiy++mF69erFy5UrmzJnDokWLGDhwILfddlvZdjfddBPDhg1j48aNXHbZZfTo0YPLL7+cTZs2lTvOrl27GDp0KD179iQrK4u3334bgOzsbEpKSmLFTjCzwQDhuxF/ZWZvmtl9ZjbGzF43s5fNLNEzfpVz94Pu07VrVxcRqXdKS/ea//jjjz0tLc23bdvmO3fu9G9/+9s+ZMgQf/bZZ93dfdGiRT5gwAB3d+/Ro4evXbvW3b1su61bt7q7e15enk+ZMsXd3adMmeIjR450d/fTTjvN3d0fe+wxv+eee9zd/dNPP/XMzMy99gmsAwYHk6wGziZ4MPwD4Jpw+TTgHK/G3+RkPJAoIiL5+bBlC0yYAGbgDiNHAtCxY0eaNm0KQJcuXdiwYQMFBQU8/vjjADRokPhPcZcuXWjWrBkAy5cv59ZbbwUgMzOT5557rlzZxYsXs2DBAmbMmAHA1q1bAdjHzcVuD96BiJmtA94Nl5cALapz6goSEZGo3IMQKSgI5idMCEKkoACGDGHZsmVs376dxo0bs2TJEs4991yGDx/O1VdfDcDOnTsBaNSoEbt37y7bbUpKStl0hw4dWLBgAe3bt2fBggV06NChXBU6d+5M+/btGRmGV2yfLVq0oKSkhNatWwM02ddZxE1Xq2lLQSIiEpVZEB4QhEcsUPLyIC+PtosXM2zYMFasWEFOTg4/+MEPuOWWW3j44Ydxd6688kruuOMOrrnmGnJzc8nMzCQ3N7fcIUaPHk1OTg6TJk2iSZMm/OEPfyi3ftiwYdx222307NkTgIyMDO677z5GjBjB0KFDOeOMM6Dq70Gs3um7HkgUEUkOdzgi7jdMpaWs/uQThg4dyqxZs+quXiE9kCgiUp/F9YmUGTkyWH6IU5CIiEQVC5GCgqA5q7Q0+C4ooG1BAbNmzqzrGtYo9ZGIiERlBqmpQXjEfrUV6zNJTQ3mD2HqIxERSRb38qFRcb6OqY9ERKS+qxga1QiRTz/9lDvuuKNKZYcOHUpRUVE1KgbTpk0DaFStjapIQSIiUg8cf/zxPPDAAzW2/8qCxMxS9i5dPQoSEZF6YPXq1fTu3Zv8/Hxyc3Pp168f6enpLFu2DIAXX3yR9PR0BgwYwNq1a8ttExN7E3BRURHdunWjZ8+eDBkyhKVLl8aeeG9jZi9CMOyumf0OeNnMnjezc8Llp5hZtX4doCAREaltFfumK8w3bdqUwsJC7rzzTiZNmsSePXsYM2YMb7zxBs8991xZkFTmz3/+M+PHj2fu3Lk89dRTdOrUib59+wKscfeBYbETgHvd/UpgIhB7AnII8FR1TkdBIiJSm/Lzyz9fEvvp8G9/W1aka9euALRp04YvvviCzz//nFatWtG0aVMaNmzIueeeC+z9Hq3Yj6d++tOfUlhYyKBBg/j9739fWU3WufuacHoO0M3MmgDfBf5SnVNKSpCYWV8zWx6+lnh0gvVHhrdOK83sH2bWNlze1sy+NrNF4efxZNRHRKRein8nVyxMYs+fxI1DEh8Q7k5aWhobN25k+/bt7N69m0WLFgHQvHlz1q9fj7vz6aefsm7dOgBatmzJI488wtSpU7n33nvZtm0bjRo1gvLv0NoTdwwHXgJ+B7zu7juqc1qRnyMJO2oeBfoQvDXyHTMrdPelccVygc3u3t7MbgB+A1wfrvvI3dOj1kNEpN7bzzu5GDYs4WYpKSmMGzeOrKws2rVrx0knnQRAs2bN6Nu3L927d6dbt260atUKgAcffJC//e1vlJaW0qdPH5o1a8aVV17Jk08+eaKZPeHuNyc4zO8J/oafU+3TivociZl1B/Ld/dJw/i4Ad78nrsxrYZk3zawB8ClwLHAKMN3du1TnmHqOREQOagneyVUbz5vs6zkSM2sFPOvuvaq732Q0bZ0ExPf8lITLEpZx993AVqBluK6dmb1rZvPM7MIk1EdEpP6qh+/kMrM+QCEw/kC2T0aQJIrRileksjIbgDbufg5wO/CMmTVLeBCz4WZWbGbFFYeYFBE5KOzjnVx1GSbuPtPdz3f3OQeyfTLetVUCnBw33xpYX0mZkrBp6xjgy7CDZweAuy80s4+AM4C92q3cfSLBT9TIyMg4+N7rIiJyiL6TKxlB8g5wupm1IxgP+AbgxgplCoEc4E3gWmCOu7uZHUsQKHvM7FTgdGBVEuokIlI/5eeXfwdXLEwO0hCBJASJu+82s1uB14AU4Gl3f9/MxgHF7l5I8HDLH81sJfAlQdgAXASMM7PdBD9Fu8Xdv4xaJxGRei3CO7nqI739V0TkMKG3/4qISL2kIBERkUgUJCIiEomCREREIlGQiIhIJAoSERGJREEiIiKRKEhERCQSBYmIiESiIBERkUgUJCIiEomCREREIlGQiIhIJAoSERGJREEiIiKRKEhERCQSBYmIiESSlCAxs75mttzMVprZ6ATrjzSz58P1/zCztnHr7gqXLzezS5NRHxERqT2Rg8TMUoBHgcuATsD3zKxThWK5wGZ3bw9MAH4TbtuJYPz2zkBf4Hfh/kRE5CCRjDuSbsBKd1/l7juB54D+Fcr0B6aE038CLjYzC5c/5+473P1jYGW4PxERqUHJ/E97gyTs4yRgbdx8CXB+ZWXcfbeZbQVahsvfqrDtSUmok4jIIemuu+5iwYIF7Ny5kzFjxlBcXMzatWvZtGkTa9as4bnnnqNjx47MmzePsWPHYmZ07NiRxx57DKCRmb0DLAN2mdnPgWeB/wCfAEcCI4FX3f0CADMbC3zs7n+srE7JuCOxBMu8imWqsm2wA7PhZlZsZsWbNm2qZhVFRA5CXv7P4YxXX2Xz5s3MmzeP2bNnM2bMGNydpk2bUlhYyJ133smkSZNwd3784x9TWFhIUVER3/rWt3jllVdiu2kL/MjdfwCMAn7n7n2BNcEhfTOwwswy4lqO/rSvaibjjqQEODluvjWwvpIyJWbWADgG+LKK2wLg7hOBiQAZGRkJw0ZE5JCRnw9btsCECWAG7iz+9a+Zt2wZ2cuWAbBjxw6++OILzj8/aARq06YNM2fO5PPPP2f16tX07x/0Mmzfvp0OHTrE9rzE3beF06cDBeH0P8J5CP7WDgWaAW+6+9f7qmoy7kjeAU43s3Zm1oig87ywQplCICecvhaY4+4eLr8h/FVXu/Ak3k5CnUREDl7uQYgUFMDIkcH8yJF0nj+fS1q3pmjuXIqKinjvvfdIS0sjuHGIbeqkpaVx6qmnMn36dIqKiiguLiY3NzdWZE/ckVYCGeH0eXH7eAP4NnAbMGl/1Y18RxL2edwKvAakAE+7+/tmNg4odvdC4Cngj2a2kuBO5IZw2/fN7AVgKbCb4HZrT8IDiYgcLsyCOxEIwqQguGm4PC+PN48+muyePTEzWrduzWmnnZZgc+PBBx+kX79+uDtHHHEEE2L7K+83wLNm9gOC1qCdceteAG5090X7ra77wddKlJGR4cXFxXVdDRGRmuUOR8Q1HJWWBiFzgMxsobtnxM2nAKXu7mY2Btjh7veH634M/Nvdn9zffvVku4hIfRQ2Z5UTa+ZKnlbA62Y2H8gCngQws98A/YCpVdlJMjrbRUQkmWIhUlAAeXlBM1dsHr7pgI98GF8PXJhg+ajq7EdBIiJS35hBauo3IRLfZ5KampQQSSb1kYiI1Ffu5UOj4nw1VewjSRb1kYiI1FcVQ6Oe3YnEKEhERCQSBYmIiESiIBERkUgUJCIiEomCREREIlGQiIhIJAoSERGJREEiIiKRKEhERCQSBYmISD127733snjxYgDat29fx7VJTC9tFBGpx0aPHl3XVdgv3ZGIiNQT7s7NN99MVlYWmZmZvP322wwePJj58+fXddX2SXckIiJ1pcLbfF+eNo1du3Yxf/58Vq1axQ033ECnTp3qsIJVE+mOxMxamNlMM1sRfjevpFxOWGaFmeXELS8ys+Vmtij8HBelPiIiB438/PIjHrqz/IEHyPz8cwBOPfVUNm/eXHf1q4aoTVujgdnufjowO5wvx8xaAL8Azge6Ab+oEDiD3D09/HwWsT4iIvWfO2zZEox4GAuTkSPp8Pe/s+Cjj8CdVatWkZqaWtc1rZKoTVv9gexwegpQBFQcovFSYKa7fwlgZjOBvsCzEY8tInJwih/xsKCgbAjdfiNG8Mq//03WhReyZ88eHn74YR5//PE6rGjVRBoh0cy2uHtq3Pxmd29eocxPgMbuPj6c/znwtbvfb2ZFQEtgD/ASMN6rUCGNkCgihwR3OCKuYai0tEYHr6qzERLNbJaZLUnw6V/FYyS6KrGwGOTuZxEMPn8hcNM+6jHczIrNrHjTpk1VPLSISD0VNmeVE99nchDZb5C4e29375Lg8zKw0cxOAAi/E/VxlAAnx823BtaH+14Xfn8FPEPQh1JZPSa6e4a7Zxx77LFVPT8RkfonFiIFBZCXF9yJ5OWV7zM5iETtbC8EYr/CygFeTlDmNeASM2sedrJfArxmZg3MLA3AzBoCVwJLItZHRKT+M4PU1CA8Jkz4ps8kLy9YXk/HZq9M1D6SlsALQBtgDTDQ3b80swzgFncfGpb7AfC/4WZ3u/vvzewo4HWgIZACzAJud/c9+zuu+khE5JBQ4TmSveaTrKb6SCIFSV1RkIiIVF+ddbaLiIjsi4JEREQiUZCIiEgkChIREYlEQSIiIpEoSEREJBIFiYiIRKIgERGRSBQkIiISiYJEREQiUZCIiNRDJSUlZGdn13U1qkRBIiJykNuzZ7/vuq1RChIRkQN011130aNHD7p378706dNZs2YNffv2pUePHlx88cWUlpYyePBg5s+fD8DUqVPJz88HYNSoUfTs2ZNzzz2XiRMnArB9+3auuOIKevfuzYMPPlh2nA8//JDs7Gx69OjB9ddfz9dffw3AKaecwv/8z//Qv39VxxmsGQoSEZGqqPCm9BmvvsrmzZuZN28es2fPZsyYMdxxxx3cfvvtzJs3j5kzZ3LEEZX/iR07dixz587lzTff5P7772fXrl08+eSTZGVlMWvWLLp27VpW9s4772TcuHHMmzePzp078+STTwKwYcMGRo8eTVpaWllY1QUFiYjI/uTnlx+50J3Fv/418156iezsbC6//HJ27NjB0qVL6dmzJ0BZiFjc+CLxw3Y89thjZGVlcckll/DZZ5/x2Wef8eGHH9KtWzBQ7Pnnn19W9sMPPyQzMxOAzMxMli1bBsBJJ51EmzZtauy0q6pBXVdARKRec4ctW6CggDtef50ev/gF/ebOpf38+XzeuDGP5+cz9he/oFWrVpSUlDB37lzOOOMMrrvuOrp06cJf//pX9uzZQ1ZWFgsWLOD1119n7ty5FBcX889//pMGDRrQuXNnrr32Wj7++GPeeOMNpk+fzlVXXUVaWhoAa9asYcGCBXTv3p1Ro0axbds2srKy2LVr117VvfTSS9mxYwf/+c9/KCgooHv37jV+iRQkIiKJxEYrjA2D607OQw/xy6uuoh+wo29fOn71FVd+97ukp6fTtm1bTj/9dEaPHk3Dhg3517/+xZw5cxgxYgTnn38+W7ZsYfXq1Zx55pm8+OKLXHLJJWRmZtK/f386derEU089xbBhwzAzBg0axLZt28qCpGXLlvz85z9n/fr1NG7cmPfee49t27bRtm3bvar95z//maOOOooPPviAH/3oR8yZM6fGL1WkIDGzFsDzQFtgNXCdu29OUG4GcAEw392vjFveDngOaAH8E7jJ3XdGqZOISGT5+cFdSGw89dDZQAnwJTA1JYUHJ0zgkksuISUlhbVr17J9+3aGDx9O3759yc3N5eijjyY9PZ2TTz6ZwsJCrrjiCtatW1fW/HXOOefw9NNPs2PHDq666iruueceLrroIj755BNyc3OZNWsWAA0aNGDevHn86Ec/YsGCBVx22WUAnHbaaeWq/fXXX5OXl8fy5ctJSUlh3bp1tXCxot+RjAZmu/u9ZjY6nB+VoNx9QBPg5grLfwNMcPfnzOxxIBd4LGKdREQOXFxTFhCEyY9/DA89BMD1wMPA9kWLyOjalVNPPZXp06dz9NFHA7Br1y7WrVtXrm8kpnPnznTv3p2rr74agJ07d+LuDBkyhNzcXC666CIAmjdvzvr163F3Nm7cWBYInTt3pn379owcObJs+3gzZswgJSWFN954g6VLl9KvX79kX52Eona29wemhNNTgKsSFXL32cBX8cssuMq9gD/tb3sRkVoTa8rKywvC5IgjykKEESMYtGED96Sk8L1167Dbb+fBBx6gX79+9OzZk4svvpgPPvig0l2PGTOGF154gV69etGzZ08efvhh5s+fzyuvvMIjjzxCdnY2P/vZz2jWrBl9+/ale/fu/PrXv6ZVq1YADBs2jOXLl9OzZ0969uzJmDFjyu2/e/fuvPvuu/Tu3Zvnn3++xi5RReYVftJWrY3Ntrh7atz8ZndvXknZbOAnsaYtM0sD3nL39uH8ycCr7t5lf8fNyMjw4uLiA663iMh+uQchEjNiBPz2t0HQuAe/4kpNDZrBDhJmttDdM5K93/02bZnZLOD4BKvGJFhWHXvf90GlqWZmw4HhQL34uZuIHMJiQVGZ2F1Lguarw9F+m7bcvbe7d0nweRnYaGYnAITfn1Xj2J8DqWYWC7PWwPp91GOiu2e4e8axxx5bjcOIiOxt9erV9O7de+8VsRApKAiat0pLg++HHir/LMk+QmTRokXcd999AEybNo01a9bUxCnUG1E72wuBHODe8Pvlqm7o7m5mc4FrCX65Va3tRURqhFnQZJWX981dx4QJwbrU1CrdhaSnp5Oeng4EQZKWlnZIt6RE7Wy/F+hjZiuAPuE8ZpZhZpNihczsDeBF4GIzKzGzS8NVo4DbzWwl0BJ4KmJ9RESq7ZFHHuGHP/wh7dq1Cxbk59N7yRJWf/IJv/zlL5n28sv4gw9y7KOPMmPGDPbs2UNGRtDVkOidWUVFRQwdOpSlS5cyY8YMbrvtNgYOHFhXp1fjIt2RuPsXwMUJlhcDQ+PmL6xk+1VAtyh1EBHZr9jDhfHzof/93/+lUaNGPPbYY7Rv336vTXv16sULL7xAu3btyMzMZPbs2TRv3rzsXVhjx47lqKOOYseOHZx11lkMGTKkbNtOnTrRt29fhg4dSlZWVs2dXx3Tk+0icmir+HBhXEf6+++/z5dffslbb72112axX7RecMEF3HHHHZx22mnceuutPPTQQ8ydO5devXoBwTuzpk2bRkpKStk7sw43emmjiBy64h8ujHWUxzrSt22jc+fOjBkzhuuuu47//ve/lJaWlr2nKvY8SMOGDWnZsiUvvfQS3/nOd2jRogUvhS9r3Lx5M08//TTz5s3jtdde45hjjqHiIxWNGjVi9+7ddXH2tUZ3JCJy6IrvKC8o+OZp9by84DNsGAMGDKBhw4YMHDiQ3NxcLrjgAtLT02ndunXZbnr16sX06dNp0qQJ2dnZLFy4kFatWuHudO7cmaysLM4880xatmy5VxWuvPJKxo4dy5lnnskTTzxRG2dd6yI9kFhX9ECiiFRLxYcLS0sPy2dAauqBRDVticihLdHDhfHPg0hkChIROXRV9nBhfJ+JRKY+EhE5dCXh4ULZP/WRiMihL9FzJIdhiKiPRETkQFUMjcMwRGqSgkRERCJRkIiISCQKEhERiURBIiIikShIREQkEgWJiIhEoiAREZFIFCQiIhKJgkRERCKJFCRm1sLMZprZivC7eSXlZpjZFjObXmH5ZDP72MwWhZ/0KPUREZHaF/WOZDQw291PB2aH84ncB9xUybqfunt6+FkUsT4iIlLLogZJf2BKOD0FuCpRIXefDXwV8VgiIlIPRQ2SVu6+ASD8Pu4A9nG3mb1nZhPM7MiI9RERkVq23/FIzGwWcHyCVWOScPy7gE+BRsBEYBQwrpJ6DAeGA7Rp0yYJhxYRkWTYb5C4e+/K1pnZRjM7wd03mNkJwGfVOXjsbgbYYWa/B36yj7ITCcKGjIyMg28QFRGRQ1TUpq1CICeczgFers7GYfhgZkbQv7IkYn1ERKSWRQ2Se4E+ZrYC6BPOY2YZZjYpVsjM3gBeBC42sxIzuzRc9X9mthhYDKQB4yPWR0REalmkMdvd/Qvg4gTLi4GhcfMXVrJ9ryjHFxGRuqcn20VEJBIFiYiIRKIgERGRSBQkIiISiYJERPby0EMPHfC2kydPZtu2bUmsjdR3ChIR2YuCRKpDQSJymHB3br75ZrKyssjMzOTtt98mOzubkpISAMaPH8/kyZN55plnWLduHdnZ2dx9990UFRVx6aWXMmDAANLT03nxxRcBGDx4MPPnzwdg6tSp5OfnM2fOHBYtWsTAgQO57bbb6uxcpXZFeo5EROoxdzArm3152jR27drF/PnzWbVqFTfccANNmjTZa7Mbb7yRsWPHUlRUBEBRURHr1q3j3Xff5euvvyYjI4MBAwYkPGSvXr1IT09n6tSptG7dukZOS+of3ZGIHIry82HkyCBMANxZ/sADZH7+OQCnnnoqmzdvxuKCxr3yV9idc845NGzYkGbNmnHcccexadOmKm8rhz4Ficihxh22bIGCgm/CZORIOvz97yz46CNwZ9WqVaSmptKiRYuypq2FCxeW7aJBgwaUlpaWzS9atIjdu3fz1VdfsXHjRtLS0irdtlGjRuzevbuWTlbqAzVtiRxqzGDChGC6oCD4AP1GjOCVf/+brAsvZM+ePTz88MPs2LGDoUOHcsYZZ3Dkkd8MB3TttddyxRVXcNlll3H22Wdz4oknMnDgQD7++GPGjx9PSkoKQ4cO5Xvf+x7PPPMMaWlppKamAnDNNdeQm5tLZmYmv/rVr2r99KX22cF4S5qRkeHFxcV1XQ2R+s0djohrdCgtLddnUlVFRUVMnTqVSZMm7b+w1GtmttDdM5K9XzVtiRyKwuascuL7TESSSEEicqiJhUhBAeTlBXcieXnl+0yqITs7W3cjsk/qIxE51JhBamoQHhMmlO8zSU09oOYtkX1RH4nIoarCcyR7zcthR30kIlI9FUNDISI1REEiIiKRRAoSM2thZjPNbEX43TxBmXQze9PM3jez98zs+rh17czsH+H2z5tZoyj1ERGR2hf1jmQ0MNvdTwdmh/MV/Qf4vrt3BvoCvzWz1HDdb4AJ4fabgdyI9RERkVoWNUj6A1PC6SnAVRULuPuH7r4inF4PfAYca8GLenoBf9rX9iIiUr9FDZJW7r4BIPw+bl+Fzawb0Aj4CGgJbHH32Et5SoCT9rHtcDMrNrPiTZs2Ray2iIgky36fIzGzWcDxCVaNqc6BzOwE4I9AjruXmiX8CUmlv0V294nARAh+/ludY4uISM3Zb5C4e+/K1pnZRjM7wd03hEHxWSXlmgGvAD9z97fCxZ8DqWbWILwraQ2sr/YZiIhInYratFUI5ITTOcDLFQuEv8T6C/AHd38xttyDJyHnAtfua3uRurJ69Wp69670/1GRxY9OKHIwixok9wJ9zGwF0Cecx8wyzCz2cp7rgIuAwWa2KPykh+tGAbeb2UqCPpOnItZHpE7Fj+EBsGfPnjqqiUjtifSuLXf/Arg4wfJiYGg4PRWYWsn2q4BuUeogUpO2bt3KoEGDWL58OTfddBNnn30248aNY/fu3bRo0YLnn3+exo0b0759e6677jrefPNNHn30UXJycujYsSMNGzZkwoQJDBs2jC+++AJ3Z+LEibRv377sGO+//z5Dhw6lcePGNG7cmFdffbUOz1jkALj7Qffp2rWriyRVaele8x9//LGnpaX5tm3bfOfOnf7tb3/bV69eXVbkzjvv9ClTpri7+ymnnOILFixwdy/bbuvWre7uPmrUKH/22Wfd3X3RokU+YMAAd3fv0aOHr1271h944AF/4okn3N19z549NXqacngDir0G/ibr7b8i+fnB0LSxN+XGjeXRsWNHmjZtCkCXLl349NNPGTZsGDt27GDjxo00a9YMgJSUFC644IKyXXbp0qVs3eLFi5k3bx6PP/44EAxjG2/IkCHcfffdDBo0iLPPPptRo0bV9BmLJJWCRA5v8eObQxAmsbE8hgxh2bJlbN++ncaNG7NkyRLy8/P55S9/Sffu3bnzzjvx8O3ZZkb8L9pTUlLKpjt37kz37t25+uqrAdi5c2e5Khx55JHcf//9APTu3ZvLL7+cs846qyYqas0gAAAJ7klEQVTPWiSpFCRyeKtkfHPy8iAvj7aLFzNs2DBWrFhBTk4Oxx9/PLm5uXTo0IFjjjmm7K5jX8aMGcMtt9zCww8/jLtz5ZVXcscdd5Stf/bZZ5k8eTJmxvHHH0+HDh1q4kxFaozGIxGBpI1vLlKfaTwSkZqi8c1FIlGQyOEtyeObixyO1EcihzeNby4SmfpIREDjm8thQX0kIjVJ45uLHDAFiYiIRKIgERGRSBQkIiISiYJEREQiUZCIiEgkChIREYlEQSIiIpFEChIza2FmM81sRfjdPEGZdDN708zeN7P3zOz6uHWTzezjBEPwiojIQSLqHcloYLa7nw7MDucr+g/wfXfvDPQFfmtmqXHrf+ru6eFnUcT6iIhILYsaJP2BKeH0FOCqigXc/UN3XxFOrwc+A46NeFwREaknogZJK3ffABB+H7evwmbWDWgEfBS3+O6wyWuCmR0ZsT4iIlLL9vv2XzObBRyfYNWY6hzIzE4A/gjkuHtpuPgu4FOCcJkIjALGVbL9cGA4QJs2bapzaBERqUH7DRJ3713ZOjPbaGYnuPuGMCg+q6RcM+AV4Gfu/lbcvjeEkzvM7PfAT/ZRj4kEYUNGRsbB98piEZFDVNSmrUIgJ5zOAV6uWMDMGgF/Af7g7i9WWHdC+G0E/StLItZHRERqWdQguRfoY2YrgD7hPGaWYWaTwjLXARcBgxP8zPf/zGwxsBhIA8ZHrI+IiNQyDWwlInKY0MBWIiJSLylIREQkEgWJiIhEoiAREZFIFCQiIhKJgkRERCJRkIiISCQKEhERiURBIiIikShIREQkksM+SFavXk3v3pW+4LhS48ePZ/LkycmvkIjIQeawDxIREYlmv+ORHA62bt3KoEGDWL58OTfddBPHHHMMr7zyCv/9738pKSnhoYce4sILL+T1119nxIgRtGnThiOPPJLWrVvXddVFROrc4RUk7mBWfp6geWvOnDk0btyY8847jxtvvJGvvvqKGTNmsHr1aq699lqKi4u5/fbbKSws5OSTT+bSSy+to5MQEalfDp8gyc+HLVtgwoQgTNxh5EgAOnbsSNOmTQHo0qUL7s55550HQNu2bdm6dSsA27ZtKxvmt1u3brV/DiIi9dDh0UfiHoRIQUEQHrEQKSiAbdtYtmwZ27dvZ/fu3SxZsgQzY+HChQCsWbOGZs2aAdC0aVNKSkoAeOedd+rsdERE6pPD447ELLgTgSA8CgqC6bw8yMuj7eLFDBs2jBUrVpCTk0Pz5s1p0qQJV1xxBevXr2dCuO0DDzzAd7/7XU488cSyOxgRkcPd4TVCojscEXcTVlpavs8kNHnyZEpKSvjZz34WoZYiIvVLvR0h0cxamNlMM1sRfjdPUOYUM1sYjtf+vpndEreuq5ktNrOVZvaQWYK/7MkQ1ydSJtbMJSIiBywZfSSjgdnufjowO5yvaAOQ6e7pwPnAaDM7MVz3GDAcOD389E1CncqL7xPJywvuRPLyyveZxBk8eLDuRkREqigZfST9gexwegpQBIyKL+DuO+NmjyQMMDM7AWjm7m+G838ArgJeTUK9vmEGqalBeMR+tRXrM0lNTdi8JSIiVZOMIGnl7hsA3H2DmR2XqJCZnQy8ArQHfuru680sAyiJK1YCnJSEOu0tP7/8cySxMFGIiIhEUqUgMbNZwPEJVo2p6oHcfS1wdtikNc3M/gQk+iuesNPCzIYTNIGVPctRbRVDQyEiIhJZlYLE3St9q6GZbTSzE8K7kROAz/azr/Vm9j5wIfB3IP49I62B9ZVsNxGYCMGvtqpSbxERqXnJ6GwvBHLC6Rzg5YoFzKy1mX0rnG4OfAdYHjaJfWVmF4S/1vp+ou1FRKT+SkaQ3Av0MbMVQJ9wHjPLMLNJYZkzgX+Y2b+AecD97r44XPdDYBKwEviIZHe0i4hIjTq8HkgUETmM1dQDiQdlkJjZJuCTuq5HJdKAz+u6EvWMrkliui570zVJLFnX5RR3PzYJ+ynnoAyS+szMimsi8Q9muiaJ6brsTdcksfp+XQ6Pt/+KiEiNUZCIiEgkCpLkm1jXFaiHdE0S03XZm65JYvX6uqiPREREItEdiYiIRKIgqSIz62tmy8NxU/Z6VX445spsM3vPzIrMrHXcujZm9jcz+8DMlppZ29qse0060OtiZj3D8Wlin/+a2VW1fwbJF/Hfyv8Lx+z5oEbH56kDEa/Lb8xsSfi5vnZrXnPM7Gkz+8zMllSy3sJ/ByvD63Ju3LqccByoFWaWk2j7WuPu+uznA6QQPHV/KtAI+BfQqUKZF4GccLoX8Me4dUVAn3D6aKBJXZ9TfbgucWVaAF8eCtclyjUBMgneP5cSft4Esuv6nOrBdbkCmEnwbsCjgGKC4Sfq/LyScF0uAs4FllSy/nKCt30YcAHwj3B5C2BV+N08nG5eV+ehO5Kq6QasdPdVHoyt8hzBOCzxOhEM7AUwN7bezDoBDdx9JoC7b3f3/9ROtWvcAV+XCq4FXj1ErkuUa+JAY4I/tEcCDYGNNV7j2hHlunQC5rn7bnf/N0EIJX8AvDrg7q8T/CeqMv2BP3jgLSA1fDnupcBMd//S3TcTBG2dXRMFSdWcBKyNm080bsq/gAHh9NVAUzNrCZwBbDGzP5vZu2Z2n5ml1HiNa0eU6xLvBuDZGqlh7Tvga+LBAG9zCUYU3QC85u4f1HB9a0uUfyv/Ai4zsyZmlgb0BE6u4frWF5Vdt6pcz1qjIKmaqoyb8hOgh5m9C/QA1gG7CW7HLwzXn0dwaz+4xmpau6Jcl2AHwf+uzgJeq6lK1rIDviZm1p7gBaetCf4o9DKzi2qysrXogK+Lu/8N+CuwgOA/HG8S92/oEFfZdavyWE61QUFSNSWU/x/QXuOmuPt6d7/G3c8hHPDL3beG274b3tLvBqYRtIkeCqJcl5jrgL+4+66armwtiXJNrgbeCps/txO0jV9QO9WucZH+rbj73e6e7u59CP6Irqidate5yq7bfq9nbVKQVM07wOlm1s7MGhE0xRTGFzCzNDOLXc+7gKfjtm1uZrEXpfUCltZCnWtDlOsS8z0OnWYtiHZN1hD8j7yBmTUk+F/5odK0dcDXxcxSYs2hZnY2cDbwt1qred0qBL4f/nrrAmCrB+M4vQZcYmbNLRjj6RLq8q6+rn+1cLB8CH498SHBL0/GhMvGAf3C6WsJ/pf0IcH4KkfGbdsHeA9YDEwGGtX1+dST69KWoPniiLo+j/pwTQh+2fQEQXgsBR6s63OpJ9elcXg9lgJvAel1fS5JvCbPEvSH7SK4y8gFbgFuCdcb8Gh4zRYDGXHb/oBgHKeVwJC6PA892S4iIpGoaUtERCJRkIiISCQKEhERiURBIiIikShIREQkEgWJiIhEoiAREZFIFCQiIhLJ/wdFuimsXZXdAwAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# -----------------------------\n", "# Run This Cell to Produce Your Plot\n", "# ------------------------------\n", "reuters_corpus = read_corpus()\n", "M_co_occurrence, word2Ind_co_occurrence = compute_co_occurrence_matrix(reuters_corpus)\n", "M_reduced_co_occurrence = reduce_to_k_dim(M_co_occurrence, k=2)\n", "\n", "# Rescale (normalize) the rows to make them each of unit-length\n", "M_lengths = np.linalg.norm(M_reduced_co_occurrence, axis=1)\n", "M_normalized = M_reduced_co_occurrence / M_lengths[:, np.newaxis] # broadcasting\n", "\n", "words = ['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']\n", "plot_embeddings(M_normalized, word2Ind_co_occurrence, words)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 2: Prediction-Based Word Vectors (15 points)\n", "\n", "As discussed in class, more recently prediction-based word vectors have come into fashion, e.g. word2vec. Here, we shall explore the embeddings produced by word2vec. Please revisit the class notes and lecture slides for more details on the word2vec algorithm. If you're feeling adventurous, challenge yourself and try reading the [original paper](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf).\n", "\n", "Then run the following cells to load the word2vec vectors into memory. **Note**: This might take several minutes." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "def load_word2vec():\n", " \"\"\" Load Word2Vec Vectors\n", " Return:\n", " wv_from_bin: All 3 million embeddings, each lengh 300\n", " \"\"\"\n", " import gensim.downloader as api\n", " wv_from_bin = api.load(\"word2vec-google-news-300\")\n", " vocab = list(wv_from_bin.vocab.keys())\n", " print(\"Loaded vocab size %i\" % len(vocab))\n", " return wv_from_bin" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[==================================================] 100.0% 1662.8/1662.8MB downloaded\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/Users/nwams/anaconda3/lib/python3.7/site-packages/smart_open/smart_open_lib.py:398: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function\n", " 'See the migration notes for details: %s' % _MIGRATION_NOTES_URL\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Loaded vocab size 3000000\n" ] } ], "source": [ "# -----------------------------------\n", "# Run Cell to Load Word Vectors\n", "# Note: This may take several minutes\n", "# -----------------------------------\n", "wv_from_bin = load_word2vec()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reducing dimensionality of Word2Vec Word Embeddings\n", "Let's directly compare the word2vec embeddings to those of the co-occurrence matrix. Run the following cells to:\n", "\n", "1. Put the 3 million word2vec vectors into a matrix M\n", "2. Run `reduce_to_k_dim` (your Truncated SVD function) to reduce the vectors from 300-dimensional to 2-dimensional." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "def get_matrix_of_vectors(wv_from_bin, required_words=['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']):\n", " \"\"\" Put the word2vec vectors into a matrix M.\n", " Param:\n", " wv_from_bin: KeyedVectors object; the 3 million word2vec vectors loaded from file\n", " Return:\n", " M: numpy matrix shape (num words, 300) containing the vectors\n", " word2Ind: dictionary mapping each word to its row number in M\n", " \"\"\"\n", " import random\n", " words = list(wv_from_bin.vocab.keys())\n", " print(\"Shuffling words ...\")\n", " random.shuffle(words)\n", " words = words[:10000]\n", " print(\"Putting %i words into word2Ind and matrix M...\" % len(words))\n", " word2Ind = {}\n", " M = []\n", " curInd = 0\n", " for w in words:\n", " try:\n", " M.append(wv_from_bin.word_vec(w))\n", " word2Ind[w] = curInd\n", " curInd += 1\n", " except KeyError:\n", " continue\n", " for w in required_words:\n", " try:\n", " M.append(wv_from_bin.word_vec(w))\n", " word2Ind[w] = curInd\n", " curInd += 1\n", " except KeyError:\n", " continue\n", " M = np.stack(M)\n", " print(\"Done.\")\n", " return M, word2Ind" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Shuffling words ...\n", "Putting 10000 words into word2Ind and matrix M...\n", "Done.\n", "Running Truncated SVD over 10010 words...\n", "(10010, 2)\n", "Done.\n" ] } ], "source": [ "# -----------------------------------------------------------------\n", "# Run Cell to Reduce 300-Dimensinal Word Embeddings to k Dimensions\n", "# Note: This may take several minutes\n", "# -----------------------------------------------------------------\n", "M, word2Ind = get_matrix_of_vectors(wv_from_bin)\n", "M_reduced = reduce_to_k_dim(M, k=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 2.1: Word2Vec Plot Analysis [written] (4 points)\n", "\n", "Run the cell below to plot the 2D word2vec embeddings for `['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']`.\n", "\n", "What clusters together in 2-dimensional embedding space? What doesn't cluster together that you might think should have? How is the plot different from the one generated earlier from the co-occurrence matrix?" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[10000, 10001, 10002, 10003, 10004, 10005, 10006, 10007, 10008, 10009]\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX8AAAD8CAYAAACfF6SlAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAIABJREFUeJzt3Xt4VNXd9vHvjwBSVAgaPJRwLOegIobUxCjhIIJaaAUUPDyGcvDwihF9qih9KAX6YEs1jeKLRYraWrVa3wsiKlQREIpWg6IICiJGCOciYC1n8nv/mCQdQoDATDJJ9v25rlzMnlmz11oz4d4ra8/sZe6OiIgES61YN0BERCqfwl9EJIAU/iIiAaTwFxEJIIW/iEgAKfxFRAJI4S8iEkAKfxGRAFL4i4gEUO1YN+BYEhISvEWLFrFuhohItbJs2bJ/unvjE5WrsuHfokUL8vLyYt0MEZFqxcy+Lk85TfuIiASQwl9EJICiEv5m1sfMVpvZWjMbU8bjmWa23cyWF/0Mj0a9IiJyaiKe8zezOOAJ4EqgAPjAzHLdfVWpon9x97sirU9ERCIXjZF/CrDW3de5+wHgRaB/FPYrNUhmZiZLliyJdTNEpEg0wr8JsCFsu6DovtIGmNknZvZXM2ta1o7MbKSZ5ZlZ3vbt26PQNBERKUs0wt/KuK/08mCvAi3c/ULgLeDZsnbk7tPdPdndkxs3PuHHVCVc0Yps9913H7m5ueDO3r17ueiii1i0aBHdunUjIyOD22+/HXcnPz+flJQUfvrTn9KlSxd+97vfAbB7926uv/56evbsSY8ePVi7di0bN24kIyODjIwMkpKSGDBgAPn5+fTq1auk+tatWwNw8OBBhg8fTvfu3UlPT+f9998/qqlXXXUVGRkZpKSk8O6771bCiyMiR3H3iH6AVGBe2PaDwIPHKR8H7D7Rfi+55BKXcvrFL9yzstwLC/3jjz/26667zj0ry18YMMB//vOfe+fOnX3Xrl3u7n7PPff4q6++6l999ZWff/75/u9//9v37t3rLVq0cHf3Bx54wF944QV3d1++fLkPGDCgpJpdu3b55Zdf7p9++ql/9dVX3rNnz5LHfvCDH7i7+7Rp03zy5Mnu7r5lyxZPS0tzd/dbb73VFy9e7O7u3333nbu7r1q1yrt3716BL4xI8AB5Xo7sjsaXvD4A2phZS2AjMBi4MbyAmZ3v7puLNvsBn0WhXoHQiH/XLsjJAeDC7GwK/v53vtm6ledatOAX99/P1KlT6d8/dBrmu+++o127dnTq1IkOHTpQv359AOLi4gBYsWIFixYt4sknnwSgdu3Qr8j+/fsZPHgwkyZNIikpia+//rpUM7zk+UuXLmXu3LlA6C+JcHv37iUrK4vVq1cTFxfHxo0bK+JVEZETiDj83f2Qmd0FzCM0qp/p7ivNbAKhI1AucLeZ9QMOAd8AmZHWK0XMIDs7dDsnB3JyuAF4/Ic/5Lt69Uju2pVWrVoxZ84czjjjDCA0NbNx40bMjp6xS0pKIjU1lZ/85CcAHDhwAHdn6NChDBs2jCuuuAKARo0asWnTJtydrVu3loR4UlISrVu3ZvTo0SXPDzd37lzi4uJYvHgxq1atol+/fhXxqojICUTl8g7u/jrweqn7xoXdfpDQdJBUhOIDQNHo/yag+fLl5OTkYGY8+uij9OvXD3enVq1aZGdn06BBgzJ3NXbsWG6//XYef/xx3J1rr72WlJQUXnvtNTZt2sTUqVNJT09n0qRJ9OnTh9TUVFJSUjj33HMBGDFiBKNGjaJ79+4AJCcnM2XKlJL9p6amMnnyZHr16sVll11Wsa+LiByTFf+5XtUkJye7ru1TTu4wenRJ+AOQlRU6IJQxuheRmsvMlrl78onK6fIO1V148GdlQWFh6N+cnND9VfTgLiKxVWWv6inlZAbx8UeO9IvPAcTHa+QvImXStE9N4X5k0JfeFpFA0LRP0JQOegW/iByHwl9EJIAU/iIiAaTwFxEJIIW/iEgAKfxFRAJI4S8iEkAKfxGRAFL4i4gEkMJfRCSAFP4iIgGk8BcRCSCFv4hIACn8RUQCSOEvIhJACn8RkQBS+IuIBJDCX0QkgKIS/mbWx8xWm9laMxtznHIDzczN7IRLjImISMWJOPzNLA54AugLdASGmFnHMsqdCdwN/CPSOkVEJDLRGPmnAGvdfZ27HwBeBPqXUW4i8BtgXxTqFBGRCEQj/JsAG8K2C4ruK2FmFwNN3X3O8XZkZiPNLM/M8rZv3x6FpomISFmiEf5Wxn1e8qBZLSAbuO9EO3L36e6e7O7JjRs3jkLTRESkLNEI/wKgadh2IrApbPtMoBOw0MzygUuBXJ30FRGJnWiE/wdAGzNraWZ1gcFAbvGD7r7b3RPcvYW7twDeA/q5e14U6hYRkVMQcfi7+yHgLmAe8BnwkruvNLMJZtYv0v2LiEj01Y7GTtz9deD1UveNO0bZjGjUKSIip07f8BURCSCFv4hIACn8RUQCSOEvIhJACn8RkQBS+IuIBJDCX0QkgBT+IiIBpPAXEQkghb+ISAAp/EVEAkjhLyISQAp/EZEAUviLiASQwl9EJIAU/iIiAaTwFxEJIIV/BAoKCsjIyIh1M0RETprCv5IcPnw41k0QESlR48P/wQcfpFu3bqSmpjJnzhzWr19Pnz596NatGz179qSwsJDMzEyWLFkCwHPPPcf48eMBeOCBB+jevTtdunRh+vTpAHz33Xdcc8019OrVi0cffbSknjVr1pCRkUG3bt244YYb2Lt3LwDNmzfnzjvvpH///pXbcRGR44jKAu5VjjuYMXfuXHbu3MmihQvZs3cvqamptG3blnvvvZfevXtTWFhIrVrHPv6NGzeO008/nf3793PBBRcwdOhQnnrqKdLT03nwwQf585//zIcffgjA/fffz4QJE7jiiiuYMGECTz31FHfffTebN29mzJgxNGvWrLJ6LyJyQlEZ+ZtZHzNbbWZrzWxMGY/fbmYrzGy5mS0xs47RqLdM48fD6NHgzooVK1i0aBEZTZtydceO7N+/n1WrVtG9e3eAkuA3s5Knu3vJ7WnTppGenk7v3r3Ztm0b27ZtY82aNaSkpADwwx/+sKTsmjVrSEtLAyAtLY3PP/8cgCZNmij4RaTKiTj8zSwOeALoC3QEhpQR7s+7+wXu3hn4DfAoFcEddu2CnBwYPZqkjh3pXa8eCzduZOGPf8wnH39MUlISCxcuBKCwsBCAs846i4KCAgCWLVsGwM6dO5k5cyaLFi1i3rx5NGzYEHenTZs25OXlAfDBBx+UVN22bVuWLl0KwNKlS2nXrh0AcXFxFdJVEZFIRGPaJwVY6+7rAMzsRaA/sKq4gLt/G1b+dMCpCGaQnR26nZPD1Tk5vAtkNGmCffwxicOH89vf/pYRI0YwadIk6tSpw9/+9jeGDx/OkCFDeP7550lISCA+Pp74+HiSkpJIT0+nQ4cOnH322QCMGDGC66+/njfffJNOnTqVVP3www9z22234e6cc845/OlPf6qQLoqIRIOFT3Oc0g7MBgJ93H140fYtwA/d/a5S5f4PcC9QF+jh7l8cb7/JyclePMI+ae4QPpdfWBg6MIiI1HBmtszdk09ULhpz/mWl6lFHFHd/wt1/ADwA/LzMHZmNNLM8M8vbvn37qbXGPTTnH67oHICIiIREI/wLgKZh24nApuOUfxH4cVkPuPt0d0929+TGjRuffEuKgz8nB7KyQiP+rKyScwA6AIiIhERjzv8DoI2ZtQQ2AoOBG8MLmFmbsGmea4DjTvmcMjOIjw8Ffnb2kecA4uM19SMiUiTi8Hf3Q2Z2FzAPiANmuvtKM5sA5Ll7LnCXmfUCDgI7gVsjrfeYxo8v+Zw/8J8DgIJfRKREVL7k5e6vA6+Xum9c2O2saNRTbqWDXsEvInKEGn95BxEROZrCX0QkgBT+IiIBpPAXEQkghb+ISAAp/EVEAkjhL9XCY489dsrPfeaZZ/j2229PXFAkQBT+Ui0o/EWiS+EvMePu3HbbbaSnp5OWlsb7779PRkZGydoKkyZN4plnnuH5559n48aNZGRk8Ktf/YqFCxdy1VVXMWDAADp37szLL78MUOZynG+//TbLly9n0KBBjBo1KmZ9FalqauYyjlJ1hV16Y/bs2Rw8cIAlS5awbt06Bg8eTP369Y96yo033si4ceNKFuFZuHAhGzdu5KOPPmLv3r0kJyczYMCAMqvr0aMHnTt35rnnniMxMbHCuiVS3WjkL5UnbIlNgNWff07ahg0wfjytWrVi586dx1xSs7SLL76YOnXq0KBBA8455xy2b99e7ueKiMJfKkupJTZxp93f/87S+fNh1y7Wffkl8fHxZS6pCVC7du2SZTcBli9fzqFDh/jXv/7F1q1bSUhIOOZz69aty6FDhyqpoyLVg6Z9pHKUWmKTnBz6Aa8lJZGel8fhm2/m8ccfZ//+/QwfPpy2bdty2mmnlTx94MCBXHPNNfTt25cLL7yQ73//+wwaNIivvvqKSZMmERcXV+ZynADXXXcdw4YNIy0tjYkTJ8ag8yJVT8TLOFaUiJZxlKorCktsLly4kOeee44ZM2ZEuXEi1V9lLuMoUj5aYlOkylD4S+WI4hKbGRkZGvWLREhz/lI5tMSmSJWiOX+pXOFLbJa1LSIR0Zy/VE1aYlOkSlD4i4gEkMJfRCSAFP4iIgEUlfA3sz5mttrM1prZmDIev9fMVpnZJ2Y238yaR6NeERE5NRGHv5nFAU8AfYGOwBAz61iq2EdAsrtfCPwV+E2k9YqIyKmLxsg/BVjr7uvc/QDwItA/vIC7L3D3PUWb7wG6tq6ISAxFI/ybABvCtguK7juWYcAbUahXREROUTS+4VvWB7XL/OaYmd0MJAPdjvH4SGAkQLNmzaLQNBERKUs0Rv4FQNOw7URgU+lCZtYLGAv0c/f9Ze3I3ae7e7K7Jzdu3DgKTRMRkbJEI/w/ANqYWUszqwsMBnLDC5jZxcDvCQX/tijUKSIiEYg4/N39EHAXMA/4DHjJ3Vea2QQz61dUbApwBvCymS03s9xj7E5ERCpBVK7q6e6vA6+Xum9c2O1e0ahHRESiQ9/wFREJIIW/iEgAKfxFRAJI4S8iEkAKfxGRAFL4i4gEkMJfRCSAFP4iIgGk8BcRCSCFv4hIACn8RUQCSOEvIhJACn8RkQBS+IuIBJDCX0QkgBT+IiIBpPAXEQkghb+ISAAp/EVEAkjhLyISQAp/EZEAUviLiARQVMLfzPqY2WozW2tmY8p4/Aoz+9DMDpnZwGjUKSIipy7i8DezOOAJoC/QERhiZh1LFVsPZALPR1qfiIhErnYU9pECrHX3dQBm9iLQH1hVXMDd84seK4xCfSIiEqFoTPs0ATaEbRcU3SciIlVUNMLfyrjPT2lHZiPNLM/M8rZv3x5hs0RE5FiiEf4FQNOw7URg06nsyN2nu3uyuyc3btw4Ck0TEZGyRCP8PwDamFlLM6sLDAZyo7BfERGpIBGHv7sfAu4C5gGfAS+5+0ozm2Bm/QDMrKuZFQCDgN+b2cpI6xURkVMXjU/74O6vA6+Xum9c2O0PCE0HiYhIFaBv+IqIBJDCX0QkgBT+IiIBpPAXEQkghb+ISAAp/EVEAkjhLyISQAp/EZEAUviLiASQwl9EJIAU/iIiAaTwFxEJIIW/iEgAKfxFRAJI4S8iEkAKfxGRAFL4i4gEkMJfRCSAFP4iIgGk8BepAPn5+fTq1atC9r18+XKmTJkCwKxZs1i/fn2F1CM1W1QWcBeRytO5c2c6d+4MhMI/ISGBZs2axbhVUt1o5F/DHD58ONZNkFKmTp3KHXfcQcuWLUvu69WrF/n5+fzyl79k1qxZuDuNGzdm7ty5HD58mOTkZAAeeOABunfvTpcuXZg+fToACxcuZPjw4axatYq5c+cyatQoBg0aFJO+SfUVlZG/mfUBcoA4YIa7P1zq8dOAPwKXADuAG9w9Pxp11xQPPvggS5cu5cCBA4wdO5a8vDw2bNjA9u3bWb9+PS+++CLt27dn0aJFjBs3DjOjffv2TJs2ja+//ppBgwbRvn176tSpw8SJExkyZAj169enefPm7N+/n+zsbPr27ct7770HwIQJE2jZsiW33HJLjHteg7iD2RF3PfTQQ9StW5dp06bRunXro57So0cPXnrpJVq2bElaWhrz58+nUaNGXHLJJQCMGzeO008/nf3793PBBRcwdOjQkud27NiRPn36MHz4cNLT0yu2b1LjRDzyN7M44AmgL9ARGGJmHUsVGwbsdPfWQDbw60jrrfbcS27OnTuXnd98w6JFi5g/fz5jx47F3TnzzDPJzc3l/vvvZ8aMGbg799xzD7m5uSxcuJDvfe97vPbaa0BojvmJJ55g5syZ/PrXv+bOO+9k7ty5JdMBjRo1ok2bNuTl5eHuzJ49m4EDB8ak6zXS+PEwevR/3ld3Vr77LnOffpqHHnroqOJeVO7SSy/lH//4BwsWLOCuu+7i888/Z8GCBfTo0QOAadOmkZ6eTu/evdm2bRvbtm2rrB5JDReNaZ8UYK27r3P3A8CLQP9SZfoDzxbd/ivQ06zUEClIioJiy+bN3Hfffaz45BMWvfwyGS1acPXVV7N//3527NhRMvpr1qwZubm5zJ49m/z8fPr3709GRgaLFy+moKAAgE6dOtGgQQMAvvjiC7p27cqsWbOOmGoYOXIkM2bMYMGCBaSmpvK9732v0rteI7nDrl2Qk/OfA8DEiSTt2cPYLl24/vrr2bdvH4WFhezfv589e/bw2WefAVCnTh3OPvtsXnnlFS677DLOOussXnnlFTIyMti5cyczZ85k0aJFzJs3j4YNG5YcNIrVrVuXQ4cOxaLXUs1FY9qnCbAhbLsA+OGxyrj7ITPbDZwN/DMK9VcvYUFxHvBIdjav//jH9N65k5z/+i/IzubAwYP87//+L6WPjw0bNqRVq1bMmTOHM844A4CDBw+yceNG4uLiSsq1bt2avLw83njjDU477bSS+y+//HLuv/9+tmzZwvjx4yujt8FgBtnZods5OaEfgKZNGTBnDnVefZVBgwYxbNgwLr30Ujp37kxiYmLJ03v06MGcOXOoX78+GRkZLFu2jHPPPRd3JykpifT0dDp06MDZZ599VNXXXnst48aNo0OHDvz+97+vjN5KDWGlRxInvQOzQcBV7j68aPsWIMXdR4WVWVlUpqBo+8uiMjtK7WskMBKgWbNml3z99dcRta1KKn69R48mPyeH4UA6MCshgQJ39u3bR48ePejSpQs7duxg8eLFxMfH88UXX/DnP/+ZLVu2cOedd3LRRRdRq1YtvvjiC9555x0GDBhAXFwcp59+Oo0bN+bLL7/k008/5YwzzqBhw4asW7eO5s2bk5iYyMqVK7nqqqsYM2YMF198MV9//TXDhw/nzTffjOUrU/25Q62wP6YLC486ByBS0cxsmbsnn6hcNEb+BUDTsO1EYNMxyhSYWW2gIfBN6R25+3RgOkBycnJkR6WqaPz40Kg/Ozv0UzxCBDJuvJHf5eTw/PPP8+GHH/I///M/dOjQgWXLllGvXj0uuugiIDRH3KVLF9566y0gNMpv0aIFl112Gddeey29e/fm4MGD1K5dm6FDh1KnTh3atWsHwObNm8nMzCQzM5NWrVrxhz/8galTp/L0008zbNiwSn85ahT30JRPuNGjQ++zDgBSBUVjzv8DoI2ZtTSzusBgILdUmVzg1qLbA4G3PdI/Oaqb8Hnhe+4J/YS55LPPwJ1mzZqxY8cO/vnPf3Luuedy5plnUqdOHbp06QJw1FRQ8cv4s5/9jNzcXG666SYee+wxrrjiCt544w1WrFjBiBEjAKhfvz6LFy/m5ptvpkePHrz//vvs2bOHV199lZ/85CeV8CLUUMXBn5MDWVmhEX9W1pHnAESqmIhH/kVz+HcB8wh91HOmu680swlAnrvnAn8A/mRmawmN+AdHWm+1Uzwv7A6PPfaf+5s2hfPPx958MxQUAwbg7iQkJLB161a+++476tWrx/Lly4HQp3Y2bdqEu7N161Y2btwIwNlnn83UqVNxd9q2bcuyZcv47//+b2688UYaNmwIQEJCAm+//XZJ1QMGDODOO+/kiiuuOOLcgJwkM4iPDwV+8Ui/+BxAfLxG/lIlReVz/u7+OvB6qfvGhd3eB+hbKGbwu98dGf5t28Jll0GjRkcERVxcHBMmTCA9PZ2WLVvSpEkTABo0aECfPn1ITU0lJSWFc889F4BHH32Uv/3tbxQWFnLllVfSoEGDE54MHDp0KImJiXz00UcV3/eabvz4Iz/nX3wAUPBLFRXxCd+Kkpyc7Hl5ebFuRnSFTw8UKx4tQqUHxdatWxkyZMgRfw2ISPVW3hO+urxDZTnRvHAle/PNN+nXrx8///nPK71uEYk9XditslSxeeErr7ySK6+8slLrFJGqQ9M+la309V/KuB6MiMip0rRPVVU66BX8IhIDCn8RkQBS+IuIBJDCX0QkgBT+IiIBpPCXmHn44YdZsWIFQJmrXIlIxdHn/CVmxowZE+smiASWRv5SKdyd2267jfT0dNLS0nj//ffJzMxkyZIlsW6aRCg/P59evXpV2P4zMjJKVqyT6NHIXypO2BfYZs+ezcEDB1iyZAnr1q1j8ODBdOxYeqlnCYrCwkJqhS18c/jw4SNWo5OKp5G/VIxSC5qv/vxz0jZsgPHjadWqFTt37oxt+ySqdu/ezU033URycjI5OTksWLCA7t27c/nll9O/f3/27dsHhM7tPPTQQ/Ts2ZNVq1bRtWtXbrnlFkaMGMHu3bu5/vrr6dmzJz169GDt2rVH1LFy5UpSU1Pp3r07ffv2jUU3axSFv0RfGQuat/v731k6fz7s2sW6L78kPj4+1q2USJS6LEx+fj5PPvkk7777Lk8//TStWrViwYIFLF68mPbt2/PSSy8BcOjQIX70ox+xYMEC6tevT35+Pk888QQzZ85k8uTJXHfddcyfP5/s7OyjzgnNmzePoUOHsmDBAl577bVK62pNpfCX6Cu+aF3xVUtr1aLfnDnEJSWRnpfHTTffzOOPPx7rVsqpKvVXHe60r1uXMx95hDp16tCpUye2bNlC79696datG7Nnz2bDhg1AaJ2KSy+9tGRXnTp1okGDBgCsWLGCnJwcMjIyyMrKYteuXUdUO3ToUNasWcNNN93ElClTKqWrNZnm/KViFB8AitYuqAU8tWLFEdcyCg+B0n/iSxUV/lcdhN7jiRP5fNMmvtu2jXoHD/Lpp58yfvx4fvnLX5Kamsr9999fstyomR2xFGn4PH9SUhKpqaklS4oeOHDgiKpPO+00fvvb3wLQq1cvrr76ai644IKK7G2NpvCXiqEFzWum8EuR5+SUHARanHMOI3bu5IvUVG699VbOO+88hg0bRrt27WjYsGHJ6P54xo4dy+23387jjz+Ou3Pttddy3333lTz+wgsv8Mwzz2BmnHfeebRr165CuhgUuqSzRF/phWuys4/e1gGgenOHsE/rUFio97SK0CWdJXaOtXBNVpYWNK8JjvVXXRUdSErZNO0jFUMLmtdMx/urDvQeVyMKf6k4Wrim5qliy5HKqYtozt/MzgL+ArQA8oHr3f2ob++Y2VzgUmCJu19bnn1rzl+kCtNypFVWZc35jwHmu3sbYH7RdlmmALdEWJeIVBX6q67aizT8+wPPFt1+FvhxWYXcfT7wrwjrEhGRKIk0/M91980ARf+eE8nOzGykmeWZWd727dsjbJqIiBzLCU/4mtlbwHllPDQ22o1x9+nAdAjN+Ud7/yIiEnLC8Hf3Y16o28y2mtn57r7ZzM4HtkW1dSIiUiEinfbJBW4tun0rMDvC/YmICBEtknO+mWWeqFCk4f8wcKWZfQFcWbSNmSWb2YziQma2GHgZ6GlmBWZ2VYT1iohIBCL6kpe77wB6lnF/HjA8bPvySOoREQmi4kVyVq9ezS233ELDhg157bXX2LdvHwUFBTz22GNcfvnlvPPOO9x99900a9YM4Hvl2be+4SsiUlWU+rJcfn4+b7/9NvXq1aNr167ceOON/Otf/2Lu3Lnk5+czcOBA8vLyuPfee8nNzaVp06bUqlWrXOth6sJuUqM988wzfPvttyf1nNatW1dQa0SOoxyL5Lg7Xbt2BaBFixbs3r0bgG+//ZZmzZoVr5Xw7/JUp/CXGu1Y4X/48OEYtEbkGMpY+jR8kZxDRYvkmBnLli0DYP369SXrJJx55pkUFBQU7+308lSpaR+pdvLz8xk0aBBt27YtmQvNzMxkxIgR7NixA3dn+vTprF+/nuXLlzNo0CCSk5O57777GDRoEO3bt6dOnTpMnjyZzMxM9uzZw+mnn86zzz5L48aNS+o5ePAgd9xxB19++SUHDx7k0UcfJSUlhYyMDJ577jkSExOZNGkSiYmJZGZm0rp1a4YMGcJbb71Feno68fHxzJs3j0aNGjFr1qwjVrASOUI5F8lp1KgR9evX55prrmHTpk1kFz3nkUce4Uc/+hHf//73Aco3snH3KvlzySWXuEiJwsKSm1999ZUnJCT4t99+6wcOHPCLLrrIhw4d6i+88IK7uy9fvtwHDBjg7u7dunXzDRs2HPG83bt3u7t7VlaWP/vss+7u/uyzz/ro0aPd3f0HP/iBu7tPmzbNJ0+e7O7uW7Zs8bS0tKP2OXHiRH/66afd3b158+b+8ccfe2Fhobdv395feeUVd3fv37+/f/jhhxXzukjNUljoHhr3h37Cfu/d3Z9++mmfOHHicXcB5Hk5MlbTPlL1lWMudPPmzcdd/LtY+ILhq1evJi0tDYC0tDQ+//zzI8quWLGCv/zlL2RkZHDDDTeUzK+Gj+A97Kq4tWvX5sILL8TMaNKkCRdffDEAiYmJfPPNN1F5KaQGq+RFcjTtI1VbORcM79KlCyNHjjxq8e+6dety6NChkt2FLxjerl07li5dSuvWrVm6dOlRa8ImJSXRunVrRhf9hyze51lnnUVBQQGJiYksW7aMpk2bltn0Yx0kRI5SzkVyMjMzo1alwl+qtnLOhf70pz8tc/Hv6667jmHDhpGWlsawYcOO2PWYMWO49dZbmTFjBvXr1+ePf/zjEY+s2Sr5AAAFEUlEQVSPGDGCUaNG0b17dwCSk5OZMmUKd999N8OHD6dt27acdtppFf8aSM0Xg0VytIC7VA9hC4bnA8N79uStt96KaZNEoi4Ki+RoAXepOcqaC12zRguGS81TiYvkKPylais9F1pYSIusLN7asKFCT4aJ1HSa85eqTQuGi1QIzflL9aAFw0XKRXP+UrNowXCRqFL4i4gEkMJfRCSAFP4iIgGk8BcRCSCFv4hIACn8RUQCqMp+zt/MthNajuyfsW5LlCRQc/oC6k9VVpP6AurPyWru7o1PVKjKhj+AmeWV58sK1UFN6guoP1VZTeoLqD8VRdM+IiIBpPAXEQmgqh7+02PdgCiqSX0B9acqq0l9AfWnQlTpOX8REakYVX3kLyIiFSDm4W9mfcxstZmtNbMxZTx+r5mtMrNPzGy+mTWPRTvL60T9CSs30MzczGJ+1v94ytMfM7u+6D1aaWbPV3Yby6scv2vNzGyBmX1U9Pt2dSzaWR5mNtPMtpnZp8d43MzssaK+fmJmXSq7jSejHP25qagfn5jZUjO7qLLbeDJO1J+wcl3N7LCZDaystpVw95j9AHHAl0AroC7wMdCxVJnuQP2i23cAf4llmyPtT1G5M4F3gPeA5Fi3O8L3pw3wEdCoaPucWLc7gr5MB+4out0RyI91u4/TnyuALsCnx3j8auANwIBLgX/Eus0R9ict7Hesb3XvT1GZOOBt4HVgYGW3MdYj/xRgrbuvc/cDwItA//AC7r7A3fcUbb4HJFZyG0/GCftTZCLwG2BfZTbuFJSnPyOAJ9x9J4C7b6vkNpZXefriQIOi2w2BTZXYvpPi7u8A3xynSH/gjx7yHhBvZudXTutO3on64+5Li3/HqPo5UJ73B2AU8AoQk/8zsQ7/JsCGsO2CovuOZRih0UxVdcL+mNnFQFN3n1OZDTtF5Xl/2gJtzezvZvaemfWptNadnPL0ZTxws5kVEBqNjaqcplWIk/2/VZ1U9Rw4ITNrAvwEeDJWbYj1Gr5lLcdU5sePzOxmIBnoVqEtisxx+2NmtYBsILOyGhSh8rw/tQlN/WQQGo0tNrNO7r6rgtt2ssrTlyHAM+7+iJmlAn8q6kthxTcv6sr9f6s6MbPuhMI/PdZtidDvgAfc/bDFaFW6WId/AdA0bDuRMv7UNrNewFigm7vvr6S2nYoT9edMoBOwsOgNPw/INbN+7l4VFywuz/tTALzn7geBr8xsNaGDwQeV08RyK09fhgF9ANz9XTOrR+g6LFV1Kut4yvV/qzoxswuBGUBfd98R6/ZEKBl4sSgHEoCrzeyQu8+qrAbEetrnA6CNmbU0s7rAYCA3vEDRNMnvgX5VeD652HH74+673T3B3Vu4ewtCc5dVNfihHO8PMIvQSXnMLIHQNNC6Sm1l+ZSnL+uBngBm1gGoB2yv1FZGTy7wX0Wf+rkU2O3um2PdqFNlZs2A/wfc4u5rYt2eSLl7y7Ac+CtwZ2UGP8R45O/uh8zsLmAeoTPfM919pZlNAPLcPReYApwBvFx0lFzv7v1i1ujjKGd/qo1y9mce0NvMVgGHgZ9VxVFZOftyH/CUmY0mNEWS6UUfy6hqzOwFQlNtCUXnKH4B1AFw9ycJnbO4GlgL7AGGxqal5VOO/owDzgb+b1EOHPIqcHG0YylHf2JO3/AVEQmgWE/7iIhIDCj8RUQCSOEvIhJACn8RkQBS+IuIBJDCX0QkgBT+IiIBpPAXEQmg/w8feIvh2p9EdAAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "words = ['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']\n", "plot_embeddings(M_reduced, word2Ind, words)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cosine Similarity\n", "Now that we have word vectors, we need a way to quantify the similarity between individual words, according to these vectors. One such metric is cosine-similarity. We will be using this to find words that are \"close\" and \"far\" from one another.\n", "\n", "We can think of n-dimensional vectors as points in n-dimensional space. If we take this perspective L1 and L2 Distances help quantify the amount of space \"we must travel\" to get between these two points. Another approach is to examine the angle between two vectors. From trigonometry we know that:\n", "\n", "\n", "\n", "Instead of computing the actual angle, we can leave the similarity in terms of $similarity = cos(\\Theta)$. Formally the [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity) $s$ between two vectors $p$ and $q$ is defined as:\n", "\n", "$$s = \\frac{p \\cdot q}{||p|| ||q||}, \\textrm{ where } s \\in [-1, 1] $$ " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 2.2: Polysemous Words (2 points) [code + written] \n", "Find a [polysemous](https://en.wikipedia.org/wiki/Polysemy) word (for example, \"leaves\" or \"scoop\") such that the top-10 most similar words (according to cosine similarity) contains related words from *both* meanings. For example, \"leaves\" has both \"vanishes\" and \"stalks\" in the top 10, and \"scoop\" has both \"handed_waffle_cone\" and \"lowdown\". You will probably need to try several polysemous words before you find one. Please state the polysemous word you discover and the multiple meanings that occur in the top 10. Why do you think many of the polysemous words you tried didn't work?\n", "\n", "**Note**: You should use the `wv_from_bin.most_similar(word)` function to get the top 10 similar words. This function ranks all other words in the vocabulary with respect to their cosine similarity to the given word. For further assistance please check the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.FastTextKeyedVectors.most_similar)__." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('man', 0.7664012312889099), ('girl', 0.7494640946388245), ('teenage_girl', 0.7336829900741577), ('teenager', 0.631708562374115), ('lady', 0.6288785934448242), ('teenaged_girl', 0.6141784191131592), ('mother', 0.607630729675293), ('policewoman', 0.6069462299346924), ('boy', 0.5975908041000366), ('Woman', 0.5770983099937439)]\n", "-------------\n", "[('kings', 0.7138045430183411), ('queen', 0.6510956883430481), ('monarch', 0.6413194537162781), ('crown_prince', 0.6204219460487366), ('prince', 0.6159993410110474), ('sultan', 0.5864823460578918), ('ruler', 0.5797567367553711), ('princes', 0.5646552443504333), ('Prince_Paras', 0.5432944297790527), ('throne', 0.5422105193138123)]\n", "-------------\n", "[('queues', 0.7822983860969543), ('queuing', 0.7479887008666992), ('queued', 0.6569256782531738), ('Queues', 0.6135598421096802), ('snaking_queue', 0.6109408140182495), ('snaking_queues', 0.5706708431243896), ('serpentine_queues', 0.5586966276168823), ('queing', 0.5505384802818298), ('serpentine_queue', 0.5351817011833191), ('Queue', 0.5160154700279236)]\n", "-------------\n", "[('man', 0.5752460956573486), ('queen', 0.5433590412139893), ('prince', 0.5224438905715942), ('monarch', 0.5148977041244507), ('princess', 0.5129660367965698), ('lady', 0.5018671154975891), ('girl', 0.5011837482452393), ('wellwisher', 0.5011758804321289), ('deposed_monarch', 0.4940492510795593), ('befits_newly_minted', 0.48958051204681396)]\n" ] } ], "source": [ "# ------------------\n", "# Write your polysemous word exploration code here.\n", "\n", "print(wv_from_bin.most_similar([\"woman\"]))\n", "print('-------------')\n", "print(wv_from_bin.most_similar(['king']))\n", "print('-------------')\n", "print(wv_from_bin.most_similar(['queue']))\n", "print('-------------')\n", "print(wv_from_bin.most_similar(['woman', 'king', 'queue']))\n", "\n", "# ------------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 2.3: Synonyms & Antonyms (2 points) [code + written] \n", "\n", "When considering Cosine Similarity, it's often more convenient to think of Cosine Distance, which is simply 1 - Cosine Similarity.\n", "\n", "Find three words (w1,w2,w3) where w1 and w2 are synonyms and w1 and w3 are antonyms, but Cosine Distance(w1,w3) < Cosine Distance(w1,w2). For example, w1=\"happy\" is closer to w3=\"sad\" than to w2=\"cheerful\". \n", "\n", "Once you have found your example, please give a possible explanation for why this counter-intuitive result may have happened.\n", "\n", "You should use the the `wv_from_bin.distance(w1, w2)` function here in order to compute the cosine distance between two words. Please see the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.FastTextKeyedVectors.distance)__ for further assistance." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Synonyms fat, large have cosine distance: 0.8237101936789306\n", "Antonyms fat, skinny have cosine distance: 0.49759130423428566\n" ] } ], "source": [ "# ------------------\n", "# Write your synonym & antonym exploration code here.\n", "\n", "w1 = \"fat\"\n", "w2 = \"large\"\n", "w3 = \"skinny\"\n", "w1_w2_dist = wv_from_bin.distance(w1, w2)\n", "w1_w3_dist = wv_from_bin.distance(w1, w3)\n", "\n", "print(\"Synonyms {}, {} have cosine distance: {}\".format(w1, w2, w1_w2_dist))\n", "print(\"Antonyms {}, {} have cosine distance: {}\".format(w1, w3, w1_w3_dist))\n", "\n", "# ------------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above counter-intuitive result happens likely because when people write sentences they typically write fat and skinny in closer proximity to eachother, whereas \"large\" may rarely occur in the same neighborhood as \"skinny\". Therefore, the cosign distance between \"fat\" and \"skinny\" is smaller than the cosine distance between \"fat\" and \"large\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Solving Analogies with Word Vectors\n", "Word2Vec vectors have been shown to *sometimes* exhibit the ability to solve analogies. \n", "\n", "As an example, for the analogy \"man : king :: woman : x\", what is x?\n", "\n", "In the cell below, we show you how to use word vectors to find x. The `most_similar` function finds words that are most similar to the words in the `positive` list and most dissimilar from the words in the `negative` list. The answer to the analogy will be the word ranked most similar (largest numerical value).\n", "\n", "**Note:** Further Documentation on the `most_similar` function can be found within the __[GenSim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.FastTextKeyedVectors.most_similar)__." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('queen', 0.7118192911148071),\n", " ('monarch', 0.6189674139022827),\n", " ('princess', 0.5902431607246399),\n", " ('crown_prince', 0.5499460697174072),\n", " ('prince', 0.5377321243286133),\n", " ('kings', 0.5236844420433044),\n", " ('Queen_Consort', 0.5235945582389832),\n", " ('queens', 0.5181134343147278),\n", " ('sultan', 0.5098593235015869),\n", " ('monarchy', 0.5087411999702454)]\n" ] } ], "source": [ "# Run this cell to answer the analogy -- man : king :: woman : x\n", "pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'king'], negative=['man']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 2.4: Finding Analogies [code + written] (2 Points)\n", "Find an example of analogy that holds according to these vectors (i.e. the intended word is ranked top). In your solution please state the full analogy in the form x:y :: a:b. If you believe the analogy is complicated, explain why the analogy holds in one or two sentences.\n", "\n", "**Note**: You may have to try many analogies to find one that works!" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('Texas', 0.5915002226829529),\n", " ('Denton', 0.4873083531856537),\n", " ('Dripping_Springs', 0.48517969250679016),\n", " ('Lampasas', 0.48042112588882446),\n", " ('New_Braunfels', 0.477350115776062),\n", " ('Tarleton', 0.4719838500022888),\n", " ('Lubbock', 0.46612784266471863),\n", " ('Spicewood', 0.46407443284988403),\n", " ('Nacogdoches', 0.463761568069458),\n", " ('Austin_Ladyjack', 0.460454523563385)]\n" ] } ], "source": [ "# ------------------\n", "# Write your analogy exploration code here.\n", "\n", "pprint.pprint(wv_from_bin.most_similar(positive=['Austin', 'Georgia'], negative=['Atlanta']))\n", "\n", "# ------------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is fairly accurate. Regarding state capitals, Austin is the capital of Texas, and Atlanta is the capital of Georgia. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 2.5: Incorrect Analogy [code + written] (1 point)\n", "Find an example of analogy that does *not* hold according to these vectors. In your solution, state the intended analogy in the form x:y :: a:b, and state the (incorrect) value of b according to the word vectors." ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('migratory_birds', 0.583236038684845),\n", " ('migrating_birds', 0.5757449269294739),\n", " ('geese', 0.5467339754104614),\n", " ('raptors', 0.5410414934158325),\n", " ('bird_species', 0.536007285118103),\n", " ('pelicans', 0.5209925174713135),\n", " ('migratory_waterfowl', 0.5119693875312805),\n", " ('crows', 0.5089749693870544),\n", " ('seabirds', 0.5001636743545532),\n", " ('animals', 0.5000354051589966)]\n" ] } ], "source": [ "# ------------------\n", "# Write your incorrect analogy exploration code here.\n", "\n", "pprint.pprint(wv_from_bin.most_similar(positive=[\"people\", \"birds\"], negative=[\"house\"]))\n", "\n", "# ------------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Person:House :: Bird:Tree\n", "The correct answer is \"tree\" however gensim incorrectly output \"migratory birds\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 2.6: Guided Analysis of Bias in Word Vectors [written] (1 point)\n", "\n", "It's important to be cognizant of the biases (gender, race, sexual orientation etc.) implicit to our word embeddings.\n", "\n", "Run the cell below, to examine (a) which terms are most similar to \"woman\" and \"boss\" and most dissimilar to \"man\", and (b) which terms are most similar to \"man\" and \"boss\" and most dissimilar to \"woman\". What do you find in the top 10?" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('bosses', 0.5522644519805908),\n", " ('manageress', 0.49151360988616943),\n", " ('exec', 0.45940813422203064),\n", " ('Manageress', 0.45598435401916504),\n", " ('receptionist', 0.4474116563796997),\n", " ('Jane_Danson', 0.44480544328689575),\n", " ('Fiz_Jennie_McAlpine', 0.44275766611099243),\n", " ('Coronation_Street_actress', 0.44275566935539246),\n", " ('supremo', 0.4409853219985962),\n", " ('coworker', 0.43986251950263977)]\n", "\n", "[('supremo', 0.6097398400306702),\n", " ('MOTHERWELL_boss', 0.5489562153816223),\n", " ('CARETAKER_boss', 0.5375303626060486),\n", " ('Bully_Wee_boss', 0.5333974361419678),\n", " ('YEOVIL_Town_boss', 0.5321705341339111),\n", " ('head_honcho', 0.5281980037689209),\n", " ('manager_Stan_Ternent', 0.525971531867981),\n", " ('Viv_Busby', 0.5256162881851196),\n", " ('striker_Gabby_Agbonlahor', 0.5250812768936157),\n", " ('BARNSLEY_boss', 0.5238943099975586)]\n" ] } ], "source": [ "# Run this cell\n", "# Here `positive` indicates the list of words to be similar to and `negative` indicates the list of words to be\n", "# most dissimilar from.\n", "pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'boss'], negative=['man']))\n", "print()\n", "pprint.pprint(wv_from_bin.most_similar(positive=['man', 'boss'], negative=['woman']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 2.7: Independent Analysis of Bias in Word Vectors [code + written]\n", "Use the `most_similar` function to find another case where some bias is exhibited by the vectors. Please briefly explain the example of bias that you discover." ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('nurse', 0.7127889394760132),\n", " ('doctors', 0.6593285799026489),\n", " ('gynecologist', 0.6454397439956665),\n", " ('physician', 0.6408007144927979),\n", " ('nurse_practitioner', 0.6387196779251099),\n", " ('pediatrician', 0.609344482421875),\n", " ('midwife', 0.5823134183883667),\n", " ('pharmacist', 0.5700446367263794),\n", " ('oncologist', 0.5668959617614746),\n", " ('obstetrician', 0.5636125206947327)]\n", "\n", "[('physician', 0.6708430051803589),\n", " ('surgeon', 0.6096619367599487),\n", " ('doctors', 0.5943056344985962),\n", " ('orthopedic_surgeon', 0.5938172340393066),\n", " ('dentist', 0.5839548707008362),\n", " ('urologist', 0.5734840631484985),\n", " ('cardiologist', 0.5721563100814819),\n", " ('orthopedist', 0.565475344657898),\n", " ('ophthalmologist', 0.5647045373916626),\n", " ('neurologist', 0.551352858543396)]\n" ] } ], "source": [ "# ------------------\n", "# Write your bias exploration code here.\n", "\n", "pprint.pprint(wv_from_bin.most_similar(positive=['mother','doctor'], negative=['father']))\n", "print()\n", "pprint.pprint(wv_from_bin.most_similar(positive=['father','doctor'], negative=['mother']))\n", "\n", "# ------------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I discovered that there's a bias which results in the correlation of mothers and doctors to nurse. While father and doctors is correlated to physician." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "position": { "height": "145px", "left": "883px", "right": "20px", "top": "120px", "width": "377px" }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }