{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Vector-space models: designs, distances, basic reweighting" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "__author__ = \"Christopher Potts\"\n", "__version__ = \"CS224u, Stanford, Spring 2022\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Contents\n", "\n", "1. [Overview](#Overview)\n", " 1. [Motivation](#Motivation)\n", " 1. [Terminological notes](#Terminological-notes)\n", "1. [Set-up](#Set-up)\n", "1. [Matrix designs](#Matrix-designs)\n", "1. [Pre-computed example matrices](#Pre-computed-example-matrices)\n", "1. [Vector comparison](#Vector-comparison)\n", " 1. [Euclidean](#Euclidean)\n", " 1. [Length normalization](#Length-normalization)\n", " 1. [Cosine distance](#Cosine-distance)\n", " 1. [Cosine distance that's really a distance metric](#Cosine-distance-that's-really-a-distance-metric)\n", " 1. [Matching-based methods](#Matching-based-methods)\n", " 1. [Summary](#Summary)\n", "1. [Distributional neighbors](#Distributional-neighbors)\n", "1. [Matrix reweighting](#Matrix-reweighting)\n", " 1. [Normalization](#Normalization)\n", " 1. [Observed/Expected](#Observed/Expected)\n", " 1. [Pointwise Mutual Information](#Pointwise-Mutual-Information)\n", " 1. [TF-IDF](#TF-IDF)\n", "1. [Subword information](#Subword-information)\n", "1. [Visualization](#Visualization)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Overview\n", "\n", "This notebook is the first in our series about creating effective __distributed representations__. The focus is on matrix designs, assessing similarity, and methods for matrix reweighting.\n", "\n", "The central idea (which takes some getting used to!) is that we can represent words and phrases as dense vectors of real numbers. These take on meaning by being __embedded__ in a larger matrix of representations with comparable structure." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Motivation\n", "\n", "Why build distributed representations? There are potentially many reasons. The two we will emphasize in this course:\n", "\n", "1. __Understanding words in context__: There is value to linguists in seeing what these data-rich approaches can teach us about natural language lexicons, and there is value for social scientists in understanding how words are being used.\n", "\n", "1. __Feature representations for other models__: As we will see, many models can benefit from representing examples as distributed representations." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Terminological notes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* The distributed representations we build will always be vectors of real numbers. The models are often called __vector space models__ (VSMs).\n", "\n", "* __Distributional representations__ are the special case where the data come entirely from co-occurrence counts in corpora. \n", "\n", "* We'll look at models that use supervised labels to obtain vector-based word representations. These aren't purely distributional, in that they take advantage of more than just co-occurrence patterns among items in the vocabulary, but they share the idea that words can be modeled with vectors.\n", "\n", "* If a neural network is used to train the representations, then they might be called __neural representations__.\n", "\n", "* The term __word embedding__ is also used for distributed representations, including distributional ones. This term is a reminder that vector representations are meaningful only when embedded in and compared with others in a unified space (usually a matrix) of representations of the same type.\n", "\n", "* In any case, __distributed representation__ seems like the most general cover term for what we're trying to achieve, and its only downside is that sometimes people think it has something to do with distributed databases." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Set-up\n", "\n", "* Make sure your environment meets all the requirements for [the cs224u repository](https://github.com/cgpotts/cs224u/). For help getting set-up, see [setup.ipynb](setup.ipynb).\n", "\n", "* Download [the course data](http://web.stanford.edu/class/cs224u/data/data.tgz), unpack it, and place it in the directory containing the course repository – the same directory as this notebook. (If you want to put it somewhere else, change `DATA_HOME` below.)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import os\n", "import pandas as pd\n", "\n", "import vsm\n", "import utils" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Set all the random seeds for reproducibility:\n", "\n", "utils.fix_random_seeds()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "DATA_HOME = os.path.join('data', 'vsmdata')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Matrix designs\n", "\n", "There are many, many ways to define distributional matrices. Here's a schematic overview that highlights the major decisions for building a word $\\times$ word matrix:\n", "\n", "1. Define a notion of __co-occurrence context__. This could be an entire document, a paragraph, a sentence, a clause, an NP — whatever domain seems likely to capture the associations you care about.\n", "\n", "1. Define a __count scaling method__. The simplest method just counts everything in the context window, giving equal weight to everything inside it. A common alternative is to scale the weights based on proximity to the target word – e.g., $1/d$, where $d$ is the distance in tokens from the target.\n", "\n", "1. Scan through your corpus building a dictionary $d$ mapping word-pairs to co-occurrence values. Every time a pair of words $w$ and $w'$ occurs in the same context (as you defined it in 1), increment $d[(w, w')]$ by whatever value is determined by your weighting scheme. You'd increment by $1$ with the weighting scheme that simply counts co-occurrences.\n", "\n", "1. Using the count dictionary $d$ that you collected in 3, establish your full vocabulary $V$, an ordered list of words types. \n", " 1. For large collections of documents, $|V|$ will typically be huge. You will probably want to winnow (narrow down) the vocabulary at this point. \n", " 1. You might do this by filtering to a specific subset, or just imposing a minimum count threshold. \n", " 1. You might impose a minimum count threshold even if $|V|$ is small — for words with very low counts, you simply don't have enough evidence to support good representations.\n", " 1. For words outside the vocabulary you choose, you could ignore them entirely or accumulate all their values into a designated _UNK_ vector.\n", "\n", "1. Now build a matrix $M$ of dimension $|V| \\times |V|$. Both the rows and the columns of $M$ represent words. Each cell $M[i, j]$ is filled with the value $d[(w_{i}, w_{j})]$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Pre-computed example matrices\n", "\n", "The data distribution includes four matrices that we'll use for hands-on exploration. All of them were designed in the same basic way:\n", "\n", "* They are word $\\times$ word matrices with 5K rows and 5K columns. \n", "\n", "* The vocabulary is the top 5K most frequent unigrams.\n", "\n", "Two come from Yelp user-supplied reviews of products and services, and two come from Gigaword, a collection of newswire and newspaper texts. Further details:\n", "\n", "|filename | source | window size| count weighting |\n", "|---------|--------|------------|-----------------|\n", "|yelp_window5-scaled.csv.gz | Yelp reviews | 5| 1/d |\n", "|yelp_window20-flat.csv.gz | Yelp reviews | 20| 1 |\n", "|gigaword_window5-scaled.csv.gz | Gigaword | 5 | 1/d |\n", "|gigaword_window20-flat.csv.gz | Gigaword | 20 | 1 |\n", "\n", "Any hunches about how these matrices might differ from each other?" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "yelp5 = pd.read_csv(\n", " os.path.join(DATA_HOME, 'yelp_window5-scaled.csv.gz'), index_col=0)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "yelp20 = pd.read_csv(\n", " os.path.join(DATA_HOME, 'yelp_window20-flat.csv.gz'), index_col=0)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "giga5 = pd.read_csv(\n", " os.path.join(DATA_HOME, 'giga_window5-scaled.csv.gz'), index_col=0)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "giga20 = pd.read_csv(\n", " os.path.join(DATA_HOME, 'giga_window20-flat.csv.gz'), index_col=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These count matrices all have the same vocabulary/index, which you can extract from their indices:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['):', ');', '..', '...', ':(', ':)', ':/', ':D', ':|', ';p', '___', 'abandon']" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(yelp5.index[: 12])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Vector comparison\n", "\n", "Vector comparisons form the heart of our analyses in this context. \n", "\n", "* For the most part, we are interested in measuring the __distance__ between vectors. The guiding idea is that semantically related words should be close together in the vector spaces we build, and semantically unrelated words should be far apart.\n", "\n", "* The [scipy.spatial.distance](http://docs.scipy.org/doc/scipy-0.14.0/reference/spatial.distance.html) module has a lot of vector comparison methods, so you might check them out if you want to go beyond the functions defined and explored here. Read the documentation closely, though: many of those methods are defined only for binary vectors, whereas the VSMs we'll use allow all float values." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Euclidean\n", "\n", "The most basic and intuitive distance measure between vectors is __euclidean distance__. The euclidean distance between two vectors $u$ and $v$ of dimension $n$ is\n", "\n", "$$\\textbf{euclidean}(u, v) = \n", "\\sqrt{\\sum_{i=1}^{n}|u_{i} - v_{i}|^{2}}$$\n", "\n", "In two-dimensions, this corresponds to the length of the most direct line between the two points.\n", "\n", "In `vsm.py`, the function `euclidean` just uses the corresponding [scipy.spatial.distance](https://docs.scipy.org/doc/scipy/reference/spatial.distance.html) method to define it." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Here's the tiny vector space from the screencast on vector comparisons associated with this notebook:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "ABC = pd.DataFrame([\n", " [ 2.0, 4.0],\n", " [10.0, 15.0],\n", " [14.0, 10.0]],\n", " index=['A', 'B', 'C'],\n", " columns=['x', 'y'])" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | x | \n", "y | \n", "
---|---|---|
A | \n", "2.0 | \n", "4.0 | \n", "
B | \n", "10.0 | \n", "15.0 | \n", "
C | \n", "14.0 | \n", "10.0 | \n", "