{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Vector-space models: dimensionality reduction" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "__author__ = \"Christopher Potts\"\n", "__version__ = \"CS224u, Stanford, Spring 2018 term\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Contents\n", "\n", "0. [Overview](#Overview)\n", "0. [Set-up](#Set-up)\n", "0. [Latent Semantic Analysis](#Latent-Semantic-Analysis)\n", " 0. [Overview of the LSA method](#Overview-of-the-LSA-method)\n", " 0. [Motivating example for LSA](#Motivating-example-for-LSA)\n", " 0. [Applying LSA to real VSMs](#Applying-LSA-to-real-VSMs)\n", " 0. [Other resources for matrix factorization](#Other-resources-for-matrix-factorization)\n", "0. [GloVe](#GloVe)\n", " 0. [Overview of the GloVe method](#Overview-of-the-GloVe-method)\n", " 0. [GloVe implementation notes](#GloVe-implementation-notes)\n", " 0. [Applying GloVe to our motivating example](#Applying-GloVe-to-our-motivating-example)\n", " 0. [Testing the GloVe implementation](#Testing-the-GloVe-implementation)\n", " 0. [Applying GloVe to real VSMs](#Applying-GloVe-to-real-VSMs)\n", "0. [Autoencoders](#Autoencoders)\n", " 0. [Overview of the autoencoder method](#Overview-of-the-autoencoder-method)\n", " 0. [Testing the autoencoder implementation](#Testing-the-autoencoder-implementation)\n", " 0. [Applying autoencoders to real VSMs](#Applying-autoencoders-to-real-VSMs)\n", "0. [word2vec](#word2vec)\n", " 0. [Training data](#Training-data)\n", " 0. [Basic skip-gram](#Basic-skip-gram)\n", " 0. [Skip-gram with noise contrastive estimation ](#Skip-gram-with-noise-contrastive-estimation-)\n", " 0. [word2vec resources](#word2vec-resources)\n", "0. [Other methods](#Other-methods)\n", "0. [Exploratory exercises](#Exploratory-exercises)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Overview\n", "\n", "The matrix weighting schemes reviewed in the first notebook for this unit deliver solid results. However, they are not capable of capturing higher-order associations in the data. \n", "\n", "With dimensionality reduction, the goal is to eliminate correlations in the input VSM and capture such higher-order notions of co-occurrence, thereby improving the overall space.\n", "\n", "As a motivating example, consider the adjectives _gnarly_ and _wicked_ used as slang positive adjectives. Since both are positive, we expect them to be similar in a good VSM. However, at least stereotypically, _gnarly_ is Californian and _wicked_ is Bostonian. Thus, they are unlikely to occur often in the same texts, and so the methods we've reviewed so far will not be able to model their similarity. \n", "\n", "Dimensionality reduction techniques are often capable of capturing such semantic similarities (and have the added advantage of shrinking the size of our data structures)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Set-up\n", "\n", "* Make sure your environment meets all the requirements for [the cs224u repository](https://github.com/cgpotts/cs224u/). For help getting set-up, see [setup.ipynb](setup.ipynb).\n", "\n", "* Make sure you've downloaded [the data distribution for this unit](http://web.stanford.edu/class/cs224u/data/vsmdata.zip), unpacked it, and placed it in the current directory (or wherever you point `data_home` to below)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Applications/anaconda/envs/nlu/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n", " from ._conv import register_converters as _register_converters\n" ] } ], "source": [ "from mittens import GloVe\n", "import numpy as np\n", "import os\n", "import pandas as pd\n", "import scipy.stats\n", "from tf_autoencoder import TfAutoencoder\n", "import utils\n", "import vsm" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "data_home = 'vsmdata'" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "imdb5 = pd.read_csv(\n", " os.path.join(data_home, 'imdb_window5-scaled.csv.gz'), index_col=0)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "imdb20 = pd.read_csv(\n", " os.path.join(data_home, 'imdb_window20-flat.csv.gz'), index_col=0)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "giga5 = pd.read_csv(\n", " os.path.join(data_home, 'gigaword_window5-scaled.csv.gz'), index_col=0)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "giga20 = pd.read_csv(\n", " os.path.join(data_home, 'gigaword_window20-flat.csv.gz'), index_col=0)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Latent Semantic Analysis\n", "\n", "Latent Semantic Analysis (LSA) is a prominent dimensionality reduction technique. It is an application of __truncated singular value decomposition__ (SVD) and so uses only techniques from linear algebra (no machine learning needed)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Overview of the LSA method\n", "\n", "The central mathematical result is that, for any matrix of real numbers $X$ of dimension $m \\times n$, there is a factorization of $X$ into matrices $T$, $S$, and $D$ such that\n", "\n", "$$X_{m \\times n} = T_{m \\times m}S_{m\\times m}D_{n \\times m}^{\\top}$$\n", "\n", "The matrices $T$ and $D$ are __orthonormal__ – their columns are length-normalized and orthogonal to one another (that is, they each have cosine distance of $1$ from each other). The singular-value matrix $S$ is a diagonal matrix arranged by size, so that the first dimension corresponds to the greatest source of variability in the data, followed by the second, and so on.\n", "\n", "Of course, we don't want to factorize and rebuild the original matrix, as that wouldn't get us anywhere. The __truncation__ part means that we include only the top $k$ dimensions of $S$. Given our row-oriented perspective on these matrices, this means using\n", "\n", "$$T[1{:}m, 1{:}k]S[1{:}k, 1{:}k]$$\n", "\n", "which gives us a version of $T$ that includes only the top $k$ dimensions of variation. \n", "\n", "To build up intuitions, imagine that everyone on the Stanford campus is associated with a 3d point representing their position: $x$ is east–west, $y$ is north–south, and $z$ is zenith–nadir. Since the campus is spread out and has relatively few deep basements and tall buildings, the top two dimensions of variation will be $x$ and $y$, and the 2d truncated SVD of this space will leave $z$ out. This will, for example, capture the sense in which someone at the top of Hoover Tower is close to someone at its base." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Motivating example for LSA\n", "\n", "We can also return to our original motivating example of _wicked_ and _gnarly_. Here is a matrix reflecting those assumptions:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | 0 | \n", "1 | \n", "2 | \n", "3 | \n", "4 | \n", "5 | \n", "
---|---|---|---|---|---|---|
gnarly | \n", "1.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
wicked | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "
awesome | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "
lame | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "1.0 | \n", "
terrible | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "