{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Vector-space models: dimensionality reduction" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "__author__ = \"Christopher Potts\"\n", "__version__ = \"CS224u, Stanford, Spring 2018 term\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Contents\n", "\n", "0. [Overview](#Overview)\n", "0. [Set-up](#Set-up)\n", "0. [Latent Semantic Analysis](#Latent-Semantic-Analysis)\n", " 0. [Overview of the LSA method](#Overview-of-the-LSA-method)\n", " 0. [Motivating example for LSA](#Motivating-example-for-LSA)\n", " 0. [Applying LSA to real VSMs](#Applying-LSA-to-real-VSMs)\n", " 0. [Other resources for matrix factorization](#Other-resources-for-matrix-factorization)\n", "0. [GloVe](#GloVe)\n", " 0. [Overview of the GloVe method](#Overview-of-the-GloVe-method)\n", " 0. [GloVe implementation notes](#GloVe-implementation-notes)\n", " 0. [Applying GloVe to our motivating example](#Applying-GloVe-to-our-motivating-example)\n", " 0. [Testing the GloVe implementation](#Testing-the-GloVe-implementation)\n", " 0. [Applying GloVe to real VSMs](#Applying-GloVe-to-real-VSMs)\n", "0. [Autoencoders](#Autoencoders)\n", " 0. [Overview of the autoencoder method](#Overview-of-the-autoencoder-method)\n", " 0. [Testing the autoencoder implementation](#Testing-the-autoencoder-implementation)\n", " 0. [Applying autoencoders to real VSMs](#Applying-autoencoders-to-real-VSMs)\n", "0. [word2vec](#word2vec)\n", " 0. [Training data](#Training-data)\n", " 0. [Basic skip-gram](#Basic-skip-gram)\n", " 0. [Skip-gram with noise contrastive estimation ](#Skip-gram-with-noise-contrastive-estimation-)\n", " 0. [word2vec resources](#word2vec-resources)\n", "0. [Other methods](#Other-methods)\n", "0. [Exploratory exercises](#Exploratory-exercises)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Overview\n", "\n", "The matrix weighting schemes reviewed in the first notebook for this unit deliver solid results. However, they are not capable of capturing higher-order associations in the data. \n", "\n", "With dimensionality reduction, the goal is to eliminate correlations in the input VSM and capture such higher-order notions of co-occurrence, thereby improving the overall space.\n", "\n", "As a motivating example, consider the adjectives _gnarly_ and _wicked_ used as slang positive adjectives. Since both are positive, we expect them to be similar in a good VSM. However, at least stereotypically, _gnarly_ is Californian and _wicked_ is Bostonian. Thus, they are unlikely to occur often in the same texts, and so the methods we've reviewed so far will not be able to model their similarity. \n", "\n", "Dimensionality reduction techniques are often capable of capturing such semantic similarities (and have the added advantage of shrinking the size of our data structures)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Set-up\n", "\n", "* Make sure your environment meets all the requirements for [the cs224u repository](https://github.com/cgpotts/cs224u/). For help getting set-up, see [setup.ipynb](setup.ipynb).\n", "\n", "* Make sure you've downloaded [the data distribution for this unit](http://web.stanford.edu/class/cs224u/data/vsmdata.zip), unpacked it, and placed it in the current directory (or wherever you point `data_home` to below)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Applications/anaconda/envs/nlu/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n", " from ._conv import register_converters as _register_converters\n" ] } ], "source": [ "from mittens import GloVe\n", "import numpy as np\n", "import os\n", "import pandas as pd\n", "import scipy.stats\n", "from tf_autoencoder import TfAutoencoder\n", "import utils\n", "import vsm" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "data_home = 'vsmdata'" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "imdb5 = pd.read_csv(\n", " os.path.join(data_home, 'imdb_window5-scaled.csv.gz'), index_col=0)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "imdb20 = pd.read_csv(\n", " os.path.join(data_home, 'imdb_window20-flat.csv.gz'), index_col=0)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "giga5 = pd.read_csv(\n", " os.path.join(data_home, 'gigaword_window5-scaled.csv.gz'), index_col=0)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "giga20 = pd.read_csv(\n", " os.path.join(data_home, 'gigaword_window20-flat.csv.gz'), index_col=0)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Latent Semantic Analysis\n", "\n", "Latent Semantic Analysis (LSA) is a prominent dimensionality reduction technique. It is an application of __truncated singular value decomposition__ (SVD) and so uses only techniques from linear algebra (no machine learning needed)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Overview of the LSA method\n", "\n", "The central mathematical result is that, for any matrix of real numbers $X$ of dimension $m \\times n$, there is a factorization of $X$ into matrices $T$, $S$, and $D$ such that\n", "\n", "$$X_{m \\times n} = T_{m \\times m}S_{m\\times m}D_{n \\times m}^{\\top}$$\n", "\n", "The matrices $T$ and $D$ are __orthonormal__ – their columns are length-normalized and orthogonal to one another (that is, they each have cosine distance of $1$ from each other). The singular-value matrix $S$ is a diagonal matrix arranged by size, so that the first dimension corresponds to the greatest source of variability in the data, followed by the second, and so on.\n", "\n", "Of course, we don't want to factorize and rebuild the original matrix, as that wouldn't get us anywhere. The __truncation__ part means that we include only the top $k$ dimensions of $S$. Given our row-oriented perspective on these matrices, this means using\n", "\n", "$$T[1{:}m, 1{:}k]S[1{:}k, 1{:}k]$$\n", "\n", "which gives us a version of $T$ that includes only the top $k$ dimensions of variation. \n", "\n", "To build up intuitions, imagine that everyone on the Stanford campus is associated with a 3d point representing their position: $x$ is east–west, $y$ is north–south, and $z$ is zenith–nadir. Since the campus is spread out and has relatively few deep basements and tall buildings, the top two dimensions of variation will be $x$ and $y$, and the 2d truncated SVD of this space will leave $z$ out. This will, for example, capture the sense in which someone at the top of Hoover Tower is close to someone at its base." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Motivating example for LSA\n", "\n", "We can also return to our original motivating example of _wicked_ and _gnarly_. Here is a matrix reflecting those assumptions:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012345
gnarly1.00.01.00.00.00.0
wicked0.01.00.01.00.00.0
awesome1.01.01.01.00.00.0
lame0.00.00.00.01.01.0
terrible0.00.00.00.00.01.0
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5\n", "gnarly 1.0 0.0 1.0 0.0 0.0 0.0\n", "wicked 0.0 1.0 0.0 1.0 0.0 0.0\n", "awesome 1.0 1.0 1.0 1.0 0.0 0.0\n", "lame 0.0 0.0 0.0 0.0 1.0 1.0\n", "terrible 0.0 0.0 0.0 0.0 0.0 1.0" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gnarly_df = pd.DataFrame(\n", " np.array([\n", " [1,0,1,0,0,0],\n", " [0,1,0,1,0,0],\n", " [1,1,1,1,0,0],\n", " [0,0,0,0,1,1],\n", " [0,0,0,0,0,1]], dtype='float64'),\n", " index=['gnarly', 'wicked', 'awesome', 'lame', 'terrible'])\n", "\n", "gnarly_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "No column context includes both _gnarly_ and _wicked_ together so our count matrix places them far apart:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "gnarly 0.000000\n", "awesome 0.292893\n", "wicked 1.000000\n", "lame 1.000000\n", "terrible 1.000000\n", "dtype: float64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vsm.neighbors('gnarly', gnarly_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reweighting doesn't help. For example, here is the attempt with Positive PMI:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "gnarly 0.000000\n", "awesome 0.292893\n", "wicked 1.000000\n", "lame 1.000000\n", "terrible 1.000000\n", "dtype: float64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vsm.neighbors('gnarly', vsm.pmi(gnarly_df))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, both words tend to occur with _awesome_ and not with _lame_ or _terrible_, so there is an important sense in which they are similar. LSA to the rescue:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "gnarly_lsa_df = vsm.lsa(gnarly_df, k=2)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "gnarly 0.0\n", "wicked 0.0\n", "awesome 0.0\n", "lame 1.0\n", "terrible 1.0\n", "dtype: float64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vsm.neighbors('gnarly', gnarly_lsa_df)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Applying LSA to real VSMs\n", "\n", "Here's an example that begins to convey the effect that this can have empirically.\n", "\n", "First, the original count matrix:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "superb 0.000000\n", "the 0.984231\n", ". 0.986942\n", "is 0.988014\n", ", 0.988344\n", "dtype: float64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vsm.neighbors('superb', imdb5).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And then LSA with $k=100$:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "imdb5_svd = vsm.lsa(imdb5, k=100)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "superb 0.000000\n", "stunning 0.008543\n", "breathtaking 0.011966\n", "magnificent 0.013147\n", "notch 0.013776\n", "dtype: float64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vsm.neighbors('superb', imdb5_svd).head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "A common pattern in the literature is to apply PMI first. The PMI values tend to give the count matrix a normal (Gaussian) distribution that better satisfies the assumptions underlying SVD:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "imdb5_pmi = vsm.pmi(imdb5, positive=False)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "imdb5_pmi_svd = vsm.lsa(imdb5_pmi, k=100)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "superb 0.000000\n", "terrific 0.012701\n", "outstanding 0.014119\n", "incredible 0.018908\n", "fantastic 0.022780\n", "dtype: float64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vsm.neighbors('superb', imdb5_pmi_svd).head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Other resources for matrix factorization\n", "\n", "The [sklearn.decomposition](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition) module contains an implementation of LSA ([TruncatedSVD](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition.TruncatedSVD)) that you might want to switch to for real experiments:\n", "\n", "* The `sklearn` version is more flexible than the above in that it can operate on both dense matrices (Numpy arrays) and sparse matrices (from Scipy).\n", "\n", "* The `sklearn` version will make it easy to try out other dimensionality reduction methods in your own code; [Principal Component Analysis (PCA)](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA) and [Non-Negative Matrix Factorization (NMF)](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html#sklearn.decomposition.NMF) are closely related methods that are worth a look." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## GloVe\n", "\n", "### Overview of the GloVe method\n", "\n", "[Pennington et al. (2014)](http://www.aclweb.org/anthology/D/D14/D14-1162.pdf) introduce an objective function for semantic word representations. Roughly speaking, the objective is to learn vectors for words $w_{i}$ and $w_{j}$ such that their dot product is proportional to their probability of co-occurrence:\n", "\n", "$$w_{i}^{\\top}\\widetilde{w}_{k} + b_{i} + \\widetilde{b}_{k} = \\log(X_{ik})$$\n", "\n", "The paper is exceptionally good at motivating this objective from first principles. In their equation (6), they define \n", "\n", "$$w_{i}^{\\top}\\widetilde{w}_{k} = \\log(P_{ik}) = \\log(X_{ik}) - \\log(X_{i})$$\n", "\n", "If we allow that the rows and columns can be different, then we would do\n", "\n", "$$w_{i}^{\\top}\\widetilde{w}_{k} = \\log(P_{ik}) = \\log(X_{ik}) - \\log(X_{i} \\cdot X_{*k})$$\n", "\n", "where, as in the paper, $X_{i}$ is the sum of the values in row $i$, and $X_{*k}$ is the sum of the values in column $k$.\n", "\n", "The rightmost expression is PMI by the equivalence $\\log(\\frac{x}{y}) = \\log(x) - \\log(y)$, and hence we can see GloVe as aiming to make the dot product of two learned vectors equal to the PMI!\n", "\n", "The full model is a weighting of this objective:\n", "\n", "$$\\sum_{i, j=1}^{|V|} f\\left(X_{ij}\\right)\n", " \\left(w_i^\\top \\widetilde{w}_j + b_i + \\widetilde{b}_j - \\log X_{ij}\\right)^2$$\n", "\n", "where $V$ is the vocabulary and $f$ is a scaling factor designed to diminish the impact of very large co-occurrence counts:\n", "\n", "$$f(x) \n", "\\begin{cases}\n", "(x/x_{\\max})^{\\alpha} & \\textrm{if } x < x_{\\max} \\\\\n", "1 & \\textrm{otherwise}\n", "\\end{cases}$$\n", "\n", "Typically, $\\alpha$ is set to $0.75$ and $x_{\\max}$ to $100$ (though it is worth assessing how many of your non-zero counts are above this; in dense word $\\times$ word matrices, you could be flattening more than you want to)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### GloVe implementation notes\n", "\n", "* The implementation in `vsm.glove` is the most stripped-down, bare-bones version of the GloVe method I could think of. As such, it is quite slow. \n", "\n", "* The required [mittens](https://github.com/roamanalytics/mittens) package includes a vectorized implementation that is much, much faster, so we'll mainly use that. \n", "\n", "* For really large jobs, [the official C implementation released by the GloVe team](http://nlp.stanford.edu/projects/glove/) is probably the best choice." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Applying GloVe to our motivating example\n", "\n", "GloVe should do well on our _gnarly_/_wicked_ evaluation, though you will see a lot variation due to the small size of this VSM:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Stopping at iteration 547 with error 9.999229386453145e-05\n" ] } ], "source": [ "gnarly_glove = vsm.glove(gnarly_df, n=5, max_iter=1000)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "gnarly 0.000000\n", "awesome 0.219802\n", "lame 0.604375\n", "terrible 0.895709\n", "wicked 1.574378\n", "dtype: float64" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vsm.neighbors('gnarly', gnarly_glove)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Testing the GloVe implementation\n", "\n", "It is not easy analyze GloVe values derived from real data, but the following little simulation suggests that `vsm.glove` is working as advertised: it does seem to reliably deliver vectors whose dot products are proportional to the co-occurrence probability:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "glove_test_count_df = pd.DataFrame(\n", " np.array([\n", " [10.0, 2.0, 3.0, 4.0],\n", " [ 2.0, 10.0, 4.0, 1.0],\n", " [ 3.0, 4.0, 10.0, 2.0],\n", " [ 4.0, 1.0, 2.0, 10.0]]),\n", " index=['A', 'B', 'C', 'D'],\n", " columns=['A', 'B', 'C', 'D'])" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Stopping at iteration 679 with error 9.996347120791212e-05\n" ] } ], "source": [ "glove_test_df = vsm.glove(glove_test_count_df, max_iter=1000, n=4)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "def correlation_test(true, pred): \n", " mask = true > 0\n", " M = pred.dot(pred.T)\n", " with np.errstate(divide='ignore'):\n", " log_cooccur = np.log(true)\n", " log_cooccur[np.isinf(log_cooccur)] = 0.0\n", " row_prob = np.log(true.sum(axis=1))\n", " row_log_prob = np.log(np.outer(row_prob, np.ones(true.shape[1])))\n", " prob = log_cooccur - row_log_prob\n", " return np.corrcoef(prob[mask], M[mask])[0, 1]" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9886773305299698" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "correlation_test(glove_test_count_df.values, glove_test_df.values)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Applying GloVe to real VSMs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `vsm.glove` implementation is too slow to use on real matrices. The distribution in the `mittens` package is significantly faster, making its use possible even without a GPU (and it will be very fast indeed on a GPU machine):" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Iteration 100: loss: 133194.53125" ] } ], "source": [ "glove_model = GloVe()\n", "\n", "imdb5_glv = glove_model.fit(imdb5.values)\n", "\n", "imdb5_glv = pd.DataFrame(imdb5_glv, index=imdb5.index)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "superb 0.000000\n", "brilliant 0.113144\n", "excellent 0.116295\n", "fine 0.117888\n", "performances 0.127362\n", "dtype: float64" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vsm.neighbors('superb', imdb5_glv).head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Autoencoders\n", "\n", "An autoencoder is a machine learning model that seeks to learn parameters that predict its own input. This is meaningful when there are intermediate representations that have lower dimensionality than the inputs. These provide a reduced-dimensional view of the data akin to those learned by LSA, but now we have a lot more design choices and a lot more potential to learn higher-order associations in the underyling data." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Overview of the autoencoder method\n", "\n", "The module `tf_autoencoder` uses TensorFlow to implement a simple one-layer autoencoder:\n", "\n", "$$\n", "\\begin{align}\n", "h &= \\mathbf{f}(xW + b_{h}) \\\\\n", "\\widehat{x} &= hW^{\\top} + b_{x}\n", "\\end{align}$$\n", "\n", "Here, we assume that the hidden representation $h$ has a low dimensionality like 100, and that $\\mathbf{f}$ is a non-linear activation function (the default for `tf_autoencoder` is `tanh`). These are the major design choices internal to the network. It might also be meaningful to assume that there are two matrices of weights $W_{xh}$ and $W_{hx}$, rather than using $W^{\\top}$ for the output step.\n", "\n", "The objective function for autoencoders will implement some kind of assessment of the distance between the inputs and their predicted outputs. For example, one could use the one-half mean squared error:\n", "\n", "$$\\frac{1}{m}\\sum_{i=1}^{m} \\frac{1}{2}(\\widehat{X[i]} - X[i])^{2}$$\n", "\n", "where $X$ is the input matrix of examples (dimension $m \\times n$) and $X[i]$ corresponds to the $i$th example.\n", "\n", "When you call the `fit` method of `tf_autoencoder.TfAutoencoder`, it returns the matrix of hidden representations $h$, which is the new embedding space: same row count as the input, but with the column count set by the `hidden_dim` parameter.\n", "\n", "For much more on autoencoders, see the 'Autoencoders' chapter of [Goodfellow et al. 2016](http://www.deeplearningbook.org)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Testing the autoencoder implementation\n", "\n", "Here's an evaluation that is meant to test the autoencoder implementation – we expect it to be able to full encode the input matrix because we know its rank is equal to the dimensionality of the hidden representation." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "def randmatrix(m, n, sigma=0.1, mu=0):\n", " return sigma * np.random.randn(m, n) + mu\n", "\n", "def autoenoder_evaluation(nrow=1000, ncol=100, rank=20, max_iter=20000):\n", " \"\"\"This an evaluation in which `TfAutoencoder` should be able\n", " to perfectly reconstruct the input data, because the\n", " hidden representations have the same dimensionality as\n", " the rank of the input matrix.\n", " \"\"\"\n", " X = randmatrix(nrow, rank).dot(randmatrix(rank, ncol))\n", " ae = TfAutoencoder(hidden_dim=rank, max_iter=max_iter)\n", " ae.fit(X)\n", " X_pred = ae.predict(X)\n", " mse = (0.5 * (X_pred - X)**2).mean()\n", " return(X, X_pred, mse)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Iteration 100: loss: 0.0010361233726143837" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Autoencoder evaluation MSE after 100 evaluations: 0.0010\n" ] } ], "source": [ "ae_max_iter = 100\n", "\n", "_, _, ae = autoenoder_evaluation(max_iter=ae_max_iter)\n", "\n", "print(\"Autoencoder evaluation MSE after {0} evaluations: {1:0.04f}\".format(ae_max_iter, ae))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Applying autoencoders to real VSMs\n", "\n", "You can apply the autoencoder directly to the count matrix, but this could interact very badly with the internal activation function: if the counts are all very high or very low, then everything might get pushed irrevocably towards the extreme values of the activation.\n", "\n", "Thus, it's a good idea to first normalize the values somehow. Here, I use `vsm.length_norm`:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "imdb5_l2 = imdb5.apply(vsm.length_norm, axis=1)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\r", "Iteration 1: stopping with loss < self.tol" ] } ], "source": [ "imdb5_l2_ae = TfAutoencoder(\n", " max_iter=100, hidden_dim=50, eta=0.001).fit(imdb5_l2)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "superb 0.000000\n", "newer 0.496989\n", "trip 0.504418\n", "fault 0.529206\n", "arlington 0.541104\n", "dtype: float64" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vsm.neighbors('superb', imdb5_l2_ae).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is very slow and seems not to work all that well. To speed things up, one can first apply LSA or similar:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "imdb5_l2_svd100 = vsm.lsa(imdb5_l2, k=100)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Iteration 1000: loss: 0.00011613927927101031" ] } ], "source": [ "imdb_l2_svd100_ae = TfAutoencoder(\n", " max_iter=1000, hidden_dim=50, eta=0.01).fit(imdb5_l2_svd100)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "superb 0.000000\n", "brilliant 0.103738\n", "direction 0.115500\n", "cinematography 0.130195\n", "flawless 0.138378\n", "dtype: float64" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vsm.neighbors('superb', imdb_l2_svd100_ae).head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## word2vec\n", "\n", "The label __word2vec__ picks out a family of models in which the embedding for a word $w$ is trained to predict the words that co-occur with $w$. This intuition can be cashed out in numerous ways. Here, we review just the __skip-gram model__, due to [Mikolov et al. 2013](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Training data\n", "\n", "The most natural starting point is to transform a corpus into a supervised data set by mapping each word to a subset (maybe all) of the words that it occurs with in a given window. Schematically:\n", "\n", "__Corpus__: `it was the best of times, it was the worst of times, ...`\n", "\n", "With window size 2:\n", "\n", "```\n", "(it, was)\n", "(it, the)\n", "(was, it)\n", "(was, the)\n", "(was, best)\n", "(the, was)\n", "(the, it)\n", "(the, best)\n", "(the, of)\n", "...\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Basic skip-gram\n", "\n", "The basic skip-gram model estimates the probability of an input–output pair $(a, b)$ as\n", "\n", "$$P(b \\mid a) = \\frac{\\exp(x_{a}w_{b})}{\\sum_{b'\\in V}\\exp(x_{a}w_{b'})}$$\n", "\n", "where $x_{a}$ is the row-vector representation of word $a$ and $w_{b}$ is the column vector representation of word $b$. The objective is to minimize the following quantity:\n", "\n", "$$\n", "-\\sum_{i=1}^{m}\\sum_{k=1}^{|V|}\n", "\\textbf{1}\\{c_{i}=k\\} \n", "\\log\n", "\\frac{\n", " \\exp(x_{i}w_{k})\n", "}{\n", " \\sum_{j=1}^{|V|}\\exp(x_{i}w_{j})\n", "}$$\n", "\n", "where $V$ is the vocabulary.\n", "\n", "The inputs $x_{i}$ are the word representations, which get updated during training, and the outputs are one-hot vectors $c$. For example, if `was` is the 560th element in the vocab, then the output $c$ for the first example in the corpus above would be a vector of all $0$s except for a $1$ in the 560th position. $x$ would be the representation of `it` in the embedding space. \n", "\n", "The distribution over the entire output space for a given input word $a$ is thus a standard softmax classifier; here we add a bias term for good measure:\n", "\n", "$$c = \\textbf{softmax}(x_{a}W + b)$$\n", "\n", "If we think of this model as taking the entire matrix $X$ as input all at once, then it becomes\n", "\n", "$$c = \\textbf{softmax}(XW + b)$$\n", "\n", "and it is now very clear that we are back to the core insight that runs through all of our reweighting and dimensionality reduction methods: we have a word matrix $X$ and a context matrix $W$, and we are trying to push the dot products of these two embeddings in a specific direction: here, to maximize the likelihood of the observed co-occurrences in the corpus." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Skip-gram with noise contrastive estimation \n", "\n", "Training the basic skip-gram model directly is extremely expensive for large vocabularies, because $W$, $b$, and the outputs $c$ get so large. \n", "\n", "A straightforward way to address this is to change the objective to use __noise contrastive estimation__ (negative sampling). Where $\\mathcal{D}$ is the original training corpus and $\\mathcal{D}'$ is a sample of pairs not in the corpus, we minimize\n", "\n", "$$\\sum_{a, b \\in \\mathcal{D}}-\\log\\sigma(x_{a}w_{b}) + \\sum_{a, b \\in \\mathcal{D}'}\\log\\sigma(x_{a}w_{b})$$\n", "\n", "with $\\sigma$ the sigmoid activation function $\\frac{1}{1 + \\exp(-x)}$.\n", "\n", "The advice of Mikolov et al. is to sample $\\mathcal{D}'$ proportional to a scaling of the frequency distribution of the underlying vocabulary in the corpus:\n", "\n", "$$P(w) = \\frac{\\textbf{count}(w)^{0.75}}{\\sum_{w'\\in V} \\textbf{count}(w')}$$\n", "\n", "where $V$ is the vocabulary.\n", "\n", "Although this new objective function is a substantively different objective than the previous one, Mikolov et al. (2013) say that it should approximate it, and it is building on the same insight about words and their contexts. See [Levy and Golberg 2014](http://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization) for a proof that this objective reduces to PMI shifted by a constant value. See also [Cotterell et al. 2017](https://aclanthology.coli.uni-saarland.de/papers/E17-2028/e17-2028) for an interpretation of this model as a variant of PCA." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### word2vec resources\n", "\n", "* In the usual presentation, word2vec training involves looping repeatedly over the sequence of tokens in the corpus, sampling from the context window from each word to create the positive training pairs. I assume that this same process could be modeled by sampling (row, column) index pairs from our count matrices proportional to their cell values. However, I couldn't get this to work well. I'd be grateful if someone got it work or figured out why it won't!\n", "\n", "* Luckily, there are numerous excellent resources for word2vec. [The TensorFlow tutorial Vector representations of words](https://www.tensorflow.org/tutorials/word2vec) is very clear and links to code that is easy to work with. Because TensorFlow has a built in loss function called `tf.nn.nce_loss`, it is especially simple to define these models – one pretty much just sets up an embedding $X$, a context matrix $W$, and a bias $b$, and then feeds them plus a training batch to the loss function.\n", "\n", "* The excellent [Gensim package](https://radimrehurek.com/gensim/) has an implementation that handles the scalability issues related to word2vec." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Other methods\n", "\n", "Learning word representations is one of the most active areas in NLP right now, so I can't hope to offer a comprehensive summary. I'll settle instead for identifying some overall trends and methods:\n", "\n", "* The LexVec model of [Salle et al. 2016](https://aclanthology.coli.uni-saarland.de/papers/P16-2068/p16-2068) combines the core insight of GloVe (learn vectors that approximate PMI) with the insight from word2vec that we should additionally try to push words that don't appear together farther apart in the VSM. (GloVe simply ignores 0 count cells and so can't do this.)\n", "\n", "* There is growing awareness that many apparently diverse models can be expressed as matrix factorization methods like SVD/LSA. See especially \n", "[Singh and Gordon 2008](http://www.cs.cmu.edu/~ggordon/singh-gordon-unified-factorization-ecml.pdf),\n", "[Levy and Golberg 2014](http://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization), [Cotterell et al.](https://aclanthology.coli.uni-saarland.de/papers/E17-2028/e17-2028). Also Ryan Cotterell tweeted [this very helpful annotated chart](https://twitter.com/_shrdlu_/status/927278909040873472).\n", "\n", "* The ELMo embeddings of [Peters et al. 2018](https://arxiv.org/abs/1802.05365) are derived from various combinations of the internal states of deep, bidirectional language models with character-level convolutions. The authors report outstanding results when using these representations as the inputs to a wide range of tasks. For additional details, see http://allennlp.org/elmo.\n", "\n", "* Subword modeling ([reviewed briefly in the previous notebook](vsm_01_distributional.ipynb#Subword-information)) is increasingly yielding dividends. (It would already be central if most of NLP focused on languages with complex morphology!) Check out the papers at the Subword and Character-Level Models for NLP Workshops: [SCLeM 2017](https://sites.google.com/view/sclem2017/home), [SCLeM 2018](https://sites.google.com/view/sclem2018/home).\n", "\n", "See [this blog post by Sebastian Ruder](http://ruder.io/word-embeddings-2017/index.html) for even more ideas!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Exploratory exercises\n", "\n", "These are largely meant to give you a feel for the material, but some of them could lead to projects and help you with future work for the course. These are not for credit.\n", "\n", "1. Try out some pipelines of reweighting, `vsm.lsa` at various dimensions, and `TfAutoencoder` to see which seems best according to your sampling around with `vsm.neighbors` and high-level visualization with `vsm.tsne_viz`. Feel free to use other factorization methods defined in [sklearn.decomposition](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition) as well.\n", "\n", "1. What happens if you set `k=1` using `vsm.lsa`? What do the results look like then? What do you think this first (and now only) dimension is capturing?\n", "\n", "1. Modify `vsm.glove` so that it uses [the AdaGrad optimization method](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) as in the original paper. It's fine to use [the authors' implementation](http://nlp.stanford.edu/projects/glove/), [Jon Gauthier's implementation](http://www.foldl.me/2014/glove-python/), or the [mittens Numpy implementation](https://github.com/roamanalytics/mittens/blob/master/mittens/np_mittens.py) as references, but you might enjoy the challenge of doing this with no peeking at their code." ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" }, "widgets": { "state": {}, "version": "1.1.2" } }, "nbformat": 4, "nbformat_minor": 2 }