{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Word2Vec Example"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**(C) 2018-2024 by [Damir Cavar](http://damir.cavar.me/)**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Version:** 1.2, January 2024"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Prerequisites:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -U numpy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -U gensim"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a tutorial related to the L665 course on Machine Learning for NLP focusing on Deep Learning, Spring and Fall 2018 at Indiana University. This material is based on Chris Manning's lecture 2 [Word Vector Representations: word2vec](https://www.youtube.com/watch?v=ERibwqs9p38) and additional sources with extended annotations and explanations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we will discuss briefly the necessary methods to understand the Word2Vec algorithm. We will use Numpy for the basic computations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using One-Hot Vectors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can create a one-hot vector that selects the 3rd row:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 0, 1, 0])"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = np.array([0, 0, 1, 0])\n",
    "x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us create a matrix $A$ of four rows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 1,  2,  3,  4],\n",
       "       [ 5,  6,  7,  8],\n",
       "       [ 9, 10, 11, 12],\n",
       "       [13, 14, 15, 16]])"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "A = np.array([[1,   2,  3,  4],\n",
    "              [5,   6,  7,  8],\n",
    "              [9,  10, 11, 12],\n",
    "              [13, 14, 15, 16]])\n",
    "A"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use the column vector $x$ to select a row in matrix $A$:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 9, 10, 11, 12])"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x.dot(A)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Computing the Dot-Product"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us simplify and assume that the dot-product of two vectors is the sum of the products of the scalars in the particular dimension of each vector, e.g.:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "u^T v = u \\cdot v = \\sum_{i=1}^n{u_i v_i}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The more similar two vectors $u$ and $v$ are in terms of *directionality*, the larger the dot-product is."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We could think of the dot product $u \\cdot v$ as projecting one vector on the other. If the two vectors point into the same direction, the dot-product will be the largest. If one vector is orthogonal to the other, it will be $0$. If the vectors are pointing into opposite directions, the dot-product will be negative."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Computing Softmax"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Assume that we have some data or results of mutually exclusive variables that represent scores for an observation being of class $C = [ c_1, c_2, c_3 ]$, as represented by the columns in the vector $y$ below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = np.array([4.0, 2.5, 1.1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we want to convert the scores into a probability distribution that represents the likelihood that on the basis of these scores the observation belongs to one of the classes, we can compute the Softmax of the vector:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "p(C_n) = \\frac{\\exp(\\theta \\cdot X_n)}{\\sum_{i=1}^N{\\exp(\\theta \\cdot X_i)}}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The parameter $\\theta$ allows us to scale the results to increase the probabilities of lower scalars in the vector. The exponentiation of $X$ makes larger values much larger. If we include a parameter like $\\theta$, we can scale the effect and increase the probabilities assigned to lower values. See for more details the implementation of *softmax* below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In Python we can write this using Numpy's *exp* and *sum* functions. The *axis* parameter determines that the some is performed row-wise:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def softmax1(y):\n",
    "    return np.exp(y) / np.sum(np.exp(y), axis=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.46831053, 0.46831053, 0.06337894])"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "softmax1([4.0, 4.0, 2.0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can provide a parameter $\\theta$ to the function, to be able to scale the probability for low values up. The larger $\\theta$, the higher the probability assigned to lower values. We set the default for $\\theta$ to $1.0$ in the *softmax* definition:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def softmax(y, t=1.0):\n",
    "    return np.exp(y / t) / np.sum(np.exp(y / t), axis=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For a vector of values $[ 4.0, 4.0, 2.0 ]$, we get the following probability distribution given a default $\\theta$ of $1.0$:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.46831053, 0.46831053, 0.06337894])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "softmax(np.array([4.0, 4.0, 2.0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we double $\\theta$, the probability assigned to the third scalar increases significantly:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.4223188, 0.4223188, 0.1553624])"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "softmax(np.array([4.0, 4.0, 2.0]), 2.0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computing Word2Vec"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us first focus on the skip-gram model. For every word $w$ at position $t$ with some window size or radius of $m$ we want to predict the context of $w$. Our *objective function* is to maximize the probability of any context word given the current center word:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$J'(\\theta) = \\prod_{t=1}^T \\prod_{\\substack{-m \\leq j \\leq m\\\\j \\neq 0}} P(w_{t+j} | w_t; \\theta)$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that $\\theta$ here is a hyper-parameter. We will get back to it later. There is also another hidden hyper-parameter, that is the radius $m$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can reformulate this equation as the sum of the log-likelihoods:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "J(\\theta) = - \\frac{1}{T} \\sum_{t=1}^T{ \\sum_{\\substack{-m \\leq j \\leq m\\\\ j \\neq 0}} \\log(P(w_{t+j}| w_t)) }\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the equation above we are averaging over all words in the text by dividing with $T$. We take the minus of the sums to get a positive score, since the log of a probability will be always negative. This way our goal is also to minimize the loss function $J(\\theta)$, that is to minimize the negative log-likelihood."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The task can be described as: with a specific center word somewhere in the middle of a sentence, pick one random word in a specific radius of $n$ words left and right, we will get as output the probability for every word in the vocabulary of being the selected context word."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The model is trained using word pairs extracted from a text. We go word by word through some text, selecting this word as the center word, and pick the context words from a windows of $n$ size. For example, here for $n=3$ (left and right of the center word):"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"Sample_Generation_Word2vec.png\" width=\"70%\" height=\"70%\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our model will be based on the distributional properties of the word pairs, that is the frequency of their cooccurrence. The left word in the example pairs above is the center word (red) and the right word is one of the context words."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Assume that we have learned 300 features represented in vectors for center words and their context respectively, that is, we have two independent vectors for each word, one that represents its features when it is a context word, and one that represents its features as a center word. This is our model $p(w_{t+j}|w_t)$ for a word at position $t$ in the text, and words in $t+j$ positions in the text relative to $t$. For the purpose of explanation here, we could assume that we have such vector pairs of 300 features for 100,000 such words."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the following equation we will assume $c$ and $o$ to be indices in the space of the word types that we model. These indices now refer to the list of vocabulary, not to positions in the text, as $t$ does above."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will compute the Softmax of two word vectors for a context word (index $o$ in the vocabulary) and a center word (index $c$ in the vocabulary) by taking the dot-product of their 300-dimensional feature vectors. That is, one is the vector of a word in the context $u_o$, and the other is the vector of the center word $v_c$. The $u$ vectors are in the context word vectors and the $v$ vectors are the center word vectors."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "p(o|c) = \\frac{\\exp(u_o^T v_c)}{\\sum_{W=1}^V{\\exp(u_w^T v_c)}}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We compute the dot-product between the one-hot vector representing one center word $w_t$ in the center word matrix $W$. This is the lookup matrix (looking up column) of word embedding matrix as representation of center word. (See Lecture 2, Word Vector Representations: word2vec, Stanford presentation by Chris Manning). As explained above, this will result in picking a concrete center word feature vector from the corresponding matrix:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\\begin{bmatrix}\n",
    " 0 \\\\[0.3em]\n",
    " 0 \\\\[0.3em]\n",
    " 0 \\\\[0.3em]\n",
    " 0 \\\\[0.3em]\n",
    " 1 \\\\[0.3em]\n",
    " 0 \\\\[0.3em]\n",
    " 0 \\\\[0.3em]\n",
    " 0\n",
    "\\end{bmatrix}\n",
    "\\begin{bmatrix}\n",
    " -- & 0.2  & -- \\\\[0.3em]\n",
    " -- & -1.4 & -- \\\\[0.3em]\n",
    " -- & 0.3  & -- \\\\[0.3em]\n",
    " -- & -0.1 & -- \\\\[0.3em]\n",
    " -- &  0.1 & -- \\\\[0.3em]\n",
    " -- & -0.3 & -- \n",
    "\\end{bmatrix} = \n",
    "\\begin{bmatrix}\n",
    " 0.2 \\\\[0.3em]\n",
    " -1.4 \\\\[0.3em]\n",
    " 0.3 \\\\[0.3em]\n",
    " -0.1 \\\\[0.3em]\n",
    " 0.1 \\\\[0.3em]\n",
    " -0.3 \n",
    "\\end{bmatrix} = V_c\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Take this center word matrix to be the **hidden layer** of a simple neural network."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We then take the dot-product of this vector of the center word with the matrix of the context vectors (here the scalars dashed out), or output word representation, for each word in the context (or the $n$-sized radius) around the center word, which gives us a 100,000 dimensional vector of weights for each word from the vocabulary in the context of the center vector:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$V_c \\cdot \\begin{bmatrix}\n",
    " -- & --  & -- \\\\[0.3em]\n",
    " -- & -- & -- \\\\[0.3em]\n",
    " -- & --  & -- \\\\[0.3em]\n",
    " -- & -- & -- \\\\[0.3em]\n",
    " -- & -- & -- \\\\[0.3em]\n",
    " -- & -- & -- \n",
    "\\end{bmatrix} = u_o \\cdot v_c$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Take this context vector matrix to be the **output matrix**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the next step we use *Softmax* to compute the probability distribution of this vector:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\\mbox{Softmax}(u_o \\cdot v_c) = \\mbox{Softmax}(\n",
    "\\begin{bmatrix}\n",
    " 1.7 \\\\[0.3em]\n",
    " 0.3 \\\\[0.3em]\n",
    " 0.1 \\\\[0.3em]\n",
    " -0.7 \\\\[0.3em]\n",
    " -0.2 \\\\[0.3em]\n",
    " 0.1 \\\\[0.3em]\n",
    " 0.7 \n",
    "\\end{bmatrix}) = \\begin{bmatrix}\n",
    " 0.44 \\\\[0.3em]\n",
    " 0.11 \\\\[0.3em]\n",
    " 0.09 \\\\[0.3em]\n",
    " 0.04 \\\\[0.3em]\n",
    " 0.07 \\\\[0.3em]\n",
    " 0.09 \\\\[0.3em]\n",
    " 0.16 \n",
    "\\end{bmatrix}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here once more the softmax function over the vector $u_o \\cdot v_c$:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.44276077, 0.10918346, 0.08939186, 0.04016635, 0.06622312,\n",
       "       0.08939186, 0.16288258])"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    " softmax(np.array([1.7, 0.3, 0.1, -0.7, -0.2, 0.1, 0.7]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The rounded sum results in $1.0$:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sum([0.44, 0.11, 0.09, 0.04, 0.07, 0.09, 0.16])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We could now compare the resulting probabilities with some ground truth and compute the error rate for example. If the ground truth for the above result would be:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\\begin{bmatrix}\n",
    " 0 \\\\[0.3em]\n",
    " 0 \\\\[0.3em]\n",
    " 0 \\\\[0.3em]\n",
    " 0 \\\\[0.3em]\n",
    " 0 \\\\[0.3em]\n",
    " 0 \\\\[0.3em]\n",
    " 1 \n",
    "\\end{bmatrix}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "this would mean that the model assigned a probability of $0.16$ to this word, then we could compute the loss."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How do we learn the parameters for each word that would maximize the prediction of a context word (or vice versa)?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training the Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Assume that all parameters of the model are defined in a long vector $\\theta$. This vector will have twice the length of the vocabulary size, containing a weight for every word as a center word, and every word as a context word:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\theta = \\begin{bmatrix}\n",
    " v_{a}    \\\\[0.3em]\n",
    " v_{ant}  \\\\[0.3em]\n",
    " \\vdots    \\\\[0.3em]\n",
    " v_{zero} \\\\[0.3em]\n",
    " u_{a}    \\\\[0.3em]\n",
    " u_{ant}  \\\\[0.3em]\n",
    " \\vdots  \\\\[0.3em]\n",
    " u_{zero}\n",
    "\\end{bmatrix} \\in \\mathbb{R}^{2dV}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We repeat here again the objective function in which we want to minimize the log-likelihood:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "J(\\theta) = - \\frac{1}{T} \\sum_{t=1}^T{ \\sum_{\\substack{-m \\leq j \\leq m\\\\ j \\neq 0}} \\log(P(w_{t+j}| w_t)) }\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our softmax function discussed above fits into this equation:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "p(o|c) = \\frac{\\exp(u_o^T v_c)}{\\sum_{W=1}^V{\\exp(u_w^T v_c)}}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "J(\\theta) = - \\frac{1}{T} \\sum_{t=1}^T{ \\sum_{\\substack{-m \\leq j \\leq m\\\\ j \\neq 0}} \\log(p(o|c)) }\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our goal is to minimize the loss function and maximize the likelihood that we predict the right word in the context for any given center word."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Changing the parameters can be achieved using the gradient. We want to minimize the negative log of the following equation, computing the partial derivative with respect to the center vector ($v_c$):"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\frac{\\partial }{\\partial v_c} \\log(\\frac{\\exp(u_o^T v_c)}{\\sum_{W=1}^V{\\exp(u_w^T v_c)}})\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note a partial derivative of a function with several variables is its derivative with respect to one of these variables. Here it is $v_c$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The log of a division can be converted into a subtraction:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\frac{\\partial }{\\partial v_c} \\  \\log(\\exp(u_o^T v_c)) - \\log(\\sum_{W=1}^V{\\exp(u_w^T v_c)})\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can simplify the first part of the subtraction, since $\\log$ and $\\exp$ cancel each other out, and the partial derivative with respect to $v_c$ of the simplified equation is simply $u_0$:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\frac{\\partial }{\\partial v_c} \\  \\log(\\exp(u_o^T v_c)) = \\frac{\\partial }{\\partial v_c} \\  u_o^T v_c = u_0\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the second part of the subtraction above, we get:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\frac{\\partial }{\\partial v_c} \\  \\log(\\sum_{W=1}^V{\\exp(u_w^T v_c)})\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the chain rule we can simplify this equation as well. The chain rule expresses the derivative of the composition of two functions, that is mapping $x$ onto $f(g(x))$, in terms of derivatives of $f$ and $g$. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us first start with derivatives of simple functions. What is the derivative of a constant $5$ for example? It is $0$. Since the derivative measures the change of a function, and a constant does not change, the derivative of a constant is $0$. The derivative of a variable $x$ is $1$. The derivative of the product of a constant and a variable is the constant. The derivative of $x^n$ is $n \\cdot x^{n-1}$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A differentiable function of a variable is a function whose derivative exists at each point in its domain ([see here](https://en.wikipedia.org/wiki/Differentiable_function)).The graph of a differentiable function must have a (non-vertical) tangent line at each point in its domain, and cannot contain any breaks, bends, or cusps. The following function $y = |x|$ (absolute value function) is not differentiable:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"600px-Absolute_value.svg.png\" width=\"30%\" height=\"30%\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following function is differentiable:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"Polynomialdeg3.svg.png\" width=\"30%\" height=\"30%\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The chain rule allows one to find derivatives of compositions of functions by knowing the derivative of the elementary functions from the composition. Let $g(x)$ be differentiable at $x$ and $f(x)$ be differentiable at $f(g(x))$. Then, if $y=f(g(x))$ and $u=g(x)$: "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\\frac{dy}{dx}=\\frac{dy}{du}\\cdot\\frac{du}{dx}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "see for the derivation Chris Manning's video..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The equation he derives is:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "u_o - \\sum_{x=1}^V p(x|c) u_x\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "He labels the left part ($u_o$) as *observation*. This is the context word vectors that we identified in the texts or data. He labels the second part ($\\sum_{x=1}^V p(x|c) u_x$) as *expectation*. This is the part that we want to tweak such that the loss function is minimized."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For maximizing or minimizing the cost (or objective) function using gradient descent, consider this simple example: Find the local minimum of the function $f(x) = x^4 - 3 x^3 + 2$ with derivative $f'(x) = 4 x^3 - 9 x^2$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Subtracting a fraction of the gradient moves you to the minimum:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "x_old: 6  Local minimum occurs at 0.5999999999999996\n",
      "x_old: 0.5999999999999996  Local minimum occurs at 0.6237599999999996\n",
      "x_old: 0.6237599999999996  Local minimum occurs at 0.6490692731402646\n",
      "x_old: 0.6490692731402646  Local minimum occurs at 0.6760475763767438\n",
      "x_old: 0.6760475763767438  Local minimum occurs at 0.704821965498881\n",
      "x_old: 0.704821965498881  Local minimum occurs at 0.7355261366038248\n",
      "x_old: 0.7355261366038248  Local minimum occurs at 0.7682992721113444\n",
      "x_old: 0.7682992721113444  Local minimum occurs at 0.8032842278686305\n",
      "x_old: 0.8032842278686305  Local minimum occurs at 0.840624861847519\n",
      "x_old: 0.840624861847519  Local minimum occurs at 0.8804622684298664\n",
      "x_old: 0.8804622684298664  Local minimum occurs at 0.9229296507309586\n",
      "x_old: 0.9229296507309586  Local minimum occurs at 0.9681455460305634\n",
      "x_old: 0.9681455460305634  Local minimum occurs at 1.016205130521792\n",
      "x_old: 1.016205130521792  Local minimum occurs at 1.067169389942697\n",
      "x_old: 1.067169389942697  Local minimum occurs at 1.1210520795330405\n",
      "x_old: 1.1210520795330405  Local minimum occurs at 1.1778046421472836\n",
      "x_old: 1.1778046421472836  Local minimum occurs at 1.237299637824332\n",
      "x_old: 1.237299637824332  Local minimum occurs at 1.2993137782331108\n",
      "x_old: 1.2993137782331108  Local minimum occurs at 1.3635123370474889\n",
      "x_old: 1.3635123370474889  Local minimum occurs at 1.429437442158506\n",
      "x_old: 1.429437442158506  Local minimum occurs at 1.4965033788967752\n",
      "x_old: 1.4965033788967752  Local minimum occurs at 1.5640022802344904\n",
      "x_old: 1.5640022802344904  Local minimum occurs at 1.6311231270849003\n",
      "x_old: 1.6311231270849003  Local minimum occurs at 1.6969855549473505\n",
      "x_old: 1.6969855549473505  Local minimum occurs at 1.7606875094969714\n",
      "x_old: 1.7606875094969714  Local minimum occurs at 1.821362659674955\n",
      "x_old: 1.821362659674955  Local minimum occurs at 1.8782404675959476\n",
      "x_old: 1.8782404675959476  Local minimum occurs at 1.930700009196379\n",
      "x_old: 1.930700009196379  Local minimum occurs at 1.9783089472809865\n",
      "x_old: 1.9783089472809865  Local minimum occurs at 2.0208417065692057\n",
      "x_old: 2.0208417065692057  Local minimum occurs at 2.0582751831448975\n",
      "x_old: 2.0582751831448975  Local minimum occurs at 2.090764845528107\n",
      "x_old: 2.090764845528107  Local minimum occurs at 2.118607415721545\n",
      "x_old: 2.118607415721545  Local minimum occurs at 2.1421976265432066\n",
      "x_old: 2.1421976265432066  Local minimum occurs at 2.1619858762300224\n",
      "x_old: 2.1619858762300224  Local minimum occurs at 2.178441640823547\n",
      "x_old: 2.178441640823547  Local minimum occurs at 2.1920251576443675\n",
      "x_old: 2.1920251576443675  Local minimum occurs at 2.203167862727841\n",
      "x_old: 2.203167862727841  Local minimum occurs at 2.2122606942724694\n",
      "x_old: 2.2122606942724694  Local minimum occurs at 2.2196486877629633\n",
      "x_old: 2.2196486877629633  Local minimum occurs at 2.2256301304909205\n",
      "x_old: 2.2256301304909205  Local minimum occurs at 2.2304587076907274\n",
      "x_old: 2.2304587076907274  Local minimum occurs at 2.234347382687595\n",
      "x_old: 2.234347382687595  Local minimum occurs at 2.2374730902946083\n",
      "x_old: 2.2374730902946083  Local minimum occurs at 2.239981621916576\n",
      "x_old: 2.239981621916576  Local minimum occurs at 2.2419923174775156\n",
      "x_old: 2.2419923174775156  Local minimum occurs at 2.243602351591089\n",
      "x_old: 2.243602351591089  Local minimum occurs at 2.2448905184851697\n",
      "x_old: 2.2448905184851697  Local minimum occurs at 2.2459204946033684\n",
      "x_old: 2.2459204946033684  Local minimum occurs at 2.2467436015363202\n",
      "x_old: 2.2467436015363202  Local minimum occurs at 2.2474011148628947\n",
      "x_old: 2.2474011148628947  Local minimum occurs at 2.2479261740485823\n",
      "x_old: 2.2479261740485823  Local minimum occurs at 2.2483453500247714\n",
      "x_old: 2.2483453500247714  Local minimum occurs at 2.2486799240099864\n",
      "x_old: 2.2486799240099864  Local minimum occurs at 2.2489469258218673\n",
      "x_old: 2.2489469258218673  Local minimum occurs at 2.2491599737759116\n",
      "x_old: 2.2491599737759116  Local minimum occurs at 2.2493299520940697\n",
      "x_old: 2.2493299520940697  Local minimum occurs at 2.249465555993498\n",
      "x_old: 2.249465555993498  Local minimum occurs at 2.24957372949745\n",
      "x_old: 2.24957372949745  Local minimum occurs at 2.249660016570137\n",
      "x_old: 2.249660016570137  Local minimum occurs at 2.2497288424102844\n",
      "x_old: 2.2497288424102844  Local minimum occurs at 2.24978373858824\n",
      "x_old: 2.24978373858824  Local minimum occurs at 2.2498275231061067\n",
      "x_old: 2.2498275231061067  Local minimum occurs at 2.249862444322635\n",
      "x_old: 2.249862444322635  Local minimum occurs at 2.249890295941524\n",
      "x_old: 2.249890295941524  Local minimum occurs at 2.2499125088471215\n",
      "x_old: 2.2499125088471215  Local minimum occurs at 2.24993022442776\n",
      "x_old: 2.24993022442776  Local minimum occurs at 2.249944353104799\n",
      "x_old: 2.249944353104799  Local minimum occurs at 2.2499556210437\n",
      "x_old: 2.2499556210437  Local minimum occurs at 2.2499646074278457\n"
     ]
    }
   ],
   "source": [
    "x_old = 0\n",
    "x_new = 6\n",
    "eps = 0.01 # step size\n",
    "precision = 0.00001\n",
    "\n",
    "def f_derivative(x):\n",
    "    return 4 * x**3 - 9 * x**2\n",
    "\n",
    "while abs(x_new - x_old) > precision:\n",
    "    x_old = x_new\n",
    "    x_new = x_old - eps * f_derivative(x_old)\n",
    "    print(\"x_old:\", x_old, \" Local minimum occurs at\", x_new)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"Tangent-calculus.svg.png\" width=\"30%\" height=\"30%\">\n",
    "\n",
    "<p><center><a href=\"https://en.wikipedia.org/wiki/Derivative\">Wikipedia Derivative</a></center></p>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here the idea is to identify the gradient at point $x$, we subtract a little fraction of the gradient, which moves us downhill towards the minimum. We continue with computing the gradient again at this point and walking down towards the minimum."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Gradient Descent"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Minimization of an objective function $J(\\theta)$ over the entire training corpus makes it necessary to compute gradients for all windows."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We would need to update for each element of $\\theta$ to identify the derivatives of the objective function with respect to all the parameters:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\theta_j^{new} = \\theta_j^{old} - \\alpha \\frac{\\partial }{\\partial \\theta_j^{old}} J(\\theta)\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$\\alpha$ is the step size in the Gradient Descent algorithm. We have some parameter values ($\\theta_j^{old}$) and we identified the gradient (the $\\frac{\\partial}{\\partial\\theta_j^{old}}$ portion) at this position ($j$). We subtract a fraction of the gradient from the old parameters and get new ones. If this gives us a lower objection value, this takes us towards the minimum."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Matrix notation for all parameters:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\theta^{new} = \\theta^{old} - \\alpha \\frac{\\partial}{\\partial \\theta^{old}} J(\\theta)\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\theta^{new} = \\theta^{old} - \\alpha \\nabla_\\theta J(\\theta)\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Generic Gradient Descent code with some stopping condition to be added to it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "while True:\n",
    "    theta_grad = evaluate_gradient(J, corpus, theta)\n",
    "    theta = theta - alpha * theta_grad"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The blue lines show the contour lines of the value of the objective function. Computing the Gradient we identify the direction of the steepest descent. Walking in small steps down this line of the steepest descent takes us to the minimum."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"512px-Gradient_descent.svg.png\" width=\"30%\" height=\"30%\">\n",
    "\n",
    "<p><center>(<a href=\"https://en.wikipedia.org/wiki/Gradient_descent\">Wikipedia: Gradient Descent</a>)</center></p>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To achieve walking continuously down towards the minimum, $\\alpha$ needs to be small enough so that one does not jump over the minimum to the other side."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The problem would be that, if we have a billion token corpus, this might include a lot of computations and optimizations. Computing the gradient of the objective function for a very large corpus will take very long for even the first gradient update."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Stochastic Gradient Descent"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We compute the gradient and optimize at one position $t$ in the corpus ($t$ is the index of the center word)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\theta^{new} = \\theta^{old} - \\alpha \\nabla_\\theta J_t(\\theta)\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "while True:\n",
    "    window = sample_window(corpus)\n",
    "    theta_grad = evaluate_gradient(J, window, theta)\n",
    "    theta = theta - alpha * theta_grad"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example using Gensim"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Gensim](https://radimrehurek.com/gensim/index.html) provides an [API documentation for word2vec](https://radimrehurek.com/gensim/models/word2vec.html). There is a [GitHub repository](https://github.com/RaRe-Technologies/w2v_server_googlenews) with the code for a high-performance similarity server using [Gensim](https://radimrehurek.com/gensim/index.html) and word2vec. This short intro is based on the [*Rare Technologies* tutorial](https://rare-technologies.com/word2vec-tutorial/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The input for Gensim's word2vec has to be a list of sentences and sentences are a list of tokens. We will import *gensim* and train a model on some sample sentences:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gensim\n",
    "\n",
    "sentences = [['Tom', 'loves', 'pizza'], ['Peter', 'loves', 'fries']]\n",
    "\n",
    "model = gensim.models.Word2Vec(sentences, min_count=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In a more efficient implementation that does not hold an entire corpus in RAM, one can specify a file-reader and process the input line by line, as shown by Radim Řehůřek in [his tutorial](https://rare-technologies.com/word2vec-tutorial/):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "class MySentences(object):\n",
    "    def __init__(self, dirname):\n",
    "        self.dirname = dirname\n",
    " \n",
    "    def __iter__(self):\n",
    "        for fname in os.listdir(self.dirname):\n",
    "            for line in open(os.path.join(self.dirname, fname)):\n",
    "                yield line.split()\n",
    " \n",
    "sentences = MySentences('examples') # load Gensim_example_1.txt from folder, a memory-friendly iterator\n",
    "model = gensim.models.Word2Vec(sentences)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The code above might not run in Jupyter Notebooks, due to restrictions over some modules. Copy the code into a Python file and run it in the command line."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Gensim's word2vec allows you to prune the internal dictionary. This can eliminate tokens that occur a minimum number of times. The *min_count* parameter provides this restriction."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Gensim offers an API for:\n",
    "- evaluation using standard data sets and formats (e.g. Google test set)\n",
    "- storing and loading of models\n",
    "- online or resuming of training"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can compute similarities using "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.wv.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)\n",
    "model.doesnt_match(\"breakfast cereal dinner lunch\".split())\n",
    "model.similarity('woman', 'man')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can access a word vector directly:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.wv['loves']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**(C) 2024 by [Damir Cavar](http://damir.cavar.me/) <<dcavar@iu.edu>>**"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.1"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autoclose": false,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}