{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lab : Using Word2Vec pre-trained models\n",
    "We will learn to use widely avilable pre-trained models\n",
    "\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elephantscale/cool-ML-demos/blob/main/nlp/word2vec-2-pre-trained-models.ipynb)\n",
    "\n",
    "### Runtime \n",
    "20 mins"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1 - Data\n",
    "\n",
    "https://github.com/RaRe-Technologies/gensim-data is a central repository of data and pre-trained models for Gensim.\n",
    "\n",
    "### Few notable datasets:\n",
    "- '20-news-groups'  : newsgroup postings (size : 13 MB)\n",
    "- 'patent-2017' : US patents (size : ~3G)\n",
    "- 'wiki-english-20171001' : wikipedia dump on 2017  (size : ~6G)\n",
    "\n",
    "\n",
    "### Popular pre-trained word2vec models\n",
    "\n",
    "| model | size | number of vectors | description |\n",
    "|--------------------------|----------|-------------------|------------------------------------------------------|\n",
    "| glove-twitter-25 | 104 MB | 1,193,514 | Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased) |\n",
    "| word2vec-google-news-300 | 1,662 MB | 3,000,000 | Google News (about 100 billion words) |\n",
    "| glove-wiki-gigaword-300 | 376 MB | 400,000 | Wikipedia 2014 + Gigaword 5 (6B tokens, uncased) |\n",
    "\n",
    "\n",
    "**TODO : Inspect some of the models avilable**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 - Tweets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1 - Available Models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get a list of models available\n",
    "\n",
    "import gensim.downloader as downloader\n",
    "from pprint import pprint\n",
    "\n",
    "info = downloader.info()\n",
    "# print(info)\n",
    "# pprint (info)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.2 - Download Glove-Twitter-25"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<gensim.models.keyedvectors.Word2VecKeyedVectors object at 0x7f03c5b424a8>\n",
      "CPU times: user 41.7 s, sys: 262 ms, total: 41.9 s\n",
      "Wall time: 42.1 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "# this will download and load the model\n",
    "# data will be downloaded to  ~/gensim-data\n",
    "model_glove_tw25 = downloader.load(\"glove-twitter-25\") \n",
    "print(model_glove_tw25)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 - Query the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('dog', 0.9590819478034973),\n",
       " ('monkey', 0.9203578233718872),\n",
       " ('bear', 0.9143137335777283),\n",
       " ('pet', 0.9108031392097473),\n",
       " ('girl', 0.8880630135536194),\n",
       " ('horse', 0.8872727155685425),\n",
       " ('kitty', 0.8870542049407959),\n",
       " ('puppy', 0.886769711971283),\n",
       " ('hot', 0.8865255117416382),\n",
       " ('lady', 0.8845518827438354)]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model_glove_tw25.most_similar('cat')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "similarity(woman, man) :  0.76541775\n",
      "similarity(girl, boy) :  0.95961404\n",
      "similarity(prince, princess) :  0.87682575\n"
     ]
    }
   ],
   "source": [
    "#printing similarity index\n",
    "print (\"similarity(woman, man) : \",  model_glove_tw25.similarity('woman', 'man'))\n",
    "print (\"similarity(girl, boy) : \",  model_glove_tw25.similarity('girl', 'boy'))\n",
    "print (\"similarity(prince, princess) : \",  model_glove_tw25.similarity('prince', 'princess'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('meets', 0.8841923475265503),\n",
       " ('prince', 0.832163393497467),\n",
       " ('queen', 0.8257461190223694),\n",
       " ('’s', 0.8174097537994385),\n",
       " ('crow', 0.8134994506835938),\n",
       " ('hunter', 0.8131038546562195),\n",
       " ('father', 0.811583399772644),\n",
       " ('soldier', 0.8111359477043152),\n",
       " ('mercy', 0.8082392811775208),\n",
       " ('hero', 0.8082262873649597)]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model_glove_tw25.most_similar(positive=['woman', 'king'], negative=['man'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3 - Google News Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.1 - Download\n",
    "Large download (1.6 GB)  \n",
    "Sometimes you might run out of memory!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<gensim.models.keyedvectors.Word2VecKeyedVectors object at 0x7f0384f14518>\n",
      "CPU times: user 1min 49s, sys: 2.58 s, total: 1min 52s\n",
      "Wall time: 1min 51s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "# this will download and load the model\n",
    "# data will be downloaded to  ~/gensim-data\n",
    "model_google_news = downloader.load(\"word2vec-google-news-300\") \n",
    "print(model_google_news)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 - Query"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('cats', 0.8099379539489746),\n",
       " ('dog', 0.7609456777572632),\n",
       " ('kitten', 0.7464985251426697),\n",
       " ('feline', 0.7326233983039856),\n",
       " ('beagle', 0.7150583267211914),\n",
       " ('puppy', 0.7075453996658325),\n",
       " ('pup', 0.6934291124343872),\n",
       " ('pet', 0.6891531348228455),\n",
       " ('felines', 0.6755931377410889),\n",
       " ('chihuahua', 0.6709762215614319)]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model_google_news.most_similar('cat')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "similarity(woman, man) :  0.76640123\n",
      "similarity(girl, boy) :  0.8543272\n",
      "similarity(prince, princess) :  0.69865096\n"
     ]
    }
   ],
   "source": [
    "#printing similarity index\n",
    "print (\"similarity(woman, man) : \",  model_google_news.similarity('woman', 'man'))\n",
    "print (\"similarity(girl, boy) : \",  model_google_news.similarity('girl', 'boy'))\n",
    "print (\"similarity(prince, princess) : \",  model_google_news.similarity('prince', 'princess'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('queen', 0.7118192911148071),\n",
       " ('monarch', 0.6189674139022827),\n",
       " ('princess', 0.5902431011199951),\n",
       " ('crown_prince', 0.5499460697174072),\n",
       " ('prince', 0.5377321243286133),\n",
       " ('kings', 0.5236844420433044),\n",
       " ('Queen_Consort', 0.5235945582389832),\n",
       " ('queens', 0.518113374710083),\n",
       " ('sultan', 0.5098593235015869),\n",
       " ('monarchy', 0.5087411999702454)]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model_google_news.most_similar(positive=['woman', 'king'], negative=['man'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('French', 0.6156964302062988),\n",
       " ('PARIS_AFX_Gaz_de', 0.5759478807449341),\n",
       " ('Villebon_Sur_Yvette', 0.5739676356315613),\n",
       " ('extradites_Noriega', 0.5631396174430847),\n",
       " ('Belgium', 0.5630872845649719),\n",
       " ('Dordogne_region', 0.5518567562103271),\n",
       " ('called_Xynthia_blew', 0.5481809973716736),\n",
       " ('Evian_Les_Bains', 0.5411219000816345),\n",
       " ('Nantes', 0.5385209321975708),\n",
       " ('Anny_Cazenave', 0.5256197452545166)]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model_google_news.most_similar(positive=['Paris', 'France'], negative=['Rome'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('brother', 0.8007099628448486),\n",
       " ('nephew', 0.7814147472381592),\n",
       " ('younger_brother', 0.7780234217643738),\n",
       " ('eldest_son', 0.769737720489502),\n",
       " ('uncle', 0.7542626857757568),\n",
       " ('grandson', 0.7425380349159241),\n",
       " ('sons', 0.7106431722640991),\n",
       " ('grandfather', 0.6886354684829712),\n",
       " ('dad', 0.6858929395675659),\n",
       " ('elder_brother', 0.6622239947319031)]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model_google_news.most_similar(positive=['father', 'son'], negative=['mother'], topn=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4 - Compare the results between models\n",
    "\n",
    "Now that we have tried two models, which one seems to give more accurate answers?   \n",
    "Why ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}