{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab : Using Word2Vec pre-trained models\n", "We will learn to use widely avilable pre-trained models\n", "\n", "[](https://colab.research.google.com/github/elephantscale/cool-ML-demos/blob/main/nlp/word2vec-2-pre-trained-models.ipynb)\n", "\n", "### Runtime \n", "20 mins" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1 - Data\n", "\n", "https://github.com/RaRe-Technologies/gensim-data is a central repository of data and pre-trained models for Gensim.\n", "\n", "### Few notable datasets:\n", "- '20-news-groups' : newsgroup postings (size : 13 MB)\n", "- 'patent-2017' : US patents (size : ~3G)\n", "- 'wiki-english-20171001' : wikipedia dump on 2017 (size : ~6G)\n", "\n", "\n", "### Popular pre-trained word2vec models\n", "\n", "| model | size | number of vectors | description |\n", "|--------------------------|----------|-------------------|------------------------------------------------------|\n", "| glove-twitter-25 | 104 MB | 1,193,514 | Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased) |\n", "| word2vec-google-news-300 | 1,662 MB | 3,000,000 | Google News (about 100 billion words) |\n", "| glove-wiki-gigaword-300 | 376 MB | 400,000 | Wikipedia 2014 + Gigaword 5 (6B tokens, uncased) |\n", "\n", "\n", "**TODO : Inspect some of the models avilable**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2 - Tweets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1 - Available Models" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Get a list of models available\n", "\n", "import gensim.downloader as downloader\n", "from pprint import pprint\n", "\n", "info = downloader.info()\n", "# print(info)\n", "# pprint (info)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.2 - Download Glove-Twitter-25" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "<gensim.models.keyedvectors.Word2VecKeyedVectors object at 0x7f03c5b424a8>\n", "CPU times: user 41.7 s, sys: 262 ms, total: 41.9 s\n", "Wall time: 42.1 s\n" ] } ], "source": [ "%%time\n", "# this will download and load the model\n", "# data will be downloaded to ~/gensim-data\n", "model_glove_tw25 = downloader.load(\"glove-twitter-25\") \n", "print(model_glove_tw25)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3 - Query the model" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('dog', 0.9590819478034973),\n", " ('monkey', 0.9203578233718872),\n", " ('bear', 0.9143137335777283),\n", " ('pet', 0.9108031392097473),\n", " ('girl', 0.8880630135536194),\n", " ('horse', 0.8872727155685425),\n", " ('kitty', 0.8870542049407959),\n", " ('puppy', 0.886769711971283),\n", " ('hot', 0.8865255117416382),\n", " ('lady', 0.8845518827438354)]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_glove_tw25.most_similar('cat')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "similarity(woman, man) : 0.76541775\n", "similarity(girl, boy) : 0.95961404\n", "similarity(prince, princess) : 0.87682575\n" ] } ], "source": [ "#printing similarity index\n", "print (\"similarity(woman, man) : \", model_glove_tw25.similarity('woman', 'man'))\n", "print (\"similarity(girl, boy) : \", model_glove_tw25.similarity('girl', 'boy'))\n", "print (\"similarity(prince, princess) : \", model_glove_tw25.similarity('prince', 'princess'))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('meets', 0.8841923475265503),\n", " ('prince', 0.832163393497467),\n", " ('queen', 0.8257461190223694),\n", " ('’s', 0.8174097537994385),\n", " ('crow', 0.8134994506835938),\n", " ('hunter', 0.8131038546562195),\n", " ('father', 0.811583399772644),\n", " ('soldier', 0.8111359477043152),\n", " ('mercy', 0.8082392811775208),\n", " ('hero', 0.8082262873649597)]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_glove_tw25.most_similar(positive=['woman', 'king'], negative=['man'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3 - Google News Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1 - Download\n", "Large download (1.6 GB) \n", "Sometimes you might run out of memory!" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "<gensim.models.keyedvectors.Word2VecKeyedVectors object at 0x7f0384f14518>\n", "CPU times: user 1min 49s, sys: 2.58 s, total: 1min 52s\n", "Wall time: 1min 51s\n" ] } ], "source": [ "%%time\n", "# this will download and load the model\n", "# data will be downloaded to ~/gensim-data\n", "model_google_news = downloader.load(\"word2vec-google-news-300\") \n", "print(model_google_news)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 - Query" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('cats', 0.8099379539489746),\n", " ('dog', 0.7609456777572632),\n", " ('kitten', 0.7464985251426697),\n", " ('feline', 0.7326233983039856),\n", " ('beagle', 0.7150583267211914),\n", " ('puppy', 0.7075453996658325),\n", " ('pup', 0.6934291124343872),\n", " ('pet', 0.6891531348228455),\n", " ('felines', 0.6755931377410889),\n", " ('chihuahua', 0.6709762215614319)]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_google_news.most_similar('cat')" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "similarity(woman, man) : 0.76640123\n", "similarity(girl, boy) : 0.8543272\n", "similarity(prince, princess) : 0.69865096\n" ] } ], "source": [ "#printing similarity index\n", "print (\"similarity(woman, man) : \", model_google_news.similarity('woman', 'man'))\n", "print (\"similarity(girl, boy) : \", model_google_news.similarity('girl', 'boy'))\n", "print (\"similarity(prince, princess) : \", model_google_news.similarity('prince', 'princess'))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('queen', 0.7118192911148071),\n", " ('monarch', 0.6189674139022827),\n", " ('princess', 0.5902431011199951),\n", " ('crown_prince', 0.5499460697174072),\n", " ('prince', 0.5377321243286133),\n", " ('kings', 0.5236844420433044),\n", " ('Queen_Consort', 0.5235945582389832),\n", " ('queens', 0.518113374710083),\n", " ('sultan', 0.5098593235015869),\n", " ('monarchy', 0.5087411999702454)]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_google_news.most_similar(positive=['woman', 'king'], negative=['man'])" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('French', 0.6156964302062988),\n", " ('PARIS_AFX_Gaz_de', 0.5759478807449341),\n", " ('Villebon_Sur_Yvette', 0.5739676356315613),\n", " ('extradites_Noriega', 0.5631396174430847),\n", " ('Belgium', 0.5630872845649719),\n", " ('Dordogne_region', 0.5518567562103271),\n", " ('called_Xynthia_blew', 0.5481809973716736),\n", " ('Evian_Les_Bains', 0.5411219000816345),\n", " ('Nantes', 0.5385209321975708),\n", " ('Anny_Cazenave', 0.5256197452545166)]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_google_news.most_similar(positive=['Paris', 'France'], negative=['Rome'])" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('brother', 0.8007099628448486),\n", " ('nephew', 0.7814147472381592),\n", " ('younger_brother', 0.7780234217643738),\n", " ('eldest_son', 0.769737720489502),\n", " ('uncle', 0.7542626857757568),\n", " ('grandson', 0.7425380349159241),\n", " ('sons', 0.7106431722640991),\n", " ('grandfather', 0.6886354684829712),\n", " ('dad', 0.6858929395675659),\n", " ('elder_brother', 0.6622239947319031)]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_google_news.most_similar(positive=['father', 'son'], negative=['mother'], topn=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4 - Compare the results between models\n", "\n", "Now that we have tried two models, which one seems to give more accurate answers? \n", "Why ?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 4 }