{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "Training_embeddings_using_gensim.ipynb", "provenance": [], "toc_visible": true, "machine_shape": "hm" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "8G1t37lcGSKK" }, "source": [ "# Training Embeddings Using Gensim and FastText\n", "> Word embeddings are an approach to representing text in NLP. In this notebook we will demonstrate how to train embeddings both CBOW and SkipGram methods using Genism and Fasttext.\n", "\n", "- toc: true\n", "- badges: true\n", "- comments: true\n", "- categories: [Concept, Embedding, Gensim, FastText]\n", "- author: \"Quantum Stat\"\n", "- image:" ] }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2021-04-05T21:26:40.863650Z", "start_time": "2021-04-05T21:26:40.339123Z" }, "id": "TBw9OCYcYQ_n" }, "source": [ "from gensim.models import Word2Vec\n", "import warnings\n", "warnings.filterwarnings('ignore')" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2021-04-05T21:26:40.894143Z", "start_time": "2021-04-05T21:26:40.865114Z" }, "id": "5qWptd54ZcfV" }, "source": [ "# define training data\n", "#Genism word2vec requires that a format of ‘list of lists’ be provided for training where every document contained in a list.\n", "#Every list contains lists of tokens of that document.\n", "corpus = [['dog','bites','man'], [\"man\", \"bites\" ,\"dog\"],[\"dog\",\"eats\",\"meat\"],[\"man\", \"eats\",\"food\"]]\n", "\n", "#Training the model\n", "model_cbow = Word2Vec(corpus, min_count=1,sg=0) #using CBOW Architecture for trainnig\n", "model_skipgram = Word2Vec(corpus, min_count=1,sg=1)#using skipGram Architecture for training " ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "0QjSxefPl4mh" }, "source": [ "## Continuous Bag of Words (CBOW) \n", "In CBOW, the primary task is to build a language model that correctly predicts the center word given the context words in which the center word appears." ] }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2021-04-05T21:26:56.724662Z", "start_time": "2021-04-05T21:26:56.712651Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 486 }, "id": "nyZY8ME4lUjd", "outputId": "bd00e825-c11a-4b36-dbf5-80f32c659956" }, "source": [ "#Summarize the loaded model\n", "print(model_cbow)\n", "\n", "#Summarize vocabulary\n", "words = list(model_cbow.wv.vocab)\n", "print(words)\n", "\n", "#Acess vector for one word\n", "print(model_cbow['dog'])" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "Word2Vec(vocab=6, size=100, alpha=0.025)\n", "['dog', 'bites', 'man', 'eats', 'meat', 'food']\n", "[-3.1667745e-03 2.5268614e-03 -4.9504861e-03 2.3797194e-03\n", " -3.3511904e-03 1.7659335e-03 -9.6838089e-04 3.6862001e-03\n", " 3.3760078e-03 -1.1944126e-03 -4.7475514e-03 -4.6677454e-03\n", " 4.7231275e-03 2.1875298e-03 4.9989321e-03 -4.7024325e-04\n", " 4.6936749e-03 4.5417100e-03 -4.8383311e-03 4.5522186e-03\n", " 9.4010920e-04 -2.8778350e-03 -2.3938445e-03 7.6240452e-04\n", " 2.8537741e-05 -1.0585956e-03 1.5203804e-03 1.1994856e-04\n", " 4.3881699e-03 3.5755127e-04 1.9964906e-03 -3.3893189e-03\n", " 2.5362791e-03 -3.8559963e-03 -4.6814438e-03 -1.0485576e-03\n", " 1.9576577e-03 -5.4296525e-04 2.5505766e-03 1.4563937e-03\n", " 1.1214090e-03 3.1200200e-03 3.5230191e-03 4.4931062e-03\n", " -5.5389071e-04 1.6268899e-03 -4.6736463e-03 -1.9612674e-04\n", " 1.5486709e-03 -3.5581242e-03 1.5163666e-03 2.2859944e-03\n", " -3.5728619e-03 -3.5505979e-03 7.8282715e-04 -4.8093311e-03\n", " -3.1324120e-03 -3.6213300e-03 -1.4478542e-03 3.4006054e-03\n", " 2.2276146e-03 -4.1698264e-03 -3.6997625e-03 -4.1264743e-03\n", " -4.9103238e-03 -2.2635974e-03 -3.9036905e-03 3.8846405e-03\n", " -7.9726276e-05 -2.0692295e-03 -3.0645117e-04 -3.0288144e-03\n", " -3.4682599e-03 -3.1768843e-03 -1.1148058e-03 -2.8012963e-03\n", " -6.5973290e-04 -2.3705217e-03 4.3961490e-03 3.2166531e-03\n", " 3.6933657e-04 -6.2054797e-04 2.0661615e-04 3.7390803e-04\n", " -3.5061471e-03 3.6587315e-03 2.1328868e-03 -2.5964181e-03\n", " 4.3381471e-03 4.0168604e-03 1.8054987e-03 -1.2192487e-03\n", " 1.5615283e-03 -1.8635839e-03 2.9529419e-03 -3.3825964e-03\n", " -3.2592549e-03 -4.7523994e-04 -5.3210353e-04 -9.8173530e-04]\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2021-04-05T21:26:57.420196Z", "start_time": "2021-04-05T21:26:57.417193Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 52 }, "id": "gMuHv52GeuoR", "outputId": "b498032d-6f9d-485b-a3cc-5a21300bfb06" }, "source": [ "#Compute similarity \n", "print(\"Similarity between eats and bites:\",model_cbow.similarity('eats', 'bites'))\n", "print(\"Similarity between eats and man:\",model_cbow.similarity('eats', 'man'))" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "Similarity between eats and bites: -0.09852024\n", "Similarity between eats and man: -0.17088428\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "twhTZfPOezTU" }, "source": [ "From the above similarity scores we can conclude that eats is more similar to bites than man." ] }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2021-04-05T21:26:59.635831Z", "start_time": "2021-04-05T21:26:59.621818Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 104 }, "id": "5Lv0V7WofmsB", "outputId": "00600b23-d9a6-4f14-bacd-395be85076c8" }, "source": [ "#Most similarity\n", "model_cbow.most_similar('meat')" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[('bites', 0.1353721022605896),\n", " ('man', 0.1094527617096901),\n", " ('food', -0.02215239405632019),\n", " ('dog', -0.1444159597158432),\n", " ('eats', -0.16309654712677002)]" ] }, "metadata": { "tags": [] }, "execution_count": 5 } ] }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2021-04-05T21:26:59.855822Z", "start_time": "2021-04-05T21:26:59.841810Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "id": "WA783nrSalgs", "outputId": "80d6e23f-2bed-47d7-f925-4aaa87ec5f9e" }, "source": [ "# save model\n", "model_cbow.save('model_cbow.bin')\n", "\n", "# load model\n", "new_model_cbow = Word2Vec.load('model_cbow.bin')\n", "print(new_model_cbow)" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "Word2Vec(vocab=6, size=100, alpha=0.025)\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "deReLSI7mQyr" }, "source": [ "## SkipGram\n", "In skipgram, the task is to predict the context words from the center word." ] }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2021-04-05T21:27:00.517046Z", "start_time": "2021-04-05T21:27:00.508038Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 486 }, "id": "9QtUtsLglvY0", "outputId": "6d19902b-66aa-4b0f-9f12-be18f37d40d1" }, "source": [ "#Summarize the loaded model\n", "print(model_skipgram)\n", "\n", "#Summarize vocabulary\n", "words = list(model_skipgram.wv.vocab)\n", "print(words)\n", "\n", "#Acess vector for one word\n", "print(model_skipgram['dog'])" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "Word2Vec(vocab=6, size=100, alpha=0.025)\n", "['dog', 'bites', 'man', 'eats', 'meat', 'food']\n", "[-3.1667745e-03 2.5268614e-03 -4.9504861e-03 2.3797194e-03\n", " -3.3511904e-03 1.7659335e-03 -9.6838089e-04 3.6862001e-03\n", " 3.3760078e-03 -1.1944126e-03 -4.7475514e-03 -4.6677454e-03\n", " 4.7231275e-03 2.1875298e-03 4.9989321e-03 -4.7024325e-04\n", " 4.6936749e-03 4.5417100e-03 -4.8383311e-03 4.5522186e-03\n", " 9.4010920e-04 -2.8778350e-03 -2.3938445e-03 7.6240452e-04\n", " 2.8537741e-05 -1.0585956e-03 1.5203804e-03 1.1994856e-04\n", " 4.3881699e-03 3.5755127e-04 1.9964906e-03 -3.3893189e-03\n", " 2.5362791e-03 -3.8559963e-03 -4.6814438e-03 -1.0485576e-03\n", " 1.9576577e-03 -5.4296525e-04 2.5505766e-03 1.4563937e-03\n", " 1.1214090e-03 3.1200200e-03 3.5230191e-03 4.4931062e-03\n", " -5.5389071e-04 1.6268899e-03 -4.6736463e-03 -1.9612674e-04\n", " 1.5486709e-03 -3.5581242e-03 1.5163666e-03 2.2859944e-03\n", " -3.5728619e-03 -3.5505979e-03 7.8282715e-04 -4.8093311e-03\n", " -3.1324120e-03 -3.6213300e-03 -1.4478542e-03 3.4006054e-03\n", " 2.2276146e-03 -4.1698264e-03 -3.6997625e-03 -4.1264743e-03\n", " -4.9103238e-03 -2.2635974e-03 -3.9036905e-03 3.8846405e-03\n", " -7.9726276e-05 -2.0692295e-03 -3.0645117e-04 -3.0288144e-03\n", " -3.4682599e-03 -3.1768843e-03 -1.1148058e-03 -2.8012963e-03\n", " -6.5973290e-04 -2.3705217e-03 4.3961490e-03 3.2166531e-03\n", " 3.6933657e-04 -6.2054797e-04 2.0661615e-04 3.7390803e-04\n", " -3.5061471e-03 3.6587315e-03 2.1328868e-03 -2.5964181e-03\n", " 4.3381471e-03 4.0168604e-03 1.8054987e-03 -1.2192487e-03\n", " 1.5615283e-03 -1.8635839e-03 2.9529419e-03 -3.3825964e-03\n", " -3.2592549e-03 -4.7523994e-04 -5.3210353e-04 -9.8173530e-04]\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2021-04-05T21:27:02.660747Z", "start_time": "2021-04-05T21:27:02.642866Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 52 }, "id": "8YUsblEOfFWf", "outputId": "14cd759c-d5fc-465f-ed20-8fd1a1949168" }, "source": [ "#Compute similarity \n", "print(\"Similarity between eats and bites:\",model_skipgram.similarity('eats', 'bites'))\n", "print(\"Similarity between eats and man:\",model_skipgram.similarity('eats', 'man'))" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "Similarity between eats and bites: -0.09852936\n", "Similarity between eats and man: -0.17089055\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "gdXVDePKnBpv" }, "source": [ "From the above similarity scores we can conclude that eats is more similar to bites than man." ] }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2021-04-05T21:27:03.419546Z", "start_time": "2021-04-05T21:27:03.414541Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 104 }, "id": "lpF4qtwpmuM3", "outputId": "f3bc68f6-3768-4a4d-e5bc-bb3dff6f654f" }, "source": [ "#Most similarity\n", "model_skipgram.most_similar('meat')" ], "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[('bites', 0.1353721022605896),\n", " ('man', 0.10945276916027069),\n", " ('food', -0.022152386605739594),\n", " ('dog', -0.1444159746170044),\n", " ('eats', -0.16317100822925568)]" ] }, "metadata": { "tags": [] }, "execution_count": 9 } ] }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2021-04-05T21:27:03.973454Z", "start_time": "2021-04-05T21:27:03.950433Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "id": "aNDCEXRTnAnj", "outputId": "402f77b6-0625-4b37-e135-3650df626007" }, "source": [ "# save model\n", "model_skipgram.save('model_skipgram.bin')\n", "\n", "# load model\n", "new_model_skipgram = Word2Vec.load('model_skipgram.bin')\n", "print(new_model_skipgram)" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "Word2Vec(vocab=6, size=100, alpha=0.025)\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "b0MiqJ_1M0mX" }, "source": [ "## Training Your Embedding on Wiki Corpus\n", "\n", "##### The corpus download page : https://dumps.wikimedia.org/enwiki/20200120/\n", "The entire wiki corpus as of 28/04/2020 is just over 16GB in size.\n", "We will take a part of this corpus due to computation constraints and train our word2vec and fasttext embeddings.\n", "\n", "The file size is 294MB so it can take a while to download.\n", "\n", "Source for code which downloads files from Google Drive: https://stackoverflow.com/questions/25010369/wget-curl-large-file-from-google-drive/39225039#39225039" ] }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2021-04-05T21:27:58.596845Z", "start_time": "2021-04-05T21:27:58.585833Z" }, "id": "60UO41DfGPL0", "outputId": "262cce44-03e5-46c8-861a-c9da76306c23" }, "source": [ "import os\n", "import requests\n", "\n", "os.makedirs('data/en', exist_ok= True)\n", "file_name = \"data/en/enwiki-latest-pages-articles-multistream14.xml-p13159683p14324602.bz2\"\n", "file_id = \"11804g0GcWnBIVDahjo5fQyc05nQLXGwF\"\n", "\n", "def download_file_from_google_drive(id, destination):\n", " URL = \"https://docs.google.com/uc?export=download\"\n", "\n", " session = requests.Session()\n", "\n", " response = session.get(URL, params = { 'id' : id }, stream = True)\n", " token = get_confirm_token(response)\n", "\n", " if token:\n", " params = { 'id' : id, 'confirm' : token }\n", " response = session.get(URL, params = params, stream = True)\n", "\n", " save_response_content(response, destination) \n", "\n", "def get_confirm_token(response):\n", " for key, value in response.cookies.items():\n", " if key.startswith('download_warning'):\n", " return value\n", "\n", " return None\n", "\n", "def save_response_content(response, destination):\n", " CHUNK_SIZE = 32768\n", "\n", " with open(destination, \"wb\") as f:\n", " for chunk in response.iter_content(CHUNK_SIZE):\n", " if chunk: # filter out keep-alive new chunks\n", " f.write(chunk)\n", "\n", "if not os.path.exists(file_name):\n", " download_file_from_google_drive(file_id, file_name)\n", "else:\n", " print(\"file already exists, skipping download\")\n", "\n", "print(f\"File at: {file_name}\")" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "file already exists, skipping download\n", "File at: data/en/enwiki-latest-pages-articles-multistream14.xml-p13159683p14324602.bz2\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2021-04-03T08:59:17.024306Z", "start_time": "2021-04-03T08:59:17.022304Z" }, "id": "wX1kx96JLYvt" }, "source": [ "from gensim.corpora.wikicorpus import WikiCorpus\n", "from gensim.models.word2vec import Word2Vec\n", "from gensim.models.fasttext import FastText\n", "import time" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2021-04-03T09:56:14.722195Z", "start_time": "2021-04-03T09:56:14.705177Z" }, "id": "rJgsEUmRPppc" }, "source": [ "#Preparing the Training data\n", "wiki = WikiCorpus(file_name, lemmatize=False, dictionary={})\n", "sentences = list(wiki.get_texts())\n", "\n", "#if you get a memory error executing the lines above\n", "#comment the lines out and uncomment the lines below. \n", "#loading will be slower, but stable.\n", "# wiki = WikiCorpus(file_name, processes=4, lemmatize=False, dictionary={})\n", "# sentences = list(wiki.get_texts())\n", "\n", "#if you still get a memory error, try settings processes to 1 or 2 and then run it again." ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "xsIrgt_gPQda" }, "source": [ "### Hyperparameters\n", "\n", "\n", "1. sg - Selecting the training algorithm: 1 for skip-gram else its 0 for CBOW. Default is CBOW.\n", "2. min_count- Ignores all words with total frequency lower than this.
\n", "There are many more hyperparamaeters whose list can be found in the official documentation [here.](https://radimrehurek.com/gensim/models/word2vec.html)\n" ] }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2021-04-03T10:01:20.065332Z", "start_time": "2021-04-03T09:59:12.350872Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 52 }, "id": "idmfbr_8LvoN", "outputId": "f505a46e-025d-4169-f996-06c672008f81" }, "source": [ "#CBOW\n", "start = time.time()\n", "word2vec_cbow = Word2Vec(sentences,min_count=10, sg=0)\n", "end = time.time()\n", "\n", "print(\"CBOW Model Training Complete.\\nTime taken for training is:{:.2f} hrs \".format((end-start)/3600.0))" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "CBOW Model Training Complete.\n", "Time taken for training is:0.04 hrs \n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2021-04-03T10:02:10.613551Z", "start_time": "2021-04-03T10:02:10.585535Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 471 }, "id": "mMdGn08-RkhM", "outputId": "efb34148-3fb4-435c-f070-8493708fc07a" }, "source": [ "#Summarize the loaded model\n", "print(word2vec_cbow)\n", "print(\"-\"*30)\n", "\n", "#Summarize vocabulary\n", "words = list(word2vec_cbow.wv.vocab)\n", "print(f\"Length of vocabulary: {len(words)}\")\n", "print(\"Printing the first 30 words.\")\n", "print(words[:30])\n", "print(\"-\"*30)\n", "\n", "#Acess vector for one word\n", "print(f\"Length of vector: {len(word2vec_cbow['film'])}\")\n", "print(word2vec_cbow['film'])\n", "print(\"-\"*30)\n", "\n", "#Compute similarity \n", "print(\"Similarity between film and drama:\",word2vec_cbow.similarity('film', 'drama'))\n", "print(\"Similarity between film and tiger:\",word2vec_cbow.similarity('film', 'tiger'))\n", "print(\"-\"*30)" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "Word2Vec(vocab=111150, size=100, alpha=0.025)\n", "------------------------------\n", "Length of vocabulary: 111150\n", "Printing the first 30 words.\n", "['the', 'roses', 'registered', 'as', 'is', 'brisbane', 'racing', 'club', 'group', 'thoroughbred', 'horse', 'race', 'for', 'three', 'year', 'old', 'filles', 'run', 'under', 'set', 'weights', 'conditions', 'over', 'distance', 'of', 'metres', 'at', 'racecourse', 'australia', 'during']\n", "------------------------------\n", "Length of vector: 100\n", "[-0.25941572 -1.6287326 2.5331333 -1.5818936 0.9024474 0.8614945\n", " 2.4875445 -0.95802265 -1.3792082 -1.1744157 -4.300686 1.0071316\n", " 0.10418405 4.855032 0.6251962 -0.06472338 0.19993098 -0.7291219\n", " 2.342258 -1.7298651 0.7895099 -2.2819378 0.7158192 -0.62419826\n", " 0.6720258 3.6712303 1.3836899 0.17808275 -3.7205396 0.2529162\n", " 1.0290879 -0.9228959 0.9451632 1.7415334 1.9618814 1.4535053\n", " 2.670452 0.9272077 0.25056183 -0.4078236 0.5795217 0.6316829\n", " 0.50204426 -0.19865237 -2.697352 0.75351495 1.0796617 2.247825\n", " -2.956658 2.6606686 -0.42392135 -0.44319883 -2.9274392 -1.0198026\n", " 1.404833 -0.10840467 0.50829273 1.0767945 -0.65002084 -3.4231277\n", " 4.719826 -1.5996053 0.82882935 1.635043 -0.45730942 -1.3166244\n", " -1.3349417 -2.3565981 1.7141095 -2.6643796 -1.2148786 0.2972199\n", " -2.2865987 -1.6022073 2.0965865 -0.87479544 -1.4143106 -0.9149557\n", " 2.2900226 1.1464663 -2.6113467 -1.5517493 1.3018385 4.1072307\n", " 1.1441547 1.0222696 0.4847384 2.4148073 -2.881392 -0.67044157\n", " -2.482836 -0.417894 3.1442287 -1.6087203 1.865813 -3.717568\n", " 0.5994761 1.8819104 3.355772 -1.9087372 ]\n", "------------------------------\n", "Similarity between film and drama: 0.4986632\n", "Similarity between film and tiger: 0.15477756\n", "------------------------------\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2021-04-03T10:02:16.109851Z", "start_time": "2021-04-03T10:02:15.257052Z" }, "id": "rXrDOrKskcHX" }, "source": [ "# save model\n", "from gensim.models import Word2Vec, KeyedVectors \n", "word2vec_cbow.wv.save_word2vec_format('word2vec_cbow.bin', binary=True)\n", "\n", "# load model\n", "# new_modelword2vec_cbow = Word2Vec.load('word2vec_cbow.bin')\n", "# print(word2vec_cbow)" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2021-04-03T10:08:27.736688Z", "start_time": "2021-04-03T10:02:19.197708Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 52 }, "id": "dX0U0CbQOK30", "outputId": "b9bfcf2b-91cb-40d9-ca92-791ec346aef4" }, "source": [ "#SkipGram\n", "start = time.time()\n", "word2vec_skipgram = Word2Vec(sentences,min_count=10, sg=1)\n", "end = time.time()\n", "\n", "print(\"SkipGram Model Training Complete\\nTime taken for training is:{:.2f} hrs \".format((end-start)/3600.0))" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "SkipGram Model Training Complete\n", "Time taken for training is:0.10 hrs \n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2021-04-03T10:09:06.406929Z", "start_time": "2021-04-03T10:09:06.383908Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 471 }, "id": "LXnY9YInSvnI", "outputId": "26f1dab7-27a6-4655-81c7-ac6f08fe1f9c" }, "source": [ "#Summarize the loaded model\n", "print(word2vec_skipgram)\n", "print(\"-\"*30)\n", "\n", "#Summarize vocabulary\n", "words = list(word2vec_skipgram.wv.vocab)\n", "print(f\"Length of vocabulary: {len(words)}\")\n", "print(\"Printing the first 30 words.\")\n", "print(words[:30])\n", "print(\"-\"*30)\n", "\n", "#Acess vector for one word\n", "print(f\"Length of vector: {len(word2vec_skipgram['film'])}\")\n", "print(word2vec_skipgram['film'])\n", "print(\"-\"*30)\n", "\n", "#Compute similarity \n", "print(\"Similarity between film and drama:\",word2vec_skipgram.similarity('film', 'drama'))\n", "print(\"Similarity between film and tiger:\",word2vec_skipgram.similarity('film', 'tiger'))\n", "print(\"-\"*30)" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "Word2Vec(vocab=111150, size=100, alpha=0.025)\n", "------------------------------\n", "Length of vocabulary: 111150\n", "Printing the first 30 words.\n", "['the', 'roses', 'registered', 'as', 'is', 'brisbane', 'racing', 'club', 'group', 'thoroughbred', 'horse', 'race', 'for', 'three', 'year', 'old', 'filles', 'run', 'under', 'set', 'weights', 'conditions', 'over', 'distance', 'of', 'metres', 'at', 'racecourse', 'australia', 'during']\n", "------------------------------\n", "Length of vector: 100\n", "[ 1.94889292e-01 -7.88324535e-01 4.66947220e-02 2.57520348e-01\n", " 2.65304267e-01 3.63538593e-01 4.63590741e-01 -1.62654325e-01\n", " 9.11010578e-02 -6.58479631e-02 -6.97350129e-02 -6.56900406e-02\n", " 2.19506964e-01 2.20394313e-01 1.05092540e-01 8.26439075e-03\n", " -9.39796269e-02 5.50851583e-01 7.65753444e-04 -2.22807571e-01\n", " -3.17346871e-01 3.20529372e-01 4.51157093e-02 -1.93709806e-01\n", " 2.07626969e-02 1.69344515e-01 2.77250055e-02 1.10369585e-02\n", " -4.75540310e-01 1.10796697e-01 4.28172469e-01 4.06191871e-02\n", " 5.15495241e-01 -6.85295224e-01 -5.06723702e-01 -4.52192919e-03\n", " 1.51265517e-03 -3.84557724e-01 -2.22782314e-01 5.11201501e-01\n", " 1.42252162e-01 -7.73397386e-01 -2.78606623e-01 4.70017433e-01\n", " -2.70037323e-01 5.04850507e-01 -1.48356587e-01 2.26073325e-01\n", " -3.36060971e-01 -1.19667962e-01 -2.59654850e-01 -4.44965392e-01\n", " 1.11614995e-01 1.62986945e-02 4.82374012e-01 -7.87460804e-02\n", " -1.13825299e-01 -2.24003598e-01 4.93353546e-01 -5.57069406e-02\n", " 2.43176505e-01 -1.84876159e-01 2.13489812e-02 3.42909366e-01\n", " 2.02496469e-01 -4.25657362e-01 8.17572057e-01 -2.83644646e-01\n", " -5.23434244e-02 -3.27616245e-01 4.43994589e-02 -3.90237272e-01\n", " 2.12029487e-01 -7.25788534e-01 5.52469850e-01 -4.72590374e-03\n", " -2.02829018e-01 -9.59078223e-03 3.68973225e-01 -2.69762665e-01\n", " -2.85591751e-01 -2.68359333e-01 3.10093671e-01 2.02198789e-01\n", " 5.80960453e-01 -2.47493789e-01 -7.37856887e-03 -3.59723950e-03\n", " 3.14893663e-01 1.12885557e-01 -5.09416103e-01 -7.58459032e-01\n", " 5.30587435e-01 -1.51896626e-01 -3.37440372e-01 4.22841489e-01\n", " -3.34523350e-01 3.21759552e-01 7.44457126e-01 -1.26014173e-01]\n", "------------------------------\n", "Similarity between film and drama: 0.63833964\n", "Similarity between film and tiger: 0.22270091\n", "------------------------------\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2021-04-03T10:09:09.947695Z", "start_time": "2021-04-03T10:09:09.076901Z" }, "id": "o8U7bfPSVB04" }, "source": [ "# save model\n", "word2vec_skipgram.wv.save_word2vec_format('word2vec_sg.bin', binary=True)\n", "\n", "# load model\n", "# new_model_skipgram = Word2Vec.load('model_skipgram.bin')\n", "# print(model_skipgram)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "kExlA8kfrKml" }, "source": [ "## FastText" ] }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2021-04-03T10:16:31.271764Z", "start_time": "2021-04-03T10:09:16.592670Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 52 }, "id": "JPd2VhMEk8gL", "outputId": "55c44bdd-d7d8-4df2-8140-cdd442bbd68c" }, "source": [ "#CBOW\n", "start = time.time()\n", "fasttext_cbow = FastText(sentences, sg=0, min_count=10)\n", "end = time.time()\n", "\n", "print(\"FastText CBOW Model Training Complete\\nTime taken for training is:{:.2f} hrs \".format((end-start)/3600.0))" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "FastText CBOW Model Training Complete\n", "Time taken for training is:0.12 hrs \n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2021-04-03T10:16:31.287283Z", "start_time": "2021-04-03T10:16:31.273765Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 471 }, "id": "FlQFl8-Zsost", "outputId": "6472e944-e6de-4d64-8c6f-14475ef1eac5" }, "source": [ "#Summarize the loaded model\n", "print(fasttext_cbow)\n", "print(\"-\"*30)\n", "\n", "#Summarize vocabulary\n", "words = list(fasttext_cbow.wv.vocab)\n", "print(f\"Length of vocabulary: {len(words)}\")\n", "print(\"Printing the first 30 words.\")\n", "print(words[:30])\n", "print(\"-\"*30)\n", "\n", "#Acess vector for one word\n", "print(f\"Length of vector: {len(fasttext_cbow['film'])}\")\n", "print(fasttext_cbow['film'])\n", "print(\"-\"*30)\n", "\n", "#Compute similarity \n", "print(\"Similarity between film and drama:\",fasttext_cbow.similarity('film', 'drama'))\n", "print(\"Similarity between film and tiger:\",fasttext_cbow.similarity('film', 'tiger'))\n", "print(\"-\"*30)" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "FastText(vocab=111150, size=100, alpha=0.025)\n", "------------------------------\n", "Length of vocabulary: 111150\n", "Printing the first 30 words.\n", "['the', 'roses', 'registered', 'as', 'is', 'brisbane', 'racing', 'club', 'group', 'thoroughbred', 'horse', 'race', 'for', 'three', 'year', 'old', 'filles', 'run', 'under', 'set', 'weights', 'conditions', 'over', 'distance', 'of', 'metres', 'at', 'racecourse', 'australia', 'during']\n", "------------------------------\n", "Length of vector: 100\n", "[ 0.47473213 1.6783198 -4.766255 -3.2404876 0.80164665 1.993539\n", " 3.4226568 -0.7035685 -3.0426116 1.5137119 3.8207133 1.3821473\n", " -0.7379625 -0.6726444 1.8303355 -2.1288188 1.2368282 -3.0745962\n", " 1.4226121 -2.8884995 7.2847705 -1.564321 2.869352 0.6962616\n", " 4.469778 2.5569658 2.621335 -4.612509 -2.2389078 3.6648748\n", " 0.7189718 1.0702186 -3.175641 2.7648733 0.13811935 -2.441776\n", " -3.9559126 -0.03163956 -1.1257534 -0.64402825 -1.5076644 -0.58919376\n", " -0.14338583 4.2466817 4.3784213 3.0076942 -5.972965 2.2950342\n", " -0.50719374 -3.916504 -2.1366098 -2.661619 2.3540869 2.1862476\n", " 5.1004434 4.1282 -4.164653 1.1288711 -4.001655 -4.051289\n", " 2.5718336 -0.40600455 3.8396242 2.214367 1.8413899 4.5216975\n", " -1.6419586 2.7617378 -2.0902452 2.598776 4.041824 -5.1805005\n", " -2.777213 -0.02546828 -0.07393612 -3.2800605 -2.9874747 -0.6490991\n", " 3.6039045 -1.4168853 3.6110177 -1.0872458 -0.6365031 -1.0161037\n", " 3.7344344 0.29839793 0.421953 -1.811646 1.3730506 7.575645\n", " 3.3998368 5.0335827 -0.2107324 -2.331183 0.19383769 3.0550041\n", " 4.1529713 3.988616 0.04955976 1.3424706 ]\n", "------------------------------\n", "Similarity between film and drama: 0.5669882\n", "Similarity between film and tiger: 0.24975622\n", "------------------------------\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2021-04-03T10:28:28.771383Z", "start_time": "2021-04-03T10:16:31.289284Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 52 }, "id": "UgSOxsNklAvh", "outputId": "f491f83c-17b8-42ad-a225-479df8419578" }, "source": [ "#SkipGram\n", "start = time.time()\n", "fasttext_skipgram = FastText(sentences, sg=1, min_count=10)\n", "end = time.time()\n", "\n", "print(\"FastText SkipGram Model Training Complete\\nTime taken for training is:{:.2f} hrs \".format((end-start)/3600.0))" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "FastText SkipGram Model Training Complete\n", "Time taken for training is:0.20 hrs \n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "ExecuteTime": { "end_time": "2021-04-03T10:28:28.803412Z", "start_time": "2021-04-03T10:28:28.773386Z" }, "colab": { "base_uri": "https://localhost:8080/", "height": 610 }, "id": "vFiTAP0PsQwi", "outputId": "a29ae2e3-5dbc-453a-f66b-ceca255a8652" }, "source": [ "#Summarize the loaded model\n", "print(fasttext_skipgram)\n", "print(\"-\"*30)\n", "\n", "#Summarize vocabulary\n", "words = list(fasttext_skipgram.wv.vocab)\n", "print(f\"Length of vocabulary: {len(words)}\")\n", "print(\"Printing the first 30 words.\")\n", "print(words[:30])\n", "print(\"-\"*30)\n", "\n", "#Acess vector for one word\n", "print(f\"Length of vector: {len(fasttext_skipgram['film'])}\")\n", "print(fasttext_skipgram['film'])\n", "print(\"-\"*30)\n", "\n", "#Compute similarity \n", "print(\"Similarity between film and drama:\",fasttext_skipgram.similarity('film', 'drama'))\n", "print(\"Similarity between film and tiger:\",fasttext_skipgram.similarity('film', 'tiger'))\n", "print(\"-\"*30)" ], "execution_count": null, "outputs": [ { "output_type": "stream", "text": [ "FastText(vocab=111150, size=100, alpha=0.025)\n", "------------------------------\n", "Length of vocabulary: 111150\n", "Printing the first 30 words.\n", "['the', 'roses', 'registered', 'as', 'is', 'brisbane', 'racing', 'club', 'group', 'thoroughbred', 'horse', 'race', 'for', 'three', 'year', 'old', 'filles', 'run', 'under', 'set', 'weights', 'conditions', 'over', 'distance', 'of', 'metres', 'at', 'racecourse', 'australia', 'during']\n", "------------------------------\n", "Length of vector: 100\n", "[-8.4101312e-02 -6.9478154e-04 3.3954462e-01 -3.6973858e-01\n", " 1.6844368e-01 3.4855682e-01 8.0026442e-01 -5.0405812e-01\n", " -6.0389137e-01 2.1694953e-02 4.0937051e-01 -3.5893116e-02\n", " -1.3717794e-01 4.0389201e-01 3.9567137e-01 2.4365921e-01\n", " 5.6551516e-02 -1.5994829e-01 -1.8148309e-01 -2.6480275e-01\n", " -4.8462763e-01 9.5473409e-02 -1.1126036e-02 -1.8805853e-01\n", " 2.4277805e-01 2.4251699e-01 -1.7501226e-01 -4.3078136e-01\n", " -3.6442232e-01 9.1702184e-03 -3.2344624e-01 -1.0232232e-01\n", " -5.2684498e-01 -2.7622378e-01 4.2112619e-01 -4.3196991e-02\n", " 3.1967857e-01 1.7001998e-01 3.3157614e-01 -2.4995559e-01\n", " -1.3239473e-01 -3.4502399e-01 2.1341468e-01 5.8890671e-01\n", " 1.7721146e-01 1.5974782e-01 -3.8579264e-01 -2.8241745e-01\n", " 6.7402735e-02 -7.1903253e-01 1.3665260e-01 -5.9633050e-02\n", " -5.9002697e-01 -6.1173952e-01 -1.0246418e-03 -5.1254374e-01\n", " -1.5101396e-01 1.6967247e-01 2.8125226e-01 -4.6728057e-01\n", " -5.4966863e-02 -1.3736627e-02 -1.5689149e-01 8.3176725e-02\n", " 1.8850440e-02 4.1858605e-01 -1.1376646e-02 -4.0758383e-02\n", " -1.7871203e-01 2.7792713e-01 5.5813068e-01 -3.5465869e-01\n", " 1.3662770e-01 2.5777066e-01 -3.0423281e-01 7.8141141e-01\n", " 1.1446947e-02 -4.0541172e-01 2.9406905e-01 6.0151044e-02\n", " 4.9637925e-02 -3.9679220e-01 4.5333567e-01 1.0888510e-02\n", " 2.7147910e-01 -1.7305572e-01 -2.8098795e-01 -6.1907400e-03\n", " -2.3080334e-01 5.8609635e-01 -1.0097053e-01 6.6119152e-01\n", " 1.8578811e-01 -5.9025098e-02 -5.3886050e-01 2.6664239e-01\n", " -2.2193529e-02 7.0487672e-01 3.9477929e-01 3.7981489e-01]\n", "------------------------------\n", "Similarity between film and drama: 0.626041\n", "Similarity between film and tiger: 0.27831402\n", "------------------------------\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "oArMIJzYOmUR" }, "source": [ "An interesting obeseravtion if you noticed is that CBOW trains faster than SkipGram in both cases.\n", "We will leave it to the user to figure out why. A hint would be to refer the working of CBOW and skipgram." ] } ] }