{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Training_embeddings_using_gensim.ipynb",
      "provenance": [],
      "toc_visible": true,
      "machine_shape": "hm"
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.8"
    },
    "varInspector": {
      "cols": {
        "lenName": 16,
        "lenType": 16,
        "lenVar": 40
      },
      "kernels_config": {
        "python": {
          "delete_cmd_postfix": "",
          "delete_cmd_prefix": "del ",
          "library": "var_list.py",
          "varRefreshCmd": "print(var_dic_list())"
        },
        "r": {
          "delete_cmd_postfix": ") ",
          "delete_cmd_prefix": "rm(",
          "library": "var_list.r",
          "varRefreshCmd": "cat(var_dic_list()) "
        }
      },
      "types_to_exclude": [
        "module",
        "function",
        "builtin_function_or_method",
        "instance",
        "_Feature"
      ],
      "window_display": false
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8G1t37lcGSKK"
      },
      "source": [
        "# Training Embeddings Using Gensim and FastText\n",
        "> Word embeddings are an approach to representing text in NLP. In this notebook we will demonstrate how to train embeddings both CBOW and SkipGram methods using Genism and Fasttext.\n",
        "\n",
        "- toc: true\n",
        "- badges: true\n",
        "- comments: true\n",
        "- categories: [Concept, Embedding, Gensim, FastText]\n",
        "- author: \"<a href='https://notebooks.quantumstat.com/'>Quantum Stat</a>\"\n",
        "- image:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "ExecuteTime": {
          "end_time": "2021-04-05T21:26:40.863650Z",
          "start_time": "2021-04-05T21:26:40.339123Z"
        },
        "id": "TBw9OCYcYQ_n"
      },
      "source": [
        "from gensim.models import Word2Vec\n",
        "import warnings\n",
        "warnings.filterwarnings('ignore')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "ExecuteTime": {
          "end_time": "2021-04-05T21:26:40.894143Z",
          "start_time": "2021-04-05T21:26:40.865114Z"
        },
        "id": "5qWptd54ZcfV"
      },
      "source": [
        "# define training data\n",
        "#Genism word2vec requires that a format of ‘list of lists’ be provided for training where every document contained in a list.\n",
        "#Every list contains lists of tokens of that document.\n",
        "corpus = [['dog','bites','man'], [\"man\", \"bites\" ,\"dog\"],[\"dog\",\"eats\",\"meat\"],[\"man\", \"eats\",\"food\"]]\n",
        "\n",
        "#Training the model\n",
        "model_cbow = Word2Vec(corpus, min_count=1,sg=0) #using CBOW Architecture for trainnig\n",
        "model_skipgram = Word2Vec(corpus, min_count=1,sg=1)#using skipGram Architecture for training "
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0QjSxefPl4mh"
      },
      "source": [
        "## Continuous Bag of Words (CBOW) \n",
        "In CBOW, the primary task is to build a language model that correctly predicts the center word given the context words in which the center word appears."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "ExecuteTime": {
          "end_time": "2021-04-05T21:26:56.724662Z",
          "start_time": "2021-04-05T21:26:56.712651Z"
        },
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 486
        },
        "id": "nyZY8ME4lUjd",
        "outputId": "bd00e825-c11a-4b36-dbf5-80f32c659956"
      },
      "source": [
        "#Summarize the loaded model\n",
        "print(model_cbow)\n",
        "\n",
        "#Summarize vocabulary\n",
        "words = list(model_cbow.wv.vocab)\n",
        "print(words)\n",
        "\n",
        "#Acess vector for one word\n",
        "print(model_cbow['dog'])"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Word2Vec(vocab=6, size=100, alpha=0.025)\n",
            "['dog', 'bites', 'man', 'eats', 'meat', 'food']\n",
            "[-3.1667745e-03  2.5268614e-03 -4.9504861e-03  2.3797194e-03\n",
            " -3.3511904e-03  1.7659335e-03 -9.6838089e-04  3.6862001e-03\n",
            "  3.3760078e-03 -1.1944126e-03 -4.7475514e-03 -4.6677454e-03\n",
            "  4.7231275e-03  2.1875298e-03  4.9989321e-03 -4.7024325e-04\n",
            "  4.6936749e-03  4.5417100e-03 -4.8383311e-03  4.5522186e-03\n",
            "  9.4010920e-04 -2.8778350e-03 -2.3938445e-03  7.6240452e-04\n",
            "  2.8537741e-05 -1.0585956e-03  1.5203804e-03  1.1994856e-04\n",
            "  4.3881699e-03  3.5755127e-04  1.9964906e-03 -3.3893189e-03\n",
            "  2.5362791e-03 -3.8559963e-03 -4.6814438e-03 -1.0485576e-03\n",
            "  1.9576577e-03 -5.4296525e-04  2.5505766e-03  1.4563937e-03\n",
            "  1.1214090e-03  3.1200200e-03  3.5230191e-03  4.4931062e-03\n",
            " -5.5389071e-04  1.6268899e-03 -4.6736463e-03 -1.9612674e-04\n",
            "  1.5486709e-03 -3.5581242e-03  1.5163666e-03  2.2859944e-03\n",
            " -3.5728619e-03 -3.5505979e-03  7.8282715e-04 -4.8093311e-03\n",
            " -3.1324120e-03 -3.6213300e-03 -1.4478542e-03  3.4006054e-03\n",
            "  2.2276146e-03 -4.1698264e-03 -3.6997625e-03 -4.1264743e-03\n",
            " -4.9103238e-03 -2.2635974e-03 -3.9036905e-03  3.8846405e-03\n",
            " -7.9726276e-05 -2.0692295e-03 -3.0645117e-04 -3.0288144e-03\n",
            " -3.4682599e-03 -3.1768843e-03 -1.1148058e-03 -2.8012963e-03\n",
            " -6.5973290e-04 -2.3705217e-03  4.3961490e-03  3.2166531e-03\n",
            "  3.6933657e-04 -6.2054797e-04  2.0661615e-04  3.7390803e-04\n",
            " -3.5061471e-03  3.6587315e-03  2.1328868e-03 -2.5964181e-03\n",
            "  4.3381471e-03  4.0168604e-03  1.8054987e-03 -1.2192487e-03\n",
            "  1.5615283e-03 -1.8635839e-03  2.9529419e-03 -3.3825964e-03\n",
            " -3.2592549e-03 -4.7523994e-04 -5.3210353e-04 -9.8173530e-04]\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "ExecuteTime": {
          "end_time": "2021-04-05T21:26:57.420196Z",
          "start_time": "2021-04-05T21:26:57.417193Z"
        },
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 52
        },
        "id": "gMuHv52GeuoR",
        "outputId": "b498032d-6f9d-485b-a3cc-5a21300bfb06"
      },
      "source": [
        "#Compute similarity \n",
        "print(\"Similarity between eats and bites:\",model_cbow.similarity('eats', 'bites'))\n",
        "print(\"Similarity between eats and man:\",model_cbow.similarity('eats', 'man'))"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Similarity between eats and bites: -0.09852024\n",
            "Similarity between eats and man: -0.17088428\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "twhTZfPOezTU"
      },
      "source": [
        "From the above similarity scores we can conclude that eats is more similar to bites than man."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "ExecuteTime": {
          "end_time": "2021-04-05T21:26:59.635831Z",
          "start_time": "2021-04-05T21:26:59.621818Z"
        },
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 104
        },
        "id": "5Lv0V7WofmsB",
        "outputId": "00600b23-d9a6-4f14-bacd-395be85076c8"
      },
      "source": [
        "#Most similarity\n",
        "model_cbow.most_similar('meat')"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[('bites', 0.1353721022605896),\n",
              " ('man', 0.1094527617096901),\n",
              " ('food', -0.02215239405632019),\n",
              " ('dog', -0.1444159597158432),\n",
              " ('eats', -0.16309654712677002)]"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 5
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "ExecuteTime": {
          "end_time": "2021-04-05T21:26:59.855822Z",
          "start_time": "2021-04-05T21:26:59.841810Z"
        },
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        },
        "id": "WA783nrSalgs",
        "outputId": "80d6e23f-2bed-47d7-f925-4aaa87ec5f9e"
      },
      "source": [
        "# save model\n",
        "model_cbow.save('model_cbow.bin')\n",
        "\n",
        "# load model\n",
        "new_model_cbow = Word2Vec.load('model_cbow.bin')\n",
        "print(new_model_cbow)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Word2Vec(vocab=6, size=100, alpha=0.025)\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "deReLSI7mQyr"
      },
      "source": [
        "## SkipGram\n",
        "In skipgram, the task is to predict the context words from the center word."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "ExecuteTime": {
          "end_time": "2021-04-05T21:27:00.517046Z",
          "start_time": "2021-04-05T21:27:00.508038Z"
        },
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 486
        },
        "id": "9QtUtsLglvY0",
        "outputId": "6d19902b-66aa-4b0f-9f12-be18f37d40d1"
      },
      "source": [
        "#Summarize the loaded model\n",
        "print(model_skipgram)\n",
        "\n",
        "#Summarize vocabulary\n",
        "words = list(model_skipgram.wv.vocab)\n",
        "print(words)\n",
        "\n",
        "#Acess vector for one word\n",
        "print(model_skipgram['dog'])"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Word2Vec(vocab=6, size=100, alpha=0.025)\n",
            "['dog', 'bites', 'man', 'eats', 'meat', 'food']\n",
            "[-3.1667745e-03  2.5268614e-03 -4.9504861e-03  2.3797194e-03\n",
            " -3.3511904e-03  1.7659335e-03 -9.6838089e-04  3.6862001e-03\n",
            "  3.3760078e-03 -1.1944126e-03 -4.7475514e-03 -4.6677454e-03\n",
            "  4.7231275e-03  2.1875298e-03  4.9989321e-03 -4.7024325e-04\n",
            "  4.6936749e-03  4.5417100e-03 -4.8383311e-03  4.5522186e-03\n",
            "  9.4010920e-04 -2.8778350e-03 -2.3938445e-03  7.6240452e-04\n",
            "  2.8537741e-05 -1.0585956e-03  1.5203804e-03  1.1994856e-04\n",
            "  4.3881699e-03  3.5755127e-04  1.9964906e-03 -3.3893189e-03\n",
            "  2.5362791e-03 -3.8559963e-03 -4.6814438e-03 -1.0485576e-03\n",
            "  1.9576577e-03 -5.4296525e-04  2.5505766e-03  1.4563937e-03\n",
            "  1.1214090e-03  3.1200200e-03  3.5230191e-03  4.4931062e-03\n",
            " -5.5389071e-04  1.6268899e-03 -4.6736463e-03 -1.9612674e-04\n",
            "  1.5486709e-03 -3.5581242e-03  1.5163666e-03  2.2859944e-03\n",
            " -3.5728619e-03 -3.5505979e-03  7.8282715e-04 -4.8093311e-03\n",
            " -3.1324120e-03 -3.6213300e-03 -1.4478542e-03  3.4006054e-03\n",
            "  2.2276146e-03 -4.1698264e-03 -3.6997625e-03 -4.1264743e-03\n",
            " -4.9103238e-03 -2.2635974e-03 -3.9036905e-03  3.8846405e-03\n",
            " -7.9726276e-05 -2.0692295e-03 -3.0645117e-04 -3.0288144e-03\n",
            " -3.4682599e-03 -3.1768843e-03 -1.1148058e-03 -2.8012963e-03\n",
            " -6.5973290e-04 -2.3705217e-03  4.3961490e-03  3.2166531e-03\n",
            "  3.6933657e-04 -6.2054797e-04  2.0661615e-04  3.7390803e-04\n",
            " -3.5061471e-03  3.6587315e-03  2.1328868e-03 -2.5964181e-03\n",
            "  4.3381471e-03  4.0168604e-03  1.8054987e-03 -1.2192487e-03\n",
            "  1.5615283e-03 -1.8635839e-03  2.9529419e-03 -3.3825964e-03\n",
            " -3.2592549e-03 -4.7523994e-04 -5.3210353e-04 -9.8173530e-04]\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "ExecuteTime": {
          "end_time": "2021-04-05T21:27:02.660747Z",
          "start_time": "2021-04-05T21:27:02.642866Z"
        },
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 52
        },
        "id": "8YUsblEOfFWf",
        "outputId": "14cd759c-d5fc-465f-ed20-8fd1a1949168"
      },
      "source": [
        "#Compute similarity \n",
        "print(\"Similarity between eats and bites:\",model_skipgram.similarity('eats', 'bites'))\n",
        "print(\"Similarity between eats and man:\",model_skipgram.similarity('eats', 'man'))"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Similarity between eats and bites: -0.09852936\n",
            "Similarity between eats and man: -0.17089055\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gdXVDePKnBpv"
      },
      "source": [
        "From the above similarity scores we can conclude that eats is more similar to bites than man."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "ExecuteTime": {
          "end_time": "2021-04-05T21:27:03.419546Z",
          "start_time": "2021-04-05T21:27:03.414541Z"
        },
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 104
        },
        "id": "lpF4qtwpmuM3",
        "outputId": "f3bc68f6-3768-4a4d-e5bc-bb3dff6f654f"
      },
      "source": [
        "#Most similarity\n",
        "model_skipgram.most_similar('meat')"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[('bites', 0.1353721022605896),\n",
              " ('man', 0.10945276916027069),\n",
              " ('food', -0.022152386605739594),\n",
              " ('dog', -0.1444159746170044),\n",
              " ('eats', -0.16317100822925568)]"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 9
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "ExecuteTime": {
          "end_time": "2021-04-05T21:27:03.973454Z",
          "start_time": "2021-04-05T21:27:03.950433Z"
        },
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        },
        "id": "aNDCEXRTnAnj",
        "outputId": "402f77b6-0625-4b37-e135-3650df626007"
      },
      "source": [
        "# save model\n",
        "model_skipgram.save('model_skipgram.bin')\n",
        "\n",
        "# load model\n",
        "new_model_skipgram = Word2Vec.load('model_skipgram.bin')\n",
        "print(new_model_skipgram)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Word2Vec(vocab=6, size=100, alpha=0.025)\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "b0MiqJ_1M0mX"
      },
      "source": [
        "## Training Your Embedding on Wiki Corpus\n",
        "\n",
        "##### The corpus download page : https://dumps.wikimedia.org/enwiki/20200120/\n",
        "The entire wiki corpus as of 28/04/2020 is just over 16GB in size.\n",
        "We will take a part of this corpus due to computation constraints and train our word2vec and fasttext embeddings.\n",
        "\n",
        "The file size is 294MB so it can take a while to download.\n",
        "\n",
        "Source for code which downloads files from Google Drive: https://stackoverflow.com/questions/25010369/wget-curl-large-file-from-google-drive/39225039#39225039"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "ExecuteTime": {
          "end_time": "2021-04-05T21:27:58.596845Z",
          "start_time": "2021-04-05T21:27:58.585833Z"
        },
        "id": "60UO41DfGPL0",
        "outputId": "262cce44-03e5-46c8-861a-c9da76306c23"
      },
      "source": [
        "import os\n",
        "import requests\n",
        "\n",
        "os.makedirs('data/en', exist_ok= True)\n",
        "file_name = \"data/en/enwiki-latest-pages-articles-multistream14.xml-p13159683p14324602.bz2\"\n",
        "file_id = \"11804g0GcWnBIVDahjo5fQyc05nQLXGwF\"\n",
        "\n",
        "def download_file_from_google_drive(id, destination):\n",
        "    URL = \"https://docs.google.com/uc?export=download\"\n",
        "\n",
        "    session = requests.Session()\n",
        "\n",
        "    response = session.get(URL, params = { 'id' : id }, stream = True)\n",
        "    token = get_confirm_token(response)\n",
        "\n",
        "    if token:\n",
        "        params = { 'id' : id, 'confirm' : token }\n",
        "        response = session.get(URL, params = params, stream = True)\n",
        "\n",
        "    save_response_content(response, destination)    \n",
        "\n",
        "def get_confirm_token(response):\n",
        "    for key, value in response.cookies.items():\n",
        "        if key.startswith('download_warning'):\n",
        "            return value\n",
        "\n",
        "    return None\n",
        "\n",
        "def save_response_content(response, destination):\n",
        "    CHUNK_SIZE = 32768\n",
        "\n",
        "    with open(destination, \"wb\") as f:\n",
        "        for chunk in response.iter_content(CHUNK_SIZE):\n",
        "            if chunk: # filter out keep-alive new chunks\n",
        "                f.write(chunk)\n",
        "\n",
        "if not os.path.exists(file_name):\n",
        "    download_file_from_google_drive(file_id, file_name)\n",
        "else:\n",
        "    print(\"file already exists, skipping download\")\n",
        "\n",
        "print(f\"File at: {file_name}\")"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "file already exists, skipping download\n",
            "File at: data/en/enwiki-latest-pages-articles-multistream14.xml-p13159683p14324602.bz2\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "ExecuteTime": {
          "end_time": "2021-04-03T08:59:17.024306Z",
          "start_time": "2021-04-03T08:59:17.022304Z"
        },
        "id": "wX1kx96JLYvt"
      },
      "source": [
        "from gensim.corpora.wikicorpus import WikiCorpus\n",
        "from gensim.models.word2vec import Word2Vec\n",
        "from gensim.models.fasttext import FastText\n",
        "import time"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "ExecuteTime": {
          "end_time": "2021-04-03T09:56:14.722195Z",
          "start_time": "2021-04-03T09:56:14.705177Z"
        },
        "id": "rJgsEUmRPppc"
      },
      "source": [
        "#Preparing the Training data\n",
        "wiki = WikiCorpus(file_name, lemmatize=False, dictionary={})\n",
        "sentences = list(wiki.get_texts())\n",
        "\n",
        "#if you get a memory error executing the lines above\n",
        "#comment the lines out and uncomment the lines below. \n",
        "#loading will be slower, but stable.\n",
        "# wiki = WikiCorpus(file_name, processes=4, lemmatize=False, dictionary={})\n",
        "# sentences = list(wiki.get_texts())\n",
        "\n",
        "#if you still get a memory error, try settings processes to 1 or 2 and then run it again."
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xsIrgt_gPQda"
      },
      "source": [
        "### Hyperparameters\n",
        "\n",
        "\n",
        "1.   sg - Selecting the training algorithm: 1 for skip-gram else its 0 for CBOW. Default is CBOW.\n",
        "2.   min_count-  Ignores all words with total frequency lower than this.<br>\n",
        "There are many more hyperparamaeters whose list can be found in the official documentation [here.](https://radimrehurek.com/gensim/models/word2vec.html)\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "ExecuteTime": {
          "end_time": "2021-04-03T10:01:20.065332Z",
          "start_time": "2021-04-03T09:59:12.350872Z"
        },
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 52
        },
        "id": "idmfbr_8LvoN",
        "outputId": "f505a46e-025d-4169-f996-06c672008f81"
      },
      "source": [
        "#CBOW\n",
        "start = time.time()\n",
        "word2vec_cbow = Word2Vec(sentences,min_count=10, sg=0)\n",
        "end = time.time()\n",
        "\n",
        "print(\"CBOW Model Training Complete.\\nTime taken for training is:{:.2f} hrs \".format((end-start)/3600.0))"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "CBOW Model Training Complete.\n",
            "Time taken for training is:0.04 hrs \n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "ExecuteTime": {
          "end_time": "2021-04-03T10:02:10.613551Z",
          "start_time": "2021-04-03T10:02:10.585535Z"
        },
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 471
        },
        "id": "mMdGn08-RkhM",
        "outputId": "efb34148-3fb4-435c-f070-8493708fc07a"
      },
      "source": [
        "#Summarize the loaded model\n",
        "print(word2vec_cbow)\n",
        "print(\"-\"*30)\n",
        "\n",
        "#Summarize vocabulary\n",
        "words = list(word2vec_cbow.wv.vocab)\n",
        "print(f\"Length of vocabulary: {len(words)}\")\n",
        "print(\"Printing the first 30 words.\")\n",
        "print(words[:30])\n",
        "print(\"-\"*30)\n",
        "\n",
        "#Acess vector for one word\n",
        "print(f\"Length of vector: {len(word2vec_cbow['film'])}\")\n",
        "print(word2vec_cbow['film'])\n",
        "print(\"-\"*30)\n",
        "\n",
        "#Compute similarity \n",
        "print(\"Similarity between film and drama:\",word2vec_cbow.similarity('film', 'drama'))\n",
        "print(\"Similarity between film and tiger:\",word2vec_cbow.similarity('film', 'tiger'))\n",
        "print(\"-\"*30)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Word2Vec(vocab=111150, size=100, alpha=0.025)\n",
            "------------------------------\n",
            "Length of vocabulary: 111150\n",
            "Printing the first 30 words.\n",
            "['the', 'roses', 'registered', 'as', 'is', 'brisbane', 'racing', 'club', 'group', 'thoroughbred', 'horse', 'race', 'for', 'three', 'year', 'old', 'filles', 'run', 'under', 'set', 'weights', 'conditions', 'over', 'distance', 'of', 'metres', 'at', 'racecourse', 'australia', 'during']\n",
            "------------------------------\n",
            "Length of vector: 100\n",
            "[-0.25941572 -1.6287326   2.5331333  -1.5818936   0.9024474   0.8614945\n",
            "  2.4875445  -0.95802265 -1.3792082  -1.1744157  -4.300686    1.0071316\n",
            "  0.10418405  4.855032    0.6251962  -0.06472338  0.19993098 -0.7291219\n",
            "  2.342258   -1.7298651   0.7895099  -2.2819378   0.7158192  -0.62419826\n",
            "  0.6720258   3.6712303   1.3836899   0.17808275 -3.7205396   0.2529162\n",
            "  1.0290879  -0.9228959   0.9451632   1.7415334   1.9618814   1.4535053\n",
            "  2.670452    0.9272077   0.25056183 -0.4078236   0.5795217   0.6316829\n",
            "  0.50204426 -0.19865237 -2.697352    0.75351495  1.0796617   2.247825\n",
            " -2.956658    2.6606686  -0.42392135 -0.44319883 -2.9274392  -1.0198026\n",
            "  1.404833   -0.10840467  0.50829273  1.0767945  -0.65002084 -3.4231277\n",
            "  4.719826   -1.5996053   0.82882935  1.635043   -0.45730942 -1.3166244\n",
            " -1.3349417  -2.3565981   1.7141095  -2.6643796  -1.2148786   0.2972199\n",
            " -2.2865987  -1.6022073   2.0965865  -0.87479544 -1.4143106  -0.9149557\n",
            "  2.2900226   1.1464663  -2.6113467  -1.5517493   1.3018385   4.1072307\n",
            "  1.1441547   1.0222696   0.4847384   2.4148073  -2.881392   -0.67044157\n",
            " -2.482836   -0.417894    3.1442287  -1.6087203   1.865813   -3.717568\n",
            "  0.5994761   1.8819104   3.355772   -1.9087372 ]\n",
            "------------------------------\n",
            "Similarity between film and drama: 0.4986632\n",
            "Similarity between film and tiger: 0.15477756\n",
            "------------------------------\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "ExecuteTime": {
          "end_time": "2021-04-03T10:02:16.109851Z",
          "start_time": "2021-04-03T10:02:15.257052Z"
        },
        "id": "rXrDOrKskcHX"
      },
      "source": [
        "# save model\n",
        "from gensim.models import Word2Vec, KeyedVectors   \n",
        "word2vec_cbow.wv.save_word2vec_format('word2vec_cbow.bin', binary=True)\n",
        "\n",
        "# load model\n",
        "# new_modelword2vec_cbow = Word2Vec.load('word2vec_cbow.bin')\n",
        "# print(word2vec_cbow)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "ExecuteTime": {
          "end_time": "2021-04-03T10:08:27.736688Z",
          "start_time": "2021-04-03T10:02:19.197708Z"
        },
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 52
        },
        "id": "dX0U0CbQOK30",
        "outputId": "b9bfcf2b-91cb-40d9-ca92-791ec346aef4"
      },
      "source": [
        "#SkipGram\n",
        "start = time.time()\n",
        "word2vec_skipgram = Word2Vec(sentences,min_count=10, sg=1)\n",
        "end = time.time()\n",
        "\n",
        "print(\"SkipGram Model Training Complete\\nTime taken for training is:{:.2f} hrs \".format((end-start)/3600.0))"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "SkipGram Model Training Complete\n",
            "Time taken for training is:0.10 hrs \n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "ExecuteTime": {
          "end_time": "2021-04-03T10:09:06.406929Z",
          "start_time": "2021-04-03T10:09:06.383908Z"
        },
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 471
        },
        "id": "LXnY9YInSvnI",
        "outputId": "26f1dab7-27a6-4655-81c7-ac6f08fe1f9c"
      },
      "source": [
        "#Summarize the loaded model\n",
        "print(word2vec_skipgram)\n",
        "print(\"-\"*30)\n",
        "\n",
        "#Summarize vocabulary\n",
        "words = list(word2vec_skipgram.wv.vocab)\n",
        "print(f\"Length of vocabulary: {len(words)}\")\n",
        "print(\"Printing the first 30 words.\")\n",
        "print(words[:30])\n",
        "print(\"-\"*30)\n",
        "\n",
        "#Acess vector for one word\n",
        "print(f\"Length of vector: {len(word2vec_skipgram['film'])}\")\n",
        "print(word2vec_skipgram['film'])\n",
        "print(\"-\"*30)\n",
        "\n",
        "#Compute similarity \n",
        "print(\"Similarity between film and drama:\",word2vec_skipgram.similarity('film', 'drama'))\n",
        "print(\"Similarity between film and tiger:\",word2vec_skipgram.similarity('film', 'tiger'))\n",
        "print(\"-\"*30)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Word2Vec(vocab=111150, size=100, alpha=0.025)\n",
            "------------------------------\n",
            "Length of vocabulary: 111150\n",
            "Printing the first 30 words.\n",
            "['the', 'roses', 'registered', 'as', 'is', 'brisbane', 'racing', 'club', 'group', 'thoroughbred', 'horse', 'race', 'for', 'three', 'year', 'old', 'filles', 'run', 'under', 'set', 'weights', 'conditions', 'over', 'distance', 'of', 'metres', 'at', 'racecourse', 'australia', 'during']\n",
            "------------------------------\n",
            "Length of vector: 100\n",
            "[ 1.94889292e-01 -7.88324535e-01  4.66947220e-02  2.57520348e-01\n",
            "  2.65304267e-01  3.63538593e-01  4.63590741e-01 -1.62654325e-01\n",
            "  9.11010578e-02 -6.58479631e-02 -6.97350129e-02 -6.56900406e-02\n",
            "  2.19506964e-01  2.20394313e-01  1.05092540e-01  8.26439075e-03\n",
            " -9.39796269e-02  5.50851583e-01  7.65753444e-04 -2.22807571e-01\n",
            " -3.17346871e-01  3.20529372e-01  4.51157093e-02 -1.93709806e-01\n",
            "  2.07626969e-02  1.69344515e-01  2.77250055e-02  1.10369585e-02\n",
            " -4.75540310e-01  1.10796697e-01  4.28172469e-01  4.06191871e-02\n",
            "  5.15495241e-01 -6.85295224e-01 -5.06723702e-01 -4.52192919e-03\n",
            "  1.51265517e-03 -3.84557724e-01 -2.22782314e-01  5.11201501e-01\n",
            "  1.42252162e-01 -7.73397386e-01 -2.78606623e-01  4.70017433e-01\n",
            " -2.70037323e-01  5.04850507e-01 -1.48356587e-01  2.26073325e-01\n",
            " -3.36060971e-01 -1.19667962e-01 -2.59654850e-01 -4.44965392e-01\n",
            "  1.11614995e-01  1.62986945e-02  4.82374012e-01 -7.87460804e-02\n",
            " -1.13825299e-01 -2.24003598e-01  4.93353546e-01 -5.57069406e-02\n",
            "  2.43176505e-01 -1.84876159e-01  2.13489812e-02  3.42909366e-01\n",
            "  2.02496469e-01 -4.25657362e-01  8.17572057e-01 -2.83644646e-01\n",
            " -5.23434244e-02 -3.27616245e-01  4.43994589e-02 -3.90237272e-01\n",
            "  2.12029487e-01 -7.25788534e-01  5.52469850e-01 -4.72590374e-03\n",
            " -2.02829018e-01 -9.59078223e-03  3.68973225e-01 -2.69762665e-01\n",
            " -2.85591751e-01 -2.68359333e-01  3.10093671e-01  2.02198789e-01\n",
            "  5.80960453e-01 -2.47493789e-01 -7.37856887e-03 -3.59723950e-03\n",
            "  3.14893663e-01  1.12885557e-01 -5.09416103e-01 -7.58459032e-01\n",
            "  5.30587435e-01 -1.51896626e-01 -3.37440372e-01  4.22841489e-01\n",
            " -3.34523350e-01  3.21759552e-01  7.44457126e-01 -1.26014173e-01]\n",
            "------------------------------\n",
            "Similarity between film and drama: 0.63833964\n",
            "Similarity between film and tiger: 0.22270091\n",
            "------------------------------\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "ExecuteTime": {
          "end_time": "2021-04-03T10:09:09.947695Z",
          "start_time": "2021-04-03T10:09:09.076901Z"
        },
        "id": "o8U7bfPSVB04"
      },
      "source": [
        "# save model\n",
        "word2vec_skipgram.wv.save_word2vec_format('word2vec_sg.bin', binary=True)\n",
        "\n",
        "# load model\n",
        "# new_model_skipgram = Word2Vec.load('model_skipgram.bin')\n",
        "# print(model_skipgram)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kExlA8kfrKml"
      },
      "source": [
        "## FastText"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "ExecuteTime": {
          "end_time": "2021-04-03T10:16:31.271764Z",
          "start_time": "2021-04-03T10:09:16.592670Z"
        },
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 52
        },
        "id": "JPd2VhMEk8gL",
        "outputId": "55c44bdd-d7d8-4df2-8140-cdd442bbd68c"
      },
      "source": [
        "#CBOW\n",
        "start = time.time()\n",
        "fasttext_cbow = FastText(sentences, sg=0, min_count=10)\n",
        "end = time.time()\n",
        "\n",
        "print(\"FastText CBOW Model Training Complete\\nTime taken for training is:{:.2f} hrs \".format((end-start)/3600.0))"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "FastText CBOW Model Training Complete\n",
            "Time taken for training is:0.12 hrs \n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "ExecuteTime": {
          "end_time": "2021-04-03T10:16:31.287283Z",
          "start_time": "2021-04-03T10:16:31.273765Z"
        },
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 471
        },
        "id": "FlQFl8-Zsost",
        "outputId": "6472e944-e6de-4d64-8c6f-14475ef1eac5"
      },
      "source": [
        "#Summarize the loaded model\n",
        "print(fasttext_cbow)\n",
        "print(\"-\"*30)\n",
        "\n",
        "#Summarize vocabulary\n",
        "words = list(fasttext_cbow.wv.vocab)\n",
        "print(f\"Length of vocabulary: {len(words)}\")\n",
        "print(\"Printing the first 30 words.\")\n",
        "print(words[:30])\n",
        "print(\"-\"*30)\n",
        "\n",
        "#Acess vector for one word\n",
        "print(f\"Length of vector: {len(fasttext_cbow['film'])}\")\n",
        "print(fasttext_cbow['film'])\n",
        "print(\"-\"*30)\n",
        "\n",
        "#Compute similarity \n",
        "print(\"Similarity between film and drama:\",fasttext_cbow.similarity('film', 'drama'))\n",
        "print(\"Similarity between film and tiger:\",fasttext_cbow.similarity('film', 'tiger'))\n",
        "print(\"-\"*30)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "FastText(vocab=111150, size=100, alpha=0.025)\n",
            "------------------------------\n",
            "Length of vocabulary: 111150\n",
            "Printing the first 30 words.\n",
            "['the', 'roses', 'registered', 'as', 'is', 'brisbane', 'racing', 'club', 'group', 'thoroughbred', 'horse', 'race', 'for', 'three', 'year', 'old', 'filles', 'run', 'under', 'set', 'weights', 'conditions', 'over', 'distance', 'of', 'metres', 'at', 'racecourse', 'australia', 'during']\n",
            "------------------------------\n",
            "Length of vector: 100\n",
            "[ 0.47473213  1.6783198  -4.766255   -3.2404876   0.80164665  1.993539\n",
            "  3.4226568  -0.7035685  -3.0426116   1.5137119   3.8207133   1.3821473\n",
            " -0.7379625  -0.6726444   1.8303355  -2.1288188   1.2368282  -3.0745962\n",
            "  1.4226121  -2.8884995   7.2847705  -1.564321    2.869352    0.6962616\n",
            "  4.469778    2.5569658   2.621335   -4.612509   -2.2389078   3.6648748\n",
            "  0.7189718   1.0702186  -3.175641    2.7648733   0.13811935 -2.441776\n",
            " -3.9559126  -0.03163956 -1.1257534  -0.64402825 -1.5076644  -0.58919376\n",
            " -0.14338583  4.2466817   4.3784213   3.0076942  -5.972965    2.2950342\n",
            " -0.50719374 -3.916504   -2.1366098  -2.661619    2.3540869   2.1862476\n",
            "  5.1004434   4.1282     -4.164653    1.1288711  -4.001655   -4.051289\n",
            "  2.5718336  -0.40600455  3.8396242   2.214367    1.8413899   4.5216975\n",
            " -1.6419586   2.7617378  -2.0902452   2.598776    4.041824   -5.1805005\n",
            " -2.777213   -0.02546828 -0.07393612 -3.2800605  -2.9874747  -0.6490991\n",
            "  3.6039045  -1.4168853   3.6110177  -1.0872458  -0.6365031  -1.0161037\n",
            "  3.7344344   0.29839793  0.421953   -1.811646    1.3730506   7.575645\n",
            "  3.3998368   5.0335827  -0.2107324  -2.331183    0.19383769  3.0550041\n",
            "  4.1529713   3.988616    0.04955976  1.3424706 ]\n",
            "------------------------------\n",
            "Similarity between film and drama: 0.5669882\n",
            "Similarity between film and tiger: 0.24975622\n",
            "------------------------------\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "ExecuteTime": {
          "end_time": "2021-04-03T10:28:28.771383Z",
          "start_time": "2021-04-03T10:16:31.289284Z"
        },
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 52
        },
        "id": "UgSOxsNklAvh",
        "outputId": "f491f83c-17b8-42ad-a225-479df8419578"
      },
      "source": [
        "#SkipGram\n",
        "start = time.time()\n",
        "fasttext_skipgram = FastText(sentences, sg=1, min_count=10)\n",
        "end = time.time()\n",
        "\n",
        "print(\"FastText SkipGram Model Training Complete\\nTime taken for training is:{:.2f} hrs \".format((end-start)/3600.0))"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "FastText SkipGram Model Training Complete\n",
            "Time taken for training is:0.20 hrs \n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "ExecuteTime": {
          "end_time": "2021-04-03T10:28:28.803412Z",
          "start_time": "2021-04-03T10:28:28.773386Z"
        },
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 610
        },
        "id": "vFiTAP0PsQwi",
        "outputId": "a29ae2e3-5dbc-453a-f66b-ceca255a8652"
      },
      "source": [
        "#Summarize the loaded model\n",
        "print(fasttext_skipgram)\n",
        "print(\"-\"*30)\n",
        "\n",
        "#Summarize vocabulary\n",
        "words = list(fasttext_skipgram.wv.vocab)\n",
        "print(f\"Length of vocabulary: {len(words)}\")\n",
        "print(\"Printing the first 30 words.\")\n",
        "print(words[:30])\n",
        "print(\"-\"*30)\n",
        "\n",
        "#Acess vector for one word\n",
        "print(f\"Length of vector: {len(fasttext_skipgram['film'])}\")\n",
        "print(fasttext_skipgram['film'])\n",
        "print(\"-\"*30)\n",
        "\n",
        "#Compute similarity \n",
        "print(\"Similarity between film and drama:\",fasttext_skipgram.similarity('film', 'drama'))\n",
        "print(\"Similarity between film and tiger:\",fasttext_skipgram.similarity('film', 'tiger'))\n",
        "print(\"-\"*30)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "FastText(vocab=111150, size=100, alpha=0.025)\n",
            "------------------------------\n",
            "Length of vocabulary: 111150\n",
            "Printing the first 30 words.\n",
            "['the', 'roses', 'registered', 'as', 'is', 'brisbane', 'racing', 'club', 'group', 'thoroughbred', 'horse', 'race', 'for', 'three', 'year', 'old', 'filles', 'run', 'under', 'set', 'weights', 'conditions', 'over', 'distance', 'of', 'metres', 'at', 'racecourse', 'australia', 'during']\n",
            "------------------------------\n",
            "Length of vector: 100\n",
            "[-8.4101312e-02 -6.9478154e-04  3.3954462e-01 -3.6973858e-01\n",
            "  1.6844368e-01  3.4855682e-01  8.0026442e-01 -5.0405812e-01\n",
            " -6.0389137e-01  2.1694953e-02  4.0937051e-01 -3.5893116e-02\n",
            " -1.3717794e-01  4.0389201e-01  3.9567137e-01  2.4365921e-01\n",
            "  5.6551516e-02 -1.5994829e-01 -1.8148309e-01 -2.6480275e-01\n",
            " -4.8462763e-01  9.5473409e-02 -1.1126036e-02 -1.8805853e-01\n",
            "  2.4277805e-01  2.4251699e-01 -1.7501226e-01 -4.3078136e-01\n",
            " -3.6442232e-01  9.1702184e-03 -3.2344624e-01 -1.0232232e-01\n",
            " -5.2684498e-01 -2.7622378e-01  4.2112619e-01 -4.3196991e-02\n",
            "  3.1967857e-01  1.7001998e-01  3.3157614e-01 -2.4995559e-01\n",
            " -1.3239473e-01 -3.4502399e-01  2.1341468e-01  5.8890671e-01\n",
            "  1.7721146e-01  1.5974782e-01 -3.8579264e-01 -2.8241745e-01\n",
            "  6.7402735e-02 -7.1903253e-01  1.3665260e-01 -5.9633050e-02\n",
            " -5.9002697e-01 -6.1173952e-01 -1.0246418e-03 -5.1254374e-01\n",
            " -1.5101396e-01  1.6967247e-01  2.8125226e-01 -4.6728057e-01\n",
            " -5.4966863e-02 -1.3736627e-02 -1.5689149e-01  8.3176725e-02\n",
            "  1.8850440e-02  4.1858605e-01 -1.1376646e-02 -4.0758383e-02\n",
            " -1.7871203e-01  2.7792713e-01  5.5813068e-01 -3.5465869e-01\n",
            "  1.3662770e-01  2.5777066e-01 -3.0423281e-01  7.8141141e-01\n",
            "  1.1446947e-02 -4.0541172e-01  2.9406905e-01  6.0151044e-02\n",
            "  4.9637925e-02 -3.9679220e-01  4.5333567e-01  1.0888510e-02\n",
            "  2.7147910e-01 -1.7305572e-01 -2.8098795e-01 -6.1907400e-03\n",
            " -2.3080334e-01  5.8609635e-01 -1.0097053e-01  6.6119152e-01\n",
            "  1.8578811e-01 -5.9025098e-02 -5.3886050e-01  2.6664239e-01\n",
            " -2.2193529e-02  7.0487672e-01  3.9477929e-01  3.7981489e-01]\n",
            "------------------------------\n",
            "Similarity between film and drama: 0.626041\n",
            "Similarity between film and tiger: 0.27831402\n",
            "------------------------------\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "oArMIJzYOmUR"
      },
      "source": [
        "An interesting obeseravtion if you noticed is that CBOW trains faster than SkipGram in both cases.\n",
        "We will leave it to the user to figure out why. A hint would be to refer the working of CBOW and skipgram."
      ]
    }
  ]
}