{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "Training_embeddings_using_gensim.ipynb",
"provenance": [],
"toc_visible": true,
"machine_shape": "hm"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
},
"varInspector": {
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"delete_cmd_postfix": "",
"delete_cmd_prefix": "del ",
"library": "var_list.py",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"delete_cmd_postfix": ") ",
"delete_cmd_prefix": "rm(",
"library": "var_list.r",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
],
"window_display": false
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "8G1t37lcGSKK"
},
"source": [
"# Training Embeddings Using Gensim and FastText\n",
"> Word embeddings are an approach to representing text in NLP. In this notebook we will demonstrate how to train embeddings both CBOW and SkipGram methods using Genism and Fasttext.\n",
"\n",
"- toc: true\n",
"- badges: true\n",
"- comments: true\n",
"- categories: [Concept, Embedding, Gensim, FastText]\n",
"- author: \"Quantum Stat\"\n",
"- image:"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-05T21:26:40.863650Z",
"start_time": "2021-04-05T21:26:40.339123Z"
},
"id": "TBw9OCYcYQ_n"
},
"source": [
"from gensim.models import Word2Vec\n",
"import warnings\n",
"warnings.filterwarnings('ignore')"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-05T21:26:40.894143Z",
"start_time": "2021-04-05T21:26:40.865114Z"
},
"id": "5qWptd54ZcfV"
},
"source": [
"# define training data\n",
"#Genism word2vec requires that a format of ‘list of lists’ be provided for training where every document contained in a list.\n",
"#Every list contains lists of tokens of that document.\n",
"corpus = [['dog','bites','man'], [\"man\", \"bites\" ,\"dog\"],[\"dog\",\"eats\",\"meat\"],[\"man\", \"eats\",\"food\"]]\n",
"\n",
"#Training the model\n",
"model_cbow = Word2Vec(corpus, min_count=1,sg=0) #using CBOW Architecture for trainnig\n",
"model_skipgram = Word2Vec(corpus, min_count=1,sg=1)#using skipGram Architecture for training "
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "0QjSxefPl4mh"
},
"source": [
"## Continuous Bag of Words (CBOW) \n",
"In CBOW, the primary task is to build a language model that correctly predicts the center word given the context words in which the center word appears."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-05T21:26:56.724662Z",
"start_time": "2021-04-05T21:26:56.712651Z"
},
"colab": {
"base_uri": "https://localhost:8080/",
"height": 486
},
"id": "nyZY8ME4lUjd",
"outputId": "bd00e825-c11a-4b36-dbf5-80f32c659956"
},
"source": [
"#Summarize the loaded model\n",
"print(model_cbow)\n",
"\n",
"#Summarize vocabulary\n",
"words = list(model_cbow.wv.vocab)\n",
"print(words)\n",
"\n",
"#Acess vector for one word\n",
"print(model_cbow['dog'])"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"Word2Vec(vocab=6, size=100, alpha=0.025)\n",
"['dog', 'bites', 'man', 'eats', 'meat', 'food']\n",
"[-3.1667745e-03 2.5268614e-03 -4.9504861e-03 2.3797194e-03\n",
" -3.3511904e-03 1.7659335e-03 -9.6838089e-04 3.6862001e-03\n",
" 3.3760078e-03 -1.1944126e-03 -4.7475514e-03 -4.6677454e-03\n",
" 4.7231275e-03 2.1875298e-03 4.9989321e-03 -4.7024325e-04\n",
" 4.6936749e-03 4.5417100e-03 -4.8383311e-03 4.5522186e-03\n",
" 9.4010920e-04 -2.8778350e-03 -2.3938445e-03 7.6240452e-04\n",
" 2.8537741e-05 -1.0585956e-03 1.5203804e-03 1.1994856e-04\n",
" 4.3881699e-03 3.5755127e-04 1.9964906e-03 -3.3893189e-03\n",
" 2.5362791e-03 -3.8559963e-03 -4.6814438e-03 -1.0485576e-03\n",
" 1.9576577e-03 -5.4296525e-04 2.5505766e-03 1.4563937e-03\n",
" 1.1214090e-03 3.1200200e-03 3.5230191e-03 4.4931062e-03\n",
" -5.5389071e-04 1.6268899e-03 -4.6736463e-03 -1.9612674e-04\n",
" 1.5486709e-03 -3.5581242e-03 1.5163666e-03 2.2859944e-03\n",
" -3.5728619e-03 -3.5505979e-03 7.8282715e-04 -4.8093311e-03\n",
" -3.1324120e-03 -3.6213300e-03 -1.4478542e-03 3.4006054e-03\n",
" 2.2276146e-03 -4.1698264e-03 -3.6997625e-03 -4.1264743e-03\n",
" -4.9103238e-03 -2.2635974e-03 -3.9036905e-03 3.8846405e-03\n",
" -7.9726276e-05 -2.0692295e-03 -3.0645117e-04 -3.0288144e-03\n",
" -3.4682599e-03 -3.1768843e-03 -1.1148058e-03 -2.8012963e-03\n",
" -6.5973290e-04 -2.3705217e-03 4.3961490e-03 3.2166531e-03\n",
" 3.6933657e-04 -6.2054797e-04 2.0661615e-04 3.7390803e-04\n",
" -3.5061471e-03 3.6587315e-03 2.1328868e-03 -2.5964181e-03\n",
" 4.3381471e-03 4.0168604e-03 1.8054987e-03 -1.2192487e-03\n",
" 1.5615283e-03 -1.8635839e-03 2.9529419e-03 -3.3825964e-03\n",
" -3.2592549e-03 -4.7523994e-04 -5.3210353e-04 -9.8173530e-04]\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-05T21:26:57.420196Z",
"start_time": "2021-04-05T21:26:57.417193Z"
},
"colab": {
"base_uri": "https://localhost:8080/",
"height": 52
},
"id": "gMuHv52GeuoR",
"outputId": "b498032d-6f9d-485b-a3cc-5a21300bfb06"
},
"source": [
"#Compute similarity \n",
"print(\"Similarity between eats and bites:\",model_cbow.similarity('eats', 'bites'))\n",
"print(\"Similarity between eats and man:\",model_cbow.similarity('eats', 'man'))"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"Similarity between eats and bites: -0.09852024\n",
"Similarity between eats and man: -0.17088428\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "twhTZfPOezTU"
},
"source": [
"From the above similarity scores we can conclude that eats is more similar to bites than man."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-05T21:26:59.635831Z",
"start_time": "2021-04-05T21:26:59.621818Z"
},
"colab": {
"base_uri": "https://localhost:8080/",
"height": 104
},
"id": "5Lv0V7WofmsB",
"outputId": "00600b23-d9a6-4f14-bacd-395be85076c8"
},
"source": [
"#Most similarity\n",
"model_cbow.most_similar('meat')"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[('bites', 0.1353721022605896),\n",
" ('man', 0.1094527617096901),\n",
" ('food', -0.02215239405632019),\n",
" ('dog', -0.1444159597158432),\n",
" ('eats', -0.16309654712677002)]"
]
},
"metadata": {
"tags": []
},
"execution_count": 5
}
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-05T21:26:59.855822Z",
"start_time": "2021-04-05T21:26:59.841810Z"
},
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"id": "WA783nrSalgs",
"outputId": "80d6e23f-2bed-47d7-f925-4aaa87ec5f9e"
},
"source": [
"# save model\n",
"model_cbow.save('model_cbow.bin')\n",
"\n",
"# load model\n",
"new_model_cbow = Word2Vec.load('model_cbow.bin')\n",
"print(new_model_cbow)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"Word2Vec(vocab=6, size=100, alpha=0.025)\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "deReLSI7mQyr"
},
"source": [
"## SkipGram\n",
"In skipgram, the task is to predict the context words from the center word."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-05T21:27:00.517046Z",
"start_time": "2021-04-05T21:27:00.508038Z"
},
"colab": {
"base_uri": "https://localhost:8080/",
"height": 486
},
"id": "9QtUtsLglvY0",
"outputId": "6d19902b-66aa-4b0f-9f12-be18f37d40d1"
},
"source": [
"#Summarize the loaded model\n",
"print(model_skipgram)\n",
"\n",
"#Summarize vocabulary\n",
"words = list(model_skipgram.wv.vocab)\n",
"print(words)\n",
"\n",
"#Acess vector for one word\n",
"print(model_skipgram['dog'])"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"Word2Vec(vocab=6, size=100, alpha=0.025)\n",
"['dog', 'bites', 'man', 'eats', 'meat', 'food']\n",
"[-3.1667745e-03 2.5268614e-03 -4.9504861e-03 2.3797194e-03\n",
" -3.3511904e-03 1.7659335e-03 -9.6838089e-04 3.6862001e-03\n",
" 3.3760078e-03 -1.1944126e-03 -4.7475514e-03 -4.6677454e-03\n",
" 4.7231275e-03 2.1875298e-03 4.9989321e-03 -4.7024325e-04\n",
" 4.6936749e-03 4.5417100e-03 -4.8383311e-03 4.5522186e-03\n",
" 9.4010920e-04 -2.8778350e-03 -2.3938445e-03 7.6240452e-04\n",
" 2.8537741e-05 -1.0585956e-03 1.5203804e-03 1.1994856e-04\n",
" 4.3881699e-03 3.5755127e-04 1.9964906e-03 -3.3893189e-03\n",
" 2.5362791e-03 -3.8559963e-03 -4.6814438e-03 -1.0485576e-03\n",
" 1.9576577e-03 -5.4296525e-04 2.5505766e-03 1.4563937e-03\n",
" 1.1214090e-03 3.1200200e-03 3.5230191e-03 4.4931062e-03\n",
" -5.5389071e-04 1.6268899e-03 -4.6736463e-03 -1.9612674e-04\n",
" 1.5486709e-03 -3.5581242e-03 1.5163666e-03 2.2859944e-03\n",
" -3.5728619e-03 -3.5505979e-03 7.8282715e-04 -4.8093311e-03\n",
" -3.1324120e-03 -3.6213300e-03 -1.4478542e-03 3.4006054e-03\n",
" 2.2276146e-03 -4.1698264e-03 -3.6997625e-03 -4.1264743e-03\n",
" -4.9103238e-03 -2.2635974e-03 -3.9036905e-03 3.8846405e-03\n",
" -7.9726276e-05 -2.0692295e-03 -3.0645117e-04 -3.0288144e-03\n",
" -3.4682599e-03 -3.1768843e-03 -1.1148058e-03 -2.8012963e-03\n",
" -6.5973290e-04 -2.3705217e-03 4.3961490e-03 3.2166531e-03\n",
" 3.6933657e-04 -6.2054797e-04 2.0661615e-04 3.7390803e-04\n",
" -3.5061471e-03 3.6587315e-03 2.1328868e-03 -2.5964181e-03\n",
" 4.3381471e-03 4.0168604e-03 1.8054987e-03 -1.2192487e-03\n",
" 1.5615283e-03 -1.8635839e-03 2.9529419e-03 -3.3825964e-03\n",
" -3.2592549e-03 -4.7523994e-04 -5.3210353e-04 -9.8173530e-04]\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-05T21:27:02.660747Z",
"start_time": "2021-04-05T21:27:02.642866Z"
},
"colab": {
"base_uri": "https://localhost:8080/",
"height": 52
},
"id": "8YUsblEOfFWf",
"outputId": "14cd759c-d5fc-465f-ed20-8fd1a1949168"
},
"source": [
"#Compute similarity \n",
"print(\"Similarity between eats and bites:\",model_skipgram.similarity('eats', 'bites'))\n",
"print(\"Similarity between eats and man:\",model_skipgram.similarity('eats', 'man'))"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"Similarity between eats and bites: -0.09852936\n",
"Similarity between eats and man: -0.17089055\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "gdXVDePKnBpv"
},
"source": [
"From the above similarity scores we can conclude that eats is more similar to bites than man."
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-05T21:27:03.419546Z",
"start_time": "2021-04-05T21:27:03.414541Z"
},
"colab": {
"base_uri": "https://localhost:8080/",
"height": 104
},
"id": "lpF4qtwpmuM3",
"outputId": "f3bc68f6-3768-4a4d-e5bc-bb3dff6f654f"
},
"source": [
"#Most similarity\n",
"model_skipgram.most_similar('meat')"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[('bites', 0.1353721022605896),\n",
" ('man', 0.10945276916027069),\n",
" ('food', -0.022152386605739594),\n",
" ('dog', -0.1444159746170044),\n",
" ('eats', -0.16317100822925568)]"
]
},
"metadata": {
"tags": []
},
"execution_count": 9
}
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-05T21:27:03.973454Z",
"start_time": "2021-04-05T21:27:03.950433Z"
},
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"id": "aNDCEXRTnAnj",
"outputId": "402f77b6-0625-4b37-e135-3650df626007"
},
"source": [
"# save model\n",
"model_skipgram.save('model_skipgram.bin')\n",
"\n",
"# load model\n",
"new_model_skipgram = Word2Vec.load('model_skipgram.bin')\n",
"print(new_model_skipgram)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"Word2Vec(vocab=6, size=100, alpha=0.025)\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "b0MiqJ_1M0mX"
},
"source": [
"## Training Your Embedding on Wiki Corpus\n",
"\n",
"##### The corpus download page : https://dumps.wikimedia.org/enwiki/20200120/\n",
"The entire wiki corpus as of 28/04/2020 is just over 16GB in size.\n",
"We will take a part of this corpus due to computation constraints and train our word2vec and fasttext embeddings.\n",
"\n",
"The file size is 294MB so it can take a while to download.\n",
"\n",
"Source for code which downloads files from Google Drive: https://stackoverflow.com/questions/25010369/wget-curl-large-file-from-google-drive/39225039#39225039"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-05T21:27:58.596845Z",
"start_time": "2021-04-05T21:27:58.585833Z"
},
"id": "60UO41DfGPL0",
"outputId": "262cce44-03e5-46c8-861a-c9da76306c23"
},
"source": [
"import os\n",
"import requests\n",
"\n",
"os.makedirs('data/en', exist_ok= True)\n",
"file_name = \"data/en/enwiki-latest-pages-articles-multistream14.xml-p13159683p14324602.bz2\"\n",
"file_id = \"11804g0GcWnBIVDahjo5fQyc05nQLXGwF\"\n",
"\n",
"def download_file_from_google_drive(id, destination):\n",
" URL = \"https://docs.google.com/uc?export=download\"\n",
"\n",
" session = requests.Session()\n",
"\n",
" response = session.get(URL, params = { 'id' : id }, stream = True)\n",
" token = get_confirm_token(response)\n",
"\n",
" if token:\n",
" params = { 'id' : id, 'confirm' : token }\n",
" response = session.get(URL, params = params, stream = True)\n",
"\n",
" save_response_content(response, destination) \n",
"\n",
"def get_confirm_token(response):\n",
" for key, value in response.cookies.items():\n",
" if key.startswith('download_warning'):\n",
" return value\n",
"\n",
" return None\n",
"\n",
"def save_response_content(response, destination):\n",
" CHUNK_SIZE = 32768\n",
"\n",
" with open(destination, \"wb\") as f:\n",
" for chunk in response.iter_content(CHUNK_SIZE):\n",
" if chunk: # filter out keep-alive new chunks\n",
" f.write(chunk)\n",
"\n",
"if not os.path.exists(file_name):\n",
" download_file_from_google_drive(file_id, file_name)\n",
"else:\n",
" print(\"file already exists, skipping download\")\n",
"\n",
"print(f\"File at: {file_name}\")"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"file already exists, skipping download\n",
"File at: data/en/enwiki-latest-pages-articles-multistream14.xml-p13159683p14324602.bz2\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-03T08:59:17.024306Z",
"start_time": "2021-04-03T08:59:17.022304Z"
},
"id": "wX1kx96JLYvt"
},
"source": [
"from gensim.corpora.wikicorpus import WikiCorpus\n",
"from gensim.models.word2vec import Word2Vec\n",
"from gensim.models.fasttext import FastText\n",
"import time"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-03T09:56:14.722195Z",
"start_time": "2021-04-03T09:56:14.705177Z"
},
"id": "rJgsEUmRPppc"
},
"source": [
"#Preparing the Training data\n",
"wiki = WikiCorpus(file_name, lemmatize=False, dictionary={})\n",
"sentences = list(wiki.get_texts())\n",
"\n",
"#if you get a memory error executing the lines above\n",
"#comment the lines out and uncomment the lines below. \n",
"#loading will be slower, but stable.\n",
"# wiki = WikiCorpus(file_name, processes=4, lemmatize=False, dictionary={})\n",
"# sentences = list(wiki.get_texts())\n",
"\n",
"#if you still get a memory error, try settings processes to 1 or 2 and then run it again."
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "xsIrgt_gPQda"
},
"source": [
"### Hyperparameters\n",
"\n",
"\n",
"1. sg - Selecting the training algorithm: 1 for skip-gram else its 0 for CBOW. Default is CBOW.\n",
"2. min_count- Ignores all words with total frequency lower than this.
\n",
"There are many more hyperparamaeters whose list can be found in the official documentation [here.](https://radimrehurek.com/gensim/models/word2vec.html)\n"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-03T10:01:20.065332Z",
"start_time": "2021-04-03T09:59:12.350872Z"
},
"colab": {
"base_uri": "https://localhost:8080/",
"height": 52
},
"id": "idmfbr_8LvoN",
"outputId": "f505a46e-025d-4169-f996-06c672008f81"
},
"source": [
"#CBOW\n",
"start = time.time()\n",
"word2vec_cbow = Word2Vec(sentences,min_count=10, sg=0)\n",
"end = time.time()\n",
"\n",
"print(\"CBOW Model Training Complete.\\nTime taken for training is:{:.2f} hrs \".format((end-start)/3600.0))"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"CBOW Model Training Complete.\n",
"Time taken for training is:0.04 hrs \n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-03T10:02:10.613551Z",
"start_time": "2021-04-03T10:02:10.585535Z"
},
"colab": {
"base_uri": "https://localhost:8080/",
"height": 471
},
"id": "mMdGn08-RkhM",
"outputId": "efb34148-3fb4-435c-f070-8493708fc07a"
},
"source": [
"#Summarize the loaded model\n",
"print(word2vec_cbow)\n",
"print(\"-\"*30)\n",
"\n",
"#Summarize vocabulary\n",
"words = list(word2vec_cbow.wv.vocab)\n",
"print(f\"Length of vocabulary: {len(words)}\")\n",
"print(\"Printing the first 30 words.\")\n",
"print(words[:30])\n",
"print(\"-\"*30)\n",
"\n",
"#Acess vector for one word\n",
"print(f\"Length of vector: {len(word2vec_cbow['film'])}\")\n",
"print(word2vec_cbow['film'])\n",
"print(\"-\"*30)\n",
"\n",
"#Compute similarity \n",
"print(\"Similarity between film and drama:\",word2vec_cbow.similarity('film', 'drama'))\n",
"print(\"Similarity between film and tiger:\",word2vec_cbow.similarity('film', 'tiger'))\n",
"print(\"-\"*30)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"Word2Vec(vocab=111150, size=100, alpha=0.025)\n",
"------------------------------\n",
"Length of vocabulary: 111150\n",
"Printing the first 30 words.\n",
"['the', 'roses', 'registered', 'as', 'is', 'brisbane', 'racing', 'club', 'group', 'thoroughbred', 'horse', 'race', 'for', 'three', 'year', 'old', 'filles', 'run', 'under', 'set', 'weights', 'conditions', 'over', 'distance', 'of', 'metres', 'at', 'racecourse', 'australia', 'during']\n",
"------------------------------\n",
"Length of vector: 100\n",
"[-0.25941572 -1.6287326 2.5331333 -1.5818936 0.9024474 0.8614945\n",
" 2.4875445 -0.95802265 -1.3792082 -1.1744157 -4.300686 1.0071316\n",
" 0.10418405 4.855032 0.6251962 -0.06472338 0.19993098 -0.7291219\n",
" 2.342258 -1.7298651 0.7895099 -2.2819378 0.7158192 -0.62419826\n",
" 0.6720258 3.6712303 1.3836899 0.17808275 -3.7205396 0.2529162\n",
" 1.0290879 -0.9228959 0.9451632 1.7415334 1.9618814 1.4535053\n",
" 2.670452 0.9272077 0.25056183 -0.4078236 0.5795217 0.6316829\n",
" 0.50204426 -0.19865237 -2.697352 0.75351495 1.0796617 2.247825\n",
" -2.956658 2.6606686 -0.42392135 -0.44319883 -2.9274392 -1.0198026\n",
" 1.404833 -0.10840467 0.50829273 1.0767945 -0.65002084 -3.4231277\n",
" 4.719826 -1.5996053 0.82882935 1.635043 -0.45730942 -1.3166244\n",
" -1.3349417 -2.3565981 1.7141095 -2.6643796 -1.2148786 0.2972199\n",
" -2.2865987 -1.6022073 2.0965865 -0.87479544 -1.4143106 -0.9149557\n",
" 2.2900226 1.1464663 -2.6113467 -1.5517493 1.3018385 4.1072307\n",
" 1.1441547 1.0222696 0.4847384 2.4148073 -2.881392 -0.67044157\n",
" -2.482836 -0.417894 3.1442287 -1.6087203 1.865813 -3.717568\n",
" 0.5994761 1.8819104 3.355772 -1.9087372 ]\n",
"------------------------------\n",
"Similarity between film and drama: 0.4986632\n",
"Similarity between film and tiger: 0.15477756\n",
"------------------------------\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-03T10:02:16.109851Z",
"start_time": "2021-04-03T10:02:15.257052Z"
},
"id": "rXrDOrKskcHX"
},
"source": [
"# save model\n",
"from gensim.models import Word2Vec, KeyedVectors \n",
"word2vec_cbow.wv.save_word2vec_format('word2vec_cbow.bin', binary=True)\n",
"\n",
"# load model\n",
"# new_modelword2vec_cbow = Word2Vec.load('word2vec_cbow.bin')\n",
"# print(word2vec_cbow)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-03T10:08:27.736688Z",
"start_time": "2021-04-03T10:02:19.197708Z"
},
"colab": {
"base_uri": "https://localhost:8080/",
"height": 52
},
"id": "dX0U0CbQOK30",
"outputId": "b9bfcf2b-91cb-40d9-ca92-791ec346aef4"
},
"source": [
"#SkipGram\n",
"start = time.time()\n",
"word2vec_skipgram = Word2Vec(sentences,min_count=10, sg=1)\n",
"end = time.time()\n",
"\n",
"print(\"SkipGram Model Training Complete\\nTime taken for training is:{:.2f} hrs \".format((end-start)/3600.0))"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"SkipGram Model Training Complete\n",
"Time taken for training is:0.10 hrs \n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-03T10:09:06.406929Z",
"start_time": "2021-04-03T10:09:06.383908Z"
},
"colab": {
"base_uri": "https://localhost:8080/",
"height": 471
},
"id": "LXnY9YInSvnI",
"outputId": "26f1dab7-27a6-4655-81c7-ac6f08fe1f9c"
},
"source": [
"#Summarize the loaded model\n",
"print(word2vec_skipgram)\n",
"print(\"-\"*30)\n",
"\n",
"#Summarize vocabulary\n",
"words = list(word2vec_skipgram.wv.vocab)\n",
"print(f\"Length of vocabulary: {len(words)}\")\n",
"print(\"Printing the first 30 words.\")\n",
"print(words[:30])\n",
"print(\"-\"*30)\n",
"\n",
"#Acess vector for one word\n",
"print(f\"Length of vector: {len(word2vec_skipgram['film'])}\")\n",
"print(word2vec_skipgram['film'])\n",
"print(\"-\"*30)\n",
"\n",
"#Compute similarity \n",
"print(\"Similarity between film and drama:\",word2vec_skipgram.similarity('film', 'drama'))\n",
"print(\"Similarity between film and tiger:\",word2vec_skipgram.similarity('film', 'tiger'))\n",
"print(\"-\"*30)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"Word2Vec(vocab=111150, size=100, alpha=0.025)\n",
"------------------------------\n",
"Length of vocabulary: 111150\n",
"Printing the first 30 words.\n",
"['the', 'roses', 'registered', 'as', 'is', 'brisbane', 'racing', 'club', 'group', 'thoroughbred', 'horse', 'race', 'for', 'three', 'year', 'old', 'filles', 'run', 'under', 'set', 'weights', 'conditions', 'over', 'distance', 'of', 'metres', 'at', 'racecourse', 'australia', 'during']\n",
"------------------------------\n",
"Length of vector: 100\n",
"[ 1.94889292e-01 -7.88324535e-01 4.66947220e-02 2.57520348e-01\n",
" 2.65304267e-01 3.63538593e-01 4.63590741e-01 -1.62654325e-01\n",
" 9.11010578e-02 -6.58479631e-02 -6.97350129e-02 -6.56900406e-02\n",
" 2.19506964e-01 2.20394313e-01 1.05092540e-01 8.26439075e-03\n",
" -9.39796269e-02 5.50851583e-01 7.65753444e-04 -2.22807571e-01\n",
" -3.17346871e-01 3.20529372e-01 4.51157093e-02 -1.93709806e-01\n",
" 2.07626969e-02 1.69344515e-01 2.77250055e-02 1.10369585e-02\n",
" -4.75540310e-01 1.10796697e-01 4.28172469e-01 4.06191871e-02\n",
" 5.15495241e-01 -6.85295224e-01 -5.06723702e-01 -4.52192919e-03\n",
" 1.51265517e-03 -3.84557724e-01 -2.22782314e-01 5.11201501e-01\n",
" 1.42252162e-01 -7.73397386e-01 -2.78606623e-01 4.70017433e-01\n",
" -2.70037323e-01 5.04850507e-01 -1.48356587e-01 2.26073325e-01\n",
" -3.36060971e-01 -1.19667962e-01 -2.59654850e-01 -4.44965392e-01\n",
" 1.11614995e-01 1.62986945e-02 4.82374012e-01 -7.87460804e-02\n",
" -1.13825299e-01 -2.24003598e-01 4.93353546e-01 -5.57069406e-02\n",
" 2.43176505e-01 -1.84876159e-01 2.13489812e-02 3.42909366e-01\n",
" 2.02496469e-01 -4.25657362e-01 8.17572057e-01 -2.83644646e-01\n",
" -5.23434244e-02 -3.27616245e-01 4.43994589e-02 -3.90237272e-01\n",
" 2.12029487e-01 -7.25788534e-01 5.52469850e-01 -4.72590374e-03\n",
" -2.02829018e-01 -9.59078223e-03 3.68973225e-01 -2.69762665e-01\n",
" -2.85591751e-01 -2.68359333e-01 3.10093671e-01 2.02198789e-01\n",
" 5.80960453e-01 -2.47493789e-01 -7.37856887e-03 -3.59723950e-03\n",
" 3.14893663e-01 1.12885557e-01 -5.09416103e-01 -7.58459032e-01\n",
" 5.30587435e-01 -1.51896626e-01 -3.37440372e-01 4.22841489e-01\n",
" -3.34523350e-01 3.21759552e-01 7.44457126e-01 -1.26014173e-01]\n",
"------------------------------\n",
"Similarity between film and drama: 0.63833964\n",
"Similarity between film and tiger: 0.22270091\n",
"------------------------------\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-03T10:09:09.947695Z",
"start_time": "2021-04-03T10:09:09.076901Z"
},
"id": "o8U7bfPSVB04"
},
"source": [
"# save model\n",
"word2vec_skipgram.wv.save_word2vec_format('word2vec_sg.bin', binary=True)\n",
"\n",
"# load model\n",
"# new_model_skipgram = Word2Vec.load('model_skipgram.bin')\n",
"# print(model_skipgram)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "kExlA8kfrKml"
},
"source": [
"## FastText"
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-03T10:16:31.271764Z",
"start_time": "2021-04-03T10:09:16.592670Z"
},
"colab": {
"base_uri": "https://localhost:8080/",
"height": 52
},
"id": "JPd2VhMEk8gL",
"outputId": "55c44bdd-d7d8-4df2-8140-cdd442bbd68c"
},
"source": [
"#CBOW\n",
"start = time.time()\n",
"fasttext_cbow = FastText(sentences, sg=0, min_count=10)\n",
"end = time.time()\n",
"\n",
"print(\"FastText CBOW Model Training Complete\\nTime taken for training is:{:.2f} hrs \".format((end-start)/3600.0))"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"FastText CBOW Model Training Complete\n",
"Time taken for training is:0.12 hrs \n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-03T10:16:31.287283Z",
"start_time": "2021-04-03T10:16:31.273765Z"
},
"colab": {
"base_uri": "https://localhost:8080/",
"height": 471
},
"id": "FlQFl8-Zsost",
"outputId": "6472e944-e6de-4d64-8c6f-14475ef1eac5"
},
"source": [
"#Summarize the loaded model\n",
"print(fasttext_cbow)\n",
"print(\"-\"*30)\n",
"\n",
"#Summarize vocabulary\n",
"words = list(fasttext_cbow.wv.vocab)\n",
"print(f\"Length of vocabulary: {len(words)}\")\n",
"print(\"Printing the first 30 words.\")\n",
"print(words[:30])\n",
"print(\"-\"*30)\n",
"\n",
"#Acess vector for one word\n",
"print(f\"Length of vector: {len(fasttext_cbow['film'])}\")\n",
"print(fasttext_cbow['film'])\n",
"print(\"-\"*30)\n",
"\n",
"#Compute similarity \n",
"print(\"Similarity between film and drama:\",fasttext_cbow.similarity('film', 'drama'))\n",
"print(\"Similarity between film and tiger:\",fasttext_cbow.similarity('film', 'tiger'))\n",
"print(\"-\"*30)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"FastText(vocab=111150, size=100, alpha=0.025)\n",
"------------------------------\n",
"Length of vocabulary: 111150\n",
"Printing the first 30 words.\n",
"['the', 'roses', 'registered', 'as', 'is', 'brisbane', 'racing', 'club', 'group', 'thoroughbred', 'horse', 'race', 'for', 'three', 'year', 'old', 'filles', 'run', 'under', 'set', 'weights', 'conditions', 'over', 'distance', 'of', 'metres', 'at', 'racecourse', 'australia', 'during']\n",
"------------------------------\n",
"Length of vector: 100\n",
"[ 0.47473213 1.6783198 -4.766255 -3.2404876 0.80164665 1.993539\n",
" 3.4226568 -0.7035685 -3.0426116 1.5137119 3.8207133 1.3821473\n",
" -0.7379625 -0.6726444 1.8303355 -2.1288188 1.2368282 -3.0745962\n",
" 1.4226121 -2.8884995 7.2847705 -1.564321 2.869352 0.6962616\n",
" 4.469778 2.5569658 2.621335 -4.612509 -2.2389078 3.6648748\n",
" 0.7189718 1.0702186 -3.175641 2.7648733 0.13811935 -2.441776\n",
" -3.9559126 -0.03163956 -1.1257534 -0.64402825 -1.5076644 -0.58919376\n",
" -0.14338583 4.2466817 4.3784213 3.0076942 -5.972965 2.2950342\n",
" -0.50719374 -3.916504 -2.1366098 -2.661619 2.3540869 2.1862476\n",
" 5.1004434 4.1282 -4.164653 1.1288711 -4.001655 -4.051289\n",
" 2.5718336 -0.40600455 3.8396242 2.214367 1.8413899 4.5216975\n",
" -1.6419586 2.7617378 -2.0902452 2.598776 4.041824 -5.1805005\n",
" -2.777213 -0.02546828 -0.07393612 -3.2800605 -2.9874747 -0.6490991\n",
" 3.6039045 -1.4168853 3.6110177 -1.0872458 -0.6365031 -1.0161037\n",
" 3.7344344 0.29839793 0.421953 -1.811646 1.3730506 7.575645\n",
" 3.3998368 5.0335827 -0.2107324 -2.331183 0.19383769 3.0550041\n",
" 4.1529713 3.988616 0.04955976 1.3424706 ]\n",
"------------------------------\n",
"Similarity between film and drama: 0.5669882\n",
"Similarity between film and tiger: 0.24975622\n",
"------------------------------\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-03T10:28:28.771383Z",
"start_time": "2021-04-03T10:16:31.289284Z"
},
"colab": {
"base_uri": "https://localhost:8080/",
"height": 52
},
"id": "UgSOxsNklAvh",
"outputId": "f491f83c-17b8-42ad-a225-479df8419578"
},
"source": [
"#SkipGram\n",
"start = time.time()\n",
"fasttext_skipgram = FastText(sentences, sg=1, min_count=10)\n",
"end = time.time()\n",
"\n",
"print(\"FastText SkipGram Model Training Complete\\nTime taken for training is:{:.2f} hrs \".format((end-start)/3600.0))"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"FastText SkipGram Model Training Complete\n",
"Time taken for training is:0.20 hrs \n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"ExecuteTime": {
"end_time": "2021-04-03T10:28:28.803412Z",
"start_time": "2021-04-03T10:28:28.773386Z"
},
"colab": {
"base_uri": "https://localhost:8080/",
"height": 610
},
"id": "vFiTAP0PsQwi",
"outputId": "a29ae2e3-5dbc-453a-f66b-ceca255a8652"
},
"source": [
"#Summarize the loaded model\n",
"print(fasttext_skipgram)\n",
"print(\"-\"*30)\n",
"\n",
"#Summarize vocabulary\n",
"words = list(fasttext_skipgram.wv.vocab)\n",
"print(f\"Length of vocabulary: {len(words)}\")\n",
"print(\"Printing the first 30 words.\")\n",
"print(words[:30])\n",
"print(\"-\"*30)\n",
"\n",
"#Acess vector for one word\n",
"print(f\"Length of vector: {len(fasttext_skipgram['film'])}\")\n",
"print(fasttext_skipgram['film'])\n",
"print(\"-\"*30)\n",
"\n",
"#Compute similarity \n",
"print(\"Similarity between film and drama:\",fasttext_skipgram.similarity('film', 'drama'))\n",
"print(\"Similarity between film and tiger:\",fasttext_skipgram.similarity('film', 'tiger'))\n",
"print(\"-\"*30)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"FastText(vocab=111150, size=100, alpha=0.025)\n",
"------------------------------\n",
"Length of vocabulary: 111150\n",
"Printing the first 30 words.\n",
"['the', 'roses', 'registered', 'as', 'is', 'brisbane', 'racing', 'club', 'group', 'thoroughbred', 'horse', 'race', 'for', 'three', 'year', 'old', 'filles', 'run', 'under', 'set', 'weights', 'conditions', 'over', 'distance', 'of', 'metres', 'at', 'racecourse', 'australia', 'during']\n",
"------------------------------\n",
"Length of vector: 100\n",
"[-8.4101312e-02 -6.9478154e-04 3.3954462e-01 -3.6973858e-01\n",
" 1.6844368e-01 3.4855682e-01 8.0026442e-01 -5.0405812e-01\n",
" -6.0389137e-01 2.1694953e-02 4.0937051e-01 -3.5893116e-02\n",
" -1.3717794e-01 4.0389201e-01 3.9567137e-01 2.4365921e-01\n",
" 5.6551516e-02 -1.5994829e-01 -1.8148309e-01 -2.6480275e-01\n",
" -4.8462763e-01 9.5473409e-02 -1.1126036e-02 -1.8805853e-01\n",
" 2.4277805e-01 2.4251699e-01 -1.7501226e-01 -4.3078136e-01\n",
" -3.6442232e-01 9.1702184e-03 -3.2344624e-01 -1.0232232e-01\n",
" -5.2684498e-01 -2.7622378e-01 4.2112619e-01 -4.3196991e-02\n",
" 3.1967857e-01 1.7001998e-01 3.3157614e-01 -2.4995559e-01\n",
" -1.3239473e-01 -3.4502399e-01 2.1341468e-01 5.8890671e-01\n",
" 1.7721146e-01 1.5974782e-01 -3.8579264e-01 -2.8241745e-01\n",
" 6.7402735e-02 -7.1903253e-01 1.3665260e-01 -5.9633050e-02\n",
" -5.9002697e-01 -6.1173952e-01 -1.0246418e-03 -5.1254374e-01\n",
" -1.5101396e-01 1.6967247e-01 2.8125226e-01 -4.6728057e-01\n",
" -5.4966863e-02 -1.3736627e-02 -1.5689149e-01 8.3176725e-02\n",
" 1.8850440e-02 4.1858605e-01 -1.1376646e-02 -4.0758383e-02\n",
" -1.7871203e-01 2.7792713e-01 5.5813068e-01 -3.5465869e-01\n",
" 1.3662770e-01 2.5777066e-01 -3.0423281e-01 7.8141141e-01\n",
" 1.1446947e-02 -4.0541172e-01 2.9406905e-01 6.0151044e-02\n",
" 4.9637925e-02 -3.9679220e-01 4.5333567e-01 1.0888510e-02\n",
" 2.7147910e-01 -1.7305572e-01 -2.8098795e-01 -6.1907400e-03\n",
" -2.3080334e-01 5.8609635e-01 -1.0097053e-01 6.6119152e-01\n",
" 1.8578811e-01 -5.9025098e-02 -5.3886050e-01 2.6664239e-01\n",
" -2.2193529e-02 7.0487672e-01 3.9477929e-01 3.7981489e-01]\n",
"------------------------------\n",
"Similarity between film and drama: 0.626041\n",
"Similarity between film and tiger: 0.27831402\n",
"------------------------------\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "oArMIJzYOmUR"
},
"source": [
"An interesting obeseravtion if you noticed is that CBOW trains faster than SkipGram in both cases.\n",
"We will leave it to the user to figure out why. A hint would be to refer the working of CBOW and skipgram."
]
}
]
}