# Training Embeddings Using Gensim and FastText
> Word embeddings are an approach to representing text in NLP. In this notebook we will demonstrate how to train embeddings both CBOW and SkipGram methods using Genism and Fasttext.

- toc: true
- badges: true
- comments: true
- categories: [Concept, Embedding, Gensim, FastText]
- author: "<a href='https://notebooks.quantumstat.com/'>Quantum Stat</a>"
- image:

In [None]:
from gensim.models import Word2Vec
import warnings
warnings.filterwarnings('ignore')

In [None]:
# define training data
#Genism word2vec requires that a format of ‘list of lists’ be provided for training where every document contained in a list.
#Every list contains lists of tokens of that document.
corpus = [['dog','bites','man'], ["man", "bites" ,"dog"],["dog","eats","meat"],["man", "eats","food"]]

#Training the model
model_cbow = Word2Vec(corpus, min_count=1,sg=0) #using CBOW Architecture for trainnig
model_skipgram = Word2Vec(corpus, min_count=1,sg=1)#using skipGram Architecture for training 

## Continuous Bag of Words (CBOW) 
In CBOW, the primary task is to build a language model that correctly predicts the center word given the context words in which the center word appears.

In [None]:
#Summarize the loaded model
print(model_cbow)

#Summarize vocabulary
words = list(model_cbow.wv.vocab)
print(words)

#Acess vector for one word
print(model_cbow['dog'])

Word2Vec(vocab=6, size=100, alpha=0.025)
['dog', 'bites', 'man', 'eats', 'meat', 'food']
[-3.1667745e-03  2.5268614e-03 -4.9504861e-03  2.3797194e-03
 -3.3511904e-03  1.7659335e-03 -9.6838089e-04  3.6862001e-03
  3.3760078e-03 -1.1944126e-03 -4.7475514e-03 -4.6677454e-03
  4.7231275e-03  2.1875298e-03  4.9989321e-03 -4.7024325e-04
  4.6936749e-03  4.5417100e-03 -4.8383311e-03  4.5522186e-03
  9.4010920e-04 -2.8778350e-03 -2.3938445e-03  7.6240452e-04
  2.8537741e-05 -1.0585956e-03  1.5203804e-03  1.1994856e-04
  4.3881699e-03  3.5755127e-04  1.9964906e-03 -3.3893189e-03
  2.5362791e-03 -3.8559963e-03 -4.6814438e-03 -1.0485576e-03
  1.9576577e-03 -5.4296525e-04  2.5505766e-03  1.4563937e-03
  1.1214090e-03  3.1200200e-03  3.5230191e-03  4.4931062e-03
 -5.5389071e-04  1.6268899e-03 -4.6736463e-03 -1.9612674e-04
  1.5486709e-03 -3.5581242e-03  1.5163666e-03  2.2859944e-03
 -3.5728619e-03 -3.5505979e-03  7.8282715e-04 -4.8093311e-03
 -3.1324120e-03 -3.6213300e-03 -1.4478542e-03  3.4006054e

In [None]:
#Compute similarity 
print("Similarity between eats and bites:",model_cbow.similarity('eats', 'bites'))
print("Similarity between eats and man:",model_cbow.similarity('eats', 'man'))

Similarity between eats and bites: -0.09852024
Similarity between eats and man: -0.17088428


From the above similarity scores we can conclude that eats is more similar to bites than man.

In [None]:
#Most similarity
model_cbow.most_similar('meat')

[('bites', 0.1353721022605896),
 ('man', 0.1094527617096901),
 ('food', -0.02215239405632019),
 ('dog', -0.1444159597158432),
 ('eats', -0.16309654712677002)]

In [None]:
# save model
model_cbow.save('model_cbow.bin')

# load model
new_model_cbow = Word2Vec.load('model_cbow.bin')
print(new_model_cbow)

Word2Vec(vocab=6, size=100, alpha=0.025)


## SkipGram
In skipgram, the task is to predict the context words from the center word.

In [None]:
#Summarize the loaded model
print(model_skipgram)

#Summarize vocabulary
words = list(model_skipgram.wv.vocab)
print(words)

#Acess vector for one word
print(model_skipgram['dog'])

Word2Vec(vocab=6, size=100, alpha=0.025)
['dog', 'bites', 'man', 'eats', 'meat', 'food']
[-3.1667745e-03  2.5268614e-03 -4.9504861e-03  2.3797194e-03
 -3.3511904e-03  1.7659335e-03 -9.6838089e-04  3.6862001e-03
  3.3760078e-03 -1.1944126e-03 -4.7475514e-03 -4.6677454e-03
  4.7231275e-03  2.1875298e-03  4.9989321e-03 -4.7024325e-04
  4.6936749e-03  4.5417100e-03 -4.8383311e-03  4.5522186e-03
  9.4010920e-04 -2.8778350e-03 -2.3938445e-03  7.6240452e-04
  2.8537741e-05 -1.0585956e-03  1.5203804e-03  1.1994856e-04
  4.3881699e-03  3.5755127e-04  1.9964906e-03 -3.3893189e-03
  2.5362791e-03 -3.8559963e-03 -4.6814438e-03 -1.0485576e-03
  1.9576577e-03 -5.4296525e-04  2.5505766e-03  1.4563937e-03
  1.1214090e-03  3.1200200e-03  3.5230191e-03  4.4931062e-03
 -5.5389071e-04  1.6268899e-03 -4.6736463e-03 -1.9612674e-04
  1.5486709e-03 -3.5581242e-03  1.5163666e-03  2.2859944e-03
 -3.5728619e-03 -3.5505979e-03  7.8282715e-04 -4.8093311e-03
 -3.1324120e-03 -3.6213300e-03 -1.4478542e-03  3.4006054e

In [None]:
#Compute similarity 
print("Similarity between eats and bites:",model_skipgram.similarity('eats', 'bites'))
print("Similarity between eats and man:",model_skipgram.similarity('eats', 'man'))

Similarity between eats and bites: -0.09852936
Similarity between eats and man: -0.17089055


From the above similarity scores we can conclude that eats is more similar to bites than man.

In [None]:
#Most similarity
model_skipgram.most_similar('meat')

[('bites', 0.1353721022605896),
 ('man', 0.10945276916027069),
 ('food', -0.022152386605739594),
 ('dog', -0.1444159746170044),
 ('eats', -0.16317100822925568)]

In [None]:
# save model
model_skipgram.save('model_skipgram.bin')

# load model
new_model_skipgram = Word2Vec.load('model_skipgram.bin')
print(new_model_skipgram)

Word2Vec(vocab=6, size=100, alpha=0.025)


## Training Your Embedding on Wiki Corpus

##### The corpus download page : https://dumps.wikimedia.org/enwiki/20200120/
The entire wiki corpus as of 28/04/2020 is just over 16GB in size.
We will take a part of this corpus due to computation constraints and train our word2vec and fasttext embeddings.

The file size is 294MB so it can take a while to download.

Source for code which downloads files from Google Drive: https://stackoverflow.com/questions/25010369/wget-curl-large-file-from-google-drive/39225039#39225039

In [None]:
import os
import requests

os.makedirs('data/en', exist_ok= True)
file_name = "data/en/enwiki-latest-pages-articles-multistream14.xml-p13159683p14324602.bz2"
file_id = "11804g0GcWnBIVDahjo5fQyc05nQLXGwF"

def download_file_from_google_drive(id, destination):
    URL = "https://docs.google.com/uc?export=download"

    session = requests.Session()

    response = session.get(URL, params = { 'id' : id }, stream = True)
    token = get_confirm_token(response)

    if token:
        params = { 'id' : id, 'confirm' : token }
        response = session.get(URL, params = params, stream = True)

    save_response_content(response, destination)    

def get_confirm_token(response):
    for key, value in response.cookies.items():
        if key.startswith('download_warning'):
            return value

    return None

def save_response_content(response, destination):
    CHUNK_SIZE = 32768

    with open(destination, "wb") as f:
        for chunk in response.iter_content(CHUNK_SIZE):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)

if not os.path.exists(file_name):
    download_file_from_google_drive(file_id, file_name)
else:
    print("file already exists, skipping download")

print(f"File at: {file_name}")

file already exists, skipping download
File at: data/en/enwiki-latest-pages-articles-multistream14.xml-p13159683p14324602.bz2


In [None]:
from gensim.corpora.wikicorpus import WikiCorpus
from gensim.models.word2vec import Word2Vec
from gensim.models.fasttext import FastText
import time

In [None]:
#Preparing the Training data
wiki = WikiCorpus(file_name, lemmatize=False, dictionary={})
sentences = list(wiki.get_texts())

#if you get a memory error executing the lines above
#comment the lines out and uncomment the lines below. 
#loading will be slower, but stable.
# wiki = WikiCorpus(file_name, processes=4, lemmatize=False, dictionary={})
# sentences = list(wiki.get_texts())

#if you still get a memory error, try settings processes to 1 or 2 and then run it again.

### Hyperparameters


1.   sg - Selecting the training algorithm: 1 for skip-gram else its 0 for CBOW. Default is CBOW.
2.   min_count-  Ignores all words with total frequency lower than this.<br>
There are many more hyperparamaeters whose list can be found in the official documentation [here.](https://radimrehurek.com/gensim/models/word2vec.html)


In [None]:
#CBOW
start = time.time()
word2vec_cbow = Word2Vec(sentences,min_count=10, sg=0)
end = time.time()

print("CBOW Model Training Complete.\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))

CBOW Model Training Complete.
Time taken for training is:0.04 hrs 


In [None]:
#Summarize the loaded model
print(word2vec_cbow)
print("-"*30)

#Summarize vocabulary
words = list(word2vec_cbow.wv.vocab)
print(f"Length of vocabulary: {len(words)}")
print("Printing the first 30 words.")
print(words[:30])
print("-"*30)

#Acess vector for one word
print(f"Length of vector: {len(word2vec_cbow['film'])}")
print(word2vec_cbow['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:",word2vec_cbow.similarity('film', 'drama'))
print("Similarity between film and tiger:",word2vec_cbow.similarity('film', 'tiger'))
print("-"*30)

Word2Vec(vocab=111150, size=100, alpha=0.025)
------------------------------
Length of vocabulary: 111150
Printing the first 30 words.
['the', 'roses', 'registered', 'as', 'is', 'brisbane', 'racing', 'club', 'group', 'thoroughbred', 'horse', 'race', 'for', 'three', 'year', 'old', 'filles', 'run', 'under', 'set', 'weights', 'conditions', 'over', 'distance', 'of', 'metres', 'at', 'racecourse', 'australia', 'during']
------------------------------
Length of vector: 100
[-0.25941572 -1.6287326   2.5331333  -1.5818936   0.9024474   0.8614945
  2.4875445  -0.95802265 -1.3792082  -1.1744157  -4.300686    1.0071316
  0.10418405  4.855032    0.6251962  -0.06472338  0.19993098 -0.7291219
  2.342258   -1.7298651   0.7895099  -2.2819378   0.7158192  -0.62419826
  0.6720258   3.6712303   1.3836899   0.17808275 -3.7205396   0.2529162
  1.0290879  -0.9228959   0.9451632   1.7415334   1.9618814   1.4535053
  2.670452    0.9272077   0.25056183 -0.4078236   0.5795217   0.6316829
  0.50204426 -0.19865237

In [None]:
# save model
from gensim.models import Word2Vec, KeyedVectors   
word2vec_cbow.wv.save_word2vec_format('word2vec_cbow.bin', binary=True)

# load model
# new_modelword2vec_cbow = Word2Vec.load('word2vec_cbow.bin')
# print(word2vec_cbow)

In [None]:
#SkipGram
start = time.time()
word2vec_skipgram = Word2Vec(sentences,min_count=10, sg=1)
end = time.time()

print("SkipGram Model Training Complete\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))

SkipGram Model Training Complete
Time taken for training is:0.10 hrs 


In [None]:
#Summarize the loaded model
print(word2vec_skipgram)
print("-"*30)

#Summarize vocabulary
words = list(word2vec_skipgram.wv.vocab)
print(f"Length of vocabulary: {len(words)}")
print("Printing the first 30 words.")
print(words[:30])
print("-"*30)

#Acess vector for one word
print(f"Length of vector: {len(word2vec_skipgram['film'])}")
print(word2vec_skipgram['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:",word2vec_skipgram.similarity('film', 'drama'))
print("Similarity between film and tiger:",word2vec_skipgram.similarity('film', 'tiger'))
print("-"*30)

Word2Vec(vocab=111150, size=100, alpha=0.025)
------------------------------
Length of vocabulary: 111150
Printing the first 30 words.
['the', 'roses', 'registered', 'as', 'is', 'brisbane', 'racing', 'club', 'group', 'thoroughbred', 'horse', 'race', 'for', 'three', 'year', 'old', 'filles', 'run', 'under', 'set', 'weights', 'conditions', 'over', 'distance', 'of', 'metres', 'at', 'racecourse', 'australia', 'during']
------------------------------
Length of vector: 100
[ 1.94889292e-01 -7.88324535e-01  4.66947220e-02  2.57520348e-01
  2.65304267e-01  3.63538593e-01  4.63590741e-01 -1.62654325e-01
  9.11010578e-02 -6.58479631e-02 -6.97350129e-02 -6.56900406e-02
  2.19506964e-01  2.20394313e-01  1.05092540e-01  8.26439075e-03
 -9.39796269e-02  5.50851583e-01  7.65753444e-04 -2.22807571e-01
 -3.17346871e-01  3.20529372e-01  4.51157093e-02 -1.93709806e-01
  2.07626969e-02  1.69344515e-01  2.77250055e-02  1.10369585e-02
 -4.75540310e-01  1.10796697e-01  4.28172469e-01  4.06191871e-02
  5.15495

In [None]:
# save model
word2vec_skipgram.wv.save_word2vec_format('word2vec_sg.bin', binary=True)

# load model
# new_model_skipgram = Word2Vec.load('model_skipgram.bin')
# print(model_skipgram)

## FastText

In [None]:
#CBOW
start = time.time()
fasttext_cbow = FastText(sentences, sg=0, min_count=10)
end = time.time()

print("FastText CBOW Model Training Complete\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))

FastText CBOW Model Training Complete
Time taken for training is:0.12 hrs 


In [None]:
#Summarize the loaded model
print(fasttext_cbow)
print("-"*30)

#Summarize vocabulary
words = list(fasttext_cbow.wv.vocab)
print(f"Length of vocabulary: {len(words)}")
print("Printing the first 30 words.")
print(words[:30])
print("-"*30)

#Acess vector for one word
print(f"Length of vector: {len(fasttext_cbow['film'])}")
print(fasttext_cbow['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:",fasttext_cbow.similarity('film', 'drama'))
print("Similarity between film and tiger:",fasttext_cbow.similarity('film', 'tiger'))
print("-"*30)

FastText(vocab=111150, size=100, alpha=0.025)
------------------------------
Length of vocabulary: 111150
Printing the first 30 words.
['the', 'roses', 'registered', 'as', 'is', 'brisbane', 'racing', 'club', 'group', 'thoroughbred', 'horse', 'race', 'for', 'three', 'year', 'old', 'filles', 'run', 'under', 'set', 'weights', 'conditions', 'over', 'distance', 'of', 'metres', 'at', 'racecourse', 'australia', 'during']
------------------------------
Length of vector: 100
[ 0.47473213  1.6783198  -4.766255   -3.2404876   0.80164665  1.993539
  3.4226568  -0.7035685  -3.0426116   1.5137119   3.8207133   1.3821473
 -0.7379625  -0.6726444   1.8303355  -2.1288188   1.2368282  -3.0745962
  1.4226121  -2.8884995   7.2847705  -1.564321    2.869352    0.6962616
  4.469778    2.5569658   2.621335   -4.612509   -2.2389078   3.6648748
  0.7189718   1.0702186  -3.175641    2.7648733   0.13811935 -2.441776
 -3.9559126  -0.03163956 -1.1257534  -0.64402825 -1.5076644  -0.58919376
 -0.14338583  4.2466817   

In [None]:
#SkipGram
start = time.time()
fasttext_skipgram = FastText(sentences, sg=1, min_count=10)
end = time.time()

print("FastText SkipGram Model Training Complete\nTime taken for training is:{:.2f} hrs ".format((end-start)/3600.0))

FastText SkipGram Model Training Complete
Time taken for training is:0.20 hrs 


In [None]:
#Summarize the loaded model
print(fasttext_skipgram)
print("-"*30)

#Summarize vocabulary
words = list(fasttext_skipgram.wv.vocab)
print(f"Length of vocabulary: {len(words)}")
print("Printing the first 30 words.")
print(words[:30])
print("-"*30)

#Acess vector for one word
print(f"Length of vector: {len(fasttext_skipgram['film'])}")
print(fasttext_skipgram['film'])
print("-"*30)

#Compute similarity 
print("Similarity between film and drama:",fasttext_skipgram.similarity('film', 'drama'))
print("Similarity between film and tiger:",fasttext_skipgram.similarity('film', 'tiger'))
print("-"*30)

FastText(vocab=111150, size=100, alpha=0.025)
------------------------------
Length of vocabulary: 111150
Printing the first 30 words.
['the', 'roses', 'registered', 'as', 'is', 'brisbane', 'racing', 'club', 'group', 'thoroughbred', 'horse', 'race', 'for', 'three', 'year', 'old', 'filles', 'run', 'under', 'set', 'weights', 'conditions', 'over', 'distance', 'of', 'metres', 'at', 'racecourse', 'australia', 'during']
------------------------------
Length of vector: 100
[-8.4101312e-02 -6.9478154e-04  3.3954462e-01 -3.6973858e-01
  1.6844368e-01  3.4855682e-01  8.0026442e-01 -5.0405812e-01
 -6.0389137e-01  2.1694953e-02  4.0937051e-01 -3.5893116e-02
 -1.3717794e-01  4.0389201e-01  3.9567137e-01  2.4365921e-01
  5.6551516e-02 -1.5994829e-01 -1.8148309e-01 -2.6480275e-01
 -4.8462763e-01  9.5473409e-02 -1.1126036e-02 -1.8805853e-01
  2.4277805e-01  2.4251699e-01 -1.7501226e-01 -4.3078136e-01
 -3.6442232e-01  9.1702184e-03 -3.2344624e-01 -1.0232232e-01
 -5.2684498e-01 -2.7622378e-01  4.2112619

An interesting obeseravtion if you noticed is that CBOW trains faster than SkipGram in both cases.
We will leave it to the user to figure out why. A hint would be to refer the working of CBOW and skipgram.