# SMASAC - fastText Embedding

This notebook shows how to use the [fastText](https://fasttext.cc) to generate word, tweet representation in embedding space.

This notebook is structured as follow:

1. Preprocessing the data
2. Training the fastText embedding model
3. Query similar word based on embedding model

In [1]:
from pathlib import Path
import fastText
import sklearn
import sklearn.metrics
import numpy as np
import re

# Configuration

Folder structure of this project:

* data: data directory
 - twitter_las_vegas_shooting : Text for training, sample of 50k tweets
 - twitter_las_vegas_shooting.preprocessed : Preprocessed training text
 - twitter_las_vegas_shooting.labels : Hashtags in training corpus
 - twitter_las_vegas_shooting.embedding : Hashtags emebdding vectors
 - twitter_las_vegas_shooting.low_dim_embedding : Hashtags embedding vectors in 2D
* model: model directory


We will use `twitter_las_vegas_shooting` for training, which contains 50,000 tweets crawled during Las Vegas mass shooting massacre. 

In [10]:
root_dir = Path("..")
data_dir = root_dir / "data" / "3-entity-extraction"
notebook_dir = root_dir / "notebooks"
model_dir = data_dir / "model" 

if not model_dir.exists():
 model_dir.mkdir()

In [11]:
# corpus
data_path = data_dir / "twitter_las_vegas_shooting"
# Training corpus filename
input_filename = str(data_path)
# Model filename
model_filename = str(model_dir / "twitter.bin")

# Preprocessing

Preprocessing tweet to obtain a good representation of language model.

* Remove hashtags
* Remove mentioned
* Remove punctuations
* Remove urls
* Convert tweet to lowercase

In [13]:
# Preprocessing Config
preprocess_config = {
 "hashtag": True,
 "mentioned": True,
 "punctuation": True,
 "url": True,
}

# Pattern
hashtag_pattern = "#\w+"
mentioned_pattern = "@\w+"
url_pattern = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

trans_str = "!\"$%&\'()*+,-./:;<=>?[\\]^_`{|}~" + "…"
translate_table = str.maketrans(trans_str, " " * len(trans_str))

def preprocess(s):
 s = s.lower()
 if preprocess_config["hashtag"]:
 s = re.sub(hashtag_pattern, "", s)
 if preprocess_config["mentioned"]:
 s = re.sub(mentioned_pattern, "", s)
 if preprocess_config["url"]:
 s = re.sub(url_pattern, "", s)
 if preprocess_config["punctuation"]:
 s = " ".join(s.translate(translate_table).split())
 return s


**Preprocessing Example** 
Here is an example output of preprocessing. 

In [14]:
# example of preprocessing
example_tweet = "RT @TheLeadCNN: Remembering Keri Lynn Galvan, from Thousand Oaks, California. #LasVegasLost https://t.co/QuvXa6WvlE https://t.co/hDF2d3Owgn"

print("Original Tweet:")
print(example_tweet)
print()
print("Preprocessed Tweet:")
print(preprocess(example_tweet))

Original Tweet:
RT @TheLeadCNN: Remembering Keri Lynn Galvan, from Thousand Oaks, California. #LasVegasLost https://t.co/QuvXa6WvlE https://t.co/hDF2d3Owgn

Preprocessed Tweet:
rt remembering keri lynn galvan from thousand oaks california


**Preprocessing corpus**

In [15]:
# Preprocessing
preprocessed_data_path = data_dir / "twitter_las_vegas_shooting.preprocessed"

with data_path.open() as f:
 lines = [l.strip() for l in f.readlines()]

with preprocessed_data_path.open("w") as f:
 for l in lines:
 f.write(preprocess(l))
 f.write("\n")

# use preprocessed data as input
input_filename = str(preprocessed_data_path)

# Training fastText embedding model

Use corpus after preprocessing to generate the 100 dimensions embedding representation model.

In [16]:
# fastText Config
embedding_model = "skipgram"
lr = 0.05
dim = 100
ws = 5
epoch = 5
minCount = 5
minCountLabel = 0
minn = 3
maxn = 6
neg = 5
wordNgrams = 1
loss = "ns"
bucket = 2000000
thread = 12
lrUpdateRate = 100
t = 1e-4
verbose = 2

In [17]:
model = fastText.train_unsupervised(
 input = input_filename,
 model=embedding_model,
 lr=lr,
 dim=dim,
 ws=ws,
 epoch=epoch,
 minCount=minCount,
 minCountLabel=minCountLabel,
 minn=minn,
 maxn=maxn,
 neg=neg,
 wordNgrams=wordNgrams,
 loss=loss,
 bucket=bucket,
 thread=thread,
 lrUpdateRate=lrUpdateRate,
 t=t,
 verbose=verbose,
)

print("Training finished.")
print("Dimension: {}".format(model.get_dimension()))
print("Number of words: {}".format(len(model.get_words())))

# Output model to disk if needed
model.save_model(model_filename)

# Load saved model if needed
model = fastText.load_model(model_filename)

Training finished.
Dimension: 100
Number of words: 6040


# Query

**Get word vectors of corpus**

In [18]:
words = np.array(model.get_words())
word_vectors = np.array([model.get_word_vector(w) for w in words])

**Similarity of word vectors**
In text embedding space, cosine similarity could be used for measuring similarity between words

In [19]:
# Calculate N neighbors based on cosine similarity
def calc_n_cosine_neighbor(inX, X, N):
 if inX.ndim == 1:
 inX = [inX]
 distances = sklearn.metrics.pairwise.pairwise_distances(
 X, inX, metric="cosine")
 sortedDist = distances.reshape((distances.shape[0],)).argsort()
 return sortedDist[:N], distances

# calculate nearest neighbours based on cosine similarity
def nn(query, words=words, word_vectors=word_vectors, k=10):
 """
 words: numpy array of words
 k: (optional, 10 by default) top k labels
 """
 global model
 v = model.get_word_vector(query)
 idx, _ = calc_n_cosine_neighbor(v, word_vectors, k)
 return words[idx]

## Query nearest words

In [20]:
q = "lasvegasshooting"

neighbours = nn("lasvegasshooting", k=20)

print("Neighbours of word \"{}\":".format(q))
for word in neighbours:
 print(word)

Neighbours of word "lasvegasshooting":
shooting
lasvegas
vegas”
las
vegas
rt
vega
“shooting”
shootin

shooting”
shooti
shootings
❤
🙏🙏🙏
👍
cc
🙏🏾
😢💔
😓


## Get sentence vector

Use API `get_sentence_vector` to get a representation of sentende

In [21]:
example_tweet = "RT @TheLeadCNN: Remembering Keri Lynn Galvan, from Thousand Oaks, California. #LasVegasLost https://t.co/QuvXa6WvlE https://t.co/hDF2d3Owgn"

tweet_vector = model.get_sentence_vector(example_tweet)
print("Tweet vector in embedding space:")
print(example_tweet)
print()
print(tweet_vector)

print()
print("Words similar this tweet")
idx, _ = calc_n_cosine_neighbor(tweet_vector, word_vectors, 20)
print([words[i] for i in idx])

Tweet vector in embedding space:
RT @TheLeadCNN: Remembering Keri Lynn Galvan, from Thousand Oaks, California. #LasVegasLost https://t.co/QuvXa6WvlE https://t.co/hDF2d3Owgn

[-0.02461572 0.04784836 -0.05343785 0.00153351 0.04367601 0.10020498
 -0.01127366 -0.00975734 -0.01951972 0.07512145 0.03622668 -0.00580111
 0.08758368 0.031007 -0.00507403 0.07074952 -0.05185707 -0.11242248
 -0.03888126 -0.01926897 0.08175821 -0.01120457 -0.07555435 -0.04022888
 0.00478477 -0.0012044 0.05348494 0.0350855 0.0982817 0.01342872
 0.00545024 0.00250413 0.03077969 -0.0874893 -0.03390906 0.14996992
 -0.01272367 -0.02368226 -0.01887075 -0.02408492 -0.03291685 -0.05095126
 -0.04614896 0.10122891 0.07110424 -0.12804917 -0.05888803 -0.03085945
 -0.01463612 0.11134949 -0.08774657 -0.01715528 -0.08862083 0.00346183
 0.09192748 0.05510866 -0.04465136 -0.0433164 0.02116909 -0.06731256
 -0.00497376 -0.02442945 0.04918417 -0.03386533 0.05390133 0.01210842
 -0.03669443 0.00295777 -0.00802929 0.05568004 0.03773327 0