![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/finance-nlp/03.Word_Sentence_Embeddings.ipynb)

# Financial Word and Sentence Embeddings

# Finance Word and Sentence Embeddings visualization using PCA (Principal Component Analysis)

Modern NLP models work with a numerical representation of texts and their menaning. For token classification problems (inferring a class for a token, for example Name Entity Recognition) Word Embeddings are required. For sentences, paragraph, document classification - we use Sentence Embeddings.

In this notebook, we got token embeddings using Spark NLP Finance Word Embeddings(**bert_embeddings_sec_bert_base**) and using these token embeddings we got sentence embeddings by sparknlp annotator SentenceEmbeddings to get those numerical representations of the semantics of the texts. The result is a 768 embeddings matrix, impossible to process by the human eye.

There are many techniques we can use to visualize those embeddings. We are using one of them - Principal Component Analysis, a dimensionality reduction process, carried out by Spark MLLib. Both embeddings have 768 dimensions, so we will reduced this dimensions from **768** to **3** (X, Y, Z) and will use a color for the word / sentence legend.

# Installation

In [None]:
! pip install johnsnowlabs

## Automatic Installation
Using my.johnsnowlabs.com SSO

In [None]:
from johnsnowlabs import nlp, finance

# nlp.install(force_browser=True)

## Manual downloading
If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

- Go to my.johnsnowlabs.com
- Download your license
- Upload it using the following command

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

- Install it

In [None]:
nlp.install()

# Starting

In [None]:
spark = nlp.start()

# Get sample text

In [None]:
! pip install -q plotly

# Downloading sample datasets.
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/finance_pca_samples.csv

In [None]:
import pandas as pd

df = pd.read_csv("finance_pca_samples.csv")

In [None]:
# Create spark dataframe
sdf = spark.createDataFrame(df)
sdf.show()

+--------------------+----------------+
|                text|           label|
+--------------------+----------------+
|I called Huntingt...|        Accounts|
|I opened an citi ...|        Accounts|
|I have been a lon...|    Credit Cards|
|My credit limit w...|    Credit Cards|
|I am filing this ...|Credit Reporting|
|The Credit Bureau...|Credit Reporting|
|I noticed an arti...| Debt Collection|
|A bank account wa...| Debt Collection|
|I was contacted v...|           Loans|
|My husband recent...|           Loans|
|I wire transfered...| Money Transfers|
|PayPal holds fund...| Money Transfers|
|We have requested...|        Mortgage|
|I filled out a co...|        Mortgage|
+--------------------+----------------+



# Pipeline with Spark NLP and Spark MLLIB

In [None]:
# We defined a generic pipeline for word and sentence embeddings

def generic_pipeline():
  document_assembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

  tokenizer = nlp.Tokenizer()\
      .setInputCols("document")\
      .setOutputCol("token")

  word_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\
      .setInputCols(["document", "token"])\
      .setOutputCol("word_embeddings")

  pipeline = nlp.Pipeline(stages = [
      document_assembler,
      tokenizer,
      word_embeddings
  ])

  return pipeline



## Sentence Embeddings

In [None]:
embeddings_sentence = nlp.SentenceEmbeddings()\
    .setInputCols(["document", "word_embeddings"])\
    .setOutputCol("sentence_embeddings")\
    .setPoolingStrategy("AVERAGE")
# We used sparknlp SentenceEmbeddings anootator to get each sentence embeddings from token embeddings

# Custom transform to retrieve the numerical embeddings from Spark NLP and pass it to Spark MLLib

In [None]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only

In [None]:
# This class extracts the embeddings from the Spark NLP Annotation object
# from pyspark import ml as ML
class EmbeddingsUDF(
    nlp.Transformer, nlp.ML.param.shared.HasInputCol, nlp.ML.param.shared.HasOutputCol,
    nlp.ML.util.DefaultParamsReadable, nlp.ML.util.DefaultParamsWritable
):
    @keyword_only
    def __init__(self):
        super(EmbeddingsUDF, self).__init__()

        def _sum(r):
            result = 0.0
            for e in r:
                result += e
            return result

        self.udfs = {
            'convertToVectorUDF': F.udf(lambda vs: nlp.ML.linalg.Vectors.dense(vs), nlp.ML.linalg.VectorUDT()),
            'sumUDF': F.udf(lambda r: _sum(r), T.FloatType())
        }

    def _transform(self, dataset):

        results = dataset.select(
            "*", F.explode("sentence_embeddings.embeddings").alias("embeddings")
        )
        results = results.withColumn(
            "features",
            self.udfs['convertToVectorUDF'](F.col("embeddings"))
        )
        results = results.withColumn(
            "emb_sum",
            self.udfs['sumUDF'](F.col("embeddings"))
        )
        # Remove those with embeddings all zeroes (so we can calculate cosine distance)
        results = results.where(F.col("emb_sum")!=0.0)

        return results

In [None]:
embeddings_for_pca = EmbeddingsUDF()

In [None]:
DIMENSIONS  = 3

In [None]:
pca = nlp.ML.feature.PCA(k=DIMENSIONS, inputCol="features", outputCol="pca_features")

### Full Spark NLP + Spark MLLib pipeline

In [None]:
# We did all process in one pipeline
pipeline = nlp.Pipeline().setStages([generic_pipeline(), embeddings_sentence, embeddings_for_pca, pca])

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]


In [None]:
model = pipeline.fit(sdf)

In [None]:
result = model.transform(sdf)

In [None]:
result.select('pca_features', 'label').show(truncate=False)

+--------------------------------------------------------------+----------------+
|pca_features                                                  |label           |
+--------------------------------------------------------------+----------------+
|[3.39576448119276,-1.060361129782475,-1.568794006399417]      |Accounts        |
|[2.3660850756971623,0.8591941003552866,-0.8066168807669747]   |Accounts        |
|[0.6867735108170906,1.4823947144210112,0.006591220237646302]  |Credit Cards    |
|[-0.28834125177427167,1.0031549697755784,-0.7963810505318434] |Credit Cards    |
|[-0.5037809008469382,-1.3771583372345915,0.4449701036930799]  |Credit Reporting|
|[1.039756950301059,-1.7194174825036457,1.8539366217014026]    |Credit Reporting|
|[2.7731701148109815,1.1680247656394984,1.3949448202984454]    |Debt Collection |
|[-0.45951034017887454,0.833969250052939,0.5051728405912744]   |Debt Collection |
|[0.2703079726928541,1.1069420631113542,-0.4247559623637921]   |Loans           |
|[0.866252306486

In [None]:
df = result.select('pca_features', 'label').toPandas()

df
# As you see, dimension values are inside a list

Unnamed: 0,pca_features,label
0,"[3.39576448119276, -1.060361129782475, -1.5687...",Accounts
1,"[2.3660850756971623, 0.8591941003552866, -0.80...",Accounts
2,"[0.6867735108170906, 1.4823947144210112, 0.006...",Credit Cards
3,"[-0.28834125177427167, 1.0031549697755784, -0....",Credit Cards
4,"[-0.5037809008469382, -1.3771583372345915, 0.4...",Credit Reporting
5,"[1.039756950301059, -1.7194174825036457, 1.853...",Credit Reporting
6,"[2.7731701148109815, 1.1680247656394984, 1.394...",Debt Collection
7,"[-0.45951034017887454, 0.833969250052939, 0.50...",Debt Collection
8,"[0.2703079726928541, 1.1069420631113542, -0.42...",Loans
9,"[0.8662523064864315, 1.1435249671794807, 0.870...",Loans


In [None]:
# We extract the dimension values out off the list

df["x"] = df["pca_features"].apply(lambda x: x[0])

df["y"] = df["pca_features"].apply(lambda x: x[1])

df["z"] = df["pca_features"].apply(lambda x: x[2])

df = df[["x", "y", "z", "label"]]

df

Unnamed: 0,x,y,z,label
0,3.395764,-1.060361,-1.568794,Accounts
1,2.366085,0.859194,-0.806617,Accounts
2,0.686774,1.482395,0.006591,Credit Cards
3,-0.288341,1.003155,-0.796381,Credit Cards
4,-0.503781,-1.377158,0.44497,Credit Reporting
5,1.039757,-1.719417,1.853937,Credit Reporting
6,2.77317,1.168025,1.394945,Debt Collection
7,-0.45951,0.833969,0.505173,Debt Collection
8,0.270308,1.106942,-0.424756,Loans
9,0.866252,1.143525,0.870356,Loans


In [None]:
import plotly.express as px

fig = px.scatter_3d(df, x = 'x', y = 'y', z = 'z', color = 'label', width=800, height=600)

fig.show()

### Word Embeddings

We can also visualize the semantics of words, instead of full texts, by using Word Embeddings. We will add a Tokenizer and a WordEmbeddings model to get those embeddings, and them apply PCA as before. Firstly we splitted the pipeline in two to get all token embeddings

In [None]:
model = generic_pipeline().fit(sdf)

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]


In [None]:
result = model.transform(sdf)

In [None]:
result_df = result.select("label", F.explode(F.arrays_zip(result.token.result, result.word_embeddings.embeddings)).alias("cols"))\
      .select(F.expr("cols['0']").alias("token"),
              "label",
              F.expr("cols['1']").alias("embeddings"))

result_df.show(truncate = 80)


+----------+--------+--------------------------------------------------------------------------------+
|     token|   label|                                                                      embeddings|
+----------+--------+--------------------------------------------------------------------------------+
|         I|Accounts|[-0.29679197, 0.80952483, 0.026026089, 0.08434192, 0.7434629, -0.02694758, -0...|
|    called|Accounts|[0.28905854, -0.29229686, -0.42990392, -0.3833449, 0.026178285, -0.12728442, ...|
|Huntington|Accounts|[0.20684586, -0.010130149, -0.259025, -0.37558293, 0.45792142, 0.3114912, -0....|
|      Bank|Accounts|[-0.034710683, 0.46047488, -0.6221113, -0.011169381, 0.2938512, 0.31341088, -...|
|        to|Accounts|[-0.40457863, -0.3768647, -0.08015404, -0.58909655, -0.33856544, -0.39321256,...|
|     close|Accounts|[0.35089388, 0.9568475, 0.86328286, -0.4334402, 0.11386797, -0.48837784, -0.8...|
|        my|Accounts|[-0.36591864, 0.2655603, -0.32495034, -0.5081896, -0

In [None]:
# Here we defined inheritance class from that defined previously EmbeddingsUDF class
class WordEmbeddingsUDF(EmbeddingsUDF):    
    def _transform(self, dataset):
        
        results = dataset.select('token', 'label', 'embeddings') # We changed this line because our embedding cloumn is already exploded

        results = results.withColumn(
            "features",
            self.udfs['convertToVectorUDF'](F.col("embeddings"))
        )
        results = results.withColumn(
            "emb_sum",
            self.udfs['sumUDF'](F.col("embeddings"))
        )
        # Remove those with embeddings all zeroes (so we can calculate cosine distance)
        results = results.where(F.col("emb_sum")!=0.0)

        return results

In [None]:
embeddings_for_pca = WordEmbeddingsUDF()

In [None]:
DIMENSIONS  = 3

In [None]:
pca = nlp.ML.feature.PCA(k=DIMENSIONS, inputCol="features", outputCol="pca_features")

### Full Spark NLP + Spark MLLib pipeline

In [None]:
# We run the second part of the pipeline. Here 768 dimensions is reduced to 3 dimensions

pipeline = nlp.Pipeline().setStages([embeddings_for_pca, pca])


In [None]:
model = pipeline.fit(result_df)

In [None]:
result = model.transform(result_df)

In [None]:
result.select("token", "label", "pca_features").show(truncate = 60)

+----------+--------+------------------------------------------------------------+
|     token|   label|                                                pca_features|
+----------+--------+------------------------------------------------------------+
|         I|Accounts|  [9.850468172808704,0.02182025684995559,1.7128883074588641]|
|    called|Accounts|   [0.5703260311955864,0.346658149631252,-2.867726751670609]|
|Huntington|Accounts|  [8.635450770647445,0.8802312004740499,-0.8417105564124523]|
|      Bank|Accounts| [9.391061503515894,0.45066516018168057,-1.2157436459087525]|
|        to|Accounts|  [-2.093784358504493,-1.1261933945050695,4.473374538741789]|
|     close|Accounts| [-2.897764751048121,-0.1633032944974737,2.6316552582800594]|
|        my|Accounts|    [3.542237747747922,-2.721495573008954,2.847896218683586]|
|   account|Accounts|[-1.2533257167247633,0.006480340909400874,1.9023215773218...|
|         ,|Accounts|  [-1.371343619695057,0.16043397738672746,2.236148062116737]|
|   

In [None]:
df = result.select('token', 'label', 'pca_features').toPandas()

df

Unnamed: 0,token,label,pca_features
0,I,Accounts,"[9.850468172808704, 0.02182025684995559, 1.712..."
1,called,Accounts,"[0.5703260311955864, 0.346658149631252, -2.867..."
2,Huntington,Accounts,"[8.635450770647445, 0.8802312004740499, -0.841..."
3,Bank,Accounts,"[9.391061503515894, 0.45066516018168057, -1.21..."
4,to,Accounts,"[-2.093784358504493, -1.1261933945050695, 4.47..."
...,...,...,...
1364,the,Mortgage,"[0.20783178705004846, 1.2121685298369587, 2.34..."
1365,company,Mortgage,"[0.9758784877952482, 1.1525640123640015, 1.548..."
1366,never,Mortgage,"[-0.009449827591906173, -1.360506257943843, -0..."
1367,responds,Mortgage,"[-1.3105360623344586, -0.3952000653886483, -1...."


In [None]:
df["x"] = df["pca_features"].apply(lambda x: x[0])

df["y"] = df["pca_features"].apply(lambda x: x[1])

df["z"] = df["pca_features"].apply(lambda x: x[2])

df = df[["token", "label", "x", "y", "z"]]

df

Unnamed: 0,token,label,x,y,z
0,I,Accounts,9.850468,0.021820,1.712888
1,called,Accounts,0.570326,0.346658,-2.867727
2,Huntington,Accounts,8.635451,0.880231,-0.841711
3,Bank,Accounts,9.391062,0.450665,-1.215744
4,to,Accounts,-2.093784,-1.126193,4.473375
...,...,...,...,...,...
1364,the,Mortgage,0.207832,1.212169,2.345686
1365,company,Mortgage,0.975878,1.152564,1.548878
1366,never,Mortgage,-0.009450,-1.360506,-0.080957
1367,responds,Mortgage,-1.310536,-0.395200,-1.634091


In [None]:
import plotly.express as px

fig = px.scatter_3d(df, x = 'x', y = 'y', z = 'z', color = "label", width=1000, height = 800, hover_data = ["token", "label"])

fig.show()