{
"cells": [
{
"cell_type": "markdown",
"id": "e9KViDTWeMEE",
"metadata": {
"id": "e9KViDTWeMEE"
},
"source": [
"![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)"
]
},
{
"cell_type": "markdown",
"id": "kcSuGqjTeOAy",
"metadata": {
"id": "kcSuGqjTeOAy"
},
"source": [
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/finance-nlp/03.Word_Sentence_Embeddings.ipynb)"
]
},
{
"cell_type": "markdown",
"id": "e35e5153",
"metadata": {
"id": "e35e5153"
},
"source": [
"# Financial Word and Sentence Embeddings"
]
},
{
"cell_type": "markdown",
"id": "69a14890-bc41-4765-9557-89a969c04d8f",
"metadata": {
"id": "69a14890-bc41-4765-9557-89a969c04d8f"
},
"source": [
"# Finance Word and Sentence Embeddings visualization using PCA (Principal Component Analysis)"
]
},
{
"cell_type": "markdown",
"id": "5f982ae9-0570-4f90-bd30-c9d55d219e5b",
"metadata": {
"id": "5f982ae9-0570-4f90-bd30-c9d55d219e5b"
},
"source": [
"Modern NLP models work with a numerical representation of texts and their menaning. For token classification problems (inferring a class for a token, for example Name Entity Recognition) Word Embeddings are required. For sentences, paragraph, document classification - we use Sentence Embeddings.\n",
"\n",
"In this notebook, we got token embeddings using Spark NLP Finance Word Embeddings(**bert_embeddings_sec_bert_base**) and using these token embeddings we got sentence embeddings by sparknlp annotator SentenceEmbeddings to get those numerical representations of the semantics of the texts. The result is a 768 embeddings matrix, impossible to process by the human eye.\n",
"\n",
"There are many techniques we can use to visualize those embeddings. We are using one of them - Principal Component Analysis, a dimensionality reduction process, carried out by Spark MLLib. Both embeddings have 768 dimensions, so we will reduced this dimensions from **768** to **3** (X, Y, Z) and will use a color for the word / sentence legend."
]
},
{
"cell_type": "markdown",
"id": "d07e14ba-b319-4d4a-b38d-5ed93debc5e6",
"metadata": {
"id": "d07e14ba-b319-4d4a-b38d-5ed93debc5e6"
},
"source": [
"# Installation"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "nRcDPOsDqhkN",
"metadata": {
"id": "nRcDPOsDqhkN"
},
"outputs": [],
"source": [
"! pip install johnsnowlabs"
]
},
{
"cell_type": "markdown",
"id": "YPsbAnNoPt0Z",
"metadata": {
"id": "YPsbAnNoPt0Z"
},
"source": [
"## Automatic Installation\n",
"Using my.johnsnowlabs.com SSO"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "_L-7mLYp3pqr",
"metadata": {
"id": "_L-7mLYp3pqr",
"pycharm": {
"is_executing": true
}
},
"outputs": [],
"source": [
"from johnsnowlabs import nlp, finance\n",
"\n",
"# nlp.install(force_browser=True)"
]
},
{
"cell_type": "markdown",
"id": "hsJvn_WWM2GL",
"metadata": {
"id": "hsJvn_WWM2GL"
},
"source": [
"## Manual downloading\n",
"If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.\n",
"\n",
"- Go to my.johnsnowlabs.com\n",
"- Download your license\n",
"- Upload it using the following command"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "i57QV3-_P2sQ",
"metadata": {
"id": "i57QV3-_P2sQ"
},
"outputs": [],
"source": [
"from google.colab import files\n",
"print('Please Upload your John Snow Labs License using the button below')\n",
"license_keys = files.upload()"
]
},
{
"cell_type": "markdown",
"id": "xGgNdFzZP_hQ",
"metadata": {
"id": "xGgNdFzZP_hQ"
},
"source": [
"- Install it"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "OfmmPqknP4rR",
"metadata": {
"id": "OfmmPqknP4rR"
},
"outputs": [],
"source": [
"nlp.install()"
]
},
{
"cell_type": "markdown",
"id": "DCl5ErZkNNLk",
"metadata": {
"id": "DCl5ErZkNNLk"
},
"source": [
"# Starting"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "x3jVICoa3pqr",
"metadata": {
"id": "x3jVICoa3pqr"
},
"outputs": [],
"source": [
"spark = nlp.start()"
]
},
{
"cell_type": "markdown",
"id": "4b7c011d-eac8-4eda-b272-027bab6a8895",
"metadata": {
"id": "4b7c011d-eac8-4eda-b272-027bab6a8895"
},
"source": [
"# Get sample text"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "XvEU36EbfUZH",
"metadata": {
"id": "XvEU36EbfUZH"
},
"outputs": [],
"source": [
"! pip install -q plotly\n",
"\n",
"# Downloading sample datasets.\n",
"! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/finance_pca_samples.csv"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5f411a9b-9744-4f77-bdb9-58702260b5b0",
"metadata": {
"id": "5f411a9b-9744-4f77-bdb9-58702260b5b0"
},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"df = pd.read_csv(\"finance_pca_samples.csv\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3977898f-24e8-43d6-a338-80f3bf8ad2a1",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "3977898f-24e8-43d6-a338-80f3bf8ad2a1",
"outputId": "5f8679a4-78ab-4afa-da11-06290fd7abcb",
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------------------+----------------+\n",
"| text| label|\n",
"+--------------------+----------------+\n",
"|I called Huntingt...| Accounts|\n",
"|I opened an citi ...| Accounts|\n",
"|I have been a lon...| Credit Cards|\n",
"|My credit limit w...| Credit Cards|\n",
"|I am filing this ...|Credit Reporting|\n",
"|The Credit Bureau...|Credit Reporting|\n",
"|I noticed an arti...| Debt Collection|\n",
"|A bank account wa...| Debt Collection|\n",
"|I was contacted v...| Loans|\n",
"|My husband recent...| Loans|\n",
"|I wire transfered...| Money Transfers|\n",
"|PayPal holds fund...| Money Transfers|\n",
"|We have requested...| Mortgage|\n",
"|I filled out a co...| Mortgage|\n",
"+--------------------+----------------+\n",
"\n"
]
}
],
"source": [
"# Create spark dataframe\n",
"sdf = spark.createDataFrame(df)\n",
"sdf.show()"
]
},
{
"cell_type": "markdown",
"id": "5de27fdb-3fc1-4d58-8505-4f02238635b8",
"metadata": {
"id": "5de27fdb-3fc1-4d58-8505-4f02238635b8"
},
"source": [
"# Pipeline with Spark NLP and Spark MLLIB"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "YwQjfHcM_a4q",
"metadata": {
"id": "YwQjfHcM_a4q"
},
"outputs": [],
"source": [
"# We defined a generic pipeline for word and sentence embeddings\n",
"\n",
"def generic_pipeline():\n",
" document_assembler = nlp.DocumentAssembler()\\\n",
" .setInputCol(\"text\")\\\n",
" .setOutputCol(\"document\")\n",
"\n",
" tokenizer = nlp.Tokenizer()\\\n",
" .setInputCols(\"document\")\\\n",
" .setOutputCol(\"token\")\n",
"\n",
" word_embeddings = nlp.BertEmbeddings.pretrained(\"bert_embeddings_sec_bert_base\",\"en\")\\\n",
" .setInputCols([\"document\", \"token\"])\\\n",
" .setOutputCol(\"word_embeddings\")\n",
"\n",
" pipeline = nlp.Pipeline(stages = [\n",
" document_assembler,\n",
" tokenizer,\n",
" word_embeddings\n",
" ])\n",
"\n",
" return pipeline\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "d751fd45-9810-4b43-b893-0d373e8d4870",
"metadata": {
"id": "d751fd45-9810-4b43-b893-0d373e8d4870"
},
"source": [
"## Sentence Embeddings"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b990cfd8-576f-4057-a548-0434d39897bd",
"metadata": {
"id": "b990cfd8-576f-4057-a548-0434d39897bd",
"tags": []
},
"outputs": [],
"source": [
"embeddings_sentence = nlp.SentenceEmbeddings()\\\n",
" .setInputCols([\"document\", \"word_embeddings\"])\\\n",
" .setOutputCol(\"sentence_embeddings\")\\\n",
" .setPoolingStrategy(\"AVERAGE\")\n",
"# We used sparknlp SentenceEmbeddings anootator to get each sentence embeddings from token embeddings"
]
},
{
"cell_type": "markdown",
"id": "fde3b08f-6187-43fd-8d7b-3fe6da054c18",
"metadata": {
"id": "fde3b08f-6187-43fd-8d7b-3fe6da054c18"
},
"source": [
"# Custom transform to retrieve the numerical embeddings from Spark NLP and pass it to Spark MLLib"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "Qe4s0BQxkMPU",
"metadata": {
"id": "Qe4s0BQxkMPU"
},
"outputs": [],
"source": [
"from pyspark.sql import DataFrame\n",
"import pyspark.sql.functions as F\n",
"import pyspark.sql.types as T\n",
"import pyspark.sql as SQL\n",
"from pyspark import keyword_only"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "30cfff3f-487b-4808-8b32-f54fc443777b",
"metadata": {
"id": "30cfff3f-487b-4808-8b32-f54fc443777b"
},
"outputs": [],
"source": [
"# This class extracts the embeddings from the Spark NLP Annotation object\n",
"# from pyspark import ml as ML\n",
"class EmbeddingsUDF(\n",
" nlp.Transformer, nlp.ML.param.shared.HasInputCol, nlp.ML.param.shared.HasOutputCol,\n",
" nlp.ML.util.DefaultParamsReadable, nlp.ML.util.DefaultParamsWritable\n",
"):\n",
" @keyword_only\n",
" def __init__(self):\n",
" super(EmbeddingsUDF, self).__init__()\n",
"\n",
" def _sum(r):\n",
" result = 0.0\n",
" for e in r:\n",
" result += e\n",
" return result\n",
"\n",
" self.udfs = {\n",
" 'convertToVectorUDF': F.udf(lambda vs: nlp.ML.linalg.Vectors.dense(vs), nlp.ML.linalg.VectorUDT()),\n",
" 'sumUDF': F.udf(lambda r: _sum(r), T.FloatType())\n",
" }\n",
"\n",
" def _transform(self, dataset):\n",
"\n",
" results = dataset.select(\n",
" \"*\", F.explode(\"sentence_embeddings.embeddings\").alias(\"embeddings\")\n",
" )\n",
" results = results.withColumn(\n",
" \"features\",\n",
" self.udfs['convertToVectorUDF'](F.col(\"embeddings\"))\n",
" )\n",
" results = results.withColumn(\n",
" \"emb_sum\",\n",
" self.udfs['sumUDF'](F.col(\"embeddings\"))\n",
" )\n",
" # Remove those with embeddings all zeroes (so we can calculate cosine distance)\n",
" results = results.where(F.col(\"emb_sum\")!=0.0)\n",
"\n",
" return results"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5f82e139-1cc3-423d-9d53-c0ccfec835c2",
"metadata": {
"id": "5f82e139-1cc3-423d-9d53-c0ccfec835c2"
},
"outputs": [],
"source": [
"embeddings_for_pca = EmbeddingsUDF()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5213dfcf-61bc-4db2-a8c7-797a318a13f4",
"metadata": {
"id": "5213dfcf-61bc-4db2-a8c7-797a318a13f4"
},
"outputs": [],
"source": [
"DIMENSIONS = 3"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "285ba04a-dbe2-4f53-9aa0-fc66ba783e14",
"metadata": {
"id": "285ba04a-dbe2-4f53-9aa0-fc66ba783e14"
},
"outputs": [],
"source": [
"pca = nlp.ML.feature.PCA(k=DIMENSIONS, inputCol=\"features\", outputCol=\"pca_features\")"
]
},
{
"cell_type": "markdown",
"id": "82f95669-0f8e-491b-b6d8-33f590aa06cf",
"metadata": {
"id": "82f95669-0f8e-491b-b6d8-33f590aa06cf"
},
"source": [
"### Full Spark NLP + Spark MLLib pipeline"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c279de78-563f-474c-8d3e-3c72243bb04b",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "c279de78-563f-474c-8d3e-3c72243bb04b",
"outputId": "d66eddb3-ba45-42c3-9ade-0f2047c5884c"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"bert_embeddings_sec_bert_base download started this may take some time.\n",
"Approximate size to download 390.4 MB\n",
"[OK!]\n"
]
}
],
"source": [
"# We did all process in one pipeline\n",
"pipeline = nlp.Pipeline().setStages([generic_pipeline(), embeddings_sentence, embeddings_for_pca, pca])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0d46c615-d2b6-408a-99c3-bac0f824dcf4",
"metadata": {
"id": "0d46c615-d2b6-408a-99c3-bac0f824dcf4"
},
"outputs": [],
"source": [
"model = pipeline.fit(sdf)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4d52ebe0-4257-4354-817f-5b8f2a363b8e",
"metadata": {
"id": "4d52ebe0-4257-4354-817f-5b8f2a363b8e"
},
"outputs": [],
"source": [
"result = model.transform(sdf)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7b852cad-e1e7-48bc-94b0-9da7188a9727",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "7b852cad-e1e7-48bc-94b0-9da7188a9727",
"outputId": "4c71cda6-2527-4a36-e6e3-676905a8dc7c"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------------------------------------------------------------+----------------+\n",
"|pca_features |label |\n",
"+--------------------------------------------------------------+----------------+\n",
"|[3.39576448119276,-1.060361129782475,-1.568794006399417] |Accounts |\n",
"|[2.3660850756971623,0.8591941003552866,-0.8066168807669747] |Accounts |\n",
"|[0.6867735108170906,1.4823947144210112,0.006591220237646302] |Credit Cards |\n",
"|[-0.28834125177427167,1.0031549697755784,-0.7963810505318434] |Credit Cards |\n",
"|[-0.5037809008469382,-1.3771583372345915,0.4449701036930799] |Credit Reporting|\n",
"|[1.039756950301059,-1.7194174825036457,1.8539366217014026] |Credit Reporting|\n",
"|[2.7731701148109815,1.1680247656394984,1.3949448202984454] |Debt Collection |\n",
"|[-0.45951034017887454,0.833969250052939,0.5051728405912744] |Debt Collection |\n",
"|[0.2703079726928541,1.1069420631113542,-0.4247559623637921] |Loans |\n",
"|[0.8662523064864315,1.1435249671794807,0.8703562689970329] |Loans |\n",
"|[-0.7580966795506656,0.6312432474265479,0.6829074622939197] |Money Transfers |\n",
"|[0.38557719496563764,-1.4420990245260328,-0.19825482305628117]|Money Transfers |\n",
"|[2.45690397730987,0.33025601313067965,1.2981705024775965] |Mortgage |\n",
"|[2.3553279838082126,0.8329467950039564,1.390405767602749] |Mortgage |\n",
"+--------------------------------------------------------------+----------------+\n",
"\n"
]
}
],
"source": [
"result.select('pca_features', 'label').show(truncate=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "adbf7811-b04c-4f63-88e4-10e51341ce0f",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 488
},
"id": "adbf7811-b04c-4f63-88e4-10e51341ce0f",
"outputId": "757e141e-3b40-4033-c351-c30597e762ad"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pca_features | \n",
" label | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" [3.39576448119276, -1.060361129782475, -1.5687... | \n",
" Accounts | \n",
"
\n",
" \n",
" 1 | \n",
" [2.3660850756971623, 0.8591941003552866, -0.80... | \n",
" Accounts | \n",
"
\n",
" \n",
" 2 | \n",
" [0.6867735108170906, 1.4823947144210112, 0.006... | \n",
" Credit Cards | \n",
"
\n",
" \n",
" 3 | \n",
" [-0.28834125177427167, 1.0031549697755784, -0.... | \n",
" Credit Cards | \n",
"
\n",
" \n",
" 4 | \n",
" [-0.5037809008469382, -1.3771583372345915, 0.4... | \n",
" Credit Reporting | \n",
"
\n",
" \n",
" 5 | \n",
" [1.039756950301059, -1.7194174825036457, 1.853... | \n",
" Credit Reporting | \n",
"
\n",
" \n",
" 6 | \n",
" [2.7731701148109815, 1.1680247656394984, 1.394... | \n",
" Debt Collection | \n",
"
\n",
" \n",
" 7 | \n",
" [-0.45951034017887454, 0.833969250052939, 0.50... | \n",
" Debt Collection | \n",
"
\n",
" \n",
" 8 | \n",
" [0.2703079726928541, 1.1069420631113542, -0.42... | \n",
" Loans | \n",
"
\n",
" \n",
" 9 | \n",
" [0.8662523064864315, 1.1435249671794807, 0.870... | \n",
" Loans | \n",
"
\n",
" \n",
" 10 | \n",
" [-0.7580966795506656, 0.6312432474265479, 0.68... | \n",
" Money Transfers | \n",
"
\n",
" \n",
" 11 | \n",
" [0.38557719496563764, -1.4420990245260328, -0.... | \n",
" Money Transfers | \n",
"
\n",
" \n",
" 12 | \n",
" [2.45690397730987, 0.33025601313067965, 1.2981... | \n",
" Mortgage | \n",
"
\n",
" \n",
" 13 | \n",
" [2.3553279838082126, 0.8329467950039564, 1.390... | \n",
" Mortgage | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" pca_features label\n",
"0 [3.39576448119276, -1.060361129782475, -1.5687... Accounts\n",
"1 [2.3660850756971623, 0.8591941003552866, -0.80... Accounts\n",
"2 [0.6867735108170906, 1.4823947144210112, 0.006... Credit Cards\n",
"3 [-0.28834125177427167, 1.0031549697755784, -0.... Credit Cards\n",
"4 [-0.5037809008469382, -1.3771583372345915, 0.4... Credit Reporting\n",
"5 [1.039756950301059, -1.7194174825036457, 1.853... Credit Reporting\n",
"6 [2.7731701148109815, 1.1680247656394984, 1.394... Debt Collection\n",
"7 [-0.45951034017887454, 0.833969250052939, 0.50... Debt Collection\n",
"8 [0.2703079726928541, 1.1069420631113542, -0.42... Loans\n",
"9 [0.8662523064864315, 1.1435249671794807, 0.870... Loans\n",
"10 [-0.7580966795506656, 0.6312432474265479, 0.68... Money Transfers\n",
"11 [0.38557719496563764, -1.4420990245260328, -0.... Money Transfers\n",
"12 [2.45690397730987, 0.33025601313067965, 1.2981... Mortgage\n",
"13 [2.3553279838082126, 0.8329467950039564, 1.390... Mortgage"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = result.select('pca_features', 'label').toPandas()\n",
"\n",
"df\n",
"# As you see, dimension values are inside a list"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5c3707de-8ff4-49b7-89c2-85e9616bc337",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 488
},
"id": "5c3707de-8ff4-49b7-89c2-85e9616bc337",
"outputId": "8f3f817f-297b-42f9-c914-13f4d0889180"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" x | \n",
" y | \n",
" z | \n",
" label | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 3.395764 | \n",
" -1.060361 | \n",
" -1.568794 | \n",
" Accounts | \n",
"
\n",
" \n",
" 1 | \n",
" 2.366085 | \n",
" 0.859194 | \n",
" -0.806617 | \n",
" Accounts | \n",
"
\n",
" \n",
" 2 | \n",
" 0.686774 | \n",
" 1.482395 | \n",
" 0.006591 | \n",
" Credit Cards | \n",
"
\n",
" \n",
" 3 | \n",
" -0.288341 | \n",
" 1.003155 | \n",
" -0.796381 | \n",
" Credit Cards | \n",
"
\n",
" \n",
" 4 | \n",
" -0.503781 | \n",
" -1.377158 | \n",
" 0.444970 | \n",
" Credit Reporting | \n",
"
\n",
" \n",
" 5 | \n",
" 1.039757 | \n",
" -1.719417 | \n",
" 1.853937 | \n",
" Credit Reporting | \n",
"
\n",
" \n",
" 6 | \n",
" 2.773170 | \n",
" 1.168025 | \n",
" 1.394945 | \n",
" Debt Collection | \n",
"
\n",
" \n",
" 7 | \n",
" -0.459510 | \n",
" 0.833969 | \n",
" 0.505173 | \n",
" Debt Collection | \n",
"
\n",
" \n",
" 8 | \n",
" 0.270308 | \n",
" 1.106942 | \n",
" -0.424756 | \n",
" Loans | \n",
"
\n",
" \n",
" 9 | \n",
" 0.866252 | \n",
" 1.143525 | \n",
" 0.870356 | \n",
" Loans | \n",
"
\n",
" \n",
" 10 | \n",
" -0.758097 | \n",
" 0.631243 | \n",
" 0.682907 | \n",
" Money Transfers | \n",
"
\n",
" \n",
" 11 | \n",
" 0.385577 | \n",
" -1.442099 | \n",
" -0.198255 | \n",
" Money Transfers | \n",
"
\n",
" \n",
" 12 | \n",
" 2.456904 | \n",
" 0.330256 | \n",
" 1.298171 | \n",
" Mortgage | \n",
"
\n",
" \n",
" 13 | \n",
" 2.355328 | \n",
" 0.832947 | \n",
" 1.390406 | \n",
" Mortgage | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" x y z label\n",
"0 3.395764 -1.060361 -1.568794 Accounts\n",
"1 2.366085 0.859194 -0.806617 Accounts\n",
"2 0.686774 1.482395 0.006591 Credit Cards\n",
"3 -0.288341 1.003155 -0.796381 Credit Cards\n",
"4 -0.503781 -1.377158 0.444970 Credit Reporting\n",
"5 1.039757 -1.719417 1.853937 Credit Reporting\n",
"6 2.773170 1.168025 1.394945 Debt Collection\n",
"7 -0.459510 0.833969 0.505173 Debt Collection\n",
"8 0.270308 1.106942 -0.424756 Loans\n",
"9 0.866252 1.143525 0.870356 Loans\n",
"10 -0.758097 0.631243 0.682907 Money Transfers\n",
"11 0.385577 -1.442099 -0.198255 Money Transfers\n",
"12 2.456904 0.330256 1.298171 Mortgage\n",
"13 2.355328 0.832947 1.390406 Mortgage"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# We extract the dimension values out off the list\n",
"\n",
"df[\"x\"] = df[\"pca_features\"].apply(lambda x: x[0])\n",
"\n",
"df[\"y\"] = df[\"pca_features\"].apply(lambda x: x[1])\n",
"\n",
"df[\"z\"] = df[\"pca_features\"].apply(lambda x: x[2])\n",
"\n",
"df = df[[\"x\", \"y\", \"z\", \"label\"]]\n",
"\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7281d091-9356-41a7-83e7-6b5ee541a09c",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 617
},
"id": "7281d091-9356-41a7-83e7-6b5ee541a09c",
"outputId": "3972e1b8-6144-46aa-85fc-2afc12dad49b",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"\n",
" \n",
"\n",
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import plotly.express as px\n",
"\n",
"fig = px.scatter_3d(df, x = 'x', y = 'y', z = 'z', color = 'label', width=800, height=600)\n",
"\n",
"fig.show()"
]
},
{
"cell_type": "markdown",
"id": "602bc911-284b-4e0e-8882-99ac444fbb51",
"metadata": {
"id": "602bc911-284b-4e0e-8882-99ac444fbb51"
},
"source": [
"### Word Embeddings"
]
},
{
"cell_type": "markdown",
"id": "11bfbc63-cb97-4189-8d6c-6629bbc898e8",
"metadata": {
"id": "11bfbc63-cb97-4189-8d6c-6629bbc898e8"
},
"source": [
"We can also visualize the semantics of words, instead of full texts, by using Word Embeddings. We will add a Tokenizer and a WordEmbeddings model to get those embeddings, and them apply PCA as before. Firstly we splitted the pipeline in two to get all token embeddings"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "28f0cced-e804-4415-aa57-6ae05587adf2",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "28f0cced-e804-4415-aa57-6ae05587adf2",
"outputId": "bbfd73e0-624c-407d-cb4d-8c1f46a1fe8f"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"bert_embeddings_sec_bert_base download started this may take some time.\n",
"Approximate size to download 390.4 MB\n",
"[OK!]\n"
]
}
],
"source": [
"model = generic_pipeline().fit(sdf)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "31090faa-0bbe-4e94-8384-e38a338e1989",
"metadata": {
"id": "31090faa-0bbe-4e94-8384-e38a338e1989"
},
"outputs": [],
"source": [
"result = model.transform(sdf)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1815b79d-81c5-45ec-ac80-418e31f9df8e",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "1815b79d-81c5-45ec-ac80-418e31f9df8e",
"outputId": "4f50264e-2aad-4fc5-eb4e-175b3b9d4f8a",
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+----------+--------+--------------------------------------------------------------------------------+\n",
"| token| label| embeddings|\n",
"+----------+--------+--------------------------------------------------------------------------------+\n",
"| I|Accounts|[-0.29679197, 0.80952483, 0.026026089, 0.08434192, 0.7434629, -0.02694758, -0...|\n",
"| called|Accounts|[0.28905854, -0.29229686, -0.42990392, -0.3833449, 0.026178285, -0.12728442, ...|\n",
"|Huntington|Accounts|[0.20684586, -0.010130149, -0.259025, -0.37558293, 0.45792142, 0.3114912, -0....|\n",
"| Bank|Accounts|[-0.034710683, 0.46047488, -0.6221113, -0.011169381, 0.2938512, 0.31341088, -...|\n",
"| to|Accounts|[-0.40457863, -0.3768647, -0.08015404, -0.58909655, -0.33856544, -0.39321256,...|\n",
"| close|Accounts|[0.35089388, 0.9568475, 0.86328286, -0.4334402, 0.11386797, -0.48837784, -0.8...|\n",
"| my|Accounts|[-0.36591864, 0.2655603, -0.32495034, -0.5081896, -0.39623818, -0.63347244, -...|\n",
"| account|Accounts|[0.004639961, 0.5340125, 0.77567977, 0.23316649, -0.4303767, -0.2937901, -0.5...|\n",
"| ,|Accounts|[0.17874305, -0.026907753, 0.19498396, -0.7929611, -0.26044437, -0.3964327, -...|\n",
"| and|Accounts|[0.5011346, 0.6637548, 0.15587743, -0.79522926, -0.8198417, -0.24028614, -0.6...|\n",
"| they|Accounts|[-0.2188998, 0.17353022, -0.3897713, -0.4219988, -0.66089946, -0.6682683, -0....|\n",
"| refused|Accounts|[-0.71534324, 0.4092898, -0.58240926, 0.2768947, -0.7440806, -0.016842518, -0...|\n",
"| to|Accounts|[-0.062417023, -0.30230471, 0.17689183, -0.36983997, 0.22308639, -0.20912732,...|\n",
"| close|Accounts|[0.49901608, 0.93363476, 0.89050376, -0.20053658, 0.47381917, -0.24397722, -0...|\n",
"| my|Accounts|[-0.11864859, 0.068643466, -0.47048938, -0.33866596, -0.1448204, -0.59992373,...|\n",
"| account|Accounts|[-0.045060933, 0.55244875, 0.9458424, 0.3263075, -0.26439214, -0.14597315, -0...|\n",
"| over|Accounts|[0.19230467, -0.47188944, 0.33582675, 0.008950032, 0.3479425, 0.107840315, -0...|\n",
"| the|Accounts|[-0.44176129, -0.17911726, -0.9623183, 0.09716578, 0.19224198, 0.1584882, 0.5...|\n",
"| phone|Accounts|[-0.44973916, -0.9114662, -0.06911273, -0.18094938, 0.10837507, -0.8229777, -...|\n",
"| .|Accounts|[0.0502675, 0.32013232, 0.22356117, -0.6540274, 0.48769465, -0.81690645, -0.6...|\n",
"+----------+--------+--------------------------------------------------------------------------------+\n",
"only showing top 20 rows\n",
"\n"
]
}
],
"source": [
"result_df = result.select(\"label\", F.explode(F.arrays_zip(result.token.result, result.word_embeddings.embeddings)).alias(\"cols\"))\\\n",
" .select(F.expr(\"cols['0']\").alias(\"token\"),\n",
" \"label\",\n",
" F.expr(\"cols['1']\").alias(\"embeddings\"))\n",
"\n",
"result_df.show(truncate = 80)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6186b5d6-fa75-42db-b5f6-febbe0f6c957",
"metadata": {
"id": "6186b5d6-fa75-42db-b5f6-febbe0f6c957"
},
"outputs": [],
"source": [
"# Here we defined inheritance class from that defined previously EmbeddingsUDF class\n",
"class WordEmbeddingsUDF(EmbeddingsUDF): \n",
" def _transform(self, dataset):\n",
" \n",
" results = dataset.select('token', 'label', 'embeddings') # We changed this line because our embedding cloumn is already exploded\n",
"\n",
" results = results.withColumn(\n",
" \"features\",\n",
" self.udfs['convertToVectorUDF'](F.col(\"embeddings\"))\n",
" )\n",
" results = results.withColumn(\n",
" \"emb_sum\",\n",
" self.udfs['sumUDF'](F.col(\"embeddings\"))\n",
" )\n",
" # Remove those with embeddings all zeroes (so we can calculate cosine distance)\n",
" results = results.where(F.col(\"emb_sum\")!=0.0)\n",
"\n",
" return results"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e416012e-ba80-4ca8-98de-aa675660ecb1",
"metadata": {
"id": "e416012e-ba80-4ca8-98de-aa675660ecb1"
},
"outputs": [],
"source": [
"embeddings_for_pca = WordEmbeddingsUDF()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0cca3d1f-0261-42de-96f8-f68f7fe6210a",
"metadata": {
"id": "0cca3d1f-0261-42de-96f8-f68f7fe6210a"
},
"outputs": [],
"source": [
"DIMENSIONS = 3"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2f9e30a2-cc23-463f-87d1-4846c7a321e7",
"metadata": {
"id": "2f9e30a2-cc23-463f-87d1-4846c7a321e7"
},
"outputs": [],
"source": [
"pca = nlp.ML.feature.PCA(k=DIMENSIONS, inputCol=\"features\", outputCol=\"pca_features\")"
]
},
{
"cell_type": "markdown",
"id": "4ec00af2-8e16-4473-8b43-4f7cff630a3d",
"metadata": {
"id": "4ec00af2-8e16-4473-8b43-4f7cff630a3d"
},
"source": [
"### Full Spark NLP + Spark MLLib pipeline"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "69b8590b-2bf4-45b9-924d-d7438b87f08a",
"metadata": {
"id": "69b8590b-2bf4-45b9-924d-d7438b87f08a"
},
"outputs": [],
"source": [
"# We run the second part of the pipeline. Here 768 dimensions is reduced to 3 dimensions\n",
"\n",
"pipeline = nlp.Pipeline().setStages([embeddings_for_pca, pca])\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8207c543-9fb0-4187-a359-61abad891934",
"metadata": {
"id": "8207c543-9fb0-4187-a359-61abad891934",
"tags": []
},
"outputs": [],
"source": [
"model = pipeline.fit(result_df)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cf674445-20d2-4ee9-8409-6a3782794aae",
"metadata": {
"id": "cf674445-20d2-4ee9-8409-6a3782794aae"
},
"outputs": [],
"source": [
"result = model.transform(result_df)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9093ca0e-e46e-40e8-8958-c47702fc4a43",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "9093ca0e-e46e-40e8-8958-c47702fc4a43",
"outputId": "a697bed3-95af-4997-ba5b-b5c5d894af25",
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+----------+--------+------------------------------------------------------------+\n",
"| token| label| pca_features|\n",
"+----------+--------+------------------------------------------------------------+\n",
"| I|Accounts| [9.850468172808704,0.02182025684995559,1.7128883074588641]|\n",
"| called|Accounts| [0.5703260311955864,0.346658149631252,-2.867726751670609]|\n",
"|Huntington|Accounts| [8.635450770647445,0.8802312004740499,-0.8417105564124523]|\n",
"| Bank|Accounts| [9.391061503515894,0.45066516018168057,-1.2157436459087525]|\n",
"| to|Accounts| [-2.093784358504493,-1.1261933945050695,4.473374538741789]|\n",
"| close|Accounts| [-2.897764751048121,-0.1633032944974737,2.6316552582800594]|\n",
"| my|Accounts| [3.542237747747922,-2.721495573008954,2.847896218683586]|\n",
"| account|Accounts|[-1.2533257167247633,0.006480340909400874,1.9023215773218...|\n",
"| ,|Accounts| [-1.371343619695057,0.16043397738672746,2.236148062116737]|\n",
"| and|Accounts| [0.2574783722223581,-0.39882523377542006,4.898577649457495]|\n",
"| they|Accounts| [2.649181792582909,-2.0965602813943836,3.0047699978661027]|\n",
"| refused|Accounts| [-1.447842544994814,-3.120728385057716,1.623718089120733]|\n",
"| to|Accounts| [-3.476136836992586,-0.955126757467589,5.927975835938944]|\n",
"| close|Accounts| [-3.22033000871663,-0.09380183797818464,2.502218213385302]|\n",
"| my|Accounts| [3.0773967959997126,-2.732351171853666,3.4638980333557523]|\n",
"| account|Accounts|[-1.7590956965611935,0.049200510343874765,2.0855752458205...|\n",
"| over|Accounts| [-0.7852937839017823,-1.0837250583596254,2.77524528481479]|\n",
"| the|Accounts| [-1.9152538017789913,-0.5845586090479645,5.449708419677918]|\n",
"| phone|Accounts| [-3.7960455527563335,-2.527243794119921,1.563631667500876]|\n",
"| .|Accounts|[-0.04212375650031434,-0.5960727710945473,0.4870793244043...|\n",
"+----------+--------+------------------------------------------------------------+\n",
"only showing top 20 rows\n",
"\n"
]
}
],
"source": [
"result.select(\"token\", \"label\", \"pca_features\").show(truncate = 60)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2299c9bc-8e8e-4c49-ad29-133df31da543",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 424
},
"id": "2299c9bc-8e8e-4c49-ad29-133df31da543",
"outputId": "16b3d720-7c81-4010-8e17-447d492f8330"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" token | \n",
" label | \n",
" pca_features | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" I | \n",
" Accounts | \n",
" [9.850468172808704, 0.02182025684995559, 1.712... | \n",
"
\n",
" \n",
" 1 | \n",
" called | \n",
" Accounts | \n",
" [0.5703260311955864, 0.346658149631252, -2.867... | \n",
"
\n",
" \n",
" 2 | \n",
" Huntington | \n",
" Accounts | \n",
" [8.635450770647445, 0.8802312004740499, -0.841... | \n",
"
\n",
" \n",
" 3 | \n",
" Bank | \n",
" Accounts | \n",
" [9.391061503515894, 0.45066516018168057, -1.21... | \n",
"
\n",
" \n",
" 4 | \n",
" to | \n",
" Accounts | \n",
" [-2.093784358504493, -1.1261933945050695, 4.47... | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 1364 | \n",
" the | \n",
" Mortgage | \n",
" [0.20783178705004846, 1.2121685298369587, 2.34... | \n",
"
\n",
" \n",
" 1365 | \n",
" company | \n",
" Mortgage | \n",
" [0.9758784877952482, 1.1525640123640015, 1.548... | \n",
"
\n",
" \n",
" 1366 | \n",
" never | \n",
" Mortgage | \n",
" [-0.009449827591906173, -1.360506257943843, -0... | \n",
"
\n",
" \n",
" 1367 | \n",
" responds | \n",
" Mortgage | \n",
" [-1.3105360623344586, -0.3952000653886483, -1.... | \n",
"
\n",
" \n",
" 1368 | \n",
" . | \n",
" Mortgage | \n",
" [1.732371824614684, -14.254692680656397, -4.51... | \n",
"
\n",
" \n",
"
\n",
"
1369 rows × 3 columns
\n",
"
\n",
"
\n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" token label pca_features\n",
"0 I Accounts [9.850468172808704, 0.02182025684995559, 1.712...\n",
"1 called Accounts [0.5703260311955864, 0.346658149631252, -2.867...\n",
"2 Huntington Accounts [8.635450770647445, 0.8802312004740499, -0.841...\n",
"3 Bank Accounts [9.391061503515894, 0.45066516018168057, -1.21...\n",
"4 to Accounts [-2.093784358504493, -1.1261933945050695, 4.47...\n",
"... ... ... ...\n",
"1364 the Mortgage [0.20783178705004846, 1.2121685298369587, 2.34...\n",
"1365 company Mortgage [0.9758784877952482, 1.1525640123640015, 1.548...\n",
"1366 never Mortgage [-0.009449827591906173, -1.360506257943843, -0...\n",
"1367 responds Mortgage [-1.3105360623344586, -0.3952000653886483, -1....\n",
"1368 . Mortgage [1.732371824614684, -14.254692680656397, -4.51...\n",
"\n",
"[1369 rows x 3 columns]"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = result.select('token', 'label', 'pca_features').toPandas()\n",
"\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "20443a2f-9e6e-4cd2-885e-b42720bcb840",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 424
},
"id": "20443a2f-9e6e-4cd2-885e-b42720bcb840",
"outputId": "ac0c709a-df3b-4bc7-9275-924524c4a823"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" token | \n",
" label | \n",
" x | \n",
" y | \n",
" z | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" I | \n",
" Accounts | \n",
" 9.850468 | \n",
" 0.021820 | \n",
" 1.712888 | \n",
"
\n",
" \n",
" 1 | \n",
" called | \n",
" Accounts | \n",
" 0.570326 | \n",
" 0.346658 | \n",
" -2.867727 | \n",
"
\n",
" \n",
" 2 | \n",
" Huntington | \n",
" Accounts | \n",
" 8.635451 | \n",
" 0.880231 | \n",
" -0.841711 | \n",
"
\n",
" \n",
" 3 | \n",
" Bank | \n",
" Accounts | \n",
" 9.391062 | \n",
" 0.450665 | \n",
" -1.215744 | \n",
"
\n",
" \n",
" 4 | \n",
" to | \n",
" Accounts | \n",
" -2.093784 | \n",
" -1.126193 | \n",
" 4.473375 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 1364 | \n",
" the | \n",
" Mortgage | \n",
" 0.207832 | \n",
" 1.212169 | \n",
" 2.345686 | \n",
"
\n",
" \n",
" 1365 | \n",
" company | \n",
" Mortgage | \n",
" 0.975878 | \n",
" 1.152564 | \n",
" 1.548878 | \n",
"
\n",
" \n",
" 1366 | \n",
" never | \n",
" Mortgage | \n",
" -0.009450 | \n",
" -1.360506 | \n",
" -0.080957 | \n",
"
\n",
" \n",
" 1367 | \n",
" responds | \n",
" Mortgage | \n",
" -1.310536 | \n",
" -0.395200 | \n",
" -1.634091 | \n",
"
\n",
" \n",
" 1368 | \n",
" . | \n",
" Mortgage | \n",
" 1.732372 | \n",
" -14.254693 | \n",
" -4.517188 | \n",
"
\n",
" \n",
"
\n",
"
1369 rows × 5 columns
\n",
"
\n",
"
\n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" token label x y z\n",
"0 I Accounts 9.850468 0.021820 1.712888\n",
"1 called Accounts 0.570326 0.346658 -2.867727\n",
"2 Huntington Accounts 8.635451 0.880231 -0.841711\n",
"3 Bank Accounts 9.391062 0.450665 -1.215744\n",
"4 to Accounts -2.093784 -1.126193 4.473375\n",
"... ... ... ... ... ...\n",
"1364 the Mortgage 0.207832 1.212169 2.345686\n",
"1365 company Mortgage 0.975878 1.152564 1.548878\n",
"1366 never Mortgage -0.009450 -1.360506 -0.080957\n",
"1367 responds Mortgage -1.310536 -0.395200 -1.634091\n",
"1368 . Mortgage 1.732372 -14.254693 -4.517188\n",
"\n",
"[1369 rows x 5 columns]"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"x\"] = df[\"pca_features\"].apply(lambda x: x[0])\n",
"\n",
"df[\"y\"] = df[\"pca_features\"].apply(lambda x: x[1])\n",
"\n",
"df[\"z\"] = df[\"pca_features\"].apply(lambda x: x[2])\n",
"\n",
"df = df[[\"token\", \"label\", \"x\", \"y\", \"z\"]]\n",
"\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e96d9799-e838-4cdc-903e-dbd5cdc37aa9",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 817
},
"id": "e96d9799-e838-4cdc-903e-dbd5cdc37aa9",
"outputId": "fc9faccb-52b9-4c50-914a-a83f2177fc66",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"\n",
" \n",
"\n",
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import plotly.express as px\n",
"\n",
"fig = px.scatter_3d(df, x = 'x', y = 'y', z = 'z', color = \"label\", width=1000, height = 800, hover_data = [\"token\", \"label\"])\n",
"\n",
"fig.show()"
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"gpuClass": "standard",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.6 (default, Oct 18 2022, 12:41:40) \n[Clang 14.0.0 (clang-1400.0.29.202)]"
},
"vscode": {
"interpreter": {
"hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}