{ "cells": [ { "cell_type": "markdown", "id": "e9KViDTWeMEE", "metadata": { "id": "e9KViDTWeMEE" }, "source": [ "![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)" ] }, { "cell_type": "markdown", "id": "kcSuGqjTeOAy", "metadata": { "id": "kcSuGqjTeOAy" }, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/finance-nlp/03.Word_Sentence_Embeddings.ipynb)" ] }, { "cell_type": "markdown", "id": "e35e5153", "metadata": { "id": "e35e5153" }, "source": [ "# Financial Word and Sentence Embeddings" ] }, { "cell_type": "markdown", "id": "69a14890-bc41-4765-9557-89a969c04d8f", "metadata": { "id": "69a14890-bc41-4765-9557-89a969c04d8f" }, "source": [ "# Finance Word and Sentence Embeddings visualization using PCA (Principal Component Analysis)" ] }, { "cell_type": "markdown", "id": "5f982ae9-0570-4f90-bd30-c9d55d219e5b", "metadata": { "id": "5f982ae9-0570-4f90-bd30-c9d55d219e5b" }, "source": [ "Modern NLP models work with a numerical representation of texts and their menaning. For token classification problems (inferring a class for a token, for example Name Entity Recognition) Word Embeddings are required. For sentences, paragraph, document classification - we use Sentence Embeddings.\n", "\n", "In this notebook, we got token embeddings using Spark NLP Finance Word Embeddings(**bert_embeddings_sec_bert_base**) and using these token embeddings we got sentence embeddings by sparknlp annotator SentenceEmbeddings to get those numerical representations of the semantics of the texts. The result is a 768 embeddings matrix, impossible to process by the human eye.\n", "\n", "There are many techniques we can use to visualize those embeddings. We are using one of them - Principal Component Analysis, a dimensionality reduction process, carried out by Spark MLLib. Both embeddings have 768 dimensions, so we will reduced this dimensions from **768** to **3** (X, Y, Z) and will use a color for the word / sentence legend." ] }, { "cell_type": "markdown", "id": "d07e14ba-b319-4d4a-b38d-5ed93debc5e6", "metadata": { "id": "d07e14ba-b319-4d4a-b38d-5ed93debc5e6" }, "source": [ "# Installation" ] }, { "cell_type": "code", "execution_count": null, "id": "nRcDPOsDqhkN", "metadata": { "id": "nRcDPOsDqhkN" }, "outputs": [], "source": [ "! pip install johnsnowlabs" ] }, { "cell_type": "markdown", "id": "YPsbAnNoPt0Z", "metadata": { "id": "YPsbAnNoPt0Z" }, "source": [ "## Automatic Installation\n", "Using my.johnsnowlabs.com SSO" ] }, { "cell_type": "code", "execution_count": null, "id": "_L-7mLYp3pqr", "metadata": { "id": "_L-7mLYp3pqr", "pycharm": { "is_executing": true } }, "outputs": [], "source": [ "from johnsnowlabs import nlp, finance\n", "\n", "# nlp.install(force_browser=True)" ] }, { "cell_type": "markdown", "id": "hsJvn_WWM2GL", "metadata": { "id": "hsJvn_WWM2GL" }, "source": [ "## Manual downloading\n", "If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.\n", "\n", "- Go to my.johnsnowlabs.com\n", "- Download your license\n", "- Upload it using the following command" ] }, { "cell_type": "code", "execution_count": null, "id": "i57QV3-_P2sQ", "metadata": { "id": "i57QV3-_P2sQ" }, "outputs": [], "source": [ "from google.colab import files\n", "print('Please Upload your John Snow Labs License using the button below')\n", "license_keys = files.upload()" ] }, { "cell_type": "markdown", "id": "xGgNdFzZP_hQ", "metadata": { "id": "xGgNdFzZP_hQ" }, "source": [ "- Install it" ] }, { "cell_type": "code", "execution_count": null, "id": "OfmmPqknP4rR", "metadata": { "id": "OfmmPqknP4rR" }, "outputs": [], "source": [ "nlp.install()" ] }, { "cell_type": "markdown", "id": "DCl5ErZkNNLk", "metadata": { "id": "DCl5ErZkNNLk" }, "source": [ "# Starting" ] }, { "cell_type": "code", "execution_count": null, "id": "x3jVICoa3pqr", "metadata": { "id": "x3jVICoa3pqr" }, "outputs": [], "source": [ "spark = nlp.start()" ] }, { "cell_type": "markdown", "id": "4b7c011d-eac8-4eda-b272-027bab6a8895", "metadata": { "id": "4b7c011d-eac8-4eda-b272-027bab6a8895" }, "source": [ "# Get sample text" ] }, { "cell_type": "code", "execution_count": null, "id": "XvEU36EbfUZH", "metadata": { "id": "XvEU36EbfUZH" }, "outputs": [], "source": [ "! pip install -q plotly\n", "\n", "# Downloading sample datasets.\n", "! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/finance_pca_samples.csv" ] }, { "cell_type": "code", "execution_count": null, "id": "5f411a9b-9744-4f77-bdb9-58702260b5b0", "metadata": { "id": "5f411a9b-9744-4f77-bdb9-58702260b5b0" }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv(\"finance_pca_samples.csv\")" ] }, { "cell_type": "code", "execution_count": null, "id": "3977898f-24e8-43d6-a338-80f3bf8ad2a1", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "3977898f-24e8-43d6-a338-80f3bf8ad2a1", "outputId": "5f8679a4-78ab-4afa-da11-06290fd7abcb", "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------------------+----------------+\n", "| text| label|\n", "+--------------------+----------------+\n", "|I called Huntingt...| Accounts|\n", "|I opened an citi ...| Accounts|\n", "|I have been a lon...| Credit Cards|\n", "|My credit limit w...| Credit Cards|\n", "|I am filing this ...|Credit Reporting|\n", "|The Credit Bureau...|Credit Reporting|\n", "|I noticed an arti...| Debt Collection|\n", "|A bank account wa...| Debt Collection|\n", "|I was contacted v...| Loans|\n", "|My husband recent...| Loans|\n", "|I wire transfered...| Money Transfers|\n", "|PayPal holds fund...| Money Transfers|\n", "|We have requested...| Mortgage|\n", "|I filled out a co...| Mortgage|\n", "+--------------------+----------------+\n", "\n" ] } ], "source": [ "# Create spark dataframe\n", "sdf = spark.createDataFrame(df)\n", "sdf.show()" ] }, { "cell_type": "markdown", "id": "5de27fdb-3fc1-4d58-8505-4f02238635b8", "metadata": { "id": "5de27fdb-3fc1-4d58-8505-4f02238635b8" }, "source": [ "# Pipeline with Spark NLP and Spark MLLIB" ] }, { "cell_type": "code", "execution_count": null, "id": "YwQjfHcM_a4q", "metadata": { "id": "YwQjfHcM_a4q" }, "outputs": [], "source": [ "# We defined a generic pipeline for word and sentence embeddings\n", "\n", "def generic_pipeline():\n", " document_assembler = nlp.DocumentAssembler()\\\n", " .setInputCol(\"text\")\\\n", " .setOutputCol(\"document\")\n", "\n", " tokenizer = nlp.Tokenizer()\\\n", " .setInputCols(\"document\")\\\n", " .setOutputCol(\"token\")\n", "\n", " word_embeddings = nlp.BertEmbeddings.pretrained(\"bert_embeddings_sec_bert_base\",\"en\")\\\n", " .setInputCols([\"document\", \"token\"])\\\n", " .setOutputCol(\"word_embeddings\")\n", "\n", " pipeline = nlp.Pipeline(stages = [\n", " document_assembler,\n", " tokenizer,\n", " word_embeddings\n", " ])\n", "\n", " return pipeline\n", "\n" ] }, { "cell_type": "markdown", "id": "d751fd45-9810-4b43-b893-0d373e8d4870", "metadata": { "id": "d751fd45-9810-4b43-b893-0d373e8d4870" }, "source": [ "## Sentence Embeddings" ] }, { "cell_type": "code", "execution_count": null, "id": "b990cfd8-576f-4057-a548-0434d39897bd", "metadata": { "id": "b990cfd8-576f-4057-a548-0434d39897bd", "tags": [] }, "outputs": [], "source": [ "embeddings_sentence = nlp.SentenceEmbeddings()\\\n", " .setInputCols([\"document\", \"word_embeddings\"])\\\n", " .setOutputCol(\"sentence_embeddings\")\\\n", " .setPoolingStrategy(\"AVERAGE\")\n", "# We used sparknlp SentenceEmbeddings anootator to get each sentence embeddings from token embeddings" ] }, { "cell_type": "markdown", "id": "fde3b08f-6187-43fd-8d7b-3fe6da054c18", "metadata": { "id": "fde3b08f-6187-43fd-8d7b-3fe6da054c18" }, "source": [ "# Custom transform to retrieve the numerical embeddings from Spark NLP and pass it to Spark MLLib" ] }, { "cell_type": "code", "execution_count": null, "id": "Qe4s0BQxkMPU", "metadata": { "id": "Qe4s0BQxkMPU" }, "outputs": [], "source": [ "from pyspark.sql import DataFrame\n", "import pyspark.sql.functions as F\n", "import pyspark.sql.types as T\n", "import pyspark.sql as SQL\n", "from pyspark import keyword_only" ] }, { "cell_type": "code", "execution_count": null, "id": "30cfff3f-487b-4808-8b32-f54fc443777b", "metadata": { "id": "30cfff3f-487b-4808-8b32-f54fc443777b" }, "outputs": [], "source": [ "# This class extracts the embeddings from the Spark NLP Annotation object\n", "# from pyspark import ml as ML\n", "class EmbeddingsUDF(\n", " nlp.Transformer, nlp.ML.param.shared.HasInputCol, nlp.ML.param.shared.HasOutputCol,\n", " nlp.ML.util.DefaultParamsReadable, nlp.ML.util.DefaultParamsWritable\n", "):\n", " @keyword_only\n", " def __init__(self):\n", " super(EmbeddingsUDF, self).__init__()\n", "\n", " def _sum(r):\n", " result = 0.0\n", " for e in r:\n", " result += e\n", " return result\n", "\n", " self.udfs = {\n", " 'convertToVectorUDF': F.udf(lambda vs: nlp.ML.linalg.Vectors.dense(vs), nlp.ML.linalg.VectorUDT()),\n", " 'sumUDF': F.udf(lambda r: _sum(r), T.FloatType())\n", " }\n", "\n", " def _transform(self, dataset):\n", "\n", " results = dataset.select(\n", " \"*\", F.explode(\"sentence_embeddings.embeddings\").alias(\"embeddings\")\n", " )\n", " results = results.withColumn(\n", " \"features\",\n", " self.udfs['convertToVectorUDF'](F.col(\"embeddings\"))\n", " )\n", " results = results.withColumn(\n", " \"emb_sum\",\n", " self.udfs['sumUDF'](F.col(\"embeddings\"))\n", " )\n", " # Remove those with embeddings all zeroes (so we can calculate cosine distance)\n", " results = results.where(F.col(\"emb_sum\")!=0.0)\n", "\n", " return results" ] }, { "cell_type": "code", "execution_count": null, "id": "5f82e139-1cc3-423d-9d53-c0ccfec835c2", "metadata": { "id": "5f82e139-1cc3-423d-9d53-c0ccfec835c2" }, "outputs": [], "source": [ "embeddings_for_pca = EmbeddingsUDF()" ] }, { "cell_type": "code", "execution_count": null, "id": "5213dfcf-61bc-4db2-a8c7-797a318a13f4", "metadata": { "id": "5213dfcf-61bc-4db2-a8c7-797a318a13f4" }, "outputs": [], "source": [ "DIMENSIONS = 3" ] }, { "cell_type": "code", "execution_count": null, "id": "285ba04a-dbe2-4f53-9aa0-fc66ba783e14", "metadata": { "id": "285ba04a-dbe2-4f53-9aa0-fc66ba783e14" }, "outputs": [], "source": [ "pca = nlp.ML.feature.PCA(k=DIMENSIONS, inputCol=\"features\", outputCol=\"pca_features\")" ] }, { "cell_type": "markdown", "id": "82f95669-0f8e-491b-b6d8-33f590aa06cf", "metadata": { "id": "82f95669-0f8e-491b-b6d8-33f590aa06cf" }, "source": [ "### Full Spark NLP + Spark MLLib pipeline" ] }, { "cell_type": "code", "execution_count": null, "id": "c279de78-563f-474c-8d3e-3c72243bb04b", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "c279de78-563f-474c-8d3e-3c72243bb04b", "outputId": "d66eddb3-ba45-42c3-9ade-0f2047c5884c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bert_embeddings_sec_bert_base download started this may take some time.\n", "Approximate size to download 390.4 MB\n", "[OK!]\n" ] } ], "source": [ "# We did all process in one pipeline\n", "pipeline = nlp.Pipeline().setStages([generic_pipeline(), embeddings_sentence, embeddings_for_pca, pca])" ] }, { "cell_type": "code", "execution_count": null, "id": "0d46c615-d2b6-408a-99c3-bac0f824dcf4", "metadata": { "id": "0d46c615-d2b6-408a-99c3-bac0f824dcf4" }, "outputs": [], "source": [ "model = pipeline.fit(sdf)" ] }, { "cell_type": "code", "execution_count": null, "id": "4d52ebe0-4257-4354-817f-5b8f2a363b8e", "metadata": { "id": "4d52ebe0-4257-4354-817f-5b8f2a363b8e" }, "outputs": [], "source": [ "result = model.transform(sdf)" ] }, { "cell_type": "code", "execution_count": null, "id": "7b852cad-e1e7-48bc-94b0-9da7188a9727", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "7b852cad-e1e7-48bc-94b0-9da7188a9727", "outputId": "4c71cda6-2527-4a36-e6e3-676905a8dc7c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------------------------------------------------------------+----------------+\n", "|pca_features |label |\n", "+--------------------------------------------------------------+----------------+\n", "|[3.39576448119276,-1.060361129782475,-1.568794006399417] |Accounts |\n", "|[2.3660850756971623,0.8591941003552866,-0.8066168807669747] |Accounts |\n", "|[0.6867735108170906,1.4823947144210112,0.006591220237646302] |Credit Cards |\n", "|[-0.28834125177427167,1.0031549697755784,-0.7963810505318434] |Credit Cards |\n", "|[-0.5037809008469382,-1.3771583372345915,0.4449701036930799] |Credit Reporting|\n", "|[1.039756950301059,-1.7194174825036457,1.8539366217014026] |Credit Reporting|\n", "|[2.7731701148109815,1.1680247656394984,1.3949448202984454] |Debt Collection |\n", "|[-0.45951034017887454,0.833969250052939,0.5051728405912744] |Debt Collection |\n", "|[0.2703079726928541,1.1069420631113542,-0.4247559623637921] |Loans |\n", "|[0.8662523064864315,1.1435249671794807,0.8703562689970329] |Loans |\n", "|[-0.7580966795506656,0.6312432474265479,0.6829074622939197] |Money Transfers |\n", "|[0.38557719496563764,-1.4420990245260328,-0.19825482305628117]|Money Transfers |\n", "|[2.45690397730987,0.33025601313067965,1.2981705024775965] |Mortgage |\n", "|[2.3553279838082126,0.8329467950039564,1.390405767602749] |Mortgage |\n", "+--------------------------------------------------------------+----------------+\n", "\n" ] } ], "source": [ "result.select('pca_features', 'label').show(truncate=False)" ] }, { "cell_type": "code", "execution_count": null, "id": "adbf7811-b04c-4f63-88e4-10e51341ce0f", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 488 }, "id": "adbf7811-b04c-4f63-88e4-10e51341ce0f", "outputId": "757e141e-3b40-4033-c351-c30597e762ad" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pca_featureslabel
0[3.39576448119276, -1.060361129782475, -1.5687...Accounts
1[2.3660850756971623, 0.8591941003552866, -0.80...Accounts
2[0.6867735108170906, 1.4823947144210112, 0.006...Credit Cards
3[-0.28834125177427167, 1.0031549697755784, -0....Credit Cards
4[-0.5037809008469382, -1.3771583372345915, 0.4...Credit Reporting
5[1.039756950301059, -1.7194174825036457, 1.853...Credit Reporting
6[2.7731701148109815, 1.1680247656394984, 1.394...Debt Collection
7[-0.45951034017887454, 0.833969250052939, 0.50...Debt Collection
8[0.2703079726928541, 1.1069420631113542, -0.42...Loans
9[0.8662523064864315, 1.1435249671794807, 0.870...Loans
10[-0.7580966795506656, 0.6312432474265479, 0.68...Money Transfers
11[0.38557719496563764, -1.4420990245260328, -0....Money Transfers
12[2.45690397730987, 0.33025601313067965, 1.2981...Mortgage
13[2.3553279838082126, 0.8329467950039564, 1.390...Mortgage
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " pca_features label\n", "0 [3.39576448119276, -1.060361129782475, -1.5687... Accounts\n", "1 [2.3660850756971623, 0.8591941003552866, -0.80... Accounts\n", "2 [0.6867735108170906, 1.4823947144210112, 0.006... Credit Cards\n", "3 [-0.28834125177427167, 1.0031549697755784, -0.... Credit Cards\n", "4 [-0.5037809008469382, -1.3771583372345915, 0.4... Credit Reporting\n", "5 [1.039756950301059, -1.7194174825036457, 1.853... Credit Reporting\n", "6 [2.7731701148109815, 1.1680247656394984, 1.394... Debt Collection\n", "7 [-0.45951034017887454, 0.833969250052939, 0.50... Debt Collection\n", "8 [0.2703079726928541, 1.1069420631113542, -0.42... Loans\n", "9 [0.8662523064864315, 1.1435249671794807, 0.870... Loans\n", "10 [-0.7580966795506656, 0.6312432474265479, 0.68... Money Transfers\n", "11 [0.38557719496563764, -1.4420990245260328, -0.... Money Transfers\n", "12 [2.45690397730987, 0.33025601313067965, 1.2981... Mortgage\n", "13 [2.3553279838082126, 0.8329467950039564, 1.390... Mortgage" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = result.select('pca_features', 'label').toPandas()\n", "\n", "df\n", "# As you see, dimension values are inside a list" ] }, { "cell_type": "code", "execution_count": null, "id": "5c3707de-8ff4-49b7-89c2-85e9616bc337", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 488 }, "id": "5c3707de-8ff4-49b7-89c2-85e9616bc337", "outputId": "8f3f817f-297b-42f9-c914-13f4d0889180" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xyzlabel
03.395764-1.060361-1.568794Accounts
12.3660850.859194-0.806617Accounts
20.6867741.4823950.006591Credit Cards
3-0.2883411.003155-0.796381Credit Cards
4-0.503781-1.3771580.444970Credit Reporting
51.039757-1.7194171.853937Credit Reporting
62.7731701.1680251.394945Debt Collection
7-0.4595100.8339690.505173Debt Collection
80.2703081.106942-0.424756Loans
90.8662521.1435250.870356Loans
10-0.7580970.6312430.682907Money Transfers
110.385577-1.442099-0.198255Money Transfers
122.4569040.3302561.298171Mortgage
132.3553280.8329471.390406Mortgage
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " x y z label\n", "0 3.395764 -1.060361 -1.568794 Accounts\n", "1 2.366085 0.859194 -0.806617 Accounts\n", "2 0.686774 1.482395 0.006591 Credit Cards\n", "3 -0.288341 1.003155 -0.796381 Credit Cards\n", "4 -0.503781 -1.377158 0.444970 Credit Reporting\n", "5 1.039757 -1.719417 1.853937 Credit Reporting\n", "6 2.773170 1.168025 1.394945 Debt Collection\n", "7 -0.459510 0.833969 0.505173 Debt Collection\n", "8 0.270308 1.106942 -0.424756 Loans\n", "9 0.866252 1.143525 0.870356 Loans\n", "10 -0.758097 0.631243 0.682907 Money Transfers\n", "11 0.385577 -1.442099 -0.198255 Money Transfers\n", "12 2.456904 0.330256 1.298171 Mortgage\n", "13 2.355328 0.832947 1.390406 Mortgage" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We extract the dimension values out off the list\n", "\n", "df[\"x\"] = df[\"pca_features\"].apply(lambda x: x[0])\n", "\n", "df[\"y\"] = df[\"pca_features\"].apply(lambda x: x[1])\n", "\n", "df[\"z\"] = df[\"pca_features\"].apply(lambda x: x[2])\n", "\n", "df = df[[\"x\", \"y\", \"z\", \"label\"]]\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "7281d091-9356-41a7-83e7-6b5ee541a09c", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 617 }, "id": "7281d091-9356-41a7-83e7-6b5ee541a09c", "outputId": "3972e1b8-6144-46aa-85fc-2afc12dad49b", "tags": [] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "
\n", "\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import plotly.express as px\n", "\n", "fig = px.scatter_3d(df, x = 'x', y = 'y', z = 'z', color = 'label', width=800, height=600)\n", "\n", "fig.show()" ] }, { "cell_type": "markdown", "id": "602bc911-284b-4e0e-8882-99ac444fbb51", "metadata": { "id": "602bc911-284b-4e0e-8882-99ac444fbb51" }, "source": [ "### Word Embeddings" ] }, { "cell_type": "markdown", "id": "11bfbc63-cb97-4189-8d6c-6629bbc898e8", "metadata": { "id": "11bfbc63-cb97-4189-8d6c-6629bbc898e8" }, "source": [ "We can also visualize the semantics of words, instead of full texts, by using Word Embeddings. We will add a Tokenizer and a WordEmbeddings model to get those embeddings, and them apply PCA as before. Firstly we splitted the pipeline in two to get all token embeddings" ] }, { "cell_type": "code", "execution_count": null, "id": "28f0cced-e804-4415-aa57-6ae05587adf2", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "28f0cced-e804-4415-aa57-6ae05587adf2", "outputId": "bbfd73e0-624c-407d-cb4d-8c1f46a1fe8f" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bert_embeddings_sec_bert_base download started this may take some time.\n", "Approximate size to download 390.4 MB\n", "[OK!]\n" ] } ], "source": [ "model = generic_pipeline().fit(sdf)" ] }, { "cell_type": "code", "execution_count": null, "id": "31090faa-0bbe-4e94-8384-e38a338e1989", "metadata": { "id": "31090faa-0bbe-4e94-8384-e38a338e1989" }, "outputs": [], "source": [ "result = model.transform(sdf)" ] }, { "cell_type": "code", "execution_count": null, "id": "1815b79d-81c5-45ec-ac80-418e31f9df8e", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "1815b79d-81c5-45ec-ac80-418e31f9df8e", "outputId": "4f50264e-2aad-4fc5-eb4e-175b3b9d4f8a", "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----------+--------+--------------------------------------------------------------------------------+\n", "| token| label| embeddings|\n", "+----------+--------+--------------------------------------------------------------------------------+\n", "| I|Accounts|[-0.29679197, 0.80952483, 0.026026089, 0.08434192, 0.7434629, -0.02694758, -0...|\n", "| called|Accounts|[0.28905854, -0.29229686, -0.42990392, -0.3833449, 0.026178285, -0.12728442, ...|\n", "|Huntington|Accounts|[0.20684586, -0.010130149, -0.259025, -0.37558293, 0.45792142, 0.3114912, -0....|\n", "| Bank|Accounts|[-0.034710683, 0.46047488, -0.6221113, -0.011169381, 0.2938512, 0.31341088, -...|\n", "| to|Accounts|[-0.40457863, -0.3768647, -0.08015404, -0.58909655, -0.33856544, -0.39321256,...|\n", "| close|Accounts|[0.35089388, 0.9568475, 0.86328286, -0.4334402, 0.11386797, -0.48837784, -0.8...|\n", "| my|Accounts|[-0.36591864, 0.2655603, -0.32495034, -0.5081896, -0.39623818, -0.63347244, -...|\n", "| account|Accounts|[0.004639961, 0.5340125, 0.77567977, 0.23316649, -0.4303767, -0.2937901, -0.5...|\n", "| ,|Accounts|[0.17874305, -0.026907753, 0.19498396, -0.7929611, -0.26044437, -0.3964327, -...|\n", "| and|Accounts|[0.5011346, 0.6637548, 0.15587743, -0.79522926, -0.8198417, -0.24028614, -0.6...|\n", "| they|Accounts|[-0.2188998, 0.17353022, -0.3897713, -0.4219988, -0.66089946, -0.6682683, -0....|\n", "| refused|Accounts|[-0.71534324, 0.4092898, -0.58240926, 0.2768947, -0.7440806, -0.016842518, -0...|\n", "| to|Accounts|[-0.062417023, -0.30230471, 0.17689183, -0.36983997, 0.22308639, -0.20912732,...|\n", "| close|Accounts|[0.49901608, 0.93363476, 0.89050376, -0.20053658, 0.47381917, -0.24397722, -0...|\n", "| my|Accounts|[-0.11864859, 0.068643466, -0.47048938, -0.33866596, -0.1448204, -0.59992373,...|\n", "| account|Accounts|[-0.045060933, 0.55244875, 0.9458424, 0.3263075, -0.26439214, -0.14597315, -0...|\n", "| over|Accounts|[0.19230467, -0.47188944, 0.33582675, 0.008950032, 0.3479425, 0.107840315, -0...|\n", "| the|Accounts|[-0.44176129, -0.17911726, -0.9623183, 0.09716578, 0.19224198, 0.1584882, 0.5...|\n", "| phone|Accounts|[-0.44973916, -0.9114662, -0.06911273, -0.18094938, 0.10837507, -0.8229777, -...|\n", "| .|Accounts|[0.0502675, 0.32013232, 0.22356117, -0.6540274, 0.48769465, -0.81690645, -0.6...|\n", "+----------+--------+--------------------------------------------------------------------------------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "result_df = result.select(\"label\", F.explode(F.arrays_zip(result.token.result, result.word_embeddings.embeddings)).alias(\"cols\"))\\\n", " .select(F.expr(\"cols['0']\").alias(\"token\"),\n", " \"label\",\n", " F.expr(\"cols['1']\").alias(\"embeddings\"))\n", "\n", "result_df.show(truncate = 80)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "6186b5d6-fa75-42db-b5f6-febbe0f6c957", "metadata": { "id": "6186b5d6-fa75-42db-b5f6-febbe0f6c957" }, "outputs": [], "source": [ "# Here we defined inheritance class from that defined previously EmbeddingsUDF class\n", "class WordEmbeddingsUDF(EmbeddingsUDF): \n", " def _transform(self, dataset):\n", " \n", " results = dataset.select('token', 'label', 'embeddings') # We changed this line because our embedding cloumn is already exploded\n", "\n", " results = results.withColumn(\n", " \"features\",\n", " self.udfs['convertToVectorUDF'](F.col(\"embeddings\"))\n", " )\n", " results = results.withColumn(\n", " \"emb_sum\",\n", " self.udfs['sumUDF'](F.col(\"embeddings\"))\n", " )\n", " # Remove those with embeddings all zeroes (so we can calculate cosine distance)\n", " results = results.where(F.col(\"emb_sum\")!=0.0)\n", "\n", " return results" ] }, { "cell_type": "code", "execution_count": null, "id": "e416012e-ba80-4ca8-98de-aa675660ecb1", "metadata": { "id": "e416012e-ba80-4ca8-98de-aa675660ecb1" }, "outputs": [], "source": [ "embeddings_for_pca = WordEmbeddingsUDF()" ] }, { "cell_type": "code", "execution_count": null, "id": "0cca3d1f-0261-42de-96f8-f68f7fe6210a", "metadata": { "id": "0cca3d1f-0261-42de-96f8-f68f7fe6210a" }, "outputs": [], "source": [ "DIMENSIONS = 3" ] }, { "cell_type": "code", "execution_count": null, "id": "2f9e30a2-cc23-463f-87d1-4846c7a321e7", "metadata": { "id": "2f9e30a2-cc23-463f-87d1-4846c7a321e7" }, "outputs": [], "source": [ "pca = nlp.ML.feature.PCA(k=DIMENSIONS, inputCol=\"features\", outputCol=\"pca_features\")" ] }, { "cell_type": "markdown", "id": "4ec00af2-8e16-4473-8b43-4f7cff630a3d", "metadata": { "id": "4ec00af2-8e16-4473-8b43-4f7cff630a3d" }, "source": [ "### Full Spark NLP + Spark MLLib pipeline" ] }, { "cell_type": "code", "execution_count": null, "id": "69b8590b-2bf4-45b9-924d-d7438b87f08a", "metadata": { "id": "69b8590b-2bf4-45b9-924d-d7438b87f08a" }, "outputs": [], "source": [ "# We run the second part of the pipeline. Here 768 dimensions is reduced to 3 dimensions\n", "\n", "pipeline = nlp.Pipeline().setStages([embeddings_for_pca, pca])\n" ] }, { "cell_type": "code", "execution_count": null, "id": "8207c543-9fb0-4187-a359-61abad891934", "metadata": { "id": "8207c543-9fb0-4187-a359-61abad891934", "tags": [] }, "outputs": [], "source": [ "model = pipeline.fit(result_df)" ] }, { "cell_type": "code", "execution_count": null, "id": "cf674445-20d2-4ee9-8409-6a3782794aae", "metadata": { "id": "cf674445-20d2-4ee9-8409-6a3782794aae" }, "outputs": [], "source": [ "result = model.transform(result_df)" ] }, { "cell_type": "code", "execution_count": null, "id": "9093ca0e-e46e-40e8-8958-c47702fc4a43", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "9093ca0e-e46e-40e8-8958-c47702fc4a43", "outputId": "a697bed3-95af-4997-ba5b-b5c5d894af25", "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----------+--------+------------------------------------------------------------+\n", "| token| label| pca_features|\n", "+----------+--------+------------------------------------------------------------+\n", "| I|Accounts| [9.850468172808704,0.02182025684995559,1.7128883074588641]|\n", "| called|Accounts| [0.5703260311955864,0.346658149631252,-2.867726751670609]|\n", "|Huntington|Accounts| [8.635450770647445,0.8802312004740499,-0.8417105564124523]|\n", "| Bank|Accounts| [9.391061503515894,0.45066516018168057,-1.2157436459087525]|\n", "| to|Accounts| [-2.093784358504493,-1.1261933945050695,4.473374538741789]|\n", "| close|Accounts| [-2.897764751048121,-0.1633032944974737,2.6316552582800594]|\n", "| my|Accounts| [3.542237747747922,-2.721495573008954,2.847896218683586]|\n", "| account|Accounts|[-1.2533257167247633,0.006480340909400874,1.9023215773218...|\n", "| ,|Accounts| [-1.371343619695057,0.16043397738672746,2.236148062116737]|\n", "| and|Accounts| [0.2574783722223581,-0.39882523377542006,4.898577649457495]|\n", "| they|Accounts| [2.649181792582909,-2.0965602813943836,3.0047699978661027]|\n", "| refused|Accounts| [-1.447842544994814,-3.120728385057716,1.623718089120733]|\n", "| to|Accounts| [-3.476136836992586,-0.955126757467589,5.927975835938944]|\n", "| close|Accounts| [-3.22033000871663,-0.09380183797818464,2.502218213385302]|\n", "| my|Accounts| [3.0773967959997126,-2.732351171853666,3.4638980333557523]|\n", "| account|Accounts|[-1.7590956965611935,0.049200510343874765,2.0855752458205...|\n", "| over|Accounts| [-0.7852937839017823,-1.0837250583596254,2.77524528481479]|\n", "| the|Accounts| [-1.9152538017789913,-0.5845586090479645,5.449708419677918]|\n", "| phone|Accounts| [-3.7960455527563335,-2.527243794119921,1.563631667500876]|\n", "| .|Accounts|[-0.04212375650031434,-0.5960727710945473,0.4870793244043...|\n", "+----------+--------+------------------------------------------------------------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "result.select(\"token\", \"label\", \"pca_features\").show(truncate = 60)" ] }, { "cell_type": "code", "execution_count": null, "id": "2299c9bc-8e8e-4c49-ad29-133df31da543", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 424 }, "id": "2299c9bc-8e8e-4c49-ad29-133df31da543", "outputId": "16b3d720-7c81-4010-8e17-447d492f8330" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tokenlabelpca_features
0IAccounts[9.850468172808704, 0.02182025684995559, 1.712...
1calledAccounts[0.5703260311955864, 0.346658149631252, -2.867...
2HuntingtonAccounts[8.635450770647445, 0.8802312004740499, -0.841...
3BankAccounts[9.391061503515894, 0.45066516018168057, -1.21...
4toAccounts[-2.093784358504493, -1.1261933945050695, 4.47...
............
1364theMortgage[0.20783178705004846, 1.2121685298369587, 2.34...
1365companyMortgage[0.9758784877952482, 1.1525640123640015, 1.548...
1366neverMortgage[-0.009449827591906173, -1.360506257943843, -0...
1367respondsMortgage[-1.3105360623344586, -0.3952000653886483, -1....
1368.Mortgage[1.732371824614684, -14.254692680656397, -4.51...
\n", "

1369 rows × 3 columns

\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " token label pca_features\n", "0 I Accounts [9.850468172808704, 0.02182025684995559, 1.712...\n", "1 called Accounts [0.5703260311955864, 0.346658149631252, -2.867...\n", "2 Huntington Accounts [8.635450770647445, 0.8802312004740499, -0.841...\n", "3 Bank Accounts [9.391061503515894, 0.45066516018168057, -1.21...\n", "4 to Accounts [-2.093784358504493, -1.1261933945050695, 4.47...\n", "... ... ... ...\n", "1364 the Mortgage [0.20783178705004846, 1.2121685298369587, 2.34...\n", "1365 company Mortgage [0.9758784877952482, 1.1525640123640015, 1.548...\n", "1366 never Mortgage [-0.009449827591906173, -1.360506257943843, -0...\n", "1367 responds Mortgage [-1.3105360623344586, -0.3952000653886483, -1....\n", "1368 . Mortgage [1.732371824614684, -14.254692680656397, -4.51...\n", "\n", "[1369 rows x 3 columns]" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = result.select('token', 'label', 'pca_features').toPandas()\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "20443a2f-9e6e-4cd2-885e-b42720bcb840", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 424 }, "id": "20443a2f-9e6e-4cd2-885e-b42720bcb840", "outputId": "ac0c709a-df3b-4bc7-9275-924524c4a823" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tokenlabelxyz
0IAccounts9.8504680.0218201.712888
1calledAccounts0.5703260.346658-2.867727
2HuntingtonAccounts8.6354510.880231-0.841711
3BankAccounts9.3910620.450665-1.215744
4toAccounts-2.093784-1.1261934.473375
..................
1364theMortgage0.2078321.2121692.345686
1365companyMortgage0.9758781.1525641.548878
1366neverMortgage-0.009450-1.360506-0.080957
1367respondsMortgage-1.310536-0.395200-1.634091
1368.Mortgage1.732372-14.254693-4.517188
\n", "

1369 rows × 5 columns

\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " token label x y z\n", "0 I Accounts 9.850468 0.021820 1.712888\n", "1 called Accounts 0.570326 0.346658 -2.867727\n", "2 Huntington Accounts 8.635451 0.880231 -0.841711\n", "3 Bank Accounts 9.391062 0.450665 -1.215744\n", "4 to Accounts -2.093784 -1.126193 4.473375\n", "... ... ... ... ... ...\n", "1364 the Mortgage 0.207832 1.212169 2.345686\n", "1365 company Mortgage 0.975878 1.152564 1.548878\n", "1366 never Mortgage -0.009450 -1.360506 -0.080957\n", "1367 responds Mortgage -1.310536 -0.395200 -1.634091\n", "1368 . Mortgage 1.732372 -14.254693 -4.517188\n", "\n", "[1369 rows x 5 columns]" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"x\"] = df[\"pca_features\"].apply(lambda x: x[0])\n", "\n", "df[\"y\"] = df[\"pca_features\"].apply(lambda x: x[1])\n", "\n", "df[\"z\"] = df[\"pca_features\"].apply(lambda x: x[2])\n", "\n", "df = df[[\"token\", \"label\", \"x\", \"y\", \"z\"]]\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": null, "id": "e96d9799-e838-4cdc-903e-dbd5cdc37aa9", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 817 }, "id": "e96d9799-e838-4cdc-903e-dbd5cdc37aa9", "outputId": "fc9faccb-52b9-4c50-914a-a83f2177fc66", "tags": [] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "
\n", "\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import plotly.express as px\n", "\n", "fig = px.scatter_3d(df, x = 'x', y = 'y', z = 'z', color = \"label\", width=1000, height = 800, hover_data = [\"token\", \"label\"])\n", "\n", "fig.show()" ] } ], "metadata": { "colab": { "provenance": [] }, "gpuClass": "standard", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.6 (default, Oct 18 2022, 12:41:40) \n[Clang 14.0.0 (clang-1400.0.29.202)]" }, "vscode": { "interpreter": { "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6" } } }, "nbformat": 4, "nbformat_minor": 5 }