{ "cells": [ { "cell_type": "markdown", "id": "db5f4f9a-7776-42b3-8758-85624d4c15ea", "metadata": { "id": "db5f4f9a-7776-42b3-8758-85624d4c15ea" }, "source": [ "![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)" ] }, { "cell_type": "markdown", "id": "982c5188", "metadata": { "id": "982c5188" }, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/05.0.NER_and_ZeroShotNER.ipynb)" ] }, { "cell_type": "markdown", "id": "6964d2b7", "metadata": { "id": "6964d2b7" }, "source": [ "# Legal Named Entity Recognition (NER) and Zero-shot NER" ] }, { "cell_type": "markdown", "id": "gk3kZHmNj51v", "metadata": { "collapsed": false, "id": "gk3kZHmNj51v" }, "source": [ "#🎬 Installation" ] }, { "cell_type": "code", "execution_count": null, "id": "_914itZsj51v", "metadata": { "id": "_914itZsj51v", "pycharm": { "is_executing": true } }, "outputs": [], "source": [ "! pip install -q johnsnowlabs" ] }, { "cell_type": "markdown", "id": "YPsbAnNoPt0Z", "metadata": { "id": "YPsbAnNoPt0Z" }, "source": [ "##🔗 Automatic Installation\n", "Using my.johnsnowlabs.com SSO" ] }, { "cell_type": "code", "execution_count": null, "id": "fY0lcShkj51w", "metadata": { "id": "fY0lcShkj51w", "pycharm": { "is_executing": true } }, "outputs": [], "source": [ "from johnsnowlabs import nlp, legal\n", "\n", "# nlp.install(force_browser=True)" ] }, { "cell_type": "markdown", "id": "hsJvn_WWM2GL", "metadata": { "id": "hsJvn_WWM2GL" }, "source": [ "##🔗 Manual downloading\n", "📚If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.\n", "\n", "- Go to my.johnsnowlabs.com\n", "- Download your license\n", "- Upload it using the following command" ] }, { "cell_type": "code", "execution_count": null, "id": "i57QV3-_P2sQ", "metadata": { "id": "i57QV3-_P2sQ" }, "outputs": [], "source": [ "from google.colab import files\n", "print('Please Upload your John Snow Labs License using the button below')\n", "license_keys = files.upload()" ] }, { "cell_type": "markdown", "id": "xGgNdFzZP_hQ", "metadata": { "id": "xGgNdFzZP_hQ" }, "source": [ "- Install it" ] }, { "cell_type": "code", "execution_count": null, "id": "OfmmPqknP4rR", "metadata": { "id": "OfmmPqknP4rR" }, "outputs": [], "source": [ "nlp.install()" ] }, { "cell_type": "markdown", "id": "DCl5ErZkNNLk", "metadata": { "id": "DCl5ErZkNNLk" }, "source": [ "#📌 Starting" ] }, { "cell_type": "code", "execution_count": null, "id": "wRXTnNl3j51w", "metadata": { "id": "wRXTnNl3j51w" }, "outputs": [], "source": [ "spark = nlp.start()" ] }, { "cell_type": "markdown", "id": "766fe57a-fcd5-4072-99d0-7626c7888493", "metadata": { "id": "766fe57a-fcd5-4072-99d0-7626c7888493", "tags": [] }, "source": [ "##🔎 NER Model Implementation in Spark NLP\n", "\n", " The deep neural network architecture for NER model in Spark NLP is BiLSTM-CNN-Char framework. a slightly modified version of the architecture proposed by Jason PC Chiu and Eric Nichols ([Named Entity Recognition with Bidirectional LSTM-CNNs](https://arxiv.org/abs/1511.08308)). It is a neural network architecture that automatically detects word and character-level features using a hybrid bidirectional LSTM and CNN architecture, eliminating the need for most feature engineering steps.\n", " \n", " In the original framework, the CNN extracts a fixed length feature vector from character-level features. For each word, these vectors are concatenated and fed to the BLSTM network and then to the output layers. They employed a stacked bi-directional recurrent neural network with long short-term memory units to transform word features into named entity tag scores. The extracted features of each word are fed into a forward LSTM network and a backward LSTM network. The output of each network at each time step is decoded by a linear layer and a log-softmax layer into log-probabilities for each tag category. These two vectors are then simply added together to produce the final output. In the architecture of the proposed framework in the original paper, 50-dimensional pretrained word embeddings is used for word features, 25-dimension character embeddings is used for char features, and capitalization features (allCaps, upperInitial, lowercase, mixedCaps, noinfo) are used for case features." ] }, { "cell_type": "markdown", "id": "bee4b28c-dda1-4708-9240-edb6fe105013", "metadata": { "id": "bee4b28c-dda1-4708-9240-edb6fe105013" }, "source": [ "###📌 Legal CuadNER Model\n", "\n", "This model uses Name Entity Recognition to extract DOC (Document Type), PARTY (An Entity signing a contract), ALIAS (the way a company is named later on in the document) and EFFDATE (Effective Date of the contract)." ] }, { "cell_type": "code", "execution_count": null, "id": "889067cf-a64c-4f3a-b27a-51fdca438599", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "889067cf-a64c-4f3a-b27a-51fdca438599", "outputId": "90f6e90c-a9c9-4b0d-b38f-80cd5a651e85", "scrolled": true, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sentence_detector_dl download started this may take some time.\n", "Approximate size to download 514.9 KB\n", "[OK!]\n", "roberta_embeddings_legal_roberta_base download started this may take some time.\n", "Approximate size to download 447.2 MB\n", "[OK!]\n", "legner_contract_doc_parties download started this may take some time.\n", "[OK!]\n" ] } ], "source": [ "documentAssembler = nlp.DocumentAssembler()\\\n", " .setInputCol(\"text\")\\\n", " .setOutputCol(\"document\")\n", "\n", "sentenceDetector = nlp.SentenceDetectorDLModel.pretrained(\"sentence_detector_dl\",\"xx\")\\\n", " .setInputCols([\"document\"])\\\n", " .setOutputCol(\"sentence\")\n", "\n", "tokenizer = nlp.Tokenizer()\\\n", " .setInputCols([\"sentence\"])\\\n", " .setOutputCol(\"token\")\n", "\n", "embeddings = nlp.RoBertaEmbeddings.pretrained(\"roberta_embeddings_legal_roberta_base\", \"en\") \\\n", " .setInputCols(\"sentence\", \"token\") \\\n", " .setOutputCol(\"embeddings\")\\\n", "\n", "ner_model = legal.NerModel.pretrained(\"legner_contract_doc_parties\", \"en\", \"legal/models\")\\\n", " .setInputCols([\"sentence\", \"token\", \"embeddings\"])\\\n", " .setOutputCol(\"ner\")\n", "\n", "ner_converter = nlp.NerConverter()\\\n", " .setInputCols([\"sentence\",\"token\",\"ner\"])\\\n", " .setOutputCol(\"ner_chunk\")\n", "\n", "nlpPipeline = nlp.Pipeline(stages=[\n", " documentAssembler,\n", " sentenceDetector,\n", " tokenizer,\n", " embeddings,\n", " ner_model,\n", " ner_converter])\n", "\n", "empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n", "\n", "model = nlpPipeline.fit(empty_data)" ] }, { "cell_type": "code", "execution_count": null, "id": "46fa5d8a-a5f0-4173-a21e-1df147d1b2e8", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "46fa5d8a-a5f0-4173-a21e-1df147d1b2e8", "outputId": "b0f70b65-fe99-45c9-d60b-0082826fe129" }, "outputs": [ { "data": { "text/plain": [ "[DocumentAssembler_d02822bc3d37,\n", " SentenceDetectorDLModel_8aaebf7e098e,\n", " REGEX_TOKENIZER_a8b4485b4dba,\n", " ROBERTA_EMBEDDINGS_b915dff90901,\n", " MedicalNerModel_93f728ff96e5,\n", " NerConverter_c5758600563d]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# you can see pipeline stages with this code\n", "\n", "model.stages" ] }, { "cell_type": "code", "execution_count": null, "id": "af5baafe-793a-4022-ac3c-95c5345ef606", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "af5baafe-793a-4022-ac3c-95c5345ef606", "outputId": "24815617-36fe-46d9-f87a-872f16a23ee0" }, "outputs": [ { "data": { "text/plain": [ "['O',\n", " 'I-DOC',\n", " 'B-EFFDATE',\n", " 'B-ALIAS',\n", " 'I-ALIAS',\n", " 'B-PARTY',\n", " 'I-EFFDATE',\n", " 'I-PARTY',\n", " 'B-DOC']" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# With this code, you can see which labels your NER model has.\n", "\n", "ner_model.getClasses()" ] }, { "cell_type": "code", "execution_count": null, "id": "5954047c-ec79-47ec-98fa-44c74b492140", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "5954047c-ec79-47ec-98fa-44c74b492140", "outputId": "57b759a9-4cd3-4dad-a9af-0ec00d63adb6" }, "outputs": [ { "data": { "text/plain": [ "{Param(parent='MedicalNerModel_93f728ff96e5', name='inferenceBatchSize', doc='number of sentences to process in a single batch during inference'): 1,\n", " Param(parent='MedicalNerModel_93f728ff96e5', name='labelCasing', doc='Setting all labels of the NER models upper/lower case. values upper|lower'): '',\n", " Param(parent='MedicalNerModel_93f728ff96e5', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,\n", " Param(parent='MedicalNerModel_93f728ff96e5', name='includeConfidence', doc='whether to include confidence scores in annotation metadata'): True,\n", " Param(parent='MedicalNerModel_93f728ff96e5', name='includeAllConfidenceScores', doc='whether to include all confidence scores in annotation metadata or just the score of the predicted tag'): False,\n", " Param(parent='MedicalNerModel_93f728ff96e5', name='batchSize', doc='Size of every batch'): 256,\n", " Param(parent='MedicalNerModel_93f728ff96e5', name='classes', doc='get the tags used to trained this MedicalNerModel'): ['O',\n", " 'I-DOC',\n", " 'B-EFFDATE',\n", " 'B-ALIAS',\n", " 'I-ALIAS',\n", " 'B-PARTY',\n", " 'I-EFFDATE',\n", " 'I-PARTY',\n", " 'B-DOC'],\n", " Param(parent='MedicalNerModel_93f728ff96e5', name='inputCols', doc='previous annotations columns, if renamed'): ['sentence',\n", " 'token',\n", " 'embeddings'],\n", " Param(parent='MedicalNerModel_93f728ff96e5', name='outputCol', doc='output annotation column. can be left default.'): 'ner',\n", " Param(parent='MedicalNerModel_93f728ff96e5', name='storageRef', doc='unique reference name for identification'): 'roberta_embeddings_legal_roberta_base_en'}" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ner_model.extractParamMap()\n", "\n", "# With extractParamMap() function, you can see the parameters of any annotators you are using." ] }, { "cell_type": "markdown", "id": "9e7d801c-fcc0-458c-9835-b6cbb0149f38", "metadata": { "id": "9e7d801c-fcc0-458c-9835-b6cbb0149f38" }, "source": [ "####✔️ **Sample Text**" ] }, { "cell_type": "code", "execution_count": null, "id": "00d74636-3490-4a24-9dc2-4f3f023c8909", "metadata": { "id": "00d74636-3490-4a24-9dc2-4f3f023c8909" }, "outputs": [], "source": [ "text = \"\"\"EXCLUSIVE DISTRIBUTOR AGREEMENT (\" Agreement \") dated as April 15, 1994 by and between IMRS OPERATIONS INC., a Delaware corporation with its principal place of business at 777 Long Ridge Road, Stamford, Connecticut 06902, U.S.A. (hereinafter referred to as \" Developer \") and Delteq Pte Ltd, a Singapore company (and a subsidiary of Wuthelam Industries (S) Pte LTD ) with its principal place of business at 215 Henderson Road , #101-03 Henderson Industrial Park , Singapore 0315 ( hereinafter referred to as \" Distributor \").\"\"\"\n", "\n", "df = spark.createDataFrame([[text]]).toDF(\"text\")\n", "\n", "result = model.transform(df)" ] }, { "cell_type": "markdown", "id": "4c7c211c-448e-494f-9f83-3274b9ca0aba", "metadata": { "id": "4c7c211c-448e-494f-9f83-3274b9ca0aba" }, "source": [ "####🖨️ **Getting Result**" ] }, { "cell_type": "code", "execution_count": null, "id": "ec9a99c6-4d22-4837-aed9-425b8f9efed6", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "ec9a99c6-4d22-4837-aed9-425b8f9efed6", "outputId": "25f414a9-e5fb-45d9-c968-d44d5c1f37aa", "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-----------+---------+----------+\n", "| token|ner_label|confidence|\n", "+-----------+---------+----------+\n", "| EXCLUSIVE| B-DOC| 0.885|\n", "|DISTRIBUTOR| I-DOC| 0.7397|\n", "| AGREEMENT| I-DOC| 0.9926|\n", "| (\"| O| 0.9998|\n", "| Agreement| O| 0.9964|\n", "| \")| O| 1.0|\n", "| dated| O| 1.0|\n", "| as| O| 0.9985|\n", "| April|B-EFFDATE| 0.9845|\n", "| 15|I-EFFDATE| 0.951|\n", "| ,|I-EFFDATE| 0.9504|\n", "| 1994|I-EFFDATE| 0.8741|\n", "| by| O| 1.0|\n", "| and| O| 1.0|\n", "| between| O| 1.0|\n", "| IMRS| B-PARTY| 0.9898|\n", "| OPERATIONS| I-PARTY| 0.9987|\n", "| INC| I-PARTY| 0.9995|\n", "| .| O| 0.9907|\n", "| ,| O| 0.9983|\n", "| a| O| 1.0|\n", "| Delaware| O| 0.9997|\n", "|corporation| O| 0.9999|\n", "| with| O| 1.0|\n", "| its| O| 1.0|\n", "| principal| O| 1.0|\n", "| place| O| 1.0|\n", "| of| O| 1.0|\n", "| business| O| 1.0|\n", "| at| O| 1.0|\n", "| 777| O| 1.0|\n", "| Long| O| 0.9999|\n", "| Ridge| O| 0.9999|\n", "| Road| O| 1.0|\n", "| ,| O| 1.0|\n", "| Stamford| O| 0.9997|\n", "| ,| O| 1.0|\n", "|Connecticut| O| 0.9998|\n", "| 06902| O| 0.9997|\n", "| ,| O| 0.9998|\n", "| U.S.A| O| 0.9919|\n", "| .| O| 0.9991|\n", "| (| O| 0.9999|\n", "|hereinafter| O| 1.0|\n", "| referred| O| 1.0|\n", "| to| O| 0.9995|\n", "| as| O| 0.9994|\n", "| \"| O| 0.9959|\n", "| Developer| B-ALIAS| 0.9741|\n", "| \")| O| 0.9972|\n", "| and| O| 0.9978|\n", "| Delteq| B-PARTY| 0.9257|\n", "| Pte| I-PARTY| 0.9525|\n", "| Ltd| I-PARTY| 0.9735|\n", "| ,| O| 0.983|\n", "| a| O| 1.0|\n", "| Singapore| O| 0.9984|\n", "| company| O| 0.9977|\n", "| (| O| 1.0|\n", "| and| O| 1.0|\n", "| a| O| 1.0|\n", "| subsidiary| O| 1.0|\n", "| of| O| 0.9999|\n", "| Wuthelam| O| 0.9009|\n", "| Industries| O| 0.9494|\n", "| (| O| 0.9384|\n", "| S| O| 0.9564|\n", "| )| O| 0.9981|\n", "| Pte| O| 0.9911|\n", "| LTD| O| 0.9893|\n", "| )| O| 1.0|\n", "| with| O| 1.0|\n", "| its| O| 1.0|\n", "| principal| O| 1.0|\n", "| place| O| 1.0|\n", "| of| O| 1.0|\n", "| business| O| 1.0|\n", "| at| O| 1.0|\n", "| 215| O| 1.0|\n", "| Henderson| O| 1.0|\n", "| Road| O| 1.0|\n", "| ,| O| 1.0|\n", "| #101-03| O| 1.0|\n", "| Henderson| O| 0.9997|\n", "| Industrial| O| 0.9997|\n", "| Park| O| 0.9998|\n", "| ,| O| 1.0|\n", "| Singapore| O| 0.9999|\n", "| 0315| O| 0.9998|\n", "| (| O| 1.0|\n", "|hereinafter| O| 1.0|\n", "| referred| O| 1.0|\n", "| to| O| 0.9999|\n", "| as| O| 0.9999|\n", "| \"| O| 0.999|\n", "|Distributor| B-ALIAS| 0.9814|\n", "| \").| O| 0.9926|\n", "+-----------+---------+----------+\n", "\n" ] } ], "source": [ "from pyspark.sql import functions as F\n", "\n", "result.select(F.explode(F.arrays_zip(result.token.result, \n", " result.ner.result, \n", " result.ner.metadata)).alias(\"cols\"))\\\n", " .select(F.expr(\"cols['0']\").alias(\"token\"),\n", " F.expr(\"cols['1']\").alias(\"ner_label\"),\n", " F.expr(\"cols['2']['confidence']\").alias(\"confidence\")).show(200, truncate=100)" ] }, { "cell_type": "code", "execution_count": null, "id": "865dce29-ece0-45f6-8f5b-9028292523f0", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "865dce29-ece0-45f6-8f5b-9028292523f0", "outputId": "4572f71f-31af-4ee6-e693-563d12a91dad" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-------------------------------+---------+----------+\n", "|chunk |ner_label|confidence|\n", "+-------------------------------+---------+----------+\n", "|EXCLUSIVE DISTRIBUTOR AGREEMENT|DOC |0.87243336|\n", "|April 15, 1994 |EFFDATE |0.94 |\n", "|IMRS OPERATIONS INC |PARTY |0.996 |\n", "|Developer |ALIAS |0.9741 |\n", "|Delteq Pte Ltd |PARTY |0.9505667 |\n", "|Distributor |ALIAS |0.9814 |\n", "+-------------------------------+---------+----------+\n", "\n" ] } ], "source": [ "result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias(\"cols\")) \\\n", " .select(F.expr(\"cols['0']\").alias(\"chunk\"),\n", " F.expr(\"cols['1']['entity']\").alias(\"ner_label\"),\n", " F.expr(\"cols['1']['confidence']\").alias(\"confidence\")).show(truncate=False)" ] }, { "cell_type": "markdown", "id": "b47e34e0-3633-4202-afdf-63a0f2475520", "metadata": { "id": "b47e34e0-3633-4202-afdf-63a0f2475520" }, "source": [ "####🖨️ **Getting Result with LightPipeline**\n", "\n", "LightPipelines are Spark NLP specific Pipelines, equivalent to Spark ML Pipeline, but meant to deal with smaller amounts of data. They’re useful working with small datasets, debugging results, or when running either training or prediction from an API that serves one-off requests.\n", "\n", "Spark NLP LightPipelines are Spark ML pipelines converted into a single machine but the multi-threaded task, becoming more than 10x times faster for smaller amounts of data (small is relative, but 50k sentences are roughly a good maximum). To use them, we simply plug in a trained (fitted) pipeline and then annotate a plain text. We don't even need to convert the input text to DataFrame in order to feed it into a pipeline that's accepting DataFrame as an input in the first place. This feature would be quite useful when it comes to getting a prediction for a few lines of text from a trained ML model.\n", "\n", " **It is nearly 10x faster than using Spark ML Pipeline**\n", "\n", "For more details:\n", "[https://medium.com/spark-nlp/spark-nlp-101-lightpipeline-a544e93f20f1](https://medium.com/spark-nlp/spark-nlp-101-lightpipeline-a544e93f20f1)" ] }, { "cell_type": "code", "execution_count": null, "id": "f22dd0c2-c63d-43c8-bc96-2f7cead3553b", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 238 }, "id": "f22dd0c2-c63d-43c8-bc96-2f7cead3553b", "outputId": "4a475990-355c-4e6e-fc9b-f6b89d9d7380" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chunksbeginendsentence_identities
0EXCLUSIVE DISTRIBUTOR AGREEMENT0300DOC
1April 15, 199457700EFFDATE
2IMRS OPERATIONS INC871050PARTY
3Developer2592671ALIAS
4Delteq Pte Ltd2762891PARTY
5Distributor5105201ALIAS
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " chunks begin end sentence_id entities\n", "0 EXCLUSIVE DISTRIBUTOR AGREEMENT 0 30 0 DOC\n", "1 April 15, 1994 57 70 0 EFFDATE\n", "2 IMRS OPERATIONS INC 87 105 0 PARTY\n", "3 Developer 259 267 1 ALIAS\n", "4 Delteq Pte Ltd 276 289 1 PARTY\n", "5 Distributor 510 520 1 ALIAS" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "light_model = nlp.LightPipeline(model)\n", "\n", "light_result = light_model.fullAnnotate(text)\n", "\n", "\n", "chunks = []\n", "entities = []\n", "sentence= []\n", "begin = []\n", "end = []\n", "\n", "for n in light_result[0]['ner_chunk']:\n", " \n", " begin.append(n.begin)\n", " end.append(n.end)\n", " chunks.append(n.result)\n", " entities.append(n.metadata['entity']) \n", " sentence.append(n.metadata['sentence'])\n", " \n", " \n", "\n", "df = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, \n", " 'sentence_id':sentence, 'entities':entities})\n", "\n", "df.head(20)" ] }, { "cell_type": "markdown", "id": "0e91726a-2fd2-4432-a3ff-fe5238b00e9d", "metadata": { "id": "0e91726a-2fd2-4432-a3ff-fe5238b00e9d" }, "source": [ "####📌 NER Visualizer\n", "\n", "For saving the visualization result as html, provide `save_path` parameter in the display function." ] }, { "cell_type": "code", "execution_count": null, "id": "1f9e05ec-1724-4d53-b4e2-68e454c4e3bb", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 159 }, "id": "1f9e05ec-1724-4d53-b4e2-68e454c4e3bb", "outputId": "be1d8a6b-f93d-4322-fdc7-8bfa38635b49" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", " EXCLUSIVE DISTRIBUTOR AGREEMENT DOC (\" Agreement \") dated as April 15, 1994 EFFDATE by and between IMRS OPERATIONS INC PARTY., a Delaware corporation with its principal place of business at 777 Long Ridge Road, Stamford, Connecticut 06902, U.S.A. (hereinafter referred to as \" Developer ALIAS \") and Delteq Pte Ltd PARTY, a Singapore company (and a subsidiary of Wuthelam Industries (S) Pte LTD ) with its principal place of business at 215 Henderson Road , #101-03 Henderson Industrial Park , Singapore 0315 ( hereinafter referred to as \" Distributor ALIAS \")." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# from sparknlp_display import NerVisualizer\n", "\n", "visualiser = nlp.viz.NerVisualizer()\n", "\n", "visualiser.display(light_result[0], label_col='ner_chunk', document_col='document')" ] }, { "cell_type": "markdown", "id": "95645147-7f43-4fc1-b668-3bb2317bb74f", "metadata": { "id": "95645147-7f43-4fc1-b668-3bb2317bb74f" }, "source": [ "##🔎 Create Generic Pipeline for NerDL Models" ] }, { "cell_type": "code", "execution_count": null, "id": "da501a7a-aabd-477d-b19b-1e18b4ee7042", "metadata": { "id": "da501a7a-aabd-477d-b19b-1e18b4ee7042" }, "outputs": [], "source": [ "def base_pipeline():\n", " \n", " documentAssembler = nlp.DocumentAssembler()\\\n", " .setInputCol(\"text\")\\\n", " .setOutputCol(\"document\")\n", "\n", " textSplitter = legal.TextSplitter()\\\n", " .setInputCols([\"document\"])\\\n", " .setOutputCol(\"sentence\")\n", "\n", " tokenizer = nlp.Tokenizer()\\\n", " .setInputCols([\"sentence\"])\\\n", " .setOutputCol(\"token\")\n", " \n", " pipeline = nlp.Pipeline(stages=[\n", " documentAssembler,\n", " textSplitter,\n", " tokenizer])\n", " \n", " return pipeline" ] }, { "cell_type": "code", "execution_count": null, "id": "6b050243-1a9d-4b9e-a7a8-3fc084e69d08", "metadata": { "id": "6b050243-1a9d-4b9e-a7a8-3fc084e69d08" }, "outputs": [], "source": [ "def generic_ner_pipeline(model_name):\n", " \n", " embeddings = nlp.RoBertaEmbeddings.pretrained(\"roberta_embeddings_legal_roberta_base\", \"en\") \\\n", " .setInputCols(\"sentence\", \"token\") \\\n", " .setOutputCol(\"embeddings\")\\\n", "\n", " ner_model = legal.NerModel.pretrained(model_name, \"en\", \"legal/models\")\\\n", " .setInputCols([\"sentence\", \"token\", \"embeddings\"])\\\n", " .setOutputCol(\"ner\")\n", "\n", " ner_converter = nlp.NerConverter()\\\n", " .setInputCols([\"sentence\",\"token\",\"ner\"])\\\n", " .setOutputCol(\"ner_chunk\")\n", "\n", " nlpPipeline = nlp.Pipeline(stages=[\n", " base_pipeline(),\n", " embeddings,\n", " ner_model,\n", " ner_converter])\n", "\n", " empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n", "\n", " model = nlpPipeline.fit(empty_data)\n", " \n", " return model" ] }, { "cell_type": "markdown", "id": "479103f2-1256-4b78-971a-55d81140d030", "metadata": { "id": "479103f2-1256-4b78-971a-55d81140d030" }, "source": [ "##📌 Create Generic Result Function" ] }, { "cell_type": "code", "execution_count": null, "id": "19fac60f-f99a-4bf6-8668-0d40aa50a24d", "metadata": { "id": "19fac60f-f99a-4bf6-8668-0d40aa50a24d" }, "outputs": [], "source": [ "def get_result(result):\n", " result.select(F.explode(F.arrays_zip(result.ner_chunk.result, \n", " result.ner_chunk.metadata)).alias(\"cols\")) \\\n", " .select(F.expr(\"cols['0']\").alias(\"chunk\"),\n", " F.expr(\"cols['1']['entity']\").alias(\"ner_label\")).show(50, truncate=False)" ] }, { "cell_type": "markdown", "id": "817a44c1-52a3-40b3-9bd2-bc7b67c4c7fe", "metadata": { "id": "817a44c1-52a3-40b3-9bd2-bc7b67c4c7fe" }, "source": [ "###✔️ Legal Cuad_NER_Header Model\n", "\n", "This model uses Name Entity Recognition to detect **HEADER** and **SUBHEADER** with aims to detect the different sections of a legal document." ] }, { "cell_type": "code", "execution_count": null, "id": "3fa0104c-bdb7-4e4a-b3bb-f7e21f912964", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "3fa0104c-bdb7-4e4a-b3bb-f7e21f912964", "jupyter": { "outputs_hidden": true }, "outputId": "8a3755a3-d082-418e-f842-3ba40a35bfc5", "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "roberta_embeddings_legal_roberta_base download started this may take some time.\n", "Approximate size to download 447.2 MB\n", "[OK!]\n", "legner_headers download started this may take some time.\n", "[OK!]\n" ] } ], "source": [ "text = \"\"\"5. GRANT OF PATENT LICENSE\n", "5.1 Arizona Patent Grant. Subject to the terms and conditions of this Agreement, Arizona hereby grants to the Company a perpetual, non-exclusive, royalty-free license in, to and under the Arizona Licensed Patents for use in the Company Field throughout the world.\"\"\"\n", "\n", "model_name = \"legner_headers\"\n", "df = spark.createDataFrame([[text]]).toDF(\"text\")\n", "\n", "result = generic_ner_pipeline(model_name).transform(df)" ] }, { "cell_type": "code", "execution_count": null, "id": "0c143057-ddfc-4823-947f-9e51506e50ce", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "0c143057-ddfc-4823-947f-9e51506e50ce", "outputId": "72a7ecb8-9128-48fd-ad0f-6d1b16d80416" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------------------------+---------+\n", "|chunk |ner_label|\n", "+--------------------------+---------+\n", "|5. GRANT OF PATENT LICENSE|HEADER |\n", "|5.1 Arizona Patent Grant |SUBHEADER|\n", "+--------------------------+---------+\n", "\n" ] } ], "source": [ "get_result(result)" ] }, { "cell_type": "markdown", "id": "aab56faa-0646-4c84-ac4e-e9710c0ba891", "metadata": { "id": "aab56faa-0646-4c84-ac4e-e9710c0ba891" }, "source": [ "###✔️ Legal Cuad_NER_Obligations Model\n", "\n", "📚Entities:\n", " - OBLIGATION_SUBJECT\n", " - OBLIGATION_ACTION\n", " - OBLIGATION\n", " - OBLIGATION_INDIRECT_OBJECT" ] }, { "cell_type": "code", "execution_count": null, "id": "Bzup3rC83o-F", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Bzup3rC83o-F", "outputId": "acbe94fd-cad1-47b1-adc1-7b623481cf2e" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "legner_obligations download started this may take some time.\n", "[OK!]\n" ] } ], "source": [ "tokenClassifier = legal.BertForTokenClassification.pretrained(\"legner_obligations\", \"en\", \"legal/models\")\\\n", " .setInputCols(\"token\", \"sentence\")\\\n", " .setOutputCol(\"ner\")\\\n", " .setCaseSensitive(True)\n", "\n", "ner_converter = nlp.NerConverter()\\\n", " .setInputCols([\"sentence\",\"token\",\"ner\"])\\\n", " .setOutputCol(\"ner_chunk\")\n", "\n", "pipeline = nlp.Pipeline(stages=[\n", " base_pipeline(), \n", " tokenClassifier,\n", " ner_converter])\n", "\n", "empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n", "\n", "model = pipeline.fit(empty_data)" ] }, { "cell_type": "code", "execution_count": null, "id": "6e48a792-b252-42bd-8d0f-0e543240b289", "metadata": { "id": "6e48a792-b252-42bd-8d0f-0e543240b289", "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [], "source": [ "# Sometimes models work better with lowercase, depending on the vocabulary of the uppercase items\n", "# Sometimes only uncased language models are present.\n", "# This one is mixed but works better with lowercase\n", "text = \"\"\"PPD may engage VS to perform imaging services\"\"\".lower()\n", "\n", "df = spark.createDataFrame([[text]]).toDF(\"text\")\n", "\n", "result = model.transform(df)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "56f94eaf-e394-4052-b33b-518ae5876ca1", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "56f94eaf-e394-4052-b33b-518ae5876ca1", "outputId": "cfa33c03-93d1-4299-d0dc-2e8b0ca5c941" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+------------------+------------------+\n", "|chunk |ner_label |\n", "+------------------+------------------+\n", "|ppd |OBLIGATION_SUBJECT|\n", "|may engage |OBLIGATION_ACTION |\n", "|vs |OBLIGATION |\n", "|to perform imaging|OBLIGATION |\n", "+------------------+------------------+\n", "\n" ] } ], "source": [ "get_result(result)" ] }, { "cell_type": "markdown", "id": "bd3aab70-a7e4-477c-9415-9ed845afb9c1", "metadata": { "id": "bd3aab70-a7e4-477c-9415-9ed845afb9c1" }, "source": [ "###✔️ Legal NER_Law_Money Spanish Model with RoBertaForTokenClassification\n", "\n", "📚Enities\n", " - LAW\n", " - MONEY" ] }, { "cell_type": "code", "execution_count": null, "id": "e350660c-4e30-42d9-8086-7f835036fa19", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "e350660c-4e30-42d9-8086-7f835036fa19", "outputId": "54c545d1-cfc6-47dd-dd60-b525de5bfd5c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "legner_law_money download started this may take some time.\n", "Approximate size to download 395.1 MB\n", "[OK!]\n" ] } ], "source": [ "tokenClassifier = nlp.RoBertaForTokenClassification.pretrained(\"legner_law_money\", \"es\", \"legal/models\") \\\n", " .setInputCols([\"sentence\", \"token\"])\\\n", " .setOutputCol(\"ner\")\n", "ner_converter = nlp.NerConverter()\\\n", " .setInputCols([\"sentence\",\"token\",\"ner\"])\\\n", " .setOutputCol(\"ner_chunk\")\n", "\n", "pipeline = nlp.Pipeline(stages=[\n", " base_pipeline(), \n", " tokenClassifier,\n", " ner_converter])\n", "\n", "empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n", "\n", "model = pipeline.fit(empty_data)" ] }, { "cell_type": "code", "execution_count": null, "id": "ad62ed1d-7972-45dc-adf0-46318bfcdc4f", "metadata": { "id": "ad62ed1d-7972-45dc-adf0-46318bfcdc4f" }, "outputs": [], "source": [ "text = \"\"\"La recaudación del ministerio del interior fue de 20,000,000 euros así constatado por el artículo 24 de la Constitución Española.\"\"\"\n", "\n", "df = spark.createDataFrame([[text]]).toDF(\"text\")\n", "\n", "result = model.transform(df)" ] }, { "cell_type": "code", "execution_count": null, "id": "0cd698ee-0302-427b-bdf8-a43d878468cc", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "0cd698ee-0302-427b-bdf8-a43d878468cc", "outputId": "cb5fbea8-2455-4d2b-9a7c-0a6ba8431beb" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+---------------------------------------+---------+\n", "|chunk |ner_label|\n", "+---------------------------------------+---------+\n", "|20,000,000 euros |MONEY |\n", "|artículo 24 de la Constitución Española|LAW |\n", "+---------------------------------------+---------+\n", "\n" ] } ], "source": [ "get_result(result)" ] }, { "cell_type": "markdown", "id": "bf944e21-f7e3-4282-b3c5-2bbeb04fe8a8", "metadata": { "id": "bf944e21-f7e3-4282-b3c5-2bbeb04fe8a8" }, "source": [ "#🔎 Zero-shot Legal Example" ] }, { "cell_type": "markdown", "id": "dd10f5c5", "metadata": { "id": "dd10f5c5" }, "source": [ "📚`Zero-shot` is a new inference paradigm which allows us to use a model for prediction without any previous training step.\n", "\n", "For doing that, several examples (_hypotheses_) are provided and sent to the Language model, which will use `NLI (Natural Language Inference)` to check if the any information found in the text matches the examples (confirm the hypotheses).\n", "\n", "NLI usually works by trying to _confirm or reject an hypotheses_. The _hypotheses_ are the `prompts` or examples we are going to provide. If any piece of information confirm the constructed hypotheses (answer the examples we are given), then the hypotheses is confirmed and the Zero-shot is triggered.\n", "\n", "Let's see it in action." ] }, { "cell_type": "code", "execution_count": null, "id": "6610e9d9-0cd6-45ad-9fe4-e0d9ac3314e3", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "6610e9d9-0cd6-45ad-9fe4-e0d9ac3314e3", "outputId": "c1a25f57-56f1-433a-cb9c-e037007bb117" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "legner_roberta_zeroshot download started this may take some time.\n", "[OK!]\n" ] } ], "source": [ "documentAssembler = nlp.DocumentAssembler()\\\n", " .setInputCol(\"text\")\\\n", " .setOutputCol(\"document\")\n", "\n", "textSplitter = legal.TextSplitter()\\\n", " .setInputCols([\"document\"])\\\n", " .setOutputCol(\"sentence\")\n", "\n", "sparktokenizer = nlp.Tokenizer()\\\n", " .setInputCols(\"sentence\")\\\n", " .setOutputCol(\"token\")\n", "\n", "zero_shot_ner = legal.ZeroShotNerModel.pretrained(\"legner_roberta_zeroshot\", \"en\", \"legal/models\")\\\n", " .setInputCols([\"sentence\", \"token\"])\\\n", " .setOutputCol(\"zero_shot_ner\")\\\n", " .setEntityDefinitions(\n", " {\n", " \"DATE\": ['When was the company acquisition?', 'When was the company purchase agreement?', \"When was the agreement?\"],\n", " \"ORG\": [\"Which company?\"],\n", " \"STATE\": [\"Which state?\"],\n", " \"AGREEMENT\": [\"What kind of agreement?\"],\n", " \"LICENSE\": [\"What kind of license?\"],\n", " \"LICENSE_RECIPIENT\": [\"To whom the license is granted?\"]\n", " })\n", " \n", "\n", "nerconverter = nlp.NerConverter()\\\n", " .setInputCols([\"sentence\", \"token\", \"zero_shot_ner\"])\\\n", " .setOutputCol(\"ner_chunk\")\n", "\n", "pipeline = nlp.Pipeline(stages=[\n", " documentAssembler,\n", " textSplitter,\n", " sparktokenizer,\n", " zero_shot_ner,\n", " nerconverter\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "9f49d722-71aa-488e-af32-6de899aa5b2a", "metadata": { "id": "9f49d722-71aa-488e-af32-6de899aa5b2a" }, "outputs": [], "source": [ "from pyspark.sql.types import StructType,StructField, StringType\n", "\n", "sample_text = [\"\"\"In March 2012, as part of a longer-term strategy, the Company acquired Vertro, Inc., which owned and operated the ALOT product portfolio.\"\"\",\n", " \"\"\"In February 2017, the Company entered into an asset purchase agreement with NetSeer, Inc.\"\"\",\n", " \"\"\"This INTELLECTUAL PROPERTY AGREEMENT, dated as of December 31, 2018 (the 'Effective Date') is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ('Seller') and AFI Licensing LLC, a Delaware company (the 'Licensee')\"\"\",\n", " \"\"\"The Company hereby grants to Seller a perpetual, non- exclusive, royalty-free license\"\"\"]\n", "\n", "p_model = pipeline.fit(spark.createDataFrame([[\"\"]]).toDF(\"text\"))\n", "\n", "res = p_model.transform(spark.createDataFrame(sample_text, StringType()).toDF(\"text\"))" ] }, { "cell_type": "code", "execution_count": null, "id": "233fd7d6-c84d-4240-9b57-9912f6256b71", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "233fd7d6-c84d-4240-9b57-9912f6256b71", "outputId": "c402d4e9-07e3-4b1c-e90d-8eb2016df264" }, "outputs": [ { "data": { "text/plain": [ "DataFrame[chunk: string, ner_label: string]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# from pyspark.sql import functions as F\n", "\n", "res.select(F.explode(F.arrays_zip(res.ner_chunk.result, res.ner_chunk.begin, res.ner_chunk.end, res.ner_chunk.metadata)).alias(\"cols\")) \\\n", " .select(F.expr(\"cols['0']\").alias(\"chunk\"),\n", " F.expr(\"cols['3']['entity']\").alias(\"ner_label\"))\\\n", " .filter(\"ner_label!='O'\")" ] }, { "cell_type": "code", "execution_count": null, "id": "06a23e24-2872-4bb2-8dd3-32f62bdb9423", "metadata": { "id": "06a23e24-2872-4bb2-8dd3-32f62bdb9423" }, "outputs": [], "source": [ "lp = nlp.LightPipeline(p_model)\n", "lp_res_1 = lp.fullAnnotate(sample_text[2])\n", "lp_res_2 = lp.fullAnnotate(sample_text[3])" ] }, { "cell_type": "code", "execution_count": null, "id": "99fc9030-2dc5-4621-b25a-ea2467aa0ccb", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 88 }, "id": "99fc9030-2dc5-4621-b25a-ea2467aa0ccb", "outputId": "24d5c6e1-03ca-4683-86bc-2fb1935f4930" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", " This INTELLECTUAL PROPERTY AGREEMENT AGREEMENT, dated as of December 31, 2018 DATE (the 'Effective Date') is entered into by and between Armstrong Flooring LICENSE_RECIPIENT, Inc., a Delaware STATE corporation ('Seller') and AFI Licensing LLC, a Delaware company LICENSE_RECIPIENT (the 'Licensee')" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# from sparknlp_display import NerVisualizer\n", "\n", "visualiser = nlp.viz.NerVisualizer()\n", "\n", "visualiser.display(lp_res_1[0], label_col='ner_chunk', document_col='document')" ] }, { "cell_type": "code", "execution_count": null, "id": "aed220b8-8aba-42bb-8e5a-b2ee3438834e", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 88 }, "id": "aed220b8-8aba-42bb-8e5a-b2ee3438834e", "outputId": "9f6b3be3-07f4-4828-c569-93214a1eba64" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", " The Company hereby grants to Seller LICENSE_RECIPIENT a perpetual LICENSE, non- exclusive LICENSE, royalty-free LICENSE license" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "visualiser.display(lp_res_2[0], label_col='ner_chunk', document_col='document')" ] }, { "cell_type": "code", "execution_count": null, "id": "CijixuQO90aU", "metadata": { "id": "CijixuQO90aU" }, "outputs": [], "source": [] } ], "metadata": { "colab": { "machine_shape": "hm", "provenance": [], "toc_visible": true }, "gpuClass": "standard", "kernelspec": { "display_name": "tf-gpu", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)]" }, "vscode": { "interpreter": { "hash": "3f47d918ae832c68584484921185f5c85a1760864bf927a683dc6fb56366cc77" } } }, "nbformat": 4, "nbformat_minor": 5 }