{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "db5f4f9a-7776-42b3-8758-85624d4c15ea",
      "metadata": {
        "id": "db5f4f9a-7776-42b3-8758-85624d4c15ea"
      },
      "source": [
        "![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "982c5188",
      "metadata": {
        "id": "982c5188"
      },
      "source": [
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/05.0.NER_and_ZeroShotNER.ipynb)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "6964d2b7",
      "metadata": {
        "id": "6964d2b7"
      },
      "source": [
        "# Legal Named Entity Recognition (NER) and Zero-shot NER"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "gk3kZHmNj51v",
      "metadata": {
        "collapsed": false,
        "id": "gk3kZHmNj51v"
      },
      "source": [
        "#🎬 Installation"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "_914itZsj51v",
      "metadata": {
        "id": "_914itZsj51v",
        "pycharm": {
          "is_executing": true
        }
      },
      "outputs": [],
      "source": [
        "! pip install -q johnsnowlabs"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "YPsbAnNoPt0Z",
      "metadata": {
        "id": "YPsbAnNoPt0Z"
      },
      "source": [
        "##🔗 Automatic Installation\n",
        "Using my.johnsnowlabs.com SSO"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "fY0lcShkj51w",
      "metadata": {
        "id": "fY0lcShkj51w",
        "pycharm": {
          "is_executing": true
        }
      },
      "outputs": [],
      "source": [
        "from johnsnowlabs import nlp, legal\n",
        "\n",
        "# nlp.install(force_browser=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "hsJvn_WWM2GL",
      "metadata": {
        "id": "hsJvn_WWM2GL"
      },
      "source": [
        "##🔗 Manual downloading\n",
        "📚If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.\n",
        "\n",
        "- Go to my.johnsnowlabs.com\n",
        "- Download your license\n",
        "- Upload it using the following command"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "i57QV3-_P2sQ",
      "metadata": {
        "id": "i57QV3-_P2sQ"
      },
      "outputs": [],
      "source": [
        "from google.colab import files\n",
        "print('Please Upload your John Snow Labs License using the button below')\n",
        "license_keys = files.upload()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "xGgNdFzZP_hQ",
      "metadata": {
        "id": "xGgNdFzZP_hQ"
      },
      "source": [
        "- Install it"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "OfmmPqknP4rR",
      "metadata": {
        "id": "OfmmPqknP4rR"
      },
      "outputs": [],
      "source": [
        "nlp.install()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "DCl5ErZkNNLk",
      "metadata": {
        "id": "DCl5ErZkNNLk"
      },
      "source": [
        "#📌 Starting"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "wRXTnNl3j51w",
      "metadata": {
        "id": "wRXTnNl3j51w"
      },
      "outputs": [],
      "source": [
        "spark = nlp.start()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "766fe57a-fcd5-4072-99d0-7626c7888493",
      "metadata": {
        "id": "766fe57a-fcd5-4072-99d0-7626c7888493",
        "tags": []
      },
      "source": [
        "##🔎 NER Model Implementation in Spark NLP\n",
        "\n",
        "  The deep neural network architecture for NER model in Spark NLP is BiLSTM-CNN-Char framework. a slightly modified version of the architecture proposed by Jason PC Chiu and Eric Nichols ([Named Entity Recognition with Bidirectional LSTM-CNNs](https://arxiv.org/abs/1511.08308)). It is a neural network architecture that automatically detects word and character-level features using a hybrid bidirectional LSTM and CNN architecture, eliminating the need for most feature engineering steps.\n",
        "  \n",
        "  In the original framework, the CNN extracts a fixed length feature vector from character-level features. For each word, these vectors are concatenated and fed to the BLSTM network and then to the output layers. They employed a stacked bi-directional recurrent neural network with long short-term memory units to transform word features into named entity tag scores. The extracted features of each word are fed into a forward LSTM network and a backward LSTM network. The output of each network at each time step is decoded by a linear layer and a log-softmax layer into log-probabilities for each tag category. These two vectors are then simply added together to produce the final output. In the architecture of the proposed framework in the original paper, 50-dimensional pretrained word embeddings is used for word features, 25-dimension character embeddings is used for char features, and capitalization features (allCaps, upperInitial, lowercase, mixedCaps, noinfo) are used for case features."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "bee4b28c-dda1-4708-9240-edb6fe105013",
      "metadata": {
        "id": "bee4b28c-dda1-4708-9240-edb6fe105013"
      },
      "source": [
        "###📌 Legal CuadNER Model\n",
        "\n",
        "This model uses Name Entity Recognition to extract DOC (Document Type), PARTY (An Entity signing a contract), ALIAS (the way a company is named later on in the document) and EFFDATE (Effective Date of the contract)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "889067cf-a64c-4f3a-b27a-51fdca438599",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "889067cf-a64c-4f3a-b27a-51fdca438599",
        "outputId": "90f6e90c-a9c9-4b0d-b38f-80cd5a651e85",
        "scrolled": true,
        "tags": []
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "sentence_detector_dl download started this may take some time.\n",
            "Approximate size to download 514.9 KB\n",
            "[OK!]\n",
            "roberta_embeddings_legal_roberta_base download started this may take some time.\n",
            "Approximate size to download 447.2 MB\n",
            "[OK!]\n",
            "legner_contract_doc_parties download started this may take some time.\n",
            "[OK!]\n"
          ]
        }
      ],
      "source": [
        "documentAssembler = nlp.DocumentAssembler()\\\n",
        "        .setInputCol(\"text\")\\\n",
        "        .setOutputCol(\"document\")\n",
        "\n",
        "sentenceDetector = nlp.SentenceDetectorDLModel.pretrained(\"sentence_detector_dl\",\"xx\")\\\n",
        "        .setInputCols([\"document\"])\\\n",
        "        .setOutputCol(\"sentence\")\n",
        "\n",
        "tokenizer = nlp.Tokenizer()\\\n",
        "        .setInputCols([\"sentence\"])\\\n",
        "        .setOutputCol(\"token\")\n",
        "\n",
        "embeddings = nlp.RoBertaEmbeddings.pretrained(\"roberta_embeddings_legal_roberta_base\", \"en\") \\\n",
        "        .setInputCols(\"sentence\", \"token\") \\\n",
        "        .setOutputCol(\"embeddings\")\\\n",
        "\n",
        "ner_model = legal.NerModel.pretrained(\"legner_contract_doc_parties\", \"en\", \"legal/models\")\\\n",
        "        .setInputCols([\"sentence\", \"token\", \"embeddings\"])\\\n",
        "        .setOutputCol(\"ner\")\n",
        "\n",
        "ner_converter = nlp.NerConverter()\\\n",
        "        .setInputCols([\"sentence\",\"token\",\"ner\"])\\\n",
        "        .setOutputCol(\"ner_chunk\")\n",
        "\n",
        "nlpPipeline = nlp.Pipeline(stages=[\n",
        "        documentAssembler,\n",
        "        sentenceDetector,\n",
        "        tokenizer,\n",
        "        embeddings,\n",
        "        ner_model,\n",
        "        ner_converter])\n",
        "\n",
        "empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n",
        "\n",
        "model = nlpPipeline.fit(empty_data)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "46fa5d8a-a5f0-4173-a21e-1df147d1b2e8",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "46fa5d8a-a5f0-4173-a21e-1df147d1b2e8",
        "outputId": "b0f70b65-fe99-45c9-d60b-0082826fe129"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "[DocumentAssembler_d02822bc3d37,\n",
              " SentenceDetectorDLModel_8aaebf7e098e,\n",
              " REGEX_TOKENIZER_a8b4485b4dba,\n",
              " ROBERTA_EMBEDDINGS_b915dff90901,\n",
              " MedicalNerModel_93f728ff96e5,\n",
              " NerConverter_c5758600563d]"
            ]
          },
          "execution_count": 7,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "# you can see pipeline stages with this code\n",
        "\n",
        "model.stages"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "af5baafe-793a-4022-ac3c-95c5345ef606",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "af5baafe-793a-4022-ac3c-95c5345ef606",
        "outputId": "24815617-36fe-46d9-f87a-872f16a23ee0"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "['O',\n",
              " 'I-DOC',\n",
              " 'B-EFFDATE',\n",
              " 'B-ALIAS',\n",
              " 'I-ALIAS',\n",
              " 'B-PARTY',\n",
              " 'I-EFFDATE',\n",
              " 'I-PARTY',\n",
              " 'B-DOC']"
            ]
          },
          "execution_count": 8,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "# With this code, you can see which labels your NER model has.\n",
        "\n",
        "ner_model.getClasses()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "5954047c-ec79-47ec-98fa-44c74b492140",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "5954047c-ec79-47ec-98fa-44c74b492140",
        "outputId": "57b759a9-4cd3-4dad-a9af-0ec00d63adb6"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{Param(parent='MedicalNerModel_93f728ff96e5', name='inferenceBatchSize', doc='number of sentences to process in a single batch during inference'): 1,\n",
              " Param(parent='MedicalNerModel_93f728ff96e5', name='labelCasing', doc='Setting all labels of the NER models upper/lower case. values upper|lower'): '',\n",
              " Param(parent='MedicalNerModel_93f728ff96e5', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,\n",
              " Param(parent='MedicalNerModel_93f728ff96e5', name='includeConfidence', doc='whether to include confidence scores in annotation metadata'): True,\n",
              " Param(parent='MedicalNerModel_93f728ff96e5', name='includeAllConfidenceScores', doc='whether to include all confidence scores in annotation metadata or just the score of the predicted tag'): False,\n",
              " Param(parent='MedicalNerModel_93f728ff96e5', name='batchSize', doc='Size of every batch'): 256,\n",
              " Param(parent='MedicalNerModel_93f728ff96e5', name='classes', doc='get the tags used to trained this MedicalNerModel'): ['O',\n",
              "  'I-DOC',\n",
              "  'B-EFFDATE',\n",
              "  'B-ALIAS',\n",
              "  'I-ALIAS',\n",
              "  'B-PARTY',\n",
              "  'I-EFFDATE',\n",
              "  'I-PARTY',\n",
              "  'B-DOC'],\n",
              " Param(parent='MedicalNerModel_93f728ff96e5', name='inputCols', doc='previous annotations columns, if renamed'): ['sentence',\n",
              "  'token',\n",
              "  'embeddings'],\n",
              " Param(parent='MedicalNerModel_93f728ff96e5', name='outputCol', doc='output annotation column. can be left default.'): 'ner',\n",
              " Param(parent='MedicalNerModel_93f728ff96e5', name='storageRef', doc='unique reference name for identification'): 'roberta_embeddings_legal_roberta_base_en'}"
            ]
          },
          "execution_count": 9,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "ner_model.extractParamMap()\n",
        "\n",
        "# With extractParamMap() function, you can see the parameters of any annotators you are using."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "9e7d801c-fcc0-458c-9835-b6cbb0149f38",
      "metadata": {
        "id": "9e7d801c-fcc0-458c-9835-b6cbb0149f38"
      },
      "source": [
        "####✔️ **Sample Text**"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "00d74636-3490-4a24-9dc2-4f3f023c8909",
      "metadata": {
        "id": "00d74636-3490-4a24-9dc2-4f3f023c8909"
      },
      "outputs": [],
      "source": [
        "text = \"\"\"EXCLUSIVE DISTRIBUTOR AGREEMENT (\" Agreement \") dated as April 15, 1994 by and between IMRS OPERATIONS INC., a Delaware corporation with its principal place of business at 777 Long Ridge Road, Stamford, Connecticut 06902, U.S.A. (hereinafter referred to as \" Developer \") and Delteq Pte Ltd, a Singapore company (and a subsidiary of Wuthelam Industries (S) Pte LTD ) with its principal place of business at 215 Henderson Road , #101-03 Henderson Industrial Park , Singapore 0315 ( hereinafter referred to as \" Distributor \").\"\"\"\n",
        "\n",
        "df = spark.createDataFrame([[text]]).toDF(\"text\")\n",
        "\n",
        "result = model.transform(df)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "4c7c211c-448e-494f-9f83-3274b9ca0aba",
      "metadata": {
        "id": "4c7c211c-448e-494f-9f83-3274b9ca0aba"
      },
      "source": [
        "####🖨️ **Getting Result**"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "ec9a99c6-4d22-4837-aed9-425b8f9efed6",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ec9a99c6-4d22-4837-aed9-425b8f9efed6",
        "outputId": "25f414a9-e5fb-45d9-c968-d44d5c1f37aa",
        "tags": []
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "+-----------+---------+----------+\n",
            "|      token|ner_label|confidence|\n",
            "+-----------+---------+----------+\n",
            "|  EXCLUSIVE|    B-DOC|     0.885|\n",
            "|DISTRIBUTOR|    I-DOC|    0.7397|\n",
            "|  AGREEMENT|    I-DOC|    0.9926|\n",
            "|         (\"|        O|    0.9998|\n",
            "|  Agreement|        O|    0.9964|\n",
            "|         \")|        O|       1.0|\n",
            "|      dated|        O|       1.0|\n",
            "|         as|        O|    0.9985|\n",
            "|      April|B-EFFDATE|    0.9845|\n",
            "|         15|I-EFFDATE|     0.951|\n",
            "|          ,|I-EFFDATE|    0.9504|\n",
            "|       1994|I-EFFDATE|    0.8741|\n",
            "|         by|        O|       1.0|\n",
            "|        and|        O|       1.0|\n",
            "|    between|        O|       1.0|\n",
            "|       IMRS|  B-PARTY|    0.9898|\n",
            "| OPERATIONS|  I-PARTY|    0.9987|\n",
            "|        INC|  I-PARTY|    0.9995|\n",
            "|          .|        O|    0.9907|\n",
            "|          ,|        O|    0.9983|\n",
            "|          a|        O|       1.0|\n",
            "|   Delaware|        O|    0.9997|\n",
            "|corporation|        O|    0.9999|\n",
            "|       with|        O|       1.0|\n",
            "|        its|        O|       1.0|\n",
            "|  principal|        O|       1.0|\n",
            "|      place|        O|       1.0|\n",
            "|         of|        O|       1.0|\n",
            "|   business|        O|       1.0|\n",
            "|         at|        O|       1.0|\n",
            "|        777|        O|       1.0|\n",
            "|       Long|        O|    0.9999|\n",
            "|      Ridge|        O|    0.9999|\n",
            "|       Road|        O|       1.0|\n",
            "|          ,|        O|       1.0|\n",
            "|   Stamford|        O|    0.9997|\n",
            "|          ,|        O|       1.0|\n",
            "|Connecticut|        O|    0.9998|\n",
            "|      06902|        O|    0.9997|\n",
            "|          ,|        O|    0.9998|\n",
            "|      U.S.A|        O|    0.9919|\n",
            "|          .|        O|    0.9991|\n",
            "|          (|        O|    0.9999|\n",
            "|hereinafter|        O|       1.0|\n",
            "|   referred|        O|       1.0|\n",
            "|         to|        O|    0.9995|\n",
            "|         as|        O|    0.9994|\n",
            "|          \"|        O|    0.9959|\n",
            "|  Developer|  B-ALIAS|    0.9741|\n",
            "|         \")|        O|    0.9972|\n",
            "|        and|        O|    0.9978|\n",
            "|     Delteq|  B-PARTY|    0.9257|\n",
            "|        Pte|  I-PARTY|    0.9525|\n",
            "|        Ltd|  I-PARTY|    0.9735|\n",
            "|          ,|        O|     0.983|\n",
            "|          a|        O|       1.0|\n",
            "|  Singapore|        O|    0.9984|\n",
            "|    company|        O|    0.9977|\n",
            "|          (|        O|       1.0|\n",
            "|        and|        O|       1.0|\n",
            "|          a|        O|       1.0|\n",
            "| subsidiary|        O|       1.0|\n",
            "|         of|        O|    0.9999|\n",
            "|   Wuthelam|        O|    0.9009|\n",
            "| Industries|        O|    0.9494|\n",
            "|          (|        O|    0.9384|\n",
            "|          S|        O|    0.9564|\n",
            "|          )|        O|    0.9981|\n",
            "|        Pte|        O|    0.9911|\n",
            "|        LTD|        O|    0.9893|\n",
            "|          )|        O|       1.0|\n",
            "|       with|        O|       1.0|\n",
            "|        its|        O|       1.0|\n",
            "|  principal|        O|       1.0|\n",
            "|      place|        O|       1.0|\n",
            "|         of|        O|       1.0|\n",
            "|   business|        O|       1.0|\n",
            "|         at|        O|       1.0|\n",
            "|        215|        O|       1.0|\n",
            "|  Henderson|        O|       1.0|\n",
            "|       Road|        O|       1.0|\n",
            "|          ,|        O|       1.0|\n",
            "|    #101-03|        O|       1.0|\n",
            "|  Henderson|        O|    0.9997|\n",
            "| Industrial|        O|    0.9997|\n",
            "|       Park|        O|    0.9998|\n",
            "|          ,|        O|       1.0|\n",
            "|  Singapore|        O|    0.9999|\n",
            "|       0315|        O|    0.9998|\n",
            "|          (|        O|       1.0|\n",
            "|hereinafter|        O|       1.0|\n",
            "|   referred|        O|       1.0|\n",
            "|         to|        O|    0.9999|\n",
            "|         as|        O|    0.9999|\n",
            "|          \"|        O|     0.999|\n",
            "|Distributor|  B-ALIAS|    0.9814|\n",
            "|        \").|        O|    0.9926|\n",
            "+-----------+---------+----------+\n",
            "\n"
          ]
        }
      ],
      "source": [
        "from pyspark.sql import functions as F\n",
        "\n",
        "result.select(F.explode(F.arrays_zip(result.token.result, \n",
        "                                     result.ner.result, \n",
        "                                     result.ner.metadata)).alias(\"cols\"))\\\n",
        "                  .select(F.expr(\"cols['0']\").alias(\"token\"),\n",
        "                          F.expr(\"cols['1']\").alias(\"ner_label\"),\n",
        "                          F.expr(\"cols['2']['confidence']\").alias(\"confidence\")).show(200, truncate=100)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "865dce29-ece0-45f6-8f5b-9028292523f0",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "865dce29-ece0-45f6-8f5b-9028292523f0",
        "outputId": "4572f71f-31af-4ee6-e693-563d12a91dad"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "+-------------------------------+---------+----------+\n",
            "|chunk                          |ner_label|confidence|\n",
            "+-------------------------------+---------+----------+\n",
            "|EXCLUSIVE DISTRIBUTOR AGREEMENT|DOC      |0.87243336|\n",
            "|April 15, 1994                 |EFFDATE  |0.94      |\n",
            "|IMRS OPERATIONS INC            |PARTY    |0.996     |\n",
            "|Developer                      |ALIAS    |0.9741    |\n",
            "|Delteq Pte Ltd                 |PARTY    |0.9505667 |\n",
            "|Distributor                    |ALIAS    |0.9814    |\n",
            "+-------------------------------+---------+----------+\n",
            "\n"
          ]
        }
      ],
      "source": [
        "result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias(\"cols\")) \\\n",
        "      .select(F.expr(\"cols['0']\").alias(\"chunk\"),\n",
        "              F.expr(\"cols['1']['entity']\").alias(\"ner_label\"),\n",
        "              F.expr(\"cols['1']['confidence']\").alias(\"confidence\")).show(truncate=False)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "b47e34e0-3633-4202-afdf-63a0f2475520",
      "metadata": {
        "id": "b47e34e0-3633-4202-afdf-63a0f2475520"
      },
      "source": [
        "####🖨️ **Getting Result with LightPipeline**\n",
        "\n",
        "LightPipelines are Spark NLP specific Pipelines, equivalent to Spark ML Pipeline, but meant to deal with smaller amounts of data. They’re useful working with small datasets, debugging results, or when running either training or prediction from an API that serves one-off requests.\n",
        "\n",
        "Spark NLP LightPipelines are Spark ML pipelines converted into a single machine but the multi-threaded task, becoming more than 10x times faster for smaller amounts of data (small is relative, but 50k sentences are roughly a good maximum). To use them, we simply plug in a trained (fitted) pipeline and then annotate a plain text. We don't even need to convert the input text to DataFrame in order to feed it into a pipeline that's accepting DataFrame as an input in the first place. This feature would be quite useful when it comes to getting a prediction for a few lines of text from a trained ML model.\n",
        "\n",
        " **It is nearly 10x faster than using Spark ML Pipeline**\n",
        "\n",
        "For more details:\n",
        "[https://medium.com/spark-nlp/spark-nlp-101-lightpipeline-a544e93f20f1](https://medium.com/spark-nlp/spark-nlp-101-lightpipeline-a544e93f20f1)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "f22dd0c2-c63d-43c8-bc96-2f7cead3553b",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 238
        },
        "id": "f22dd0c2-c63d-43c8-bc96-2f7cead3553b",
        "outputId": "4a475990-355c-4e6e-fc9b-f6b89d9d7380"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "\n",
              "  <div id=\"df-e7e1a471-238d-4926-b5c3-dd596e0d5ec4\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>chunks</th>\n",
              "      <th>begin</th>\n",
              "      <th>end</th>\n",
              "      <th>sentence_id</th>\n",
              "      <th>entities</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>EXCLUSIVE DISTRIBUTOR AGREEMENT</td>\n",
              "      <td>0</td>\n",
              "      <td>30</td>\n",
              "      <td>0</td>\n",
              "      <td>DOC</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>April 15, 1994</td>\n",
              "      <td>57</td>\n",
              "      <td>70</td>\n",
              "      <td>0</td>\n",
              "      <td>EFFDATE</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>IMRS OPERATIONS INC</td>\n",
              "      <td>87</td>\n",
              "      <td>105</td>\n",
              "      <td>0</td>\n",
              "      <td>PARTY</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>Developer</td>\n",
              "      <td>259</td>\n",
              "      <td>267</td>\n",
              "      <td>1</td>\n",
              "      <td>ALIAS</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>Delteq Pte Ltd</td>\n",
              "      <td>276</td>\n",
              "      <td>289</td>\n",
              "      <td>1</td>\n",
              "      <td>PARTY</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5</th>\n",
              "      <td>Distributor</td>\n",
              "      <td>510</td>\n",
              "      <td>520</td>\n",
              "      <td>1</td>\n",
              "      <td>ALIAS</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-e7e1a471-238d-4926-b5c3-dd596e0d5ec4')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-e7e1a471-238d-4926-b5c3-dd596e0d5ec4 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-e7e1a471-238d-4926-b5c3-dd596e0d5ec4');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ],
            "text/plain": [
              "                            chunks  begin  end sentence_id entities\n",
              "0  EXCLUSIVE DISTRIBUTOR AGREEMENT      0   30           0      DOC\n",
              "1                   April 15, 1994     57   70           0  EFFDATE\n",
              "2              IMRS OPERATIONS INC     87  105           0    PARTY\n",
              "3                        Developer    259  267           1    ALIAS\n",
              "4                   Delteq Pte Ltd    276  289           1    PARTY\n",
              "5                      Distributor    510  520           1    ALIAS"
            ]
          },
          "execution_count": 14,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "import pandas as pd\n",
        "\n",
        "light_model = nlp.LightPipeline(model)\n",
        "\n",
        "light_result = light_model.fullAnnotate(text)\n",
        "\n",
        "\n",
        "chunks = []\n",
        "entities = []\n",
        "sentence= []\n",
        "begin = []\n",
        "end = []\n",
        "\n",
        "for n in light_result[0]['ner_chunk']:\n",
        "        \n",
        "    begin.append(n.begin)\n",
        "    end.append(n.end)\n",
        "    chunks.append(n.result)\n",
        "    entities.append(n.metadata['entity']) \n",
        "    sentence.append(n.metadata['sentence'])\n",
        "    \n",
        "    \n",
        "\n",
        "df = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, \n",
        "                   'sentence_id':sentence, 'entities':entities})\n",
        "\n",
        "df.head(20)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "0e91726a-2fd2-4432-a3ff-fe5238b00e9d",
      "metadata": {
        "id": "0e91726a-2fd2-4432-a3ff-fe5238b00e9d"
      },
      "source": [
        "####📌 NER Visualizer\n",
        "\n",
        "For saving the visualization result as html, provide `save_path` parameter in the display function."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "1f9e05ec-1724-4d53-b4e2-68e454c4e3bb",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 159
        },
        "id": "1f9e05ec-1724-4d53-b4e2-68e454c4e3bb",
        "outputId": "be1d8a6b-f93d-4322-fdc7-8bfa38635b49"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "\n",
              "<style>\n",
              "    @import url('https://fonts.googleapis.com/css2?family=Montserrat:wght@300;400;500;600;700&display=swap');\n",
              "    @import url('https://fonts.googleapis.com/css2?family=Vistol Regular:wght@300;400;500;600;700&display=swap');\n",
              "    \n",
              "    .spark-nlp-display-scroll-entities {\n",
              "        border: 1px solid #E7EDF0;\n",
              "        border-radius: 3px;\n",
              "        text-align: justify;\n",
              "        \n",
              "    }\n",
              "    .spark-nlp-display-scroll-entities span {  \n",
              "        font-size: 14px;\n",
              "        line-height: 24px;\n",
              "        color: #536B76;\n",
              "        font-family: 'Montserrat', sans-serif !important;\n",
              "    }\n",
              "    \n",
              "    .spark-nlp-display-entity-wrapper{\n",
              "    \n",
              "        display: inline-grid;\n",
              "        text-align: center;\n",
              "        border-radius: 4px;\n",
              "        margin: 0 2px 5px 2px;\n",
              "        padding: 1px\n",
              "    }\n",
              "    .spark-nlp-display-entity-name{\n",
              "        font-size: 14px;\n",
              "        line-height: 24px;\n",
              "        font-family: 'Montserrat', sans-serif !important;\n",
              "        \n",
              "        background: #f1f2f3;\n",
              "        border-width: medium;\n",
              "        text-align: center;\n",
              "        \n",
              "        font-weight: 400;\n",
              "        \n",
              "        border-radius: 5px;\n",
              "        padding: 2px 5px;\n",
              "        display: block;\n",
              "        margin: 3px 2px;\n",
              "    \n",
              "    }\n",
              "    .spark-nlp-display-entity-type{\n",
              "        font-size: 14px;\n",
              "        line-height: 24px;\n",
              "        color: #ffffff;\n",
              "        font-family: 'Montserrat', sans-serif !important;\n",
              "        \n",
              "        text-transform: uppercase;\n",
              "        \n",
              "        font-weight: 500;\n",
              "\n",
              "        display: block;\n",
              "        padding: 3px 5px;\n",
              "    }\n",
              "    \n",
              "    .spark-nlp-display-entity-resolution{\n",
              "        font-size: 14px;\n",
              "        line-height: 24px;\n",
              "        color: #ffffff;\n",
              "        font-family: 'Vistol Regular', sans-serif !important;\n",
              "        \n",
              "        text-transform: uppercase;\n",
              "        \n",
              "        font-weight: 500;\n",
              "\n",
              "        display: block;\n",
              "        padding: 3px 5px;\n",
              "    }\n",
              "    \n",
              "    .spark-nlp-display-others{\n",
              "        font-size: 14px;\n",
              "        line-height: 24px;\n",
              "        font-family: 'Montserrat', sans-serif !important;\n",
              "        \n",
              "        font-weight: 400;\n",
              "    }\n",
              "\n",
              "</style>\n",
              " <span class=\"spark-nlp-display-entity-wrapper\" style=\"background-color: #3DB665\"><span class=\"spark-nlp-display-entity-name\">EXCLUSIVE DISTRIBUTOR AGREEMENT </span><span class=\"spark-nlp-display-entity-type\">DOC</span></span><span class=\"spark-nlp-display-others\" style=\"background-color: white\"> (\" Agreement \") dated as </span><span class=\"spark-nlp-display-entity-wrapper\" style=\"background-color: #7BBA76\"><span class=\"spark-nlp-display-entity-name\">April 15, 1994 </span><span class=\"spark-nlp-display-entity-type\">EFFDATE</span></span><span class=\"spark-nlp-display-others\" style=\"background-color: white\"> by and between </span><span class=\"spark-nlp-display-entity-wrapper\" style=\"background-color: #6C9F52\"><span class=\"spark-nlp-display-entity-name\">IMRS OPERATIONS INC </span><span class=\"spark-nlp-display-entity-type\">PARTY</span></span><span class=\"spark-nlp-display-others\" style=\"background-color: white\">., a Delaware corporation with its principal place of business at 777 Long Ridge Road, Stamford, Connecticut 06902, U.S.A. (hereinafter referred to as \" </span><span class=\"spark-nlp-display-entity-wrapper\" style=\"background-color: #162611\"><span class=\"spark-nlp-display-entity-name\">Developer </span><span class=\"spark-nlp-display-entity-type\">ALIAS</span></span><span class=\"spark-nlp-display-others\" style=\"background-color: white\"> \") and </span><span class=\"spark-nlp-display-entity-wrapper\" style=\"background-color: #6C9F52\"><span class=\"spark-nlp-display-entity-name\">Delteq Pte Ltd </span><span class=\"spark-nlp-display-entity-type\">PARTY</span></span><span class=\"spark-nlp-display-others\" style=\"background-color: white\">, a Singapore company (and a subsidiary of Wuthelam Industries (S) Pte LTD ) with its principal place of business at 215 Henderson Road , #101-03 Henderson Industrial Park , Singapore 0315 ( hereinafter referred to as \" </span><span class=\"spark-nlp-display-entity-wrapper\" style=\"background-color: #162611\"><span class=\"spark-nlp-display-entity-name\">Distributor </span><span class=\"spark-nlp-display-entity-type\">ALIAS</span></span><span class=\"spark-nlp-display-others\" style=\"background-color: white\"> \").</span></div>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "# from sparknlp_display import NerVisualizer\n",
        "\n",
        "visualiser = nlp.viz.NerVisualizer()\n",
        "\n",
        "visualiser.display(light_result[0], label_col='ner_chunk', document_col='document')"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "95645147-7f43-4fc1-b668-3bb2317bb74f",
      "metadata": {
        "id": "95645147-7f43-4fc1-b668-3bb2317bb74f"
      },
      "source": [
        "##🔎 Create Generic Pipeline for NerDL Models"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "da501a7a-aabd-477d-b19b-1e18b4ee7042",
      "metadata": {
        "id": "da501a7a-aabd-477d-b19b-1e18b4ee7042"
      },
      "outputs": [],
      "source": [
        "def base_pipeline():\n",
        "    \n",
        "    documentAssembler = nlp.DocumentAssembler()\\\n",
        "        .setInputCol(\"text\")\\\n",
        "        .setOutputCol(\"document\")\n",
        "\n",
        "    textSplitter = legal.TextSplitter()\\\n",
        "        .setInputCols([\"document\"])\\\n",
        "        .setOutputCol(\"sentence\")\n",
        "\n",
        "    tokenizer = nlp.Tokenizer()\\\n",
        "        .setInputCols([\"sentence\"])\\\n",
        "        .setOutputCol(\"token\")\n",
        "    \n",
        "    pipeline = nlp.Pipeline(stages=[\n",
        "            documentAssembler,\n",
        "            textSplitter,\n",
        "            tokenizer])\n",
        "    \n",
        "    return pipeline"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "6b050243-1a9d-4b9e-a7a8-3fc084e69d08",
      "metadata": {
        "id": "6b050243-1a9d-4b9e-a7a8-3fc084e69d08"
      },
      "outputs": [],
      "source": [
        "def generic_ner_pipeline(model_name):\n",
        "    \n",
        "    embeddings = nlp.RoBertaEmbeddings.pretrained(\"roberta_embeddings_legal_roberta_base\", \"en\") \\\n",
        "            .setInputCols(\"sentence\", \"token\") \\\n",
        "            .setOutputCol(\"embeddings\")\\\n",
        "\n",
        "    ner_model = legal.NerModel.pretrained(model_name, \"en\", \"legal/models\")\\\n",
        "            .setInputCols([\"sentence\", \"token\", \"embeddings\"])\\\n",
        "            .setOutputCol(\"ner\")\n",
        "\n",
        "    ner_converter = nlp.NerConverter()\\\n",
        "            .setInputCols([\"sentence\",\"token\",\"ner\"])\\\n",
        "            .setOutputCol(\"ner_chunk\")\n",
        "\n",
        "    nlpPipeline = nlp.Pipeline(stages=[\n",
        "            base_pipeline(),\n",
        "            embeddings,\n",
        "            ner_model,\n",
        "            ner_converter])\n",
        "\n",
        "    empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n",
        "\n",
        "    model = nlpPipeline.fit(empty_data)\n",
        "    \n",
        "    return model"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "479103f2-1256-4b78-971a-55d81140d030",
      "metadata": {
        "id": "479103f2-1256-4b78-971a-55d81140d030"
      },
      "source": [
        "##📌 Create Generic Result Function"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "19fac60f-f99a-4bf6-8668-0d40aa50a24d",
      "metadata": {
        "id": "19fac60f-f99a-4bf6-8668-0d40aa50a24d"
      },
      "outputs": [],
      "source": [
        "def get_result(result):\n",
        "    result.select(F.explode(F.arrays_zip(result.ner_chunk.result, \n",
        "                                         result.ner_chunk.metadata)).alias(\"cols\")) \\\n",
        "          .select(F.expr(\"cols['0']\").alias(\"chunk\"),\n",
        "                  F.expr(\"cols['1']['entity']\").alias(\"ner_label\")).show(50, truncate=False)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "817a44c1-52a3-40b3-9bd2-bc7b67c4c7fe",
      "metadata": {
        "id": "817a44c1-52a3-40b3-9bd2-bc7b67c4c7fe"
      },
      "source": [
        "###✔️ Legal Cuad_NER_Header Model\n",
        "\n",
        "This model uses Name Entity Recognition to detect **HEADER** and **SUBHEADER** with aims to detect the different sections of a legal document."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "3fa0104c-bdb7-4e4a-b3bb-f7e21f912964",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "3fa0104c-bdb7-4e4a-b3bb-f7e21f912964",
        "jupyter": {
          "outputs_hidden": true
        },
        "outputId": "8a3755a3-d082-418e-f842-3ba40a35bfc5",
        "tags": []
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "roberta_embeddings_legal_roberta_base download started this may take some time.\n",
            "Approximate size to download 447.2 MB\n",
            "[OK!]\n",
            "legner_headers download started this may take some time.\n",
            "[OK!]\n"
          ]
        }
      ],
      "source": [
        "text = \"\"\"5. GRANT OF PATENT LICENSE\n",
        "5.1 Arizona Patent Grant. Subject to the terms and conditions of this Agreement, Arizona hereby grants to the Company a perpetual, non-exclusive, royalty-free license in, to and under the Arizona Licensed Patents for use in the Company Field throughout the world.\"\"\"\n",
        "\n",
        "model_name = \"legner_headers\"\n",
        "df = spark.createDataFrame([[text]]).toDF(\"text\")\n",
        "\n",
        "result = generic_ner_pipeline(model_name).transform(df)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "0c143057-ddfc-4823-947f-9e51506e50ce",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "0c143057-ddfc-4823-947f-9e51506e50ce",
        "outputId": "72a7ecb8-9128-48fd-ad0f-6d1b16d80416"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "+--------------------------+---------+\n",
            "|chunk                     |ner_label|\n",
            "+--------------------------+---------+\n",
            "|5. GRANT OF PATENT LICENSE|HEADER   |\n",
            "|5.1 Arizona Patent Grant  |SUBHEADER|\n",
            "+--------------------------+---------+\n",
            "\n"
          ]
        }
      ],
      "source": [
        "get_result(result)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "aab56faa-0646-4c84-ac4e-e9710c0ba891",
      "metadata": {
        "id": "aab56faa-0646-4c84-ac4e-e9710c0ba891"
      },
      "source": [
        "###✔️ Legal Cuad_NER_Obligations Model\n",
        "\n",
        "📚Entities:\n",
        " - OBLIGATION_SUBJECT\n",
        " - OBLIGATION_ACTION\n",
        " - OBLIGATION\n",
        " - OBLIGATION_INDIRECT_OBJECT"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "Bzup3rC83o-F",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Bzup3rC83o-F",
        "outputId": "acbe94fd-cad1-47b1-adc1-7b623481cf2e"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "legner_obligations download started this may take some time.\n",
            "[OK!]\n"
          ]
        }
      ],
      "source": [
        "tokenClassifier = legal.BertForTokenClassification.pretrained(\"legner_obligations\", \"en\", \"legal/models\")\\\n",
        "  .setInputCols(\"token\", \"sentence\")\\\n",
        "  .setOutputCol(\"ner\")\\\n",
        "  .setCaseSensitive(True)\n",
        "\n",
        "ner_converter = nlp.NerConverter()\\\n",
        "    .setInputCols([\"sentence\",\"token\",\"ner\"])\\\n",
        "    .setOutputCol(\"ner_chunk\")\n",
        "\n",
        "pipeline = nlp.Pipeline(stages=[\n",
        "    base_pipeline(), \n",
        "    tokenClassifier,\n",
        "    ner_converter])\n",
        "\n",
        "empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n",
        "\n",
        "model = pipeline.fit(empty_data)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "6e48a792-b252-42bd-8d0f-0e543240b289",
      "metadata": {
        "id": "6e48a792-b252-42bd-8d0f-0e543240b289",
        "jupyter": {
          "outputs_hidden": true
        },
        "tags": []
      },
      "outputs": [],
      "source": [
        "# Sometimes models work better with lowercase, depending on the vocabulary of the uppercase items\n",
        "# Sometimes only uncased language models are present.\n",
        "# This one is mixed but works better with lowercase\n",
        "text = \"\"\"PPD may engage VS to perform imaging services\"\"\".lower()\n",
        "\n",
        "df = spark.createDataFrame([[text]]).toDF(\"text\")\n",
        "\n",
        "result = model.transform(df)\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "56f94eaf-e394-4052-b33b-518ae5876ca1",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "56f94eaf-e394-4052-b33b-518ae5876ca1",
        "outputId": "cfa33c03-93d1-4299-d0dc-2e8b0ca5c941"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "+------------------+------------------+\n",
            "|chunk             |ner_label         |\n",
            "+------------------+------------------+\n",
            "|ppd               |OBLIGATION_SUBJECT|\n",
            "|may engage        |OBLIGATION_ACTION |\n",
            "|vs                |OBLIGATION        |\n",
            "|to perform imaging|OBLIGATION        |\n",
            "+------------------+------------------+\n",
            "\n"
          ]
        }
      ],
      "source": [
        "get_result(result)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "bd3aab70-a7e4-477c-9415-9ed845afb9c1",
      "metadata": {
        "id": "bd3aab70-a7e4-477c-9415-9ed845afb9c1"
      },
      "source": [
        "###✔️ Legal NER_Law_Money Spanish Model with RoBertaForTokenClassification\n",
        "\n",
        "📚Enities\n",
        " - LAW\n",
        " - MONEY"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "e350660c-4e30-42d9-8086-7f835036fa19",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "e350660c-4e30-42d9-8086-7f835036fa19",
        "outputId": "54c545d1-cfc6-47dd-dd60-b525de5bfd5c"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "legner_law_money download started this may take some time.\n",
            "Approximate size to download 395.1 MB\n",
            "[OK!]\n"
          ]
        }
      ],
      "source": [
        "tokenClassifier = nlp.RoBertaForTokenClassification.pretrained(\"legner_law_money\", \"es\", \"legal/models\") \\\n",
        "    .setInputCols([\"sentence\", \"token\"])\\\n",
        "    .setOutputCol(\"ner\")\n",
        "ner_converter = nlp.NerConverter()\\\n",
        "    .setInputCols([\"sentence\",\"token\",\"ner\"])\\\n",
        "    .setOutputCol(\"ner_chunk\")\n",
        "\n",
        "pipeline = nlp.Pipeline(stages=[\n",
        "    base_pipeline(), \n",
        "    tokenClassifier,\n",
        "    ner_converter])\n",
        "\n",
        "empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n",
        "\n",
        "model = pipeline.fit(empty_data)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "ad62ed1d-7972-45dc-adf0-46318bfcdc4f",
      "metadata": {
        "id": "ad62ed1d-7972-45dc-adf0-46318bfcdc4f"
      },
      "outputs": [],
      "source": [
        "text = \"\"\"La recaudación del ministerio del interior fue de 20,000,000 euros así constatado por el artículo 24 de la Constitución Española.\"\"\"\n",
        "\n",
        "df = spark.createDataFrame([[text]]).toDF(\"text\")\n",
        "\n",
        "result = model.transform(df)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "0cd698ee-0302-427b-bdf8-a43d878468cc",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "0cd698ee-0302-427b-bdf8-a43d878468cc",
        "outputId": "cb5fbea8-2455-4d2b-9a7c-0a6ba8431beb"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "+---------------------------------------+---------+\n",
            "|chunk                                  |ner_label|\n",
            "+---------------------------------------+---------+\n",
            "|20,000,000 euros                       |MONEY    |\n",
            "|artículo 24 de la Constitución Española|LAW      |\n",
            "+---------------------------------------+---------+\n",
            "\n"
          ]
        }
      ],
      "source": [
        "get_result(result)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "bf944e21-f7e3-4282-b3c5-2bbeb04fe8a8",
      "metadata": {
        "id": "bf944e21-f7e3-4282-b3c5-2bbeb04fe8a8"
      },
      "source": [
        "#🔎 Zero-shot Legal Example"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "dd10f5c5",
      "metadata": {
        "id": "dd10f5c5"
      },
      "source": [
        "📚`Zero-shot` is a new inference paradigm which allows us to use a model for prediction without any previous training step.\n",
        "\n",
        "For doing that, several examples (_hypotheses_) are provided and sent to the Language model, which will use `NLI (Natural Language Inference)` to check if the any information found in the text matches the examples (confirm the hypotheses).\n",
        "\n",
        "NLI usually works by trying to _confirm or reject an hypotheses_. The _hypotheses_ are the `prompts` or examples we are going to provide. If any piece of information confirm the constructed hypotheses (answer the examples we are given), then the hypotheses is confirmed and the Zero-shot is triggered.\n",
        "\n",
        "Let's see it  in action."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "6610e9d9-0cd6-45ad-9fe4-e0d9ac3314e3",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "6610e9d9-0cd6-45ad-9fe4-e0d9ac3314e3",
        "outputId": "c1a25f57-56f1-433a-cb9c-e037007bb117"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "legner_roberta_zeroshot download started this may take some time.\n",
            "[OK!]\n"
          ]
        }
      ],
      "source": [
        "documentAssembler = nlp.DocumentAssembler()\\\n",
        "  .setInputCol(\"text\")\\\n",
        "  .setOutputCol(\"document\")\n",
        "\n",
        "textSplitter = legal.TextSplitter()\\\n",
        "  .setInputCols([\"document\"])\\\n",
        "  .setOutputCol(\"sentence\")\n",
        "\n",
        "sparktokenizer = nlp.Tokenizer()\\\n",
        "  .setInputCols(\"sentence\")\\\n",
        "  .setOutputCol(\"token\")\n",
        "\n",
        "zero_shot_ner = legal.ZeroShotNerModel.pretrained(\"legner_roberta_zeroshot\", \"en\", \"legal/models\")\\\n",
        "    .setInputCols([\"sentence\", \"token\"])\\\n",
        "    .setOutputCol(\"zero_shot_ner\")\\\n",
        "    .setEntityDefinitions(\n",
        "        {\n",
        "            \"DATE\": ['When was the company acquisition?', 'When was the company purchase agreement?', \"When was the agreement?\"],\n",
        "            \"ORG\": [\"Which company?\"],\n",
        "            \"STATE\": [\"Which state?\"],\n",
        "            \"AGREEMENT\": [\"What kind of agreement?\"],\n",
        "            \"LICENSE\": [\"What kind of license?\"],\n",
        "            \"LICENSE_RECIPIENT\": [\"To whom the license is granted?\"]\n",
        "        })\n",
        "    \n",
        "\n",
        "nerconverter = nlp.NerConverter()\\\n",
        "  .setInputCols([\"sentence\", \"token\", \"zero_shot_ner\"])\\\n",
        "  .setOutputCol(\"ner_chunk\")\n",
        "\n",
        "pipeline =  nlp.Pipeline(stages=[\n",
        "  documentAssembler,\n",
        "  textSplitter,\n",
        "  sparktokenizer,\n",
        "  zero_shot_ner,\n",
        "  nerconverter\n",
        "    ]\n",
        ")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "9f49d722-71aa-488e-af32-6de899aa5b2a",
      "metadata": {
        "id": "9f49d722-71aa-488e-af32-6de899aa5b2a"
      },
      "outputs": [],
      "source": [
        "from pyspark.sql.types import StructType,StructField, StringType\n",
        "\n",
        "sample_text = [\"\"\"In March 2012, as part of a longer-term strategy, the Company acquired Vertro, Inc., which owned and operated the ALOT product portfolio.\"\"\",\n",
        "              \"\"\"In February 2017, the Company entered into an asset purchase agreement with NetSeer, Inc.\"\"\",\n",
        "              \"\"\"This INTELLECTUAL PROPERTY AGREEMENT, dated as of December 31, 2018 (the 'Effective Date') is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ('Seller') and AFI Licensing LLC, a Delaware company (the 'Licensee')\"\"\",\n",
        "              \"\"\"The Company hereby grants to Seller a perpetual, non- exclusive, royalty-free license\"\"\"]\n",
        "\n",
        "p_model = pipeline.fit(spark.createDataFrame([[\"\"]]).toDF(\"text\"))\n",
        "\n",
        "res = p_model.transform(spark.createDataFrame(sample_text, StringType()).toDF(\"text\"))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "233fd7d6-c84d-4240-9b57-9912f6256b71",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "233fd7d6-c84d-4240-9b57-9912f6256b71",
        "outputId": "c402d4e9-07e3-4b1c-e90d-8eb2016df264"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "DataFrame[chunk: string, ner_label: string]"
            ]
          },
          "execution_count": 29,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "# from pyspark.sql import functions as F\n",
        "\n",
        "res.select(F.explode(F.arrays_zip(res.ner_chunk.result, res.ner_chunk.begin, res.ner_chunk.end, res.ner_chunk.metadata)).alias(\"cols\")) \\\n",
        "   .select(F.expr(\"cols['0']\").alias(\"chunk\"),\n",
        "           F.expr(\"cols['3']['entity']\").alias(\"ner_label\"))\\\n",
        "           .filter(\"ner_label!='O'\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "06a23e24-2872-4bb2-8dd3-32f62bdb9423",
      "metadata": {
        "id": "06a23e24-2872-4bb2-8dd3-32f62bdb9423"
      },
      "outputs": [],
      "source": [
        "lp = nlp.LightPipeline(p_model)\n",
        "lp_res_1 = lp.fullAnnotate(sample_text[2])\n",
        "lp_res_2 = lp.fullAnnotate(sample_text[3])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "99fc9030-2dc5-4621-b25a-ea2467aa0ccb",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 88
        },
        "id": "99fc9030-2dc5-4621-b25a-ea2467aa0ccb",
        "outputId": "24d5c6e1-03ca-4683-86bc-2fb1935f4930"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "\n",
              "<style>\n",
              "    @import url('https://fonts.googleapis.com/css2?family=Montserrat:wght@300;400;500;600;700&display=swap');\n",
              "    @import url('https://fonts.googleapis.com/css2?family=Vistol Regular:wght@300;400;500;600;700&display=swap');\n",
              "    \n",
              "    .spark-nlp-display-scroll-entities {\n",
              "        border: 1px solid #E7EDF0;\n",
              "        border-radius: 3px;\n",
              "        text-align: justify;\n",
              "        \n",
              "    }\n",
              "    .spark-nlp-display-scroll-entities span {  \n",
              "        font-size: 14px;\n",
              "        line-height: 24px;\n",
              "        color: #536B76;\n",
              "        font-family: 'Montserrat', sans-serif !important;\n",
              "    }\n",
              "    \n",
              "    .spark-nlp-display-entity-wrapper{\n",
              "    \n",
              "        display: inline-grid;\n",
              "        text-align: center;\n",
              "        border-radius: 4px;\n",
              "        margin: 0 2px 5px 2px;\n",
              "        padding: 1px\n",
              "    }\n",
              "    .spark-nlp-display-entity-name{\n",
              "        font-size: 14px;\n",
              "        line-height: 24px;\n",
              "        font-family: 'Montserrat', sans-serif !important;\n",
              "        \n",
              "        background: #f1f2f3;\n",
              "        border-width: medium;\n",
              "        text-align: center;\n",
              "        \n",
              "        font-weight: 400;\n",
              "        \n",
              "        border-radius: 5px;\n",
              "        padding: 2px 5px;\n",
              "        display: block;\n",
              "        margin: 3px 2px;\n",
              "    \n",
              "    }\n",
              "    .spark-nlp-display-entity-type{\n",
              "        font-size: 14px;\n",
              "        line-height: 24px;\n",
              "        color: #ffffff;\n",
              "        font-family: 'Montserrat', sans-serif !important;\n",
              "        \n",
              "        text-transform: uppercase;\n",
              "        \n",
              "        font-weight: 500;\n",
              "\n",
              "        display: block;\n",
              "        padding: 3px 5px;\n",
              "    }\n",
              "    \n",
              "    .spark-nlp-display-entity-resolution{\n",
              "        font-size: 14px;\n",
              "        line-height: 24px;\n",
              "        color: #ffffff;\n",
              "        font-family: 'Vistol Regular', sans-serif !important;\n",
              "        \n",
              "        text-transform: uppercase;\n",
              "        \n",
              "        font-weight: 500;\n",
              "\n",
              "        display: block;\n",
              "        padding: 3px 5px;\n",
              "    }\n",
              "    \n",
              "    .spark-nlp-display-others{\n",
              "        font-size: 14px;\n",
              "        line-height: 24px;\n",
              "        font-family: 'Montserrat', sans-serif !important;\n",
              "        \n",
              "        font-weight: 400;\n",
              "    }\n",
              "\n",
              "</style>\n",
              " <span class=\"spark-nlp-display-others\" style=\"background-color: white\">This </span><span class=\"spark-nlp-display-entity-wrapper\" style=\"background-color: #5F832F\"><span class=\"spark-nlp-display-entity-name\">INTELLECTUAL PROPERTY </span><span class=\"spark-nlp-display-entity-type\">AGREEMENT</span></span><span class=\"spark-nlp-display-others\" style=\"background-color: white\"> AGREEMENT, dated as of </span><span class=\"spark-nlp-display-entity-wrapper\" style=\"background-color: #a6b1e1\"><span class=\"spark-nlp-display-entity-name\">December 31, 2018 </span><span class=\"spark-nlp-display-entity-type\">DATE</span></span><span class=\"spark-nlp-display-others\" style=\"background-color: white\"> (the 'Effective Date') is entered into by and between </span><span class=\"spark-nlp-display-entity-wrapper\" style=\"background-color: #62BA7A\"><span class=\"spark-nlp-display-entity-name\">Armstrong Flooring </span><span class=\"spark-nlp-display-entity-type\">LICENSE_RECIPIENT</span></span><span class=\"spark-nlp-display-others\" style=\"background-color: white\">, Inc., a </span><span class=\"spark-nlp-display-entity-wrapper\" style=\"background-color: #222F2C\"><span class=\"spark-nlp-display-entity-name\">Delaware </span><span class=\"spark-nlp-display-entity-type\">STATE</span></span><span class=\"spark-nlp-display-others\" style=\"background-color: white\"> corporation ('Seller') and </span><span class=\"spark-nlp-display-entity-wrapper\" style=\"background-color: #62BA7A\"><span class=\"spark-nlp-display-entity-name\">AFI Licensing LLC, a Delaware company </span><span class=\"spark-nlp-display-entity-type\">LICENSE_RECIPIENT</span></span><span class=\"spark-nlp-display-others\" style=\"background-color: white\"> (the 'Licensee')</span></div>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "# from sparknlp_display import NerVisualizer\n",
        "\n",
        "visualiser = nlp.viz.NerVisualizer()\n",
        "\n",
        "visualiser.display(lp_res_1[0], label_col='ner_chunk', document_col='document')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "aed220b8-8aba-42bb-8e5a-b2ee3438834e",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 88
        },
        "id": "aed220b8-8aba-42bb-8e5a-b2ee3438834e",
        "outputId": "9f6b3be3-07f4-4828-c569-93214a1eba64"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "\n",
              "<style>\n",
              "    @import url('https://fonts.googleapis.com/css2?family=Montserrat:wght@300;400;500;600;700&display=swap');\n",
              "    @import url('https://fonts.googleapis.com/css2?family=Vistol Regular:wght@300;400;500;600;700&display=swap');\n",
              "    \n",
              "    .spark-nlp-display-scroll-entities {\n",
              "        border: 1px solid #E7EDF0;\n",
              "        border-radius: 3px;\n",
              "        text-align: justify;\n",
              "        \n",
              "    }\n",
              "    .spark-nlp-display-scroll-entities span {  \n",
              "        font-size: 14px;\n",
              "        line-height: 24px;\n",
              "        color: #536B76;\n",
              "        font-family: 'Montserrat', sans-serif !important;\n",
              "    }\n",
              "    \n",
              "    .spark-nlp-display-entity-wrapper{\n",
              "    \n",
              "        display: inline-grid;\n",
              "        text-align: center;\n",
              "        border-radius: 4px;\n",
              "        margin: 0 2px 5px 2px;\n",
              "        padding: 1px\n",
              "    }\n",
              "    .spark-nlp-display-entity-name{\n",
              "        font-size: 14px;\n",
              "        line-height: 24px;\n",
              "        font-family: 'Montserrat', sans-serif !important;\n",
              "        \n",
              "        background: #f1f2f3;\n",
              "        border-width: medium;\n",
              "        text-align: center;\n",
              "        \n",
              "        font-weight: 400;\n",
              "        \n",
              "        border-radius: 5px;\n",
              "        padding: 2px 5px;\n",
              "        display: block;\n",
              "        margin: 3px 2px;\n",
              "    \n",
              "    }\n",
              "    .spark-nlp-display-entity-type{\n",
              "        font-size: 14px;\n",
              "        line-height: 24px;\n",
              "        color: #ffffff;\n",
              "        font-family: 'Montserrat', sans-serif !important;\n",
              "        \n",
              "        text-transform: uppercase;\n",
              "        \n",
              "        font-weight: 500;\n",
              "\n",
              "        display: block;\n",
              "        padding: 3px 5px;\n",
              "    }\n",
              "    \n",
              "    .spark-nlp-display-entity-resolution{\n",
              "        font-size: 14px;\n",
              "        line-height: 24px;\n",
              "        color: #ffffff;\n",
              "        font-family: 'Vistol Regular', sans-serif !important;\n",
              "        \n",
              "        text-transform: uppercase;\n",
              "        \n",
              "        font-weight: 500;\n",
              "\n",
              "        display: block;\n",
              "        padding: 3px 5px;\n",
              "    }\n",
              "    \n",
              "    .spark-nlp-display-others{\n",
              "        font-size: 14px;\n",
              "        line-height: 24px;\n",
              "        font-family: 'Montserrat', sans-serif !important;\n",
              "        \n",
              "        font-weight: 400;\n",
              "    }\n",
              "\n",
              "</style>\n",
              " <span class=\"spark-nlp-display-others\" style=\"background-color: white\">The Company hereby grants to </span><span class=\"spark-nlp-display-entity-wrapper\" style=\"background-color: #A2056B\"><span class=\"spark-nlp-display-entity-name\">Seller </span><span class=\"spark-nlp-display-entity-type\">LICENSE_RECIPIENT</span></span><span class=\"spark-nlp-display-others\" style=\"background-color: white\"> a </span><span class=\"spark-nlp-display-entity-wrapper\" style=\"background-color: #706BC1\"><span class=\"spark-nlp-display-entity-name\">perpetual </span><span class=\"spark-nlp-display-entity-type\">LICENSE</span></span><span class=\"spark-nlp-display-others\" style=\"background-color: white\">, </span><span class=\"spark-nlp-display-entity-wrapper\" style=\"background-color: #706BC1\"><span class=\"spark-nlp-display-entity-name\">non- exclusive </span><span class=\"spark-nlp-display-entity-type\">LICENSE</span></span><span class=\"spark-nlp-display-others\" style=\"background-color: white\">, </span><span class=\"spark-nlp-display-entity-wrapper\" style=\"background-color: #706BC1\"><span class=\"spark-nlp-display-entity-name\">royalty-free </span><span class=\"spark-nlp-display-entity-type\">LICENSE</span></span><span class=\"spark-nlp-display-others\" style=\"background-color: white\"> license</span></div>"
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "visualiser.display(lp_res_2[0], label_col='ner_chunk', document_col='document')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "CijixuQO90aU",
      "metadata": {
        "id": "CijixuQO90aU"
      },
      "outputs": [],
      "source": []
    }
  ],
  "metadata": {
    "colab": {
      "machine_shape": "hm",
      "provenance": [],
      "toc_visible": true
    },
    "gpuClass": "standard",
    "kernelspec": {
      "display_name": "tf-gpu",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)]"
    },
    "vscode": {
      "interpreter": {
        "hash": "3f47d918ae832c68584484921185f5c85a1760864bf927a683dc6fb56366cc77"
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}