{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "I08sFJYCxR0Z"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FwJ-P56kq6FU"
},
"source": [
"[](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/05.3.ZeroShot_Legal_NER.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Niy3mZAjoayg"
},
"source": [
"#π Zero-Shot Named Entity Recognition in Spark NLP\n",
"\n",
"In this notebook, you will find an example of Zero-Shot NER model (`legner_roberta_zeroshot`) that is the first of its kind and can detect any named entities without using any annotated dataset to train a model. \n",
"\n",
"`ZeroShotNerModel` annotator also allows extracting entities by crafting appropriate prompts to query **any RoBERTa Question Answering model**. \n",
"\n",
"\n",
"You can check the model card here: [Models Hub](https://nlp.johnsnowlabs.com/2022/08/29/zero_shot_ner_roberta_en.html)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"id": "gk3kZHmNj51v"
},
"source": [
"#π¬ Installation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "_914itZsj51v",
"pycharm": {
"is_executing": true
}
},
"outputs": [],
"source": [
"! pip install -q johnsnowlabs"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "YPsbAnNoPt0Z"
},
"source": [
"##π Automatic Installation\n",
"Using my.johnsnowlabs.com SSO"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "fY0lcShkj51w",
"pycharm": {
"is_executing": true
}
},
"outputs": [],
"source": [
"from johnsnowlabs import nlp, legal\n",
"\n",
"# nlp.install(force_browser=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "hsJvn_WWM2GL"
},
"source": [
"##π Manual downloading\n",
"If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.\n",
"\n",
"- Go to my.johnsnowlabs.com\n",
"- Download your license\n",
"- Upload it using the following command"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "i57QV3-_P2sQ"
},
"outputs": [],
"source": [
"from google.colab import files\n",
"print('Please Upload your John Snow Labs License using the button below')\n",
"license_keys = files.upload()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "xGgNdFzZP_hQ"
},
"source": [
"- Install it"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "OfmmPqknP4rR"
},
"outputs": [],
"source": [
"nlp.install()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "DCl5ErZkNNLk"
},
"source": [
"#π Starting"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "wRXTnNl3j51w",
"outputId": "af9ba04b-cf01-4edd-915a-b69fe20c70b2"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"π Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7163 (2).json\n",
"π Launched \u001b[92mcpu optimized\u001b[39m session with with: πSpark-NLP==4.2.4, πSpark-Healthcare==4.2.4, running on β‘ PySpark==3.1.2\n"
]
}
],
"source": [
"spark = nlp.start()"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"id": "_Dk4cMb4ETjU"
},
"source": [
"####π Answering Questions on Legal Texts\n",
"One of the latests biggest outcomes in NLP are **Language Models** and their ability to answer questions, expressed in natural language.\n",
"\n",
"**Question Answeering (QA)** uses specific Language Models trained to carry out **Natural Language Inference (NLI)**\n",
"\n",
"**NLI** works as follows:\n",
"- Given a text as a Premise (P);\n",
"- Given a hypotheses (H) as a question to be solved;\n",
" - Then, we ask the Language Model is H is `entailed`, `contradicted` or `not related` in P.\n",
"\n",
"For doing that, several examples (hypotheses) are provided and sent to the Language model, which will use `NLI (Natural Language Inference)` to check if the any information found in the text matches the examples (confirm the hypotheses).\n",
"\n",
"NLI usually works by trying to confirm or reject an hypotheses. The hypotheses are the `prompts` or examples we are going to provide. If any piece of information confirm the constructed hypotheses (answer the examples we are given), then the hypotheses is confirmed and the Zero-shot is triggered.\n",
"\n",
"> *In February 2017, the Company entered into an asset purchase agreement with NetSeer, Inc.\n",
"...\n",
" The Company hereby grants to Seller a perpetual, non-exclusive, royalty-free license.\n",
"...\n",
"On March 12, 2020, we closed a Loan and Security Agreement with Hitachi Capital American Corp (also known as \"Hitachi\")\n",
"...*"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"id": "HvBGy5GQETjV"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"id": "wGiqJji4ETja"
},
"source": [
"We have built our `Zero-shot` NER and Relation Extraction models on top of Language Models and Question Answering, applying NLI. Since it's a QA model, Zero-shot does not require any training data, just a context and a series of questions.\n",
"\n",
"\n",
"Let's see it in action."
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"id": "lU0XIb4dETja"
},
"source": [
"##π Zero-shot Learning: NER\n",
"\n",
"Named Entity Recognition is the NLP task aimed to tag chunks of information with specific labels.\n",
"\n",
"NER has been historically carried out using rule-based approaches, machine learning and more recently, Deep Learning models, including transformers.\n",
"\n",
"If we ignore the traditional rule-based approach, which consisted on having Subject Matter Experts (SME) creating rules using regular expressions, vocabularies, ontologies, etc., the common steps for the rest of Machine Learning based NER approaches were:\n",
"\n",
"1. Collect and clean data\n",
"2. Having SME annotating many documents;\n",
"3. Create features (only if using ML approaches, since Deep Learning does that feature extraction for you in most cases);\n",
"4. Train a model on a training dataset;\n",
"5. Evaluate on a test test;\n",
"\n",
"If it is not accurate, go to step number 1.\n",
"\n",
"This process takes a long time, specially if the nature of the use case is complex and requires many examples for the model to learn.\n",
"\n",
"Thankfully, **Zero-shot** comes to help, since it does not require any training data, drastically speeding up the process. *Zero-shot* models can be:\n",
"\n",
"- It can be a model on itβs own, with average accuracy;\n",
"- It can be used to preannotate the documents and speed up the process of annotations by SME;\n",
"- **Legal NLP** includes *Zero-shot* NER, which uses prompts in form of questions, and retrieves the answers to those questions as tagged chunks of information.\n",
"\n",
"*This is an example of Entity labels and some prompts.*\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"id": "NdaYKZCGETja"
},
"source": [
"```\n",
"- What is the type of agreement?\n",
"- What is the type of license?\n",
"- What are the companies in the agreement?\n",
"- What is also known as the different compaines?\n",
"- Who is the recipient of a license?\n",
"````"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"id": "iyiX-V0AETjb"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"id": "6_SlCQTDETjd"
},
"source": [
"## How is this achievedβ\n",
"- We check if the question/prompt (Hypotheses) returns `entailment` for any token in the premise (context/text)\n",
"- If several tokens in a row entail the hypotheses, the tokens are merged and returned as one `ner_chunk`"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"id": "ofR6Df7VETjd"
},
"source": [
"#π Let's start!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "eWa-7CKaETjd",
"outputId": "07e281df-d1be-49f3-cb62-12a977385518"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"legner_roberta_zeroshot download started this may take some time.\n",
"[OK!]\n"
]
}
],
"source": [
"documentAssembler = nlp.DocumentAssembler()\\\n",
" .setInputCol(\"text\")\\\n",
" .setOutputCol(\"document\")\n",
"\n",
"textSplitter = legal.TextSplitter()\\\n",
" .setInputCols([\"document\"])\\\n",
" .setOutputCol(\"sentence\")\n",
"\n",
"sparktokenizer = nlp.Tokenizer()\\\n",
" .setInputCols(\"sentence\")\\\n",
" .setOutputCol(\"token\")\n",
"\n",
"zero_shot_ner = legal.ZeroShotNerModel.pretrained(\"legner_roberta_zeroshot\", \"en\", \"legal/models\")\\\n",
" .setInputCols([\"sentence\", \"token\"])\\\n",
" .setOutputCol(\"zero_shot_ner\")\\\n",
" .setEntityDefinitions(\n",
" {\n",
" \"DATE\": ['When was the company acquisition?', 'When was the company purchase agreement?', \"When was the agreement?\"],\n",
" \"ORG\": [\"Which company?\"],\n",
" \"STATE\": [\"Which state?\"],\n",
" \"AGREEMENT\": [\"What kind of agreement?\"],\n",
" \"LICENSE\": [\"What kind of license?\"],\n",
" \"LICENSE_RECIPIENT\": [\"To whom the license is granted?\"]\n",
" })\n",
"\n",
"\n",
"nerconverter = nlp.NerConverter()\\\n",
" .setInputCols([\"sentence\", \"token\", \"zero_shot_ner\"])\\\n",
" .setOutputCol(\"ner_chunk\")\n",
"\n",
"pipeline = nlp.Pipeline(stages=[\n",
" documentAssembler,\n",
" textSplitter,\n",
" sparktokenizer,\n",
" zero_shot_ner,\n",
" nerconverter\n",
" ]\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "cKhHKvFtPsLF"
},
"outputs": [],
"source": [
"from pyspark.sql.types import StructType,StructField, StringType\n",
"\n",
"sample_text = [\"\"\"In March 2012, as part of a longer-term strategy, the Company acquired Vertro, Inc., which owned and operated the ALOT product portfolio.\"\"\",\n",
" \"\"\"In February 2017, the Company entered into an asset purchase agreement with NetSeer, Inc.\"\"\",\n",
" \"\"\"This INTELLECTUAL PROPERTY AGREEMENT, dated as of December 31, 2018 (the 'Effective Date') is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ('Seller') and AFI Licensing LLC, a Delaware company (the 'Licensee')\"\"\",\n",
" \"\"\"The Company hereby grants to Seller a perpetual, non- exclusive, royalty-free license\"\"\"]\n",
"\n",
"p_model = pipeline.fit(spark.createDataFrame([[\"\"]]).toDF(\"text\"))\n",
"\n",
"res = p_model.transform(spark.createDataFrame(sample_text, StringType()).toDF(\"text\"))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "9_v09TvWQgbg",
"outputId": "23595107-18b7-4a26-e6a0-2a4d82517667"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-------------------------------------+-----------------+\n",
"|chunk |ner_label |\n",
"+-------------------------------------+-----------------+\n",
"|March 2012 |DATE |\n",
"|Vertro, Inc |ORG |\n",
"|February 2017 |DATE |\n",
"|asset purchase agreement |AGREEMENT |\n",
"|NetSeer |ORG |\n",
"|INTELLECTUAL PROPERTY |AGREEMENT |\n",
"|December 31, 2018 |DATE |\n",
"|Armstrong Flooring |LICENSE_RECIPIENT|\n",
"|Delaware |STATE |\n",
"|AFI Licensing LLC, a Delaware company|LICENSE_RECIPIENT|\n",
"|Seller |LICENSE_RECIPIENT|\n",
"|perpetual |LICENSE |\n",
"|non- exclusive |LICENSE |\n",
"|royalty-free |LICENSE |\n",
"+-------------------------------------+-----------------+\n",
"\n"
]
}
],
"source": [
"from pyspark.sql import functions as F\n",
"\n",
"res.select(F.explode(F.arrays_zip(res.ner_chunk.result, res.ner_chunk.begin, res.ner_chunk.end, res.ner_chunk.metadata)).alias(\"cols\")) \\\n",
" .select(F.expr(\"cols['0']\").alias(\"chunk\"),\n",
" F.expr(\"cols['3']['entity']\").alias(\"ner_label\"))\\\n",
" .filter(\"ner_label!='O'\")\\\n",
" .show(truncate=False)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "WIp_4iFTSjpx"
},
"source": [
"#### We have just seen how simple it is to obtain the output without having to deal with the hassle of model training.\n",
"\n",
"#### Let's now look at how to enhance the model's predictions in scenarios where there may be incorrectly identified labels or fewer labels overall."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "yglxuB_pJKoQ"
},
"source": [
"π**Let's look at an instance where the text's predictions are incorrect and discuss how to make them better:**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 370
},
"id": "5c4zHTJvQhWL",
"outputId": "c8e43224-0308-410e-ed76-bfa713800716"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"****************************************************************************************** Text Number - 1\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" The maximum penalty for a first-time DUI LICENSE offense in California STATE is up to six months in jail and a fine of up to $1,000 , according to state law. DATE"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"****************************************************************************************** Text Number - 2\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" The company is required to file its annual report with the state of California STATE by March 31st of each year. DATE Failure to do so may result in fines of $1,000 . LICENSE_RECIPIENT"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"****************************************************************************************** Text Number - 3\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" Pursuant to the laws of the State of Delaware STATE, the company ORG XYZ, Inc. ORG was properly organized and is currently in good standing with its principal place of business located at 123 Main Street, Anytown, USA 12345."
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"****************************************************************************************** Text Number - 4\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" The corporation known as DEF Inc. ORG was incorporated under the laws of the State of Illinois STATE on April 1, 2018 DATE and has its principal place of business at 456 Main Street, Anytown, USA STATE 54321 ."
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"lp = nlp.LightPipeline(p_model)\n",
"sample_text = [\"The maximum penalty for a first-time DUI offense in California is up to six months in jail and a fine of up to $1,000 , according to state law.\", \"The company is required to file its annual report with the state of California by March 31st of each year. Failure to do so may result in fines of $1,000 .\", \"Pursuant to the laws of the State of Delaware, the company XYZ, Inc. was properly organized and is currently in good standing with its principal place of business located at 123 Main Street, Anytown, USA 12345.\",\"The corporation known as DEF Inc. was incorporated under the laws of the State of Illinois on April 1, 2018 and has its principal place of business at 456 Main Street, Anytown, USA 54321 .\"]\n",
"\n",
"# from sparknlp_display import NerVisualizer\n",
"for i in range(len(sample_text)):\n",
" print('***'*30,f'Text Number - {i+1}')\n",
" visualiser = nlp.viz.NerVisualizer()\n",
" lp_res_1 = lp.fullAnnotate(sample_text[i])\n",
" visualiser.display(lp_res_1[0], label_col='ner_chunk', document_col='document')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "b2y7U_wj2tHS"
},
"source": [
"####π Here, it is clear that many of the model-identified text entities are incorrect.\n",
"\n",
"\n",
"\n",
"\n",
"1. We can observe that the **DATE** label has been misidentified in the first text.\n",
"2. The label **LICENSE RECEPIENT** on the second text shouldn't be used to refer to it.\n",
"3. In the 4th text `USA` has been identified as text but it should rather be identified as `ADDRESS`. \n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "paIktnWLTd1p"
},
"source": [
"*Let's try to fix these by modifying and including some prompts:*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "9zPpLPr_RM_V",
"outputId": "b2e13c17-6eca-487e-93e0-53a4dce52f67"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"legner_roberta_zeroshot download started this may take some time.\n",
"[OK!]\n"
]
}
],
"source": [
"zero_shot_ner = legal.ZeroShotNerModel.pretrained(\"legner_roberta_zeroshot\", \"en\", \"legal/models\")\\\n",
" .setInputCols([\"sentence\", \"token\"])\\\n",
" .setOutputCol(\"zero_shot_ner\")\\\n",
" .setEntityDefinitions(\n",
" {\n",
" \"DATE\": ['When was the company acquisition?', 'When was the company purchase agreement?', \"When was the agreement?\"],\n",
" \"ORG\": [\"Which company?\"],\n",
" \"STATE\": [\"Which state?\"],\n",
" \"AGREEMENT\": [\"What kind of agreement?\"],\n",
" \"LICENSE\": [\"What kind of license?\"],\n",
" \"LICENSE_RECIPIENT\": [\"To whom the license is granted?\"],\n",
" \"ADDRESS\": ['What is the address of the company'],\n",
" \"FINE_RECEPIENT\": ['How much fine?']\n",
" })"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jI-VnBlHEmAc"
},
"source": [
"#### To make the model predictions better, We could think of adding new labels or rephrasing the texts so that the model can better identify the entities inside the text.\n",
"\n",
"From the sentences, it is clear that there are other valuable entities that can improve the model's ability to predict: `FINE RECEPIENT`, `ADDRESS`.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 370
},
"id": "mkWVF3dntnEQ",
"outputId": "373168d7-a6f5-49e6-ff72-09f2920b8647"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"****************************************************************************************** Text Number - 1\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" The maximum penalty for a first-time DUI LICENSE offense in California STATE is up to six months in jail and a fine of up to $1,000 FINE_RECEPIENT , according to state law."
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"****************************************************************************************** Text Number - 2\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" The company is required to file its annual report with the state of California STATE by March 31st of each year. DATE Failure to do so may result in fines of $1,000 FINE_RECEPIENT ."
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"****************************************************************************************** Text Number - 3\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" Pursuant to the laws of the State of Delaware STATE, the company ORG XYZ, Inc. ORG was properly organized and is currently in good standing with its principal place of business located at 123 Main Street, Anytown, USA 12345. ADDRESS"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"****************************************************************************************** Text Number - 4\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" The corporation known as DEF Inc. ORG was incorporated under the laws of the State of Illinois STATE on April 1, 2018 DATE and has its principal place of business at 456 Main Street ADDRESS, Anytown, USA 54321 ADDRESS ."
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"pipeline = nlp.Pipeline(stages=[\n",
" documentAssembler,\n",
" sen,\n",
" sparktokenizer,\n",
" zero_shot_ner,\n",
" nerconverter\n",
" ]\n",
")\n",
"\n",
"p_model = pipeline.fit(spark.createDataFrame([[\"\"]]).toDF(\"text\"))\n",
"\n",
"lp = nlp.LightPipeline(p_model)\n",
"\n",
"sample_text = [\"The maximum penalty for a first-time DUI offense in California is up to six months in jail and a fine of up to $1,000 , according to state law.\", \"The company is required to file its annual report with the state of California by March 31st of each year. Failure to do so may result in fines of $1,000 .\", \"Pursuant to the laws of the State of Delaware, the company XYZ, Inc. was properly organized and is currently in good standing with its principal place of business located at 123 Main Street, Anytown, USA 12345.\",\"The corporation known as DEF Inc. was incorporated under the laws of the State of Illinois on April 1, 2018 and has its principal place of business at 456 Main Street, Anytown, USA 54321 .\"]\n",
"\n",
"# from sparknlp_display import NerVisualizer\n",
"for i in range(len(sample_text)):\n",
" print('***'*30,f'Text Number - {i+1}')\n",
" visualiser = nlp.viz.NerVisualizer()\n",
" lp_res_1 = lp.fullAnnotate(sample_text[i])\n",
" visualiser.display(lp_res_1[0], label_col='ner_chunk', document_col='document')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "37bb8ae4"
},
"source": [
"##π Automatic Prompt Generation"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"id": "5l1HLTTPETjf"
},
"source": [
"If you are curious if prompts can be generated automatically, the answer is - *Yes!* Please check the **Automatic Question Generation** notebook in this workshop."
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"gpuClass": "standard",
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 0
}