{ "cells": [ { "cell_type": "markdown", "id": "db5f4f9a-7776-42b3-8758-85624d4c15ea", "metadata": { "id": "db5f4f9a-7776-42b3-8758-85624d4c15ea" }, "source": [ "![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "21e9eafb", "metadata": { "id": "21e9eafb" }, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/08.2.NER_using_Question_Answering.ipynb)" ] }, { "cell_type": "markdown", "id": "9859b3bc-cec4-4189-88ed-37add5484623", "metadata": { "id": "9859b3bc-cec4-4189-88ed-37add5484623" }, "source": [ "# NER using Question Answering models\n", "Which may be very useful to extract long entities / spans which usually are not extracted properly with traditional NER models" ] }, { "cell_type": "markdown", "id": "gk3kZHmNj51v", "metadata": { "collapsed": false, "id": "gk3kZHmNj51v" }, "source": [ "# Installation" ] }, { "cell_type": "code", "execution_count": null, "id": "_914itZsj51v", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "_914itZsj51v", "outputId": "4a0bb6f9-ba64-4e34-e15b-68d2058ff6ce", "pycharm": { "is_executing": true } }, "outputs": [], "source": [ "! pip install -q johnsnowlabs" ] }, { "cell_type": "markdown", "id": "YPsbAnNoPt0Z", "metadata": { "id": "YPsbAnNoPt0Z" }, "source": [ "## Automatic Installation\n", "Using my.johnsnowlabs.com SSO" ] }, { "cell_type": "code", "execution_count": null, "id": "fY0lcShkj51w", "metadata": { "id": "fY0lcShkj51w", "pycharm": { "is_executing": true } }, "outputs": [], "source": [ "from johnsnowlabs import nlp, legal\n", "\n", "# nlp.install(force_browser=True)" ] }, { "cell_type": "markdown", "id": "hsJvn_WWM2GL", "metadata": { "id": "hsJvn_WWM2GL" }, "source": [ "## Manual downloading\n", "If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.\n", "\n", "- Go to my.johnsnowlabs.com\n", "- Download your license\n", "- Upload it using the following command" ] }, { "cell_type": "code", "execution_count": null, "id": "i57QV3-_P2sQ", "metadata": { "id": "i57QV3-_P2sQ" }, "outputs": [], "source": [ "from google.colab import files\n", "print('Please Upload your John Snow Labs License using the button below')\n", "license_keys = files.upload()" ] }, { "cell_type": "markdown", "id": "xGgNdFzZP_hQ", "metadata": { "id": "xGgNdFzZP_hQ" }, "source": [ "- Install it" ] }, { "cell_type": "code", "execution_count": null, "id": "OfmmPqknP4rR", "metadata": { "id": "OfmmPqknP4rR" }, "outputs": [], "source": [ "nlp.install()" ] }, { "cell_type": "markdown", "id": "DCl5ErZkNNLk", "metadata": { "id": "DCl5ErZkNNLk" }, "source": [ "# Starting" ] }, { "cell_type": "code", "execution_count": null, "id": "wRXTnNl3j51w", "metadata": { "id": "wRXTnNl3j51w" }, "outputs": [], "source": [ "spark = nlp.start()" ] }, { "cell_type": "markdown", "id": "37bb8ae4", "metadata": { "id": "37bb8ae4" }, "source": [ "# NER, Question Generation and Question Answering for Long-Span extraction" ] }, { "cell_type": "markdown", "id": "43420eee-1c29-4148-b1c8-fa7884eff9b3", "metadata": { "id": "43420eee-1c29-4148-b1c8-fa7884eff9b3" }, "source": [ "Legal documents are known to be very long. Although you can divide the docuuments into paragraphs or sections, and those into sentences, the resulted sentences are still long.\n", "\n", "Let's take a look at this example:\n", "\n", "`Buyer shall use such materials and supplies only in accordance with the present agreement`\n", "\n", "Not, let's imagine we want to extract three entities:\n", "1) The Subject (`Buyer`)\n", "2) The Action (`shall use`)\n", "3) The Object (what the Buyer shall use? - `such materials and supplies only in accordance with the present agreement`)\n", "\n", "Although Subject and Action can be totally manageable by traditional NER, it usually struggles the longer the spans are. Trying to model the extraction of Object with a simple NER may result in word fading, when some of the initial or ending words fade into `O`.\n", "\n", "We present in this notebook a solution for Long Span Extraction: Using an Automatic Question Generator and a Question Answering model to:\n", "1) First, using NER, detect entities as the `Subject` and the `Action`. \n", "\n", "Example: `Buyer - SUBJECT`, `shall use - OBJECT`\n", "\n", "2) Automatically generate a question to ask for the `Object`, using `Subject` and `Action`;\n", "\n", "Example: `What shall the Buyer use?`\n", "\n", "3) Use the question and the sentence to retrieve `Object`, without the limitations of traditional NER;\n", "\n", "Example: `What shall the Buyer use? such materials and supplies only in accordance with the present agreement`\n", "\n", "Last, but not least, it's very important to chose a domain-specific Question Answering model.\n" ] }, { "cell_type": "markdown", "id": "03c6cb7d-b34f-4974-b63b-c04640f6a668", "metadata": { "id": "03c6cb7d-b34f-4974-b63b-c04640f6a668" }, "source": [ "# Answering the question - `What?`\n", "Let's suppose we have the sentence of the example:\n", "\n", "`The Buyer shall use such materials and supplies only in accordance with the present agreement`\n", "\n", "In Spark NLP for Legal, we have a trained NER model which is able to extract Subjects (`Buyer`) and Actions (`shall use`) of agreements / obligations with good accuracy.\n", "\n", "It's also trained for extracting the `Object` using NER, but it's usage is limited due to the restrictions commented above.\n", "\n", "Let's get SUBJECT and ACTION and automatically create a question with them." ] }, { "cell_type": "code", "execution_count": null, "id": "b342ab82", "metadata": { "id": "b342ab82" }, "outputs": [], "source": [ "text = \"\"\"The Buyer shall use such materials and supplies only in accordance with the present agreement\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "id": "2948d346-d522-43b9-9cd7-99430882621f", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "2948d346-d522-43b9-9cd7-99430882621f", "outputId": "0cbcfe55-13db-4c29-f41b-e9ed717f4ccb" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "legner_obligations download started this may take some time.\n", "[OK!]\n", "legqa_bert_large download started this may take some time.\n", "Approximate size to download 1.2 GB\n", "[OK!]\n" ] } ], "source": [ "import pandas as pd\n", "\n", "documentAssembler = nlp.DocumentAssembler()\\\n", " .setInputCol(\"text\")\\\n", " .setOutputCol(\"document\")\n", "\n", "sparktokenizer = nlp.Tokenizer()\\\n", " .setInputCols(\"document\")\\\n", " .setOutputCol(\"token\")\n", "\n", "tokenClassifier = legal.BertForTokenClassification.pretrained(\"legner_obligations\", \"en\", \"legal/models\")\\\n", " .setInputCols(\"token\", \"document\")\\\n", " .setOutputCol(\"label\")\\\n", " .setCaseSensitive(True)\n", "\n", "nerconverter = nlp.NerConverter()\\\n", " .setInputCols([\"document\", \"token\", \"label\"])\\\n", " .setOutputCol(\"ner_chunk\")\n", "\n", "# setEntities1 says which entity from NER goes first in the question\n", "# setEntities2 says which entity from NER goes second in the question\n", "# setQuestionMark to True adds a '?' at the end of the sentence (after entity 2)\n", "# To sum up, the pattern is [QUESTIONPRONOUN] [ENTITY1] [ENTITY2] [QUESTIONMARK]\n", "qagenerator = legal.NerQuestionGenerator()\\\n", " .setInputCols([\"ner_chunk\"])\\\n", " .setOutputCol(\"question\")\\\n", " .setQuestionMark(False)\\\n", " .setQuestionPronoun(\"What\")\\\n", " .setEntities1([\"OBLIGATION_SUBJECT\"])\\\n", " .setEntities2([\"OBLIGATION_ACTION\"])\n", "\n", "qa =nlp.BertForQuestionAnswering.pretrained(\"legqa_bert_large\",\"en\", \"legal/models\") \\\n", " .setInputCols([\"question\", \"document\"]) \\\n", " .setOutputCol(\"answer\") \\\n", " .setCaseSensitive(True)\n", " \n", "pipeline = nlp.Pipeline(stages=[\n", " documentAssembler,\n", " sparktokenizer,\n", " tokenClassifier,\n", " nerconverter,\n", " qagenerator,\n", " qa\n", " ]\n", ")\n", "\n", "p_model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']})))\n", "\n", "res = p_model.transform(spark.createDataFrame([[text]]).toDF(\"text\"))" ] }, { "cell_type": "code", "execution_count": null, "id": "183fb2db-1cee-4f78-a486-dd6c9f6abd57", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "183fb2db-1cee-4f78-a486-dd6c9f6abd57", "outputId": "7838e677-b497-4bad-b72a-5a906c200620" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-----------------------+\n", "|result |\n", "+-----------------------+\n", "|[What Buyer shall use ]|\n", "+-----------------------+\n", "\n" ] } ], "source": [ "res.select('question.result').show(truncate=False)" ] }, { "cell_type": "code", "execution_count": null, "id": "5422560c-718e-4606-9054-678371f539b3", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "5422560c-718e-4606-9054-678371f539b3", "outputId": "58794682-2b02-431d-ce45-5b63661788b8" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-------------------------------------------------+\n", "|result |\n", "+-------------------------------------------------+\n", "|[The Buyer shall use such materials and supplies]|\n", "+-------------------------------------------------+\n", "\n" ] } ], "source": [ "res.select('answer.result').show(truncate=False)" ] }, { "cell_type": "markdown", "id": "a85caa4b", "metadata": { "id": "a85caa4b" }, "source": [ "Let's get 4 additional examples" ] }, { "cell_type": "code", "execution_count": null, "id": "49ee5ee7-208c-463e-9e04-0af46b69dd0e", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 175 }, "id": "49ee5ee7-208c-463e-9e04-0af46b69dd0e", "outputId": "062b5b57-c308-45c6-d3ff-18989e9096df" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
text
0The Buyer shall use such materials and supplie...
1The Provider will notify the Buyer about the r...
2Amazon agrees to supply 1-year license without...
3The Supplier should ship the product in less t...
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " text\n", "0 The Buyer shall use such materials and supplie...\n", "1 The Provider will notify the Buyer about the r...\n", "2 Amazon agrees to supply 1-year license without...\n", "3 The Supplier should ship the product in less t..." ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "texts = [\n", " \"\"\"The Buyer shall use such materials and supplies only in accordance with the present agreement\"\"\",\n", " \"\"\"The Provider will notify the Buyer about the release date\"\"\",\n", " \"\"\"Amazon agrees to supply 1-year license without fees\"\"\",\n", " \"\"\"The Supplier should ship the product in less than 1 month\"\"\"\n", "]\n", "\n", "pdf = pd.DataFrame(texts, columns = [\"text\"])\n", "pdf" ] }, { "cell_type": "code", "execution_count": null, "id": "bac0a5ed-2e38-417d-ad65-57bf9da463bd", "metadata": { "id": "bac0a5ed-2e38-417d-ad65-57bf9da463bd" }, "outputs": [], "source": [ "df = spark.createDataFrame(pdf)" ] }, { "cell_type": "code", "execution_count": null, "id": "7dc2e5e5-29a7-48a6-a7cb-1c8d4fb3feb0", "metadata": { "id": "7dc2e5e5-29a7-48a6-a7cb-1c8d4fb3feb0" }, "outputs": [], "source": [ "res = p_model.transform(df)" ] }, { "cell_type": "code", "execution_count": null, "id": "1351eaac-a74a-47e5-9079-44c26abc480d", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "1351eaac-a74a-47e5-9079-44c26abc480d", "outputId": "60131f25-447f-48bc-ee5d-ae1275d756ac" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-------------------------------+-----------------------------------------------------------+\n", "|result |result |\n", "+-------------------------------+-----------------------------------------------------------+\n", "|[What Buyer shall use ] |[The Buyer shall use such materials and supplies] |\n", "|[What Provider will notify ] |[The Provider will notify the Buyer about the release date]|\n", "|[What Amazon agrees to supply ]|[1 - year license without fees] |\n", "|[What Supplier should ship ] |[The Supplier should ship the product in less than 1 month]|\n", "+-------------------------------+-----------------------------------------------------------+\n", "\n" ] } ], "source": [ "res.select('question.result', 'answer.result').show(truncate=False)" ] }, { "cell_type": "markdown", "id": "bbabd12e-619b-480c-8821-6f504ab9a34b", "metadata": { "id": "bbabd12e-619b-480c-8821-6f504ab9a34b" }, "source": [ "# Answering the question - `To whom?`" ] }, { "cell_type": "markdown", "id": "ec574a03", "metadata": { "id": "ec574a03" }, "source": [ "Let's try to get now the Indirect Object. That is, the recipient of an action. For example, to whom a supplier should send a shipment." ] }, { "cell_type": "code", "execution_count": null, "id": "f8bfbdeb-2c00-4aa6-bcea-7f46c90cd7ce", "metadata": { "id": "f8bfbdeb-2c00-4aa6-bcea-7f46c90cd7ce" }, "outputs": [], "source": [ "qagenerator = legal.NerQuestionGenerator()\\\n", " .setInputCols([\"ner_chunk\"])\\\n", " .setOutputCol(\"question\")\\\n", " .setQuestionMark(False)\\\n", " .setQuestionPronoun(\"To whom\")\\\n", " .setEntities1([\"OBLIGATION_ACTION\"])\\\n", " .setEntities2([\"OBLIGATION_SUBJECT\"])\n", " \n", "pipeline = nlp.Pipeline(stages=[\n", " documentAssembler,\n", " sparktokenizer,\n", " tokenClassifier,\n", " nerconverter,\n", " qagenerator,\n", " qa\n", " ]\n", ")\n", "\n", "p_model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']})))\n", "\n", "text = \"\"\"The Provider shall send the shipment to the Buyer\"\"\"\n", "res = p_model.transform(spark.createDataFrame([[text]]).toDF(\"text\"))" ] }, { "cell_type": "code", "execution_count": null, "id": "7b730864-df23-47ff-9e4e-e4dcd123fdba", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "7b730864-df23-47ff-9e4e-e4dcd123fdba", "outputId": "dffcdcc3-b5ae-4dff-9e2d-82df4b688eff" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-------------------------------------------------+------------------------------+-----------------------------------------------+\n", "|text |result |result |\n", "+-------------------------------------------------+------------------------------+-----------------------------------------------+\n", "|The Provider shall send the shipment to the Buyer|[To whom shall send Provider ]|[Provider shall send the shipment to the Buyer]|\n", "+-------------------------------------------------+------------------------------+-----------------------------------------------+\n", "\n" ] } ], "source": [ "res.select('text', 'question.result', 'answer.result').show(truncate=False)" ] }, { "cell_type": "markdown", "id": "6ef6a289-ef6f-45db-bbe7-caa4f8666bd5", "metadata": { "id": "6ef6a289-ef6f-45db-bbe7-caa4f8666bd5" }, "source": [ "# Other clauses\n", "This approach works very well also with other clauses and phrases, as temporal ones. Let's try to ask for the deadline of a contract" ] }, { "cell_type": "code", "execution_count": null, "id": "ca38094c-52df-445e-9e18-087c85b0a2ee", "metadata": { "id": "ca38094c-52df-445e-9e18-087c85b0a2ee" }, "outputs": [], "source": [ "qagenerator = legal.NerQuestionGenerator()\\\n", " .setInputCols([\"ner_chunk\"])\\\n", " .setOutputCol(\"question\")\\\n", " .setQuestionMark(False)\\\n", " .setQuestionPronoun(\"Before when\")\\\n", " .setEntities1([\"OBLIGATION_ACTION\"])\\\n", " .setEntities2([\"OBLIGATION_SUBJECT\"])\n", " \n", "pipeline = nlp.Pipeline(stages=[\n", " documentAssembler,\n", " sparktokenizer,\n", " tokenClassifier,\n", " nerconverter,\n", " qagenerator,\n", " qa\n", " ]\n", ")\n", "\n", "p_model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']})))\n", "\n", "text = \"\"\"The customer should sign the contract before May, 2023\"\"\"\n", "res = p_model.transform(spark.createDataFrame([[text]]).toDF(\"text\"))" ] }, { "cell_type": "code", "execution_count": null, "id": "46c831a5-43f6-4511-86fa-76a076fee510", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "46c831a5-43f6-4511-86fa-76a076fee510", "outputId": "9b0f40a8-1d39-408e-8010-40a825cd17dd" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+------------------------------------------------------+-----------------------------------+-------------------+\n", "|text |result |result |\n", "+------------------------------------------------------+-----------------------------------+-------------------+\n", "|The customer should sign the contract before May, 2023|[Before when should sign customer ]|[before May , 2023]|\n", "+------------------------------------------------------+-----------------------------------+-------------------+\n", "\n" ] } ], "source": [ "res.select('text', 'question.result', 'answer.result').show(truncate=False)" ] }, { "cell_type": "code", "execution_count": null, "id": "AjLGf3A4Y8Vz", "metadata": { "id": "AjLGf3A4Y8Vz" }, "outputs": [], "source": [] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "tf-gpu", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)]" }, "vscode": { "interpreter": { "hash": "3f47d918ae832c68584484921185f5c85a1760864bf927a683dc6fb56366cc77" } } }, "nbformat": 4, "nbformat_minor": 5 }