{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "3inuQHTI-G-P" }, "source": [ "![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)" ] }, { "cell_type": "markdown", "metadata": { "id": "0Cd4gz6vxHiQ" }, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/12.Coreference_Resolution.ipynb)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "id": "gk3kZHmNj51v" }, "source": [ "#๐ŸŽฌ Installation" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_914itZsj51v", "pycharm": { "is_executing": true } }, "outputs": [], "source": [ "! pip install -q johnsnowlabs" ] }, { "cell_type": "markdown", "metadata": { "id": "YPsbAnNoPt0Z" }, "source": [ "##๐Ÿ”— Automatic Installation\n", "Using my.johnsnowlabs.com SSO" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "fY0lcShkj51w", "pycharm": { "is_executing": true } }, "outputs": [], "source": [ "from johnsnowlabs import nlp, legal\n", "\n", "# nlp.install(force_browser=True)" ] }, { "cell_type": "markdown", "metadata": { "id": "hsJvn_WWM2GL" }, "source": [ "##๐Ÿ”— Manual downloading\n", "If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.\n", "\n", "- Go to my.johnsnowlabs.com\n", "- Download your license\n", "- Upload it using the following command" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "i57QV3-_P2sQ" }, "outputs": [], "source": [ "from google.colab import files\n", "print('Please Upload your John Snow Labs License using the button below')\n", "license_keys = files.upload()" ] }, { "cell_type": "markdown", "metadata": { "id": "xGgNdFzZP_hQ" }, "source": [ "- Install it" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "OfmmPqknP4rR" }, "outputs": [], "source": [ "nlp.install()" ] }, { "cell_type": "markdown", "metadata": { "id": "DCl5ErZkNNLk" }, "source": [ "#๐Ÿ“Œ Starting" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wRXTnNl3j51w" }, "outputs": [], "source": [ "spark = nlp.start()" ] }, { "cell_type": "markdown", "metadata": { "id": "9PSPOffJm5T8" }, "source": [ "#๐Ÿ”Ž Legal Correference resolution" ] }, { "cell_type": "markdown", "metadata": { "id": "flBIgvG5LJHe" }, "source": [ "![image.png]()" ] }, { "cell_type": "markdown", "metadata": { "id": "6Q_ivA7kpPB8" }, "source": [ "๐Ÿ“œCorreference Resolution is the the task of finding all expressions that refer to the same entity in a text.\n", "\n", "This is very important in both Legal and Financial texts, where the name of the company is mentioned at the beginning of the document, but later on aliases of the company are used, intead of the official name.\n", "\n", "Let's take a look at some examples and how to solve them using Correference Resolution." ] }, { "cell_type": "markdown", "metadata": { "id": "1jD5r4LupoJC" }, "source": [ "๐Ÿ“œ`'Armstrong Hardwood Flooring Company is a Tennessee corporation (known also as \"Company\"). The Company own certain Copyrights and Know-How which may be used in the Arizona Field, and in connection with the transactions contemplated by the Stock Purchase Agreement, Arizona desires to obtain a license from the Company Entities to use such Intellectual Property on the terms and subject to the conditions set forth herein.'`\n", "\n", "In the previous text, in the second sentence, `Company` refers to `Armstrong Hardwood Floowing Company`.\n", "\n", "๐Ÿ“ŒThere are two ways we can accomplish correference resolution:\n", "1. With a specific `SpanBertCorefModel` annotator;\n", "2. With NER and Relation Extraction;" ] }, { "cell_type": "markdown", "metadata": { "id": "P8xiIA53qAzd" }, "source": [ "#๐Ÿ”Ž 1. SpanBertCoref" ] }, { "cell_type": "markdown", "metadata": { "id": "avpl6hNXqe3i" }, "source": [ "SpanBertCorefModel annotator for Coreference Resolution on BERT and SpanBERT models based on [BERT for Coreference Resolution: Baselines and Analysis](https://arxiv.org/abs/1908.09091) paper. \n", "\n", "In Spark NLP, we include `SpanBertCorefModel` annotator as an implementation of this SpanBert-based coreference resolution model." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "qdzTlziOjsL7", "outputId": "be42faab-10f8-4b75-bfbd-7c8cf6d6cf7d" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "spanbert_base_coref download started this may take some time.\n", "Approximate size to download 540.1 MB\n", "[OK!]\n" ] } ], "source": [ "document_assembler = nlp.DocumentAssembler()\\\n", " .setInputCol(\"text\")\\\n", " .setOutputCol(\"document\")\n", "\n", "text_splitter = legal.TextSplitter()\\\n", " .setInputCols([\"document\"])\\\n", " .setOutputCol(\"sentences\")\n", "\n", "tokenizer = nlp.Tokenizer()\\\n", " .setInputCols([\"sentences\"])\\\n", " .setOutputCol(\"tokens\")\n", "\n", "corefResolution = nlp.SpanBertCorefModel()\\\n", " .pretrained(\"spanbert_base_coref\")\\\n", " .setInputCols([\"sentences\", \"tokens\"])\\\n", " .setOutputCol(\"corefs\")\n", "\n", "pipeline = nlp.Pipeline(stages=[document_assembler, text_splitter, tokenizer, corefResolution])" ] }, { "cell_type": "markdown", "metadata": { "id": "_HIbSJbgt3KY" }, "source": [ "###โœ”๏ธ Who is \"the Company\" in this exampleโ“" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "zl_-oEogthky", "outputId": "35a0d2ef-9260-461f-e4d5-02b2435441f0" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-----------------------------------+-----------------------------------------------------------------------------------------------------------------+\n", "|token |metadata |\n", "+-----------------------------------+-----------------------------------------------------------------------------------------------------------------+\n", "|Armstrong Hardwood Flooring Company|{head.sentence -> -1, head -> ROOT, head.begin -> -1, head.end -> -1, sentence -> 0} |\n", "|The Company |{head.sentence -> 0, head -> Armstrong Hardwood Flooring Company, head.begin -> 0, head.end -> 34, sentence -> 1}|\n", "+-----------------------------------+-----------------------------------------------------------------------------------------------------------------+\n", "\n" ] } ], "source": [ "example1 = 'Armstrong Hardwood Flooring Company is a Tennessee corporation (known also as \"Company\"). The Company own certain Copyrights and Know-How which may be used to the conditions set forth herein.'\n", "\n", "data = spark.createDataFrame([[example1]]).toDF(\"text\")\n", "\n", "model = pipeline.fit(data)\n", "\n", "model.transform(data).selectExpr(\"explode(corefs) AS coref\").selectExpr(\"coref.result as token\", \"coref.metadata\").show(truncate=False)" ] }, { "cell_type": "markdown", "metadata": { "id": "xk69tCV3t5_x" }, "source": [ "###โœ”๏ธ What is \"this Agreement\" in the example? And \"it\"โ“" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "6_H1-luHox_t", "outputId": "789c4f31-d921-4076-b654-71717c6afc68" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+------------------+-------------------------------------------------------------------------------------------------+\n", "|token |metadata |\n", "+------------------+-------------------------------------------------------------------------------------------------+\n", "|this \" Agreement \"|{head.sentence -> -1, head -> ROOT, head.begin -> -1, head.end -> -1, sentence -> 0} |\n", "|It |{head.sentence -> 0, head -> this \" Agreement \", head.begin -> 38, head.end -> 53, sentence -> 1}|\n", "+------------------+-------------------------------------------------------------------------------------------------+\n", "\n" ] } ], "source": [ "example2 = 'This INTELLECTUAL PROPERTY AGREEMENT (this \"Agreement\") is dated as of December 31, 2018 (the \"Effective Date\").It was entered into by and between Armstrong Flooring (the \"Seller\") and AHF Holding (the \"Buyer\"). Seller and Buyer have entered into that certain Stock Purchase Agreement, dated November 14, 2018 (the \"Stock Purchase Agreement\")'\n", "\n", "data = spark.createDataFrame([[example2]]).toDF(\"text\")\n", "\n", "model = pipeline.fit(data)\n", "\n", "model.transform(data).selectExpr(\"explode(corefs) AS coref\").selectExpr(\"coref.result as token\", \"coref.metadata\").show(truncate=False)" ] }, { "cell_type": "markdown", "metadata": { "id": "o_t1PbJKuETe" }, "source": [ "###โœ”๏ธ Which date are we talking aboutโ“" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Hf4H4Cxks32y", "outputId": "ff49aeef-5269-4b09-b313-0e8bb11f72de" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+------------------+-------------------------------------------------------------------------------------------------+\n", "|token |metadata |\n", "+------------------+-------------------------------------------------------------------------------------------------+\n", "|This Agreement |{head.sentence -> -1, head -> ROOT, head.begin -> -1, head.end -> -1, sentence -> 0} |\n", "|it |{head.sentence -> 0, head -> This Agreement, head.begin -> 0, head.end -> 13, sentence -> 1} |\n", "|December 31 , 2018|{head.sentence -> -1, head -> ROOT, head.begin -> -1, head.end -> -1, sentence -> 0} |\n", "|that date |{head.sentence -> 0, head -> December 31 , 2018, head.begin -> 30, head.end -> 46, sentence -> 1}|\n", "+------------------+-------------------------------------------------------------------------------------------------+\n", "\n" ] } ], "source": [ "example3 = 'This Agreement is dated as of December 31, 2018 (the \"Effective Date\"). Seller and Buyer should sign it before the ending of that date.'\n", "\n", "data = spark.createDataFrame([[example3]]).toDF(\"text\")\n", "\n", "model = pipeline.fit(data)\n", "\n", "model.transform(data).selectExpr(\"explode(corefs) AS coref\").selectExpr(\"coref.result as token\", \"coref.metadata\").show(truncate=False)" ] }, { "cell_type": "markdown", "metadata": { "id": "3CHlz5rkuL_2" }, "source": [ "However, reality is that legal texts are often times, much longer and complex than those ones.\n", "\n", "Let's take a look at the following example:" ] }, { "cell_type": "markdown", "metadata": { "id": "a0L2ECGSu0O1" }, "source": [ "###โœ”๏ธ Who is Seller? Who is Buyerโ“" ] }, { "cell_type": "markdown", "metadata": { "id": "mCcCI8XxvNn1" }, "source": [ "FAIL: We are unable to retrieve that information using SpanBertCoref." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "iGj2-FdlrpgD", "outputId": "8df9a0b9-2d67-47c5-d008-1ee5625eda6a" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[31m\n", "+--------+----------------------------------------------------------------------------------------+\n", "|token |metadata |\n", "+--------+----------------------------------------------------------------------------------------+\n", "|Seller \"|{head.sentence -> -1, head -> ROOT, head.begin -> -1, head.end -> -1, sentence -> 1} |\n", "|Seller \"|{head.sentence -> 1, head -> Seller \", head.begin -> 95, head.end -> 101, sentence -> 3}|\n", "+--------+----------------------------------------------------------------------------------------+\n", "\n", "\u001b[0m\n" ] } ], "source": [ "example4 = 'This INTELLECTUAL PROPERTY AGREEMENT is entered into by and between Armstrong Flooring, Inc. (\"Seller\") and AHF Holding, Inc. (\"Buyer\"). \"Seller\" and \"Buyer\" have entered into that certain Stock Purchase Agreement, dated November 14, 2018 (the \"Stock Purchase Agreement\")'\n", "\n", "data = spark.createDataFrame([[example4]]).toDF(\"text\")\n", "\n", "model = pipeline.fit(data)\n", "\n", "print(\"\\x1b[31m\")\n", "model.transform(data).selectExpr(\"explode(corefs) AS coref\").selectExpr(\"coref.result as token\", \"coref.metadata\").show(truncate=False)\n", "print(\"\\x1b[0m\")" ] }, { "cell_type": "markdown", "metadata": { "id": "RNfgDSYgvYXI" }, "source": [ "Another disadvantage of this method is that you need to send all the text at once to resolve the correferences. If you miss the original lines where the concepts are defined, you will lose the reference.\n", "\n", "As an alternative, we can use NER and Relation Extraction, as shown in the next section.\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "99hvTbVtvc7d" }, "source": [ "#๐Ÿ”Ž 2. NER and Relation Extraction" ] }, { "cell_type": "markdown", "metadata": { "id": "VGBOmAeizY8u" }, "source": [ "We have several models trained in Models Hub (Spark NLP for Legal), which are able to detect aliases or secondary names in financial and legal documents.\n", "\n", "We are going to use this NER one:\n", "`https://nlp.johnsnowlabs.com/2022/08/12/legre_contract_doc_parties_en_3_2.html`\n", "\n", "After extracting the aliases, we will check which names they are referring to. To do this, we will use Relation Extraction. For this example, we will use this model:\n", "\n", "`https://nlp.johnsnowlabs.com/2022/08/17/legre_org_prod_alias_en_3_2.html`\n", "\n", "Let's see them in action." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "bPATySlKvgDh", "outputId": "df5df8a1-7146-4462-ce10-25d9ba606a9f" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bert_embeddings_sec_bert_base download started this may take some time.\n", "Approximate size to download 390.4 MB\n", "[OK!]\n", "legner_orgs_prods_alias download started this may take some time.\n", "[OK!]\n", "legre_org_prod_alias download started this may take some time.\n", "[OK!]\n" ] } ], "source": [ "documentAssembler = nlp.DocumentAssembler()\\\n", " .setInputCol(\"text\")\\\n", " .setOutputCol(\"document\")\n", "\n", "tokenizer = nlp.Tokenizer()\\\n", " .setInputCols([\"document\"])\\\n", " .setOutputCol(\"token\")\n", "\n", "embeddings = nlp.BertEmbeddings.pretrained(\"bert_embeddings_sec_bert_base\",\"en\") \\\n", " .setInputCols([\"document\", \"token\"]) \\\n", " .setOutputCol(\"embeddings\")\n", "\n", "ner_model = legal.NerModel.pretrained(\"legner_orgs_prods_alias\", \"en\", \"legal/models\")\\\n", " .setInputCols([\"document\", \"token\", \"embeddings\"])\\\n", " .setOutputCol(\"ner\")\n", "\n", "ner_converter = nlp.NerConverter()\\\n", " .setInputCols([\"document\",\"token\",\"ner\"])\\\n", " .setOutputCol(\"ner_chunk\")\n", "\n", "reDL = legal.RelationExtractionDLModel()\\\n", " .pretrained(\"legre_org_prod_alias\", \"en\", \"legal/models\")\\\n", " .setPredictionThreshold(0.99)\\\n", " .setInputCols([\"ner_chunk\", \"document\"])\\\n", " .setOutputCol(\"relations\")\n", "\n", "nlpPipeline = nlp.Pipeline(stages=[\n", " documentAssembler,\n", " tokenizer,\n", " embeddings,\n", " ner_model,\n", " ner_converter,\n", " reDL])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "YJo1E2lX05Kh" }, "outputs": [], "source": [ "example4 = 'This INTELLECTUAL PROPERTY AGREEMENT is entered into by and between Armstrong Flooring, Inc. (the \"Seller\") and AHF Holding, Inc., a Delaware Corporation (the \"Buyer\").'\n", "\n", "data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n", "\n", "model = nlpPipeline.fit(data)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "c-HnYlJ60-qk" }, "outputs": [], "source": [ "lmodel = nlp.LightPipeline(model)\n", "res = lmodel.fullAnnotate(example4)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "r4XJXvPP3rC5", "outputId": "10dfd877-1ca5-41dc-f2d2-bb5071037d7b" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Armstrong Flooring, Inc. - has_alias - Seller (confidence: 0.9938087)\n", "AHF Holding, Inc., a Delaware Corporation - has_alias - Buyer (confidence: 0.9923051)\n" ] } ], "source": [ "aliases = dict()\n", "for r in res:\n", " for rel in r['relations']:\n", " if rel.result != 'no_rel':\n", " aliases.setdefault(rel.metadata['chunk2'], []).append(rel.metadata['chunk1'])\n", " print(f\"{rel.metadata['chunk1']} - {rel.result} - {rel.metadata['chunk2']} (confidence: {rel.metadata['confidence']})\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "QugS7tnp8hs8", "outputId": "b831b085-0f84-478c-9f98-a189059d1be5" }, "outputs": [ { "data": { "text/plain": [ "{'Seller': ['Armstrong Flooring, Inc.'],\n", " 'Buyer': ['AHF Holding, Inc., a Delaware Corporation']}" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aliases" ] }, { "cell_type": "markdown", "metadata": { "id": "ukN_yz6r5naW" }, "source": [ "Being that done, you can process the rest of the document, detecting entities as Seller, Buyer, etc with either NER or ContextualParsers, and be ablet o disambiguate it using the results of the previous model" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Kmd0yjEF51Up" }, "outputs": [], "source": [ "example5 = '\"Seller\" and \"Buyer\" have entered into that certain Stock Purchase Agreement, dated November 14, 2018 (the \"Stock Purchase Agreement\")'\n", "\n", "nlpPipeline = nlp.Pipeline(stages=[\n", " documentAssembler,\n", " tokenizer,\n", " embeddings,\n", " ner_model,\n", " ner_converter])\n", "\n", "data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n", "\n", "model = nlpPipeline.fit(data)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6hFXNZU659Zh" }, "outputs": [], "source": [ "lmodel = nlp.LightPipeline(model)\n", "res = lmodel.fullAnnotate(example5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "MFUb4Xk-71yy", "outputId": "78942ecc-3760-434e-86f9-ab300cdc243f" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Seller is Armstrong Flooring, Inc.\n", "Buyer is AHF Holding, Inc., a Delaware Corporation\n" ] } ], "source": [ "for r in res:\n", " for ner_chunk in r['ner_chunk']:\n", " print(f\"{ner_chunk.result} is {aliases[ner_chunk.result][0]}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "Zcvmnmzx89i9" }, "source": [ "The big advantage of using this method is that you don't need to process the whole text to know the correferences. You can detect first the aliases, store them and then resolve the correferences with NER and RE." ] }, { "cell_type": "markdown", "metadata": { "id": "bi-OzrHX82JJ" }, "source": [ "#๐Ÿ”Ž 3. Question Answering" ] }, { "cell_type": "markdown", "metadata": { "id": "eY40Q_sm9hcm" }, "source": [ "This is the third option from retrieving correferences. You can detect the alias and ask questions on the fly about what they refer to.\n", "\n", "Let's see an example." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "3WDfLkYm-fQJ" }, "outputs": [], "source": [ "context = 'This INTELLECTUAL PROPERTY AGREEMENT is entered into by and between Armstrong Flooring, Inc. (\"Seller\") and AHF Holding, Inc. (\"Buyer\").'.lower()\n", "question1 = 'Which company is the Buyer'.lower()\n", "question2 = 'Which company is the Seller'.lower()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "zNBbwlke84fD", "outputId": "0deed14c-cb51-4b54-9415-16d84d860be3" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "legqa_bert_large download started this may take some time.\n", "Approximate size to download 1.2 GB\n", "[OK!]\n" ] } ], "source": [ "document_assembler = nlp.MultiDocumentAssembler()\\\n", " .setInputCols([\"question\", \"context\"]) \\\n", " .setOutputCols([\"document_question\", \"document_context\"])\n", "\n", "spanClassifier = nlp.BertForQuestionAnswering.pretrained(\"legqa_bert_large\",\"en\", \"legal/models\")\\\n", " .setInputCols([\"document_question\", \"document_context\"]) \\\n", " .setOutputCol(\"answer\") \\\n", " .setCaseSensitive(False)\n", "\n", "pipeline = nlp.Pipeline().setStages([\n", " document_assembler,\n", " spanClassifier\n", "])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bJdS9HFR-r6T" }, "outputs": [], "source": [ "qa = [[question1, context], [question2, context]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Xc7_sF3l_3nJ", "outputId": "66f14d3e-e1e8-49e7-c2a6-b21abdead3d6" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----------------------------+\n", "|result |\n", "+----------------------------+\n", "|[ahf holding , inc .] |\n", "|[armstrong flooring , inc .]|\n", "+----------------------------+\n", "\n" ] } ], "source": [ "example = spark.createDataFrame(qa).toDF(\"question\", \"context\")\n", "\n", "result = pipeline.fit(example).transform(example)\n", "\n", "result.select('answer.result').show(truncate=False)" ] } ], "metadata": { "colab": { "machine_shape": "hm", "provenance": [] }, "gpuClass": "standard", "kernelspec": { "display_name": "tf-gpu", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)]" }, "vscode": { "interpreter": { "hash": "3f47d918ae832c68584484921185f5c85a1760864bf927a683dc6fb56366cc77" } } }, "nbformat": 4, "nbformat_minor": 0 }