{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [], "machine_shape": "hm" }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" }, "gpuClass": "standard" }, "cells": [ { "cell_type": "markdown", "source": [ "# **Colab Setup**" ], "metadata": { "id": "gzMj3X9TSeW9" } }, { "cell_type": "markdown", "source": [ "![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)" ], "metadata": { "id": "4cm9C4uMSkKS" } }, { "cell_type": "markdown", "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/06.3.Classification_NER_RE_on_Parties.ipynb)" ], "metadata": { "id": "4yMO4KAnuFxk" } }, { "cell_type": "markdown", "metadata": { "collapsed": false, "id": "BknLo-nHX9M6" }, "source": [ "##🎬 Installation" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": true }, "id": "_914itZsj51v" }, "outputs": [], "source": [ "! pip install -q johnsnowlabs" ] }, { "cell_type": "markdown", "source": [ "##πŸ”— Automatic Installation\n", "Using my.johnsnowlabs.com SSO" ], "metadata": { "id": "YPsbAnNoPt0Z" } }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": true }, "id": "fY0lcShkj51w" }, "outputs": [], "source": [ "from johnsnowlabs import *\n", "\n", "# nlp.install(force_browser=True)" ] }, { "cell_type": "markdown", "source": [ "##πŸ”— Manual downloading\n", "If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.\n", "\n", "- Go to my.johnsnowlabs.com\n", "- Download your license\n", "- Upload it using the following command" ], "metadata": { "id": "hsJvn_WWM2GL" } }, { "cell_type": "code", "source": [ "from google.colab import files\n", "print('Please Upload your John Snow Labs License using the button below')\n", "license_keys = files.upload()" ], "metadata": { "id": "i57QV3-_P2sQ" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "- Install it" ], "metadata": { "id": "xGgNdFzZP_hQ" } }, { "cell_type": "code", "source": [ "nlp.install()" ], "metadata": { "id": "OfmmPqknP4rR" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "##πŸ“Œ Starting" ], "metadata": { "id": "DCl5ErZkNNLk" } }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wRXTnNl3j51w" }, "outputs": [], "source": [ "from johnsnowlabs import *\n", "spark = nlp.start()" ] }, { "cell_type": "markdown", "source": [ "# **Loading the data**\n", "β–’β–’β–’β–’β–’β–’β–’β–’β–’β–’ 100% α΄„α΄α΄α΄˜ΚŸα΄‡α΄›α΄‡! " ], "metadata": { "id": "xRbTMepEXRIt" } }, { "cell_type": "code", "source": [ "import requests\n", "URL = \"https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/legal-nlp/data/commercial_lease_1.txt\"\n", "URL_2 = \"https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/legal-nlp/data/commercial_lease_2.txt\"\n", "URL_3 = \"https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/legal-nlp/data/credit_agreement_2.txt\"\n", "URL_4 = \"https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/legal-nlp/data/loan_agreement.txt\"\n", "\n", "\n", "response = requests.get(URL)\n", "response2 = requests.get(URL_2)\n", "response3 = requests.get(URL_3)\n", "response4 = requests.get(URL_4)\n", "\n", "\n", "commercial_lease = response.content.decode('utf-8')\n", "commercial_lease_2 = response2.content.decode('utf-8')\n", "credit_agreement = response3.content.decode('utf-8')\n", "loan_agreement = response4.content.decode('utf-8')" ], "metadata": { "id": "NVPZmdk4XUeV" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "#πŸ”Ž **Document Clasification**" ], "metadata": { "id": "R__Q1fx2Tfmb" } }, { "cell_type": "markdown", "source": [ "## Commercial Lease Classification" ], "metadata": { "id": "SA8cKb7Xt2U0" } }, { "cell_type": "markdown", "source": [ "###πŸ“œ **Let's give the commercial lease classification model various types of documents to see if it correctly detects them or not.**\n", "\n", "###πŸ“œ The documents that are being used in the below cells for testing are ***commercial lease***, ***credit agreement***, ***loan agreement*** and another ***commercial lease***." ], "metadata": { "id": "HQPwdQm9Y6Zt" } }, { "cell_type": "code", "source": [ "documents = [commercial_lease,credit_agreement,loan_agreement,commercial_lease_2]\n", "documents = [[i] for i in documents]\n" ], "metadata": { "id": "_8kcCDjacsq1" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "document_assembler = nlp.DocumentAssembler()\\\n", " .setInputCol(\"text\")\\\n", " .setOutputCol(\"document\")\n", " \n", "embeddings = nlp.BertSentenceEmbeddings.pretrained(\"sent_bert_base_cased\", \"en\")\\\n", " .setInputCols(\"document\")\\\n", " .setOutputCol(\"sentence_embeddings\")\n", " \n", "doc_classifier = legal.ClassifierDLModel.pretrained(\"legclf_commercial_lease\", \"en\", \"legal/models\")\\\n", " .setInputCols([\"sentence_embeddings\"])\\\n", " .setOutputCol(\"category\")\n", " \n", "nlpPipeline = nlp.Pipeline(stages=[\n", " document_assembler, \n", " embeddings,\n", " doc_classifier])\n", "\n", "df = spark.createDataFrame(documents).toDF(\"text\")\n", "\n", "model = nlpPipeline.fit(df)\n", "\n", "result = model.transform(df)\n", "\n", "result.select('category.result').show(truncate=False)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "_Soy5HMkTh-W", "outputId": "1dbf2e76-1872-4e6c-8d72-a8538436c116" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "sent_bert_base_cased download started this may take some time.\n", "Approximate size to download 389.1 MB\n", "[OK!]\n", "legclf_commercial_lease download started this may take some time.\n", "[OK!]\n", "+------------------+\n", "|result |\n", "+------------------+\n", "|[commercial-lease]|\n", "|[other] |\n", "|[other] |\n", "|[commercial-lease]|\n", "+------------------+\n", "\n" ] } ] }, { "cell_type": "markdown", "source": [ "### **Here, we can see that the classifier accurately detected the commercial lease documents.**\n", "\n", "### Among these documents there is also a Loan Agreement. You can also detect it. In this case the model was trained using Setence Bert Embeddings." ], "metadata": { "id": "2lzcP4CRkMxc" } }, { "cell_type": "code", "source": [ "\n", "document_assembler = nlp.DocumentAssembler()\\\n", " .setInputCol(\"text\")\\\n", " .setOutputCol(\"document\")\n", " \n", "embeddings = nlp.BertSentenceEmbeddings.pretrained(\"sent_bert_base_cased\", \"en\")\\\n", " .setInputCols(\"document\")\\\n", " .setOutputCol(\"sentence_embeddings\")\n", " \n", "doc_classifier = legal.ClassifierDLModel.pretrained(\"legclf_loan_agreement_bert\", \"en\", \"legal/models\")\\\n", " .setInputCols([\"sentence_embeddings\"])\\\n", " .setOutputCol(\"category\")\n", " \n", "nlpPipeline = nlp.Pipeline(stages=[\n", " document_assembler, \n", " embeddings,\n", " doc_classifier])\n", " \n", "df = spark.createDataFrame(documents).toDF(\"text\")\n", "\n", "model = nlpPipeline.fit(df)\n", "\n", "result = model.transform(df)\n", "\n", "result.select('category.result').show(truncate=False)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "D5c-_V2DYA2z", "outputId": "82ecd12d-3598-4bf1-bfd2-22973082a94d" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "sent_bert_base_cased download started this may take some time.\n", "Approximate size to download 389.1 MB\n", "[OK!]\n", "legclf_loan_agreement_bert download started this may take some time.\n", "[OK!]\n", "+----------------+\n", "|result |\n", "+----------------+\n", "|[other] |\n", "|[other] |\n", "|[loan-agreement]|\n", "|[other] |\n", "+----------------+\n", "\n" ] } ] }, { "cell_type": "markdown", "source": [ "### The classifier has recognised it, You may find more classifiers on the models hub page for various documents: https://nlp.johnsnowlabs.com/models?edition=Legal+NLP&task=Text+Classification" ], "metadata": { "id": "bSuQIvHVuI-D" } }, { "cell_type": "code", "source": [], "metadata": { "id": "oiQOd36Ym6Pr" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "#πŸ”Ž Paragraph Splitting" ], "metadata": { "id": "qsRcBkMcu2VL" } }, { "cell_type": "markdown", "source": [ "### **Reason**: Generally, clauses lengths range from one paragraph to N. Splitting into larger sections such as pages might result in too much information and cause the meaning to become distorted, clauses to become mixed and confuse the classifiers. On the other hand, sentence-level information is too limited. So the best split we can make for clause extraction is at the paragraph level." ], "metadata": { "id": "BGEGZFAbzayT" } }, { "cell_type": "markdown", "metadata": { "id": "FJD39A1HidOT" }, "source": [ "πŸ“œExplanation:\n", "- `.setCustomBounds([\"\\r\\n\"])` sets an array of regular expression(s) to tell the annotator how to split the document. (**Here we are splitting by paragraph.**)\n", "- `.setUseCustomBoundsOnly(True)` the default behaviour of SentenceDetector is Sentence Splitting, so we set to ignore the default regex ('\\n', ...).\n", "- `.setExplodeSentences(True)` creates one new row in the dataframe per split." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "iYgqcMd43pqy" }, "outputs": [], "source": [ "document_assembler = nlp.DocumentAssembler() \\\n", " .setInputCol(\"text\") \\\n", " .setOutputCol(\"document\")\n", "\n", "text_splitter = legal.TextSplitter() \\\n", " .setInputCols([\"document\"]) \\\n", " .setOutputCol(\"pages\")\\\n", " .setCustomBounds([\"\\r\\n\\r\\n \"])\\\n", " .setUseCustomBoundsOnly(True)\\\n", " .setExplodeSentences(True)\n", "\n", "nlp_pipeline = nlp.Pipeline(stages=[\n", " document_assembler,\n", " text_splitter])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "rWMysWTfKto7" }, "outputs": [], "source": [ "sdf = spark.createDataFrame([[commercial_lease]]).toDF(\"text\")\n", "\n", "fit = nlp_pipeline.fit(sdf)\n", "\n", "lp = nlp.LightPipeline(fit)\n", "\n", "res = lp.annotate(commercial_lease_2)\n", "pages = res['pages']\n", "pages = [p for p in pages if p.strip() != ''] # We remove empty pages" ] }, { "cell_type": "code", "source": [ "len(pages)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "F1B9dnsNvnuZ", "outputId": "bf929444-afcd-487b-cb2b-ec7b4ef0384f" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "87" ] }, "metadata": {}, "execution_count": 16 } ] }, { "cell_type": "markdown", "source": [ "### Let's now examine these paragraphs and determine which one of them is an **introductory clause**.\n", "\n", "### You may find more **clauses** on the models hub page: https://nlp.johnsnowlabs.com/models?q=clause&edition=Legal+NLP&task=Text+Classification" ], "metadata": { "id": "rzAlKHttwV_s" } }, { "cell_type": "code", "source": [ "embeddings = nlp.BertSentenceEmbeddings.pretrained(\"sent_bert_base_cased\", \"en\")\\\n", " .setInputCols(\"document\")\\\n", " .setOutputCol(\"sentence_embeddings\")\n", " \n", "doc_classifier = legal.ClassifierDLModel.pretrained(\"legclf_introduction_clause\", \"en\", \"legal/models\")\\\n", " .setInputCols([\"sentence_embeddings\"])\\\n", " .setOutputCol(\"category\")\n", "\n", "nlpPipeline = nlp.Pipeline(stages=[\n", " document_assembler, \n", " embeddings,\n", " doc_classifier])\n", "\n", "texts = [[i] for i in pages]\n", "df = spark.createDataFrame(texts).toDF(\"text\")\n", "\n", "model = nlpPipeline.fit(df)\n", "\n", "result = model.transform(df)\n", "result.select('category.result').show()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "btNrJITbv2f3", "outputId": "6028dd69-e81e-4587-c89f-2557556598b0" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "sent_bert_base_cased download started this may take some time.\n", "Approximate size to download 389.1 MB\n", "[OK!]\n", "legclf_introduction_clause download started this may take some time.\n", "[OK!]\n", "+--------------+\n", "| result|\n", "+--------------+\n", "|[introduction]|\n", "| [other]|\n", "|[introduction]|\n", "|[introduction]|\n", "| [other]|\n", "| [other]|\n", "| [other]|\n", "| [other]|\n", "| [other]|\n", "| [other]|\n", "| [other]|\n", "| [other]|\n", "| [other]|\n", "| [other]|\n", "| [other]|\n", "| [other]|\n", "|[introduction]|\n", "|[introduction]|\n", "| [other]|\n", "| [other]|\n", "+--------------+\n", "only showing top 20 rows\n", "\n" ] } ] }, { "cell_type": "code", "source": [ "introductory_clause = result.select('text').filter(\"category.result[0] != 'other'\").collect()\n" ], "metadata": { "id": "rXzL6HwG0YNX" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "print(introductory_clause[1][0])" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "l4_msqaU5SCj", "outputId": "36107e67-e4a4-473f-a592-6e7c376fae85" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "THIS Lease Agreement , is made and entered into this _____day of May, 2006 by and between Global, Inc., (hereinafter called \"Landlord\"), and IMI Global, Inc., with a mailing address of ___, (hereinafter referred as \"Tenant\").\n" ] } ] }, { "cell_type": "markdown", "source": [ "#πŸ”Ž **Pretrained Pipelines**\n", "\n", "Spark NLP provides pre-trained pipelines that have already been fitted with specific annotators and transformers for various use cases, so you don't have to create a pipeline from scratch. If you need to adjust the parameters of the Relation Extraction model, you can utilize the aforementioned Relation Extraction pipeline." ], "metadata": { "id": "uv6a8kXu3633" } }, { "cell_type": "markdown", "source": [ "##πŸ”Ž **Named Entity Recognition**" ], "metadata": { "id": "jIf9TUwSFmJG" } }, { "cell_type": "markdown", "metadata": { "id": "GnvuUoc9IWHi" }, "source": [ "#### Let's use one of the clauses that have been identified as **`introductory`** for detecting the entities(NER) using and **Introductory Clause specific NER** and then mapping the relations between them." ] }, { "cell_type": "markdown", "source": [ "### To learn more about the pipeline being utilized here, please refer to the model's hub page on the Johns Snow Labs NLP website: https://nlp.johnsnowlabs.com/2023/02/02/legpipe_ner_contract_doc_parties_alias_former_en.html" ], "metadata": { "id": "2GNX-n2E-QK1" } }, { "cell_type": "code", "source": [ "legal_pipeline = nlp.PretrainedPipeline(\"legpipe_ner_contract_doc_parties_alias_former\", \"en\", \"legal/models\")\n", "\n", "text = [introductory_clause[1][0]]\n" ], "metadata": { "id": "dCpN8kxq90Nd" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "sdf = spark.createDataFrame([text]).toDF(\"text\")\n" ], "metadata": { "id": "MuYI-6b9R_6Y" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "df = legal_pipeline.transform(sdf)" ], "metadata": { "id": "ElJ9SsHLeecQ" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "result = legal_pipeline.fullAnnotate(text)[0]\n", "result.keys()" ], "metadata": { "id": "-fPMtLPTAkZG" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "\n", "from johnsnowlabs import viz\n", "\n", "ner_viz = viz.NerVisualizer()\n", "\n", "ner_viz.display(result, label_col='ner_chunk')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 88 }, "id": "zTpAQEpfBErT", "outputId": "baca0029-fc13-462c-fdf4-f1a1ec9e1876" }, "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "" ], "text/html": [ "\n", "\n", " THIS Lease Agreement DOC , is made and entered into this _____day of May, 2006 EFFDATE by and between Global, Inc PARTY., (hereinafter called \"Landlord ALIAS\"), and IMI Global, Inc PARTY., with a mailing address of ___, (hereinafter referred as \"Tenant ALIAS\")." ] }, "metadata": {} } ] }, { "cell_type": "markdown", "source": [ "##πŸ”Ž **Relation Extraction**" ], "metadata": { "id": "PlZ7u0u6Rmv0" } }, { "cell_type": "code", "source": [ "from sparknlp.pretrained import PretrainedPipeline\n", "pipeline = nlp.PretrainedPipeline(\"legpipe_re_contract_doc_parties_alias\", \"en\", \"legal/models\")\n" ], "metadata": { "id": "9ZPaL-LvygLk", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "a4b85d08-4580-4988-adac-675a481e3615" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "legpipe_re_contract_doc_parties_alias download started this may take some time.\n", "Approx size to download 868 MB\n", "[OK!]\n" ] } ] }, { "cell_type": "code", "source": [ "\n", "import pandas as pd\n", "\n", "def get_relations_df (results, col='relations'):\n", " \"\"\"Shows a Dataframe with the relations extracted by Spark NLP\"\"\"\n", " rel_pairs=[]\n", " for rel in results[0][col]:\n", " rel_pairs.append((\n", " rel.result, \n", " rel.metadata['entity1'], \n", " rel.metadata['entity1_begin'],\n", " rel.metadata['entity1_end'],\n", " rel.metadata['chunk1'], \n", " rel.metadata['entity2'],\n", " rel.metadata['entity2_begin'],\n", " rel.metadata['entity2_end'],\n", " rel.metadata['chunk2'], \n", " rel.metadata['confidence']\n", " ))\n", "\n", " rel_df = pd.DataFrame(rel_pairs, columns=['relation','entity1','entity1_begin','entity1_end','chunk1','entity2','entity2_begin','entity2_end','chunk2', 'confidence'])\n", "\n", " return rel_df" ], "metadata": { "id": "piGckxE9y1Kz" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "result = pipeline.fullAnnotate(text)\n" ], "metadata": { "id": "qsMxf9lkzNpK" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "rel_df = get_relations_df(result)\n", "\n", "rel_df[rel_df[\"relation\"] != \"other\"]" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 175 }, "id": "sSBZXPRv0N23", "outputId": "71250052-9995-45f2-88a0-55f8184f8814" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " relation entity1 entity1_begin entity1_end chunk1 entity2 \\\n", "0 dated_as DOC 0 19 THIS Lease Agreement EFFDATE \n", "1 signed_by DOC 0 19 THIS Lease Agreement PARTY \n", "2 has_alias PARTY 90 100 Global, Inc ALIAS \n", "3 has_alias PARTY 141 155 IMI Global, Inc ALIAS \n", "\n", " entity2_begin entity2_end chunk2 confidence \n", "0 62 73 of May, 2006 0.9999546 \n", "1 90 100 Global, Inc 0.9911765 \n", "2 125 132 Landlord 0.9999889 \n", "3 216 221 Tenant 0.9999893 " ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
relationentity1entity1_beginentity1_endchunk1entity2entity2_beginentity2_endchunk2confidence
0dated_asDOC019THIS Lease AgreementEFFDATE6273of May, 20060.9999546
1signed_byDOC019THIS Lease AgreementPARTY90100Global, Inc0.9911765
2has_aliasPARTY90100Global, IncALIAS125132Landlord0.9999889
3has_aliasPARTY141155IMI Global, IncALIAS216221Tenant0.9999893
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 29 } ] }, { "cell_type": "markdown", "source": [ "##πŸ”Ž **Visualizing the results**" ], "metadata": { "id": "2kxnumyw30zT" } }, { "cell_type": "code", "source": [ "from sparknlp_display import RelationExtractionVisualizer\n", "\n", "re_vis = viz.RelationExtractionVisualizer()\n", "\n", "re_vis.display(result = result[0], relation_col = \"relations\", document_col = \"document\", exclude_relations = [\"other\"], show_relations=True)\n", " " ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 396 }, "id": "w878HnCu0Y-t", "outputId": "217b6ee0-f1f9-47d3-84c4-7610739d3132" }, "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "" ], "text/html": [ "THIS Lease AgreementDOC,ismadeandenteredintothis_____dayof May, 2006EFFDATEbyandbetweenGlobal, IncPARTY.,(hereinaftercalled\"LandlordALIAS\"),andIMI Global, IncPARTY.,withamailingaddressof___,(hereinafterreferredas\"TenantALIAS\").signed_byhas_aliashas_aliasdated_as" ] }, "metadata": {} } ] }, { "cell_type": "markdown", "source": [ "### **Let's delve deeper into the Pretrained pipelines we've used earlier and explore their inner workings.**\n", "\n", "### **This is where you can customize the pipelines for Relation Extraction and Named Entity Recognition models to refine the results.**" ], "metadata": { "id": "fdhc6LKQTUN7" } }, { "cell_type": "markdown", "source": [ "##πŸ”Ž 1. Relation Extraction" ], "metadata": { "id": "zzA4YosUJhPN" } }, { "cell_type": "markdown", "metadata": { "id": "URzfsQWD9dRD" }, "source": [ "#### Let's map the `relations` from the entities.\n", "\n", "**For more information look at the models hub page of the model**:\n", "\n", "\n", "\n", "https://nlp.johnsnowlabs.com/models?q=legre_contract_doc_parties&task=Relation+Extraction" ] }, { "cell_type": "code", "source": [ "documentAssembler = nlp.DocumentAssembler()\\\n", " .setInputCol(\"text\")\\\n", " .setOutputCol(\"document\")\n", "\n", "textSplitter = legal.TextSplitter()\\\n", " .setInputCols([\"document\"])\\\n", " .setOutputCol(\"sentence\")\n", "\n", "tokenizer = nlp.Tokenizer()\\\n", " .setInputCols(\"sentence\")\\\n", " .setOutputCol(\"token\")\n", "\n", "embeddings = nlp.RoBertaEmbeddings.pretrained(\"roberta_embeddings_legal_roberta_base\", \"en\") \\\n", " .setInputCols(\"sentence\", \"token\") \\\n", " .setOutputCol(\"embeddings\")\\\n", "\n", "pos_tagger = nlp.PerceptronModel()\\\n", " .pretrained(\"pos_clinical\", \"en\", \"clinical/models\") \\\n", " .setInputCols([\"sentence\", \"token\"])\\\n", " .setOutputCol(\"pos_tags\")\n", " \n", "dependency_parser = nlp.DependencyParserModel()\\\n", " .pretrained(\"dependency_conllu\", \"en\")\\\n", " .setInputCols([\"sentence\", \"pos_tags\", \"token\"])\\\n", " .setOutputCol(\"dependencies\")\n", "\n", "ner_model = legal.NerModel.pretrained('legner_contract_doc_parties_lg', 'en', 'legal/models')\\\n", " .setInputCols([\"sentence\", \"token\", \"embeddings\"])\\\n", " .setOutputCol(\"ner1\")\n", "\n", "ner_converter = nlp.NerConverter()\\\n", " .setInputCols([\"sentence\",\"token\",\"ner1\"])\\\n", " .setOutputCol(\"ner_chunks\")\n", "\n", "re_ner_chunk_filter = legal.RENerChunksFilter() \\\n", " .setInputCols([\"ner_chunks\", \"dependencies\"])\\\n", " .setOutputCol(\"re_ner_chunks\")\\\n", " .setMaxSyntacticDistance(7)\\\n", " .setRelationPairs([\"DOC-EFFDATE\", \"DOC-PARTY\", \"PARTY-FORMER_NAME\", \"ALIAS-PARTY\", \"PARTY-ALIAS\"])\n", "\n", "reDL = legal.RelationExtractionDLModel().pretrained('legre_contract_doc_parties_lg', 'en', 'legal/models')\\\n", " .setPredictionThreshold(0.5)\\\n", " .setInputCols([\"re_ner_chunks\", \"sentence\"])\\\n", " .setOutputCol(\"relations\")\n", " \n", "\n", "nlpPipeline = nlp.Pipeline(stages=[\n", " documentAssembler,\n", " textSplitter,\n", " tokenizer,\n", " embeddings,\n", " pos_tagger,\n", " dependency_parser,\n", " ner_model,\n", " ner_converter,\n", " re_ner_chunk_filter,\n", " reDL\n", "])" ], "metadata": { "id": "fdnmYNp3dKow" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "text = introductory_clause[1][0]\n", "empty_df = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n", "\n", "model = nlpPipeline.fit(empty_df)\n", "sdf = spark.createDataFrame([[text]]).toDF(\"text\")\n", "\n", "res = model.transform(sdf)\n", "res.show(20,truncate=False)" ], "metadata": { "id": "gY600Xx17Yuk" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "import pyspark.sql.functions as F\n", "\n", "result_df = res.select(F.explode(F.arrays_zip(res.relations.result, \n", " res.relations.metadata)).alias(\"cols\")) \\\n", " .select(\n", " F.expr(\"cols['0']\").alias(\"relations\"),\\\n", " F.expr(\"cols['1']['entity1']\").alias(\"relations_entity1\"),\\\n", " F.expr(\"cols['1']['chunk1']\" ).alias(\"relations_chunk1\" ),\\\n", " F.expr(\"cols['1']['entity2']\").alias(\"relations_entity2\"),\\\n", " F.expr(\"cols['1']['chunk2']\" ).alias(\"relations_chunk2\" ),\\\n", " F.expr(\"cols['1']['confidence']\" ).alias(\"confidence\" ),\\\n", " F.expr(\"cols['1']['syntactic_distance']\" ).alias(\"syntactic_distance\" ),\\\n", " ).filter(\"relations!='other'\")\n", "\n", "result_df.show()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "m4qIw1JP7bIL", "outputId": "63bab477-eb43-40bd-8301-7108d3f03d43" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "+---------+-----------------+--------------------+-----------------+----------------+----------+------------------+\n", "|relations|relations_entity1| relations_chunk1|relations_entity2|relations_chunk2|confidence|syntactic_distance|\n", "+---------+-----------------+--------------------+-----------------+----------------+----------+------------------+\n", "| dated_as| DOC|THIS Lease Agreement| EFFDATE| of May, 2006| 0.9999546| 6|\n", "|signed_by| DOC|THIS Lease Agreement| PARTY| Global, Inc| 0.9911765| 7|\n", "|has_alias| PARTY| Global, Inc| ALIAS| Landlord| 0.9999889| 4|\n", "|has_alias| PARTY| IMI Global, Inc| ALIAS| Tenant| 0.9999893| 4|\n", "+---------+-----------------+--------------------+-----------------+----------------+----------+------------------+\n", "\n" ] } ] }, { "cell_type": "markdown", "source": [ "##πŸ”Ž Visualizing the results" ], "metadata": { "id": "qZK35fZZ7QXk" } }, { "cell_type": "code", "source": [ "light_model = nlp.LightPipeline(model)\n", "\n", "result = light_model.fullAnnotate(text)" ], "metadata": { "id": "e5p-wKN5gqNP" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# from sparknlp_display import RelationExtractionVisualizer\n", "\n", "re_vis = viz.RelationExtractionVisualizer()\n", "\n", "re_vis.display(result = result[0],\n", " relation_col = \"relations\",\n", " document_col = \"document\",\n", " exclude_relations = [\"no_rel\"],\n", " show_relations=True\n", " )" ], "metadata": { "id": "7Uc-Ivbmqce4", "colab": { "base_uri": "https://localhost:8080/", "height": 396 }, "outputId": "3d97c446-5265-44f6-843b-0d15223513d6" }, "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "" ], "text/html": [ "THIS Lease AgreementDOC,ismadeandenteredintothis_____dayof May, 2006EFFDATEbyandbetweenGlobal, IncPARTY.,(hereinaftercalled\"LandlordALIAS\"),andIMI Global, IncPARTY.,withamailingaddressof___,(hereinafterreferredas\"TenantALIAS\").signed_byhas_aliashas_aliasdated_as" ] }, "metadata": {} } ] }, { "cell_type": "markdown", "source": [ "##πŸ”Ž 2. Named Entity Recognition" ], "metadata": { "id": "8XYjtnp2XdaZ" } }, { "cell_type": "code", "source": [ "import json\n", "alias = {\n", " \"entity\": \"ALIAS\",\n", " \"ruleScope\": \"document\", \n", " \"completeMatchRegex\": \"true\",\n", " \"regex\":'\".*?\"',\n", " \"matchScope\": \"sub-token\",\n", " \"contextLength\": 100\n", "}\n", "\n", "with open('alias.json', 'w') as f:\n", " json.dump(alias, f)\n", " \n", "alias_2 = {\n", " \"entity\": \"ALIAS\",\n", " \"ruleScope\": \"document\", \n", " \"completeMatchRegex\": \"true\",\n", " \"regex\":'\\(\"(.*?)\"\\)',\n", " \"matchScope\": \"sub-token\",\n", " \"contextLength\": 100\n", "}\n", "\n", "with open('alias_2.json', 'w') as f:\n", " json.dump(alias_2, f)" ], "metadata": { "id": "ORdABz-qXhla" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "documentAssembler = nlp.DocumentAssembler()\\\n", " .setInputCol(\"text\")\\\n", " .setOutputCol(\"document\")\n", "\n", "textSplitter = legal.TextSplitter()\\\n", " .setInputCols([\"document\"])\\\n", " .setOutputCol(\"sentence\")\n", "\n", "tokenizer = nlp.Tokenizer()\\\n", " .setInputCols(\"sentence\")\\\n", " .setOutputCol(\"token\")\n", "\n", "alias_parser = legal.ContextualParserApproach() \\\n", " .setInputCols([\"document\", \"token\"]) \\\n", " .setOutputCol(\"subheader\")\\\n", " .setJsonPath(\"alias.json\") \\\n", " .setPrefixAndSuffixMatch(False)\\\n", " .setOptionalContextRules(True)\\\n", " .setCaseSensitive(False)\n", "\n", "alias_parser2 = legal.ContextualParserApproach() \\\n", " .setInputCols([\"document\", \"token\"]) \\\n", " .setOutputCol(\"subheader2\")\\\n", " .setJsonPath(\"alias_2.json\") \\\n", " .setCaseSensitive(True) \\\n", " .setPrefixAndSuffixMatch(False)\\\n", " .setOptionalContextRules(False)\n", "\n", "embeddings = nlp.RoBertaEmbeddings.pretrained(\"roberta_embeddings_legal_roberta_base\", \"en\") \\\n", " .setInputCols(\"sentence\", \"token\") \\\n", " .setOutputCol(\"embeddings\")\\\n", "\n", "ner_model = legal.NerModel.pretrained('legner_contract_doc_parties_lg', 'en', 'legal/models')\\\n", " .setInputCols([\"sentence\", \"token\", \"embeddings\"])\\\n", " .setOutputCol(\"ner\")\n", "\n", "ner_converter = legal.NerConverterInternal()\\\n", " .setInputCols([\"sentence\",\"token\",\"ner\"])\\\n", " .setThreshold(0.7)\\\n", " .setOutputCol(\"ner_chunk\")\n", "\n", "zero_shot_ner = legal.ZeroShotNerModel.pretrained(\"legner_roberta_zeroshot\", \"en\", \"legal/models\")\\\n", " .setInputCols([\"sentence\", \"token\"])\\\n", " .setOutputCol(\"zero_shot_ner\")\\\n", " .setPredictionThreshold(0.3)\\\n", " .setEntityDefinitions(\n", " {\n", " \n", " \"PARTY\": [\"which Inc?\", \"Which Ltd?\",\"Which company?\",\"Which party?\"],\n", " \"EFFDATE\": [\"What is the date?\"],\n", " \"ALIAS\": [\"Where is the location?\",\"What Aliases are used to refer to the PARTY?\",\"What Aliases are used to refer to the effdate?\",\"What Aliases are used to refer to the DOC?\"],\n", " \"FORMER_NAME\": ['Formerly known as?'],\n", " \"ADDRESS\":[\"What is the full location?\",\"where is the address?\",\"Where is the principal location of business?\"],\n", " \"DOC\":[\"What agreement?\"]\n", " \n", " })\n", "\n", "\n", "ner_converter_zeroshot = legal.NerConverterInternal()\\\n", " .setInputCols([\"sentence\", \"token\", \"zero_shot_ner\"])\\\n", " .setOutputCol(\"ner_chunk_zeroshot\")\\\n", " .setGreedyMode(True)\n", "\n", "chunk_merger = legal.ChunkMergeApproach()\\\n", " .setInputCols(\"ner_chunk\", \"ner_chunk_zeroshot\", \"subheader\", \"subheader2\")\\\n", " .setOutputCol('merged_ner_chunks')\n", " \n", "nlpPipeline = nlp.Pipeline(stages=[\n", " documentAssembler,\n", " textSplitter,\n", " tokenizer,\n", " alias_parser,\n", " alias_parser2, \n", " embeddings,\n", " ner_model,\n", " ner_converter,\n", " zero_shot_ner,\n", " ner_converter_zeroshot,\n", " chunk_merger\n", "])" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "9m2NrNAbUp92", "outputId": "8a55f1e6-c29b-4666-9e37-b503bb5461da" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "roberta_embeddings_legal_roberta_base download started this may take some time.\n", "Approximate size to download 447.2 MB\n", "[OK!]\n", "legner_contract_doc_parties_lg download started this may take some time.\n", "[OK!]\n", "legner_roberta_zeroshot download started this may take some time.\n", "[OK!]\n" ] } ] }, { "cell_type": "code", "source": [ "from pyspark.sql.types import StructType,StructField, StringType\n", "\n", "p_model = nlpPipeline.fit(spark.createDataFrame([[\"\"]]).toDF(\"text\"))\n", "\n", "lp = nlp.LightPipeline(p_model)\n", "\n", "# from sparknlp_display import NerVisualizer\n", "\n", "visualiser = nlp.viz.NerVisualizer()\n", "lp_res_1 = lp.fullAnnotate(text)\n", "visualiser.display(lp_res_1[0], label_col='merged_ner_chunks', document_col='document')\n" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 88 }, "id": "hsUrLXI4X8Vm", "outputId": "027513d4-675f-41ed-af41-21867db9ef02" }, "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "" ], "text/html": [ "\n", "\n", " THIS Lease Agreement DOC , is made and entered into this _____day of May, 2006 EFFDATE by and between Global, Inc PARTY., (hereinafter called \"Landlord\" ALIAS), and IMI Global, Inc PARTY., with a mailing address of ___ ADDRESS, (hereinafter referred as \"Tenant\" ALIAS)." ] }, "metadata": {} } ] }, { "cell_type": "code", "source": [ "from pyspark.sql import functions as F\n", "df = spark.createDataFrame([[text]]).toDF(\"text\")\n", "\n", "result = p_model.transform(df)\n", "\n", "result.select(F.explode(F.arrays_zip(result.merged_ner_chunks.result, result.merged_ner_chunks.metadata)).alias(\"cols\")) \\\n", " .select(F.expr(\"cols['0']\").alias(\"chunk\"),\n", " F.expr(\"cols['1']['entity']\").alias(\"ner_label\")).show(truncate=False)\n", "\n", "\n" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "r-FMDE-1X7xr", "outputId": "e32009b6-cd3e-4c0d-99da-076fa773edd3" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "+----------------------+---------+\n", "|chunk |ner_label|\n", "+----------------------+---------+\n", "|THIS Lease Agreement |DOC |\n", "|of May, 2006 |EFFDATE |\n", "|Global, Inc |PARTY |\n", "|\"Landlord\" |ALIAS |\n", "|IMI Global, Inc |PARTY |\n", "|mailing address of ___|ADDRESS |\n", "|\"Tenant\" |ALIAS |\n", "+----------------------+---------+\n", "\n" ] } ] }, { "cell_type": "code", "source": [], "metadata": { "id": "VMoRCUQQYX6X" }, "execution_count": null, "outputs": [] } ] }