{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "S6jlakvDCElY" }, "source": [ "![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "id": "9SnV8tbpW-9Z" }, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/05.2.Clause_based_NER.ipynb)" ] }, { "cell_type": "markdown", "metadata": { "id": "CUdOMUOQYJNe" }, "source": [ "#🚀 Legal NLP" ] }, { "cell_type": "markdown", "metadata": { "id": "focTiQUhYJNg" }, "source": [ "In this notebook, you will learn how to use Spark NLP and Legal NLP to identify relevant entities in legal texts using our state-of-the-art Named-Entity Recognition (NER) models and the recent Zero-Shot models.\n", "\n", "We will cover the full analysis cycle, from reading a document in PDF formar, extracting its text contents, classifying its sections and applying NER models on specific sections.\n", "\n", "Let`s dive in!" ] }, { "cell_type": "markdown", "metadata": { "id": "zRtbHsClVSuO" }, "source": [ "##📜 Introduction" ] }, { "cell_type": "markdown", "metadata": { "id": "f7iaSrydcTtG" }, "source": [ "###🔎 Classification models" ] }, { "cell_type": "markdown", "metadata": { "id": "tm_7Km89WsoY" }, "source": [ "📚For the text classification tasks, we will use two annotators:\n", "\n", "- `ClassifierDL`: uses the state-of-the-art Universal Senten- `ClassifierDL`: uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. Then, a deep learning model (DNNs) built with TensorFlow that supports `Binary Classification` and `Multiclass Classification` (up to 100 classes).\n", "- `MultiClassifierDL`: `Multilabel Classification` (can predict more than one class for each text) using a Bidirectional GRU with Convolution architecture built with TensorFlow that supports up to 100 classes. The inputs are Sentence Embeddings such as state-of-the-art UniversalSentenceEncoder, BertSentenceEmbeddings or SentenceEmbeddings.\n", "\n", "In Legal NLP, since the number of classes can be very high (over 250) and the texts could belong to more than one topic at the same time (multilabel problem), we pretrained several binary classifiers (yes / no) for many clause types in legal documents that can be used independently.\n", "\n", "You can select the topics you are interested in (for example, looking for loans and fiscal-year clauses) and create a pipeline with both of them to detect for those types of clauses in your paragraphs. \n", "\n", "As a reminder, since the models are independent and the task is multilabel, you may get some times positive results for more than one class (i.e, a paragraph talks about loans and fiscal year at the same time).\n", "\n", "As an alternative, we also have `MultiClassifierDL` that predicts many clause types in one model. The choice between using binary classifiers or the multilabel model will depend on the document types, and experimentations should be made to verify the accuracy of the models in texts that are too different from the trianed data (CUAD dataset, SEC sample documents, etc.)." ] }, { "cell_type": "markdown", "metadata": { "id": "rOX_Seu9cVk6" }, "source": [ "📚Example Classification models:\n", "\n", "| title | language | predicted_entities | compatible_editions |\n", "|:----------------------------------------------------------|:-----------||:-------------------------------|\n", "| Human Rights Articles Classification | en | ['Artículo 1. Obligación de Respetar los Derechos', 'Artículo 2. Deber de Adoptar Disposiciones de Derecho Interno', 'Artículo 3. Derecho al Reconocimiento de la Personalidad Jurídica', 'Artículo 4. Derecho a la Vida', 'Artículo 5. Derecho a la Integridad Personal', 'Artículo 6. Prohibición de la Esclavitud y Servidumbre', 'Artículo 7. Derecho a la Libertad Personal', 'Artículo 8. Garantías Judiciales', 'Artículo 9. Principio de Legalidad y de Retroactividad', 'Artículo 11. Protección de la Honra y de la Dignidad', 'Artículo 12. Libertad de Conciencia y de Religión', 'Artículo 13. Libertad de Pensamiento y de Expresión', 'Artículo 14. Derecho de Rectificación o Respuesta', 'Artículo 15. Derecho de Reunión', 'Artículo 16. Libertad de Asociación', 'Artículo 17. Protección a la Familia', 'Artículo 18. Derecho al Nombre', 'Artículo 19. Derechos del Niño', 'Artículo 20. Derecho a la Nacionalidad', 'Artículo 22. Derecho de Circulación y de Residencia', 'Artículo 23. Derechos Políticos', 'Artículo 24. Igualdad ante la Ley', 'Artículo 25. Protección Judicial', 'Artículo 26. Desarrollo Progresivo', 'Artículo 27. Suspensión de Garantías', 'Artículo 28. Cláusula Federal', 'Artículo 21. Derecho a la Propiedad Privada', 'Artículo 29. Normas de Interpretación', 'Artículo 30. Alcance de las Restricciones', 'Artículo 63.1 Reparaciones'] | ['Legal NLP 1.0', 'Legal NLP'] |\n", "| Legal Absence of certain changes Clause Binary Classifier | en | ['other', 'absence-of-certain-changes'] | ['Legal NLP 1.0', 'Legal NLP'] |\n", "| Legal Acceleration Clause Binary Classifier | en | ['other', 'acceleration'] | ['Legal NLP 1.0', 'Legal NLP'] |\n", "| Legal Access Clause Binary Classifier | en | ['other', 'access'] | ['Legal NLP 1.0', 'Legal NLP'] |\n", "| Legal Accounting terms Clause Binary Classifier | en | ['other', 'accounting-terms'] | ['Legal NLP 1.0', 'Legal NLP'] |\n", "| Legal Adjustments Clause Binary Classifier | en | ['other', 'adjustments'] | ['Legal NLP 1.0', 'Legal NLP'] |\n", "| Legal Agreements Clause Binary Classifier | en | ['other', 'agreements'] | ['Legal NLP 1.0', 'Legal NLP'] |\n", "| Legal Amendments Clause Binary Classifier | en | ['other', 'amendments'] | ['Legal NLP 1.0', 'Legal NLP'] |\n", "| Legal Application of proceeds Clause Binary Classifier | en | ['other', 'application-of-proceeds'] | ['Legal NLP 1.0', 'Legal NLP'] |\n", "| Conventions Classification | es | ['Convención sobre la Eliminación de todas las formas de Discriminación contra la Mujer', 'Convención sobre los Derechos de las Personas con Discapacidad', 'Convención Internacional Sobre la Eliminación de Todas las Formas de Discriminación Racial', 'Convención Internacional sobre la Protección de los Derechos de todos los Trabajadores Migratorios y de sus Familias', 'Convención de los Derechos del Niño', 'Pacto Internacional de Derechos Civiles y Políticos'] | ['Legal NLP 1.0', 'Legal NLP'] |\n", "\n", "\n", "For a complete list, check [NLP Models Hub](https://nlp.johnsnowlabs.com/models?edition=Legal+NLP&type=model&task=Text+Classification)" ] }, { "cell_type": "markdown", "metadata": { "id": "X-N8mjwywzPn" }, "source": [ "###🔎 NER models" ] }, { "cell_type": "markdown", "metadata": { "id": "wzDnEztPYJNh" }, "source": [ "Named-Entity Recognition (NER) is the capability to automatically identify relevant entities in the text. For example, person names, company names, public companies trading code, quantities, etc. There are many ways to implement NER, but nowadays the most efficient one is to use models based on deep learning.\n", "\n", "The deep neural network architecture for NER model in Spark NLP is BiLSTM-CNN-Char framework. a slightly modified version of the architecture proposed by Jason PC Chiu and Eric Nichols ([Named Entity Recognition with Bidirectional LSTM-CNNs](https://arxiv.org/abs/1511.08308)). It is a neural network architecture that automatically detects word and character-level features using a hybrid bidirectional LSTM and CNN architecture, eliminating the need for most feature engineering steps. This model is implemented in our `NerDL`/`NerModel` annotators that we will experiment with in this section.\n", "\n", "At John Snow Labs, we are proud to have a library of state-of-the-art pretrained, out-of-the-box, NLP models. With our newer package Legal NLP it is no different, and we currently support more than 580 models fine tuned for the legal domain. Specifically to NER, we currently have more than 40 models that can identify entities for different business needs.\n", "\n", "📚Example NER models:\n", "\n", "| title | language | predicted_entities | compatible_editions |\n", "|:----------------------------------------------------|:-----------|:----------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------|\n", "| NER on Legal Texts (CUAD, Silver corpus) | en | ['PERSON', 'LAW', 'PARTY', 'EFFDATE', 'LOC', 'DATE', 'DOC', 'ORDINAL', 'ROLE', 'PERCENT', 'ORG'] | ['Legal NLP 1.0', 'Legal NLP'] |\n", "| Generic Deidentification NER | en | ['AGE', 'CITY', 'COUNTRY', 'DATE', 'EMAIL', 'FAX', 'LOCATION-OTHER', 'ORG', 'PERSON', 'PHONE', 'PROFESSION', 'STATE', 'STREET', 'URL', 'ZIP'] | ['Legal NLP 1.0', 'Legal NLP'] |\n", "| Legal NER - License / Permission Clauses (Bert, sm) | en | ['PERMISSION', 'PERMISSION_SUBJECT', 'PERMISSION_OBJECT', 'PERMISSION_INDIRECT_OBJECT'] | ['Legal NLP 1.0', 'Legal NLP'] |\n", "| Legal NER (Headers / Subheaders) | en | ['HEADER', 'SUBHEADER'] | ['Legal NLP 1.0', 'Legal NLP'] |\n", "| Legal NER - Whereas Clauses (sm) | en | ['WHEREAS_SUBJECT', 'WHEREAS_OBJECT', 'WHEREAS_ACTION'] | ['Legal NLP 1.0', 'Legal NLP'] |\n", "| Legal NER (Parties, Dates, Document Type - sm) | en | ['PARTY', 'EFFDATE', 'DOC', 'ALIAS'] | ['Legal NLP 1.0', 'Legal NLP'] |\n", "| Legal NER (Headers / Subheaders) | en | ['SIGNING_TITLE', 'SIGNING_PERSON', 'PARTY'] | ['Legal NLP 1.0', 'Legal NLP'] |\n", "| Legal ORG, PRODUCT and ALIAS NER (small) | en | ['ORG', 'PROD', 'ALIAS'] | ['Legal NLP 1.0', 'Legal NLP'] |\n", "| Legal NER Obligations on Agreements | en | ['OBLIGATION_SUBJECT', 'OBLIGATION_ACTION', 'OBLIGATION', 'OBLIGATION_INDIRECT_OBJECT'] | ['Legal NLP 1.0', 'Legal NLP'] |\n", "| Legal Zero-shot NER | en | [] | ['Legal NLP 1.0', 'Legal NLP'] |\n", "\n", "\n", "For the complete list, check the [NLP Models Hub](https://nlp.johnsnowlabs.com/models?edition=Legal+NLP&type=model&task=Named+Entity+Recognition)." ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "id": "eJj6G_uGqmvW" }, "source": [ "##🎬 Installation" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9sGctuAVqmvW", "pycharm": { "is_executing": true } }, "outputs": [], "source": [ "! pip install -q johnsnowlabs" ] }, { "cell_type": "markdown", "metadata": { "id": "70t7zXI2qmvX" }, "source": [ "###🔗 Automatic Installation\n", "Using [my.johnsnowlabs.com](https://my.johnsnowlabs.com/) SSO" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "j3x9f6apqmvX", "pycharm": { "is_executing": true } }, "outputs": [], "source": [ "from johnsnowlabs import nlp, legal\n", "\n", "# nlp.install(force_browser=True)" ] }, { "cell_type": "markdown", "metadata": { "id": "5xgiQgV3qmvX" }, "source": [ "###🔗 Manual downloading\n", "If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.\n", "\n", "- Go to [my.johnsnowlabs.com](https://my.johnsnowlabs.com/)\n", "- Download your license\n", "- Upload it using the following command" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "EsX5CDwsqmvX" }, "outputs": [], "source": [ "from google.colab import files\n", "print('Please Upload your John Snow Labs License using the button below')\n", "license_keys = files.upload()" ] }, { "cell_type": "markdown", "metadata": { "id": "K9GLVA35qmvX" }, "source": [ "- Install it" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "B0ZiE5poqmvX" }, "outputs": [], "source": [ "nlp.install()" ] }, { "cell_type": "markdown", "metadata": { "id": "09fafa4b-cf69-4556-ae50-adb9fc6f4368" }, "source": [ "###📌 Start Spark Session" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "_uTrkyPv0rk2", "outputId": "8d01a875-8e4c-41de-c546-ff36ba9bcf4a", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7187 (3).json\n", "👌 Launched \u001b[92mcpu optimized\u001b[39m session with with: 🚀Spark-NLP==4.3.0, 💊Spark-Healthcare==4.3.0, running on ⚡ PySpark==3.1.2\n" ] } ], "source": [ "from johnsnowlabs import nlp, legal, viz\n", "# Automatically load license data and start a session with all jars user has access to\n", "spark = nlp.start()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "id": "Xqulsi7i_Tx4" }, "outputs": [], "source": [ "from pyspark.sql import DataFrame\n", "import pyspark.sql.functions as F\n", "import pyspark.sql.types as T\n", "import pyspark.sql as SQL\n", "from pyspark import keyword_only" ] }, { "cell_type": "markdown", "metadata": { "id": "2vTlIcIKiNKs" }, "source": [ "##🚨 Application: Identify Entities in a Credit Agreement Document" ] }, { "cell_type": "markdown", "metadata": { "id": "qdhHHo8vb9Jc" }, "source": [ "Getting an example agreement document, which we will use throughout this notebook to exemplify the real-world usage of our models." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "na1JVtCQ6pN5" }, "outputs": [], "source": [ "! wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/legal-nlp/data/credit_agreement.txt" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "yIfRPXss6wfk", "outputId": "44834132-5f64-45f3-f94a-88bcf01e5d65" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "\n", "\n", " Exhibit 10.1\n", "\n", " EXECUTION COPY\n", "\n", " $225,000,000.00 REVOLVING CREDIT FACILITY\n", "\n", " CREDIT AGREEMENT\n", "\n", " by and among\n", "\n", " P.H. GLATFELTER COMPANY\n", "\n", " and\n", "\n", " Certain of its Subsidiaries, as Borrowers\n", "\n", " and\n", "\n", " THE BANKS PARTY HERETO, as Lenders\n", "\n", " and\n", "\n", " PNC BANK, NATIONAL ASSOCIATION, as Administrative Agent\n", "\n", " with\n", "\n", " PNC CAPITAL MARKETS LLC and CITIZENS BANK OF PENNSYLVANIA,\n", "\n", " as Joint Lead Arrangers and Joint Bookrunners\n", "\n", " and\n", "\n", " CITIZENS BANK OF PENNSYLVANIA, as Syndication Agent\n", "\n", " Dated as of April 29, 2010\n", "\n", "\n", "\n", " TABLE OF CONTENTS\n", "\n", "\n", "\n", "Section Page\n", "------- ----\n", " \n", "1. CERTAIN DEFINITIONS.................................................. 1\n", " 1.1 Certain Definitions............................................ 1\n", " 1.2 Construction................................................... 28\n", " 1.2.1 Number; Inclusion...................................... 28\n", " 1.2.2 Determination.......................................... 28\n", " 1.2.3 Administrative Agent's Discretion and Consent.......... 28\n", " 1.2.4 Documents Taken as a Whole............................. 28\n", " 1.2.5 Headings............................................... 29\n", " 1.2.6 Implied References to this Agreement................... 29\n", " 1.2.7 Persons................................................ 29\n", " 1.2.8 Modifications to Documents............................. 29\n", " 1.2.9 From, To and Through................................... 29\n", " 1.2.10 Shall; Will............................................ 29\n", " 1.2.11 Quebec Matters......................................... 29\n", " 1.3 Accounting Principles.......................................... 30\n", "2. REVOLVING CREDIT AND SWING LOAN FACILITIES........................... 31\n", " 2.1 Revolving Credit Commitments................................... 31\n", " 2.1.1 Revolving Credit Loans................................. 31\n", " 2.1.2 Swing Loan Commitment.................................. 33\n", " 2.2 Nature of Lenders' Obligations with Respect to Revolving Credit\n", " Loans.......................................................... 33\n", " 2.3 Commitment Fees................................................ 33\n", " 2.4 Revolving Credit Loan Requests................................. 34\n", " 2.4.1 Revolving Credit Loan Requests......................... 34\n", " 2.4.2 Swing Loan Requests.................................... 34\n", " 2.5 Making Revolving Credit Loans and Swing Loans; Revolving Credit\n", " Notes and Swing Notes.......................................... 35\n", " 2.5.1 Making Revolving Credit Loans.......................... 35\n", " 2.5.2 Making Swing Loans..................................... 35\n", " 2.6 Revolving Credit Notes......................................... 35\n", " 2.7 Swing Loan Note................................................ 35\n", " 2.8 Borrowings to Repay Swing Loans................................ 36\n", " 2.9 Utilization of Commitments in Optional Currencies.............. 36\n", " 2.9.1 Periodic Computations of Dollar Equivalent Amounts of\n", " Revolving Credit Loans and Letters of Credit\n", " Outstanding............................................ 36\n", " 2.9.2 Notices From Lenders That Optional Currencies Are\n", " Unavailable to Fund New Loans.......................... 36\n", " 2.9.3 Notices From Lenders That Optional Currencies Are\n", " Unavailable to Fund Renewals of the Euro-Rate Option... 37\n", " 2.9.4 European Monetary Union................................ 37\n", "\n", "\n", "\n", " -i-\n", "\n", "\n", "\n", " \n" ] } ], "source": [ "credit_agreement = open(\"credit_agreement.txt\", \"r\", encoding=\"utf8\").read()\n", "\n", "# First page - note the \"-i-\" at the end\n", "print(credit_agreement[:4650])" ] }, { "cell_type": "markdown", "metadata": { "id": "1yHY_akQCaLY" }, "source": [ "###✔️ Splitting the document by pages" ] }, { "cell_type": "markdown", "metadata": { "id": "EwqF4QWJCcvN" }, "source": [ "Sometimes, pages have patterns which tell you how to split them. In our case, `the page number` was present in the bottom of our documents.\n", "\n", "📚Feel free to always analyze for signals when trying to detect pages boundaries. Patterns you can usually find in the bottom of a page:\n", "- Bottom placeholders\n", "- Name of people\n", "- Name of the document\n", "- other footer information\n", "- etc." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "id": "5GY8C6p97JOy" }, "outputs": [], "source": [ "document_assembler = (\n", " nlp.DocumentAssembler().setInputCol(\"text\").setOutputCol(\"document\")\n", ")\n", "\n", "text_splitter = (\n", " legal.TextSplitter()\n", " .setInputCols([\"document\"])\n", " .setOutputCol(\"pages\")\n", " .setCustomBounds([\"\\n+\\s*[0-9]+\\s*\\n+\", \"[-][iv]+[-]\"])\n", " .setUseCustomBoundsOnly(True)\n", " .setExplodeSentences(True)\n", ")\n", "\n", "page_splitting_pipeline = nlp.Pipeline(stages=[document_assembler, text_splitter])" ] }, { "cell_type": "markdown", "metadata": { "id": "YjgcOcLcCjq5" }, "source": [ "📜**Explanation:**\n", "\n", "- `.setCustomBounds([\"\\n+\\s*[0-9]+\\s*\\n+\", \"[-][iv]+[-]\"])` sets an array of regular expression(s) to tell the annotator how to split the document. The first regular expression identifies the page numbers of the document, and the second regular expression identifies the initial numbers in roman numerals enclosed by dash (-i-, -ii-, etc.) - only up to `-viii-`, which was manually checked, we could add more roman numerals identifiers if needed.\n", "- `.setUseCustomBoundsOnly(True)` the default behaviour of TextSplitter is Text Splitting, so we set to ignore the default regex ('\\n', ...).\n", "- `.setExplodeSentences(True)` creates one new row in the dataframe per split, instead of an array containing the splits.\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "id": "bDQjXSMHCh7T" }, "outputs": [], "source": [ "sdf = spark.createDataFrame([[ credit_agreement ]]).toDF(\"text\")\n", "\n", "page_splitter_model = page_splitting_pipeline.fit(sdf)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "SFGPz96QCh5b", "outputId": "8172415a-2ecb-479b-c28f-f349cf4d8e71" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "+--------------------+\n", "| pages|\n", "+--------------------+\n", "|[{document, 70, 4...|\n", "|[{document, 4678,...|\n", "|[{document, 8160,...|\n", "|[{document, 11801...|\n", "|[{document, 15298...|\n", "|[{document, 18938...|\n", "|[{document, 22461...|\n", "|[{document, 25295...|\n", "|[{document, 26399...|\n", "|[{document, 31821...|\n", "|[{document, 34560...|\n", "|[{document, 37458...|\n", "|[{document, 40734...|\n", "|[{document, 43945...|\n", "|[{document, 46511...|\n", "|[{document, 49420...|\n", "|[{document, 53065...|\n", "|[{document, 56539...|\n", "|[{document, 59694...|\n", "|[{document, 62708...|\n", "+--------------------+\n", "only showing top 20 rows\n", "\n", "CPU times: user 101 ms, sys: 6.2 ms, total: 107 ms\n", "Wall time: 11.9 s\n" ] } ], "source": [ "%%time\n", "\n", "#transforms: executes inference on a fit pipeline\n", "res = page_splitter_model.transform(sdf)\n", "\n", "# by selecting/showing/collecting the operations are performed\n", "res.select('pages').show()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "JD5fnr7o7I88", "outputId": "6db1bc39-2abf-4800-d088-2794c1a36ac5" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Exhibit 10.1\n", "\n", " EXECUTION COPY\n", "\n", " $225,000,000.00 REVOLVING CREDIT FACILITY\n", "\n", " CREDIT AGREEMENT\n", "\n", " by and among\n", "\n", " P.H. GLATFELTER COMPANY\n", "\n", " and\n", "\n", " Certain of its Subsidiaries, as Borrowers\n", "\n", " and\n", "\n", " THE BANKS PARTY HERETO, as Lenders\n", "\n", " and\n", "\n", " PNC BANK, NATIONAL ASSOCIATION, as Administrative Agent\n", "\n", " with\n", "\n", " PNC CAPITAL MARKETS LLC and CITIZENS BANK OF PENNSYLVANIA,\n", "\n", " as Joint Lead Arrangers and Joint Bookrunners\n", "\n", " and\n", "\n", " CITIZENS BANK OF PENNSYLVANIA, as Syndication Agent\n", "\n", " Dated as of April 29, 2010\n", "\n", "\n", "\n", " TABLE OF CONTENTS\n", "\n", "\n", "\n", "Section Page\n", "------- ----\n", " \n", "1. CERTAIN DEFINITIONS.................................................. 1\n", " 1.1 Certain Definitions............................................ 1\n", " 1.2 Construction................................................... 28\n", " 1.2.1 Number; Inclusion...................................... 28\n", " 1.2.2 Determination.......................................... 28\n", " 1.2.3 Administrative Agent's Discretion and Consent.......... 28\n", " 1.2.4 Documents Taken as a Whole............................. 28\n", " 1.2.5 Headings............................................... 29\n", " 1.2.6 Implied References to this Agreement................... 29\n", " 1.2.7 Persons................................................ 29\n", " 1.2.8 Modifications to Documents............................. 29\n", " 1.2.9 From, To and Through................................... 29\n", " 1.2.10 Shall; Will............................................ 29\n", " 1.2.11 Quebec Matters......................................... 29\n", " 1.3 Accounting Principles.......................................... 30\n", "2. REVOLVING CREDIT AND SWING LOAN FACILITIES........................... 31\n", " 2.1 Revolving Credit Commitments................................... 31\n", " 2.1.1 Revolving Credit Loans................................. 31\n", " 2.1.2 Swing Loan Commitment.................................. 33\n", " 2.2 Nature of Lenders' Obligations with Respect to Revolving Credit\n", " Loans.......................................................... 33\n", " 2.3 Commitment Fees................................................ 33\n", " 2.4 Revolving Credit Loan Requests................................. 34\n", " 2.4.1 Revolving Credit Loan Requests......................... 34\n", " 2.4.2 Swing Loan Requests.................................... 34\n", " 2.5 Making Revolving Credit Loans and Swing Loans; Revolving Credit\n", " Notes and Swing Notes.......................................... 35\n", " 2.5.1 Making Revolving Credit Loans.......................... 35\n", " 2.5.2 Making Swing Loans..................................... 35\n", " 2.6 Revolving Credit Notes......................................... 35\n", " 2.7 Swing Loan Note................................................ 35\n", " 2.8 Borrowings to Repay Swing Loans................................ 36\n", " 2.9 Utilization of Commitments in Optional Currencies.............. 36\n", " 2.9.1 Periodic Computations of Dollar Equivalent Amounts of\n", " Revolving Credit Loans and Letters of Credit\n", " Outstanding............................................ 36\n", " 2.9.2 Notices From Lenders That Optional Currencies Are\n", " Unavailable to Fund New Loans.......................... 36\n", " 2.9.3 Notices From Lenders That Optional Currencies Are\n", " Unavailable to Fund Renewals of the Euro-Rate Option... 37\n", " 2.9.4 European Monetary Union................................ 37\n" ] } ], "source": [ "# Checking the first page\n", "print(res.select('pages.result').take(1)[0].result[0])" ] }, { "cell_type": "markdown", "metadata": { "id": "NUe1ybTz9FMX" }, "source": [ "Let's keep the pages in a new data frame." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "jgqSVtRe9Jv1", "outputId": "52b1d5fa-4f32-4de2-f497-4819448576b3" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "+--------------------+\n", "| page|\n", "+--------------------+\n", "|Exhibit 10.1\n", "\n", " ...|\n", "|TABLE OF CONTENTS...|\n", "|TABLE OF CONTENTS...|\n", "|TABLE OF CONTENTS...|\n", "|TABLE OF CONTENTS...|\n", "|TABLE OF CONTENTS...|\n", "|TABLE OF CONTENTS...|\n", "|LIST OF SCHEDULES...|\n", "|CREDIT AGREEMENT\n", "...|\n", "|AUGMENTING LENDER...|\n", "|BUSINESS DAY shal...|\n", "|COMPLIANCE CERTIF...|\n", "|immediately prece...|\n", "|DECLINED SHARE sh...|\n", "|or directives iss...|\n", "|EURO-RATE shall m...|\n", "|Administrative Ag...|\n", "|EURO-RATE OPTION ...|\n", "|rate as quoted by...|\n", "|XXXXXXX NOTE shal...|\n", "+--------------------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "pages = res.select(F.expr(\"pages.result[0] as page\"))\n", "pages.show()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "zhRdfxhj9mR6", "outputId": "8516cb2d-4d31-4f12-ccb1-56d7400f0930" }, "outputs": [ { "output_type": "stream", "name": "stdout", "textn", "|page |\nn", "|Exhibit 10.1\n", "\n", " EXECUTION COPY\n", "\n", " $225,000,000.00 REVOLVING CREDIT FACILITY\n", "\n", " CREDIT AGREEMENT\n", "\n", " by and among\n", "\n", " P.H. GLATFELTER COMPANY\n", "\n", " and\n", "\n", " Certain of its Subsidiaries, as Borrowers\n", "\n", " and\n", "\n", " THE BANKS PARTY HERETO, as Lenders\n", "\n", " and\n", "\n", " PNC BANK, NATIONAL ASSOCIATION, as Administrative Agent\n", "\n", " with\n", "\n", " PNC CAPITAL MARKETS LLC and CITIZENS BANK OF PENNSYLVANIA,\n", "\n", " as Joint Lead Arrangers and Joint Bookrunners\n", "\n", " and\n", "\n", " CITIZENS BANK OF PENNSYLVANIA, as Syndication Agent\n", "\n", " Dated as of April 29, 2010\n", "\n", "\n", "\n", " TABLE OF CONTENTS\n", "\n", "\n", "\n", "Section Page\n", "------- ----\n", " \n", "1. CERTAIN DEFINITIONS.................................................. 1\n", " 1.1 Certain Definitions............................................ 1\n", " 1.2 Construction................................................... 28\n", " 1.2.1 Number; Inclusion...................................... 28\n", " 1.2.2 Determination.......................................... 28\n", " 1.2.3 Administrative Agent's Discretion and Consent.......... 28\n", " 1.2.4 Documents Taken as a Whole............................. 28\n", " 1.2.5 Headings............................................... 29\n", " 1.2.6 Implied References to this Agreement................... 29\n", " 1.2.7 Persons................................................ 29\n", " 1.2.8 Modifications to Documents............................. 29\n", " 1.2.9 From, To and Through................................... 29\n", " 1.2.10 Shall; Will............................................ 29\n", " 1.2.11 Quebec Matters......................................... 29\n", " 1.3 Accounting Principles.......................................... 30\n", "2. REVOLVING CREDIT AND SWING LOAN FACILITIES........................... 31\n", " 2.1 Revolving Credit Commitments................................... 31\n", " 2.1.1 Revolving Credit Loans................................. 31\n", " 2.1.2 Swing Loan Commitment.................................. 33\n", " 2.2 Nature of Lenders' Obligations with Respect to Revolving Credit\n", " Loans.......................................................... 33\n", " 2.3 Commitment Fees................................................ 33\n", " 2.4 Revolving Credit Loan Requests................................. 34\n", " 2.4.1 Revolving Credit Loan Requests......................... 34\n", " 2.4.2 Swing Loan Requests.................................... 34\n", " 2.5 Making Revolving Credit Loans and Swing Loans; Revolving Credit\n", " Notes and Swing Notes.......................................... 35\n", " 2.5.1 Making Revolving Credit Loans.......................... 35\n", " 2.5.2 Making Swing Loans..................................... 35\n", " 2.6 Revolving Credit Notes......................................... 35\n", " 2.7 Swing Loan Note................................................ 35\n", " 2.8 Borrowings to Repay Swing Loans................................ 36\n", " 2.9 Utilization of Commitments in Optional Currencies.............. 36\n", " 2.9.1 Periodic Computations of Dollar Equivalent Amounts of\n", " Revolving Credit Loans and Letters of Credit\n", " Outstanding............................................ 36\n", " 2.9.2 Notices From Lenders That Optional Currencies Are\n", " Unavailable to Fund New Loans.......................... 36\n", " 2.9.3 Notices From Lenders That Optional Currencies Are\n", " Unavailable to Fund Renewals of the Euro-Rate Option... 37\n", " 2.9.4 European Monetary Union................................ 37|\nn", "\n" ] } ], "source": [ "pages.limit(1).show(truncate=False)" ] }, { "cell_type": "markdown", "metadata": { "id": "34ewxc2QSVih" }, "source": [ "###✔️ Classifying each page" ] }, { "cell_type": "markdown", "metadata": { "id": "3vlcan-ASV09" }, "source": [ "We will use a few pretrained binary classifier models to try to identify clause types in each page. We will also replace all `\\n` (linebreak) from the pages' text to avoid extra tokenization (we keep the cleaned text in the column `page_clean`)." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "-7qvMVqh9Pme", "outputId": "222fc50e-eaa6-4d53-fd2c-7422be41a722" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "+------------------------------------------------------------------------------------------------------------------------------------------------------+\n", "| page_clean|\n", "+------------------------------------------------------------------------------------------------------------------------------------------------------+\n", "|Exhibit 10.1 EXECUTION COPY $225,000,000.00 REVOLVING CREDIT FACILITY CREDIT AGREEMENT by and among P.H. GLATFELTER COMPANY and Certain of its Subs...|\n", "+------------------------------------------------------------------------------------------------------------------------------------------------------+\n", "\n" ] } ], "source": [ "pages = pages.withColumn(\"page_clean\", F.regexp_replace(\"page\", \"\\s+\", \" \"))\n", "pages.limit(1).select(\"page_clean\").show(truncate=150)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "0f57LpGxShVj", "outputId": "40ea2120-15de-4a8a-b56b-9415f2352b0a" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "sent_bert_base_cased download started this may take some time.\n", "Approximate size to download 389.1 MB\n", "[OK!]\n", "legclf_cuad_whereas_clause download started this may take some time.\n", "[OK!]\n", "legclf_cuad_warranty_clause download started this may take some time.\n", "[OK!]\n", "legclf_termination_md download started this may take some time.\n", "[OK!]\n" ] } ], "source": [ "document_assembler = (\n", " nlp.DocumentAssembler().setInputCol(\"page_clean\").setOutputCol(\"document\")\n", ")\n", "\n", "embeddings = (\n", " nlp.BertSentenceEmbeddings.pretrained(\"sent_bert_base_cased\", \"en\")\n", " .setInputCols(\"document\")\n", " .setOutputCol(\"sentence_embeddings\")\n", ")\n", "\n", "whereas_classifier = (\n", " legal.ClassifierDLModel.pretrained(\n", " \"legclf_cuad_whereas_clause\", \"en\", \"legal/models\"\n", " )\n", " .setInputCols([\"sentence_embeddings\"])\n", " .setOutputCol(\"is_whereas\")\n", ")\n", "\n", "warranty_classifier = (\n", " legal.ClassifierDLModel.pretrained(\n", " \"legclf_cuad_warranty_clause\", \"en\", \"legal/models\"\n", " )\n", " .setInputCols([\"sentence_embeddings\"])\n", " .setOutputCol(\"is_warranty\")\n", ")\n", "\n", "termination_classifier = (\n", " legal.ClassifierDLModel.pretrained(\"legclf_termination_md\", \"en\", \"legal/models\")\n", " .setInputCols([\"sentence_embeddings\"])\n", " .setOutputCol(\"is_termination\")\n", ")\n", "\n", "\n", "pipeline = nlp.Pipeline(\n", " stages=[\n", " document_assembler,\n", " embeddings,\n", " whereas_classifier,\n", " warranty_classifier,\n", " termination_classifier,\n", " ]\n", ")\n", "\n", "model = pipeline.fit(spark.createDataFrame([[\"\"]]).toDF(\"text\"))" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "LHRhEHslShOx", "outputId": "cabc1d6e-de67-4718-a4b0-088f0d8ef92d" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "+---------+----------+-----------+\n", "| whereas| survival|termination|\n", "+---------+----------+-----------+\n", "|[whereas]| [other]| [other]|\n", "| [other]| [other]| [other]|\n", "| [other]| [other]| [other]|\n", "| [other]|[warranty]| [other]|\n", "| [other]| [other]| [other]|\n", "| [other]| [other]| [other]|\n", "| [other]| [other]| [other]|\n", "| [other]| [other]| [other]|\n", "|[whereas]| [other]| [other]|\n", "| [other]| [other]| [other]|\n", "| [other]| [other]| [other]|\n", "| [other]| [other]| [other]|\n", "| [other]| [other]| [other]|\n", "| [other]| [other]| [other]|\n", "| [other]| [other]| [other]|\n", "| [other]| [other]| [other]|\n", "| [other]| [other]| [other]|\n", "| [other]| [other]| [other]|\n", "| [other]| [other]| [other]|\n", "| [other]| [other]| [other]|\n", "+---------+----------+-----------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "result = model.transform(pages)\n", "result.select(\n", " F.expr(\"is_whereas.result as whereas\"),\n", " F.expr(\"is_warranty.result as survival\"),\n", " F.expr(\"is_termination.result as termination\"),\n", ").show()" ] }, { "cell_type": "markdown", "metadata": { "id": "ec1SGDQCCDZQ" }, "source": [ "How many survival clauses on pages❓" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "GPEboBpQCGtn", "outputId": "ede2c1df-843c-4540-ecf4-145a5ef41800" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "3" ] }, "metadata": {}, "execution_count": 19 } ], "source": [ "result.select(F.expr(\"is_whereas.result[0] as whereas\")).filter(\n", " \"whereas != 'other'\"\n", ").count()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "tqiI_0HjShMV", "outputId": "7d7ce146-e05c-4d4c-f39a-ee154d9764b6" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "+--------------------------------------------------------------------------------+-------+\n", "| page_clean|whereas|\n", "+--------------------------------------------------------------------------------+-------+\n", "|Exhibit 10.1 EXECUTION COPY $225,000,000.00 REVOLVING CREDIT FACILITY CREDIT ...|whereas|\n", "|CREDIT AGREEMENT THIS CREDIT AGREEMENT is dated as of April 29, 2010, and is ...|whereas|\n", "|ANY KIND ARISING OUT OF OR RELATED TO THIS AGREEMENT OR ANY OTHER LOAN DOCUME...|whereas|\n", "+--------------------------------------------------------------------------------+-------+\n", "\n" ] } ], "source": [ "result.filter(\"is_whereas.result[0] == 'whereas'\").select(\n", " \"page_clean\", F.expr(\"is_whereas.result[0] as whereas\")\n", ").show(3, truncate=80)" ] }, { "cell_type": "markdown", "metadata": { "id": "MS4CGxtVucxC" }, "source": [ "How many warranty clauses on pages❓" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "aGzU7GMnuhqd", "outputId": "26397c7f-1902-4915-f172-64450f8eb76e" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "9" ] }, "metadata": {}, "execution_count": 21 } ], "source": [ "result.select(F.expr(\"is_warranty.result[0] as warranty\")).filter(\n", " \"warranty != 'other'\"\n", ").count()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "eMXvU51DxHTE", "outputId": "014e9f33-16b9-4c98-fcc7-54ae794b34fb" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "+--------------------------------------------------------------------------------+--------+\n", "| page_clean|warranty|\n", "+--------------------------------------------------------------------------------+--------+\n", "|TABLE OF CONTENTS Section Page ------- ---- 5.1.14 Patents, Trademarks, Copyr...|warranty|\n", "|Document as a whole and not to any particular provision of this Agreement or ...|warranty|\n", "|(v) the lack of power or authority of any signer of (or any defect in or forg...|warranty|\n", "|such taxes, fees, assessments and other charges are being contested in good f...|warranty|\n", "|5.1.16 COMPLIANCE WITH LAWS. The Loan Parties and their Subsidiaries are in c...|warranty|\n", "|(ii) To the best of the Loan Parties' knowledge, each Multiemployer Plan and ...|warranty|\n", "|9.18 NO RELIANCE ON ADMINISTRATIVE AGENT'S CUSTOMER IDENTIFICATION PROGRAM. E...|warranty|\n", "|ANY KIND ARISING OUT OF OR RELATED TO THIS AGREEMENT OR ANY OTHER LOAN DOCUME...|warranty|\n", "|COBANK, ACB, as a Lender By: /s/ Xxxxxxx X. Norte ---------------------------...|warranty|\n", "+--------------------------------------------------------------------------------+--------+\n", "\n" ] } ], "source": [ "result.filter(\"is_warranty.result[0] == 'warranty'\").select(\n", " \"page_clean\", F.expr(\"is_warranty.result[0] as warranty\")\n", ").show(9, truncate=80)" ] }, { "cell_type": "markdown", "metadata": { "id": "adSRWNWouh8P" }, "source": [ "How many termination clauses on pages❓" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "7yaQUQGSuiQR", "outputId": "2782cd70-b058-45bf-b01d-45448c746719" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "7" ] }, "metadata": {}, "execution_count": 23 } ], "source": [ "result.select(F.expr(\"is_termination.result[0] as termination\")).filter(\n", " \"termination != 'other'\"\n", ").count()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "QZDZC3QXxQD-", "outputId": "02cb8ef9-1ee1-4716-c038-5dda4588db30" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "+--------------------------------------------------------------------------------+-----------+\n", "| page_clean|termination|\n", "+--------------------------------------------------------------------------------+-----------+\n", "|Fees shall be payable quarterly in arrears on the first day of each July, Oct...|termination|\n", "|the calculation of Equivalent Amounts which thereafter are actually in effect...|termination|\n", "|certificate to the other Lenders and the Borrowers. Upon such date as shall b...|termination|\n", "|normal banking procedures each Lender could purchase the Original Currency wi...|termination|\n", "|7.1.12 GERMAN AND ENGLISH BORROWERS. On or before the Closing Date, and such ...|termination|\n", "|(H) the Loan Parties shall deliver to the Administrative Agent at least five ...|termination|\n", "|8.1.4 BREACH OF OTHER COVENANTS. Any of the Loan Parties shall default in the...|termination|\n", "+--------------------------------------------------------------------------------+-----------+\n", "\n" ] } ], "source": [ "result.filter(\"is_termination.result[0] == 'termination'\").select(\n", " \"page_clean\", F.expr(\"is_termination.result[0] as termination\")\n", ").show(7, truncate=80)" ] }, { "cell_type": "markdown", "metadata": { "id": "vvvv93e2YJNl" }, "source": [ "###✔️ Identifying Entities in Whereas Clauses" ] }, { "cell_type": "markdown", "metadata": { "id": "rt-7A1iPj2XS" }, "source": [ "Now that we found the clauses in each page, we select one NER model to identify the entities present in the `whereas` clauses. Our pretrained model can identify the following entities:\n", "\n", "- `WHEREAS_SUBJECT`\n", "- `WHEREAS_OBJECT`\n", "- `WHEREAS_ACTION`\n", "\n", "We will filter the whereas clauses and extract them as raw text, so we can build a new pipeline from scratch. \n", "\n", "> Note: The model was trained with `Roberta Embeddings` instead of `Bert`.\n", "\n", "In addition to these entities, we will also use other models to identify person, organization, location, and dates. We can use the `ChunkMergeApproach` to merge two NER chunks in an unified field, containing all the entities." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "VCZa88JZYJNm", "outputId": "2eda7374-a8d5-4d87-97c4-3dd157c82e79" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "w2v_cc_300d download started this may take some time.\n", "Approximate size to download 1.2 GB\n", "[OK!]\n", "roberta_embeddings_legal_roberta_base download started this may take some time.\n", "Approximate size to download 447.2 MB\n", "[OK!]\n", "legner_whereas_md download started this may take some time.\n", "[OK!]\n", "legner_cuad_silver download started this may take some time.\n", "[OK!]\n" ] } ], "source": [ "document_assembler = (\n", " nlp.DocumentAssembler().setInputCol(\"page_clean\").setOutputCol(\"document\")\n", ")\n", "\n", "text_splitter = (\n", " legal.TextSplitter().setInputCols([\"document\"]).setOutputCol(\"page_sentence\")\n", ")\n", "\n", "tokenizer = nlp.Tokenizer().setInputCols([\"page_sentence\"]).setOutputCol(\"token\")\n", "\n", "embeddings = (\n", " nlp.WordEmbeddingsModel.pretrained(\"w2v_cc_300d\", \"en\")\n", " .setInputCols([\"page_sentence\", \"token\"])\n", " .setOutputCol(\"embeddings\")\n", ")\n", "\n", "roberta_embeddings = (\n", " nlp.RoBertaEmbeddings.pretrained(\"roberta_embeddings_legal_roberta_base\", \"en\")\n", " .setInputCols([\"page_sentence\", \"token\"])\n", " .setOutputCol(\"roberta\")\n", " .setMaxSentenceLength(512)\n", ")\n", "\n", "ner_whereas = (\n", " legal.NerModel.pretrained(\"legner_whereas_md\", \"en\", \"legal/models\")\n", " .setInputCols([\"page_sentence\", \"token\", \"roberta\"])\n", " .setOutputCol(\"ner_whereas\")\n", ")\n", "\n", "ner_converter_whereas = (\n", " legal.NerConverterInternal()\n", " .setInputCols([\"page_sentence\", \"token\", \"ner_whereas\"])\n", " .setOutputCol(\"ner_chunk_whereas\")\n", ")\n", "\n", "ner_generic = (\n", " legal.NerModel.pretrained(\"legner_cuad_silver\", \"en\", \"legal/models\")\n", "\t\t.setInputCols([\"page_sentence\", \"token\", \"embeddings\"])\n", "\t\t.setOutputCol(\"ner_generic\")\n", ")\n", "\n", "ner_converter_generic = (\n", " legal.NerConverterInternal()\n", " .setInputCols([\"page_sentence\", \"token\", \"ner_generic\"])\n", " .setOutputCol(\"ner_chunk_generic\")\n", " .setGreedyMode(True)\n", ")\n", "\n", "chunk_merge = (\n", " legal.ChunkMergeApproach()\n", " .setInputCols(\"ner_chunk_whereas\", \"ner_chunk_generic\")\n", " .setOutputCol(\"merged_chunk\")\n", ")\n", "\n", "ner_pipeline = nlp.Pipeline(\n", " stages=[\n", " document_assembler,\n", " text_splitter,\n", " tokenizer,\n", " embeddings,\n", " roberta_embeddings,\n", " ner_whereas,\n", " ner_converter_whereas,\n", " ner_generic,\n", " ner_converter_generic,\n", " chunk_merge\n", " ]\n", ")\n", "ner_model = ner_pipeline.fit(spark.createDataFrame([[\"\"]]).toDF(\"page_clean\"))" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "BtLzjRGMcoa8", "outputId": "5e62fd2c-fb49-4259-cc88-a4a977d8ebe5" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "+--------------------+\n", "| page_clean|\n", "+--------------------+\n", "|Exhibit 10.1 EXEC...|\n", "+--------------------+\n", "only showing top 1 row\n", "\n" ] } ], "source": [ "example_clauses = result.filter(\"is_whereas.result[0] == 'whereas'\").select(\n", " \"page_clean\"\n", ")\n", "example_clauses.show(1)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "id": "jwCHOza-dVvo" }, "outputs": [], "source": [ "ner_results = ner_model.transform(example_clauses)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Ye8WHIzOfNJT", "outputId": "a8620581-9f4a-458f-cc05-0d4c53721f91" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "+-----------------------------------------------------------------------+---------------+----------+\n", "|chunk |ner_label |confidence|\n", "+-----------------------------------------------------------------------+---------------+----------+\n", "|EXECUTION COPY |DOC |0.99565 |\n", "|REVOLVING CREDIT FACILITY CREDIT AGREEMENT |DOC |0.8406199 |\n", "|GLATFELTER COMPANY |PARTY |0.52175 |\n", "|PNC BANK, NATIONAL ASSOCIATION, as Administrative Agent |ORG |0.87597775|\n", "|PNC CAPITAL MARKETS LLC |ORG |0.631425 |\n", "|CITIZENS BANK OF PENNSYLVANIA |ORG |0.6161 |\n", "|Joint Lead Arrangers and Joint Bookrunners |ORG |0.8393666 |\n", "|CITIZENS BANK OF PENNSYLVANIA |ORG |0.603675 |\n", "|April 29, 2010 |DATE |0.95145 |\n", "|CERTAIN DEFINITIONS |ORG |0.43734998|\n", "|Construction |ORG |0.9439 |\n", "|28 1.2.1 Number |DATE |0.6659667 |\n", "|Inclusion |ORG |0.9177 |\n", "|28 1.2.2 Determination |DATE |0.6773667 |\n", "|Whole |ORG |0.7297 |\n", "|28 1.2.5 Headings |DATE |0.67876667|\n", "|Agreement |DOC |0.9939 |\n", "|29 1.2.7 |DATE |0.7389 |\n", "|Persons |ORG |0.9189 |\n", "|Documents |ORG |0.5535 |\n", "|29 1.2.9 |DATE |0.90805 |\n", "|29 1.2.10 |DATE |0.7398 |\n", "|Shall |ORG |0.6872 |\n", "|Will |DOC |0.9934 |\n", "|Quebec |LOC |0.793 |\n", "|Accounting Principles |ORG |0.51035 |\n", "|30 |DATE |0.85 |\n", "|REVOLVING CREDIT |DOC |0.74405 |\n", "|Credit |DOC |0.9644 |\n", "|Credit |DOC |0.803 |\n", "|Swing Loan Commitment |ORG |0.72103333|\n", "|Credit |DOC |0.6604 |\n", "|Credit |DOC |0.8468 |\n", "|Credit |DOC |0.8577 |\n", "|Swing Loan Requests |ORG |0.6699 |\n", "|Credit |DOC |0.8247 |\n", "|Credit |DOC |0.9194 |\n", "|Credit |DOC |0.8865 |\n", "|Making Swing Loans |ORG |0.7957333 |\n", "|Credit |DOC |0.9688 |\n", "|Loan Note |DOC |0.66545 |\n", "|Repay Swing Loans |ORG |0.6506 |\n", "|Credit |DOC |0.9285 |\n", "|Letters of Credit |DOC |0.97900003|\n", "|Fund New Loans |ORG |0.5782 |\n", "|European Monetary Union |ORG |0.5827 |\n", "|CREDIT AGREEMENT |DOC |0.99310005|\n", "|CREDIT AGREEMENT |DOC |0.9901 |\n", "|April 29, 2010 |DATE |0.8848001 |\n", "|GLATFELTER COMPANY |ORG |0.6135 |\n", "|Pennsylvania |LOC |0.7951 |\n", "|the \"COMPANY\") |ORG |0.78772503|\n", "|BORROWER |ROLE |0.9892 |\n", "|PNC BANK, NATIONAL ASSOCIATION, |ORG |0.7839833 |\n", "|Agreement |DOC |0.9553 |\n", "|PNC CAPITAL MARKETS LLC |PARTY |0.8001 |\n", "|CITIZENS BANK OF PENNSYLVANIA |ORG |0.549325 |\n", "|CITIZENS BANK OF PENNSYLVANIA |ORG |0.631575 |\n", "|the Borrowers |WHEREAS_SUBJECT|0.9951 |\n", "|have requested |WHEREAS_ACTION |0.97775 |\n", "|the Lenders to provide a revolving credit facility |WHEREAS_OBJECT |0.80509996|\n", "|proceeds of the |WHEREAS_SUBJECT|0.84663326|\n", "|credit facility |DOC |0.82975 |\n", "|WHEREAS |ORG |0.9781 |\n", "|the Lenders |WHEREAS_SUBJECT|0.9569 |\n", "|are willing to provide |WHEREAS_ACTION |0.836 |\n", "|such credit |WHEREAS_OBJECT |0.85765004|\n", "|the parties |WHEREAS_SUBJECT|0.96730006|\n", "|DEFINITIONS 1.1 CERTAIN DEFINITIONS |LAW |0.70685 |\n", "|Agreement |DOC |0.9834 |\n", "|2006 |DATE |0.9707 |\n", "|2006 |DATE |0.9988 |\n", "|May 1, 2016 |DATE |0.97282505|\n", "|the Loan Parties |ORG |0.76336664|\n", "|2010 |DATE |0.9851 |\n", "|2010 |DATE |0.9993 |\n", "|May 1, 2016 |DATE |0.97245 |\n", "|the Loan Parties |ORG |0.77570003|\n", "|RECEIVABLE FACILITY |DOC |0.7375 |\n", "|Company |ORG |0.9981 |\n", "|the Receivables Entity |ORG |0.60770005|\n", "|a Permitted Accounts Receivable Program |ORG |0.54974 |\n", "|ADMINISTRATIVE AGENT |PARTY |0.67195 |\n", "|Section 9.15 |LAW |0.99465 |\n", "|AGENT'S LETTER |DOC |0.55605 |\n", "|Section 9.15 |LAW |0.9943 |\n", "|AFFILIATE |ROLE |0.4648 |\n", "|Person |ORG |0.8968 |\n", "|Person |ORG |0.9664 |\n", "|contract |DOC |0.9829 |\n", "|AGREEMENT |DOC |0.8701 |\n", "|Credit Agreement |DOC |0.99915 |\n", "|Executive Order No. 13224 |LAW |0.70444 |\n", "|the USA Patriot |LAW |0.4611 |\n", "|Act |DOC |0.9427 |\n", "|Act |DOC |0.9735 |\n", "|the United States Treasury Department's Office of Foreign Asset Control|ORG |0.52493 |\n", "|Laws |LAW |0.9754 |\n", "|SCHEDULE 1.1(A |LAW |0.80805004|\n", "|SCHEDULE 1.1(A |LAW |0.98765004|\n", "|Euro-Rate |ORG |0.9422 |\n", "|SCHEDULE 1.1(A |LAW |0.7896 |\n", "|SCHEDULE 1.1(A |LAW |0.9899 |\n", "|ASSIGNMENT AND ASSUMPTION AGREEMENT |DOC |0.74009997|\n", "|Assumption Agreement |DOC |0.8543 |\n", "|EXHIBIT 1.1(A |LAW |0.71099997|\n", "|AGREEMENT |DOC |0.9904 |\n", "|LOAN DOCUMENT |DOC |0.95755005|\n", "|10.17.1 TAX WITHHOLDING |ORG |0.7043667 |\n", "|the United States of America |LOC |0.79712 |\n", "|the Administrative Agent |ORG |0.81193334|\n", "|Company |ORG |0.9975 |\n", "|the Administrative Agent |ORG |0.77356666|\n", "|Section 1.1441-1 |LAW |0.9625 |\n", "|the Income Tax Regulations |LAW |0.64004993|\n", "|U.S. |LOC |0.87049997|\n", "|U.S |LOC |0.8581 |\n", "|treaty |DOC |0.8876 |\n", "|the Internal Revenue Code |LAW |0.58395 |\n", "|CERTIFICATE |DOC |0.9479 |\n", "|Form |DOC |0.999 |\n", "|Form |DOC |0.9962 |\n", "|Form |DOC |0.9922 |\n", "|W-8ECI |ORG |0.6855 |\n", "|Form |DOC |0.9803 |\n", "|Section 1.1441-1 |LAW |0.95815 |\n", "|Regulations |LAW |0.9789 |\n", "|statement |DOC |0.9981 |\n", "|Section 1.871-14 |LAW |0.90325 |\n", "|Regulations |LAW |0.9828 |\n", "|the Internal Revenue Code |LAW |0.63045 |\n", "|owner |ROLE |1.0 |\n", "|U.S. |LOC |0.78575 |\n", "|Company |ORG |0.9976 |\n", "|Certificate |DOC |0.9994 |\n", "|Certificate |DOC |0.9994 |\n", "|Certificate |DOC |0.9985 |\n", "|Business Days |DATE |0.87845004|\n", "|first |ORDINAL |0.5447 |\n", "|Certificate |DOC |0.998 |\n", "|Business Days |DATE |0.9174 |\n", "|the Administrative Agent |ORG |0.86280006|\n", "|permit |DOC |0.9784 |\n", "|Certificate |DOC |0.9983 |\n", "|Business Days |DATE |0.89515 |\n", "|the Administrative Agent |ORG |0.8864667 |\n", "|Certificate |DOC |0.9996 |\n", "|Company |ORG |0.9968 |\n", "|the Administrative Agent |ORG |0.7749667 |\n", "|Certificate |DOC |0.9994 |\n", "|Certificate |DOC |0.9994 |\n", "|Certificate |DOC |0.9995 |\n", "|Borrowers |ORG |0.9825 |\n", "|the Administrative Agent |ORG |0.8297 |\n", "|the United States of America |LOC |0.78362 |\n", "|the Administrative Agent |ORG |0.8483667 |\n", "|agreement |DOC |0.9514 |\n", "|U.S |LOC |0.8164 |\n", "|U.S |LOC |0.8927 |\n", "|agreement |DOC |0.9969 |\n", "|agreement |DOC |0.9893 |\n", "|the United States |LOC |0.8441 |\n", "+-----------------------------------------------------------------------+---------------+----------+\n", "\n" ] } ], "source": [ "ner_results.select(\n", " F.explode(F.arrays_zip(ner_results.merged_chunk.result, ner_results.merged_chunk.metadata)).alias(\n", " \"cols\"\n", " )\n", ").select(\n", " F.expr(\"cols['0']\").alias(\"chunk\"),\n", " F.expr(\"cols['1']['entity']\").alias(\"ner_label\"),\n", " F.expr(\"cols['1']['confidence']\").alias(\"confidence\"),\n", ").show(300, truncate=False)\n" ] }, { "cell_type": "markdown", "metadata": { "id": "k0agyCCJemtN" }, "source": [ "Using the visualization package:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "id": "sqR76cqCb9bF" }, "outputs": [], "source": [ "ner_visualizer = viz.NerVisualizer()" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "id": "ejJSVcYG2KCx" }, "outputs": [], "source": [ "results_collected = ner_results.collect()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 1000 }, "id": "ypRwcazjc2Rs", "outputId": "9c3a4e42-fabf-4486-d9e2-08515ea70d61" }, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "" ], "text/html": [ "\n", "\n", " CREDIT AGREEMENT DOC THIS CREDIT AGREEMENT DOC is dated as of April 29, 2010 DATE, and is made by and among P.H. GLATFELTER COMPANY ORG, a Pennsylvania LOC corporation ( the \"COMPANY\") ORG and certain of its subsidiaries identified on the signature pages hereto (each a \"BORROWER ROLE\" and collectively, the \"BORROWERS\"), each of the GUARANTORS (as hereinafter defined), the LENDERS (as hereinafter defined), PNC BANK, NATIONAL ASSOCIATION, ORG in its capacity as agent for the Lenders under this Agreement DOC (hereinafter referred to in such capacity as the \"ADMINISTRATIVE AGENT\"), and, for the limited purpose of public identification in trade tables, PNC CAPITAL MARKETS LLC PARTY and CITIZENS BANK OF PENNSYLVANIA ORG, as joint arrangers and joint bookrunners, and CITIZENS BANK OF PENNSYLVANIA ORG, as syndication agent. WITNESSETH: WHEREAS, the Borrowers WHEREAS_SUBJECT have requested WHEREAS_ACTION the Lenders to provide a revolving credit facility WHEREAS_OBJECT to the Borrowers in an aggregate principal amount not to exceed $225,000,000; and WHEREAS, proceeds of the WHEREAS_SUBJECT revolving credit facility DOC shall be used for (1) refinancing existing Indebtedness, and (2) general corporate purposes; and WHEREAS ORG, the Lenders WHEREAS_SUBJECT are willing to provide WHEREAS_ACTION such credit WHEREAS_OBJECT upon the terms and conditions hereinafter set forth; NOW, THEREFORE, the parties WHEREAS_SUBJECT hereto, in consideration of their mutual covenants and agreements hereinafter set forth and intending to be legally bound hereby, covenant and agree as follows: 1. CERTAIN DEFINITIONS 1.1 CERTAIN DEFINITIONS LAW. In addition to words and terms defined elsewhere in this Agreement DOC, the following words and terms shall have the following meanings, respectively, unless the context hereof clearly requires otherwise: 2006 DATE SENIOR NOTES shall mean the Company's 7 ?% senior notes, issued in 2006 DATE and due May 1, 2016 DATE, in the aggregate principal amount of $200,000,000, guarantied by certain of the Loan Parties ORG. 2010 DATE SENIOR NOTES shall mean the Company's 7 ?% senior notes, issued in 2010 DATE and due May 1, 2016 DATE, in the aggregate principal amount of $100,000,000, guarantied by certain of the Loan Parties ORG. ACCOUNTS RECEIVABLE FACILITY DOC DOCUMENTS means all documentation entered into by the Company ORG and its Subsidiaries, including, without limitation, the Receivables Entity ORG, in connection with the sale or other transfer of accounts receivable and other related assets pursuant to a Permitted Accounts Receivable Program ORG, as such documentation may be amended, restated, supplemented or otherwise modified from time to time in accordance with the terms hereof and thereof. ADMINISTRATIVE AGENT PARTY shall have the meaning given to such term in the introductory paragraph hereof. ADMINISTRATIVE AGENT'S FEE shall have the meaning assigned to that term in Section 9.15 LAW. ADMINISTRATIVE AGENT'S LETTER DOC shall have the meaning assigned to that term in Section 9.15 LAW. AFFILIATE ROLE as to any Person shall mean any other Person (i) which, directly or indirectly controls, is controlled by, or is under common control with such Person ORG. For purposes of this definition, \"control\" (including, with correlative meanings, the term \"controlled by\" and \"under common control with\") shall mean the power, directly or indirectly, either to (a) vote 10% or more of the securities having ordinary voting power for the election of directors of such Person ORG or (b) direct or cause the direction of the management and policies of such Person whether through the ownership of voting securities or by contract DOC or otherwise, including the power to elect a majority of the directors of a corporation. AGREEMENT DOC shall mean this Credit Agreement DOC, as the same may be extended, renewed, amended, supplemented or restated from time to time, including all schedules and exhibits. ANTI-TERRORISM LAWS shall mean any Laws relating to terrorism or money laundering, including Executive Order No. 13224 LAW, the USA Patriot LAW Act DOC, the Laws comprising or implementing the Bank Secrecy Act DOC, and the Laws administered by the United States Treasury Department's Office of Foreign Asset Control ORG (as any of the foregoing Laws LAW may from time to time be amended, renewed, extended, or replaced). APPLICABLE COMMITMENT FEE RATE shall mean the percentage rate per annum at the indicated level of Debt Rating in the pricing grid on SCHEDULE 1.1(A LAW) next to the line titled \"Commitment Fee.\" The Applicable Commitment Fee Rate shall be computed in accordance with the parameters set forth on SCHEDULE 1.1(A LAW). APPLICABLE MARGIN shall mean the percentage spread to be added to Euro-Rate ORG under the Euro-Rate Option or to the Base Rate under the Base Rate Option at the indicated level of Debt Rating in the pricing grid on SCHEDULE 1.1(A LAW) next to the line titled \"Euro-Rate Spread\" or \"Base Rate Spread.\" The Applicable Margin shall be computed in accordance with the parameters set forth on SCHEDULE 1.1(A LAW). ASSIGNMENT AND ASSUMPTION AGREEMENT DOC shall mean an Assignment and Assumption Agreement DOC by and among a Purchasing Lender, a Transferor Lender and the Administrative Agent, as Administrative Agent and on behalf of the remaining Lenders, substantially in the form of EXHIBIT 1.1(A LAW)." ] }, "metadata": {} } ], "source": [ "ner_visualizer.display(\n", " results_collected[1], label_col=\"merged_chunk\", document_col=\"document\"\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "CPSxpGYCj8cF" }, "source": [ "###✔️ Using LightPipeline" ] }, { "cell_type": "markdown", "metadata": { "id": "2v2CYCq6j-i0" }, "source": [ "[LightPipelines](https://nlp.johnsnowlabs.com/docs/en/concepts#using-spark-nlps-lightpipeline) are Spark NLP specific Pipelines, equivalent to Spark ML Pipeline, but meant to deal with smaller amounts of data. They’re useful working with small datasets, debugging results, or when running either training or prediction from an API that serves one-off requests.\n", "\n", "Spark NLP LightPipelines are Spark ML pipelines converted into a single machine but the multi-threaded task, **becoming more than 10x times faster** for smaller amounts of data (small is relative, but 50k sentences are roughly a good maximum). To use them, we simply plug in a trained (fitted) pipeline and then annotate a plain text. We don't even need to convert the input text to DataFrame in order to feed it into a pipeline that's accepting DataFrame as an input in the first place. This feature would be quite useful when it comes to getting a prediction for a few lines of text from a trained ML model.\n", "\n", "For more details:\n", "[https://medium.com/spark-nlp/spark-nlp-101-lightpipeline-a544e93f20f1](https://medium.com/spark-nlp/spark-nlp-101-lightpipeline-a544e93f20f1)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "id": "juiznw0cdgOb" }, "outputs": [], "source": [ "light_model = nlp.LightPipeline(ner_model)" ] }, { "cell_type": "markdown", "metadata": { "id": "kltKN3XEpOHp" }, "source": [ "You can use strings or list of strings with the method [.annotate()](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/base/light_pipeline/index.html#sparknlp.base.light_pipeline.LightPipeline.annotate) to get the results. To get more metadata in the result, use the method [.fullAnnotate()](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/base/light_pipeline/index.html#sparknlp.base.light_pipeline.LightPipeline.fullAnnotate) instead. The result is a `list` if a `list` is given, or a `dict` if a string was given.\n", "\n", "To extract the results from the object, you just need to parse the dictionary." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "id": "1iwFYyAUnCVX" }, "outputs": [], "source": [ "text = results_collected[1].page_clean" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "a9zQvgZftS18", "outputId": "1161fbfb-7252-4249-822d-d3aa008c9cbd" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "dict_keys(['ner_whereas', 'document', 'merged_chunk', 'ner_chunk_generic', 'token', 'page_sentence', 'embeddings', 'ner_chunk_whereas', 'roberta', 'ner_generic'])" ] }, "metadata": {}, "execution_count": 34 } ], "source": [ "lp_results = light_model.annotate(text)\n", "lp_results.keys()" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "CCCX2g_ctS0D", "outputId": "d184fa20-8dbe-470a-b8a7-fcd6820b74de" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['CREDIT AGREEMENT',\n", " 'CREDIT AGREEMENT',\n", " 'April 29, 2010',\n", " 'GLATFELTER COMPANY',\n", " 'Pennsylvania',\n", " 'the \"COMPANY\")',\n", " 'BORROWER',\n", " 'PNC BANK, NATIONAL ASSOCIATION,',\n", " 'Agreement',\n", " 'PNC CAPITAL MARKETS LLC',\n", " 'CITIZENS BANK OF PENNSYLVANIA',\n", " 'CITIZENS BANK OF PENNSYLVANIA',\n", " 'the Borrowers',\n", " 'have requested',\n", " 'the Lenders to provide a revolving credit facility',\n", " 'proceeds of the',\n", " 'credit facility',\n", " 'WHEREAS',\n", " 'the Lenders',\n", " 'are willing to provide',\n", " 'such credit',\n", " 'the parties',\n", " 'DEFINITIONS 1.1 CERTAIN DEFINITIONS',\n", " 'Agreement',\n", " '2006',\n", " '2006',\n", " 'May 1, 2016',\n", " 'the Loan Parties',\n", " '2010',\n", " '2010',\n", " 'May 1, 2016',\n", " 'the Loan Parties',\n", " 'RECEIVABLE FACILITY',\n", " 'Company',\n", " 'the Receivables Entity',\n", " 'a Permitted Accounts Receivable Program',\n", " 'ADMINISTRATIVE AGENT',\n", " 'Section 9.15',\n", " \"AGENT'S LETTER\",\n", " 'Section 9.15',\n", " 'AFFILIATE',\n", " 'Person',\n", " 'Person',\n", " 'contract',\n", " 'AGREEMENT',\n", " 'Credit Agreement',\n", " 'Executive Order No. 13224',\n", " 'the USA Patriot',\n", " 'Act',\n", " 'Act',\n", " \"the United States Treasury Department's Office of Foreign Asset Control\",\n", " 'Laws',\n", " 'SCHEDULE 1.1(A',\n", " 'SCHEDULE 1.1(A',\n", " 'Euro-Rate',\n", " 'SCHEDULE 1.1(A',\n", " 'SCHEDULE 1.1(A',\n", " 'ASSIGNMENT AND ASSUMPTION AGREEMENT',\n", " 'Assumption Agreement',\n", " 'EXHIBIT 1.1(A']" ] }, "metadata": {}, "execution_count": 35 } ], "source": [ "# List with all the chunks\n", "lp_results[\"merged_chunk\"]" ] }, { "cell_type": "markdown", "metadata": { "id": "geH7M8XkrBnk" }, "source": [ "We can see that the `.annotate()` did't return the labels in the `ner_chunk` item. How can we obtain them? Using the `.fullAnnotate()` instead. This method always returns a list." ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "PLIGPnJntSx8", "outputId": "c6028f7e-1fb6-4f4f-f502-2c7c848f5623" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "dict_keys(['ner_whereas', 'document', 'merged_chunk', 'ner_chunk_generic', 'token', 'page_sentence', 'embeddings', 'ner_chunk_whereas', 'roberta', 'ner_generic'])" ] }, "metadata": {}, "execution_count": 36 } ], "source": [ "lp_results_full = light_model.fullAnnotate(text)\n", "lp_results_full[0].keys()" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "HDBT6Cb-tSvu", "outputId": "aef9ec40-0037-4785-99f9-6a66406882a6" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[Annotation(chunk, 0, 15, CREDIT AGREEMENT, {'entity': 'DOC', 'confidence': '0.99310005', 'ner_source': 'ner_chunk_generic', 'chunk': '0', 'sentence': '0'}, []),\n", " Annotation(chunk, 22, 37, CREDIT AGREEMENT, {'entity': 'DOC', 'confidence': '0.9901', 'ner_source': 'ner_chunk_generic', 'chunk': '1', 'sentence': '0'}, []),\n", " Annotation(chunk, 54, 67, April 29, 2010, {'entity': 'DATE', 'confidence': '0.8848001', 'ner_source': 'ner_chunk_generic', 'chunk': '2', 'sentence': '0'}, []),\n", " Annotation(chunk, 100, 117, GLATFELTER COMPANY, {'entity': 'ORG', 'confidence': '0.6135', 'ner_source': 'ner_chunk_generic', 'chunk': '3', 'sentence': '1'}, []),\n", " Annotation(chunk, 122, 133, Pennsylvania, {'entity': 'LOC', 'confidence': '0.7951', 'ner_source': 'ner_chunk_generic', 'chunk': '4', 'sentence': '1'}, []),\n", " Annotation(chunk, 149, 162, the \"COMPANY\"), {'entity': 'ORG', 'confidence': '0.78772503', 'ner_source': 'ner_chunk_generic', 'chunk': '5', 'sentence': '1'}, []),\n", " Annotation(chunk, 246, 253, BORROWER, {'entity': 'ROLE', 'confidence': '0.9892', 'ner_source': 'ner_chunk_generic', 'chunk': '6', 'sentence': '1'}, []),\n", " Annotation(chunk, 379, 409, PNC BANK, NATIONAL ASSOCIATION,, {'entity': 'ORG', 'confidence': '0.7839833', 'ner_source': 'ner_chunk_generic', 'chunk': '7', 'sentence': '1'}, []),\n", " Annotation(chunk, 463, 471, Agreement, {'entity': 'DOC', 'confidence': '0.9553', 'ner_source': 'ner_chunk_generic', 'chunk': '8', 'sentence': '1'}, []),\n", " Annotation(chunk, 618, 640, PNC CAPITAL MARKETS LLC, {'entity': 'PARTY', 'confidence': '0.8001', 'ner_source': 'ner_chunk_generic', 'chunk': '9', 'sentence': '1'}, []),\n", " Annotation(chunk, 646, 674, CITIZENS BANK OF PENNSYLVANIA, {'entity': 'ORG', 'confidence': '0.549325', 'ner_source': 'ner_chunk_generic', 'chunk': '10', 'sentence': '1'}, []),\n", " Annotation(chunk, 723, 751, CITIZENS BANK OF PENNSYLVANIA, {'entity': 'ORG', 'confidence': '0.631575', 'ner_source': 'ner_chunk_generic', 'chunk': '11', 'sentence': '1'}, []),\n", " Annotation(chunk, 797, 809, the Borrowers, {'entity': 'WHEREAS_SUBJECT', 'confidence': '0.9951', 'ner_source': 'ner_chunk_whereas', 'chunk': '12', 'sentence': '2'}, []),\n", " Annotation(chunk, 811, 824, have requested, {'entity': 'WHEREAS_ACTION', 'confidence': '0.97775', 'ner_source': 'ner_chunk_whereas', 'chunk': '13', 'sentence': '2'}, []),\n", " Annotation(chunk, 826, 875, the Lenders to provide a revolving credit facility, {'entity': 'WHEREAS_OBJECT', 'confidence': '0.80509996', 'ner_source': 'ner_chunk_whereas', 'chunk': '14', 'sentence': '2'}, []),\n", " Annotation(chunk, 968, 982, proceeds of the, {'entity': 'WHEREAS_SUBJECT', 'confidence': '0.84663326', 'ner_source': 'ner_chunk_whereas', 'chunk': '15', 'sentence': '3'}, []),\n", " Annotation(chunk, 994, 1008, credit facility, {'entity': 'DOC', 'confidence': '0.82975', 'ner_source': 'ner_chunk_generic', 'chunk': '16', 'sentence': '3'}, []),\n", " Annotation(chunk, 1107, 1113, WHEREAS, {'entity': 'ORG', 'confidence': '0.9781', 'ner_source': 'ner_chunk_generic', 'chunk': '17', 'sentence': '4'}, []),\n", " Annotation(chunk, 1116, 1126, the Lenders, {'entity': 'WHEREAS_SUBJECT', 'confidence': '0.9569', 'ner_source': 'ner_chunk_whereas', 'chunk': '18', 'sentence': '4'}, []),\n", " Annotation(chunk, 1128, 1149, are willing to provide, {'entity': 'WHEREAS_ACTION', 'confidence': '0.836', 'ner_source': 'ner_chunk_whereas', 'chunk': '19', 'sentence': '4'}, []),\n", " Annotation(chunk, 1151, 1161, such credit, {'entity': 'WHEREAS_OBJECT', 'confidence': '0.85765004', 'ner_source': 'ner_chunk_whereas', 'chunk': '20', 'sentence': '4'}, []),\n", " Annotation(chunk, 1232, 1242, the parties, {'entity': 'WHEREAS_SUBJECT', 'confidence': '0.96730006', 'ner_source': 'ner_chunk_whereas', 'chunk': '21', 'sentence': '5'}, []),\n", " Annotation(chunk, 1416, 1450, DEFINITIONS 1.1 CERTAIN DEFINITIONS, {'entity': 'LAW', 'confidence': '0.70685', 'ner_source': 'ner_chunk_generic', 'chunk': '22', 'sentence': '6'}, []),\n", " Annotation(chunk, 1510, 1518, Agreement, {'entity': 'DOC', 'confidence': '0.9834', 'ner_source': 'ner_chunk_generic', 'chunk': '23', 'sentence': '7'}, []),\n", " Annotation(chunk, 1654, 1657, 2006, {'entity': 'DATE', 'confidence': '0.9707', 'ner_source': 'ner_chunk_generic', 'chunk': '24', 'sentence': '7'}, []),\n", " Annotation(chunk, 1726, 1729, 2006, {'entity': 'DATE', 'confidence': '0.9988', 'ner_source': 'ner_chunk_generic', 'chunk': '25', 'sentence': '7'}, []),\n", " Annotation(chunk, 1739, 1749, May 1, 2016, {'entity': 'DATE', 'confidence': '0.97282505', 'ner_source': 'ner_chunk_generic', 'chunk': '26', 'sentence': '7'}, []),\n", " Annotation(chunk, 1828, 1843, the Loan Parties, {'entity': 'ORG', 'confidence': '0.76336664', 'ner_source': 'ner_chunk_generic', 'chunk': '27', 'sentence': '7'}, []),\n", " Annotation(chunk, 1846, 1849, 2010, {'entity': 'DATE', 'confidence': '0.9851', 'ner_source': 'ner_chunk_generic', 'chunk': '28', 'sentence': '8'}, []),\n", " Annotation(chunk, 1918, 1921, 2010, {'entity': 'DATE', 'confidence': '0.9993', 'ner_source': 'ner_chunk_generic', 'chunk': '29', 'sentence': '8'}, []),\n", " Annotation(chunk, 1931, 1941, May 1, 2016, {'entity': 'DATE', 'confidence': '0.97245', 'ner_source': 'ner_chunk_generic', 'chunk': '30', 'sentence': '8'}, []),\n", " Annotation(chunk, 2020, 2035, the Loan Parties, {'entity': 'ORG', 'confidence': '0.77570003', 'ner_source': 'ner_chunk_generic', 'chunk': '31', 'sentence': '8'}, []),\n", " Annotation(chunk, 2047, 2065, RECEIVABLE FACILITY, {'entity': 'DOC', 'confidence': '0.7375', 'ner_source': 'ner_chunk_generic', 'chunk': '32', 'sentence': '9'}, []),\n", " Annotation(chunk, 2121, 2127, Company, {'entity': 'ORG', 'confidence': '0.9981', 'ner_source': 'ner_chunk_generic', 'chunk': '33', 'sentence': '9'}, []),\n", " Annotation(chunk, 2182, 2203, the Receivables Entity, {'entity': 'ORG', 'confidence': '0.60770005', 'ner_source': 'ner_chunk_generic', 'chunk': '34', 'sentence': '9'}, []),\n", " Annotation(chunk, 2312, 2350, a Permitted Accounts Receivable Program, {'entity': 'ORG', 'confidence': '0.54974', 'ner_source': 'ner_chunk_generic', 'chunk': '35', 'sentence': '9'}, []),\n", " Annotation(chunk, 2503, 2522, ADMINISTRATIVE AGENT, {'entity': 'PARTY', 'confidence': '0.67195', 'ner_source': 'ner_chunk_generic', 'chunk': '36', 'sentence': '10'}, []),\n", " Annotation(chunk, 2679, 2690, Section 9.15, {'entity': 'LAW', 'confidence': '0.99465', 'ner_source': 'ner_chunk_generic', 'chunk': '37', 'sentence': '11'}, []),\n", " Annotation(chunk, 2708, 2721, AGENT'S LETTER, {'entity': 'DOC', 'confidence': '0.55605', 'ner_source': 'ner_chunk_generic', 'chunk': '38', 'sentence': '12'}, []),\n", " Annotation(chunk, 2771, 2782, Section 9.15, {'entity': 'LAW', 'confidence': '0.9943', 'ner_source': 'ner_chunk_generic', 'chunk': '39', 'sentence': '12'}, []),\n", " Annotation(chunk, 2785, 2793, AFFILIATE, {'entity': 'ROLE', 'confidence': '0.4648', 'ner_source': 'ner_chunk_generic', 'chunk': '40', 'sentence': '13'}, []),\n", " Annotation(chunk, 2939, 2944, Person, {'entity': 'ORG', 'confidence': '0.8968', 'ner_source': 'ner_chunk_generic', 'chunk': '41', 'sentence': '14'}, []),\n", " Annotation(chunk, 3249, 3254, Person, {'entity': 'ORG', 'confidence': '0.9664', 'ner_source': 'ner_chunk_generic', 'chunk': '42', 'sentence': '16'}, []),\n", " Annotation(chunk, 3396, 3403, contract, {'entity': 'DOC', 'confidence': '0.9829', 'ner_source': 'ner_chunk_generic', 'chunk': '43', 'sentence': '17'}, []),\n", " Annotation(chunk, 3494, 3502, AGREEMENT, {'entity': 'DOC', 'confidence': '0.8701', 'ner_source': 'ner_chunk_generic', 'chunk': '44', 'sentence': '18'}, []),\n", " Annotation(chunk, 3520, 3535, Credit Agreement, {'entity': 'DOC', 'confidence': '0.99915', 'ner_source': 'ner_chunk_generic', 'chunk': '45', 'sentence': '18'}, []),\n", " Annotation(chunk, 3760, 3784, Executive Order No. 13224, {'entity': 'LAW', 'confidence': '0.70444', 'ner_source': 'ner_chunk_generic', 'chunk': '46', 'sentence': '19'}, []),\n", " Annotation(chunk, 3787, 3801, the USA Patriot, {'entity': 'LAW', 'confidence': '0.4611', 'ner_source': 'ner_chunk_generic', 'chunk': '47', 'sentence': '19'}, []),\n", " Annotation(chunk, 3803, 3805, Act, {'entity': 'DOC', 'confidence': '0.9427', 'ner_source': 'ner_chunk_generic', 'chunk': '48', 'sentence': '19'}, []),\n", " Annotation(chunk, 3861, 3863, Act, {'entity': 'DOC', 'confidence': '0.9735', 'ner_source': 'ner_chunk_generic', 'chunk': '49', 'sentence': '19'}, []),\n", " Annotation(chunk, 3895, 3965, the United States Treasury Department's Office of Foreign Asset Control, {'entity': 'ORG', 'confidence': '0.52493', 'ner_source': 'ner_chunk_generic', 'chunk': '50', 'sentence': '19'}, []),\n", " Annotation(chunk, 3992, 3995, Laws, {'entity': 'LAW', 'confidence': '0.9754', 'ner_source': 'ner_chunk_generic', 'chunk': '51', 'sentence': '19'}, []),\n", " Annotation(chunk, 4197, 4210, SCHEDULE 1.1(A, {'entity': 'LAW', 'confidence': '0.80805004', 'ner_source': 'ner_chunk_generic', 'chunk': '52', 'sentence': '20'}, []),\n", " Annotation(chunk, 4355, 4368, SCHEDULE 1.1(A, {'entity': 'LAW', 'confidence': '0.98765004', 'ner_source': 'ner_chunk_generic', 'chunk': '53', 'sentence': '20'}, []),\n", " Annotation(chunk, 4438, 4446, Euro-Rate, {'entity': 'ORG', 'confidence': '0.9422', 'ner_source': 'ner_chunk_generic', 'chunk': '54', 'sentence': '21'}, []),\n", " Annotation(chunk, 4583, 4596, SCHEDULE 1.1(A, {'entity': 'LAW', 'confidence': '0.7896', 'ner_source': 'ner_chunk_generic', 'chunk': '55', 'sentence': '21'}, []),\n", " Annotation(chunk, 4752, 4765, SCHEDULE 1.1(A, {'entity': 'LAW', 'confidence': '0.9899', 'ner_source': 'ner_chunk_generic', 'chunk': '56', 'sentence': '22'}, []),\n", " Annotation(chunk, 4769, 4803, ASSIGNMENT AND ASSUMPTION AGREEMENT, {'entity': 'DOC', 'confidence': '0.74009997', 'ner_source': 'ner_chunk_generic', 'chunk': '57', 'sentence': '23'}, []),\n", " Annotation(chunk, 4834, 4853, Assumption Agreement, {'entity': 'DOC', 'confidence': '0.8543', 'ner_source': 'ner_chunk_generic', 'chunk': '58', 'sentence': '23'}, []),\n", " Annotation(chunk, 5032, 5044, EXHIBIT 1.1(A, {'entity': 'LAW', 'confidence': '0.71099997', 'ner_source': 'ner_chunk_generic', 'chunk': '59', 'sentence': '23'}, [])]" ] }, "metadata": {}, "execution_count": 37 } ], "source": [ "lp_results_full[0][\"merged_chunk\"]" ] }, { "cell_type": "markdown", "metadata": { "id": "lROIxxO_r0zR" }, "source": [ "Now we can see all the metadata in the annotation objects. Let's get the results in a tabular form." ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 1000 }, "id": "Q0l4JNsPtStH", "outputId": "6f86897d-4ba7-4097-e2a7-b6adf9dfa1b7" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " begin end chunk \\\n", "0 0 15 CREDIT AGREEMENT \n", "1 22 37 CREDIT AGREEMENT \n", "2 54 67 April 29, 2010 \n", "3 100 117 GLATFELTER COMPANY \n", "4 122 133 Pennsylvania \n", "5 149 162 the \"COMPANY\") \n", "6 246 253 BORROWER \n", "7 379 409 PNC BANK, NATIONAL ASSOCIATION, \n", "8 463 471 Agreement \n", "9 618 640 PNC CAPITAL MARKETS LLC \n", "10 646 674 CITIZENS BANK OF PENNSYLVANIA \n", "11 723 751 CITIZENS BANK OF PENNSYLVANIA \n", "12 797 809 the Borrowers \n", "13 811 824 have requested \n", "14 826 875 the Lenders to provide a revolving credit faci... \n", "15 968 982 proceeds of the \n", "16 994 1008 credit facility \n", "17 1107 1113 WHEREAS \n", "18 1116 1126 the Lenders \n", "19 1128 1149 are willing to provide \n", "20 1151 1161 such credit \n", "21 1232 1242 the parties \n", "22 1416 1450 DEFINITIONS 1.1 CERTAIN DEFINITIONS \n", "23 1510 1518 Agreement \n", "24 1654 1657 2006 \n", "25 1726 1729 2006 \n", "26 1739 1749 May 1, 2016 \n", "27 1828 1843 the Loan Parties \n", "28 1846 1849 2010 \n", "29 1918 1921 2010 \n", "30 1931 1941 May 1, 2016 \n", "31 2020 2035 the Loan Parties \n", "32 2047 2065 RECEIVABLE FACILITY \n", "33 2121 2127 Company \n", "34 2182 2203 the Receivables Entity \n", "35 2312 2350 a Permitted Accounts Receivable Program \n", "36 2503 2522 ADMINISTRATIVE AGENT \n", "37 2679 2690 Section 9.15 \n", "38 2708 2721 AGENT'S LETTER \n", "39 2771 2782 Section 9.15 \n", "40 2785 2793 AFFILIATE \n", "41 2939 2944 Person \n", "42 3249 3254 Person \n", "43 3396 3403 contract \n", "44 3494 3502 AGREEMENT \n", "45 3520 3535 Credit Agreement \n", "46 3760 3784 Executive Order No. 13224 \n", "47 3787 3801 the USA Patriot \n", "48 3803 3805 Act \n", "49 3861 3863 Act \n", "50 3895 3965 the United States Treasury Department's Office... \n", "51 3992 3995 Laws \n", "52 4197 4210 SCHEDULE 1.1(A \n", "53 4355 4368 SCHEDULE 1.1(A \n", "54 4438 4446 Euro-Rate \n", "55 4583 4596 SCHEDULE 1.1(A \n", "56 4752 4765 SCHEDULE 1.1(A \n", "57 4769 4803 ASSIGNMENT AND ASSUMPTION AGREEMENT \n", "58 4834 4853 Assumption Agreement \n", "59 5032 5044 EXHIBIT 1.1(A \n", "\n", " entity confidence \n", "0 DOC 0.99310005 \n", "1 DOC 0.9901 \n", "2 DATE 0.8848001 \n", "3 ORG 0.6135 \n", "4 LOC 0.7951 \n", "5 ORG 0.78772503 \n", "6 ROLE 0.9892 \n", "7 ORG 0.7839833 \n", "8 DOC 0.9553 \n", "9 PARTY 0.8001 \n", "10 ORG 0.549325 \n", "11 ORG 0.631575 \n", "12 WHEREAS_SUBJECT 0.9951 \n", "13 WHEREAS_ACTION 0.97775 \n", "14 WHEREAS_OBJECT 0.80509996 \n", "15 WHEREAS_SUBJECT 0.84663326 \n", "16 DOC 0.82975 \n", "17 ORG 0.9781 \n", "18 WHEREAS_SUBJECT 0.9569 \n", "19 WHEREAS_ACTION 0.836 \n", "20 WHEREAS_OBJECT 0.85765004 \n", "21 WHEREAS_SUBJECT 0.96730006 \n", "22 LAW 0.70685 \n", "23 DOC 0.9834 \n", "24 DATE 0.9707 \n", "25 DATE 0.9988 \n", "26 DATE 0.97282505 \n", "27 ORG 0.76336664 \n", "28 DATE 0.9851 \n", "29 DATE 0.9993 \n", "30 DATE 0.97245 \n", "31 ORG 0.77570003 \n", "32 DOC 0.7375 \n", "33 ORG 0.9981 \n", "34 ORG 0.60770005 \n", "35 ORG 0.54974 \n", "36 PARTY 0.67195 \n", "37 LAW 0.99465 \n", "38 DOC 0.55605 \n", "39 LAW 0.9943 \n", "40 ROLE 0.4648 \n", "41 ORG 0.8968 \n", "42 ORG 0.9664 \n", "43 DOC 0.9829 \n", "44 DOC 0.8701 \n", "45 DOC 0.99915 \n", "46 LAW 0.70444 \n", "47 LAW 0.4611 \n", "48 DOC 0.9427 \n", "49 DOC 0.9735 \n", "50 ORG 0.52493 \n", "51 LAW 0.9754 \n", "52 LAW 0.80805004 \n", "53 LAW 0.98765004 \n", "54 ORG 0.9422 \n", "55 LAW 0.7896 \n", "56 LAW 0.9899 \n", "57 DOC 0.74009997 \n", "58 DOC 0.8543 \n", "59 LAW 0.71099997 " ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
beginendchunkentityconfidence
0015CREDIT AGREEMENTDOC0.99310005
12237CREDIT AGREEMENTDOC0.9901
25467April 29, 2010DATE0.8848001
3100117GLATFELTER COMPANYORG0.6135
4122133PennsylvaniaLOC0.7951
5149162the \"COMPANY\")ORG0.78772503
6246253BORROWERROLE0.9892
7379409PNC BANK, NATIONAL ASSOCIATION,ORG0.7839833
8463471AgreementDOC0.9553
9618640PNC CAPITAL MARKETS LLCPARTY0.8001
10646674CITIZENS BANK OF PENNSYLVANIAORG0.549325
11723751CITIZENS BANK OF PENNSYLVANIAORG0.631575
12797809the BorrowersWHEREAS_SUBJECT0.9951
13811824have requestedWHEREAS_ACTION0.97775
14826875the Lenders to provide a revolving credit faci...WHEREAS_OBJECT0.80509996
15968982proceeds of theWHEREAS_SUBJECT0.84663326
169941008credit facilityDOC0.82975
1711071113WHEREASORG0.9781
1811161126the LendersWHEREAS_SUBJECT0.9569
1911281149are willing to provideWHEREAS_ACTION0.836
2011511161such creditWHEREAS_OBJECT0.85765004
2112321242the partiesWHEREAS_SUBJECT0.96730006
2214161450DEFINITIONS 1.1 CERTAIN DEFINITIONSLAW0.70685
2315101518AgreementDOC0.9834
24165416572006DATE0.9707
25172617292006DATE0.9988
2617391749May 1, 2016DATE0.97282505
2718281843the Loan PartiesORG0.76336664
28184618492010DATE0.9851
29191819212010DATE0.9993
3019311941May 1, 2016DATE0.97245
3120202035the Loan PartiesORG0.77570003
3220472065RECEIVABLE FACILITYDOC0.7375
3321212127CompanyORG0.9981
3421822203the Receivables EntityORG0.60770005
3523122350a Permitted Accounts Receivable ProgramORG0.54974
3625032522ADMINISTRATIVE AGENTPARTY0.67195
3726792690Section 9.15LAW0.99465
3827082721AGENT'S LETTERDOC0.55605
3927712782Section 9.15LAW0.9943
4027852793AFFILIATEROLE0.4648
4129392944PersonORG0.8968
4232493254PersonORG0.9664
4333963403contractDOC0.9829
4434943502AGREEMENTDOC0.8701
4535203535Credit AgreementDOC0.99915
4637603784Executive Order No. 13224LAW0.70444
4737873801the USA PatriotLAW0.4611
4838033805ActDOC0.9427
4938613863ActDOC0.9735
5038953965the United States Treasury Department's Office...ORG0.52493
5139923995LawsLAW0.9754
5241974210SCHEDULE 1.1(ALAW0.80805004
5343554368SCHEDULE 1.1(ALAW0.98765004
5444384446Euro-RateORG0.9422
5545834596SCHEDULE 1.1(ALAW0.7896
5647524765SCHEDULE 1.1(ALAW0.9899
5747694803ASSIGNMENT AND ASSUMPTION AGREEMENTDOC0.74009997
5848344853Assumption AgreementDOC0.8543
5950325044EXHIBIT 1.1(ALAW0.71099997
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 38 } ], "source": [ "results_tabular = []\n", "for res in lp_results_full[0][\"merged_chunk\"]:\n", " results_tabular.append(\n", " (\n", " res.begin,\n", " res.end,\n", " res.result,\n", " res.metadata[\"entity\"],\n", " res.metadata[\"confidence\"],\n", " )\n", " )\n", "\n", "import pandas as pd\n", "\n", "pd.DataFrame(results_tabular, columns=[\"begin\", \"end\", \"chunk\", \"entity\", \"confidence\"])\n" ] }, { "cell_type": "markdown", "metadata": { "id": "Ye7wGXTO5KI4" }, "source": [ "##📌 Multilabel classification" ] }, { "cell_type": "markdown", "metadata": { "id": "QS6Tz8iJ5NQI" }, "source": [ "In this section we will use the `MultiClassifierDL` annotator to idenfity more than one classes in texts. " ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "nzOhYhp35TD7", "outputId": "9c933fa1-14c2-4a80-a00e-87a29cd5555f" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "sent_bert_base_uncased_legal download started this may take some time.\n", "Approximate size to download 390.8 MB\n", "[OK!]\n", "legmulticlf_edgar download started this may take some time.\n", "Approximate size to download 13.3 MB\n", "[OK!]\n" ] } ], "source": [ "document_assembler = (\n", " nlp.DocumentAssembler().setInputCol(\"text\").setOutputCol(\"document\")\n", ")\n", "embeddings = (\n", " nlp.BertSentenceEmbeddings.pretrained(\"sent_bert_base_uncased_legal\", \"en\")\n", " .setInputCols(\"document\")\n", " .setOutputCol(\"sentence_embeddings\")\n", ")\n", "\n", "multiClassifier = (\n", " nlp.MultiClassifierDLModel.pretrained(\"legmulticlf_edgar\", \"en\", \"legal/models\")\n", " .setInputCols([\"sentence_embeddings\"])\n", " .setOutputCol(\"class\")\n", ")\n", "\n", "clf_pipeline = nlp.Pipeline(stages=[document_assembler, embeddings, multiClassifier])\n", "\n", "\n", "light_pipeline = nlp.LightPipeline(\n", " clf_pipeline.fit(spark.createDataFrame([[\"\"]]).toDF(\"text\"))\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "his5tnju96Rz" }, "source": [ "We will experiment in simpler sentences to showcase it's capabilities." ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "1QdIS2EF5n1Q", "outputId": "980fa383-105a-4835-e452-0602da947efd" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['waivers']" ] }, "metadata": {}, "execution_count": 40 } ], "source": [ "result = light_pipeline.annotate(\n", " \"\"\"No failure or delay by the Administrative Agent or any Lender in exercising any right or power hereunder shall operate as a waiver thereof, nor shall any single or partial exercise of any such right or power, or any abandonment or discontinuance of steps to enforce such a right or power, preclude any other or further exercise thereof or the exercise of any other right or power. The rights and remedies of the Administrative Agent and the Lenders hereunder are cumulative and are not exclusive of any rights or remedies that they would otherwise have. No waiver of any provision of this Agreement or consent to any departure by the Borrower therefrom shall in any event be effective unless the same shall be permitted by paragraph (b) of this Section, and then such waiver or consent shall be effective only in the specific instance and for the purpose for which given. Without limiting the generality of the foregoing, the making of a Loan shall not be construed as a waiver of any Default, regardless of whether the Administrative Agent or any Lender may have had notice or knowledge of such Default at the time.\n", " \"\"\".lower()\n", ")\n", "\n", "result[\"class\"]" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "VnapppMv5oL2", "outputId": "ed9d666c-b4bb-4c33-f458-0864f29c37e7" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[]" ] }, "metadata": {}, "execution_count": 41 } ], "source": [ "result = light_pipeline.annotate(\n", " \"\"\"The provisions of this Agreement shall be binding upon and inure to the benefit of the parties hereto and their respective successors and assigns permitted hereby (including any Affiliate of the Issuing Bank that issues any Letter of Credit), except that (i) the Borrower may not assign or otherwise transfer any of its rights or obligations hereunder without the prior written consent of each Lender (and any attempted assignment or transfer by the Borrower without such consent shall be null and void) and (ii) no Lender may assign or otherwise transfer its rights or obligations hereunder except in accordance with this Section. Nothing in this Agreement, expressed or implied, shall be construed to confer upon any Person (other than the parties hereto, their respective successors and assigns permitted hereby (including any Affiliate of the Issuing Bank that issues any Letter of Credit), Participants (to the extent provided in paragraph (c) of this Section) and, to the extent expressly contemplated hereby, the Related Parties of each of the Administrative Agent, the Issuing Bank and the Lenders) any legal or equitable right, remedy or claim under or by reason of this Agreement.\n", " \"\"\".lower()\n", ")\n", "\n", "result[\"class\"]" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "VNIEIer25pX2", "outputId": "60d0ed5b-cd9c-4fc7-a0f9-b4cda66000cc" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['warranties', 'representations']" ] }, "metadata": {}, "execution_count": 42 } ], "source": [ "result = light_pipeline.annotate(\n", " \"\"\"After the effectiveness of this Amendment, the representations and warranties of the Borrower set forth in the Credit Agreement and in the other Loan Documents are true and correct in all material respects on and as of the date hereof, with the same force and effect as if made on and as of such date, except to the extent that such representations and warranties (i) specifically refer to an earlier date, in which case they shall be true and correct in all material respects as of such earlier date (except to the extent of changes in facts or circumstances that have been disclosed to the Lenders and do not constitute an Event of Default or a Potential Default under the Credit Agreement or any other Loan Document), and (ii) are already qualified by materiality, in which case they shall be true and correct in all respects, and except that for purposes of this Section 4.1 , the representations and warranties contained in Section 7.6 of the Credit Agreement shall be deemed to refer to the most recent financial statements furnished pursuant to Section 8.1(a) of the Credit Agreement.\n", " \"\"\".lower()\n", ")\n", "\n", "result[\"class\"] " ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "6Jhv3ttw5slM", "outputId": "0d4e3ab1-fdbc-4602-ac13-c24747d8c7a1" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['notices']" ] }, "metadata": {}, "execution_count": 43 } ], "source": [ "result = light_pipeline.annotate(\"\"\"All notices and other communications provided for in this Agreement and the other Loan Documents shall be in writing and may (subject to paragraph (b) below) be telecopied (faxed), mailed by certified mail return receipt requested, or delivered by hand or overnight courier service to the intended recipient at the addresses specified below or at such other address as shall be designated by any party listed below in a notice to the other parties listed below given in accordance with this Section.\"\"\".lower())\n", "\n", "result[\"class\"]" ] }, { "cell_type": "markdown", "metadata": { "id": "8W0XGe8ytTQz" }, "source": [ "##📌 Identify obligations using Transformers (Bert) models" ] }, { "cell_type": "markdown", "metadata": { "id": "j-tu8IoKThug" }, "source": [ "In this section, we will illustrate how to use a different NER annotator that is based on Transformer architecture." ] }, { "cell_type": "markdown", "metadata": { "id": "ERgcjSNhtV4M" }, "source": [ "`BertForTokenClassification` annotator can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.\n", "\n", "Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see [Import Transformers into Spark NLP 🚀](https://github.com/JohnSnowLabs/spark-nlp/discussions/5669).\n", "\n", "Using these models is very similar to the `NerModel` we used before. We adjust the pipeline by adding the `BertFotTokenClassification` step instead of `NerModel`, and don't need to add the `Embeddings` step as it is already part of the new annotator. \n", "\n", "Then, the pipeline is just:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "id": "uskAoYzMtVdP" }, "outputs": [], "source": [ "def bert_pipeline(model_name=\"legner_obligations\", language=\"en\"):\n", " documentAssembler = (\n", " nlp.DocumentAssembler().setInputCol(\"text\").setOutputCol(\"document\")\n", " )\n", "\n", " tokenizer = nlp.Tokenizer().setInputCols(\"document\").setOutputCol(\"token\")\n", "\n", " tokenClassifier = (\n", " legal.BertForTokenClassification.pretrained(\n", " model_name, language, \"legal/models\"\n", " )\n", " .setInputCols([\"token\", \"document\"])\n", " .setOutputCol(\"label\")\n", " .setCaseSensitive(True)\n", " )\n", "\n", " ner_converter = (\n", " nlp.NerConverter()\n", " .setInputCols([\"document\", \"token\", \"label\"])\n", " .setOutputCol(\"ner_chunk\")\n", " )\n", "\n", " pipeline = nlp.Pipeline(\n", " stages=[documentAssembler, tokenizer, tokenClassifier, ner_converter]\n", " )\n", "\n", " empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n", "\n", " model = pipeline.fit(empty_data)\n", " return model" ] }, { "cell_type": "markdown", "metadata": { "id": "M9Y6Kx73u4qD" }, "source": [ "For Legal NLP we currently have Bert models for English, German, Arabic, and (Brazilian) Portuguese available, but we are constantly adding new models with every release. The model we will use here is the `legner_obligations`, which identifies obligations (what the different parties commit to do) in agreement documents.\n", "\n", "This model extracts the subject (who commits to doing what), the action (the verb - will provide, shall sign…) and the object (what subject will provide, what subject shall sign, etc). Also, if the recipient of the obligation is a third party (a subject will provide to the Company X …), then that third party (Company X) will be extracted as an indirect object.\n", "\n", "The model was trained with in-house annotated documents on CUAD dataset.\n", "\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "1XIaZ26gzXzj", "outputId": "dc067981-3593-4979-ba3a-1e264582ab32" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "legner_obligations download started this may take some time.\n", "[OK!]\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "[DocumentAssembler_95ad50798e30,\n", " REGEX_TOKENIZER_67928cc401e1,\n", " BERT_FOR_TOKEN_CLASSIFICATION_d6615a22d5c2,\n", " NerConverter_a3663db7abd4]" ] }, "metadata": {}, "execution_count": 8 } ], "source": [ "bert_model = bert_pipeline(\"legner_obligations\", \"en\")\n", "bert_model.stages" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "qcP_edOLzXw0", "outputId": "cd4f9143-ad44-4ae9-dc55-987c614604f3" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['B-OBLIGATION_ACTION',\n", " 'I-OBLIGATION_INDIRECT_OBJECT',\n", " 'I-OBLIGATION',\n", " 'B-OBLIGATION_INDIRECT_OBJECT',\n", " 'PAD',\n", " 'I-OBLIGATION_SUBJECT',\n", " 'I-OBLIGATION_ACTION',\n", " 'O',\n", " 'B-OBLIGATION_SUBJECT',\n", " 'B-OBLIGATION']" ] }, "metadata": {}, "execution_count": 9 } ], "source": [ "bert_model.stages[-2].getClasses()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "hHV-y73izXub", "outputId": "e3ba2ee7-93f0-4e7b-9944-f2c573b041bb" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "+----------+--------------------+----------+\n", "| token| label|confidence|\n", "+----------+--------------------+----------+\n", "| The| O|0.71809256|\n", "| Buyer|B-OBLIGATION_SUBJECT|0.86514723|\n", "| shall| B-OBLIGATION_ACTION|0.99315745|\n", "| use| I-OBLIGATION_ACTION| 0.9729679|\n", "| such| B-OBLIGATION| 0.7499739|\n", "| materials| I-OBLIGATION| 0.9127689|\n", "| and| I-OBLIGATION|0.88955635|\n", "| supplies| I-OBLIGATION| 0.9182221|\n", "| only| I-OBLIGATION|0.82361615|\n", "| in| I-OBLIGATION| 0.8662357|\n", "|accordance| I-OBLIGATION| 0.9251934|\n", "| with| I-OBLIGATION| 0.8835488|\n", "| the| I-OBLIGATION|0.53246284|\n", "| present| I-OBLIGATION| 0.8670555|\n", "| agreement| I-OBLIGATION| 0.8018013|\n", "+----------+--------------------+----------+\n", "\n" ] } ], "source": [ "import pyspark.sql.functions as F\n", "\n", "text = \"\"\"The Buyer shall use such materials and supplies only in accordance with the present agreement\"\"\"\n", "\n", "res = bert_model.transform(spark.createDataFrame([[text]]).toDF(\"text\"))\n", "\n", "result_df = res.select(\n", " F.explode(\n", " F.arrays_zip(res.token.result, res.label.result, res.label.metadata)\n", " ).alias(\"cols\")\n", ").select(\n", " F.expr(\"cols['0']\").alias(\"token\"),\n", " F.expr(\"cols['1']\").alias(\"label\"),\n", " F.expr(\"cols['2']['confidence']\").alias(\"confidence\"),\n", ")\n", "\n", "result_df.show(truncate=100)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "mn2JTTusQVjO" }, "outputs": [], "source": [] } ], "metadata": { "colab": { "provenance": [], "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.0 (tags/v3.8.0:fa919fd, Oct 14 2019, 19:37:50) [MSC v.1916 64 bit (AMD64)]" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "939480ed579cbcc9bd95c0bb2f0a271d068ec362d36f1415ed941c7dadb52340" } } }, "nbformat": 4, "nbformat_minor": 0 }