{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "db5f4f9a-7776-42b3-8758-85624d4c15ea",
      "metadata": {
        "id": "db5f4f9a-7776-42b3-8758-85624d4c15ea"
      },
      "source": [
        "![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "21e9eafb",
      "metadata": {
        "id": "21e9eafb"
      },
      "source": [
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/10.1.Chunk_Mappers_Training.ipynb)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "gk3kZHmNj51v",
      "metadata": {
        "collapsed": false,
        "id": "gk3kZHmNj51v"
      },
      "source": [
        "# Installation"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "_914itZsj51v",
      "metadata": {
        "id": "_914itZsj51v",
        "pycharm": {
          "is_executing": true
        }
      },
      "outputs": [],
      "source": [
        "! pip install -q johnsnowlabs"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "YPsbAnNoPt0Z",
      "metadata": {
        "id": "YPsbAnNoPt0Z"
      },
      "source": [
        "## Automatic Installation\n",
        "Using my.johnsnowlabs.com SSO"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "fY0lcShkj51w",
      "metadata": {
        "id": "fY0lcShkj51w",
        "pycharm": {
          "is_executing": true
        }
      },
      "outputs": [],
      "source": [
        "from johnsnowlabs import nlp, legal\n",
        "\n",
        "# nlp.install(force_browser=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "hsJvn_WWM2GL",
      "metadata": {
        "id": "hsJvn_WWM2GL"
      },
      "source": [
        "## Manual downloading\n",
        "If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.\n",
        "\n",
        "- Go to my.johnsnowlabs.com\n",
        "- Download your license\n",
        "- Upload it using the following command"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "i57QV3-_P2sQ",
      "metadata": {
        "id": "i57QV3-_P2sQ"
      },
      "outputs": [],
      "source": [
        "from google.colab import files\n",
        "print('Please Upload your John Snow Labs License using the button below')\n",
        "license_keys = files.upload()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "xGgNdFzZP_hQ",
      "metadata": {
        "id": "xGgNdFzZP_hQ"
      },
      "source": [
        "- Install it"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "OfmmPqknP4rR",
      "metadata": {
        "id": "OfmmPqknP4rR"
      },
      "outputs": [],
      "source": [
        "nlp.install()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "DCl5ErZkNNLk",
      "metadata": {
        "id": "DCl5ErZkNNLk"
      },
      "source": [
        "# Starting"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "wRXTnNl3j51w",
      "metadata": {
        "id": "wRXTnNl3j51w"
      },
      "outputs": [],
      "source": [
        "spark = nlp.start()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "cfbbcfc0-e0b7-4c25-8bd7-c64d90f836d1",
      "metadata": {
        "id": "cfbbcfc0-e0b7-4c25-8bd7-c64d90f836d1"
      },
      "source": [
        "# Legal Data Augmentation with Chunk Mappers"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "d2cd4221-fbca-4ca1-86a9-65e6264c4ad1",
      "metadata": {
        "id": "d2cd4221-fbca-4ca1-86a9-65e6264c4ad1"
      },
      "source": [
        "# About Data Augmentation"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "bf9835fd-9def-44e4-b022-e8db0f045fec",
      "metadata": {
        "id": "bf9835fd-9def-44e4-b022-e8db0f045fec"
      },
      "source": [
        "__Data Augmentation__ is the process of increase an extracted datapoint with external sources. \n",
        "\n",
        "For example, let's suppose I work with a document which mentions the company _Amazon_. We could be talking about stock prices, or some legal litigations, or just a commercial agreement with a provider, among others.\n",
        "\n",
        "In the document, we can extract a company name using NER as an Organization, but that's all the information available about the company in that document.\n",
        "\n",
        "Well, with __Data Augmentation__, we can use external sources, as _SEC Edgar, Crunchbase, Nasdaq_ or even _Wikipedia_, to enrich the company with much more information, allowing us to take better decisions.\n",
        "\n",
        "Let's see how to do it."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "UP1E0vZOpZ3h",
      "metadata": {
        "id": "UP1E0vZOpZ3h"
      },
      "source": [
        "# Train Your Own ChunkMapper Model"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "34f517bb-adde-4daa-b12d-921b37dd6d38",
      "metadata": {
        "id": "34f517bb-adde-4daa-b12d-921b37dd6d38"
      },
      "source": [
        "Here, we will train a ChunkMapper model with 1000 sample "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "56mbt5eO397E",
      "metadata": {
        "id": "56mbt5eO397E"
      },
      "outputs": [],
      "source": [
        "! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/legal-nlp/data/sample_openedgar.json"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "8be8c43c-ebf2-4d1e-b98b-31c0fe68cabc",
      "metadata": {
        "id": "8be8c43c-ebf2-4d1e-b98b-31c0fe68cabc"
      },
      "outputs": [],
      "source": [
        "import json\n",
        "with open('sample_openedgar.json', 'r') as f:\n",
        "    company_json = json.load(f)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "be07f846-109f-4834-89ff-483bd00c5ab5",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "be07f846-109f-4834-89ff-483bd00c5ab5",
        "outputId": "24b4056f-31f7-4159-f848-ff840f6bca6b"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'key': 'AWA Group LP',\n",
              " 'relations': [{'key': 'name', 'values': ['AWA Group LP']},\n",
              "  {'key': 'sic', 'values': ['INVESTMENT ADVICE [6282]']},\n",
              "  {'key': 'sic_code', 'values': [6282, 0]},\n",
              "  {'key': 'irs_number', 'values': [371785232, 0]},\n",
              "  {'key': 'fiscal_year_end', 'values': [630, 1231, 0]},\n",
              "  {'key': 'state_location', 'values': ['NC']},\n",
              "  {'key': 'state_incorporation', 'values': ['DE']},\n",
              "  {'key': 'business_street', 'values': ['116 SOUTH FRANKLIN STREET']},\n",
              "  {'key': 'business_city', 'values': ['ROCKY MOUNT']},\n",
              "  {'key': 'business_state', 'values': ['NC']},\n",
              "  {'key': 'business_zip', 'values': ['27804']},\n",
              "  {'key': 'business_phone', 'values': ['952-446-6678']},\n",
              "  {'key': 'former_name', 'values': ['']},\n",
              "  {'key': 'former_name_date', 'values': ['']},\n",
              "  {'key': 'date',\n",
              "   'values': ['2017-01-23',\n",
              "    '2017-03-16',\n",
              "    '2016-01-22',\n",
              "    '2016-01-19',\n",
              "    '2015-06-30',\n",
              "    '2016-04-14',\n",
              "    '2016-07-27',\n",
              "    '2016-10-28',\n",
              "    '2015-06-26',\n",
              "    '2015-09-02',\n",
              "    '2015-09-29',\n",
              "    '2015-12-31']},\n",
              "  {'key': 'company_id', 'values': [1645148]}]}"
            ]
          },
          "execution_count": 8,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "company_json['mappings'][8]"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "jiN1I0L_vqPK",
      "metadata": {
        "id": "jiN1I0L_vqPK"
      },
      "source": [
        "### Check a sample company"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "80cd73e4-288c-44fc-8ff3-fdded39ba25a",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "80cd73e4-288c-44fc-8ff3-fdded39ba25a",
        "outputId": "4c21c161-6611-4209-b1c9-4a3e939d2e94"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'key': 'Rayton Solar Inc.', 'relations': [{'key': 'name', 'values': ['Rayton Solar Inc.']}, {'key': 'sic', 'values': ['SEMICONDUCTORS & RELATED DEVICES [3674]']}, {'key': 'sic_code', 'values': [3674]}, {'key': 'irs_number', 'values': [0]}, {'key': 'fiscal_year_end', 'values': [1231]}, {'key': 'state_location', 'values': ['CA']}, {'key': 'state_incorporation', 'values': ['DE']}, {'key': 'business_street', 'values': ['920 COLORADO AVE.']}, {'key': 'business_city', 'values': ['SANTA MONICA']}, {'key': 'business_state', 'values': ['CA']}, {'key': 'business_zip', 'values': ['90401']}, {'key': 'business_phone', 'values': ['(661) 259-4786']}, {'key': 'former_name', 'values': ['']}, {'key': 'former_name_date', 'values': ['']}, {'key': 'date', 'values': ['2017-01-10', '2017-01-20', '2017-01-06', '2017-05-15', '2017-09-28', '2016-11-29', '2016-12-20', '2016-12-22', '2022-09-21', '2019-06-27', '2018-03-22', '2018-04-30', '2018-12-10', '2021-09-22', '2020-06-08', '2020-09-28']}, {'key': 'company_id', 'values': [1654124]}]}\n"
          ]
        }
      ],
      "source": [
        "for x in company_json['mappings']:\n",
        "    if 'Rayton Solar Inc.' in x['key']:\n",
        "        print(x)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "loOpl4LqvvPi",
      "metadata": {
        "id": "loOpl4LqvvPi"
      },
      "source": [
        "### Check all keys"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "e9fe7709-868d-453f-a162-7fa737f50989",
      "metadata": {
        "id": "e9fe7709-868d-453f-a162-7fa737f50989"
      },
      "outputs": [],
      "source": [
        "all_rels = [x['key'] for x in company_json['mappings'][0]['relations']]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "c47604d3-9028-41a1-a0be-a669d105beb3",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "c47604d3-9028-41a1-a0be-a669d105beb3",
        "outputId": "33e97e60-62fe-46fc-c1a9-021125e6ddc3"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "['name',\n",
              " 'sic',\n",
              " 'sic_code',\n",
              " 'irs_number',\n",
              " 'fiscal_year_end',\n",
              " 'state_location',\n",
              " 'state_incorporation',\n",
              " 'business_street',\n",
              " 'business_city',\n",
              " 'business_state',\n",
              " 'business_zip',\n",
              " 'business_phone',\n",
              " 'former_name',\n",
              " 'former_name_date',\n",
              " 'date',\n",
              " 'company_id']"
            ]
          },
          "execution_count": 11,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "all_rels"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "7opshMS1vx1H",
      "metadata": {
        "id": "7opshMS1vx1H"
      },
      "source": [
        "### Create ChunkMapperApproach"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "b5d078b5-bd12-4d4a-ade7-3200a278e061",
      "metadata": {
        "id": "b5d078b5-bd12-4d4a-ade7-3200a278e061",
        "tags": []
      },
      "outputs": [],
      "source": [
        "chunkerMapper = legal.ChunkMapperApproach()\\\n",
        "      .setInputCols([\"ner_chunk\"])\\\n",
        "      .setOutputCol(\"mappings\")\\\n",
        "      .setDictionary(\"sample_openedgar.json\")\\\n",
        "      .setRels(all_rels)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "940ca6b5-a603-4060-a112-d1f407834d4f",
      "metadata": {
        "id": "940ca6b5-a603-4060-a112-d1f407834d4f"
      },
      "outputs": [],
      "source": [
        "empty_dataset = spark.createDataFrame([[\"\"]]).toDF(\"text\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "6bd7485a-6f3c-442a-8d74-a07ae0385d56",
      "metadata": {
        "id": "6bd7485a-6f3c-442a-8d74-a07ae0385d56"
      },
      "outputs": [],
      "source": [
        "fit_CM = chunkerMapper.fit(empty_dataset)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "d3336018-e50c-4cb9-a07f-755bae236d80",
      "metadata": {
        "id": "d3336018-e50c-4cb9-a07f-755bae236d80"
      },
      "outputs": [],
      "source": [
        "# Save model\n",
        "fit_CM.write().overwrite().save('openedgar_2000_2022_company_mapper')"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "Cg0oUxTbv3oE",
      "metadata": {
        "id": "Cg0oUxTbv3oE"
      },
      "source": [
        "### Let's test our ChunkMapper model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "DlxWvIcTsqTd",
      "metadata": {
        "id": "DlxWvIcTsqTd"
      },
      "outputs": [],
      "source": [
        "text = [\"\"\" AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price. \"\"\"]"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "R1caDZbF14eq",
      "metadata": {
        "id": "R1caDZbF14eq"
      },
      "source": [
        "We get compnay name from sample text"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "K0_gi-xF0B56",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "K0_gi-xF0B56",
        "outputId": "5c8d6ecb-4d8e-4f41-985f-235e88947191"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "sentence_detector_dl download started this may take some time.\n",
            "Approximate size to download 514.9 KB\n",
            "[OK!]\n",
            "bert_embeddings_sec_bert_base download started this may take some time.\n",
            "Approximate size to download 390.4 MB\n",
            "[OK!]\n",
            "legner_org_per_role_date download started this may take some time.\n",
            "[OK!]\n"
          ]
        }
      ],
      "source": [
        "documentAssembler = nlp.DocumentAssembler()\\\n",
        "        .setInputCol(\"text\")\\\n",
        "        .setOutputCol(\"document\")\n",
        "        \n",
        "sentenceDetector = nlp.SentenceDetectorDLModel.pretrained(\"sentence_detector_dl\",\"xx\")\\\n",
        "        .setInputCols([\"document\"])\\\n",
        "        .setOutputCol(\"sentence\")\n",
        "\n",
        "tokenizer = nlp.Tokenizer()\\\n",
        "        .setInputCols([\"sentence\"])\\\n",
        "        .setOutputCol(\"token\")\n",
        "\n",
        "embeddings = nlp.BertEmbeddings.pretrained(\"bert_embeddings_sec_bert_base\",\"en\") \\\n",
        "        .setInputCols([\"sentence\", \"token\"]) \\\n",
        "        .setOutputCol(\"embeddings\")\n",
        "\n",
        "ner_model = legal.NerModel.pretrained(\"legner_org_per_role_date\", \"en\", \"legal/models\")\\\n",
        "        .setInputCols([\"sentence\", \"token\", \"embeddings\"])\\\n",
        "        .setOutputCol(\"ner\")\n",
        "\n",
        "ner_converter = nlp.NerConverter()\\\n",
        "        .setInputCols([\"sentence\",\"token\",\"ner\"])\\\n",
        "        .setOutputCol(\"ner_chunk\")\\\n",
        "        .setWhiteList([\"ORG\"]) # Return only ORG entities\n",
        "\n",
        "nlpPipeline = nlp.Pipeline(stages=[\n",
        "        documentAssembler,\n",
        "        sentenceDetector,\n",
        "        tokenizer,\n",
        "        embeddings,\n",
        "        ner_model,\n",
        "        ner_converter])\n",
        "\n",
        "empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n",
        "\n",
        "model = nlpPipeline.fit(empty_data)\n",
        "\n",
        "light_model = nlp.LightPipeline(model)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "_E5Szc8ast4n",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "_E5Szc8ast4n",
        "outputId": "5abee18b-5783-4ed4-db2a-3383db71c71f"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "[{'document': [Annotation(document, 0, 129,  AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price. , {})],\n",
              "  'ner_chunk': [Annotation(chunk, 1, 12, AWA Group LP, {'entity': 'ORG', 'sentence': '0', 'chunk': '0', 'confidence': '0.9788'})],\n",
              "  'token': [Annotation(token, 1, 3, AWA, {'sentence': '0'}),\n",
              "   Annotation(token, 5, 9, Group, {'sentence': '0'}),\n",
              "   Annotation(token, 11, 12, LP, {'sentence': '0'}),\n",
              "   Annotation(token, 14, 20, intends, {'sentence': '0'}),\n",
              "   Annotation(token, 22, 23, to, {'sentence': '0'}),\n",
              "   Annotation(token, 25, 27, pay, {'sentence': '0'}),\n",
              "   Annotation(token, 29, 37, dividends, {'sentence': '0'}),\n",
              "   Annotation(token, 39, 40, on, {'sentence': '0'}),\n",
              "   Annotation(token, 42, 44, the, {'sentence': '0'}),\n",
              "   Annotation(token, 46, 51, Common, {'sentence': '0'}),\n",
              "   Annotation(token, 53, 57, Units, {'sentence': '0'}),\n",
              "   Annotation(token, 59, 60, on, {'sentence': '0'}),\n",
              "   Annotation(token, 62, 62, a, {'sentence': '0'}),\n",
              "   Annotation(token, 64, 72, quarterly, {'sentence': '0'}),\n",
              "   Annotation(token, 74, 78, basis, {'sentence': '0'}),\n",
              "   Annotation(token, 80, 81, at, {'sentence': '0'}),\n",
              "   Annotation(token, 83, 84, an, {'sentence': '0'}),\n",
              "   Annotation(token, 86, 91, annual, {'sentence': '0'}),\n",
              "   Annotation(token, 93, 96, rate, {'sentence': '0'}),\n",
              "   Annotation(token, 98, 99, of, {'sentence': '0'}),\n",
              "   Annotation(token, 101, 105, 8.00%, {'sentence': '0'}),\n",
              "   Annotation(token, 107, 108, of, {'sentence': '0'}),\n",
              "   Annotation(token, 110, 112, the, {'sentence': '0'}),\n",
              "   Annotation(token, 114, 121, Offering, {'sentence': '0'}),\n",
              "   Annotation(token, 123, 127, Price, {'sentence': '0'}),\n",
              "   Annotation(token, 128, 128, ., {'sentence': '0'})],\n",
              "  'ner': [Annotation(named_entity, 1, 3, B-ORG, {'word': 'AWA', 'confidence': '1.0', 'sentence': '0'}),\n",
              "   Annotation(named_entity, 5, 9, I-ORG, {'word': 'Group', 'confidence': '0.9371', 'sentence': '0'}),\n",
              "   Annotation(named_entity, 11, 12, I-ORG, {'word': 'LP', 'confidence': '0.9993', 'sentence': '0'}),\n",
              "   Annotation(named_entity, 14, 20, O, {'word': 'intends', 'confidence': '0.9983', 'sentence': '0'}),\n",
              "   Annotation(named_entity, 22, 23, O, {'word': 'to', 'confidence': '1.0', 'sentence': '0'}),\n",
              "   Annotation(named_entity, 25, 27, O, {'word': 'pay', 'confidence': '0.9992', 'sentence': '0'}),\n",
              "   Annotation(named_entity, 29, 37, O, {'word': 'dividends', 'confidence': '0.9991', 'sentence': '0'}),\n",
              "   Annotation(named_entity, 39, 40, O, {'word': 'on', 'confidence': '0.999', 'sentence': '0'}),\n",
              "   Annotation(named_entity, 42, 44, O, {'word': 'the', 'confidence': '0.9993', 'sentence': '0'}),\n",
              "   Annotation(named_entity, 46, 51, O, {'word': 'Common', 'confidence': '0.9864', 'sentence': '0'}),\n",
              "   Annotation(named_entity, 53, 57, O, {'word': 'Units', 'confidence': '0.961', 'sentence': '0'}),\n",
              "   Annotation(named_entity, 59, 60, O, {'word': 'on', 'confidence': '1.0', 'sentence': '0'}),\n",
              "   Annotation(named_entity, 62, 62, O, {'word': 'a', 'confidence': '1.0', 'sentence': '0'}),\n",
              "   Annotation(named_entity, 64, 72, O, {'word': 'quarterly', 'confidence': '1.0', 'sentence': '0'}),\n",
              "   Annotation(named_entity, 74, 78, O, {'word': 'basis', 'confidence': '1.0', 'sentence': '0'}),\n",
              "   Annotation(named_entity, 80, 81, O, {'word': 'at', 'confidence': '1.0', 'sentence': '0'}),\n",
              "   Annotation(named_entity, 83, 84, O, {'word': 'an', 'confidence': '1.0', 'sentence': '0'}),\n",
              "   Annotation(named_entity, 86, 91, O, {'word': 'annual', 'confidence': '1.0', 'sentence': '0'}),\n",
              "   Annotation(named_entity, 93, 96, O, {'word': 'rate', 'confidence': '0.9995', 'sentence': '0'}),\n",
              "   Annotation(named_entity, 98, 99, O, {'word': 'of', 'confidence': '0.9988', 'sentence': '0'}),\n",
              "   Annotation(named_entity, 101, 105, O, {'word': '8.00%', 'confidence': '0.998', 'sentence': '0'}),\n",
              "   Annotation(named_entity, 107, 108, O, {'word': 'of', 'confidence': '0.9996', 'sentence': '0'}),\n",
              "   Annotation(named_entity, 110, 112, O, {'word': 'the', 'confidence': '0.9999', 'sentence': '0'}),\n",
              "   Annotation(named_entity, 114, 121, O, {'word': 'Offering', 'confidence': '0.9987', 'sentence': '0'}),\n",
              "   Annotation(named_entity, 123, 127, O, {'word': 'Price', 'confidence': '0.9873', 'sentence': '0'}),\n",
              "   Annotation(named_entity, 128, 128, O, {'word': '.', 'confidence': '0.9999', 'sentence': '0'})],\n",
              "  'embeddings': [Annotation(word_embeddings, 1, 3, AWA, {'isOOV': 'false', 'pieceId': '101', 'isWordStart': 'true', 'token': 'AWA', 'sentence': '0'}),\n",
              "   Annotation(word_embeddings, 5, 9, Group, {'isOOV': 'false', 'pieceId': '101', 'isWordStart': 'true', 'token': 'Group', 'sentence': '0'}),\n",
              "   Annotation(word_embeddings, 11, 12, LP, {'isOOV': 'false', 'pieceId': '101', 'isWordStart': 'true', 'token': 'LP', 'sentence': '0'}),\n",
              "   Annotation(word_embeddings, 14, 20, intends, {'isOOV': 'false', 'pieceId': '4255', 'isWordStart': 'true', 'token': 'intends', 'sentence': '0'}),\n",
              "   Annotation(word_embeddings, 22, 23, to, {'isOOV': 'false', 'pieceId': '631', 'isWordStart': 'true', 'token': 'to', 'sentence': '0'}),\n",
              "   Annotation(word_embeddings, 25, 27, pay, {'isOOV': 'false', 'pieceId': '936', 'isWordStart': 'true', 'token': 'pay', 'sentence': '0'}),\n",
              "   Annotation(word_embeddings, 29, 37, dividends, {'isOOV': 'false', 'pieceId': '1919', 'isWordStart': 'true', 'token': 'dividends', 'sentence': '0'}),\n",
              "   Annotation(word_embeddings, 39, 40, on, {'isOOV': 'false', 'pieceId': '666', 'isWordStart': 'true', 'token': 'on', 'sentence': '0'}),\n",
              "   Annotation(word_embeddings, 42, 44, the, {'isOOV': 'false', 'pieceId': '612', 'isWordStart': 'true', 'token': 'the', 'sentence': '0'}),\n",
              "   Annotation(word_embeddings, 46, 51, Common, {'isOOV': 'false', 'pieceId': '101', 'isWordStart': 'true', 'token': 'Common', 'sentence': '0'}),\n",
              "   Annotation(word_embeddings, 53, 57, Units, {'isOOV': 'false', 'pieceId': '101', 'isWordStart': 'true', 'token': 'Units', 'sentence': '0'}),\n",
              "   Annotation(word_embeddings, 59, 60, on, {'isOOV': 'false', 'pieceId': '666', 'isWordStart': 'true', 'token': 'on', 'sentence': '0'}),\n",
              "   Annotation(word_embeddings, 62, 62, a, {'isOOV': 'false', 'pieceId': '143', 'isWordStart': 'true', 'token': 'a', 'sentence': '0'}),\n",
              "   Annotation(word_embeddings, 64, 72, quarterly, {'isOOV': 'false', 'pieceId': '2181', 'isWordStart': 'true', 'token': 'quarterly', 'sentence': '0'}),\n",
              "   Annotation(word_embeddings, 74, 78, basis, {'isOOV': 'false', 'pieceId': '1277', 'isWordStart': 'true', 'token': 'basis', 'sentence': '0'}),\n",
              "   Annotation(word_embeddings, 80, 81, at, {'isOOV': 'false', 'pieceId': '746', 'isWordStart': 'true', 'token': 'at', 'sentence': '0'}),\n",
              "   Annotation(word_embeddings, 83, 84, an, {'isOOV': 'false', 'pieceId': '620', 'isWordStart': 'true', 'token': 'an', 'sentence': '0'}),\n",
              "   Annotation(word_embeddings, 86, 91, annual, {'isOOV': 'false', 'pieceId': '1207', 'isWordStart': 'true', 'token': 'annual', 'sentence': '0'}),\n",
              "   Annotation(word_embeddings, 93, 96, rate, {'isOOV': 'false', 'pieceId': '1072', 'isWordStart': 'true', 'token': 'rate', 'sentence': '0'}),\n",
              "   Annotation(word_embeddings, 98, 99, of, {'isOOV': 'false', 'pieceId': '619', 'isWordStart': 'true', 'token': 'of', 'sentence': '0'}),\n",
              "   Annotation(word_embeddings, 101, 105, 8.00%, {'isOOV': 'false', 'pieceId': '128', 'isWordStart': 'true', 'token': '8.00%', 'sentence': '0'}),\n",
              "   Annotation(word_embeddings, 107, 108, of, {'isOOV': 'false', 'pieceId': '619', 'isWordStart': 'true', 'token': 'of', 'sentence': '0'}),\n",
              "   Annotation(word_embeddings, 110, 112, the, {'isOOV': 'false', 'pieceId': '612', 'isWordStart': 'true', 'token': 'the', 'sentence': '0'}),\n",
              "   Annotation(word_embeddings, 114, 121, Offering, {'isOOV': 'false', 'pieceId': '101', 'isWordStart': 'true', 'token': 'Offering', 'sentence': '0'}),\n",
              "   Annotation(word_embeddings, 123, 127, Price, {'isOOV': 'false', 'pieceId': '101', 'isWordStart': 'true', 'token': 'Price', 'sentence': '0'}),\n",
              "   Annotation(word_embeddings, 128, 128, ., {'isOOV': 'false', 'pieceId': '118', 'isWordStart': 'true', 'token': '.', 'sentence': '0'})],\n",
              "  'sentence': [Annotation(document, 1, 128, AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price., {'sentence': '0'})]}]"
            ]
          },
          "execution_count": 18,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "# We get company name from sample text\n",
        "\n",
        "ner_result = light_model.fullAnnotate(text)\n",
        "\n",
        "ner_result"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "uGcQcPIntV5e",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 36
        },
        "id": "uGcQcPIntV5e",
        "outputId": "dd2383a3-8dae-4ccf-88a1-b50829163b4b"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'AWA Group LP'"
            ]
          },
          "execution_count": 19,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "ORG = ner_result[0][\"ner_chunk\"][0].result\n",
        "\n",
        "ORG"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "plvb6cMVCe0O",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "plvb6cMVCe0O",
        "outputId": "5d1ccd9f-eec0-4ef6-d6e5-c55c56e74a3f"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "tfhub_use download started this may take some time.\n",
            "Approximate size to download 923.7 MB\n",
            "[OK!]\n",
            "legel_edgar_company_name download started this may take some time.\n",
            "[OK!]\n"
          ]
        }
      ],
      "source": [
        "embeddings = nlp.UniversalSentenceEncoder.pretrained(\"tfhub_use\", \"en\") \\\n",
        "      .setInputCols(\"document\") \\\n",
        "      .setOutputCol(\"sentence_embeddings\")\n",
        "    \n",
        "resolver = legal.SentenceEntityResolverModel.pretrained(\"legel_edgar_company_name\", \"en\", \"legal/models\")\\\n",
        "      .setInputCols([\"sentence_embeddings\"]) \\\n",
        "      .setOutputCol(\"resolution\")\\\n",
        "      .setDistanceFunction(\"EUCLIDEAN\")\n",
        "\n",
        "pipelineModel = nlp.PipelineModel(\n",
        "      stages = [\n",
        "          documentAssembler,\n",
        "          embeddings,\n",
        "          resolver])\n",
        "\n",
        "lp_res = nlp.LightPipeline(pipelineModel)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "vFFo5P9Nttss",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "vFFo5P9Nttss",
        "outputId": "7ccc6b6a-6b76-47b3-db55-befc3bbaeaed"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'document': ['AWA Group LP'],\n",
              " 'sentence_embeddings': ['AWA Group LP'],\n",
              " 'resolution': ['AWA Group LP']}"
            ]
          },
          "execution_count": 21,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "# We normalize company name\n",
        "\n",
        "el_res = lp_res.annotate(ORG)\n",
        "\n",
        "el_res"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "EbCbbGZ9ttvP",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 36
        },
        "id": "EbCbbGZ9ttvP",
        "outputId": "2289cc59-0cd6-4ba5-f322-f6309e74d6c2"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'AWA Group LP'"
            ]
          },
          "execution_count": 22,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "NORM_ORG = el_res[\"resolution\"][0]\n",
        "\n",
        "NORM_ORG"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "z-2Q7nnywUaH",
      "metadata": {
        "id": "z-2Q7nnywUaH"
      },
      "source": [
        "### Let's load our ChunkMapper model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "Vx3_V33_ttyC",
      "metadata": {
        "id": "Vx3_V33_ttyC"
      },
      "outputs": [],
      "source": [
        "documentAssembler = nlp.DocumentAssembler()\\\n",
        "    .setInputCol(\"text\")\\\n",
        "    .setOutputCol(\"document\")\n",
        "\n",
        "chunkAssembler = nlp.Doc2Chunk() \\\n",
        "    .setInputCols(\"document\") \\\n",
        "    .setOutputCol(\"chunk\") \\\n",
        "    .setIsArray(False)\n",
        "\n",
        "CM = legal.ChunkMapperModel().load(\"openedgar_2000_2022_company_mapper\")\\\n",
        "    .setInputCols([\"chunk\"])\\\n",
        "    .setOutputCol(\"mappings\")\n",
        "\n",
        "cm_pipeline = nlp.Pipeline(stages=[documentAssembler, \n",
        "                                   chunkAssembler, \n",
        "                                   CM])\n",
        "\n",
        "fit_cm_pipeline = cm_pipeline.fit(empty_data)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "w5I4muUkvJG7",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "w5I4muUkvJG7",
        "outputId": "8c7c2e2e-2f77-4ba2-9ece-ccb0420d4f06"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "+------------+\n",
            "|        text|\n",
            "+------------+\n",
            "|AWA Group LP|\n",
            "+------------+\n",
            "\n"
          ]
        }
      ],
      "source": [
        "# LightPipelines don't support Doc2Chunk, so we will use here usual transform\n",
        "\n",
        "df = spark.createDataFrame([[NORM_ORG]]).toDF(\"text\")\n",
        "\n",
        "df.show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "zS0Q-zamvOsT",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "zS0Q-zamvOsT",
        "outputId": "437b4354-127d-473d-a5c0-a19acf3c525a"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "+------------+--------------------+--------------------+--------------------+\n",
            "|        text|            document|               chunk|            mappings|\n",
            "+------------+--------------------+--------------------+--------------------+\n",
            "|AWA Group LP|[{document, 0, 11...|[{chunk, 0, 11, A...|[{labeled_depende...|\n",
            "+------------+--------------------+--------------------+--------------------+\n",
            "\n"
          ]
        }
      ],
      "source": [
        "res = fit_cm_pipeline.transform(df)\n",
        "\n",
        "res.show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "e7b2c22c-49f6-4277-8636-62b876b0bf08",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "e7b2c22c-49f6-4277-8636-62b876b0bf08",
        "outputId": "7142e762-c238-437f-896f-b5957107f031",
        "tags": []
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "+----------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
            "|result                                                                                                                                                          |\n",
            "+----------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
            "|[AWA Group LP, INVESTMENT ADVICE [6282], 6282, 371785232, 630, NC, DE, 116 SOUTH FRANKLIN STREET, ROCKY MOUNT, NC, 27804, 952-446-6678, , , 2017-01-23, 1645148]|\n",
            "+----------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
            "\n"
          ]
        }
      ],
      "source": [
        "res.select(\"mappings.result\").show(truncate=False)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "fqeUMOeDuVUx",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "fqeUMOeDuVUx",
        "outputId": "4e28d465-83cc-4f85-fb65-3f7282097ad3"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "[Row(mappings=[Row(annotatorType='labeled_dependency', begin=0, end=11, result='AWA Group LP', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'name', 'entity': 'AWA Group LP', 'relation': 'name'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='INVESTMENT ADVICE [6282]', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'sic', 'entity': 'AWA Group LP', 'relation': 'sic'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='6282', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '0', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'sic_code', 'entity': 'AWA Group LP', 'relation': 'sic_code'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='371785232', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '0', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'irs_number', 'entity': 'AWA Group LP', 'relation': 'irs_number'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='630', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '1231:::0', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'fiscal_year_end', 'entity': 'AWA Group LP', 'relation': 'fiscal_year_end'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='NC', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'state_location', 'entity': 'AWA Group LP', 'relation': 'state_location'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='DE', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'state_incorporation', 'entity': 'AWA Group LP', 'relation': 'state_incorporation'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='116 SOUTH FRANKLIN STREET', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'business_street', 'entity': 'AWA Group LP', 'relation': 'business_street'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='ROCKY MOUNT', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'business_city', 'entity': 'AWA Group LP', 'relation': 'business_city'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='NC', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'business_state', 'entity': 'AWA Group LP', 'relation': 'business_state'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='27804', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'business_zip', 'entity': 'AWA Group LP', 'relation': 'business_zip'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='952-446-6678', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'business_phone', 'entity': 'AWA Group LP', 'relation': 'business_phone'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'former_name', 'entity': 'AWA Group LP', 'relation': 'former_name'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'former_name_date', 'entity': 'AWA Group LP', 'relation': 'former_name_date'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='2017-01-23', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '2017-03-16:::2016-01-22:::2016-01-19:::2015-06-30:::2016-04-14:::2016-07-27:::2016-10-28:::2015-06-26:::2015-09-02:::2015-09-29:::2015-12-31', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'date', 'entity': 'AWA Group LP', 'relation': 'date'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='1645148', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'company_id', 'entity': 'AWA Group LP', 'relation': 'company_id'}, embeddings=[])])]"
            ]
          },
          "execution_count": 27,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "r = res.select(\"mappings\").collect()\n",
        "r"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "3ed118db-d108-4637-8d1f-67c7e5106458",
      "metadata": {
        "id": "3ed118db-d108-4637-8d1f-67c7e5106458"
      },
      "outputs": [],
      "source": [
        "json_dict = dict()\n",
        "for n in r[0]['mappings']:\n",
        "    json_dict[n.metadata['relation']] = str(n.result)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "0bee659b-9446-4255-9c67-2a685de56c56",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "0bee659b-9446-4255-9c67-2a685de56c56",
        "outputId": "35dc13ce-f121-4fc6-b9b1-6899cecb191d"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{\n",
            "    \"business_city\": \"ROCKY MOUNT\",\n",
            "    \"business_phone\": \"952-446-6678\",\n",
            "    \"business_state\": \"NC\",\n",
            "    \"business_street\": \"116 SOUTH FRANKLIN STREET\",\n",
            "    \"business_zip\": \"27804\",\n",
            "    \"company_id\": \"1645148\",\n",
            "    \"date\": \"2017-01-23\",\n",
            "    \"fiscal_year_end\": \"630\",\n",
            "    \"former_name\": \"\",\n",
            "    \"former_name_date\": \"\",\n",
            "    \"irs_number\": \"371785232\",\n",
            "    \"name\": \"AWA Group LP\",\n",
            "    \"sic\": \"INVESTMENT ADVICE [6282]\",\n",
            "    \"sic_code\": \"6282\",\n",
            "    \"state_incorporation\": \"DE\",\n",
            "    \"state_location\": \"NC\"\n",
            "}\n"
          ]
        }
      ],
      "source": [
        "import json\n",
        "print(json.dumps(json_dict, indent=4, sort_keys=True))"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "gpuClass": "standard",
    "kernelspec": {
      "display_name": "tf-gpu",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)]"
    },
    "vscode": {
      "interpreter": {
        "hash": "3f47d918ae832c68584484921185f5c85a1760864bf927a683dc6fb56366cc77"
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}