{ "cells": [ { "cell_type": "markdown", "id": "db5f4f9a-7776-42b3-8758-85624d4c15ea", "metadata": { "id": "db5f4f9a-7776-42b3-8758-85624d4c15ea" }, "source": [ "![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)" ] }, { "cell_type": "markdown", "id": "21e9eafb", "metadata": { "id": "21e9eafb" }, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/10.1.Chunk_Mappers_Training.ipynb)" ] }, { "cell_type": "markdown", "id": "gk3kZHmNj51v", "metadata": { "collapsed": false, "id": "gk3kZHmNj51v" }, "source": [ "# Installation" ] }, { "cell_type": "code", "execution_count": null, "id": "_914itZsj51v", "metadata": { "id": "_914itZsj51v", "pycharm": { "is_executing": true } }, "outputs": [], "source": [ "! pip install -q johnsnowlabs" ] }, { "cell_type": "markdown", "id": "YPsbAnNoPt0Z", "metadata": { "id": "YPsbAnNoPt0Z" }, "source": [ "## Automatic Installation\n", "Using my.johnsnowlabs.com SSO" ] }, { "cell_type": "code", "execution_count": null, "id": "fY0lcShkj51w", "metadata": { "id": "fY0lcShkj51w", "pycharm": { "is_executing": true } }, "outputs": [], "source": [ "from johnsnowlabs import nlp, legal\n", "\n", "# nlp.install(force_browser=True)" ] }, { "cell_type": "markdown", "id": "hsJvn_WWM2GL", "metadata": { "id": "hsJvn_WWM2GL" }, "source": [ "## Manual downloading\n", "If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.\n", "\n", "- Go to my.johnsnowlabs.com\n", "- Download your license\n", "- Upload it using the following command" ] }, { "cell_type": "code", "execution_count": null, "id": "i57QV3-_P2sQ", "metadata": { "id": "i57QV3-_P2sQ" }, "outputs": [], "source": [ "from google.colab import files\n", "print('Please Upload your John Snow Labs License using the button below')\n", "license_keys = files.upload()" ] }, { "cell_type": "markdown", "id": "xGgNdFzZP_hQ", "metadata": { "id": "xGgNdFzZP_hQ" }, "source": [ "- Install it" ] }, { "cell_type": "code", "execution_count": null, "id": "OfmmPqknP4rR", "metadata": { "id": "OfmmPqknP4rR" }, "outputs": [], "source": [ "nlp.install()" ] }, { "cell_type": "markdown", "id": "DCl5ErZkNNLk", "metadata": { "id": "DCl5ErZkNNLk" }, "source": [ "# Starting" ] }, { "cell_type": "code", "execution_count": null, "id": "wRXTnNl3j51w", "metadata": { "id": "wRXTnNl3j51w" }, "outputs": [], "source": [ "spark = nlp.start()" ] }, { "cell_type": "markdown", "id": "cfbbcfc0-e0b7-4c25-8bd7-c64d90f836d1", "metadata": { "id": "cfbbcfc0-e0b7-4c25-8bd7-c64d90f836d1" }, "source": [ "# Legal Data Augmentation with Chunk Mappers" ] }, { "cell_type": "markdown", "id": "d2cd4221-fbca-4ca1-86a9-65e6264c4ad1", "metadata": { "id": "d2cd4221-fbca-4ca1-86a9-65e6264c4ad1" }, "source": [ "# About Data Augmentation" ] }, { "cell_type": "markdown", "id": "bf9835fd-9def-44e4-b022-e8db0f045fec", "metadata": { "id": "bf9835fd-9def-44e4-b022-e8db0f045fec" }, "source": [ "__Data Augmentation__ is the process of increase an extracted datapoint with external sources. \n", "\n", "For example, let's suppose I work with a document which mentions the company _Amazon_. We could be talking about stock prices, or some legal litigations, or just a commercial agreement with a provider, among others.\n", "\n", "In the document, we can extract a company name using NER as an Organization, but that's all the information available about the company in that document.\n", "\n", "Well, with __Data Augmentation__, we can use external sources, as _SEC Edgar, Crunchbase, Nasdaq_ or even _Wikipedia_, to enrich the company with much more information, allowing us to take better decisions.\n", "\n", "Let's see how to do it." ] }, { "cell_type": "markdown", "id": "UP1E0vZOpZ3h", "metadata": { "id": "UP1E0vZOpZ3h" }, "source": [ "# Train Your Own ChunkMapper Model" ] }, { "cell_type": "markdown", "id": "34f517bb-adde-4daa-b12d-921b37dd6d38", "metadata": { "id": "34f517bb-adde-4daa-b12d-921b37dd6d38" }, "source": [ "Here, we will train a ChunkMapper model with 1000 sample " ] }, { "cell_type": "code", "execution_count": null, "id": "56mbt5eO397E", "metadata": { "id": "56mbt5eO397E" }, "outputs": [], "source": [ "! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/legal-nlp/data/sample_openedgar.json" ] }, { "cell_type": "code", "execution_count": null, "id": "8be8c43c-ebf2-4d1e-b98b-31c0fe68cabc", "metadata": { "id": "8be8c43c-ebf2-4d1e-b98b-31c0fe68cabc" }, "outputs": [], "source": [ "import json\n", "with open('sample_openedgar.json', 'r') as f:\n", " company_json = json.load(f)" ] }, { "cell_type": "code", "execution_count": null, "id": "be07f846-109f-4834-89ff-483bd00c5ab5", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "be07f846-109f-4834-89ff-483bd00c5ab5", "outputId": "24b4056f-31f7-4159-f848-ff840f6bca6b" }, "outputs": [ { "data": { "text/plain": [ "{'key': 'AWA Group LP',\n", " 'relations': [{'key': 'name', 'values': ['AWA Group LP']},\n", " {'key': 'sic', 'values': ['INVESTMENT ADVICE [6282]']},\n", " {'key': 'sic_code', 'values': [6282, 0]},\n", " {'key': 'irs_number', 'values': [371785232, 0]},\n", " {'key': 'fiscal_year_end', 'values': [630, 1231, 0]},\n", " {'key': 'state_location', 'values': ['NC']},\n", " {'key': 'state_incorporation', 'values': ['DE']},\n", " {'key': 'business_street', 'values': ['116 SOUTH FRANKLIN STREET']},\n", " {'key': 'business_city', 'values': ['ROCKY MOUNT']},\n", " {'key': 'business_state', 'values': ['NC']},\n", " {'key': 'business_zip', 'values': ['27804']},\n", " {'key': 'business_phone', 'values': ['952-446-6678']},\n", " {'key': 'former_name', 'values': ['']},\n", " {'key': 'former_name_date', 'values': ['']},\n", " {'key': 'date',\n", " 'values': ['2017-01-23',\n", " '2017-03-16',\n", " '2016-01-22',\n", " '2016-01-19',\n", " '2015-06-30',\n", " '2016-04-14',\n", " '2016-07-27',\n", " '2016-10-28',\n", " '2015-06-26',\n", " '2015-09-02',\n", " '2015-09-29',\n", " '2015-12-31']},\n", " {'key': 'company_id', 'values': [1645148]}]}" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "company_json['mappings'][8]" ] }, { "cell_type": "markdown", "id": "jiN1I0L_vqPK", "metadata": { "id": "jiN1I0L_vqPK" }, "source": [ "### Check a sample company" ] }, { "cell_type": "code", "execution_count": null, "id": "80cd73e4-288c-44fc-8ff3-fdded39ba25a", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "80cd73e4-288c-44fc-8ff3-fdded39ba25a", "outputId": "4c21c161-6611-4209-b1c9-4a3e939d2e94" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'key': 'Rayton Solar Inc.', 'relations': [{'key': 'name', 'values': ['Rayton Solar Inc.']}, {'key': 'sic', 'values': ['SEMICONDUCTORS & RELATED DEVICES [3674]']}, {'key': 'sic_code', 'values': [3674]}, {'key': 'irs_number', 'values': [0]}, {'key': 'fiscal_year_end', 'values': [1231]}, {'key': 'state_location', 'values': ['CA']}, {'key': 'state_incorporation', 'values': ['DE']}, {'key': 'business_street', 'values': ['920 COLORADO AVE.']}, {'key': 'business_city', 'values': ['SANTA MONICA']}, {'key': 'business_state', 'values': ['CA']}, {'key': 'business_zip', 'values': ['90401']}, {'key': 'business_phone', 'values': ['(661) 259-4786']}, {'key': 'former_name', 'values': ['']}, {'key': 'former_name_date', 'values': ['']}, {'key': 'date', 'values': ['2017-01-10', '2017-01-20', '2017-01-06', '2017-05-15', '2017-09-28', '2016-11-29', '2016-12-20', '2016-12-22', '2022-09-21', '2019-06-27', '2018-03-22', '2018-04-30', '2018-12-10', '2021-09-22', '2020-06-08', '2020-09-28']}, {'key': 'company_id', 'values': [1654124]}]}\n" ] } ], "source": [ "for x in company_json['mappings']:\n", " if 'Rayton Solar Inc.' in x['key']:\n", " print(x)" ] }, { "cell_type": "markdown", "id": "loOpl4LqvvPi", "metadata": { "id": "loOpl4LqvvPi" }, "source": [ "### Check all keys" ] }, { "cell_type": "code", "execution_count": null, "id": "e9fe7709-868d-453f-a162-7fa737f50989", "metadata": { "id": "e9fe7709-868d-453f-a162-7fa737f50989" }, "outputs": [], "source": [ "all_rels = [x['key'] for x in company_json['mappings'][0]['relations']]" ] }, { "cell_type": "code", "execution_count": null, "id": "c47604d3-9028-41a1-a0be-a669d105beb3", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "c47604d3-9028-41a1-a0be-a669d105beb3", "outputId": "33e97e60-62fe-46fc-c1a9-021125e6ddc3" }, "outputs": [ { "data": { "text/plain": [ "['name',\n", " 'sic',\n", " 'sic_code',\n", " 'irs_number',\n", " 'fiscal_year_end',\n", " 'state_location',\n", " 'state_incorporation',\n", " 'business_street',\n", " 'business_city',\n", " 'business_state',\n", " 'business_zip',\n", " 'business_phone',\n", " 'former_name',\n", " 'former_name_date',\n", " 'date',\n", " 'company_id']" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_rels" ] }, { "cell_type": "markdown", "id": "7opshMS1vx1H", "metadata": { "id": "7opshMS1vx1H" }, "source": [ "### Create ChunkMapperApproach" ] }, { "cell_type": "code", "execution_count": null, "id": "b5d078b5-bd12-4d4a-ade7-3200a278e061", "metadata": { "id": "b5d078b5-bd12-4d4a-ade7-3200a278e061", "tags": [] }, "outputs": [], "source": [ "chunkerMapper = legal.ChunkMapperApproach()\\\n", " .setInputCols([\"ner_chunk\"])\\\n", " .setOutputCol(\"mappings\")\\\n", " .setDictionary(\"sample_openedgar.json\")\\\n", " .setRels(all_rels)" ] }, { "cell_type": "code", "execution_count": null, "id": "940ca6b5-a603-4060-a112-d1f407834d4f", "metadata": { "id": "940ca6b5-a603-4060-a112-d1f407834d4f" }, "outputs": [], "source": [ "empty_dataset = spark.createDataFrame([[\"\"]]).toDF(\"text\")" ] }, { "cell_type": "code", "execution_count": null, "id": "6bd7485a-6f3c-442a-8d74-a07ae0385d56", "metadata": { "id": "6bd7485a-6f3c-442a-8d74-a07ae0385d56" }, "outputs": [], "source": [ "fit_CM = chunkerMapper.fit(empty_dataset)" ] }, { "cell_type": "code", "execution_count": null, "id": "d3336018-e50c-4cb9-a07f-755bae236d80", "metadata": { "id": "d3336018-e50c-4cb9-a07f-755bae236d80" }, "outputs": [], "source": [ "# Save model\n", "fit_CM.write().overwrite().save('openedgar_2000_2022_company_mapper')" ] }, { "cell_type": "markdown", "id": "Cg0oUxTbv3oE", "metadata": { "id": "Cg0oUxTbv3oE" }, "source": [ "### Let's test our ChunkMapper model" ] }, { "cell_type": "code", "execution_count": null, "id": "DlxWvIcTsqTd", "metadata": { "id": "DlxWvIcTsqTd" }, "outputs": [], "source": [ "text = [\"\"\" AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price. \"\"\"]" ] }, { "cell_type": "markdown", "id": "R1caDZbF14eq", "metadata": { "id": "R1caDZbF14eq" }, "source": [ "We get compnay name from sample text" ] }, { "cell_type": "code", "execution_count": null, "id": "K0_gi-xF0B56", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "K0_gi-xF0B56", "outputId": "5c8d6ecb-4d8e-4f41-985f-235e88947191" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sentence_detector_dl download started this may take some time.\n", "Approximate size to download 514.9 KB\n", "[OK!]\n", "bert_embeddings_sec_bert_base download started this may take some time.\n", "Approximate size to download 390.4 MB\n", "[OK!]\n", "legner_org_per_role_date download started this may take some time.\n", "[OK!]\n" ] } ], "source": [ "documentAssembler = nlp.DocumentAssembler()\\\n", " .setInputCol(\"text\")\\\n", " .setOutputCol(\"document\")\n", " \n", "sentenceDetector = nlp.SentenceDetectorDLModel.pretrained(\"sentence_detector_dl\",\"xx\")\\\n", " .setInputCols([\"document\"])\\\n", " .setOutputCol(\"sentence\")\n", "\n", "tokenizer = nlp.Tokenizer()\\\n", " .setInputCols([\"sentence\"])\\\n", " .setOutputCol(\"token\")\n", "\n", "embeddings = nlp.BertEmbeddings.pretrained(\"bert_embeddings_sec_bert_base\",\"en\") \\\n", " .setInputCols([\"sentence\", \"token\"]) \\\n", " .setOutputCol(\"embeddings\")\n", "\n", "ner_model = legal.NerModel.pretrained(\"legner_org_per_role_date\", \"en\", \"legal/models\")\\\n", " .setInputCols([\"sentence\", \"token\", \"embeddings\"])\\\n", " .setOutputCol(\"ner\")\n", "\n", "ner_converter = nlp.NerConverter()\\\n", " .setInputCols([\"sentence\",\"token\",\"ner\"])\\\n", " .setOutputCol(\"ner_chunk\")\\\n", " .setWhiteList([\"ORG\"]) # Return only ORG entities\n", "\n", "nlpPipeline = nlp.Pipeline(stages=[\n", " documentAssembler,\n", " sentenceDetector,\n", " tokenizer,\n", " embeddings,\n", " ner_model,\n", " ner_converter])\n", "\n", "empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n", "\n", "model = nlpPipeline.fit(empty_data)\n", "\n", "light_model = nlp.LightPipeline(model)" ] }, { "cell_type": "code", "execution_count": null, "id": "_E5Szc8ast4n", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "_E5Szc8ast4n", "outputId": "5abee18b-5783-4ed4-db2a-3383db71c71f" }, "outputs": [ { "data": { "text/plain": [ "[{'document': [Annotation(document, 0, 129, AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price. , {})],\n", " 'ner_chunk': [Annotation(chunk, 1, 12, AWA Group LP, {'entity': 'ORG', 'sentence': '0', 'chunk': '0', 'confidence': '0.9788'})],\n", " 'token': [Annotation(token, 1, 3, AWA, {'sentence': '0'}),\n", " Annotation(token, 5, 9, Group, {'sentence': '0'}),\n", " Annotation(token, 11, 12, LP, {'sentence': '0'}),\n", " Annotation(token, 14, 20, intends, {'sentence': '0'}),\n", " Annotation(token, 22, 23, to, {'sentence': '0'}),\n", " Annotation(token, 25, 27, pay, {'sentence': '0'}),\n", " Annotation(token, 29, 37, dividends, {'sentence': '0'}),\n", " Annotation(token, 39, 40, on, {'sentence': '0'}),\n", " Annotation(token, 42, 44, the, {'sentence': '0'}),\n", " Annotation(token, 46, 51, Common, {'sentence': '0'}),\n", " Annotation(token, 53, 57, Units, {'sentence': '0'}),\n", " Annotation(token, 59, 60, on, {'sentence': '0'}),\n", " Annotation(token, 62, 62, a, {'sentence': '0'}),\n", " Annotation(token, 64, 72, quarterly, {'sentence': '0'}),\n", " Annotation(token, 74, 78, basis, {'sentence': '0'}),\n", " Annotation(token, 80, 81, at, {'sentence': '0'}),\n", " Annotation(token, 83, 84, an, {'sentence': '0'}),\n", " Annotation(token, 86, 91, annual, {'sentence': '0'}),\n", " Annotation(token, 93, 96, rate, {'sentence': '0'}),\n", " Annotation(token, 98, 99, of, {'sentence': '0'}),\n", " Annotation(token, 101, 105, 8.00%, {'sentence': '0'}),\n", " Annotation(token, 107, 108, of, {'sentence': '0'}),\n", " Annotation(token, 110, 112, the, {'sentence': '0'}),\n", " Annotation(token, 114, 121, Offering, {'sentence': '0'}),\n", " Annotation(token, 123, 127, Price, {'sentence': '0'}),\n", " Annotation(token, 128, 128, ., {'sentence': '0'})],\n", " 'ner': [Annotation(named_entity, 1, 3, B-ORG, {'word': 'AWA', 'confidence': '1.0', 'sentence': '0'}),\n", " Annotation(named_entity, 5, 9, I-ORG, {'word': 'Group', 'confidence': '0.9371', 'sentence': '0'}),\n", " Annotation(named_entity, 11, 12, I-ORG, {'word': 'LP', 'confidence': '0.9993', 'sentence': '0'}),\n", " Annotation(named_entity, 14, 20, O, {'word': 'intends', 'confidence': '0.9983', 'sentence': '0'}),\n", " Annotation(named_entity, 22, 23, O, {'word': 'to', 'confidence': '1.0', 'sentence': '0'}),\n", " Annotation(named_entity, 25, 27, O, {'word': 'pay', 'confidence': '0.9992', 'sentence': '0'}),\n", " Annotation(named_entity, 29, 37, O, {'word': 'dividends', 'confidence': '0.9991', 'sentence': '0'}),\n", " Annotation(named_entity, 39, 40, O, {'word': 'on', 'confidence': '0.999', 'sentence': '0'}),\n", " Annotation(named_entity, 42, 44, O, {'word': 'the', 'confidence': '0.9993', 'sentence': '0'}),\n", " Annotation(named_entity, 46, 51, O, {'word': 'Common', 'confidence': '0.9864', 'sentence': '0'}),\n", " Annotation(named_entity, 53, 57, O, {'word': 'Units', 'confidence': '0.961', 'sentence': '0'}),\n", " Annotation(named_entity, 59, 60, O, {'word': 'on', 'confidence': '1.0', 'sentence': '0'}),\n", " Annotation(named_entity, 62, 62, O, {'word': 'a', 'confidence': '1.0', 'sentence': '0'}),\n", " Annotation(named_entity, 64, 72, O, {'word': 'quarterly', 'confidence': '1.0', 'sentence': '0'}),\n", " Annotation(named_entity, 74, 78, O, {'word': 'basis', 'confidence': '1.0', 'sentence': '0'}),\n", " Annotation(named_entity, 80, 81, O, {'word': 'at', 'confidence': '1.0', 'sentence': '0'}),\n", " Annotation(named_entity, 83, 84, O, {'word': 'an', 'confidence': '1.0', 'sentence': '0'}),\n", " Annotation(named_entity, 86, 91, O, {'word': 'annual', 'confidence': '1.0', 'sentence': '0'}),\n", " Annotation(named_entity, 93, 96, O, {'word': 'rate', 'confidence': '0.9995', 'sentence': '0'}),\n", " Annotation(named_entity, 98, 99, O, {'word': 'of', 'confidence': '0.9988', 'sentence': '0'}),\n", " Annotation(named_entity, 101, 105, O, {'word': '8.00%', 'confidence': '0.998', 'sentence': '0'}),\n", " Annotation(named_entity, 107, 108, O, {'word': 'of', 'confidence': '0.9996', 'sentence': '0'}),\n", " Annotation(named_entity, 110, 112, O, {'word': 'the', 'confidence': '0.9999', 'sentence': '0'}),\n", " Annotation(named_entity, 114, 121, O, {'word': 'Offering', 'confidence': '0.9987', 'sentence': '0'}),\n", " Annotation(named_entity, 123, 127, O, {'word': 'Price', 'confidence': '0.9873', 'sentence': '0'}),\n", " Annotation(named_entity, 128, 128, O, {'word': '.', 'confidence': '0.9999', 'sentence': '0'})],\n", " 'embeddings': [Annotation(word_embeddings, 1, 3, AWA, {'isOOV': 'false', 'pieceId': '101', 'isWordStart': 'true', 'token': 'AWA', 'sentence': '0'}),\n", " Annotation(word_embeddings, 5, 9, Group, {'isOOV': 'false', 'pieceId': '101', 'isWordStart': 'true', 'token': 'Group', 'sentence': '0'}),\n", " Annotation(word_embeddings, 11, 12, LP, {'isOOV': 'false', 'pieceId': '101', 'isWordStart': 'true', 'token': 'LP', 'sentence': '0'}),\n", " Annotation(word_embeddings, 14, 20, intends, {'isOOV': 'false', 'pieceId': '4255', 'isWordStart': 'true', 'token': 'intends', 'sentence': '0'}),\n", " Annotation(word_embeddings, 22, 23, to, {'isOOV': 'false', 'pieceId': '631', 'isWordStart': 'true', 'token': 'to', 'sentence': '0'}),\n", " Annotation(word_embeddings, 25, 27, pay, {'isOOV': 'false', 'pieceId': '936', 'isWordStart': 'true', 'token': 'pay', 'sentence': '0'}),\n", " Annotation(word_embeddings, 29, 37, dividends, {'isOOV': 'false', 'pieceId': '1919', 'isWordStart': 'true', 'token': 'dividends', 'sentence': '0'}),\n", " Annotation(word_embeddings, 39, 40, on, {'isOOV': 'false', 'pieceId': '666', 'isWordStart': 'true', 'token': 'on', 'sentence': '0'}),\n", " Annotation(word_embeddings, 42, 44, the, {'isOOV': 'false', 'pieceId': '612', 'isWordStart': 'true', 'token': 'the', 'sentence': '0'}),\n", " Annotation(word_embeddings, 46, 51, Common, {'isOOV': 'false', 'pieceId': '101', 'isWordStart': 'true', 'token': 'Common', 'sentence': '0'}),\n", " Annotation(word_embeddings, 53, 57, Units, {'isOOV': 'false', 'pieceId': '101', 'isWordStart': 'true', 'token': 'Units', 'sentence': '0'}),\n", " Annotation(word_embeddings, 59, 60, on, {'isOOV': 'false', 'pieceId': '666', 'isWordStart': 'true', 'token': 'on', 'sentence': '0'}),\n", " Annotation(word_embeddings, 62, 62, a, {'isOOV': 'false', 'pieceId': '143', 'isWordStart': 'true', 'token': 'a', 'sentence': '0'}),\n", " Annotation(word_embeddings, 64, 72, quarterly, {'isOOV': 'false', 'pieceId': '2181', 'isWordStart': 'true', 'token': 'quarterly', 'sentence': '0'}),\n", " Annotation(word_embeddings, 74, 78, basis, {'isOOV': 'false', 'pieceId': '1277', 'isWordStart': 'true', 'token': 'basis', 'sentence': '0'}),\n", " Annotation(word_embeddings, 80, 81, at, {'isOOV': 'false', 'pieceId': '746', 'isWordStart': 'true', 'token': 'at', 'sentence': '0'}),\n", " Annotation(word_embeddings, 83, 84, an, {'isOOV': 'false', 'pieceId': '620', 'isWordStart': 'true', 'token': 'an', 'sentence': '0'}),\n", " Annotation(word_embeddings, 86, 91, annual, {'isOOV': 'false', 'pieceId': '1207', 'isWordStart': 'true', 'token': 'annual', 'sentence': '0'}),\n", " Annotation(word_embeddings, 93, 96, rate, {'isOOV': 'false', 'pieceId': '1072', 'isWordStart': 'true', 'token': 'rate', 'sentence': '0'}),\n", " Annotation(word_embeddings, 98, 99, of, {'isOOV': 'false', 'pieceId': '619', 'isWordStart': 'true', 'token': 'of', 'sentence': '0'}),\n", " Annotation(word_embeddings, 101, 105, 8.00%, {'isOOV': 'false', 'pieceId': '128', 'isWordStart': 'true', 'token': '8.00%', 'sentence': '0'}),\n", " Annotation(word_embeddings, 107, 108, of, {'isOOV': 'false', 'pieceId': '619', 'isWordStart': 'true', 'token': 'of', 'sentence': '0'}),\n", " Annotation(word_embeddings, 110, 112, the, {'isOOV': 'false', 'pieceId': '612', 'isWordStart': 'true', 'token': 'the', 'sentence': '0'}),\n", " Annotation(word_embeddings, 114, 121, Offering, {'isOOV': 'false', 'pieceId': '101', 'isWordStart': 'true', 'token': 'Offering', 'sentence': '0'}),\n", " Annotation(word_embeddings, 123, 127, Price, {'isOOV': 'false', 'pieceId': '101', 'isWordStart': 'true', 'token': 'Price', 'sentence': '0'}),\n", " Annotation(word_embeddings, 128, 128, ., {'isOOV': 'false', 'pieceId': '118', 'isWordStart': 'true', 'token': '.', 'sentence': '0'})],\n", " 'sentence': [Annotation(document, 1, 128, AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price., {'sentence': '0'})]}]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We get company name from sample text\n", "\n", "ner_result = light_model.fullAnnotate(text)\n", "\n", "ner_result" ] }, { "cell_type": "code", "execution_count": null, "id": "uGcQcPIntV5e", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 36 }, "id": "uGcQcPIntV5e", "outputId": "dd2383a3-8dae-4ccf-88a1-b50829163b4b" }, "outputs": [ { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'AWA Group LP'" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ORG = ner_result[0][\"ner_chunk\"][0].result\n", "\n", "ORG" ] }, { "cell_type": "code", "execution_count": null, "id": "plvb6cMVCe0O", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "plvb6cMVCe0O", "outputId": "5d1ccd9f-eec0-4ef6-d6e5-c55c56e74a3f" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tfhub_use download started this may take some time.\n", "Approximate size to download 923.7 MB\n", "[OK!]\n", "legel_edgar_company_name download started this may take some time.\n", "[OK!]\n" ] } ], "source": [ "embeddings = nlp.UniversalSentenceEncoder.pretrained(\"tfhub_use\", \"en\") \\\n", " .setInputCols(\"document\") \\\n", " .setOutputCol(\"sentence_embeddings\")\n", " \n", "resolver = legal.SentenceEntityResolverModel.pretrained(\"legel_edgar_company_name\", \"en\", \"legal/models\")\\\n", " .setInputCols([\"sentence_embeddings\"]) \\\n", " .setOutputCol(\"resolution\")\\\n", " .setDistanceFunction(\"EUCLIDEAN\")\n", "\n", "pipelineModel = nlp.PipelineModel(\n", " stages = [\n", " documentAssembler,\n", " embeddings,\n", " resolver])\n", "\n", "lp_res = nlp.LightPipeline(pipelineModel)" ] }, { "cell_type": "code", "execution_count": null, "id": "vFFo5P9Nttss", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "vFFo5P9Nttss", "outputId": "7ccc6b6a-6b76-47b3-db55-befc3bbaeaed" }, "outputs": [ { "data": { "text/plain": [ "{'document': ['AWA Group LP'],\n", " 'sentence_embeddings': ['AWA Group LP'],\n", " 'resolution': ['AWA Group LP']}" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We normalize company name\n", "\n", "el_res = lp_res.annotate(ORG)\n", "\n", "el_res" ] }, { "cell_type": "code", "execution_count": null, "id": "EbCbbGZ9ttvP", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 36 }, "id": "EbCbbGZ9ttvP", "outputId": "2289cc59-0cd6-4ba5-f322-f6309e74d6c2" }, "outputs": [ { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'AWA Group LP'" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "NORM_ORG = el_res[\"resolution\"][0]\n", "\n", "NORM_ORG" ] }, { "cell_type": "markdown", "id": "z-2Q7nnywUaH", "metadata": { "id": "z-2Q7nnywUaH" }, "source": [ "### Let's load our ChunkMapper model" ] }, { "cell_type": "code", "execution_count": null, "id": "Vx3_V33_ttyC", "metadata": { "id": "Vx3_V33_ttyC" }, "outputs": [], "source": [ "documentAssembler = nlp.DocumentAssembler()\\\n", " .setInputCol(\"text\")\\\n", " .setOutputCol(\"document\")\n", "\n", "chunkAssembler = nlp.Doc2Chunk() \\\n", " .setInputCols(\"document\") \\\n", " .setOutputCol(\"chunk\") \\\n", " .setIsArray(False)\n", "\n", "CM = legal.ChunkMapperModel().load(\"openedgar_2000_2022_company_mapper\")\\\n", " .setInputCols([\"chunk\"])\\\n", " .setOutputCol(\"mappings\")\n", "\n", "cm_pipeline = nlp.Pipeline(stages=[documentAssembler, \n", " chunkAssembler, \n", " CM])\n", "\n", "fit_cm_pipeline = cm_pipeline.fit(empty_data)" ] }, { "cell_type": "code", "execution_count": null, "id": "w5I4muUkvJG7", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "w5I4muUkvJG7", "outputId": "8c7c2e2e-2f77-4ba2-9ece-ccb0420d4f06" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+------------+\n", "| text|\n", "+------------+\n", "|AWA Group LP|\n", "+------------+\n", "\n" ] } ], "source": [ "# LightPipelines don't support Doc2Chunk, so we will use here usual transform\n", "\n", "df = spark.createDataFrame([[NORM_ORG]]).toDF(\"text\")\n", "\n", "df.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "zS0Q-zamvOsT", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "zS0Q-zamvOsT", "outputId": "437b4354-127d-473d-a5c0-a19acf3c525a" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+------------+--------------------+--------------------+--------------------+\n", "| text| document| chunk| mappings|\n", "+------------+--------------------+--------------------+--------------------+\n", "|AWA Group LP|[{document, 0, 11...|[{chunk, 0, 11, A...|[{labeled_depende...|\n", "+------------+--------------------+--------------------+--------------------+\n", "\n" ] } ], "source": [ "res = fit_cm_pipeline.transform(df)\n", "\n", "res.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "e7b2c22c-49f6-4277-8636-62b876b0bf08", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "e7b2c22c-49f6-4277-8636-62b876b0bf08", "outputId": "7142e762-c238-437f-896f-b5957107f031", "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", "|result |\n", "+----------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", "|[AWA Group LP, INVESTMENT ADVICE [6282], 6282, 371785232, 630, NC, DE, 116 SOUTH FRANKLIN STREET, ROCKY MOUNT, NC, 27804, 952-446-6678, , , 2017-01-23, 1645148]|\n", "+----------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", "\n" ] } ], "source": [ "res.select(\"mappings.result\").show(truncate=False)" ] }, { "cell_type": "code", "execution_count": null, "id": "fqeUMOeDuVUx", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "fqeUMOeDuVUx", "outputId": "4e28d465-83cc-4f85-fb65-3f7282097ad3" }, "outputs": [ { "data": { "text/plain": [ "[Row(mappings=[Row(annotatorType='labeled_dependency', begin=0, end=11, result='AWA Group LP', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'name', 'entity': 'AWA Group LP', 'relation': 'name'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='INVESTMENT ADVICE [6282]', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'sic', 'entity': 'AWA Group LP', 'relation': 'sic'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='6282', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '0', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'sic_code', 'entity': 'AWA Group LP', 'relation': 'sic_code'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='371785232', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '0', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'irs_number', 'entity': 'AWA Group LP', 'relation': 'irs_number'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='630', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '1231:::0', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'fiscal_year_end', 'entity': 'AWA Group LP', 'relation': 'fiscal_year_end'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='NC', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'state_location', 'entity': 'AWA Group LP', 'relation': 'state_location'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='DE', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'state_incorporation', 'entity': 'AWA Group LP', 'relation': 'state_incorporation'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='116 SOUTH FRANKLIN STREET', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'business_street', 'entity': 'AWA Group LP', 'relation': 'business_street'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='ROCKY MOUNT', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'business_city', 'entity': 'AWA Group LP', 'relation': 'business_city'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='NC', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'business_state', 'entity': 'AWA Group LP', 'relation': 'business_state'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='27804', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'business_zip', 'entity': 'AWA Group LP', 'relation': 'business_zip'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='952-446-6678', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'business_phone', 'entity': 'AWA Group LP', 'relation': 'business_phone'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'former_name', 'entity': 'AWA Group LP', 'relation': 'former_name'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'former_name_date', 'entity': 'AWA Group LP', 'relation': 'former_name_date'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='2017-01-23', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '2017-03-16:::2016-01-22:::2016-01-19:::2015-06-30:::2016-04-14:::2016-07-27:::2016-10-28:::2015-06-26:::2015-09-02:::2015-09-29:::2015-12-31', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'date', 'entity': 'AWA Group LP', 'relation': 'date'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=11, result='1645148', metadata={'sentence': '0', 'ops': '0.0', 'distance': '-2.220446049250313E-16', 'all_relations': '', 'chunk': '0', '__trained__': 'AWA Group LP', '__distance_function__': 'cosine', '__relation_name__': 'company_id', 'entity': 'AWA Group LP', 'relation': 'company_id'}, embeddings=[])])]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r = res.select(\"mappings\").collect()\n", "r" ] }, { "cell_type": "code", "execution_count": null, "id": "3ed118db-d108-4637-8d1f-67c7e5106458", "metadata": { "id": "3ed118db-d108-4637-8d1f-67c7e5106458" }, "outputs": [], "source": [ "json_dict = dict()\n", "for n in r[0]['mappings']:\n", " json_dict[n.metadata['relation']] = str(n.result)" ] }, { "cell_type": "code", "execution_count": null, "id": "0bee659b-9446-4255-9c67-2a685de56c56", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "0bee659b-9446-4255-9c67-2a685de56c56", "outputId": "35dc13ce-f121-4fc6-b9b1-6899cecb191d" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"business_city\": \"ROCKY MOUNT\",\n", " \"business_phone\": \"952-446-6678\",\n", " \"business_state\": \"NC\",\n", " \"business_street\": \"116 SOUTH FRANKLIN STREET\",\n", " \"business_zip\": \"27804\",\n", " \"company_id\": \"1645148\",\n", " \"date\": \"2017-01-23\",\n", " \"fiscal_year_end\": \"630\",\n", " \"former_name\": \"\",\n", " \"former_name_date\": \"\",\n", " \"irs_number\": \"371785232\",\n", " \"name\": \"AWA Group LP\",\n", " \"sic\": \"INVESTMENT ADVICE [6282]\",\n", " \"sic_code\": \"6282\",\n", " \"state_incorporation\": \"DE\",\n", " \"state_location\": \"NC\"\n", "}\n" ] } ], "source": [ "import json\n", "print(json.dumps(json_dict, indent=4, sort_keys=True))" ] } ], "metadata": { "colab": { "provenance": [] }, "gpuClass": "standard", "kernelspec": { "display_name": "tf-gpu", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)]" }, "vscode": { "interpreter": { "hash": "3f47d918ae832c68584484921185f5c85a1760864bf927a683dc6fb56366cc77" } } }, "nbformat": 4, "nbformat_minor": 5 }