{ "cells": [ { "cell_type": "markdown", "id": "db5f4f9a-7776-42b3-8758-85624d4c15ea", "metadata": { "id": "db5f4f9a-7776-42b3-8758-85624d4c15ea" }, "source": [ "![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)" ] }, { "cell_type": "markdown", "id": "21e9eafb", "metadata": { "id": "21e9eafb" }, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/finance-nlp/10.0.Data_Augmentation_with_ChunkMappers.ipynb)" ] }, { "cell_type": "markdown", "id": "4iIO6G_B3pqq", "metadata": { "collapsed": false, "id": "4iIO6G_B3pqq" }, "source": [ "#🎬 Installation" ] }, { "cell_type": "code", "execution_count": null, "id": "hPwo4Czy3pqq", "metadata": { "id": "hPwo4Czy3pqq", "pycharm": { "is_executing": true } }, "outputs": [], "source": [ "! pip install -q johnsnowlabs" ] }, { "cell_type": "markdown", "id": "YPsbAnNoPt0Z", "metadata": { "id": "YPsbAnNoPt0Z" }, "source": [ "##🔗 Automatic Installation\n", "Using my.johnsnowlabs.com SSO" ] }, { "cell_type": "code", "execution_count": null, "id": "_L-7mLYp3pqr", "metadata": { "id": "_L-7mLYp3pqr", "pycharm": { "is_executing": true } }, "outputs": [], "source": [ "from johnsnowlabs import nlp, finance\n", "\n", "# nlp.install(force_browser=True)" ] }, { "cell_type": "markdown", "id": "hsJvn_WWM2GL", "metadata": { "id": "hsJvn_WWM2GL" }, "source": [ "##🔗 Manual downloading\n", "If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.\n", "\n", "- Go to my.johnsnowlabs.com\n", "- Download your license\n", "- Upload it using the following command" ] }, { "cell_type": "code", "execution_count": null, "id": "i57QV3-_P2sQ", "metadata": { "id": "i57QV3-_P2sQ" }, "outputs": [], "source": [ "from google.colab import files\n", "print('Please Upload your John Snow Labs License using the button below')\n", "license_keys = files.upload()" ] }, { "cell_type": "markdown", "id": "xGgNdFzZP_hQ", "metadata": { "id": "xGgNdFzZP_hQ" }, "source": [ "- Install it" ] }, { "cell_type": "code", "execution_count": null, "id": "OfmmPqknP4rR", "metadata": { "id": "OfmmPqknP4rR" }, "outputs": [], "source": [ "nlp.install()" ] }, { "cell_type": "markdown", "id": "DCl5ErZkNNLk", "metadata": { "id": "DCl5ErZkNNLk" }, "source": [ "#📌 Starting" ] }, { "cell_type": "code", "execution_count": null, "id": "x3jVICoa3pqr", "metadata": { "id": "x3jVICoa3pqr" }, "outputs": [], "source": [ "spark = nlp.start()" ] }, { "cell_type": "markdown", "id": "cfbbcfc0-e0b7-4c25-8bd7-c64d90f836d1", "metadata": { "id": "cfbbcfc0-e0b7-4c25-8bd7-c64d90f836d1" }, "source": [ "#🔎 Financial Data Augmentation with Chunk Mappers" ] }, { "cell_type": "markdown", "id": "d2cd4221-fbca-4ca1-86a9-65e6264c4ad1", "metadata": { "id": "d2cd4221-fbca-4ca1-86a9-65e6264c4ad1" }, "source": [ "#🚀 About Data Augmentation" ] }, { "cell_type": "markdown", "id": "bf9835fd-9def-44e4-b022-e8db0f045fec", "metadata": { "id": "bf9835fd-9def-44e4-b022-e8db0f045fec" }, "source": [ "__Data Augmentation__ is the process of increase an extracted datapoint with external sources. \n", "\n", "For example, let's suppose I work with a document which mentions the company _Amazon_. We could be talking about stock prices, or some legal litigations, or just a commercial agreement with a provider, among others.\n", "\n", "In the document, we can extract a company name using NER as an Organization, but that's all the information available about the company in that document.\n", "\n", "Well, with __Data Augmentation__, we can use external sources, as _SEC Edgar, Crunchbase, Nasdaq_ or even _Wikipedia_, to enrich the company with much more information, allowing us to take better decisions.\n", "\n", "Let's see how to do it." ] }, { "cell_type": "markdown", "id": "ZTo7fsIKgod-", "metadata": { "id": "ZTo7fsIKgod-" }, "source": [ "##📌 Sample Texts from Cadence Design System\n", "\n", "Examples taken from publicly available information about Cadence in SEC's Edgar database [here](https://www.sec.gov/Archives/edgar/data/813672/000081367222000012/cdns-20220101.htm) and [Wikipedia](https://en.wikipedia.org/wiki/Cadence_Design_Systems)" ] }, { "cell_type": "code", "execution_count": null, "id": "Br_90qxMgn0m", "metadata": { "id": "Br_90qxMgn0m" }, "outputs": [], "source": [ "! wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/cdns-20220101.html.txt" ] }, { "cell_type": "code", "execution_count": null, "id": "RfPlg8I2gn6A", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "RfPlg8I2gn6A", "outputId": "61bb61d1-c713-4e84-c9d6-03b7fe30c2f6" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Table of Contents\n", "UNITED STATES SECURITIES AND EXCHANGE COMMISSION\n", "Washington, D.C. 20549\n", "_____________________________________ \n", "FORM 10-K \n", "_____________________________________ \n", "(Mark One)\n", "☒\n", "ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\n", "For the fiscal year en\n" ] } ], "source": [ "with open('cdns-20220101.html.txt', 'r') as f:\n", " cadence_sec10k = f.read()\n", "print(cadence_sec10k[:300])" ] }, { "cell_type": "code", "execution_count": null, "id": "SaX_618cgn-d", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "SaX_618cgn-d", "outputId": "86286d28-2550-43ac-a46a-2e54415b7db8" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "UNITED STATES SECURITIES AND EXCHANGE COMMISSION\n", "Washington, D.C. 20549\n", "_____________________________________ \n", "FORM 10-K \n", "_____________________________________ \n", "(Mark One)\n", "☒\n", "ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\n", "For the fiscal year ended January 1, 2022 \n", "OR\n", "☐\n", "TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\n", "For the transition period from _________ to_________.\n", "\n", "Commission file number 000-15867 \n", "_____________________________________\n", " \n", "CADENCE DESIGN SYSTEMS, INC. \n", "(Exact name of registrant as specified in its charter)\n", "____________________________________ \n", "Delaware\n", " \n", "00-0000000\n", "(State or Other Jurisdiction ofIncorporation or Organization)\n", " \n", "(I.R.S. EmployerIdentification No.)\n", "2655 Seely Avenue, Building 5,\n", "San Jose,\n", "California\n", " \n", "95134\n", "(Address of Principal Executive Offices)\n", " \n", "(Zip Code)\n", "(408)\n", "-943-1234 \n", "(Registrant’s Telephone Number, including Area Code) \n", "Securities registered pursuant to Section 12(b) of the Act:\n", "Title of Each Class\n", "Trading Symbol(s)\n", "Names of Each Exchange on which Registered\n", "Common Stock, $0.01 par value per share\n", "CDNS\n", "Nasdaq Global Select Market\n", "Securities registered pursuant to Section 12(g) of the Act:\n", "None\n", "Indicate by check mark if the registrant is a well-known seasoned issuer, as defined in Rule 405 of the Securities Act. \n", " Yes \n", "☒\n", " No \n", "☐\n", "Indicate by check mark if the registrant is not required to file reports pursuant to Section 13 or Section 15(d) of the Act. \n", " Yes \n", "☐ \n", "No \n", "☒\n", "Indicate by check mark whether the registrant (1) has filed all reports required to be filed by Section 13 or 15(d) of the Securities Exchange Act of 1934 during the preceding 12 months (or for such shorter period that the registrant was required to file such reports), and (2) has been subject to such filing requirements for the past 90 days. \n", " Yes \n", "☒\n", " No \n", "☐\n", "Indicate by check mark whether the registrant has submitted electronically every Interactive Data File required to be submitted pursuant to Rule 405 of Regulation S-T (§ 232.405 of this chapter) during the preceding 12 months (or for such shorter period that the registrant was required to submit such files). \n", " Yes \n", "☒\n", " No \n", "☐\n", "Indicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting company, or an emerging growth company. See the definitions of “large accelerated filer,” “accelerated filer,” “smaller reporting company,” and “emerging growth company” in Rule 12b-2 of the Exchange Act.\n", "Large Accelerated Filer\n", "☒\n", "Accelerated Filer\n", "☐\n", "Non-accelerated Filer\n", "☐\n", "Smaller Reporting Company\n", "☐\n", "Emerging Growth Company\n", "☐\n", "If an emerging growth company, indicate by check mark if the registrant has elected not to use the extended transition period for complying with any new or revised financial accounting standards provided pursuant to Section 13(a) of the Exchange Act. \n", "☐\n", "Indicate by check mark whether the registrant has filed a report on and attestation to its management’s assessment of the effectiveness of its internal control over financial reporting under Section 404(b) of the Sarbanes-Oxley Act (15 U.S.C. 7262(b)) by the registered public accounting firm that prepared or issued its audit report. \n", "☒\n", "Indicate by check mark whether the registrant is a shell company (as defined in Rule 12b-2 of the Act). \n", " Yes \n", "☐ \n", "No \n", "☒\n", "The aggregate market value of the voting and non-voting common equity held by non-affiliates computed by reference to the price at which the common equity was last sold as of the last business day of the registrant’s most recently completed second fiscal quarter ended July 3, 2021 was approximately $38,179,000,000.\n", "On February 5, 2022, approximately 277,336,000 shares of the Registrant’s Common Stock, $0.01 par value, were outstanding.\n", "DOCUMENTS INCORPORATED BY REFERENCE\n", "Portions of the definitive proxy statement for Cadence Design Systems, Inc.’s 2022 Annual Meeting of Stockholders are incorporated by reference into Part III hereof.\n", "\n", "\n", "\n" ] } ], "source": [ "pages = [x for x in cadence_sec10k.split(\"Table of Contents\") if x.strip() != '']\n", "print(pages[0])" ] }, { "cell_type": "markdown", "id": "nLLpahwrg7uz", "metadata": { "id": "nLLpahwrg7uz" }, "source": [ "##📌 Step 1: Using Text Classification to find Relevant Parts of the Document: 10K Summary\n", "In this case, we know page 0 is always the page with summary information about the company. However, let's suppose we don't know it. We can use Page Classification.\n", "\n", "To check the SEC 10K Summary page, we have a specific model called `\"finclf_form_10k_summary_item\"`" ] }, { "cell_type": "code", "execution_count": null, "id": "0lJbpKJWgGg0", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "0lJbpKJWgGg0", "outputId": "631dbcda-ef2d-418c-f43d-b549dbca5ce7" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tfhub_use download started this may take some time.\n", "Approximate size to download 923.7 MB\n", "[OK!]\n", "finclf_form_10k_summary_item download started this may take some time.\n", "[OK!]\n" ] } ], "source": [ "# Text Classifier\n", "# This pipeline allows you to use different classification models to understand if an input text is of a specific class or is something else.\n", " \n", "document_assembler = nlp.DocumentAssembler() \\\n", " .setInputCol(\"text\") \\\n", " .setOutputCol(\"document\")\n", "\n", "use_embeddings = nlp.UniversalSentenceEncoder.pretrained()\\\n", " .setInputCols(\"document\") \\\n", " .setOutputCol(\"sentence_embeddings\")\n", "\n", "classifier = finance.ClassifierDLModel.pretrained(\"finclf_form_10k_summary_item\", \"en\", \"finance/models\")\\\n", " .setInputCols([\"sentence_embeddings\"])\\\n", " .setOutputCol(\"category\")\n", "\n", "nlpPipeline = nlp.Pipeline(stages=[\n", " document_assembler, \n", " use_embeddings,\n", " classifier])\n", "\n", "empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n", "\n", "model = nlpPipeline.fit(empty_data)" ] }, { "cell_type": "code", "execution_count": null, "id": "vCmCEiMNgoBC", "metadata": { "id": "vCmCEiMNgoBC" }, "outputs": [], "source": [ "df = spark.createDataFrame([[pages[0]]]).toDF(\"text\")\n", "\n", "result = model.transform(df).cache()\n" ] }, { "cell_type": "code", "execution_count": null, "id": "qp6M15QegoDi", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "qp6M15QegoDi", "outputId": "8b0b3637-93e4-4fd4-f858-0900bbeac96e" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+------------------+\n", "| result|\n", "+------------------+\n", "|[form_10k_summary]|\n", "+------------------+\n", "\n" ] } ], "source": [ "result.select('category.result').show()" ] }, { "cell_type": "markdown", "id": "Ar1gryg4jpIc", "metadata": { "id": "Ar1gryg4jpIc" }, "source": [ "##📌 Step 2: Named Entity Recognition on 10K Summary\n", "Main component to carry out information extraction and extract entities from texts. \n", "\n", "This time we will use a model trained to extract many entities from 10K summaries." ] }, { "cell_type": "code", "execution_count": null, "id": "cb765952-24c2-48b6-8d86-5413b13bd9fa", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "cb765952-24c2-48b6-8d86-5413b13bd9fa", "outputId": "74f1d58a-7f21-4d89-dbc4-de22bc47e818" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bert_embeddings_sec_bert_base download started this may take some time.\n", "Approximate size to download 390.4 MB\n", "[OK!]\n", "finner_sec_10k_summary download started this may take some time.\n", "[OK!]\n" ] } ], "source": [ "textSplitter = finance.TextSplitter()\\\n", " .setInputCols([\"document\"])\\\n", " .setOutputCol(\"sentence\")\n", "\n", "tokenizer = nlp.Tokenizer()\\\n", " .setInputCols([\"sentence\"])\\\n", " .setOutputCol(\"token\")\n", "\n", "embeddings = nlp.BertEmbeddings.pretrained(\"bert_embeddings_sec_bert_base\",\"en\") \\\n", " .setInputCols([\"sentence\", \"token\"]) \\\n", " .setOutputCol(\"embeddings\")\n", "\n", "ner_model = finance.NerModel.pretrained(\"finner_sec_10k_summary\", \"en\", \"finance/models\")\\\n", " .setInputCols([\"sentence\", \"token\", \"embeddings\"])\\\n", " .setOutputCol(\"ner\")\n", "\n", "ner_converter = nlp.NerConverter()\\\n", " .setInputCols([\"sentence\",\"token\",\"ner\"])\\\n", " .setOutputCol(\"ner_chunk\")\n", "\n", "nlpPipeline = nlp.Pipeline(stages=[\n", " document_assembler,\n", " textSplitter,\n", " tokenizer,\n", " embeddings,\n", " ner_model,\n", " ner_converter,\n", "])\n", "\n", "empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n", "\n", "model = nlpPipeline.fit(empty_data)\n", "\n", "light_model = nlp.LightPipeline(model)" ] }, { "cell_type": "markdown", "id": "37eae9a4-52e1-400e-a1dd-effc6ed1da35", "metadata": { "id": "37eae9a4-52e1-400e-a1dd-effc6ed1da35" }, "source": [ "##✅ We use LightPipeline to get the result" ] }, { "cell_type": "code", "execution_count": null, "id": "QgkAyFqMjwOs", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 488 }, "id": "QgkAyFqMjwOs", "outputId": "d75509a2-1138-4022-af80-ec78ec263ec4" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chunksbeginendentities
0January 1, 2022287301FISCAL_YEAR
1000-15867476484CFN
2CADENCE DESIGN SYSTEMS, INC527553ORG
3Delaware650657STATE
400-0000000661670IRS
52655 Seely Avenue, Building 5,\\nSan Jose,\\nCal...772822ADDRESS
6(408)\\n-943-1234886900PHONE
7Common Stock10981109TITLE_CLASS
8$0.0111121116TITLE_CLASS_VALUE
9CDNS11381141TICKER
10Nasdaq Global Select Market11431169STOCK_EXCHANGE
11Common Stock37993810TITLE_CLASS
12$0.0138133817TITLE_CLASS_VALUE
13Cadence Design Systems, Inc39313957ORG
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " chunks begin end \\\n", "0 January 1, 2022 287 301 \n", "1 000-15867 476 484 \n", "2 CADENCE DESIGN SYSTEMS, INC 527 553 \n", "3 Delaware 650 657 \n", "4 00-0000000 661 670 \n", "5 2655 Seely Avenue, Building 5,\\nSan Jose,\\nCal... 772 822 \n", "6 (408)\\n-943-1234 886 900 \n", "7 Common Stock 1098 1109 \n", "8 $0.01 1112 1116 \n", "9 CDNS 1138 1141 \n", "10 Nasdaq Global Select Market 1143 1169 \n", "11 Common Stock 3799 3810 \n", "12 $0.01 3813 3817 \n", "13 Cadence Design Systems, Inc 3931 3957 \n", "\n", " entities \n", "0 FISCAL_YEAR \n", "1 CFN \n", "2 ORG \n", "3 STATE \n", "4 IRS \n", "5 ADDRESS \n", "6 PHONE \n", "7 TITLE_CLASS \n", "8 TITLE_CLASS_VALUE \n", "9 TICKER \n", "10 STOCK_EXCHANGE \n", "11 TITLE_CLASS \n", "12 TITLE_CLASS_VALUE \n", "13 ORG " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "ner_result = light_model.fullAnnotate(pages[0])\n", "\n", "chunks = []\n", "entities = []\n", "begin = []\n", "end = []\n", "\n", "for n in ner_result[0]['ner_chunk']:\n", " \n", " begin.append(n.begin)\n", " end.append(n.end)\n", " chunks.append(n.result)\n", " entities.append(n.metadata['entity']) \n", " \n", "df = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, 'entities':entities})\n", "\n", "df.head(20)" ] }, { "cell_type": "markdown", "id": "9fe41161-c8fd-467e-9fff-5d4fe1cb5160", "metadata": { "id": "9fe41161-c8fd-467e-9fff-5d4fe1cb5160" }, "source": [ "Alright! CADENCE DESIGN SYSTEMS, INC has been detected as an organization. \n", "\n", "Now, let's augment `CADENCE DESIGN SYSTEMS, INC` with more information about the company, given that there are no more details in the SEC10K form I can use.\n", "\n", "But before __augmenting__, there is a very important step we need to carry out: `Company Name Normalization`" ] }, { "cell_type": "markdown", "id": "pxtGC2HpSxIm", "metadata": { "id": "pxtGC2HpSxIm" }, "source": [ "🚀**We will continue this notebook in [10.1.Data_Augmentation_with_ChunkMappers.ipynb](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/finance-nlp/10.1.Data_Augmentation_with_ChunkMappers_Edgar.ipynb)**" ] }, { "cell_type": "code", "execution_count": null, "id": "Z4_OwmPuUS-j", "metadata": { "id": "Z4_OwmPuUS-j" }, "outputs": [], "source": [] } ], "metadata": { "colab": { "provenance": [], "toc_visible": true }, "gpuClass": "standard", "kernelspec": { "display_name": "tf-gpu", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)]" }, "vscode": { "interpreter": { "hash": "3f47d918ae832c68584484921185f5c85a1760864bf927a683dc6fb56366cc77" } } }, "nbformat": 4, "nbformat_minor": 5 }