{ "cells": [ { "cell_type": "markdown", "id": "db5f4f9a-7776-42b3-8758-85624d4c15ea", "metadata": { "id": "db5f4f9a-7776-42b3-8758-85624d4c15ea" }, "source": [ "![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)" ] }, { "cell_type": "markdown", "id": "21e9eafb", "metadata": { "id": "21e9eafb" }, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/finance-nlp/10.0.Data_Augmentation_with_ChunkMappers.ipynb)" ] }, { "cell_type": "markdown", "id": "4iIO6G_B3pqq", "metadata": { "collapsed": false, "id": "4iIO6G_B3pqq" }, "source": [ "#🎬 Installation" ] }, { "cell_type": "code", "execution_count": null, "id": "hPwo4Czy3pqq", "metadata": { "id": "hPwo4Czy3pqq", "pycharm": { "is_executing": true } }, "outputs": [], "source": [ "! pip install -q johnsnowlabs" ] }, { "cell_type": "markdown", "id": "YPsbAnNoPt0Z", "metadata": { "id": "YPsbAnNoPt0Z" }, "source": [ "##🔗 Automatic Installation\n", "Using my.johnsnowlabs.com SSO" ] }, { "cell_type": "code", "execution_count": null, "id": "_L-7mLYp3pqr", "metadata": { "id": "_L-7mLYp3pqr", "pycharm": { "is_executing": true } }, "outputs": [], "source": [ "from johnsnowlabs import nlp, finance\n", "\n", "# nlp.install(force_browser=True)" ] }, { "cell_type": "markdown", "id": "hsJvn_WWM2GL", "metadata": { "id": "hsJvn_WWM2GL" }, "source": [ "##🔗 Manual downloading\n", "If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.\n", "\n", "- Go to my.johnsnowlabs.com\n", "- Download your license\n", "- Upload it using the following command" ] }, { "cell_type": "code", "execution_count": null, "id": "i57QV3-_P2sQ", "metadata": { "id": "i57QV3-_P2sQ" }, "outputs": [], "source": [ "from google.colab import files\n", "print('Please Upload your John Snow Labs License using the button below')\n", "license_keys = files.upload()" ] }, { "cell_type": "markdown", "id": "xGgNdFzZP_hQ", "metadata": { "id": "xGgNdFzZP_hQ" }, "source": [ "- Install it" ] }, { "cell_type": "code", "execution_count": null, "id": "OfmmPqknP4rR", "metadata": { "id": "OfmmPqknP4rR" }, "outputs": [], "source": [ "nlp.install()" ] }, { "cell_type": "markdown", "id": "DCl5ErZkNNLk", "metadata": { "id": "DCl5ErZkNNLk" }, "source": [ "#📌 Starting" ] }, { "cell_type": "code", "execution_count": null, "id": "x3jVICoa3pqr", "metadata": { "id": "x3jVICoa3pqr" }, "outputs": [], "source": [ "spark = nlp.start()" ] }, { "cell_type": "markdown", "id": "cfbbcfc0-e0b7-4c25-8bd7-c64d90f836d1", "metadata": { "id": "cfbbcfc0-e0b7-4c25-8bd7-c64d90f836d1" }, "source": [ "#🔎 Financial Data Augmentation with Chunk Mappers" ] }, { "cell_type": "markdown", "id": "d2cd4221-fbca-4ca1-86a9-65e6264c4ad1", "metadata": { "id": "d2cd4221-fbca-4ca1-86a9-65e6264c4ad1" }, "source": [ "#🚀 About Data Augmentation" ] }, { "cell_type": "markdown", "id": "bf9835fd-9def-44e4-b022-e8db0f045fec", "metadata": { "id": "bf9835fd-9def-44e4-b022-e8db0f045fec" }, "source": [ "__Data Augmentation__ is the process of increase an extracted datapoint with external sources. \n", "\n", "For example, let's suppose I work with a document which mentions the company _Amazon_. We could be talking about stock prices, or some legal litigations, or just a commercial agreement with a provider, among others.\n", "\n", "In the document, we can extract a company name using NER as an Organization, but that's all the information available about the company in that document.\n", "\n", "Well, with __Data Augmentation__, we can use external sources, as _SEC Edgar, Crunchbase, Nasdaq_ or even _Wikipedia_, to enrich the company with much more information, allowing us to take better decisions.\n", "\n", "Let's see how to do it." ] }, { "cell_type": "markdown", "id": "ZTo7fsIKgod-", "metadata": { "id": "ZTo7fsIKgod-" }, "source": [ "##📌 Sample Texts from Cadence Design System\n", "\n", "Examples taken from publicly available information about Cadence in SEC's Edgar database [here](https://www.sec.gov/Archives/edgar/data/813672/000081367222000012/cdns-20220101.htm) and [Wikipedia](https://en.wikipedia.org/wiki/Cadence_Design_Systems)" ] }, { "cell_type": "code", "execution_count": null, "id": "Br_90qxMgn0m", "metadata": { "id": "Br_90qxMgn0m" }, "outputs": [], "source": [ "! wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/cdns-20220101.html.txt" ] }, { "cell_type": "code", "execution_count": null, "id": "RfPlg8I2gn6A", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "RfPlg8I2gn6A", "outputId": "61bb61d1-c713-4e84-c9d6-03b7fe30c2f6" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Table of Contents\n", "UNITED STATES SECURITIES AND EXCHANGE COMMISSION\n", "Washington, D.C. 20549\n", "_____________________________________ \n", "FORM 10-K \n", "_____________________________________ \n", "(Mark One)\n", "☒\n", "ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\n", "For the fiscal year en\n" ] } ], "source": [ "with open('cdns-20220101.html.txt', 'r') as f:\n", " cadence_sec10k = f.read()\n", "print(cadence_sec10k[:300])" ] }, { "cell_type": "code", "execution_count": null, "id": "SaX_618cgn-d", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "SaX_618cgn-d", "outputId": "86286d28-2550-43ac-a46a-2e54415b7db8" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "UNITED STATES SECURITIES AND EXCHANGE COMMISSION\n", "Washington, D.C. 20549\n", "_____________________________________ \n", "FORM 10-K \n", "_____________________________________ \n", "(Mark One)\n", "☒\n", "ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\n", "For the fiscal year ended January 1, 2022 \n", "OR\n", "☐\n", "TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\n", "For the transition period from _________ to_________.\n", "\n", "Commission file number 000-15867 \n", "_____________________________________\n", " \n", "CADENCE DESIGN SYSTEMS, INC. \n", "(Exact name of registrant as specified in its charter)\n", "____________________________________ \n", "Delaware\n", " \n", "00-0000000\n", "(State or Other Jurisdiction ofIncorporation or Organization)\n", " \n", "(I.R.S. EmployerIdentification No.)\n", "2655 Seely Avenue, Building 5,\n", "San Jose,\n", "California\n", " \n", "95134\n", "(Address of Principal Executive Offices)\n", " \n", "(Zip Code)\n", "(408)\n", "-943-1234 \n", "(Registrant’s Telephone Number, including Area Code) \n", "Securities registered pursuant to Section 12(b) of the Act:\n", "Title of Each Class\n", "Trading Symbol(s)\n", "Names of Each Exchange on which Registered\n", "Common Stock, $0.01 par value per share\n", "CDNS\n", "Nasdaq Global Select Market\n", "Securities registered pursuant to Section 12(g) of the Act:\n", "None\n", "Indicate by check mark if the registrant is a well-known seasoned issuer, as defined in Rule 405 of the Securities Act. \n", " Yes \n", "☒\n", " No \n", "☐\n", "Indicate by check mark if the registrant is not required to file reports pursuant to Section 13 or Section 15(d) of the Act. \n", " Yes \n", "☐ \n", "No \n", "☒\n", "Indicate by check mark whether the registrant (1) has filed all reports required to be filed by Section 13 or 15(d) of the Securities Exchange Act of 1934 during the preceding 12 months (or for such shorter period that the registrant was required to file such reports), and (2) has been subject to such filing requirements for the past 90 days. \n", " Yes \n", "☒\n", " No \n", "☐\n", "Indicate by check mark whether the registrant has submitted electronically every Interactive Data File required to be submitted pursuant to Rule 405 of Regulation S-T (§ 232.405 of this chapter) during the preceding 12 months (or for such shorter period that the registrant was required to submit such files). \n", " Yes \n", "☒\n", " No \n", "☐\n", "Indicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting company, or an emerging growth company. See the definitions of “large accelerated filer,” “accelerated filer,” “smaller reporting company,” and “emerging growth company” in Rule 12b-2 of the Exchange Act.\n", "Large Accelerated Filer\n", "☒\n", "Accelerated Filer\n", "☐\n", "Non-accelerated Filer\n", "☐\n", "Smaller Reporting Company\n", "☐\n", "Emerging Growth Company\n", "☐\n", "If an emerging growth company, indicate by check mark if the registrant has elected not to use the extended transition period for complying with any new or revised financial accounting standards provided pursuant to Section 13(a) of the Exchange Act. \n", "☐\n", "Indicate by check mark whether the registrant has filed a report on and attestation to its management’s assessment of the effectiveness of its internal control over financial reporting under Section 404(b) of the Sarbanes-Oxley Act (15 U.S.C. 7262(b)) by the registered public accounting firm that prepared or issued its audit report. \n", "☒\n", "Indicate by check mark whether the registrant is a shell company (as defined in Rule 12b-2 of the Act). \n", " Yes \n", "☐ \n", "No \n", "☒\n", "The aggregate market value of the voting and non-voting common equity held by non-affiliates computed by reference to the price at which the common equity was last sold as of the last business day of the registrant’s most recently completed second fiscal quarter ended July 3, 2021 was approximately $38,179,000,000.\n", "On February 5, 2022, approximately 277,336,000 shares of the Registrant’s Common Stock, $0.01 par value, were outstanding.\n", "DOCUMENTS INCORPORATED BY REFERENCE\n", "Portions of the definitive proxy statement for Cadence Design Systems, Inc.’s 2022 Annual Meeting of Stockholders are incorporated by reference into Part III hereof.\n", "\n", "\n", "\n" ] } ], "source": [ "pages = [x for x in cadence_sec10k.split(\"Table of Contents\") if x.strip() != '']\n", "print(pages[0])" ] }, { "cell_type": "markdown", "id": "nLLpahwrg7uz", "metadata": { "id": "nLLpahwrg7uz" }, "source": [ "##📌 Step 1: Using Text Classification to find Relevant Parts of the Document: 10K Summary\n", "In this case, we know page 0 is always the page with summary information about the company. However, let's suppose we don't know it. We can use Page Classification.\n", "\n", "To check the SEC 10K Summary page, we have a specific model called `\"finclf_form_10k_summary_item\"`" ] }, { "cell_type": "code", "execution_count": null, "id": "0lJbpKJWgGg0", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "0lJbpKJWgGg0", "outputId": "631dbcda-ef2d-418c-f43d-b549dbca5ce7" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tfhub_use download started this may take some time.\n", "Approximate size to download 923.7 MB\n", "[OK!]\n", "finclf_form_10k_summary_item download started this may take some time.\n", "[OK!]\n" ] } ], "source": [ "# Text Classifier\n", "# This pipeline allows you to use different classification models to understand if an input text is of a specific class or is something else.\n", " \n", "document_assembler = nlp.DocumentAssembler() \\\n", " .setInputCol(\"text\") \\\n", " .setOutputCol(\"document\")\n", "\n", "use_embeddings = nlp.UniversalSentenceEncoder.pretrained()\\\n", " .setInputCols(\"document\") \\\n", " .setOutputCol(\"sentence_embeddings\")\n", "\n", "classifier = finance.ClassifierDLModel.pretrained(\"finclf_form_10k_summary_item\", \"en\", \"finance/models\")\\\n", " .setInputCols([\"sentence_embeddings\"])\\\n", " .setOutputCol(\"category\")\n", "\n", "nlpPipeline = nlp.Pipeline(stages=[\n", " document_assembler, \n", " use_embeddings,\n", " classifier])\n", "\n", "empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n", "\n", "model = nlpPipeline.fit(empty_data)" ] }, { "cell_type": "code", "execution_count": null, "id": "vCmCEiMNgoBC", "metadata": { "id": "vCmCEiMNgoBC" }, "outputs": [], "source": [ "df = spark.createDataFrame([[pages[0]]]).toDF(\"text\")\n", "\n", "result = model.transform(df).cache()\n" ] }, { "cell_type": "code", "execution_count": null, "id": "qp6M15QegoDi", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "qp6M15QegoDi", "outputId": "8b0b3637-93e4-4fd4-f858-0900bbeac96e" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+------------------+\n", "| result|\n", "+------------------+\n", "|[form_10k_summary]|\n", "+------------------+\n", "\n" ] } ], "source": [ "result.select('category.result').show()" ] }, { "cell_type": "markdown", "id": "Ar1gryg4jpIc", "metadata": { "id": "Ar1gryg4jpIc" }, "source": [ "##📌 Step 2: Named Entity Recognition on 10K Summary\n", "Main component to carry out information extraction and extract entities from texts. \n", "\n", "This time we will use a model trained to extract many entities from 10K summaries." ] }, { "cell_type": "code", "execution_count": null, "id": "cb765952-24c2-48b6-8d86-5413b13bd9fa", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "cb765952-24c2-48b6-8d86-5413b13bd9fa", "outputId": "74f1d58a-7f21-4d89-dbc4-de22bc47e818" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bert_embeddings_sec_bert_base download started this may take some time.\n", "Approximate size to download 390.4 MB\n", "[OK!]\n", "finner_sec_10k_summary download started this may take some time.\n", "[OK!]\n" ] } ], "source": [ "textSplitter = finance.TextSplitter()\\\n", " .setInputCols([\"document\"])\\\n", " .setOutputCol(\"sentence\")\n", "\n", "tokenizer = nlp.Tokenizer()\\\n", " .setInputCols([\"sentence\"])\\\n", " .setOutputCol(\"token\")\n", "\n", "embeddings = nlp.BertEmbeddings.pretrained(\"bert_embeddings_sec_bert_base\",\"en\") \\\n", " .setInputCols([\"sentence\", \"token\"]) \\\n", " .setOutputCol(\"embeddings\")\n", "\n", "ner_model = finance.NerModel.pretrained(\"finner_sec_10k_summary\", \"en\", \"finance/models\")\\\n", " .setInputCols([\"sentence\", \"token\", \"embeddings\"])\\\n", " .setOutputCol(\"ner\")\n", "\n", "ner_converter = nlp.NerConverter()\\\n", " .setInputCols([\"sentence\",\"token\",\"ner\"])\\\n", " .setOutputCol(\"ner_chunk\")\n", "\n", "nlpPipeline = nlp.Pipeline(stages=[\n", " document_assembler,\n", " textSplitter,\n", " tokenizer,\n", " embeddings,\n", " ner_model,\n", " ner_converter,\n", "])\n", "\n", "empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n", "\n", "model = nlpPipeline.fit(empty_data)\n", "\n", "light_model = nlp.LightPipeline(model)" ] }, { "cell_type": "markdown", "id": "37eae9a4-52e1-400e-a1dd-effc6ed1da35", "metadata": { "id": "37eae9a4-52e1-400e-a1dd-effc6ed1da35" }, "source": [ "##✅ We use LightPipeline to get the result" ] }, { "cell_type": "code", "execution_count": null, "id": "QgkAyFqMjwOs", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 488 }, "id": "QgkAyFqMjwOs", "outputId": "d75509a2-1138-4022-af80-ec78ec263ec4" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " | chunks | \n", "begin | \n", "end | \n", "entities | \n", "
---|---|---|---|---|
0 | \n", "January 1, 2022 | \n", "287 | \n", "301 | \n", "FISCAL_YEAR | \n", "
1 | \n", "000-15867 | \n", "476 | \n", "484 | \n", "CFN | \n", "
2 | \n", "CADENCE DESIGN SYSTEMS, INC | \n", "527 | \n", "553 | \n", "ORG | \n", "
3 | \n", "Delaware | \n", "650 | \n", "657 | \n", "STATE | \n", "
4 | \n", "00-0000000 | \n", "661 | \n", "670 | \n", "IRS | \n", "
5 | \n", "2655 Seely Avenue, Building 5,\\nSan Jose,\\nCal... | \n", "772 | \n", "822 | \n", "ADDRESS | \n", "
6 | \n", "(408)\\n-943-1234 | \n", "886 | \n", "900 | \n", "PHONE | \n", "
7 | \n", "Common Stock | \n", "1098 | \n", "1109 | \n", "TITLE_CLASS | \n", "
8 | \n", "$0.01 | \n", "1112 | \n", "1116 | \n", "TITLE_CLASS_VALUE | \n", "
9 | \n", "CDNS | \n", "1138 | \n", "1141 | \n", "TICKER | \n", "
10 | \n", "Nasdaq Global Select Market | \n", "1143 | \n", "1169 | \n", "STOCK_EXCHANGE | \n", "
11 | \n", "Common Stock | \n", "3799 | \n", "3810 | \n", "TITLE_CLASS | \n", "
12 | \n", "$0.01 | \n", "3813 | \n", "3817 | \n", "TITLE_CLASS_VALUE | \n", "
13 | \n", "Cadence Design Systems, Inc | \n", "3931 | \n", "3957 | \n", "ORG | \n", "