{ "cells": [ { "cell_type": "markdown", "id": "T7aq9n4pXuQ7", "metadata": { "id": "T7aq9n4pXuQ7" }, "source": [ "![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)" ] }, { "cell_type": "markdown", "id": "iKZqb8UQXvIG", "metadata": { "id": "iKZqb8UQXvIG" }, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/finance-nlp/06.0.Relation_Extraction.ipynb)" ] }, { "cell_type": "markdown", "id": "4iIO6G_B3pqq", "metadata": { "collapsed": false, "id": "4iIO6G_B3pqq" }, "source": [ "#🎬 Installation" ] }, { "cell_type": "code", "execution_count": null, "id": "hPwo4Czy3pqq", "metadata": { "id": "hPwo4Czy3pqq", "pycharm": { "is_executing": true } }, "outputs": [], "source": [ "! pip install -q johnsnowlabs" ] }, { "cell_type": "markdown", "id": "YPsbAnNoPt0Z", "metadata": { "id": "YPsbAnNoPt0Z" }, "source": [ "##🔗 Automatic Installation\n", "Using my.johnsnowlabs.com SSO" ] }, { "cell_type": "code", "execution_count": null, "id": "_L-7mLYp3pqr", "metadata": { "id": "_L-7mLYp3pqr", "pycharm": { "is_executing": true } }, "outputs": [], "source": [ "from johnsnowlabs import nlp, finance, viz\n", "\n", "# nlp.install(force_browser=True)" ] }, { "cell_type": "markdown", "id": "hsJvn_WWM2GL", "metadata": { "id": "hsJvn_WWM2GL" }, "source": [ "##🔗 Manual downloading\n", "If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.\n", "\n", "- Go to my.johnsnowlabs.com\n", "- Download your license\n", "- Upload it using the following command" ] }, { "cell_type": "code", "execution_count": null, "id": "i57QV3-_P2sQ", "metadata": { "id": "i57QV3-_P2sQ" }, "outputs": [], "source": [ "from google.colab import files\n", "print('Please Upload your John Snow Labs License using the button below')\n", "license_keys = files.upload()" ] }, { "cell_type": "markdown", "id": "xGgNdFzZP_hQ", "metadata": { "id": "xGgNdFzZP_hQ" }, "source": [ "- Install it" ] }, { "cell_type": "code", "execution_count": null, "id": "OfmmPqknP4rR", "metadata": { "id": "OfmmPqknP4rR" }, "outputs": [], "source": [ "nlp.install()" ] }, { "cell_type": "markdown", "id": "DCl5ErZkNNLk", "metadata": { "id": "DCl5ErZkNNLk" }, "source": [ "#📌 Starting" ] }, { "cell_type": "code", "execution_count": null, "id": "x3jVICoa3pqr", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "x3jVICoa3pqr", "outputId": "17a65b67-10bd-4097-d91a-d7de00e1f6ac" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7187 (2).json\n", "👌 Launched \u001b[92mcpu optimized\u001b[39m session with with: 🚀Spark-NLP==4.2.8, 💊Spark-Healthcare==4.2.8, running on ⚡ PySpark==3.1.2\n" ] } ], "source": [ "spark = nlp.start()" ] }, { "cell_type": "markdown", "id": "Zk7IQlPsX2DR", "metadata": { "id": "Zk7IQlPsX2DR" }, "source": [ "#🔎 Financial Relation Extraction(RE)" ] }, { "cell_type": "markdown", "id": "HTxWUmGlL648", "metadata": { "id": "HTxWUmGlL648" }, "source": [ "Financial relation extraction is a process of automatically extracting structured information from unstructured text data related to finance and economics. This can be done using natural language processing (NLP) techniques, such as named entity recognition and relation extraction.\n", "\n", "Some examples of financial relation extraction include extracting information about companies and their financial performance, such as revenue, profits, and debt, as well as information about financial markets and economic indicators, such as stock prices and exchange rates." ] }, { "cell_type": "markdown", "id": "JSrrIXjYYSb1", "metadata": { "id": "JSrrIXjYYSb1" }, "source": [ "##✔ Pretrained Relation Extraction Models for Finance\n", "\n", "📚Here are the list of pretrained Relation Extraction models:" ] }, { "cell_type": "markdown", "id": "jo3aOCNwYSlk", "metadata": { "id": "jo3aOCNwYSlk" }, "source": [ "**Relation Extraction Models**\n", "\n", "|index|model|\n", "|-----:|:-----|\n", "| 1| [Financial Relation Extraction on Earning Calls (Small)](https://nlp.johnsnowlabs.com/2022/11/28/finre_earning_calls_sm_en.html) | \n", "| 2| [Financial Relation Extraction on 10K filings (Small)](https://nlp.johnsnowlabs.com/2022/11/07/finre_financial_small_en.html) | \n", "| 3| [Financial Relation Extraction (Tickers)](https://nlp.johnsnowlabs.com/2022/10/15/finre_has_ticker_en.html) |\n", "| 4| [Financial Relation Extraction (Acquisitions / Subsidiaries)](https://nlp.johnsnowlabs.com/2022/11/08/finre_acquisitions_subsidiaries_md_en.html) | \n", "| 5| [Financial Relation Extraction (Work Experience, Medium)](https://nlp.johnsnowlabs.com/2022/11/08/finre_work_experience_md_en.html) |\n", "| 6| [Financial Relation Extraction (Work Experience, Small)](https://nlp.johnsnowlabs.com/2022/09/28/finre_work_experience_en.html) | \n", "| 7| [Financial Relation Extraction (Alias)](https://nlp.johnsnowlabs.com/2022/08/17/finre_org_prod_alias_en_3_2.html) |\n", "| 8| [Financial Zero-shot Relation Extraction](https://nlp.johnsnowlabs.com/2022/08/22/finre_zero_shot_en_3_2.html) |\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "XVuxdiKgi_qd", "metadata": { "id": "XVuxdiKgi_qd" }, "source": [ "##✔ Common Componennts\n", "📚This pipeline will:\n", "1. Split Text into Sentences\n", "2. Split Sentences into Words\n", "3. Use Financial Text Embeddings, trained on SEC documents, to obtain numerical semantic representation of words\n", "\n", "**These components are common for all the pipelines we will use.**" ] }, { "cell_type": "code", "execution_count": null, "id": "W4VQJrV8lywb", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "W4VQJrV8lywb", "outputId": "167c0411-429c-43ab-fce2-d23cefc7a44e" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bert_embeddings_sec_bert_base download started this may take some time.\n", "Approximate size to download 390.4 MB\n", "[OK!]\n" ] } ], "source": [ "def get_generic_base_pipeline():\n", " \"\"\"Common components used in all pipelines\"\"\"\n", " document_assembler = nlp.DocumentAssembler()\\\n", " .setInputCol(\"text\")\\\n", " .setOutputCol(\"document\")\n", "\n", " text_splitter = finance.TextSplitter()\\\n", " .setInputCols([\"document\"])\\\n", " .setOutputCol(\"sentence\")\n", " \n", " tokenizer = nlp.Tokenizer()\\\n", " .setInputCols([\"sentence\"])\\\n", " .setOutputCol(\"token\")\n", "\n", " embeddings = nlp.BertEmbeddings.pretrained(\"bert_embeddings_sec_bert_base\",\"en\") \\\n", " .setInputCols([\"sentence\", \"token\"])\\\n", " .setOutputCol(\"embeddings\")\n", "\n", " base_pipeline = nlp.Pipeline(stages=[\n", " document_assembler,\n", " text_splitter,\n", " tokenizer,\n", " embeddings\n", " ])\n", "\n", " return base_pipeline\n", " \n", "generic_base_pipeline = get_generic_base_pipeline()" ] }, { "cell_type": "code", "execution_count": null, "id": "nfMVrwKulyzQ", "metadata": { "id": "nfMVrwKulyzQ" }, "outputs": [], "source": [ "# Text Classifier\n", "def get_text_classification_pipeline(model):\n", " \"\"\"This pipeline allows you to use different classification models to understand if an input text is of a specific class or is something else.\n", " It will be used to check where the first summary page of SEC10K is, where the sections of Acquisitions and Subsidiaries are, or where in the document\n", " the management roles and experiences are mentioned\"\"\"\n", " document_assembler = nlp.DocumentAssembler() \\\n", " .setInputCol(\"text\") \\\n", " .setOutputCol(\"document\")\n", "\n", " embeddings = nlp.UniversalSentenceEncoder.pretrained() \\\n", " .setInputCols(\"document\") \\\n", " .setOutputCol(\"sentence_embeddings\")\n", "\n", " classifier = nlp.ClassifierDLModel.pretrained(model, \"en\", \"finance/models\")\\\n", " .setInputCols([\"sentence_embeddings\"])\\\n", " .setOutputCol(\"category\")\n", "\n", " nlpPipeline = nlp.Pipeline(stages=[\n", " document_assembler, \n", " embeddings,\n", " classifier])\n", " \n", " return nlpPipeline" ] }, { "cell_type": "code", "execution_count": null, "id": "EGMAqo943IVp", "metadata": { "id": "EGMAqo943IVp" }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "def get_relations_df (results, col='relations'):\n", " \"\"\"Shows a Dataframe with the relations extracted by Spark NLP\"\"\"\n", " rel_pairs=[]\n", " for rel in results[0][col]:\n", " rel_pairs.append((\n", " rel.result, \n", " rel.metadata['entity1'], \n", " rel.metadata['entity1_begin'],\n", " rel.metadata['entity1_end'],\n", " rel.metadata['chunk1'], \n", " rel.metadata['entity2'],\n", " rel.metadata['entity2_begin'],\n", " rel.metadata['entity2_end'],\n", " rel.metadata['chunk2'], \n", " rel.metadata['confidence']\n", " ))\n", "\n", " rel_df = pd.DataFrame(rel_pairs, columns=['relation','entity1','entity1_begin','entity1_end','chunk1','entity2','entity2_begin','entity2_end','chunk2', 'confidence'])\n", "\n", " return rel_df" ] }, { "cell_type": "markdown", "id": "Zy0v4uCxxw8N", "metadata": { "id": "Zy0v4uCxxw8N" }, "source": [ "##✔ NER and Relation Extraction\n", "NER only extracts isolated entities by itself. But you can combine some NER with specific Relation Extraction Annotators trained for them, to retrieve if the entities are related to each other.\n", "\n", "Let's suppose we want to extract information about **Acquisitions** and **Subsidiaries**. If we don't know where that information is in the document, we can use Text Classifiers to find it." ] }, { "cell_type": "markdown", "id": "fO-eR0rqyB4g", "metadata": { "id": "fO-eR0rqyB4g" }, "source": [ "##✔ Using Text Classification to find Relevant Parts of the Document: Acquisitions and Subsidiaries\n", "To check the SEC 10K Summary page, we have a specific model called `\"finclf_acquisitions_item\"`\n", "\n", "Let's send some pages and check which one(s) contain that information. In a real case, you could send all the pages to the model, but here for time saving purposes, we will show just a subset." ] }, { "cell_type": "markdown", "id": "0nPVM9_By2Zc", "metadata": { "id": "0nPVM9_By2Zc" }, "source": [ "###📌 Sample Texts from Cadence Design System\n", "Examples taken from publicly available information about Cadence in SEC's Edgar database [here](https://www.sec.gov/Archives/edgar/data/813672/000081367222000012/cdns-20220101.htm) and [Wikipedia](https://en.wikipedia.org/wiki/Cadence_Design_Systems)" ] }, { "cell_type": "code", "execution_count": null, "id": "JjPwViV0yAI6", "metadata": { "id": "JjPwViV0yAI6" }, "outputs": [], "source": [ "!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/cdns-20220101.html.txt" ] }, { "cell_type": "code", "execution_count": null, "id": "KiX1xoXdy81t", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "KiX1xoXdy81t", "outputId": "6f9cdb51-386c-4bf6-9707-88d0806e69d1" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Table of Contents\n", "UNITED STATES SECURITIES AND EXCHANGE COMMISSION\n", "Washington, D.C. 20549\n", "__________\n" ] } ], "source": [ "with open('cdns-20220101.html.txt', 'r') as f:\n", " cadence_sec10k = f.read()\n", "print(cadence_sec10k[:100])" ] }, { "cell_type": "code", "execution_count": null, "id": "n1whJMYxzBEi", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "n1whJMYxzBEi", "outputId": "62e209f1-8844-4d95-8a06-1a08a01d2442" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "UNITED STATES SECURITIES AND EXCHANGE COMMISSION\n", "Washington, D.C. 20549\n", "_____________________________________ \n", "FORM 10-K \n", "_____________________________________ \n", "(Mark One)\n", "☒\n", "ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\n", "For the fiscal year ended January 1, 2022 \n", "OR\n", "☐\n", "TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\n", "For the transition period from _________ to_________.\n", "\n", "Commission file number 000-15867 \n", "_____________________________________\n", " \n", "CADENCE DESIGN SYSTEMS, INC. \n", "(Exact name of registrant as specified in its charter)\n", "____________________________________ \n", "Delaware\n", " \n", "00-0000000\n", "(State or Other Jurisdiction ofIncorporation or Organization)\n", " \n", "(I.R.S. EmployerIdentification No.)\n", "2655 Seely Avenue, Building 5,\n", "San Jose,\n", "California\n", " \n", "95134\n", "(Address of Principal Executive Offices)\n", " \n", "(Zip Code)\n", "(408)\n", "-943-1234 \n", "(Registrant’s Telephone Number, including Area Code) \n", "Securities registered pursuant to Section 12(b) of the Act:\n", "Title of Each Class\n", "Trading Symbol(s)\n", "Names of Each Exchange on which Registered\n", "Common Stock, $0.01 par value per share\n", "CDNS\n", "Nasdaq Global Select Market\n", "Securities registered pursuant to Section 12(g) of the Act:\n", "None\n", "Indicate by check mark if the registrant is a well-known seasoned issuer, as defined in Rule 405 of the Securities Act. \n", " Yes \n", "☒\n", " No \n", "☐\n", "Indicate by check mark if the registrant is not required to file reports pursuant to Section 13 or Section 15(d) of the Act. \n", " Yes \n", "☐ \n", "No \n", "☒\n", "Indicate by check mark whether the registrant (1) has filed all reports required to be filed by Section 13 or 15(d) of the Securities Exchange Act of 1934 during the preceding 12 months (or for such shorter period that the registrant was required to file such reports), and (2) has been subject to such filing requirements for the past 90 days. \n", " Yes \n", "☒\n", " No \n", "☐\n", "Indicate by check mark whether the registrant has submitted electronically every Interactive Data File required to be submitted pursuant to Rule 405 of Regulation S-T (§ 232.405 of this chapter) during the preceding 12 months (or for such shorter period that the registrant was required to submit such files). \n", " Yes \n", "☒\n", " No \n", "☐\n", "Indicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting company, or an emerging growth company. See the definitions of “large accelerated filer,” “accelerated filer,” “smaller reporting company,” and “emerging growth company” in Rule 12b-2 of the Exchange Act.\n", "Large Accelerated Filer\n", "☒\n", "Accelerated Filer\n", "☐\n", "Non-accelerated Filer\n", "☐\n", "Smaller Reporting Company\n", "☐\n", "Emerging Growth Company\n", "☐\n", "If an emerging growth company, indicate by check mark if the registrant has elected not to use the extended transition period for complying with any new or revised financial accounting standards provided pursuant to Section 13(a) of the Exchange Act. \n", "☐\n", "Indicate by check mark whether the registrant has filed a report on and attestation to its management’s assessment of the effectiveness of its internal control over financial reporting under Section 404(b) of the Sarbanes-Oxley Act (15 U.S.C. 7262(b)) by the registered public accounting firm that prepared or issued its audit report. \n", "☒\n", "Indicate by check mark whether the registrant is a shell company (as defined in Rule 12b-2 of the Act). \n", " Yes \n", "☐ \n", "No \n", "☒\n", "The aggregate market value of the voting and non-voting common equity held by non-affiliates computed by reference to the price at which the common equity was last sold as of the last business day of the registrant’s most recently completed second fiscal quarter ended July 3, 2021 was approximately $38,179,000,000.\n", "On February 5, 2022, approximately 277,336,000 shares of the Registrant’s Common Stock, $0.01 par value, were outstanding.\n", "DOCUMENTS INCORPORATED BY REFERENCE\n", "Portions of the definitive proxy statement for Cadence Design Systems, Inc.’s 2022 Annual Meeting of Stockholders are incorporated by reference into Part III hereof.\n", "\n", "\n", "\n" ] } ], "source": [ "pages = [x for x in cadence_sec10k.split(\"Table of Contents\") if x.strip() != '']\n", "print(pages[0])" ] }, { "cell_type": "code", "execution_count": null, "id": "jEo8dbtbzBxP", "metadata": { "id": "jEo8dbtbzBxP" }, "outputs": [], "source": [ "# Some examples\n", "candidates = [[pages[0]], [pages[1]], [pages[35]], [pages[67]]] " ] }, { "cell_type": "code", "execution_count": null, "id": "cSQ_3n5DzBz4", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "cSQ_3n5DzBz4", "outputId": "0da38ec7-b79e-4280-86fe-f9f64edd3e88" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tfhub_use download started this may take some time.\n", "Approximate size to download 923.7 MB\n", "[OK!]\n", "finclf_acquisitions_item download started this may take some time.\n", "Approximate size to download 21.3 MB\n", "[OK!]\n" ] } ], "source": [ "classification_pipeline = get_text_classification_pipeline('finclf_acquisitions_item')\n", "\n", "df = spark.createDataFrame(candidates).toDF(\"text\")\n", "\n", "model = classification_pipeline.fit(df)\n", "\n", "result = model.transform(df)" ] }, { "cell_type": "code", "execution_count": null, "id": "-saM1i7izB2k", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "-saM1i7izB2k", "outputId": "bbd02d06-39e1-4127-f796-626c9f7b2033" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------------+\n", "| result|\n", "+--------------+\n", "| [other]|\n", "| [other]|\n", "| [other]|\n", "|[acquisitions]|\n", "+--------------+\n", "\n" ] } ], "source": [ "result.select('category.result').show()" ] }, { "cell_type": "markdown", "id": "mP5jDp_V2Aay", "metadata": { "id": "mP5jDp_V2Aay" }, "source": [ "###📌 Acquisitions, Subsidiaries and Former Names\n", "📚Let's use some NER models to obtain information about Organizations and Dates, and understand if:\n", "- An ORG was acquired by another ORG\n", "- An ORG is a subsidiary of another ORG\n", "- An ORG name is an alias / abbreviation / acronym / etc of another ORG\n", "\n", "We will use the deteceted `page[67]` as input" ] }, { "cell_type": "code", "execution_count": null, "id": "XecIx37zzB41", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "XecIx37zzB41", "outputId": "f7cb81ab-58cb-4396-f1c2-5a5b267fa24c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "finner_sec_dates download started this may take some time.\n", "[OK!]\n", "finner_orgs_prods_alias download started this may take some time.\n", "[OK!]\n", "pos_anc download started this may take some time.\n", "Approximate size to download 3.9 MB\n", "[OK!]\n", "dependency_conllu download started this may take some time.\n", "Approximate size to download 16.7 MB\n", "[OK!]\n", "finre_acquisitions_subsidiaries_md download started this may take some time.\n", "[OK!]\n" ] } ], "source": [ "ner_model_date = finance.NerModel.pretrained(\"finner_sec_dates\", \"en\", \"finance/models\")\\\n", " .setInputCols([\"sentence\", \"token\", \"embeddings\"])\\\n", " .setOutputCol(\"ner_dates\")\n", "\n", "ner_converter_date = nlp.NerConverter()\\\n", " .setInputCols([\"sentence\",\"token\",\"ner_dates\"])\\\n", " .setOutputCol(\"ner_chunk_date\")\n", "\n", "ner_model_org= finance.NerModel.pretrained(\"finner_orgs_prods_alias\", \"en\", \"finance/models\")\\\n", " .setInputCols([\"sentence\", \"token\", \"embeddings\"])\\\n", " .setOutputCol(\"ner_orgs\")\n", "\n", "ner_converter_org = nlp.NerConverter()\\\n", " .setInputCols([\"sentence\",\"token\",\"ner_orgs\"])\\\n", " .setOutputCol(\"ner_chunk_org\")\\\n", "\n", "chunk_merger = finance.ChunkMergeApproach()\\\n", " .setInputCols('ner_chunk_org', \"ner_chunk_date\")\\\n", " .setOutputCol('ner_chunk')\n", "\n", "pos = nlp.PerceptronModel.pretrained()\\\n", " .setInputCols([\"sentence\", \"token\"])\\\n", " .setOutputCol(\"pos\")\n", "\n", "dependency_parser = nlp.DependencyParserModel().pretrained(\"dependency_conllu\", \"en\")\\\n", " .setInputCols([\"sentence\", \"pos\", \"token\"])\\\n", " .setOutputCol(\"dependencies\")\n", "\n", "re_filter = finance.RENerChunksFilter()\\\n", " .setInputCols([\"ner_chunk\", \"dependencies\"])\\\n", " .setOutputCol(\"re_ner_chunk\")\\\n", " .setRelationPairs([\"ORG-ORG\", \"ORG-DATE\"])\\\n", " .setMaxSyntacticDistance(10)\n", "\n", "reDL = finance.RelationExtractionDLModel().pretrained('finre_acquisitions_subsidiaries_md', 'en', 'finance/models')\\\n", " .setInputCols([\"re_ner_chunk\", \"sentence\"])\\\n", " .setOutputCol(\"relations_acq\")\\\n", " .setPredictionThreshold(0.1)\n", "\n", "annotation_merger = finance.AnnotationMerger()\\\n", " .setInputCols(\"relations_acq\", \"relations_alias\")\\\n", " .setOutputCol(\"relations\")\n", "\n", "nlpPipeline = nlp.Pipeline(stages=[\n", " generic_base_pipeline,\n", " ner_model_date,\n", " ner_converter_date,\n", " ner_model_org,\n", " ner_converter_org,\n", " chunk_merger,\n", " pos,\n", " dependency_parser,\n", " re_filter,\n", " reDL,\n", " annotation_merger])\n", "\n", "empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n", "\n", "model = nlpPipeline.fit(empty_data)\n", "\n", "light_model = nlp.LightPipeline(model)" ] }, { "cell_type": "code", "execution_count": null, "id": "jneVpxi026q6", "metadata": { "id": "jneVpxi026q6" }, "outputs": [], "source": [ "sample_text = pages[67].replace(\"“\", \"\\\"\").replace(\"”\", \"\\\"\")" ] }, { "cell_type": "code", "execution_count": null, "id": "P4Uzu9zc28ig", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 722 }, "id": "P4Uzu9zc28ig", "outputId": "50d9c634-f0d8-41fb-eab2-c670c425d274" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
relationentity1entity1_beginentity1_endchunk1entity2entity2_beginentity2_endchunk2confidence
0has_acquisition_dateORG440446CadenceDATE427437fiscal 20200.99945384
1has_acquisition_dateORG490504AWR CorporationDATE427437fiscal 20200.99891853
2was_acquired_byORG490504AWR CorporationORG440446Cadence0.99111485
3was_acquired_byORG518540Integrand Software, IncORG440446Cadence0.99635243
4was_acquired_byORG518540Integrand Software, IncORG490504AWR Corporation0.94192755
5otherORG12101212AWRORG12181226Integrand0.9999858
6otherORG12291235CadenceDATE13581367nine years0.996561
7otherORG19051907AWRORG19131921Integrand0.9999651
8has_acquisition_dateORG19551961CadenceDATE20072017fiscal 20200.99776745
9otherDATE22192229fiscal 2021ORG23222330Cadence’s0.99219704
10otherDATE22352245fiscal 2020ORG23222330Cadence’s0.99703074
11otherDATE25392549fiscal 2021ORG25982606Cadence’s0.94122887
12otherDATE255225552020ORG25982606Cadence’s0.96238184
13otherDATE256025632019ORG25982606Cadence’s0.9658956
14otherDATE31913222the third quarter of fiscal 2021ORG32623270Cadence’s0.5690664
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " relation entity1 entity1_begin entity1_end \\\n", "0 has_acquisition_date ORG 440 446 \n", "1 has_acquisition_date ORG 490 504 \n", "2 was_acquired_by ORG 490 504 \n", "3 was_acquired_by ORG 518 540 \n", "4 was_acquired_by ORG 518 540 \n", "5 other ORG 1210 1212 \n", "6 other ORG 1229 1235 \n", "7 other ORG 1905 1907 \n", "8 has_acquisition_date ORG 1955 1961 \n", "9 other DATE 2219 2229 \n", "10 other DATE 2235 2245 \n", "11 other DATE 2539 2549 \n", "12 other DATE 2552 2555 \n", "13 other DATE 2560 2563 \n", "14 other DATE 3191 3222 \n", "\n", " chunk1 entity2 entity2_begin entity2_end \\\n", "0 Cadence DATE 427 437 \n", "1 AWR Corporation DATE 427 437 \n", "2 AWR Corporation ORG 440 446 \n", "3 Integrand Software, Inc ORG 440 446 \n", "4 Integrand Software, Inc ORG 490 504 \n", "5 AWR ORG 1218 1226 \n", "6 Cadence DATE 1358 1367 \n", "7 AWR ORG 1913 1921 \n", "8 Cadence DATE 2007 2017 \n", "9 fiscal 2021 ORG 2322 2330 \n", "10 fiscal 2020 ORG 2322 2330 \n", "11 fiscal 2021 ORG 2598 2606 \n", "12 2020 ORG 2598 2606 \n", "13 2019 ORG 2598 2606 \n", "14 the third quarter of fiscal 2021 ORG 3262 3270 \n", "\n", " chunk2 confidence \n", "0 fiscal 2020 0.99945384 \n", "1 fiscal 2020 0.99891853 \n", "2 Cadence 0.99111485 \n", "3 Cadence 0.99635243 \n", "4 AWR Corporation 0.94192755 \n", "5 Integrand 0.9999858 \n", "6 nine years 0.996561 \n", "7 Integrand 0.9999651 \n", "8 fiscal 2020 0.99776745 \n", "9 Cadence’s 0.99219704 \n", "10 Cadence’s 0.99703074 \n", "11 Cadence’s 0.94122887 \n", "12 Cadence’s 0.96238184 \n", "13 Cadence’s 0.9658956 \n", "14 Cadence’s 0.5690664 " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = light_model.fullAnnotate(sample_text)\n", "\n", "rel_df = get_relations_df(result)\n", "\n", "rel_df" ] }, { "cell_type": "code", "execution_count": null, "id": "ThcV6sMY4_u0", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 405 }, "id": "ThcV6sMY4_u0", "outputId": "e2ede868-cd59-4fa2-dd00-3a7a3569a6c3" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
relationentity1entity1_beginentity1_endchunk1entity2entity2_beginentity2_endchunk2confidence
0has_acquisition_dateORG440446CadenceDATE427437fiscal 20200.99945384
1has_acquisition_dateORG490504AWR CorporationDATE427437fiscal 20200.99891853
2was_acquired_byORG490504AWR CorporationORG440446Cadence0.99111485
3was_acquired_byORG518540Integrand Software, IncORG440446Cadence0.99635243
4was_acquired_byORG518540Integrand Software, IncORG490504AWR Corporation0.94192755
8has_acquisition_dateORG19551961CadenceDATE20072017fiscal 20200.99776745
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " relation entity1 entity1_begin entity1_end \\\n", "0 has_acquisition_date ORG 440 446 \n", "1 has_acquisition_date ORG 490 504 \n", "2 was_acquired_by ORG 490 504 \n", "3 was_acquired_by ORG 518 540 \n", "4 was_acquired_by ORG 518 540 \n", "8 has_acquisition_date ORG 1955 1961 \n", "\n", " chunk1 entity2 entity2_begin entity2_end chunk2 \\\n", "0 Cadence DATE 427 437 fiscal 2020 \n", "1 AWR Corporation DATE 427 437 fiscal 2020 \n", "2 AWR Corporation ORG 440 446 Cadence \n", "3 Integrand Software, Inc ORG 440 446 Cadence \n", "4 Integrand Software, Inc ORG 490 504 AWR Corporation \n", "8 Cadence DATE 2007 2017 fiscal 2020 \n", "\n", " confidence \n", "0 0.99945384 \n", "1 0.99891853 \n", "2 0.99111485 \n", "3 0.99635243 \n", "4 0.94192755 \n", "8 0.99776745 " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rel_df = rel_df[(rel_df[\"relation\"] != \"other\") & (rel_df[\"relation\"] != \"no_rel\")]\n", "\n", "rel_df" ] }, { "cell_type": "markdown", "id": "jTYF644Y44_n", "metadata": { "id": "jTYF644Y44_n" }, "source": [ "###📌 Visualize Results" ] }, { "cell_type": "code", "execution_count": null, "id": "DmvmsmvY28lC", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 1000 }, "id": "DmvmsmvY28lC", "outputId": "722f8773-ac1c-4994-cfb0-14e0e362a813" }, "outputs": [ { "data": { "text/html": [ "Definite-livedintangibleassetsacquiredwithCadence’sfiscal2021acquisitionswereasfollows:AcquisitionDateFairValueWeightedAverageAmortizationPeriod(Inthousands)(inyears)Existingtechnology$59,10013.7yearsAgreementsandrelationships28,90013.7yearsTradenames,trademarksandpatents4,60014.3yearsTotalacquiredintangibleswithdefinitelives$92,60013.7years2020AcquisitionsInfiscal 2020DATE,CadenceORGacquiredalloftheoutstandingequityofAWR CorporationORG(\"AWR\")andIntegrand Software, IncORG.(\"Integrand\").TheseacquisitionsenhancedCadence’stechnologyportfoliotoaddressgrowingradiofrequencydesignactivity,drivenbyexpandinguseof5Gcommunications.Theaggregatecashconsiderationfortheseacquisitionswas$195.6million,aftertakingintoaccountcashacquiredof$1.5million.Thetotalpurchaseconsiderationwasallocatedtotheassetsacquiredandliabilitiesassumedbasedontheirrespectiveestimatedfairvaluesontheacquisitiondates.Cadencewillalsomakepaymentstocertainemployees,subjecttocontinuedemploymentandotherperformance-basedconditions,throughthefirstquarteroffiscal2023.WithitsacquisitionsofAWRandIntegrand,Cadencerecorded$101.3millionofdefinite-livedintangibleassetswithaweightedaverageamortizationperiodofapproximatelynineyears.Thedefinite-livedintangibleassetsrelatedprimarilytoexistingtechnologyandcustomeragreementsandrelationships.Cadencealsorecorded$119.4millionofgoodwilland$25.1millionofnetliabilities,consistingprimarilyofdeferredtaxliabilities,assumeddeferredrevenueandtradeaccountsreceivable.TherecordedgoodwillwasprimarilyrelatedtotheacquiredassembledworkforceandexpectedsynergiesfromcombiningoperationsoftheacquiredcompanieswithCadence.NoneofthegoodwillrelatedtotheacquisitionsofAWRandIntegrandisdeductiblefortaxpurposes.CadenceORGcompletedoneadditionalacquisitionduringfiscal 2020DATEthatwasnotmaterialtotheconsolidatedfinancialstatements.ProFormaFinancialInformationCadencehasnotpresentedproformafinancialinformationforanyofthebusinessesitacquiredduringfiscal2021andfiscal2020becausetheresultsofoperationsforthesebusinessesarenotmaterialtoCadence’sconsolidatedfinancialstatements.Acquisition-RelatedTransactionCostsTransactioncostsassociatedwithacquisitions,whichconsistofprofessionalfeesandadministrativecosts,werenotmaterialduringfiscal2021,2020or2019andwereexpensedasincurredinCadence’sconsolidatedincomestatements.NOTE7.GOODWILLANDACQUIREDINTANGIBLESGoodwillThechangesinthecarryingamountofgoodwillduringfiscal2021and2020wereasfollows:GrossCarryingAmount(Inthousands)BalanceasofDecember28,2019$661,856Goodwillresultingfromacquisitions120,564Effectofforeigncurrencytranslation(333)BalanceasofJanuary2,2021782,087Goodwillresultingfromacquisitions154,362Effectofforeigncurrencytranslation(8,091)BalanceasofJanuary1,2022$928,358Cadencecompleteditsannualgoodwillimpairmenttestduringthethirdquarteroffiscal2021anddeterminedthatthefairvalueofCadence’ssinglereportingunitexceededthecarryingamountofitsnetassetsandthatnoimpairmentexisted.65has_acquisition_datewas_acquired_byhas_acquisition_datewas_acquired_byhas_acquisition_datewas_acquired_by" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from sparknlp_display import RelationExtractionVisualizer\n", "\n", "re_vis = viz.RelationExtractionVisualizer()\n", "\n", "re_vis.display(result = result[0], relation_col = \"relations\", document_col = \"document\", exclude_relations = [\"other\", \"no_rel\"], show_relations=True)" ] } ], "metadata": { "colab": { "provenance": [], "toc_visible": true }, "gpuClass": "standard", "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" }, "toc-showtags": false }, "nbformat": 4, "nbformat_minor": 5 }