{
"cells": [
{
"cell_type": "markdown",
"id": "T7aq9n4pXuQ7",
"metadata": {
"id": "T7aq9n4pXuQ7"
},
"source": [
"![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)"
]
},
{
"cell_type": "markdown",
"id": "iKZqb8UQXvIG",
"metadata": {
"id": "iKZqb8UQXvIG"
},
"source": [
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/finance-nlp/06.0.Relation_Extraction.ipynb)"
]
},
{
"cell_type": "markdown",
"id": "4iIO6G_B3pqq",
"metadata": {
"collapsed": false,
"id": "4iIO6G_B3pqq"
},
"source": [
"#🎬 Installation"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "hPwo4Czy3pqq",
"metadata": {
"id": "hPwo4Czy3pqq",
"pycharm": {
"is_executing": true
}
},
"outputs": [],
"source": [
"! pip install -q johnsnowlabs"
]
},
{
"cell_type": "markdown",
"id": "YPsbAnNoPt0Z",
"metadata": {
"id": "YPsbAnNoPt0Z"
},
"source": [
"##🔗 Automatic Installation\n",
"Using my.johnsnowlabs.com SSO"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "_L-7mLYp3pqr",
"metadata": {
"id": "_L-7mLYp3pqr",
"pycharm": {
"is_executing": true
}
},
"outputs": [],
"source": [
"from johnsnowlabs import nlp, finance, viz\n",
"\n",
"# nlp.install(force_browser=True)"
]
},
{
"cell_type": "markdown",
"id": "hsJvn_WWM2GL",
"metadata": {
"id": "hsJvn_WWM2GL"
},
"source": [
"##🔗 Manual downloading\n",
"If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.\n",
"\n",
"- Go to my.johnsnowlabs.com\n",
"- Download your license\n",
"- Upload it using the following command"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "i57QV3-_P2sQ",
"metadata": {
"id": "i57QV3-_P2sQ"
},
"outputs": [],
"source": [
"from google.colab import files\n",
"print('Please Upload your John Snow Labs License using the button below')\n",
"license_keys = files.upload()"
]
},
{
"cell_type": "markdown",
"id": "xGgNdFzZP_hQ",
"metadata": {
"id": "xGgNdFzZP_hQ"
},
"source": [
"- Install it"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "OfmmPqknP4rR",
"metadata": {
"id": "OfmmPqknP4rR"
},
"outputs": [],
"source": [
"nlp.install()"
]
},
{
"cell_type": "markdown",
"id": "DCl5ErZkNNLk",
"metadata": {
"id": "DCl5ErZkNNLk"
},
"source": [
"#📌 Starting"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "x3jVICoa3pqr",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "x3jVICoa3pqr",
"outputId": "17a65b67-10bd-4097-d91a-d7de00e1f6ac"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7187 (2).json\n",
"👌 Launched \u001b[92mcpu optimized\u001b[39m session with with: 🚀Spark-NLP==4.2.8, 💊Spark-Healthcare==4.2.8, running on ⚡ PySpark==3.1.2\n"
]
}
],
"source": [
"spark = nlp.start()"
]
},
{
"cell_type": "markdown",
"id": "Zk7IQlPsX2DR",
"metadata": {
"id": "Zk7IQlPsX2DR"
},
"source": [
"#🔎 Financial Relation Extraction(RE)"
]
},
{
"cell_type": "markdown",
"id": "HTxWUmGlL648",
"metadata": {
"id": "HTxWUmGlL648"
},
"source": [
"Financial relation extraction is a process of automatically extracting structured information from unstructured text data related to finance and economics. This can be done using natural language processing (NLP) techniques, such as named entity recognition and relation extraction.\n",
"\n",
"Some examples of financial relation extraction include extracting information about companies and their financial performance, such as revenue, profits, and debt, as well as information about financial markets and economic indicators, such as stock prices and exchange rates."
]
},
{
"cell_type": "markdown",
"id": "JSrrIXjYYSb1",
"metadata": {
"id": "JSrrIXjYYSb1"
},
"source": [
"##✔ Pretrained Relation Extraction Models for Finance\n",
"\n",
"📚Here are the list of pretrained Relation Extraction models:"
]
},
{
"cell_type": "markdown",
"id": "jo3aOCNwYSlk",
"metadata": {
"id": "jo3aOCNwYSlk"
},
"source": [
"**Relation Extraction Models**\n",
"\n",
"|index|model|\n",
"|-----:|:-----|\n",
"| 1| [Financial Relation Extraction on Earning Calls (Small)](https://nlp.johnsnowlabs.com/2022/11/28/finre_earning_calls_sm_en.html) | \n",
"| 2| [Financial Relation Extraction on 10K filings (Small)](https://nlp.johnsnowlabs.com/2022/11/07/finre_financial_small_en.html) | \n",
"| 3| [Financial Relation Extraction (Tickers)](https://nlp.johnsnowlabs.com/2022/10/15/finre_has_ticker_en.html) |\n",
"| 4| [Financial Relation Extraction (Acquisitions / Subsidiaries)](https://nlp.johnsnowlabs.com/2022/11/08/finre_acquisitions_subsidiaries_md_en.html) | \n",
"| 5| [Financial Relation Extraction (Work Experience, Medium)](https://nlp.johnsnowlabs.com/2022/11/08/finre_work_experience_md_en.html) |\n",
"| 6| [Financial Relation Extraction (Work Experience, Small)](https://nlp.johnsnowlabs.com/2022/09/28/finre_work_experience_en.html) | \n",
"| 7| [Financial Relation Extraction (Alias)](https://nlp.johnsnowlabs.com/2022/08/17/finre_org_prod_alias_en_3_2.html) |\n",
"| 8| [Financial Zero-shot Relation Extraction](https://nlp.johnsnowlabs.com/2022/08/22/finre_zero_shot_en_3_2.html) |\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "XVuxdiKgi_qd",
"metadata": {
"id": "XVuxdiKgi_qd"
},
"source": [
"##✔ Common Componennts\n",
"📚This pipeline will:\n",
"1. Split Text into Sentences\n",
"2. Split Sentences into Words\n",
"3. Use Financial Text Embeddings, trained on SEC documents, to obtain numerical semantic representation of words\n",
"\n",
"**These components are common for all the pipelines we will use.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "W4VQJrV8lywb",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "W4VQJrV8lywb",
"outputId": "167c0411-429c-43ab-fce2-d23cefc7a44e"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"bert_embeddings_sec_bert_base download started this may take some time.\n",
"Approximate size to download 390.4 MB\n",
"[OK!]\n"
]
}
],
"source": [
"def get_generic_base_pipeline():\n",
" \"\"\"Common components used in all pipelines\"\"\"\n",
" document_assembler = nlp.DocumentAssembler()\\\n",
" .setInputCol(\"text\")\\\n",
" .setOutputCol(\"document\")\n",
"\n",
" text_splitter = finance.TextSplitter()\\\n",
" .setInputCols([\"document\"])\\\n",
" .setOutputCol(\"sentence\")\n",
" \n",
" tokenizer = nlp.Tokenizer()\\\n",
" .setInputCols([\"sentence\"])\\\n",
" .setOutputCol(\"token\")\n",
"\n",
" embeddings = nlp.BertEmbeddings.pretrained(\"bert_embeddings_sec_bert_base\",\"en\") \\\n",
" .setInputCols([\"sentence\", \"token\"])\\\n",
" .setOutputCol(\"embeddings\")\n",
"\n",
" base_pipeline = nlp.Pipeline(stages=[\n",
" document_assembler,\n",
" text_splitter,\n",
" tokenizer,\n",
" embeddings\n",
" ])\n",
"\n",
" return base_pipeline\n",
" \n",
"generic_base_pipeline = get_generic_base_pipeline()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "nfMVrwKulyzQ",
"metadata": {
"id": "nfMVrwKulyzQ"
},
"outputs": [],
"source": [
"# Text Classifier\n",
"def get_text_classification_pipeline(model):\n",
" \"\"\"This pipeline allows you to use different classification models to understand if an input text is of a specific class or is something else.\n",
" It will be used to check where the first summary page of SEC10K is, where the sections of Acquisitions and Subsidiaries are, or where in the document\n",
" the management roles and experiences are mentioned\"\"\"\n",
" document_assembler = nlp.DocumentAssembler() \\\n",
" .setInputCol(\"text\") \\\n",
" .setOutputCol(\"document\")\n",
"\n",
" embeddings = nlp.UniversalSentenceEncoder.pretrained() \\\n",
" .setInputCols(\"document\") \\\n",
" .setOutputCol(\"sentence_embeddings\")\n",
"\n",
" classifier = nlp.ClassifierDLModel.pretrained(model, \"en\", \"finance/models\")\\\n",
" .setInputCols([\"sentence_embeddings\"])\\\n",
" .setOutputCol(\"category\")\n",
"\n",
" nlpPipeline = nlp.Pipeline(stages=[\n",
" document_assembler, \n",
" embeddings,\n",
" classifier])\n",
" \n",
" return nlpPipeline"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "EGMAqo943IVp",
"metadata": {
"id": "EGMAqo943IVp"
},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"def get_relations_df (results, col='relations'):\n",
" \"\"\"Shows a Dataframe with the relations extracted by Spark NLP\"\"\"\n",
" rel_pairs=[]\n",
" for rel in results[0][col]:\n",
" rel_pairs.append((\n",
" rel.result, \n",
" rel.metadata['entity1'], \n",
" rel.metadata['entity1_begin'],\n",
" rel.metadata['entity1_end'],\n",
" rel.metadata['chunk1'], \n",
" rel.metadata['entity2'],\n",
" rel.metadata['entity2_begin'],\n",
" rel.metadata['entity2_end'],\n",
" rel.metadata['chunk2'], \n",
" rel.metadata['confidence']\n",
" ))\n",
"\n",
" rel_df = pd.DataFrame(rel_pairs, columns=['relation','entity1','entity1_begin','entity1_end','chunk1','entity2','entity2_begin','entity2_end','chunk2', 'confidence'])\n",
"\n",
" return rel_df"
]
},
{
"cell_type": "markdown",
"id": "Zy0v4uCxxw8N",
"metadata": {
"id": "Zy0v4uCxxw8N"
},
"source": [
"##✔ NER and Relation Extraction\n",
"NER only extracts isolated entities by itself. But you can combine some NER with specific Relation Extraction Annotators trained for them, to retrieve if the entities are related to each other.\n",
"\n",
"Let's suppose we want to extract information about **Acquisitions** and **Subsidiaries**. If we don't know where that information is in the document, we can use Text Classifiers to find it."
]
},
{
"cell_type": "markdown",
"id": "fO-eR0rqyB4g",
"metadata": {
"id": "fO-eR0rqyB4g"
},
"source": [
"##✔ Using Text Classification to find Relevant Parts of the Document: Acquisitions and Subsidiaries\n",
"To check the SEC 10K Summary page, we have a specific model called `\"finclf_acquisitions_item\"`\n",
"\n",
"Let's send some pages and check which one(s) contain that information. In a real case, you could send all the pages to the model, but here for time saving purposes, we will show just a subset."
]
},
{
"cell_type": "markdown",
"id": "0nPVM9_By2Zc",
"metadata": {
"id": "0nPVM9_By2Zc"
},
"source": [
"###📌 Sample Texts from Cadence Design System\n",
"Examples taken from publicly available information about Cadence in SEC's Edgar database [here](https://www.sec.gov/Archives/edgar/data/813672/000081367222000012/cdns-20220101.htm) and [Wikipedia](https://en.wikipedia.org/wiki/Cadence_Design_Systems)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "JjPwViV0yAI6",
"metadata": {
"id": "JjPwViV0yAI6"
},
"outputs": [],
"source": [
"!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/cdns-20220101.html.txt"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "KiX1xoXdy81t",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "KiX1xoXdy81t",
"outputId": "6f9cdb51-386c-4bf6-9707-88d0806e69d1"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Table of Contents\n",
"UNITED STATES SECURITIES AND EXCHANGE COMMISSION\n",
"Washington, D.C. 20549\n",
"__________\n"
]
}
],
"source": [
"with open('cdns-20220101.html.txt', 'r') as f:\n",
" cadence_sec10k = f.read()\n",
"print(cadence_sec10k[:100])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "n1whJMYxzBEi",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "n1whJMYxzBEi",
"outputId": "62e209f1-8844-4d95-8a06-1a08a01d2442"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"UNITED STATES SECURITIES AND EXCHANGE COMMISSION\n",
"Washington, D.C. 20549\n",
"_____________________________________ \n",
"FORM 10-K \n",
"_____________________________________ \n",
"(Mark One)\n",
"☒\n",
"ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\n",
"For the fiscal year ended January 1, 2022 \n",
"OR\n",
"☐\n",
"TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\n",
"For the transition period from _________ to_________.\n",
"\n",
"Commission file number 000-15867 \n",
"_____________________________________\n",
" \n",
"CADENCE DESIGN SYSTEMS, INC. \n",
"(Exact name of registrant as specified in its charter)\n",
"____________________________________ \n",
"Delaware\n",
" \n",
"00-0000000\n",
"(State or Other Jurisdiction ofIncorporation or Organization)\n",
" \n",
"(I.R.S. EmployerIdentification No.)\n",
"2655 Seely Avenue, Building 5,\n",
"San Jose,\n",
"California\n",
" \n",
"95134\n",
"(Address of Principal Executive Offices)\n",
" \n",
"(Zip Code)\n",
"(408)\n",
"-943-1234 \n",
"(Registrant’s Telephone Number, including Area Code) \n",
"Securities registered pursuant to Section 12(b) of the Act:\n",
"Title of Each Class\n",
"Trading Symbol(s)\n",
"Names of Each Exchange on which Registered\n",
"Common Stock, $0.01 par value per share\n",
"CDNS\n",
"Nasdaq Global Select Market\n",
"Securities registered pursuant to Section 12(g) of the Act:\n",
"None\n",
"Indicate by check mark if the registrant is a well-known seasoned issuer, as defined in Rule 405 of the Securities Act. \n",
" Yes \n",
"☒\n",
" No \n",
"☐\n",
"Indicate by check mark if the registrant is not required to file reports pursuant to Section 13 or Section 15(d) of the Act. \n",
" Yes \n",
"☐ \n",
"No \n",
"☒\n",
"Indicate by check mark whether the registrant (1) has filed all reports required to be filed by Section 13 or 15(d) of the Securities Exchange Act of 1934 during the preceding 12 months (or for such shorter period that the registrant was required to file such reports), and (2) has been subject to such filing requirements for the past 90 days. \n",
" Yes \n",
"☒\n",
" No \n",
"☐\n",
"Indicate by check mark whether the registrant has submitted electronically every Interactive Data File required to be submitted pursuant to Rule 405 of Regulation S-T (§ 232.405 of this chapter) during the preceding 12 months (or for such shorter period that the registrant was required to submit such files). \n",
" Yes \n",
"☒\n",
" No \n",
"☐\n",
"Indicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting company, or an emerging growth company. See the definitions of “large accelerated filer,” “accelerated filer,” “smaller reporting company,” and “emerging growth company” in Rule 12b-2 of the Exchange Act.\n",
"Large Accelerated Filer\n",
"☒\n",
"Accelerated Filer\n",
"☐\n",
"Non-accelerated Filer\n",
"☐\n",
"Smaller Reporting Company\n",
"☐\n",
"Emerging Growth Company\n",
"☐\n",
"If an emerging growth company, indicate by check mark if the registrant has elected not to use the extended transition period for complying with any new or revised financial accounting standards provided pursuant to Section 13(a) of the Exchange Act. \n",
"☐\n",
"Indicate by check mark whether the registrant has filed a report on and attestation to its management’s assessment of the effectiveness of its internal control over financial reporting under Section 404(b) of the Sarbanes-Oxley Act (15 U.S.C. 7262(b)) by the registered public accounting firm that prepared or issued its audit report. \n",
"☒\n",
"Indicate by check mark whether the registrant is a shell company (as defined in Rule 12b-2 of the Act). \n",
" Yes \n",
"☐ \n",
"No \n",
"☒\n",
"The aggregate market value of the voting and non-voting common equity held by non-affiliates computed by reference to the price at which the common equity was last sold as of the last business day of the registrant’s most recently completed second fiscal quarter ended July 3, 2021 was approximately $38,179,000,000.\n",
"On February 5, 2022, approximately 277,336,000 shares of the Registrant’s Common Stock, $0.01 par value, were outstanding.\n",
"DOCUMENTS INCORPORATED BY REFERENCE\n",
"Portions of the definitive proxy statement for Cadence Design Systems, Inc.’s 2022 Annual Meeting of Stockholders are incorporated by reference into Part III hereof.\n",
"\n",
"\n",
"\n"
]
}
],
"source": [
"pages = [x for x in cadence_sec10k.split(\"Table of Contents\") if x.strip() != '']\n",
"print(pages[0])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "jEo8dbtbzBxP",
"metadata": {
"id": "jEo8dbtbzBxP"
},
"outputs": [],
"source": [
"# Some examples\n",
"candidates = [[pages[0]], [pages[1]], [pages[35]], [pages[67]]] "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cSQ_3n5DzBz4",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "cSQ_3n5DzBz4",
"outputId": "0da38ec7-b79e-4280-86fe-f9f64edd3e88"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"tfhub_use download started this may take some time.\n",
"Approximate size to download 923.7 MB\n",
"[OK!]\n",
"finclf_acquisitions_item download started this may take some time.\n",
"Approximate size to download 21.3 MB\n",
"[OK!]\n"
]
}
],
"source": [
"classification_pipeline = get_text_classification_pipeline('finclf_acquisitions_item')\n",
"\n",
"df = spark.createDataFrame(candidates).toDF(\"text\")\n",
"\n",
"model = classification_pipeline.fit(df)\n",
"\n",
"result = model.transform(df)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "-saM1i7izB2k",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "-saM1i7izB2k",
"outputId": "bbd02d06-39e1-4127-f796-626c9f7b2033"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------------+\n",
"| result|\n",
"+--------------+\n",
"| [other]|\n",
"| [other]|\n",
"| [other]|\n",
"|[acquisitions]|\n",
"+--------------+\n",
"\n"
]
}
],
"source": [
"result.select('category.result').show()"
]
},
{
"cell_type": "markdown",
"id": "mP5jDp_V2Aay",
"metadata": {
"id": "mP5jDp_V2Aay"
},
"source": [
"###📌 Acquisitions, Subsidiaries and Former Names\n",
"📚Let's use some NER models to obtain information about Organizations and Dates, and understand if:\n",
"- An ORG was acquired by another ORG\n",
"- An ORG is a subsidiary of another ORG\n",
"- An ORG name is an alias / abbreviation / acronym / etc of another ORG\n",
"\n",
"We will use the deteceted `page[67]` as input"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "XecIx37zzB41",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "XecIx37zzB41",
"outputId": "f7cb81ab-58cb-4396-f1c2-5a5b267fa24c"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"finner_sec_dates download started this may take some time.\n",
"[OK!]\n",
"finner_orgs_prods_alias download started this may take some time.\n",
"[OK!]\n",
"pos_anc download started this may take some time.\n",
"Approximate size to download 3.9 MB\n",
"[OK!]\n",
"dependency_conllu download started this may take some time.\n",
"Approximate size to download 16.7 MB\n",
"[OK!]\n",
"finre_acquisitions_subsidiaries_md download started this may take some time.\n",
"[OK!]\n"
]
}
],
"source": [
"ner_model_date = finance.NerModel.pretrained(\"finner_sec_dates\", \"en\", \"finance/models\")\\\n",
" .setInputCols([\"sentence\", \"token\", \"embeddings\"])\\\n",
" .setOutputCol(\"ner_dates\")\n",
"\n",
"ner_converter_date = nlp.NerConverter()\\\n",
" .setInputCols([\"sentence\",\"token\",\"ner_dates\"])\\\n",
" .setOutputCol(\"ner_chunk_date\")\n",
"\n",
"ner_model_org= finance.NerModel.pretrained(\"finner_orgs_prods_alias\", \"en\", \"finance/models\")\\\n",
" .setInputCols([\"sentence\", \"token\", \"embeddings\"])\\\n",
" .setOutputCol(\"ner_orgs\")\n",
"\n",
"ner_converter_org = nlp.NerConverter()\\\n",
" .setInputCols([\"sentence\",\"token\",\"ner_orgs\"])\\\n",
" .setOutputCol(\"ner_chunk_org\")\\\n",
"\n",
"chunk_merger = finance.ChunkMergeApproach()\\\n",
" .setInputCols('ner_chunk_org', \"ner_chunk_date\")\\\n",
" .setOutputCol('ner_chunk')\n",
"\n",
"pos = nlp.PerceptronModel.pretrained()\\\n",
" .setInputCols([\"sentence\", \"token\"])\\\n",
" .setOutputCol(\"pos\")\n",
"\n",
"dependency_parser = nlp.DependencyParserModel().pretrained(\"dependency_conllu\", \"en\")\\\n",
" .setInputCols([\"sentence\", \"pos\", \"token\"])\\\n",
" .setOutputCol(\"dependencies\")\n",
"\n",
"re_filter = finance.RENerChunksFilter()\\\n",
" .setInputCols([\"ner_chunk\", \"dependencies\"])\\\n",
" .setOutputCol(\"re_ner_chunk\")\\\n",
" .setRelationPairs([\"ORG-ORG\", \"ORG-DATE\"])\\\n",
" .setMaxSyntacticDistance(10)\n",
"\n",
"reDL = finance.RelationExtractionDLModel().pretrained('finre_acquisitions_subsidiaries_md', 'en', 'finance/models')\\\n",
" .setInputCols([\"re_ner_chunk\", \"sentence\"])\\\n",
" .setOutputCol(\"relations_acq\")\\\n",
" .setPredictionThreshold(0.1)\n",
"\n",
"annotation_merger = finance.AnnotationMerger()\\\n",
" .setInputCols(\"relations_acq\", \"relations_alias\")\\\n",
" .setOutputCol(\"relations\")\n",
"\n",
"nlpPipeline = nlp.Pipeline(stages=[\n",
" generic_base_pipeline,\n",
" ner_model_date,\n",
" ner_converter_date,\n",
" ner_model_org,\n",
" ner_converter_org,\n",
" chunk_merger,\n",
" pos,\n",
" dependency_parser,\n",
" re_filter,\n",
" reDL,\n",
" annotation_merger])\n",
"\n",
"empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n",
"\n",
"model = nlpPipeline.fit(empty_data)\n",
"\n",
"light_model = nlp.LightPipeline(model)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "jneVpxi026q6",
"metadata": {
"id": "jneVpxi026q6"
},
"outputs": [],
"source": [
"sample_text = pages[67].replace(\"“\", \"\\\"\").replace(\"”\", \"\\\"\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "P4Uzu9zc28ig",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 722
},
"id": "P4Uzu9zc28ig",
"outputId": "50d9c634-f0d8-41fb-eab2-c670c425d274"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"