{ "cells": [ { "cell_type": "markdown", "id": "VNRiKOH_u_jP", "metadata": { "id": "VNRiKOH_u_jP" }, "source": [ "![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)" ] }, { "cell_type": "markdown", "id": "LF8cIN3OvA2Q", "metadata": { "id": "LF8cIN3OvA2Q" }, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/finance-nlp/07.0.Understand_Entities_in_Context.ipynb)" ] }, { "cell_type": "markdown", "id": "fa8fbe50-f928-45ff-b539-0ab62a7c40b2", "metadata": { "id": "fa8fbe50-f928-45ff-b539-0ab62a7c40b2" }, "source": [ "#πŸ”Ž Understanding Financial Entities In Context\n", "πŸ“œ Assertion Status, or *Understanding Financial Entities in Context*, is an NLP atsk in carge of analyzing NER entities, extracted with:\n", "- NER models;\n", "- ContextualParser;\n", "\n", "πŸ“œ and their surroundings (usually a sentence, but it could take bigger spans too) to *assert* different conditions / status on the entities, as:\n", "- If an entity is negated in the context;\n", "- If the context talks about past, present, or future;\n", "- If the entity is said to be hypothetical / possible, or certaing;\n", "- etc.\n", "\n", "πŸ“œ The exposed above are just some examples, since the applications of Assertion DL models can be expanded to whatever to many other scenarios where you need to:\n", "- Disambiguate entities from the context.\n", "- Subclassify or specify an entity depending on context:\n", "\n", "πŸš€Examples:\n", "- Is an ORG mentioned to be a COMPETITOR or part of the SUPPLY_CHAIN (or none of them)?\n", "- Is an ACQUIRED_COMPANY mentioned to be acquired TOTALLY or PARTIALLY acquired in the context?\n", "- etc\n", "\n" ] }, { "cell_type": "markdown", "id": "PjJlXILCYefM", "metadata": { "id": "PjJlXILCYefM" }, "source": [ "![image.png]()" ] }, { "cell_type": "markdown", "id": "-lTBFRM_cRzI", "metadata": { "id": "-lTBFRM_cRzI" }, "source": [ "Let's see which pretrained models we have and how to train custom ones!" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false, "id": "BknLo-nHX9M6" }, "source": [ "##🎬 Installation" ], "id": "BknLo-nHX9M6" }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": true }, "id": "_914itZsj51v" }, "outputs": [], "source": [ "! pip install -q johnsnowlabs" ], "id": "_914itZsj51v" }, { "cell_type": "markdown", "source": [ "##πŸ”— Automatic Installation\n", "Using my.johnsnowlabs.com SSO" ], "metadata": { "id": "YPsbAnNoPt0Z" }, "id": "YPsbAnNoPt0Z" }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": true }, "id": "fY0lcShkj51w" }, "outputs": [], "source": [ "from johnsnowlabs import *\n", "\n", "# nlp.install(force_browser=True)" ], "id": "fY0lcShkj51w" }, { "cell_type": "markdown", "source": [ "##πŸ”— Manual downloading\n", "If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.\n", "\n", "- Go to my.johnsnowlabs.com\n", "- Download your license\n", "- Upload it using the following command" ], "metadata": { "id": "hsJvn_WWM2GL" }, "id": "hsJvn_WWM2GL" }, { "cell_type": "code", "source": [ "from google.colab import files\n", "print('Please Upload your John Snow Labs License using the button below')\n", "license_keys = files.upload()" ], "metadata": { "id": "i57QV3-_P2sQ" }, "execution_count": null, "outputs": [], "id": "i57QV3-_P2sQ" }, { "cell_type": "markdown", "source": [ "- Install it" ], "metadata": { "id": "xGgNdFzZP_hQ" }, "id": "xGgNdFzZP_hQ" }, { "cell_type": "code", "source": [ "nlp.install()" ], "metadata": { "id": "OfmmPqknP4rR" }, "execution_count": null, "outputs": [], "id": "OfmmPqknP4rR" }, { "cell_type": "markdown", "source": [ "##πŸ“Œ Starting" ], "metadata": { "id": "DCl5ErZkNNLk" }, "id": "DCl5ErZkNNLk" }, { "cell_type": "code", "execution_count": null, "id": "C2eqwqTxVbdR", "metadata": { "id": "C2eqwqTxVbdR" }, "outputs": [], "source": [ "from johnsnowlabs import nlp, finance\n", "# Automatically load license data and start a session with all jars user has access to\n", "spark = nlp.start()" ] }, { "cell_type": "code", "execution_count": null, "id": "YeIQqpP6KkW9", "metadata": { "id": "YeIQqpP6KkW9" }, "outputs": [], "source": [ "from pyspark.sql import DataFrame\n", "import pyspark.sql.functions as F\n", "import pyspark.sql.types as T\n", "import pyspark.sql as SQL\n", "from pyspark import keyword_only" ] }, { "cell_type": "markdown", "id": "90643920-3813-4cfc-8e29-81a5aa972725", "metadata": { "id": "90643920-3813-4cfc-8e29-81a5aa972725" }, "source": [ "##πŸ“š Understanding time from context" ] }, { "cell_type": "markdown", "id": "l7urMeyqwXDu", "metadata": { "id": "l7urMeyqwXDu" }, "source": [ "###βœ”οΈ Past Experiences of a Role (C-level management)\n", "Let's start with a small example: analyzing whether a ROLE of a person in a company is mentioned to be a `past` or `present`.\n", "\n", "πŸ“œFor that, we need:\n", "- An NER model. We will use `finner_bert_roles` which uses `bert_embeddings_sec_bert_base` embeddings to extract `ROLE` entities;\n", "- An Assertion Model which detects time. We will use `finassertiondl_past_roles` which is a very specific one, to detect time in `ROLE` entities." ] }, { "cell_type": "code", "execution_count": null, "id": "abf82e78-a3fd-412c-8025-64363cf5a3ef", "metadata": { "id": "abf82e78-a3fd-412c-8025-64363cf5a3ef" }, "outputs": [], "source": [ "document_assembler = nlp.DocumentAssembler()\\\n", " .setInputCol(\"text\")\\\n", " .setOutputCol(\"document\")\n", "\n", "tokenizer = nlp.Tokenizer()\\\n", " .setInputCols([\"document\"])\\\n", " .setOutputCol(\"token\")\n", "\n", "embeddings = nlp.BertEmbeddings.pretrained(\"bert_embeddings_sec_bert_base\",\"en\") \\\n", " .setInputCols([\"document\", \"token\"]) \\\n", " .setOutputCol(\"embeddings\")\n", "\n", "tokenClassifier = finance.BertForTokenClassification.pretrained(\"finner_bert_roles\",\"en\",\"finance/models\")\\\n", " .setInputCols(\"token\", \"document\")\\\n", " .setOutputCol(\"ner\")\\\n", " .setCaseSensitive(True)\n", "\n", "ner_converter = finance.NerConverterInternal() \\\n", " .setInputCols([\"document\", \"token\", \"ner\"]) \\\n", " .setOutputCol(\"ner_chunk\")\\\n", " .setWhiteList([\"ROLE\"])\n", "\n", "assertion = finance.AssertionDLModel.pretrained(\"finassertiondl_past_roles\", \"en\", \"finance/models\")\\\n", " .setInputCols([\"document\", \"ner_chunk\", \"embeddings\"]) \\\n", " .setOutputCol(\"assertion\")\n", " \n", "nlpPipeline = nlp.Pipeline(stages=[\n", " document_assembler, \n", " tokenizer,\n", " embeddings,\n", " tokenClassifier,\n", " ner_converter,\n", " assertion\n", " ])\n", "\n", "empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n", "\n", "model = nlpPipeline.fit(empty_data)\n", "\n", "light_model = nlp.LightPipeline(model)" ] }, { "cell_type": "markdown", "id": "1468d272-3f3f-452d-ad4a-376d06879d4e", "metadata": { "id": "1468d272-3f3f-452d-ad4a-376d06879d4e" }, "source": [ "###βœ”οΈ Example sentences extracted from 10K filings" ] }, { "cell_type": "code", "execution_count": null, "id": "80e7d275-93ab-479d-a355-20de8613c8da", "metadata": { "id": "80e7d275-93ab-479d-a355-20de8613c8da", "tags": [] }, "outputs": [], "source": [ "sample_texts = [\"\"\"From January 2009 to November 2017, Mr. Tan worked as the Managing Director of Cadence\"\"\",\n", " \"\"\"Jane S. Smith works as a Computer Engineer and Product Lead at Globalize Cloud Services\"\"\",\n", " \"\"\"Mrs. Johansson has been apointed CEO and President of Mileways\"\"\",\n", " \"\"\"Tom Martin worked as Cadence's CTO until 2010\"\"\",\n", " \"\"\"Mrs. Charles was before Managing Director at a big consultancy company\"\"\",\n", " \"\"\"We are happy to announce that Mary Leigh joins Elephant as Web Designer and UX/UI Developer\"\"\"]" ] }, { "cell_type": "markdown", "id": "kuZWODMewxtv", "metadata": { "id": "kuZWODMewxtv" }, "source": [ "###βœ”οΈ We extract with LightPipelines" ] }, { "cell_type": "code", "execution_count": null, "id": "wD-ep_BIw16_", "metadata": { "id": "wD-ep_BIw16_" }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "chunks=[]\n", "entities=[]\n", "status=[]\n", "\n", "for i in range(len(sample_texts)):\n", " light_result = light_model.fullAnnotate(sample_texts[i])[0]\n", "\n", " for n,m in zip(light_result['ner_chunk'],light_result['assertion']):\n", " chunks.append(n.result)\n", " entities.append(n.metadata['entity']) \n", " status.append(m.result)\n", "\n", "df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status})" ] }, { "cell_type": "code", "execution_count": null, "id": "e4c81d83-a2c4-4303-b257-9615a8986361", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 332 }, "id": "e4c81d83-a2c4-4303-b257-9615a8986361", "outputId": "5c3534ef-f2fb-49b6-8c17-a424733bf389" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chunksentitiesassertion
0DirectorROLEPAST
1Computer EngineerROLENO_PAST
2Product LeadROLENO_PAST
3CEOROLENO_PAST
4PresidentROLENO_PAST
5Cadence's CTOROLEPAST
6Managing DirectorROLEPAST
7Web DesignerROLENO_PAST
8UX/UI DeveloperROLENO_PAST
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " chunks entities assertion\n", "0 Director ROLE PAST\n", "1 Computer Engineer ROLE NO_PAST\n", "2 Product Lead ROLE NO_PAST\n", "3 CEO ROLE NO_PAST\n", "4 President ROLE NO_PAST\n", "5 Cadence's CTO ROLE PAST\n", "6 Managing Director ROLE PAST\n", "7 Web Designer ROLE NO_PAST\n", "8 UX/UI Developer ROLE NO_PAST" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "id": "c76bba1c-850e-4c15-908c-191a6c3fa3f0", "metadata": { "id": "c76bba1c-850e-4c15-908c-191a6c3fa3f0" }, "source": [ "###βœ”οΈ Visualization of Assertion Status" ] }, { "cell_type": "code", "execution_count": null, "id": "9a783e5e-0b93-4302-b776-530b7d790e63", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 623 }, "id": "9a783e5e-0b93-4302-b776-530b7d790e63", "outputId": "a51dcee8-06e4-47b2-897b-766856a47e81" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", " From January 2009 to November 2017, Mr. Tan worked as the Managing Director ROLEPAST of Cadence" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", " Jane S. Smith works as a Computer Engineer ROLENO_PAST and Product Lead ROLENO_PAST at Globalize Cloud Services" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", " Mrs. Johansson has been apointed CEO ROLENO_PAST and President ROLENO_PAST of Mileways" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", " Tom Martin worked as Cadence's CTO ROLEPAST until 2010" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", " Mrs. Charles was before Managing Director ROLEPAST at a big consultancy company" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", " We are happy to announce that Mary Leigh joins Elephant as Web Designer ROLENO_PAST and UX/UI Developer ROLENO_PAST " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "for i in range(len(sample_texts)):\n", " \n", " light_result = light_model.fullAnnotate(sample_texts[i])[0]\n", " \n", " vis = nlp.viz.AssertionVisualizer()\n", "\n", " vis.display(light_result, 'ner_chunk', 'assertion')" ] }, { "cell_type": "markdown", "id": "4a3QK9yVwKmN", "metadata": { "id": "4a3QK9yVwKmN" }, "source": [ "##πŸ“š Bigger example: Asserting time in a 10-K filing\n", "πŸ“œNow let's go bigger. We will use one 10K filing, extract several pages and apply assertion status to detect time for:\n", "- `PER` (people)\n", "- `ORG` (organizations)\n", "- `ROLE` (roles of those people in that or past organizations)\n", "\n", "πŸ“œFor that, we need:\n", "- An NER model. We will use `finner_org_per_role_date` which uses `bert_embeddings_sec_bert_base` embeddings to extract `PERSON`, `ORG` and `ROLE` entities;\n", "- An Assertion Model which detects time. We will use `finassertion_time` which is a generic time assertion model, to detect time on the previously mentioned entities. \n", "\n", "πŸš€**Please keep in mind that you can use this model also in other entities, but the performance may degrade since it was not trained on other kind of entities.**" ] }, { "cell_type": "code", "execution_count": null, "id": "j7iWgb2Fghle", "metadata": { "id": "j7iWgb2Fghle" }, "outputs": [], "source": [ "!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/cdns-20220101.html.txt" ] }, { "cell_type": "code", "execution_count": null, "id": "wSbj1E7fpTrK", "metadata": { "id": "wSbj1E7fpTrK" }, "outputs": [], "source": [ "import requests\n", "URL = \"https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/cdns-20220101.html.txt\"\n", "response = requests.get(URL)\n", "\n", "cadence_sec10k = response.content.decode('utf-8')" ] }, { "cell_type": "code", "execution_count": null, "id": "4PD0eMHNpe5A", "metadata": { "id": "4PD0eMHNpe5A" }, "outputs": [], "source": [ "document_assembler = nlp.DocumentAssembler() \\\n", " .setInputCol(\"text\") \\\n", " .setOutputCol(\"document\")\n", "\n", "text_splitter = finance.TextSplitter() \\\n", " .setInputCols([\"document\"]) \\\n", " .setOutputCol(\"pages\")\\\n", " .setCustomBounds([\"Table of Contents\"])\\\n", " .setUseCustomBoundsOnly(True)\\\n", " .setExplodeSentences(True)\n", "\n", "nlp_pipeline = nlp.Pipeline(stages=[\n", " document_assembler,\n", " text_splitter])" ] }, { "cell_type": "code", "execution_count": null, "id": "VO_J6PLtp3AH", "metadata": { "id": "VO_J6PLtp3AH" }, "outputs": [], "source": [ "#fit: trains, configures and prepares the pipeline for inference. \n", "\n", "sdf = spark.createDataFrame([[ cadence_sec10k ]]).toDF(\"text\")\n", "\n", "fit = nlp_pipeline.fit(sdf)" ] }, { "cell_type": "code", "execution_count": null, "id": "IjZjKSTAp47P", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "IjZjKSTAp47P", "outputId": "3564fbfa-76a8-44e5-a39c-7d8351275134" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------------------+--------------------+--------------------+\n", "| text| document| pages|\n", "+--------------------+--------------------+--------------------+\n", "|Table of Contents...|[{document, 0, 34...|[{document, 18, 4...|\n", "|Table of Contents...|[{document, 0, 34...|[{document, 4087,...|\n", "|Table of Contents...|[{document, 0, 34...|[{document, 4215,...|\n", "|Table of Contents...|[{document, 0, 34...|[{document, 5504,...|\n", "|Table of Contents...|[{document, 0, 34...|[{document, 11617...|\n", "|Table of Contents...|[{document, 0, 34...|[{document, 13985...|\n", "|Table of Contents...|[{document, 0, 34...|[{document, 20001...|\n", "|Table of Contents...|[{document, 0, 34...|[{document, 26059...|\n", "|Table of Contents...|[{document, 0, 34...|[{document, 31638...|\n", "|Table of Contents...|[{document, 0, 34...|[{document, 36733...|\n", "|Table of Contents...|[{document, 0, 34...|[{document, 42440...|\n", "|Table of Contents...|[{document, 0, 34...|[{document, 47053...|\n", "|Table of Contents...|[{document, 0, 34...|[{document, 48328...|\n", "|Table of Contents...|[{document, 0, 34...|[{document, 53745...|\n", "|Table of Contents...|[{document, 0, 34...|[{document, 59341...|\n", "|Table of Contents...|[{document, 0, 34...|[{document, 65403...|\n", "|Table of Contents...|[{document, 0, 34...|[{document, 72330...|\n", "|Table of Contents...|[{document, 0, 34...|[{document, 77951...|\n", "|Table of Contents...|[{document, 0, 34...|[{document, 84131...|\n", "|Table of Contents...|[{document, 0, 34...|[{document, 89718...|\n", "+--------------------+--------------------+--------------------+\n", "only showing top 20 rows\n", "\n", "CPU times: user 48 ms, sys: 13.8 ms, total: 61.7 ms\n", "Wall time: 6.41 s\n" ] } ], "source": [ "%%time\n", "\n", "#transforms: executes inference on a fit pipeline\n", "res = fit.transform(sdf)\n", "\n", "res.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "_C_GBv2Dp9D2", "metadata": { "id": "_C_GBv2Dp9D2" }, "outputs": [], "source": [ "%%time\n", "\n", "import json\n", "\n", "lp = nlp.LightPipeline(fit)\n", "\n", "json_res = lp.annotate(cadence_sec10k)\n", "\n", "print(json.dumps(json_res, indent=4))" ] }, { "cell_type": "code", "execution_count": null, "id": "ex8hiEW_qCnA", "metadata": { "id": "ex8hiEW_qCnA" }, "outputs": [], "source": [ "pages = [json_res['pages'][i] for i in range(13)]" ] }, { "cell_type": "code", "execution_count": null, "id": "9TPTrACWgaiL", "metadata": { "id": "9TPTrACWgaiL" }, "outputs": [], "source": [ "document_assembler = nlp.DocumentAssembler()\\\n", " .setInputCol(\"text\")\\\n", " .setOutputCol(\"document\")\n", "\n", "text_splitter = finance.TextSplitter()\\\n", " .setInputCols([\"document\"])\\\n", " .setOutputCol(\"sentence\")\n", "\n", "tokenizer = nlp.Tokenizer()\\\n", " .setInputCols([\"sentence\"])\\\n", " .setOutputCol(\"token\")\n", "\n", "embeddings = nlp.BertEmbeddings.pretrained(\"bert_embeddings_sec_bert_base\",\"en\") \\\n", " .setInputCols([\"sentence\", \"token\"]) \\\n", " .setOutputCol(\"embeddings\")\n", "\n", "ner = finance.NerModel.pretrained(\"finner_org_per_role_date\", \"en\", \"finance/models\")\\\n", " .setInputCols(\"sentence\", \"token\", \"embeddings\")\\\n", " .setOutputCol(\"ner\")\n", "\n", "chunk_converter = nlp.NerConverter() \\\n", " .setInputCols([\"sentence\", \"token\", \"ner\"]) \\\n", " .setOutputCol(\"ner_chunk\")\n", "\n", "assertion = finance.AssertionDLModel.pretrained(\"finassertion_time\", \"en\", \"finance/models\")\\\n", " .setInputCols([\"sentence\", \"ner_chunk\", \"embeddings\"]) \\\n", " .setOutputCol(\"assertion\")\n", " \n", "nlpPipeline = nlp.Pipeline(stages=[\n", " document_assembler,\n", " text_splitter,\n", " tokenizer,\n", " embeddings,\n", " ner,\n", " chunk_converter,\n", " assertion\n", " ])\n", "\n", "empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n", "\n", "model = nlpPipeline.fit(empty_data)\n", "\n", "lp = nlp.LightPipeline(model)" ] }, { "cell_type": "markdown", "id": "xiFoMZ4egIdl", "metadata": { "id": "xiFoMZ4egIdl" }, "source": [ "πŸš€Let's start identifying time using `finassertion_time`. As in previous notebooks, we will be using a SEC 10K filing." ] }, { "cell_type": "code", "execution_count": null, "id": "tqmGFPKcqbv_", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 1000 }, "id": "tqmGFPKcqbv_", "outputId": "a1f1fed8-5bad-43fc-c5f7-f2e41d059bf2" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", " INFORMATION ABOUT OUR EXECUTIVE OFFICERS
The following table provides information regarding our executive officers as of February 22, 2022:
Name
Age
Positions and Offices
Anirudh Devgan PERSONPAST
52
President ROLEPAST and Chief Executive Officer ROLEPAST
John M. Wall PERSONPAST
51
Senior Vice President ROLEPAST and Chief Financial Officer ROLEPAST
Thomas P. Beckley PERSONPAST
64
Senior Vice President ROLEPAST and General Manager ROLEPAST of the Custom IC and PCB Group
Paul Cunningham PERSONPAST
44
Senior Vice President ROLEPAST and General Manager ROLEPAST of the System and Verification Group
Alinka Flaminia PERSONPAST
60
Senior Vice President ROLEPAST , Chief Legal Officer ROLEPAST and Corporate Secretary ROLEPAST
Chin-Chi Teng PERSONPAST
56
Senior Vice President ROLEPAST and General Manager ROLEPAST of the Digital and Signoff Group
Neil Zaman PERSONPAST
53
Senior Vice President ROLEPAST and Chief Revenue Officer ROLEPAST
Our executive officers are appointed by the Board of Directors and serve at the discretion of the Board of Directors.
ANIRUDH DEVGAN PERSONPAST has served as Chief Executive Officer ROLEPAST of Cadence ORGPAST since December 2021 DATEPAST and President ROLEPAST of Cadence ORGPAST since November 2017 DATEPAST . From May 2012 DATEPAST to November 2017 DATEPAST , Dr. Devgan PERSONPAST held several positions at Cadence ORGPAST , most recently as Executive Vice President ROLEPAST , Research and Development from March 2017 DATEPAST to November 2017 DATEPAST and Senior Vice President ROLEPAST , Research and Development from November 2013 DATEPAST to March 2017 DATEPAST . Prior to joining Cadence ORGPAST , from May 2005 DATEPAST to March 2012 DATEPAST , Dr. Devgan PERSONPAST served as Corporate Vice President ROLEPAST and General Manager ROLEPAST of the Custom Design Business Unit at Magma Design Automation, Inc., an EDA company. Dr. Devgan PERSONPRESENT has a B.Tech. in electrical engineering from the Indian Institute of Technology ORGPAST , Delhi, and an M.S. and Ph.D. in electrical and computer engineering from Carnegie Mellon University ORGPRESENT .
JOHN M. WALL PERSONPRESENT has served as Senior Vice President ROLEPAST and Chief Financial Officer ROLEPAST of Cadence ORGPAST since October 2017 DATEPAST . From October 2000 DATEPAST to September 2017 DATEPAST , Mr. Wall PERSONPAST held several positions at Cadence ORGPAST , most recently as Corporate Vice President ROLEPAST and Corporate Controller ROLEPAST from April 2016 DATEPAST to October 2017 DATEPAST , Vice President ROLEPAST , Finance and Operations, Worldwide Revenue Accounting and Sales Finance from 2015 DATEPAST to 2016 DATEPAST and Vice President ROLEPAST , Finance and Operations, EMEA and Worldwide Revenue Accounting from 2005 DATEPAST to 2015 DATEPAST . Mr. Wall PERSONPAST has an NCBS from the Institute of Technology, Tralee and is a Fellow ROLEPAST of the Association of Chartered Certified Accountants ORGPAST .
THOMAS P. BECKLEY PERSONPRESENT has served as Senior Vice President ROLEPRESENT and General Manager ROLEPRESENT of the Custom IC and PCB Group of Cadence ORGPRESENT since 2018 DATEPRESENT . From September 2012 DATEPAST to September 2018 DATEPAST , Mr. Beckley PERSONPAST served as Senior Vice President ROLEPAST , Research and Development of Cadence ORGPAST . From April 2004 DATEPAST to September 2012 DATEPAST , Mr. Beckley PERSONPAST served as Corporate Vice President ROLEPAST , Research and Development of Cadence ORGPAST . Prior to joining Cadence ORGPAST , Mr. Beckley PERSONPAST served as President ROLEPAST and Chief Executive Officer ROLEPAST of Neolinear, Inc ORGPAST ., a developer of auto-interactive and automated analog/RF tools and solutions for mixed-signal design that was acquired by Cadence ORGPAST in April 2004 DATEPAST . Mr. Beckley PERSONPRESENT has a B.S. in mathematics and physics from Kalamazoo College and an M.B.A. from Vanderbilt University.
PAUL CUNNINGHAM PERSONPAST has served as Senior Vice President ROLEPAST and General Manager ROLEPAST of the System and Verification Group since March 2021 DATEPAST . From August 2011 DATEPAST to March 2021 DATEPAST , Mr. Cunningham PERSONPAST held several positions at Cadence ORGPAST , most recently as Corporate Vice President ROLEPAST of the System Verification Group beginning January 2018 DATEPAST . Prior to joining Cadence ORGPAST , Mr. Cunningham PERSONPAST was co-founder ROLEPAST and Chief Executive Officer ROLEPAST of Azuro, Inc ORGPAST ., a clock concurrent optimization company, that Cadence ORGPAST acquired in July 2011 DATEPAST . Mr. Cunningham PERSONPRESENT has an M.A. and Ph.D. in computer science from the University of Cambridge ORGPRESENT in the United Kingdom.
ALINKA FLAMINIA PERSONPAST has served as Senior Vice President ROLEPAST , Chief Legal Officer ROLEPAST and Corporate Secretary ROLEPAST of Cadence ORGPAST since June 2020 DATEPAST . Prior to joining Cadence ORGPAST , Ms. Flaminia PERSONPAST served as Senior Vice President ROLEPAST , General Counsel and Corporate Secretary ROLEPAST of Mellanox Technologies Ltd ORGPAST ., a supplier of intelligent interconnect solutions, from September 2016 DATEPAST until its acquisition by NVIDIA Corporation ORGPAST in April 2020 DATEPAST . She also served as General Counsel ROLEPAST and Corporate Secretary ROLEPAST of PMC-Sierra, Inc ORGPAST ., a semiconductor company, from 2007 DATEPAST until its acquisition by Microsemi Corporation ORGPAST in 2016 DATEPAST . Ms. Flaminia PERSONPAST has a B.A. from Yale University, and a J.D. from Colorado University, School of Law.
CHIN-CHI TENG PERSONPAST has served as Senior Vice President ROLEPAST and General Manager ROLEPAST of the Digital and Signoff Group of Cadence ORGPAST since September 2018 DATEPAST . From January 2002 DATEPAST to September 2018 DATEPAST , Dr. Teng PERSONPAST held several positions at Cadence ORGPAST , most recently as Corporate Vice President ROLEPAST , Research and Development from June 2015 DATEPAST to September 2018 DATEPAST , and Vice President ROLEPAST , Research and Development from March 2009 DATEPAST to June 2015 DATEPAST . Dr. Teng PERSONPRESENT has a B.S. in electrical engineering from the National Taiwan University ORGPRESENT and a Ph.D. in electrical and computer engineering from the University of Illinois, Urbana-Champaign.
NEIL ZAMAN PERSONPAST has served as Chief Revenue Officer ROLEPAST since October 2020 DATEPAST and as Senior Vice President ROLEPAST , Worldwide Field Operations since September 2015 DATEPAST . From October 1999 DATEPAST to September 2015 DATEPAST , Mr. Zaman PERSONPAST held several positions at Cadence ORGPAST , most recently as Corporate Vice President ROLEPAST , North America Field Operations. Prior to joining Cadence ORGPAST , Mr. Zaman PERSONPAST held positions at Phoenix Technologies Ltd ORGPAST ., a developer of core system software, and IBM Corporation ORGPAST , a technology and consulting company. Mr. Zaman PERSONPRESENT has a B.S. in finance from California State University, Hayward.
10
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from johnsnowlabs import viz\n", "\n", "texts = [pages[12]]\n", "\n", "res = lp.fullAnnotate(texts)\n", "\n", "vis = viz.AssertionVisualizer()\n", "\n", "for r in res:\n", " vis.display(r, 'ner_chunk', 'assertion')" ] }, { "cell_type": "markdown", "id": "caeae8fe-c276-4c47-a137-69f325bb105c", "metadata": { "id": "caeae8fe-c276-4c47-a137-69f325bb105c" }, "source": [ "##πŸ“š Identify COMPETITORS in a Text with Assertion Status" ] }, { "cell_type": "markdown", "id": "94fe87f8-6282-4f41-82a3-34d2a2744d0b", "metadata": { "id": "94fe87f8-6282-4f41-82a3-34d2a2744d0b" }, "source": [ "This model uses Assertion Status to identify if a **PRODUCT** or an **ORG** is mentioned to be a `COMPETITOR`. By default, if nothing is mentioned, it returns `NO_COMPETITOR`.\n", "\n", "Again, this is a model uses the context around `PRODUCT` or `ORGANIZATION` to further subclassify them.\n", "\n", "For that, we need:\n", "- An NER model. We will use `finner_org_prod_alias` which uses `bert_embeddings_sec_bert_base` embeddings to extract `ORPG`, `PRODUCT` and `ALIAS` entities;\n", "- An Assertion Model which detects time. We will use `finassertion_competitors` which retrieves if a company or product is a `COMPETITOR` or `NO_COMPETITOR`\n", "\n", "πŸš€**Please keep in mind that you can use this model also in other entities, but the performance may be affected**\n" ] }, { "cell_type": "code", "execution_count": null, "id": "9d612414-0e2c-41bd-b9a1-3158ce9400eb", "metadata": { "id": "9d612414-0e2c-41bd-b9a1-3158ce9400eb", "jupyter": { "outputs_hidden": true }, "tags": [] }, "outputs": [], "source": [ "document_assembler = nlp.DocumentAssembler()\\\n", " .setInputCol(\"text\")\\\n", " .setOutputCol(\"document\")\n", "\n", "# Text Splitter annotator, processes various sentences per line\n", "text_splitter = finance.TextSplitter()\\\n", " .setInputCols([\"document\"])\\\n", " .setOutputCol(\"sentence\")\n", "\n", "# Tokenizer splits words in a relevant format for NLP\n", "tokenizer = nlp.Tokenizer()\\\n", " .setInputCols([\"sentence\"])\\\n", " .setOutputCol(\"token\")\n", "\n", "embeddings = nlp.BertEmbeddings.pretrained(\"bert_embeddings_sec_bert_base\",\"en\") \\\n", " .setInputCols([\"sentence\", \"token\"]) \\\n", " .setOutputCol(\"embeddings\")\n", "\n", "ner_model = finance.NerModel.pretrained(\"finner_orgs_prods_alias\",\"en\",\"finance/models\")\\\n", " .setInputCols([\"sentence\", \"token\", \"embeddings\"]) \\\n", " .setOutputCol(\"ner\")\\\n", "\n", "ner_converter = finance.NerConverterInternal() \\\n", " .setInputCols([\"sentence\", \"token\", \"ner\"]) \\\n", " .setOutputCol(\"ner_chunk\")\\\n", "\n", "assertion = finance.AssertionDLModel.pretrained(\"finassertion_competitors\", \"en\", \"finance/models\")\\\n", " .setInputCols([\"sentence\", \"ner_chunk\", \"embeddings\"]) \\\n", " .setOutputCol(\"assertion\")\n", " \n", "pipeline = nlp.Pipeline(stages=[\n", " document_assembler, \n", " text_splitter,\n", " tokenizer,\n", " embeddings,\n", " ner_model,\n", " ner_converter,\n", " assertion\n", " ])\n", "\n", "empty_df = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n", "\n", "model = pipeline.fit(empty_df)\n", "\n", "light_model = nlp.LightPipeline(model)" ] }, { "cell_type": "markdown", "id": "936ed953-a926-4e76-af5d-c8290a86a556", "metadata": { "id": "936ed953-a926-4e76-af5d-c8290a86a556" }, "source": [ "###βœ”οΈ Some examples " ] }, { "cell_type": "code", "execution_count": null, "id": "5a3cb7ee-3198-42d2-b0f9-9476a5052d2f", "metadata": { "id": "5a3cb7ee-3198-42d2-b0f9-9476a5052d2f" }, "outputs": [], "source": [ "sample_text = \"\"\"Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc.\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "id": "0619023c-a6c3-4b61-837c-06af6ea02804", "metadata": { "id": "0619023c-a6c3-4b61-837c-06af6ea02804" }, "outputs": [], "source": [ "data = spark.createDataFrame([[sample_text]]).toDF(\"text\")" ] }, { "cell_type": "code", "execution_count": null, "id": "3e7299b0-24d3-4bf7-bf95-833b950132f8", "metadata": { "id": "3e7299b0-24d3-4bf7-bf95-833b950132f8" }, "outputs": [], "source": [ "result = model.transform(data)" ] }, { "cell_type": "code", "execution_count": null, "id": "76bd5600-1705-482e-b728-e0459212ea53", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "76bd5600-1705-482e-b728-e0459212ea53", "outputId": "9574d35f-68a4-4652-a411-8256552efe6c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-------+------------+---------+----------+\n", "|sent_id|chunk |ner_label|assertion |\n", "+-------+------------+---------+----------+\n", "|0 |McAfee LLC |ORG |COMPETITOR|\n", "|0 |Broadcom Inc|ORG |COMPETITOR|\n", "+-------+------------+---------+----------+\n", "\n" ] } ], "source": [ "result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata, result.assertion.result)).alias(\"cols\"))\\\n", " .select(F.expr(\"cols['1']['sentence']\").alias(\"sent_id\"),\n", " F.expr(\"cols['0']\").alias(\"chunk\"),\n", " F.expr(\"cols['1']['entity']\").alias(\"ner_label\"),\n", " F.expr(\"cols['2']\").alias(\"assertion\")).show(truncate=False)" ] }, { "cell_type": "markdown", "id": "45de1301-4cba-432e-9d62-9e3a92b484a1", "metadata": { "id": "45de1301-4cba-432e-9d62-9e3a92b484a1" }, "source": [ "###βœ”οΈ Quick inference with LightPipeline" ] }, { "cell_type": "code", "execution_count": null, "id": "cc030bfe-fc6d-4e47-8daf-a527744c9e73", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 112 }, "id": "cc030bfe-fc6d-4e47-8daf-a527744c9e73", "outputId": "9b734d1e-d4f5-46a4-c9f2-08a4197cdfb0" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chunksentitiesassertion
0McAfee LLCORGCOMPETITOR
1Broadcom IncORGCOMPETITOR
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " chunks entities assertion\n", "0 McAfee LLC ORG COMPETITOR\n", "1 Broadcom Inc ORG COMPETITOR" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "light_result = light_model.fullAnnotate(sample_text)[0]\n", "\n", "chunks=[]\n", "entities=[]\n", "status=[]\n", "\n", "for n,m in zip(light_result['ner_chunk'],light_result['assertion']):\n", " \n", " chunks.append(n.result)\n", " entities.append(n.metadata['entity']) \n", " status.append(m.result)\n", " \n", "df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status})\n", "\n", "df" ] }, { "cell_type": "markdown", "id": "c9aaa35d-28db-46f7-b687-eb17b3352424", "metadata": { "id": "c9aaa35d-28db-46f7-b687-eb17b3352424" }, "source": [ "###βœ”οΈ Visualization of Assertion Status (`COMPETITOR` example)" ] }, { "cell_type": "code", "execution_count": null, "id": "f619b1a3-6937-4914-88d3-13b4ad086491", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 118 }, "id": "f619b1a3-6937-4914-88d3-13b4ad086491", "outputId": "8bdffb94-beef-4dae-dfc8-99328d50cf46" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", " Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC ORGCOMPETITOR and Broadcom Inc ORGCOMPETITOR ." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# from sparknlp_display import AssertionVisualizer\n", "\n", "vis = nlp.viz.AssertionVisualizer()\n", "\n", "vis.display(light_result, 'ner_chunk', 'assertion')" ] }, { "cell_type": "markdown", "id": "3679b0a8-5449-4d2d-b5bd-243f4a64c0ba", "metadata": { "id": "3679b0a8-5449-4d2d-b5bd-243f4a64c0ba" }, "source": [ "##πŸ“š Writing a Generic Assertion + NER Function\n", "You can generalize and retrieve components or full pipelines using functions.\n", "\n", "This is an example of how you can achieve that." ] }, { "cell_type": "code", "execution_count": null, "id": "2fb1ea6b-2a6c-41b1-b192-769a67e37a8d", "metadata": { "id": "2fb1ea6b-2a6c-41b1-b192-769a67e37a8d", "tags": [] }, "outputs": [], "source": [ "def get_base_pipeline(embeddings):\n", "\n", " documentAssembler = nlp.DocumentAssembler()\\\n", " .setInputCol(\"text\")\\\n", " .setOutputCol(\"document\")\n", "\n", " textSplitter = finance.TextSplitter()\\\n", " .setInputCols([\"document\"])\\\n", " .setOutputCol(\"sentence\")\n", "\n", " tokenizer = nlp.Tokenizer()\\\n", " .setInputCols([\"sentence\"])\\\n", " .setOutputCol(\"token\")\n", "\n", " embeddings = nlp.BertEmbeddings.pretrained(embeddings, \"en\") \\\n", " .setInputCols([\"sentence\", \"token\"]) \\\n", " .setOutputCol(\"embeddings\")\n", "\n", " base_pipeline = nlp.Pipeline(stages=[\n", " document_assembler,\n", " textSplitter,\n", " tokenizer,\n", " embeddings])\n", "\n", " return base_pipeline\n", "\n", "\n", "def get_assertion (embeddings, ner_model, assertion_model):\n", "\n", " ner = finance.NerModel.pretrained(ner_model, \"en\", \"finance/models\")\\\n", " .setInputCols([\"sentence\", \"token\", \"embeddings\"]) \\\n", " .setOutputCol(\"ner\")\n", "\n", " ner_converter = nlp.NerConverter() \\\n", " .setInputCols([\"sentence\", \"token\", \"ner\"]) \\\n", " .setOutputCol(\"ner_chunk\")\n", " \n", " assertion = finance.AssertionDLModel.pretrained(assertion_model, \"en\", \"finance/models\")\\\n", " .setInputCols([\"sentence\", \"ner_chunk\", \"embeddings\"])\\\n", " .setOutputCol(\"assertion\")\n", " \n", " base_model = get_base_pipeline(embeddings)\n", "\n", " nlpPipeline = nlp.Pipeline(stages=[\n", " base_model,\n", " ner,\n", " ner_converter,\n", " assertion])\n", "\n", " empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n", "\n", " model = nlpPipeline.fit(empty_data)\n", " \n", " light_model = nlp.LightPipeline(model)\n", " \n", " return light_model\n", " " ] }, { "cell_type": "markdown", "id": "c8bf2070-6bf4-4523-8a6f-5e9d3679787f", "metadata": { "id": "c8bf2070-6bf4-4523-8a6f-5e9d3679787f" }, "source": [ "###βœ”οΈ Quick inference with LightPipeline" ] }, { "cell_type": "code", "execution_count": null, "id": "b42476c7-d308-47fb-9218-1f73d99d3830", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "collapsed": true, "id": "b42476c7-d308-47fb-9218-1f73d99d3830", "jupyter": { "outputs_hidden": true }, "outputId": "64f6dda8-108d-4ab5-ec30-03f727316c8c", "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "finner_orgs_prods_alias download started this may take some time.\n", "[OK!]\n", "finassertion_competitors download started this may take some time.\n", "[OK!]\n", "bert_embeddings_sec_bert_base download started this may take some time.\n", "Approximate size to download 390.4 MB\n", "[OK!]\n" ] } ], "source": [ "sample_text = \"\"\"EDH combines our Cloudera Data Warehouse, Cloudera Operational DB, and Cloudera Data Science with our SDX technology.\"\"\"\n", "\n", "embeddings = \"bert_embeddings_sec_bert_base\"\n", "\n", "ner_model = \"finner_orgs_prods_alias\"\n", "\n", "assertion_model = \"finassertion_competitors\"\n", "\n", "light_result = get_assertion(embeddings, ner_model, assertion_model).fullAnnotate(sample_text)[0]\n", "\n", "chunks=[]\n", "entities=[]\n", "status=[]\n", "\n", "for n,m in zip(light_result['ner_chunk'],light_result['assertion']):\n", "\n", " chunks.append(n.result)\n", " entities.append(n.metadata['entity']) \n", " status.append(m.result)\n", "\n", "df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status})" ] }, { "cell_type": "code", "execution_count": null, "id": "ec16bf66-96da-4d05-b6dd-5f0ddb58446b", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "ec16bf66-96da-4d05-b6dd-5f0ddb58446b", "outputId": "89134e63-edd8-4d5d-b954-4d127e98b974" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chunksentitiesassertion
0EDHORGNO_COMPETITOR
1Cloudera Data WarehousePRODUCTNO_COMPETITOR
2Cloudera Operational DBPRODUCTNO_COMPETITOR
3Cloudera Data SciencePRODUCTNO_COMPETITOR
4SDXPRODUCTNO_COMPETITOR
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " chunks entities assertion\n", "0 EDH ORG NO_COMPETITOR\n", "1 Cloudera Data Warehouse PRODUCT NO_COMPETITOR\n", "2 Cloudera Operational DB PRODUCT NO_COMPETITOR\n", "3 Cloudera Data Science PRODUCT NO_COMPETITOR\n", "4 SDX PRODUCT NO_COMPETITOR" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "id": "ced0012d-b3e1-4e28-b9a1-64d433a84313", "metadata": { "id": "ced0012d-b3e1-4e28-b9a1-64d433a84313" }, "source": [ "###βœ”οΈ Visualization of Assertion Status (`NO_COMPETITOR` example)" ] }, { "cell_type": "code", "execution_count": null, "id": "ee86323e-e79d-453b-8b79-5e8ad235576a", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 118 }, "id": "ee86323e-e79d-453b-8b79-5e8ad235576a", "outputId": "64b07a87-fc02-449b-a654-d06759c9edda" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", " EDH ORGNO_COMPETITOR combines our Cloudera Data Warehouse PRODUCTNO_COMPETITOR , Cloudera Operational DB PRODUCTNO_COMPETITOR , and Cloudera Data Science PRODUCTNO_COMPETITOR with our SDX PRODUCTNO_COMPETITOR technology." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "vis = nlp.viz.AssertionVisualizer()\n", "\n", "vis.display(light_result, 'ner_chunk', 'assertion')" ] }, { "cell_type": "markdown", "id": "GDLWYnwF0OoE", "metadata": { "id": "GDLWYnwF0OoE" }, "source": [ "#πŸ”Ž Identify `Negation` in context\n", "This model uses Assertion Status to identify if an **ORG** or **PRODUCT** is followed by a `negation particle` in the context.\n", "\n", "Again, this is a model uses the context around `PRODUCT` or `ORGANIZATION` to further subclassify them.\n", "\n", "For that, we need:\n", "- An NER model. We will use `finner_orgs_prods_alias` which uses `bert_embeddings_sec_bert_base` embeddings to extract `PERSON`, `PRODUCT` and `ALIAS` entities;\n", "- An Assertion Model which detects negation. We will use `finassertion_negation` which retrieves if an entity is present in a `positive` or `negative` context.\n", "\n", "πŸš€**Please keep in mind that you can use this model also in other entities, but the performance may be affected**\n" ] }, { "cell_type": "markdown", "id": "JJFARERo1e-J", "metadata": { "id": "JJFARERo1e-J" }, "source": [ "###πŸ“Œ Quick inference with LightPipeline" ] }, { "cell_type": "code", "execution_count": null, "id": "JYf3ifsK00yt", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "JYf3ifsK00yt", "outputId": "0a0aa66a-41fb-4fdb-9ba2-75adfd1c0e7d" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "finner_orgs_prods_alias download started this may take some time.\n", "[OK!]\n", "finassertion_negation download started this may take some time.\n", "[OK!]\n", "bert_embeddings_sec_bert_base download started this may take some time.\n", "Approximate size to download 390.4 MB\n", "[OK!]\n" ] } ], "source": [ "sample_text = \"\"\"EDH combines our Cloudera Data Warehouse, Cloudera Operational DB, and Cloudera Data Science with our SDX technology.\"\"\"\n", "\n", "embeddings = \"bert_embeddings_sec_bert_base\"\n", "\n", "ner_model = \"finner_orgs_prods_alias\"\n", "\n", "assertion_model = \"finassertion_negation\"\n", "\n", "light_result = get_assertion(embeddings, ner_model, assertion_model).fullAnnotate(sample_text)[0]" ] }, { "cell_type": "code", "execution_count": null, "id": "nc4fdqMb1Xd0", "metadata": { "id": "nc4fdqMb1Xd0" }, "outputs": [], "source": [ "chunks=[]\n", "entities=[]\n", "status=[]\n", "\n", "for n,m in zip(light_result['ner_chunk'],light_result['assertion']):\n", "\n", " chunks.append(n.result)\n", " entities.append(n.metadata['entity']) \n", " status.append(m.result)\n", "\n", "df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status})" ] }, { "cell_type": "code", "execution_count": null, "id": "9zh1Hrms1ZJN", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "9zh1Hrms1ZJN", "outputId": "29320a3f-98fa-470f-ba2d-a699beee235b" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chunksentitiesassertion
0EDHORGpositive
1Cloudera Data WarehousePRODUCTpositive
2Cloudera Operational DBPRODUCTpositive
3Cloudera Data SciencePRODUCTpositive
4SDXPRODUCTpositive
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " chunks entities assertion\n", "0 EDH ORG positive\n", "1 Cloudera Data Warehouse PRODUCT positive\n", "2 Cloudera Operational DB PRODUCT positive\n", "3 Cloudera Data Science PRODUCT positive\n", "4 SDX PRODUCT positive" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": null, "id": "iKP7U6lA1kb9", "metadata": { "id": "iKP7U6lA1kb9" }, "outputs": [], "source": [ "sample_text = \"\"\"Whatsapp did not borrow funds from Meta for its capital needs. Synapsis INC will not be considered as eligible for X Engineering, Inc. supplier financing program.\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "id": "fY6vnWlr2Epz", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "fY6vnWlr2Epz", "outputId": "1917d378-bb9e-4b5f-b38e-ce9c72d1e511" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "finner_orgs_prods_alias download started this may take some time.\n", "[OK!]\n", "finassertion_negation download started this may take some time.\n", "[OK!]\n", "bert_embeddings_sec_bert_base download started this may take some time.\n", "Approximate size to download 390.4 MB\n", "[OK!]\n" ] } ], "source": [ "light_result = get_assertion(embeddings, ner_model, assertion_model).fullAnnotate(sample_text)[0]" ] }, { "cell_type": "code", "execution_count": null, "id": "jTXsdjeS2Isr", "metadata": { "id": "jTXsdjeS2Isr" }, "outputs": [], "source": [ "chunks=[]\n", "entities=[]\n", "status=[]\n", "\n", "for n,m in zip(light_result['ner_chunk'],light_result['assertion']):\n", "\n", " chunks.append(n.result)\n", " entities.append(n.metadata['entity']) \n", " status.append(m.result)\n", "\n", "df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status})" ] }, { "cell_type": "code", "execution_count": null, "id": "YsVKt1h42KAU", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 175 }, "id": "YsVKt1h42KAU", "outputId": "26677b22-84da-4283-88e0-ec6ea781b3a4" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
chunksentitiesassertion
0WhatsappORGnegative
1MetaORGpositive
2Synapsis INCORGnegative
3X Engineering, IncORGpositive
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " chunks entities assertion\n", "0 Whatsapp ORG negative\n", "1 Meta ORG positive\n", "2 Synapsis INC ORG negative\n", "3 X Engineering, Inc ORG positive" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "id": "q4I7pdb_2Pj0", "metadata": { "id": "q4I7pdb_2Pj0" }, "source": [ "###πŸ“Œ Visualization of Assertion Status" ] }, { "cell_type": "code", "execution_count": null, "id": "0WxQwt-12O49", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 118 }, "id": "0WxQwt-12O49", "outputId": "164d1da5-c748-4d44-8826-8f0d3363899b" }, "outputs": [ { "data": { "text/html": [ "\n", "\n", " Whatsapp ORGnegative did not borrow funds from Meta ORGpositive for its capital needs. Synapsis INC ORGnegative will not be considered as eligible for X Engineering, Inc ORGpositive . supplier financing program." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "vis = nlp.viz.AssertionVisualizer()\n", "\n", "vis.display(light_result, 'ner_chunk', 'assertion')" ] }, { "cell_type": "code", "execution_count": null, "id": "nSgv1nCsVylR", "metadata": { "id": "nSgv1nCsVylR" }, "outputs": [], "source": [] } ], "metadata": { "colab": { "provenance": [], "toc_visible": true }, "gpuClass": "standard", "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }