{
"cells": [
{
"cell_type": "markdown",
"id": "VNRiKOH_u_jP",
"metadata": {
"id": "VNRiKOH_u_jP"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "LF8cIN3OvA2Q",
"metadata": {
"id": "LF8cIN3OvA2Q"
},
"source": [
"[](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/finance-nlp/07.0.Understand_Entities_in_Context.ipynb)"
]
},
{
"cell_type": "markdown",
"id": "fa8fbe50-f928-45ff-b539-0ab62a7c40b2",
"metadata": {
"id": "fa8fbe50-f928-45ff-b539-0ab62a7c40b2"
},
"source": [
"#π Understanding Financial Entities In Context\n",
"π Assertion Status, or *Understanding Financial Entities in Context*, is an NLP atsk in carge of analyzing NER entities, extracted with:\n",
"- NER models;\n",
"- ContextualParser;\n",
"\n",
"π and their surroundings (usually a sentence, but it could take bigger spans too) to *assert* different conditions / status on the entities, as:\n",
"- If an entity is negated in the context;\n",
"- If the context talks about past, present, or future;\n",
"- If the entity is said to be hypothetical / possible, or certaing;\n",
"- etc.\n",
"\n",
"π The exposed above are just some examples, since the applications of Assertion DL models can be expanded to whatever to many other scenarios where you need to:\n",
"- Disambiguate entities from the context.\n",
"- Subclassify or specify an entity depending on context:\n",
"\n",
"πExamples:\n",
"- Is an ORG mentioned to be a COMPETITOR or part of the SUPPLY_CHAIN (or none of them)?\n",
"- Is an ACQUIRED_COMPANY mentioned to be acquired TOTALLY or PARTIALLY acquired in the context?\n",
"- etc\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "PjJlXILCYefM",
"metadata": {
"id": "PjJlXILCYefM"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "-lTBFRM_cRzI",
"metadata": {
"id": "-lTBFRM_cRzI"
},
"source": [
"Let's see which pretrained models we have and how to train custom ones!"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"id": "BknLo-nHX9M6"
},
"source": [
"##π¬ Installation"
],
"id": "BknLo-nHX9M6"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": true
},
"id": "_914itZsj51v"
},
"outputs": [],
"source": [
"! pip install -q johnsnowlabs"
],
"id": "_914itZsj51v"
},
{
"cell_type": "markdown",
"source": [
"##π Automatic Installation\n",
"Using my.johnsnowlabs.com SSO"
],
"metadata": {
"id": "YPsbAnNoPt0Z"
},
"id": "YPsbAnNoPt0Z"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": true
},
"id": "fY0lcShkj51w"
},
"outputs": [],
"source": [
"from johnsnowlabs import *\n",
"\n",
"# nlp.install(force_browser=True)"
],
"id": "fY0lcShkj51w"
},
{
"cell_type": "markdown",
"source": [
"##π Manual downloading\n",
"If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.\n",
"\n",
"- Go to my.johnsnowlabs.com\n",
"- Download your license\n",
"- Upload it using the following command"
],
"metadata": {
"id": "hsJvn_WWM2GL"
},
"id": "hsJvn_WWM2GL"
},
{
"cell_type": "code",
"source": [
"from google.colab import files\n",
"print('Please Upload your John Snow Labs License using the button below')\n",
"license_keys = files.upload()"
],
"metadata": {
"id": "i57QV3-_P2sQ"
},
"execution_count": null,
"outputs": [],
"id": "i57QV3-_P2sQ"
},
{
"cell_type": "markdown",
"source": [
"- Install it"
],
"metadata": {
"id": "xGgNdFzZP_hQ"
},
"id": "xGgNdFzZP_hQ"
},
{
"cell_type": "code",
"source": [
"nlp.install()"
],
"metadata": {
"id": "OfmmPqknP4rR"
},
"execution_count": null,
"outputs": [],
"id": "OfmmPqknP4rR"
},
{
"cell_type": "markdown",
"source": [
"##π Starting"
],
"metadata": {
"id": "DCl5ErZkNNLk"
},
"id": "DCl5ErZkNNLk"
},
{
"cell_type": "code",
"execution_count": null,
"id": "C2eqwqTxVbdR",
"metadata": {
"id": "C2eqwqTxVbdR"
},
"outputs": [],
"source": [
"from johnsnowlabs import nlp, finance\n",
"# Automatically load license data and start a session with all jars user has access to\n",
"spark = nlp.start()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "YeIQqpP6KkW9",
"metadata": {
"id": "YeIQqpP6KkW9"
},
"outputs": [],
"source": [
"from pyspark.sql import DataFrame\n",
"import pyspark.sql.functions as F\n",
"import pyspark.sql.types as T\n",
"import pyspark.sql as SQL\n",
"from pyspark import keyword_only"
]
},
{
"cell_type": "markdown",
"id": "90643920-3813-4cfc-8e29-81a5aa972725",
"metadata": {
"id": "90643920-3813-4cfc-8e29-81a5aa972725"
},
"source": [
"##π Understanding time from context"
]
},
{
"cell_type": "markdown",
"id": "l7urMeyqwXDu",
"metadata": {
"id": "l7urMeyqwXDu"
},
"source": [
"###βοΈ Past Experiences of a Role (C-level management)\n",
"Let's start with a small example: analyzing whether a ROLE of a person in a company is mentioned to be a `past` or `present`.\n",
"\n",
"πFor that, we need:\n",
"- An NER model. We will use `finner_bert_roles` which uses `bert_embeddings_sec_bert_base` embeddings to extract `ROLE` entities;\n",
"- An Assertion Model which detects time. We will use `finassertiondl_past_roles` which is a very specific one, to detect time in `ROLE` entities."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "abf82e78-a3fd-412c-8025-64363cf5a3ef",
"metadata": {
"id": "abf82e78-a3fd-412c-8025-64363cf5a3ef"
},
"outputs": [],
"source": [
"document_assembler = nlp.DocumentAssembler()\\\n",
" .setInputCol(\"text\")\\\n",
" .setOutputCol(\"document\")\n",
"\n",
"tokenizer = nlp.Tokenizer()\\\n",
" .setInputCols([\"document\"])\\\n",
" .setOutputCol(\"token\")\n",
"\n",
"embeddings = nlp.BertEmbeddings.pretrained(\"bert_embeddings_sec_bert_base\",\"en\") \\\n",
" .setInputCols([\"document\", \"token\"]) \\\n",
" .setOutputCol(\"embeddings\")\n",
"\n",
"tokenClassifier = finance.BertForTokenClassification.pretrained(\"finner_bert_roles\",\"en\",\"finance/models\")\\\n",
" .setInputCols(\"token\", \"document\")\\\n",
" .setOutputCol(\"ner\")\\\n",
" .setCaseSensitive(True)\n",
"\n",
"ner_converter = finance.NerConverterInternal() \\\n",
" .setInputCols([\"document\", \"token\", \"ner\"]) \\\n",
" .setOutputCol(\"ner_chunk\")\\\n",
" .setWhiteList([\"ROLE\"])\n",
"\n",
"assertion = finance.AssertionDLModel.pretrained(\"finassertiondl_past_roles\", \"en\", \"finance/models\")\\\n",
" .setInputCols([\"document\", \"ner_chunk\", \"embeddings\"]) \\\n",
" .setOutputCol(\"assertion\")\n",
" \n",
"nlpPipeline = nlp.Pipeline(stages=[\n",
" document_assembler, \n",
" tokenizer,\n",
" embeddings,\n",
" tokenClassifier,\n",
" ner_converter,\n",
" assertion\n",
" ])\n",
"\n",
"empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n",
"\n",
"model = nlpPipeline.fit(empty_data)\n",
"\n",
"light_model = nlp.LightPipeline(model)"
]
},
{
"cell_type": "markdown",
"id": "1468d272-3f3f-452d-ad4a-376d06879d4e",
"metadata": {
"id": "1468d272-3f3f-452d-ad4a-376d06879d4e"
},
"source": [
"###βοΈ Example sentences extracted from 10K filings"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "80e7d275-93ab-479d-a355-20de8613c8da",
"metadata": {
"id": "80e7d275-93ab-479d-a355-20de8613c8da",
"tags": []
},
"outputs": [],
"source": [
"sample_texts = [\"\"\"From January 2009 to November 2017, Mr. Tan worked as the Managing Director of Cadence\"\"\",\n",
" \"\"\"Jane S. Smith works as a Computer Engineer and Product Lead at Globalize Cloud Services\"\"\",\n",
" \"\"\"Mrs. Johansson has been apointed CEO and President of Mileways\"\"\",\n",
" \"\"\"Tom Martin worked as Cadence's CTO until 2010\"\"\",\n",
" \"\"\"Mrs. Charles was before Managing Director at a big consultancy company\"\"\",\n",
" \"\"\"We are happy to announce that Mary Leigh joins Elephant as Web Designer and UX/UI Developer\"\"\"]"
]
},
{
"cell_type": "markdown",
"id": "kuZWODMewxtv",
"metadata": {
"id": "kuZWODMewxtv"
},
"source": [
"###βοΈ We extract with LightPipelines"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "wD-ep_BIw16_",
"metadata": {
"id": "wD-ep_BIw16_"
},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"chunks=[]\n",
"entities=[]\n",
"status=[]\n",
"\n",
"for i in range(len(sample_texts)):\n",
" light_result = light_model.fullAnnotate(sample_texts[i])[0]\n",
"\n",
" for n,m in zip(light_result['ner_chunk'],light_result['assertion']):\n",
" chunks.append(n.result)\n",
" entities.append(n.metadata['entity']) \n",
" status.append(m.result)\n",
"\n",
"df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status})"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e4c81d83-a2c4-4303-b257-9615a8986361",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 332
},
"id": "e4c81d83-a2c4-4303-b257-9615a8986361",
"outputId": "5c3534ef-f2fb-49b6-8c17-a424733bf389"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
chunks
\n",
"
entities
\n",
"
assertion
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
Director
\n",
"
ROLE
\n",
"
PAST
\n",
"
\n",
"
\n",
"
1
\n",
"
Computer Engineer
\n",
"
ROLE
\n",
"
NO_PAST
\n",
"
\n",
"
\n",
"
2
\n",
"
Product Lead
\n",
"
ROLE
\n",
"
NO_PAST
\n",
"
\n",
"
\n",
"
3
\n",
"
CEO
\n",
"
ROLE
\n",
"
NO_PAST
\n",
"
\n",
"
\n",
"
4
\n",
"
President
\n",
"
ROLE
\n",
"
NO_PAST
\n",
"
\n",
"
\n",
"
5
\n",
"
Cadence's CTO
\n",
"
ROLE
\n",
"
PAST
\n",
"
\n",
"
\n",
"
6
\n",
"
Managing Director
\n",
"
ROLE
\n",
"
PAST
\n",
"
\n",
"
\n",
"
7
\n",
"
Web Designer
\n",
"
ROLE
\n",
"
NO_PAST
\n",
"
\n",
"
\n",
"
8
\n",
"
UX/UI Developer
\n",
"
ROLE
\n",
"
NO_PAST
\n",
"
\n",
" \n",
"
\n",
"
\n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" chunks entities assertion\n",
"0 Director ROLE PAST\n",
"1 Computer Engineer ROLE NO_PAST\n",
"2 Product Lead ROLE NO_PAST\n",
"3 CEO ROLE NO_PAST\n",
"4 President ROLE NO_PAST\n",
"5 Cadence's CTO ROLE PAST\n",
"6 Managing Director ROLE PAST\n",
"7 Web Designer ROLE NO_PAST\n",
"8 UX/UI Developer ROLE NO_PAST"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"id": "c76bba1c-850e-4c15-908c-191a6c3fa3f0",
"metadata": {
"id": "c76bba1c-850e-4c15-908c-191a6c3fa3f0"
},
"source": [
"###βοΈ Visualization of Assertion Status"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9a783e5e-0b93-4302-b776-530b7d790e63",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 623
},
"id": "9a783e5e-0b93-4302-b776-530b7d790e63",
"outputId": "a51dcee8-06e4-47b2-897b-766856a47e81"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
" From January 2009 to November 2017, Mr. Tan worked as the Managing Director ROLEPAST of Cadence"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
" Jane S. Smith works as a Computer Engineer ROLENO_PAST and Product Lead ROLENO_PAST at Globalize Cloud Services"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
" Mrs. Johansson has been apointed CEO ROLENO_PAST and President ROLENO_PAST of Mileways"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
" Tom Martin worked as Cadence's CTO ROLEPAST until 2010"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
" Mrs. Charles was before Managing Director ROLEPAST at a big consultancy company"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
" We are happy to announce that Mary Leigh joins Elephant as Web Designer ROLENO_PAST and UX/UI Developer ROLENO_PAST "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"for i in range(len(sample_texts)):\n",
" \n",
" light_result = light_model.fullAnnotate(sample_texts[i])[0]\n",
" \n",
" vis = nlp.viz.AssertionVisualizer()\n",
"\n",
" vis.display(light_result, 'ner_chunk', 'assertion')"
]
},
{
"cell_type": "markdown",
"id": "4a3QK9yVwKmN",
"metadata": {
"id": "4a3QK9yVwKmN"
},
"source": [
"##π Bigger example: Asserting time in a 10-K filing\n",
"πNow let's go bigger. We will use one 10K filing, extract several pages and apply assertion status to detect time for:\n",
"- `PER` (people)\n",
"- `ORG` (organizations)\n",
"- `ROLE` (roles of those people in that or past organizations)\n",
"\n",
"πFor that, we need:\n",
"- An NER model. We will use `finner_org_per_role_date` which uses `bert_embeddings_sec_bert_base` embeddings to extract `PERSON`, `ORG` and `ROLE` entities;\n",
"- An Assertion Model which detects time. We will use `finassertion_time` which is a generic time assertion model, to detect time on the previously mentioned entities. \n",
"\n",
"π**Please keep in mind that you can use this model also in other entities, but the performance may degrade since it was not trained on other kind of entities.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "j7iWgb2Fghle",
"metadata": {
"id": "j7iWgb2Fghle"
},
"outputs": [],
"source": [
"!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/cdns-20220101.html.txt"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "wSbj1E7fpTrK",
"metadata": {
"id": "wSbj1E7fpTrK"
},
"outputs": [],
"source": [
"import requests\n",
"URL = \"https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/cdns-20220101.html.txt\"\n",
"response = requests.get(URL)\n",
"\n",
"cadence_sec10k = response.content.decode('utf-8')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4PD0eMHNpe5A",
"metadata": {
"id": "4PD0eMHNpe5A"
},
"outputs": [],
"source": [
"document_assembler = nlp.DocumentAssembler() \\\n",
" .setInputCol(\"text\") \\\n",
" .setOutputCol(\"document\")\n",
"\n",
"text_splitter = finance.TextSplitter() \\\n",
" .setInputCols([\"document\"]) \\\n",
" .setOutputCol(\"pages\")\\\n",
" .setCustomBounds([\"Table of Contents\"])\\\n",
" .setUseCustomBoundsOnly(True)\\\n",
" .setExplodeSentences(True)\n",
"\n",
"nlp_pipeline = nlp.Pipeline(stages=[\n",
" document_assembler,\n",
" text_splitter])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "VO_J6PLtp3AH",
"metadata": {
"id": "VO_J6PLtp3AH"
},
"outputs": [],
"source": [
"#fit: trains, configures and prepares the pipeline for inference. \n",
"\n",
"sdf = spark.createDataFrame([[ cadence_sec10k ]]).toDF(\"text\")\n",
"\n",
"fit = nlp_pipeline.fit(sdf)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "IjZjKSTAp47P",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "IjZjKSTAp47P",
"outputId": "3564fbfa-76a8-44e5-a39c-7d8351275134"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------------------+--------------------+--------------------+\n",
"| text| document| pages|\n",
"+--------------------+--------------------+--------------------+\n",
"|Table of Contents...|[{document, 0, 34...|[{document, 18, 4...|\n",
"|Table of Contents...|[{document, 0, 34...|[{document, 4087,...|\n",
"|Table of Contents...|[{document, 0, 34...|[{document, 4215,...|\n",
"|Table of Contents...|[{document, 0, 34...|[{document, 5504,...|\n",
"|Table of Contents...|[{document, 0, 34...|[{document, 11617...|\n",
"|Table of Contents...|[{document, 0, 34...|[{document, 13985...|\n",
"|Table of Contents...|[{document, 0, 34...|[{document, 20001...|\n",
"|Table of Contents...|[{document, 0, 34...|[{document, 26059...|\n",
"|Table of Contents...|[{document, 0, 34...|[{document, 31638...|\n",
"|Table of Contents...|[{document, 0, 34...|[{document, 36733...|\n",
"|Table of Contents...|[{document, 0, 34...|[{document, 42440...|\n",
"|Table of Contents...|[{document, 0, 34...|[{document, 47053...|\n",
"|Table of Contents...|[{document, 0, 34...|[{document, 48328...|\n",
"|Table of Contents...|[{document, 0, 34...|[{document, 53745...|\n",
"|Table of Contents...|[{document, 0, 34...|[{document, 59341...|\n",
"|Table of Contents...|[{document, 0, 34...|[{document, 65403...|\n",
"|Table of Contents...|[{document, 0, 34...|[{document, 72330...|\n",
"|Table of Contents...|[{document, 0, 34...|[{document, 77951...|\n",
"|Table of Contents...|[{document, 0, 34...|[{document, 84131...|\n",
"|Table of Contents...|[{document, 0, 34...|[{document, 89718...|\n",
"+--------------------+--------------------+--------------------+\n",
"only showing top 20 rows\n",
"\n",
"CPU times: user 48 ms, sys: 13.8 ms, total: 61.7 ms\n",
"Wall time: 6.41 s\n"
]
}
],
"source": [
"%%time\n",
"\n",
"#transforms: executes inference on a fit pipeline\n",
"res = fit.transform(sdf)\n",
"\n",
"res.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "_C_GBv2Dp9D2",
"metadata": {
"id": "_C_GBv2Dp9D2"
},
"outputs": [],
"source": [
"%%time\n",
"\n",
"import json\n",
"\n",
"lp = nlp.LightPipeline(fit)\n",
"\n",
"json_res = lp.annotate(cadence_sec10k)\n",
"\n",
"print(json.dumps(json_res, indent=4))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ex8hiEW_qCnA",
"metadata": {
"id": "ex8hiEW_qCnA"
},
"outputs": [],
"source": [
"pages = [json_res['pages'][i] for i in range(13)]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9TPTrACWgaiL",
"metadata": {
"id": "9TPTrACWgaiL"
},
"outputs": [],
"source": [
"document_assembler = nlp.DocumentAssembler()\\\n",
" .setInputCol(\"text\")\\\n",
" .setOutputCol(\"document\")\n",
"\n",
"text_splitter = finance.TextSplitter()\\\n",
" .setInputCols([\"document\"])\\\n",
" .setOutputCol(\"sentence\")\n",
"\n",
"tokenizer = nlp.Tokenizer()\\\n",
" .setInputCols([\"sentence\"])\\\n",
" .setOutputCol(\"token\")\n",
"\n",
"embeddings = nlp.BertEmbeddings.pretrained(\"bert_embeddings_sec_bert_base\",\"en\") \\\n",
" .setInputCols([\"sentence\", \"token\"]) \\\n",
" .setOutputCol(\"embeddings\")\n",
"\n",
"ner = finance.NerModel.pretrained(\"finner_org_per_role_date\", \"en\", \"finance/models\")\\\n",
" .setInputCols(\"sentence\", \"token\", \"embeddings\")\\\n",
" .setOutputCol(\"ner\")\n",
"\n",
"chunk_converter = nlp.NerConverter() \\\n",
" .setInputCols([\"sentence\", \"token\", \"ner\"]) \\\n",
" .setOutputCol(\"ner_chunk\")\n",
"\n",
"assertion = finance.AssertionDLModel.pretrained(\"finassertion_time\", \"en\", \"finance/models\")\\\n",
" .setInputCols([\"sentence\", \"ner_chunk\", \"embeddings\"]) \\\n",
" .setOutputCol(\"assertion\")\n",
" \n",
"nlpPipeline = nlp.Pipeline(stages=[\n",
" document_assembler,\n",
" text_splitter,\n",
" tokenizer,\n",
" embeddings,\n",
" ner,\n",
" chunk_converter,\n",
" assertion\n",
" ])\n",
"\n",
"empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n",
"\n",
"model = nlpPipeline.fit(empty_data)\n",
"\n",
"lp = nlp.LightPipeline(model)"
]
},
{
"cell_type": "markdown",
"id": "xiFoMZ4egIdl",
"metadata": {
"id": "xiFoMZ4egIdl"
},
"source": [
"πLet's start identifying time using `finassertion_time`. As in previous notebooks, we will be using a SEC 10K filing."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "tqmGFPKcqbv_",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"id": "tqmGFPKcqbv_",
"outputId": "a1f1fed8-5bad-43fc-c5f7-f2e41d059bf2"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
" INFORMATION ABOUT OUR EXECUTIVE OFFICERS The following table provides information regarding our executive officers as of February 22, 2022: Name Age Positions and Offices Anirudh Devgan PERSONPAST 52 President ROLEPAST and Chief Executive Officer ROLEPAST John M. Wall PERSONPAST 51 Senior Vice President ROLEPAST and Chief Financial Officer ROLEPAST Thomas P. Beckley PERSONPAST 64 Senior Vice President ROLEPAST and General Manager ROLEPAST of the Custom IC and PCB Group Paul Cunningham PERSONPAST 44 Senior Vice President ROLEPAST and General Manager ROLEPAST of the System and Verification Group Alinka Flaminia PERSONPAST 60 Senior Vice President ROLEPAST , Chief Legal Officer ROLEPAST and Corporate Secretary ROLEPAST Chin-Chi Teng PERSONPAST 56 Senior Vice President ROLEPAST and General Manager ROLEPAST of the Digital and Signoff Group Neil Zaman PERSONPAST 53 Senior Vice President ROLEPAST and Chief Revenue Officer ROLEPAST Our executive officers are appointed by the Board of Directors and serve at the discretion of the Board of Directors. ANIRUDH DEVGAN PERSONPAST has served as Chief Executive Officer ROLEPAST of Cadence ORGPAST since December 2021 DATEPAST and President ROLEPAST of Cadence ORGPAST since November 2017 DATEPAST . From May 2012 DATEPAST to November 2017 DATEPAST , Dr. Devgan PERSONPAST held several positions at Cadence ORGPAST , most recently as Executive Vice President ROLEPAST , Research and Development from March 2017 DATEPAST to November 2017 DATEPAST and Senior Vice President ROLEPAST , Research and Development from November 2013 DATEPAST to March 2017 DATEPAST . Prior to joining Cadence ORGPAST , from May 2005 DATEPAST to March 2012 DATEPAST , Dr. Devgan PERSONPAST served as Corporate Vice President ROLEPAST and General Manager ROLEPAST of the Custom Design Business Unit at Magma Design Automation, Inc., an EDA company. Dr. Devgan PERSONPRESENT has a B.Tech. in electrical engineering from the Indian Institute of Technology ORGPAST , Delhi, and an M.S. and Ph.D. in electrical and computer engineering from Carnegie Mellon University ORGPRESENT . JOHN M. WALL PERSONPRESENT has served as Senior Vice President ROLEPAST and Chief Financial Officer ROLEPAST of Cadence ORGPAST since October 2017 DATEPAST . From October 2000 DATEPAST to September 2017 DATEPAST , Mr. Wall PERSONPAST held several positions at Cadence ORGPAST , most recently as Corporate Vice President ROLEPAST and Corporate Controller ROLEPAST from April 2016 DATEPAST to October 2017 DATEPAST , Vice President ROLEPAST , Finance and Operations, Worldwide Revenue Accounting and Sales Finance from 2015 DATEPAST to 2016 DATEPAST and Vice President ROLEPAST , Finance and Operations, EMEA and Worldwide Revenue Accounting from 2005 DATEPAST to 2015 DATEPAST . Mr. Wall PERSONPAST has an NCBS from the Institute of Technology, Tralee and is a Fellow ROLEPAST of the Association of Chartered Certified Accountants ORGPAST . THOMAS P. BECKLEY PERSONPRESENT has served as Senior Vice President ROLEPRESENT and General Manager ROLEPRESENT of the Custom IC and PCB Group of Cadence ORGPRESENT since 2018 DATEPRESENT . From September 2012 DATEPAST to September 2018 DATEPAST , Mr. Beckley PERSONPAST served as Senior Vice President ROLEPAST , Research and Development of Cadence ORGPAST . From April 2004 DATEPAST to September 2012 DATEPAST , Mr. Beckley PERSONPAST served as Corporate Vice President ROLEPAST , Research and Development of Cadence ORGPAST . Prior to joining Cadence ORGPAST , Mr. Beckley PERSONPAST served as President ROLEPAST and Chief Executive Officer ROLEPAST of Neolinear, Inc ORGPAST ., a developer of auto-interactive and automated analog/RF tools and solutions for mixed-signal design that was acquired by Cadence ORGPAST in April 2004 DATEPAST . Mr. Beckley PERSONPRESENT has a B.S. in mathematics and physics from Kalamazoo College and an M.B.A. from Vanderbilt University. PAUL CUNNINGHAM PERSONPAST has served as Senior Vice President ROLEPAST and General Manager ROLEPAST of the System and Verification Group since March 2021 DATEPAST . From August 2011 DATEPAST to March 2021 DATEPAST , Mr. Cunningham PERSONPAST held several positions at Cadence ORGPAST , most recently as Corporate Vice President ROLEPAST of the System Verification Group beginning January 2018 DATEPAST . Prior to joining Cadence ORGPAST , Mr. Cunningham PERSONPAST was co-founder ROLEPAST and Chief Executive Officer ROLEPAST of Azuro, Inc ORGPAST ., a clock concurrent optimization company, that Cadence ORGPAST acquired in July 2011 DATEPAST . Mr. Cunningham PERSONPRESENT has an M.A. and Ph.D. in computer science from the University of Cambridge ORGPRESENT in the United Kingdom. ALINKA FLAMINIA PERSONPAST has served as Senior Vice President ROLEPAST , Chief Legal Officer ROLEPAST and Corporate Secretary ROLEPAST of Cadence ORGPAST since June 2020 DATEPAST . Prior to joining Cadence ORGPAST , Ms. Flaminia PERSONPAST served as Senior Vice President ROLEPAST , General Counsel and Corporate Secretary ROLEPAST of Mellanox Technologies Ltd ORGPAST ., a supplier of intelligent interconnect solutions, from September 2016 DATEPAST until its acquisition by NVIDIA Corporation ORGPAST in April 2020 DATEPAST . She also served as General Counsel ROLEPAST and Corporate Secretary ROLEPAST of PMC-Sierra, Inc ORGPAST ., a semiconductor company, from 2007 DATEPAST until its acquisition by Microsemi Corporation ORGPAST in 2016 DATEPAST . Ms. Flaminia PERSONPAST has a B.A. from Yale University, and a J.D. from Colorado University, School of Law. CHIN-CHI TENG PERSONPAST has served as Senior Vice President ROLEPAST and General Manager ROLEPAST of the Digital and Signoff Group of Cadence ORGPAST since September 2018 DATEPAST . From January 2002 DATEPAST to September 2018 DATEPAST , Dr. Teng PERSONPAST held several positions at Cadence ORGPAST , most recently as Corporate Vice President ROLEPAST , Research and Development from June 2015 DATEPAST to September 2018 DATEPAST , and Vice President ROLEPAST , Research and Development from March 2009 DATEPAST to June 2015 DATEPAST . Dr. Teng PERSONPRESENT has a B.S. in electrical engineering from the National Taiwan University ORGPRESENT and a Ph.D. in electrical and computer engineering from the University of Illinois, Urbana-Champaign. NEIL ZAMAN PERSONPAST has served as Chief Revenue Officer ROLEPAST since October 2020 DATEPAST and as Senior Vice President ROLEPAST , Worldwide Field Operations since September 2015 DATEPAST . From October 1999 DATEPAST to September 2015 DATEPAST , Mr. Zaman PERSONPAST held several positions at Cadence ORGPAST , most recently as Corporate Vice President ROLEPAST , North America Field Operations. Prior to joining Cadence ORGPAST , Mr. Zaman PERSONPAST held positions at Phoenix Technologies Ltd ORGPAST ., a developer of core system software, and IBM Corporation ORGPAST , a technology and consulting company. Mr. Zaman PERSONPRESENT has a B.S. in finance from California State University, Hayward. 10"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from johnsnowlabs import viz\n",
"\n",
"texts = [pages[12]]\n",
"\n",
"res = lp.fullAnnotate(texts)\n",
"\n",
"vis = viz.AssertionVisualizer()\n",
"\n",
"for r in res:\n",
" vis.display(r, 'ner_chunk', 'assertion')"
]
},
{
"cell_type": "markdown",
"id": "caeae8fe-c276-4c47-a137-69f325bb105c",
"metadata": {
"id": "caeae8fe-c276-4c47-a137-69f325bb105c"
},
"source": [
"##π Identify COMPETITORS in a Text with Assertion Status"
]
},
{
"cell_type": "markdown",
"id": "94fe87f8-6282-4f41-82a3-34d2a2744d0b",
"metadata": {
"id": "94fe87f8-6282-4f41-82a3-34d2a2744d0b"
},
"source": [
"This model uses Assertion Status to identify if a **PRODUCT** or an **ORG** is mentioned to be a `COMPETITOR`. By default, if nothing is mentioned, it returns `NO_COMPETITOR`.\n",
"\n",
"Again, this is a model uses the context around `PRODUCT` or `ORGANIZATION` to further subclassify them.\n",
"\n",
"For that, we need:\n",
"- An NER model. We will use `finner_org_prod_alias` which uses `bert_embeddings_sec_bert_base` embeddings to extract `ORPG`, `PRODUCT` and `ALIAS` entities;\n",
"- An Assertion Model which detects time. We will use `finassertion_competitors` which retrieves if a company or product is a `COMPETITOR` or `NO_COMPETITOR`\n",
"\n",
"π**Please keep in mind that you can use this model also in other entities, but the performance may be affected**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9d612414-0e2c-41bd-b9a1-3158ce9400eb",
"metadata": {
"id": "9d612414-0e2c-41bd-b9a1-3158ce9400eb",
"jupyter": {
"outputs_hidden": true
},
"tags": []
},
"outputs": [],
"source": [
"document_assembler = nlp.DocumentAssembler()\\\n",
" .setInputCol(\"text\")\\\n",
" .setOutputCol(\"document\")\n",
"\n",
"# Text Splitter annotator, processes various sentences per line\n",
"text_splitter = finance.TextSplitter()\\\n",
" .setInputCols([\"document\"])\\\n",
" .setOutputCol(\"sentence\")\n",
"\n",
"# Tokenizer splits words in a relevant format for NLP\n",
"tokenizer = nlp.Tokenizer()\\\n",
" .setInputCols([\"sentence\"])\\\n",
" .setOutputCol(\"token\")\n",
"\n",
"embeddings = nlp.BertEmbeddings.pretrained(\"bert_embeddings_sec_bert_base\",\"en\") \\\n",
" .setInputCols([\"sentence\", \"token\"]) \\\n",
" .setOutputCol(\"embeddings\")\n",
"\n",
"ner_model = finance.NerModel.pretrained(\"finner_orgs_prods_alias\",\"en\",\"finance/models\")\\\n",
" .setInputCols([\"sentence\", \"token\", \"embeddings\"]) \\\n",
" .setOutputCol(\"ner\")\\\n",
"\n",
"ner_converter = finance.NerConverterInternal() \\\n",
" .setInputCols([\"sentence\", \"token\", \"ner\"]) \\\n",
" .setOutputCol(\"ner_chunk\")\\\n",
"\n",
"assertion = finance.AssertionDLModel.pretrained(\"finassertion_competitors\", \"en\", \"finance/models\")\\\n",
" .setInputCols([\"sentence\", \"ner_chunk\", \"embeddings\"]) \\\n",
" .setOutputCol(\"assertion\")\n",
" \n",
"pipeline = nlp.Pipeline(stages=[\n",
" document_assembler, \n",
" text_splitter,\n",
" tokenizer,\n",
" embeddings,\n",
" ner_model,\n",
" ner_converter,\n",
" assertion\n",
" ])\n",
"\n",
"empty_df = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n",
"\n",
"model = pipeline.fit(empty_df)\n",
"\n",
"light_model = nlp.LightPipeline(model)"
]
},
{
"cell_type": "markdown",
"id": "936ed953-a926-4e76-af5d-c8290a86a556",
"metadata": {
"id": "936ed953-a926-4e76-af5d-c8290a86a556"
},
"source": [
"###βοΈ Some examples "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5a3cb7ee-3198-42d2-b0f9-9476a5052d2f",
"metadata": {
"id": "5a3cb7ee-3198-42d2-b0f9-9476a5052d2f"
},
"outputs": [],
"source": [
"sample_text = \"\"\"Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc.\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0619023c-a6c3-4b61-837c-06af6ea02804",
"metadata": {
"id": "0619023c-a6c3-4b61-837c-06af6ea02804"
},
"outputs": [],
"source": [
"data = spark.createDataFrame([[sample_text]]).toDF(\"text\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3e7299b0-24d3-4bf7-bf95-833b950132f8",
"metadata": {
"id": "3e7299b0-24d3-4bf7-bf95-833b950132f8"
},
"outputs": [],
"source": [
"result = model.transform(data)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "76bd5600-1705-482e-b728-e0459212ea53",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "76bd5600-1705-482e-b728-e0459212ea53",
"outputId": "9574d35f-68a4-4652-a411-8256552efe6c"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-------+------------+---------+----------+\n",
"|sent_id|chunk |ner_label|assertion |\n",
"+-------+------------+---------+----------+\n",
"|0 |McAfee LLC |ORG |COMPETITOR|\n",
"|0 |Broadcom Inc|ORG |COMPETITOR|\n",
"+-------+------------+---------+----------+\n",
"\n"
]
}
],
"source": [
"result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata, result.assertion.result)).alias(\"cols\"))\\\n",
" .select(F.expr(\"cols['1']['sentence']\").alias(\"sent_id\"),\n",
" F.expr(\"cols['0']\").alias(\"chunk\"),\n",
" F.expr(\"cols['1']['entity']\").alias(\"ner_label\"),\n",
" F.expr(\"cols['2']\").alias(\"assertion\")).show(truncate=False)"
]
},
{
"cell_type": "markdown",
"id": "45de1301-4cba-432e-9d62-9e3a92b484a1",
"metadata": {
"id": "45de1301-4cba-432e-9d62-9e3a92b484a1"
},
"source": [
"###βοΈ Quick inference with LightPipeline"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cc030bfe-fc6d-4e47-8daf-a527744c9e73",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 112
},
"id": "cc030bfe-fc6d-4e47-8daf-a527744c9e73",
"outputId": "9b734d1e-d4f5-46a4-c9f2-08a4197cdfb0"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"