![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/finance-nlp/14.0.Financial_ChunkKeyPhraseExtraction.ipynb)

🎬 Installation

In [None]:
! pip install -q johnsnowlabs

##🔗 Automatic Installation
Using my.johnsnowlabs.com SSO

In [None]:
from johnsnowlabs import nlp, finance, legal

nlp.install(force_browser=True)

##🔗 Manual downloading
If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

- Go to my.johnsnowlabs.com
- Download your license
- Upload it using the following command

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

- Install it

In [None]:
nlp.install()

#📌 Starting

In [None]:
spark = nlp.start()

⏳ Load sample txt file

In [4]:
text = """ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES AND EXCHANGE ACT OF 1934
For the annual period ended January 31, 2021
or
TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the transition period from________to_______
Commission File Number: 001-38856
PAGERDUTY, INC.
(Exact name of registrant as specified in its charter)
Delaware
27-2793871
(State or other jurisdiction of
incorporation or organization)
(I.R.S. Employer
Identification Number)
600 Townsend St., Suite 200, San Francisco, CA 94103
(844) 800-3889
(Address, including zip code, and telephone number, including area code, of registrant’s principal executive offices)
Securities registered pursuant to Section 12(b) of the Act:
Title of each class
Trading symbol(s)
Name of each exchange on which registered
Common Stock, $0.000005 par value,
PD
New York Stock Exchange"""

In [5]:
empty_data = spark.createDataFrame([[""]]).toDF("text")
textDF = spark.createDataFrame([[text]]).toDF("text")

## 🔎 **Chunk Key Phrase Extraction**


📜Explanation:

Chunk Key Phrase Extraction is a technique used in natural language processing (NLP) to identify and extract key phrases or important chunks of text from a given document or text corpus. Key phrases are typically defined as meaningful and informative phrases that capture the essence of the content.

The process of Chunk Key Phrase Extraction involves several steps:

- **Tokenization:** The input text is divided into smaller units called tokens, which can be words, phrases, or even characters. Tokenization helps in breaking down the text into meaningful components that can be further analyzed.

- **Part-of-Speech (POS) Tagging:** Each token is assigned a part-of-speech tag, which indicates the grammatical category or role of the word in the sentence (e.g., noun, verb, adjective). POS tagging helps in understanding the syntactic structure of the text.

- **Chunking:** Chunking is the process of grouping together tokens based on specific patterns or rules. It involves identifying and extracting meaningful chunks of words that form meaningful phrases or constituents. These chunks are typically noun phrases or verb phrases that convey important information.

- **Key Phrase Extraction:** From the extracted chunks, the algorithm selects and ranks key phrases based on their importance or relevance to the overall content. Various techniques can be employed for ranking, such as frequency-based approaches or statistical models that consider the contextual information of the phrases.

Chunk Key Phrase Extraction is often used in applications such as information retrieval, document summarization, sentiment analysis, and text classification. It helps in identifying the most significant and informative phrases in a text, enabling better understanding and analysis of the content.

In [None]:
documenter = nlp.DocumentAssembler() \
 .setInputCol("text") \
 .setOutputCol("document")

sentencer = nlp.SentenceDetector() \
 .setInputCols(["document"]) \
 .setOutputCol("sentences")

tokenizer = nlp.Tokenizer() \
 .setInputCols(["document"]) \
 .setOutputCol("tokens") \
 .setSplitChars(['\[','\]']) \

stop_words_cleaner = nlp.StopWordsCleaner.pretrained()\
 .setInputCols("tokens")\
 .setOutputCol("clean_tokens")\
 .setCaseSensitive(False)

ngram_generator = nlp.NGramGenerator()\
 .setInputCols(["clean_tokens"])\
 .setOutputCol("ngrams")\
 .setN(3)

ngram_key_phrase_extractor = finance.ChunkKeyPhraseExtraction.pretrained()\
 .setTopN(10) \
 .setDivergence(0.4)\
 .setInputCols(["sentences", "ngrams"])\
 .setOutputCol("ngram_key_phrases")

ngram_pipeline = nlp.Pipeline(stages=[
 documenter, 
 sentencer, 
 tokenizer, 
 stop_words_cleaner,
 ngram_generator,
 ngram_key_phrase_extractor
])

In [7]:
ngram_results = ngram_pipeline.fit(empty_data).transform(textDF)

**Lets show N-Gram results.**

In [8]:
ngram_results.selectExpr("explode(ngrams) AS key_phrase_candidate").show(30,truncate=False)

+--------------------------------------------------------------------------------------------+
|key_phrase_candidate |
+--------------------------------------------------------------------------------------------+
|{chunk, 0, 21, ANNUAL REPORT PURSUANT, {sentence -> 0, chunk -> 0}, []} |
|{chunk, 7, 32, REPORT PURSUANT SECTION, {sentence -> 0, chunk -> 1}, []} |
|{chunk, 14, 35, PURSUANT SECTION 13, {sentence -> 0, chunk -> 2}, []} |
|{chunk, 26, 43, SECTION 13 15(d, {sentence -> 0, chunk -> 3}, []} |
|{chunk, 34, 44, 13 15(d ), {sentence -> 0, chunk -> 4}, []} |
|{chunk, 40, 62, 15(d ) SECURITIES, {sentence -> 0, chunk -> 5}, []} |
|{chunk, 44, 75, ) SECURITIES EXCHANGE, {sentence -> 0, chunk -> 6}, []} |
|{chunk, 53, 79, SECURITIES EXCHANGE ACT, {sentence -> 0, chunk -> 7}, []} |
|{chunk, 68, 87, EXCHANGE ACT 1934, {sentence -> 0, chunk -> 8}, []} |
|{chunk, 77, 102, ACT 1934 annual, {sentence -> 0, chunk -> 9}, []} |
|{chunk, 84, 109, 1934 annual period, {sentence -> 0, chunk -> 10}

**Check the key phrases from N-Gram results.**

In [9]:
ngram_results.selectExpr("explode(ngram_key_phrases) AS ngram_key_phrases").show(truncate=170)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ngram_key_phrases|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{chunk, 53, 79, SECURITIES EXCHANGE ACT, {sentence -> 0, chunk -> 7, DocumentSimilarity -> 0.6393701205777877, MMRScore -> 0.3836220875904442}, [-1.0755919, 0.8220767,...|
|{chunk, 688, 717, Securities registered pursuant, {sentence -> 0, chunk -> 95, DocumentSimilarity -> 0.6136637817081764, MMRScore -> 0.07235383001749868}, [-1.1415554,...|
|{chunk, 377, 397, Delaware 27-2793871, {sentence -> 0, chunk -> 44, DocumentSimilarity -> 0.5795593361842474, MMRScore -> 0.21278103633873757}, [-0.7930386, 0.9821784,...|
|{chunk, 274, 295, Commission File Number, {sentence -> 0, chunk -> 32, DocumentSimilarity -> 0.5611810097049629, 

**Show the selected key phrases, the cosine similarity to the document, the Maximal Marginal Relevance score and the sentence they where key phrase was found in.**

In [10]:
import pyspark.sql.functions as F

ngram_results.select(F.explode(F.arrays_zip(ngram_results.ngram_key_phrases.result,
 ngram_results.ngram_key_phrases.metadata)).alias("cols"))\
 .select(F.expr("cols['0']").alias("key_phrase"),
 F.expr("cols['1']['DocumentSimilarity']").alias("DocumentSimilarity"),
 F.expr("cols['1']['MMRScore']").alias("MMRScore"),
 F.expr("cols['1']['sentence']").alias("sentence")).show(truncate=False)

+-------------------------------------+------------------+-------------------+--------+
|key_phrase |DocumentSimilarity|MMRScore |sentence|
+-------------------------------------+------------------+-------------------+--------+
|SECURITIES EXCHANGE ACT |0.6393701205777877|0.3836220875904442 |0 |
|Securities registered pursuant |0.6136637817081764|0.07235383001749868|0 |
|Delaware 27-2793871 |0.5795593361842474|0.21278103633873757|0 |
|Commission File Number |0.5611810097049629|0.06628085753098495|0 |
|class Trading symbol(s |0.5605955919202351|0.11302878347562398|0 |
|Employer Identification Number |0.5440928090884692|0.11586834883339939|0 |
|PD York Stock |0.5371489243663168|0.08852012162737921|0 |
|from________to_______ Commission File|0.5155973270572032|0.1416217242859935 |0 |
|TRANSITION REPORT PURSUANT |0.5036100781247339|0.08331998249757963|0 |
|Exact registrant charter |0.4904558869833586|0.10867417072096042|0 |
+-------------------------------------+------------------+---------

# with NER Model

Now we will show how to get key phrases from NER chunks by feeding `ChunkKeyPhraseExtraction` with the output of `NerConverter`.

In [None]:
documenter = nlp.DocumentAssembler() \
 .setInputCol("text") \
 .setOutputCol("document")

sentencer = nlp.SentenceDetector() \
 .setInputCols(["document"]) \
 .setOutputCol("sentences")

tokenizer = nlp.Tokenizer() \
 .setInputCols(["document"]) \
 .setOutputCol("tokens") \
 .setSplitChars(['\[','\]']) 

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\
 .setInputCols("sentences", "tokens") \
 .setOutputCol("embeddings")\
 .setMaxSentenceLength(512)\
 .setCaseSensitive(True)

ner_tagger = finance.NerModel.pretrained("finner_sec_10k_summary","en","finance/models")\
 .setInputCols(["sentences", "tokens", "embeddings"]) \
 .setOutputCol("ner_tags")

ner_converter = finance.NerConverterInternal()\
 .setInputCols("sentences", "tokens", "ner_tags")\
 .setOutputCol("ner_chunks")

ner_key_phrase_extractor = finance.ChunkKeyPhraseExtraction.pretrained()\
 .setTopN(10) \
 .setDivergence(0.4)\
 .setInputCols(["sentences", "ner_chunks"])\
 .setOutputCol("ner_key_phrases")

ner_pipeline = nlp.Pipeline(stages=[
 documenter, 
 sentencer, 
 tokenizer, 
 embeddings, 
 ner_tagger, 
 ner_converter, 
 ner_key_phrase_extractor
])

In [14]:
ner_results = ner_pipeline.fit(empty_data).transform(textDF)

In [15]:
# ner_chunk results

ner_results.select(F.explode(F.arrays_zip(ner_results.ner_chunks.result,
 ner_results.ner_chunks.metadata)).alias("cols"))\
 .select(F.expr("cols['0']").alias("ner_chunk"),
 F.expr("cols['1']['entity']").alias("label")).show(50, truncate=False)

+----------------------------------------------+-----------------+
|ner_chunk |label |
+----------------------------------------------+-----------------+
|January 31, 2021 |FISCAL_YEAR |
|001-38856 |CFN |
|PAGERDUTY, INC |ORG |
|Delaware |STATE |
|27-2793871 |IRS |
|600 Townsend St., Suite 200, San Francisco, CA|ADDRESS |
|(844) 800-3889 |PHONE |
|Common Stock |TITLE_CLASS |
|$0.000005 |TITLE_CLASS_VALUE|
|PD |TICKER |
|New York Stock Exchange |STOCK_EXCHANGE |
+----------------------------------------------+-----------------+



In [16]:
ner_results.select(F.explode(F.arrays_zip(ner_results.ner_key_phrases.result, 
 ner_results.ner_key_phrases.metadata)).alias("cols"))\
 .select(F.expr("cols['0']").alias("key_phrase"),
 F.expr("cols['1']['entity']").alias("label"),
 F.expr("cols['1']['DocumentSimilarity']").alias("DocumentSimilarity"),
 F.expr("cols['1']['MMRScore']").alias("MMRScore"),
 F.expr("cols['1']['sentence']").alias("sentence")).show(truncate=False)

+----------------------------------------------+-----------------+-------------------+---------------------+--------+
|key_phrase |label |DocumentSimilarity |MMRScore |sentence|
+----------------------------------------------+-----------------+-------------------+---------------------+--------+
|New York Stock Exchange |STOCK_EXCHANGE |0.5488381988412031 |0.3293029323900442 |6 |
|27-2793871 |IRS |0.45730858080350506|0.13971821238765356 |3 |
|600 Townsend St., Suite 200, San Francisco, CA|ADDRESS |0.43345846542169797|0.0629858198432803 |4 |
|(844) 800-3889 |PHONE |0.3828927936642871 |-0.05135087737234392 |4 |
|PAGERDUTY, INC |ORG |0.3797431768838629 |0.0665505773270374 |2 |
|Common Stock |TITLE_CLASS |0.35267543540066576|0.013411679066343135 |6 |
|$0.000005 |TITLE_CLASS_VALUE|0.3453745062411516 |0.042424153422046806 |6 |
|Delaware |STATE |0.3269358508439858 |0.020992782487245037 |3 |
|January 31, 2021 |FISCAL_YEAR |0.30783572125737296|0.031128095591583277 |1 |
|PD |TICKER |0.26565797678

# with NGramGenerator and NER Model

NGramGenerator and NER (Named Entity Recognition) Mode are additional components or techniques that can be used in conjunction with Chunk Key Phrase Extraction to enhance the extraction of key phrases.

- NGramGenerator: An NGram refers to a contiguous sequence of n items from a given text, where an item can be a word, character, or any other linguistic unit. NGramGenerator is a component that generates NGrams from the input text. By considering NGrams of varying lengths (unigrams, bigrams, trigrams, etc.), the NGramGenerator captures both single words and multi-word expressions, which can be valuable key phrases.

For example, if the input text is "I love to play soccer," the NGramGenerator can produce unigrams like "I," "love," "to," "play," and "soccer," as well as bigrams like "I love," "love to," "to play," and "play soccer." These NGrams provide more context and improve the extraction of meaningful key phrases.

- NER Mode (Named Entity Recognition): Named Entity Recognition is a subtask of NLP that aims to identify and classify named entities, such as person names, locations, organizations, dates, etc., in text. NER Mode is a specific setting or approach used during Chunk Key Phrase Extraction, where named entities are recognized and treated as important chunks or key phrases.

By incorporating NER Mode, the extraction process can specifically focus on extracting key phrases that represent named entities, which are typically highly informative and relevant in many applications. For instance, in a news article, named entities like "Barack Obama," "New York City," or "Apple Inc." are important key phrases that convey crucial information.

Using NGramGenerator and NER Mode in combination with Chunk Key Phrase Extraction can lead to more accurate and comprehensive extraction of key phrases from text. These techniques allow for the identification of meaningful phrases, including single words, multi-word expressions, and named entities, which contribute to a better understanding of the content and enable more effective analysis.

In [None]:
documenter = nlp.DocumentAssembler() \
 .setInputCol("text") \
 .setOutputCol("document")

sentencer = nlp.SentenceDetector() \
 .setInputCols(["document"]) \
 .setOutputCol("sentences")

tokenizer = nlp.Tokenizer() \
 .setInputCols(["document"]) \
 .setOutputCol("tokens") \
 .setSplitChars(['\[','\]']) 

stop_words_cleaner = nlp.StopWordsCleaner.pretrained()\
 .setInputCols("tokens")\
 .setOutputCol("clean_tokens")\
 .setCaseSensitive(False)

ngram_generator = nlp.NGramGenerator()\
 .setInputCols(["clean_tokens"])\
 .setOutputCol("ngrams")\
 .setN(3)
 
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\
 .setInputCols("sentences", "tokens") \
 .setOutputCol("embeddings")\
 .setMaxSentenceLength(512)\
 .setCaseSensitive(True)

ner_tagger = finance.NerModel.pretrained("finner_sec_10k_summary","en","finance/models")\
 .setInputCols(["sentences", "tokens", "embeddings"]) \
 .setOutputCol("ner_tags")

ner_converter = finance.NerConverterInternal()\
 .setInputCols("sentences", "tokens", "ner_tags")\
 .setOutputCol("ner_chunks")

chunk_merger = finance.ChunkMergeApproach()\
 .setInputCols("ngrams", "ner_chunks")\
 .setOutputCol("merged_chunks")\
 .setMergeOverlapping(False)

ngram_ner_key_phrase_extractor = finance.ChunkKeyPhraseExtraction.pretrained()\
 .setTopN(10) \
 .setDivergence(0.4)\
 .setInputCols(["sentences", "merged_chunks"])\
 .setOutputCol("key_phrases")

ngram_ner_pipeline = nlp.Pipeline(stages=[
 documenter, 
 sentencer, 
 tokenizer, 
 stop_words_cleaner,
 ngram_generator,
 embeddings, 
 ner_tagger, 
 ner_converter, 
 chunk_merger,
 ngram_ner_key_phrase_extractor
])

In [18]:
ngram_ner_results = ngram_ner_pipeline.fit(empty_data).transform(textDF)

**Show the merged key phrase candidate results. `UNK` ones from NGramGenerator and the others from `ner_jsl` model.**

In [19]:
ngram_ner_results.selectExpr("explode(merged_chunks) AS key_phrase_candidate").show(30,truncate=False)

+------------------------------------------------------------------------------------------------------------------------------------------------+
|key_phrase_candidate |
+------------------------------------------------------------------------------------------------------------------------------------------------+
|{chunk, 0, 21, ANNUAL REPORT PURSUANT, {entity -> UNK, chunk -> 0, sentence -> 0}, []} |
|{chunk, 7, 32, REPORT PURSUANT SECTION, {entity -> UNK, chunk -> 1, sentence -> 0}, []} |
|{chunk, 14, 35, PURSUANT SECTION 13, {entity -> UNK, chunk -> 2, sentence -> 0}, []} |
|{chunk, 26, 43, SECTION 13 15(d, {entity -> UNK, chunk -> 3, sentence -> 0}, []} |
|{chunk, 34, 44, 13 15(d ), {entity -> UNK, chunk -> 4, sentence -> 0}, []} |
|{chunk, 40, 62, 15(d ) SECURITIES, {entity -> UNK, chunk -> 5, sentence -> 0}, []} |
|{chunk, 44, 75, ) SECURITIES EXCHANGE, {entity -> UNK, chunk -> 6, sentence -> 0}, []} |
|{chunk, 53, 79, SECURITIES EXCHANGE ACT, {entity -> UNK, chunk -> 7, sente

In [20]:
# NER chunk results
ngram_ner_results.select(F.explode(F.arrays_zip(ngram_ner_results.merged_chunks.result,
 ngram_ner_results.merged_chunks.metadata)).alias("cols"))\
 .select(F.expr("cols['0']").alias("key_phrase_candidate"),
 F.expr("cols['1']['entity']").alias("label")).filter("label != 'UNK'").show(50, truncate=False)

+----------------------------------------------+-----------------+
|key_phrase_candidate |label |
+----------------------------------------------+-----------------+
|January 31, 2021 |FISCAL_YEAR |
|001-38856 |CFN |
|PAGERDUTY, INC |ORG |
|Delaware |STATE |
|27-2793871 |IRS |
|600 Townsend St., Suite 200, San Francisco, CA|ADDRESS |
|(844) 800-3889 |PHONE |
|Common Stock |TITLE_CLASS |
|$0.000005 |TITLE_CLASS_VALUE|
|PD |TICKER |
|New York Stock Exchange |STOCK_EXCHANGE |
+----------------------------------------------+-----------------+



In [21]:
# ngram results
ngram_ner_results.select(F.explode(F.arrays_zip(ngram_ner_results.merged_chunks.result,
 ngram_ner_results.merged_chunks.metadata)).alias("cols"))\
 .select(F.expr("cols['0']").alias("key_phrase_candidate"),
 F.expr("cols['1']['entity']").alias("label")).filter("label == 'UNK'").show(50, truncate=False)

+---------------------------------------+-----+
|key_phrase_candidate |label|
+---------------------------------------+-----+
|ANNUAL REPORT PURSUANT |UNK |
|REPORT PURSUANT SECTION |UNK |
|PURSUANT SECTION 13 |UNK |
|SECTION 13 15(d |UNK |
|13 15(d ) |UNK |
|15(d ) SECURITIES |UNK |
|) SECURITIES EXCHANGE |UNK |
|SECURITIES EXCHANGE ACT |UNK |
|EXCHANGE ACT 1934 |UNK |
|ACT 1934 annual |UNK |
|1934 annual period |UNK |
|annual period ended |UNK |
|period ended January |UNK |
|ended January 31 |UNK |
|January 31 , |UNK |
|31 , 2021 |UNK |
|, 2021 TRANSITION |UNK |
|2021 TRANSITION REPORT |UNK |
|TRANSITION REPORT PURSUANT |UNK |
|REPORT PURSUANT SECTION |UNK |
|PURSUANT SECTION 13 |UNK |
|SECTION 13 15(d |UNK |
|13 15(d ) |UNK |
|15(d ) SECURITIES |UNK |
|) SECURITIES EXCHANGE |UNK |
|SECURITIES EXCHANGE ACT |UNK |
|EXCHANGE ACT 1934 |UNK |
|ACT 1934 transition |UNK |
|1934 transition period |UNK |
|transition period from________to_______|UNK |
|period from________to_______ Commission|

In [22]:
# merged (NER chunk + ngram) results
ngram_ner_results.select(F.explode(F.arrays_zip(ngram_ner_results.merged_chunks.result,
 ngram_ner_results.merged_chunks.metadata)).alias("cols"))\
 .select(F.expr("cols['0']").alias("key_phrase_candidate"),
 F.expr("cols['1']['entity']").alias("label")).show(50, truncate=False)

+---------------------------------------+-----------+
|key_phrase_candidate |label |
+---------------------------------------+-----------+
|ANNUAL REPORT PURSUANT |UNK |
|REPORT PURSUANT SECTION |UNK |
|PURSUANT SECTION 13 |UNK |
|SECTION 13 15(d |UNK |
|13 15(d ) |UNK |
|15(d ) SECURITIES |UNK |
|) SECURITIES EXCHANGE |UNK |
|SECURITIES EXCHANGE ACT |UNK |
|EXCHANGE ACT 1934 |UNK |
|ACT 1934 annual |UNK |
|1934 annual period |UNK |
|annual period ended |UNK |
|period ended January |UNK |
|ended January 31 |UNK |
|January 31 , |UNK |
|January 31, 2021 |FISCAL_YEAR|
|31 , 2021 |UNK |
|, 2021 TRANSITION |UNK |
|2021 TRANSITION REPORT |UNK |
|TRANSITION REPORT PURSUANT |UNK |
|REPORT PURSUANT SECTION |UNK |
|PURSUANT SECTION 13 |UNK |
|SECTION 13 15(d |UNK |
|13 15(d ) |UNK |
|15(d ) SECURITIES |UNK |
|) SECURITIES EXCHANGE |UNK |
|SECURITIES EXCHANGE ACT |UNK |
|EXCHANGE ACT 1934 |UNK |
|ACT 1934 transition |UNK |
|1934 transition period |UNK |
|transition period from________to_______|UN

In [23]:
ngram_ner_results.selectExpr("explode(merged_chunks) AS key_phrase_candidate")\
 .selectExpr("key_phrase_candidate.result AS key_phrase_candidate",
 "IF(key_phrase_candidate.metadata.entity = 'UNK', 'ngram', 'NER') AS source",
 "key_phrase_candidate.metadata.sentence")\
 .show(50, truncate=False)

+---------------------------------------+------+--------+
|key_phrase_candidate |source|sentence|
+---------------------------------------+------+--------+
|ANNUAL REPORT PURSUANT |ngram |0 |
|REPORT PURSUANT SECTION |ngram |0 |
|PURSUANT SECTION 13 |ngram |0 |
|SECTION 13 15(d |ngram |0 |
|13 15(d ) |ngram |0 |
|15(d ) SECURITIES |ngram |0 |
|) SECURITIES EXCHANGE |ngram |0 |
|SECURITIES EXCHANGE ACT |ngram |0 |
|EXCHANGE ACT 1934 |ngram |0 |
|ACT 1934 annual |ngram |0 |
|1934 annual period |ngram |0 |
|annual period ended |ngram |0 |
|period ended January |ngram |0 |
|ended January 31 |ngram |0 |
|January 31 , |ngram |0 |
|January 31, 2021 |NER |1 |
|31 , 2021 |ngram |0 |
|, 2021 TRANSITION |ngram |0 |
|2021 TRANSITION REPORT |ngram |0 |
|TRANSITION REPORT PURSUANT |ngram |0 |
|REPORT PURSUANT SECTION |ngram |0 |
|PURSUANT SECTION 13 |ngram |0 |
|SECTION 13 15(d |ngram |0 |
|13 15(d ) |ngram |0 |
|15(d ) SECURITIES |ngram |0 |
|) SECURITIES EXCHANGE |ngram |0 |
|SECURITIES EXCHANGE A

In [24]:
ngram_ner_results.select(F.explode(F.arrays_zip(ngram_ner_results.key_phrases.result,
 ngram_ner_results.key_phrases.metadata)).alias("cols"))\
 .select(F.expr("cols['0']").alias("key_phrase"),
 F.expr("cols['1']['entity']").alias("label"),
 F.expr("cols['1']['DocumentSimilarity']").alias("DocumentSimilarity"),
 F.expr("cols['1']['MMRScore']").alias("MMRScore"),
 F.expr("cols['1']['sentence']").alias("sentence")).show(truncate=False)

+-------------------------------------+-----+------------------+-------------------+--------+
|key_phrase |label|DocumentSimilarity|MMRScore |sentence|
+-------------------------------------+-----+------------------+-------------------+--------+
|SECURITIES EXCHANGE ACT |UNK |0.6393701205777877|0.3836220875904442 |0 |
|Securities registered pursuant |UNK |0.6136637817081764|0.07235383001749868|0 |
|Delaware 27-2793871 |UNK |0.5795593361842474|0.21278103633873757|0 |
|Commission File Number |UNK |0.5611810371021204|0.0662808662583036 |0 |
|class Trading symbol(s |UNK |0.5605955342194471|0.1130287649782131 |0 |
|Employer Identification Number |UNK |0.5440926937211562|0.11586821117770699|0 |
|PD York Stock |UNK |0.5371489243663168|0.08852012162737921|0 |
|from________to_______ Commission File|UNK |0.5155971905879061|0.14162160767505633|0 |
|TRANSITION REPORT PURSUANT |UNK |0.5036100781247339|0.08332010027661099|0 |
|Exact registrant charter |UNK |0.4904558869833586|0.10867417072096042|0 |