{
  "cells": [
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "I08sFJYCxR0Z"
      },
      "source": [
        "![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "FwJ-P56kq6FU"
      },
      "source": [
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/finance-nlp/14.0.Financial_ChunkKeyPhraseExtraction.ipynb)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "collapsed": false,
        "id": "4iIO6G_B3pqq"
      },
      "source": [
        "🎬 Installation"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "hPwo4Czy3pqq",
        "pycharm": {
          "is_executing": true
        }
      },
      "outputs": [],
      "source": [
        "! pip install -q johnsnowlabs"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "YPsbAnNoPt0Z"
      },
      "source": [
        "##🔗 Automatic Installation\n",
        "Using my.johnsnowlabs.com SSO"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "_L-7mLYp3pqr",
        "pycharm": {
          "is_executing": true
        }
      },
      "outputs": [],
      "source": [
        "from johnsnowlabs import nlp, finance, legal\n",
        "\n",
        "nlp.install(force_browser=True)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "hsJvn_WWM2GL"
      },
      "source": [
        "##🔗 Manual downloading\n",
        "If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.\n",
        "\n",
        "- Go to my.johnsnowlabs.com\n",
        "- Download your license\n",
        "- Upload it using the following command"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "i57QV3-_P2sQ"
      },
      "outputs": [],
      "source": [
        "from google.colab import files\n",
        "print('Please Upload your John Snow Labs License using the button below')\n",
        "license_keys = files.upload()"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "xGgNdFzZP_hQ"
      },
      "source": [
        "- Install it"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "OfmmPqknP4rR"
      },
      "outputs": [],
      "source": [
        "nlp.install()"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "DCl5ErZkNNLk"
      },
      "source": [
        "#📌 Starting"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "x3jVICoa3pqr"
      },
      "outputs": [],
      "source": [
        "spark = nlp.start()"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "qylaK-p3F-3z"
      },
      "source": [
        "⏳ Load sample txt file"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "id": "TCZ_yMDQfU6u"
      },
      "outputs": [],
      "source": [
        "text = \"\"\"ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES AND EXCHANGE ACT OF 1934\n",
        "For the annual period ended January 31, 2021\n",
        "or\n",
        "TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\n",
        "For the transition period from________to_______\n",
        "Commission File Number: 001-38856\n",
        "PAGERDUTY, INC.\n",
        "(Exact name of registrant as specified in its charter)\n",
        "Delaware\n",
        "27-2793871\n",
        "(State or other jurisdiction of\n",
        "incorporation or organization)\n",
        "(I.R.S. Employer\n",
        "Identification Number)\n",
        "600 Townsend St., Suite 200, San Francisco, CA 94103\n",
        "(844) 800-3889\n",
        "(Address, including zip code, and telephone number, including area code, of registrant’s principal executive offices)\n",
        "Securities registered pursuant to Section 12(b) of the Act:\n",
        "Title of each class\n",
        "Trading symbol(s)\n",
        "Name of each exchange on which registered\n",
        "Common Stock, $0.000005 par value,\n",
        "PD\n",
        "New York Stock Exchange\"\"\""
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {
        "id": "LrZuODCcGA-o"
      },
      "outputs": [],
      "source": [
        "empty_data = spark.createDataFrame([[\"\"]]).toDF(\"text\")\n",
        "textDF = spark.createDataFrame([[text]]).toDF(\"text\")"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "-NTEaLTrkgCh"
      },
      "source": [
        "## 🔎 **Chunk Key Phrase Extraction**\n",
        "\n",
        "\n",
        "📜Explanation:\n",
        "\n",
        "Chunk Key Phrase Extraction is a technique used in natural language processing (NLP) to identify and extract key phrases or important chunks of text from a given document or text corpus. Key phrases are typically defined as meaningful and informative phrases that capture the essence of the content.\n",
        "\n",
        "The process of Chunk Key Phrase Extraction involves several steps:\n",
        "\n",
        "- **Tokenization:** The input text is divided into smaller units called tokens, which can be words, phrases, or even characters. Tokenization helps in breaking down the text into meaningful components that can be further analyzed.\n",
        "\n",
        "- **Part-of-Speech (POS) Tagging:** Each token is assigned a part-of-speech tag, which indicates the grammatical category or role of the word in the sentence (e.g., noun, verb, adjective). POS tagging helps in understanding the syntactic structure of the text.\n",
        "\n",
        "- **Chunking:** Chunking is the process of grouping together tokens based on specific patterns or rules. It involves identifying and extracting meaningful chunks of words that form meaningful phrases or constituents. These chunks are typically noun phrases or verb phrases that convey important information.\n",
        "\n",
        "- **Key Phrase Extraction:** From the extracted chunks, the algorithm selects and ranks key phrases based on their importance or relevance to the overall content. Various techniques can be employed for ranking, such as frequency-based approaches or statistical models that consider the contextual information of the phrases.\n",
        "\n",
        "Chunk Key Phrase Extraction is often used in applications such as information retrieval, document summarization, sentiment analysis, and text classification. It helps in identifying the most significant and informative phrases in a text, enabling better understanding and analysis of the content."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "qY7XtY9YGBMX"
      },
      "outputs": [],
      "source": [
        "documenter = nlp.DocumentAssembler() \\\n",
        "    .setInputCol(\"text\") \\\n",
        "    .setOutputCol(\"document\")\n",
        "\n",
        "sentencer = nlp.SentenceDetector() \\\n",
        "    .setInputCols([\"document\"]) \\\n",
        "    .setOutputCol(\"sentences\")\n",
        "\n",
        "tokenizer = nlp.Tokenizer() \\\n",
        "    .setInputCols([\"document\"]) \\\n",
        "    .setOutputCol(\"tokens\") \\\n",
        "    .setSplitChars(['\\[','\\]']) \\\n",
        "\n",
        "stop_words_cleaner = nlp.StopWordsCleaner.pretrained()\\\n",
        "    .setInputCols(\"tokens\")\\\n",
        "    .setOutputCol(\"clean_tokens\")\\\n",
        "    .setCaseSensitive(False)\n",
        "\n",
        "ngram_generator = nlp.NGramGenerator()\\\n",
        "    .setInputCols([\"clean_tokens\"])\\\n",
        "    .setOutputCol(\"ngrams\")\\\n",
        "    .setN(3)\n",
        "\n",
        "ngram_key_phrase_extractor = finance.ChunkKeyPhraseExtraction.pretrained()\\\n",
        "    .setTopN(10) \\\n",
        "    .setDivergence(0.4)\\\n",
        "    .setInputCols([\"sentences\", \"ngrams\"])\\\n",
        "    .setOutputCol(\"ngram_key_phrases\")\n",
        "\n",
        "ngram_pipeline = nlp.Pipeline(stages=[\n",
        "    documenter, \n",
        "    sentencer, \n",
        "    tokenizer, \n",
        "    stop_words_cleaner,\n",
        "    ngram_generator,\n",
        "    ngram_key_phrase_extractor\n",
        "])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "id": "HlWNf7wKG1JB"
      },
      "outputs": [],
      "source": [
        "ngram_results = ngram_pipeline.fit(empty_data).transform(textDF)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "O833gNNFHJMA"
      },
      "source": [
        "**Lets show N-Gram results.**"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "yF50J2XRHJl5",
        "outputId": "a07e98c5-0d2d-4c00-87b7-57dccc02c3a8"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "+--------------------------------------------------------------------------------------------+\n",
            "|key_phrase_candidate                                                                        |\n",
            "+--------------------------------------------------------------------------------------------+\n",
            "|{chunk, 0, 21, ANNUAL REPORT PURSUANT, {sentence -> 0, chunk -> 0}, []}                     |\n",
            "|{chunk, 7, 32, REPORT PURSUANT SECTION, {sentence -> 0, chunk -> 1}, []}                    |\n",
            "|{chunk, 14, 35, PURSUANT SECTION 13, {sentence -> 0, chunk -> 2}, []}                       |\n",
            "|{chunk, 26, 43, SECTION 13 15(d, {sentence -> 0, chunk -> 3}, []}                           |\n",
            "|{chunk, 34, 44, 13 15(d ), {sentence -> 0, chunk -> 4}, []}                                 |\n",
            "|{chunk, 40, 62, 15(d ) SECURITIES, {sentence -> 0, chunk -> 5}, []}                         |\n",
            "|{chunk, 44, 75, ) SECURITIES EXCHANGE, {sentence -> 0, chunk -> 6}, []}                     |\n",
            "|{chunk, 53, 79, SECURITIES EXCHANGE ACT, {sentence -> 0, chunk -> 7}, []}                   |\n",
            "|{chunk, 68, 87, EXCHANGE ACT 1934, {sentence -> 0, chunk -> 8}, []}                         |\n",
            "|{chunk, 77, 102, ACT 1934 annual, {sentence -> 0, chunk -> 9}, []}                          |\n",
            "|{chunk, 84, 109, 1934 annual period, {sentence -> 0, chunk -> 10}, []}                      |\n",
            "|{chunk, 97, 115, annual period ended, {sentence -> 0, chunk -> 11}, []}                     |\n",
            "|{chunk, 104, 123, period ended January, {sentence -> 0, chunk -> 12}, []}                   |\n",
            "|{chunk, 111, 126, ended January 31, {sentence -> 0, chunk -> 13}, []}                       |\n",
            "|{chunk, 117, 127, January 31 ,, {sentence -> 0, chunk -> 14}, []}                           |\n",
            "|{chunk, 125, 132, 31 , 2021, {sentence -> 0, chunk -> 15}, []}                              |\n",
            "|{chunk, 127, 146, , 2021 TRANSITION, {sentence -> 0, chunk -> 16}, []}                      |\n",
            "|{chunk, 129, 153, 2021 TRANSITION REPORT, {sentence -> 0, chunk -> 17}, []}                 |\n",
            "|{chunk, 137, 162, TRANSITION REPORT PURSUANT, {sentence -> 0, chunk -> 18}, []}             |\n",
            "|{chunk, 148, 173, REPORT PURSUANT SECTION, {sentence -> 0, chunk -> 19}, []}                |\n",
            "|{chunk, 155, 176, PURSUANT SECTION 13, {sentence -> 0, chunk -> 20}, []}                    |\n",
            "|{chunk, 167, 184, SECTION 13 15(d, {sentence -> 0, chunk -> 21}, []}                        |\n",
            "|{chunk, 175, 185, 13 15(d ), {sentence -> 0, chunk -> 22}, []}                              |\n",
            "|{chunk, 181, 203, 15(d ) SECURITIES, {sentence -> 0, chunk -> 23}, []}                      |\n",
            "|{chunk, 185, 212, ) SECURITIES EXCHANGE, {sentence -> 0, chunk -> 24}, []}                  |\n",
            "|{chunk, 194, 216, SECURITIES EXCHANGE ACT, {sentence -> 0, chunk -> 25}, []}                |\n",
            "|{chunk, 205, 224, EXCHANGE ACT 1934, {sentence -> 0, chunk -> 26}, []}                      |\n",
            "|{chunk, 214, 243, ACT 1934 transition, {sentence -> 0, chunk -> 27}, []}                    |\n",
            "|{chunk, 221, 250, 1934 transition period, {sentence -> 0, chunk -> 28}, []}                 |\n",
            "|{chunk, 234, 272, transition period from________to_______, {sentence -> 0, chunk -> 29}, []}|\n",
            "+--------------------------------------------------------------------------------------------+\n",
            "only showing top 30 rows\n",
            "\n"
          ]
        }
      ],
      "source": [
        "ngram_results.selectExpr(\"explode(ngrams) AS key_phrase_candidate\").show(30,truncate=False)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "tmgCFlTnHPP8"
      },
      "source": [
        "**Check the key phrases from N-Gram results.**"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 9,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "oyXr3y_XHLLH",
        "outputId": "a579a9bd-837b-444f-b949-df80c527cf5f"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
            "|                                                                                                                                                         ngram_key_phrases|\n",
            "+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
            "|{chunk, 53, 79, SECURITIES EXCHANGE ACT, {sentence -> 0, chunk -> 7, DocumentSimilarity -> 0.6393701205777877, MMRScore -> 0.3836220875904442}, [-1.0755919, 0.8220767,...|\n",
            "|{chunk, 688, 717, Securities registered pursuant, {sentence -> 0, chunk -> 95, DocumentSimilarity -> 0.6136637817081764, MMRScore -> 0.07235383001749868}, [-1.1415554,...|\n",
            "|{chunk, 377, 397, Delaware 27-2793871, {sentence -> 0, chunk -> 44, DocumentSimilarity -> 0.5795593361842474, MMRScore -> 0.21278103633873757}, [-0.7930386, 0.9821784,...|\n",
            "|{chunk, 274, 295, Commission File Number, {sentence -> 0, chunk -> 32, DocumentSimilarity -> 0.5611810097049629, MMRScore -> 0.06628085753098495}, [-1.010395, 0.676069...|\n",
            "|{chunk, 762, 783, class Trading symbol(s, {sentence -> 0, chunk -> 104, DocumentSimilarity -> 0.5605955919202351, MMRScore -> 0.11302878347562398}, [-0.46789795, 0.362...|\n",
            "|{chunk, 470, 499, Employer Identification Number, {sentence -> 0, chunk -> 56, DocumentSimilarity -> 0.5440928090884692, MMRScore -> 0.11586834883339939}, [-1.3709545,...|\n",
            "|{chunk, 863, 879, PD York Stock, {sentence -> 0, chunk -> 116, DocumentSimilarity -> 0.5371489243663168, MMRScore -> 0.08852012162737921}, [-0.64972705, 0.60796344, -1...|\n",
            "|{chunk, 252, 288, from________to_______ Commission File, {sentence -> 0, chunk -> 31, DocumentSimilarity -> 0.5155973270572032, MMRScore -> 0.1416217242859935}, [-0.44...|\n",
            "|{chunk, 137, 162, TRANSITION REPORT PURSUANT, {sentence -> 0, chunk -> 18, DocumentSimilarity -> 0.5036100781247339, MMRScore -> 0.08331998249757963}, [-1.4263232, 0.0...|\n",
            "|{chunk, 325, 376, Exact registrant charter, {sentence -> 0, chunk -> 41, DocumentSimilarity -> 0.4904558869833586, MMRScore -> 0.10867417072096042}, [-1.2218723, 0.089...|\n",
            "+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n",
            "\n"
          ]
        }
      ],
      "source": [
        "ngram_results.selectExpr(\"explode(ngram_key_phrases) AS ngram_key_phrases\").show(truncate=170)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "Ne0FBLs2HZfn"
      },
      "source": [
        "**Show the selected key phrases, the cosine similarity to the document, the Maximal Marginal Relevance score and the sentence they where key phrase was found in.**"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 10,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Gr5zWKnHHZ1H",
        "outputId": "209431fa-3cf3-4801-f0ed-1dcffdc95322"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "+-------------------------------------+------------------+-------------------+--------+\n",
            "|key_phrase                           |DocumentSimilarity|MMRScore           |sentence|\n",
            "+-------------------------------------+------------------+-------------------+--------+\n",
            "|SECURITIES EXCHANGE ACT              |0.6393701205777877|0.3836220875904442 |0       |\n",
            "|Securities registered pursuant       |0.6136637817081764|0.07235383001749868|0       |\n",
            "|Delaware 27-2793871                  |0.5795593361842474|0.21278103633873757|0       |\n",
            "|Commission File Number               |0.5611810097049629|0.06628085753098495|0       |\n",
            "|class Trading symbol(s               |0.5605955919202351|0.11302878347562398|0       |\n",
            "|Employer Identification Number       |0.5440928090884692|0.11586834883339939|0       |\n",
            "|PD York Stock                        |0.5371489243663168|0.08852012162737921|0       |\n",
            "|from________to_______ Commission File|0.5155973270572032|0.1416217242859935 |0       |\n",
            "|TRANSITION REPORT PURSUANT           |0.5036100781247339|0.08331998249757963|0       |\n",
            "|Exact registrant charter             |0.4904558869833586|0.10867417072096042|0       |\n",
            "+-------------------------------------+------------------+-------------------+--------+\n",
            "\n"
          ]
        }
      ],
      "source": [
        "import pyspark.sql.functions as F\n",
        "\n",
        "ngram_results.select(F.explode(F.arrays_zip(ngram_results.ngram_key_phrases.result,\n",
        "                                            ngram_results.ngram_key_phrases.metadata)).alias(\"cols\"))\\\n",
        "              .select(F.expr(\"cols['0']\").alias(\"key_phrase\"),\n",
        "                      F.expr(\"cols['1']['DocumentSimilarity']\").alias(\"DocumentSimilarity\"),\n",
        "                      F.expr(\"cols['1']['MMRScore']\").alias(\"MMRScore\"),\n",
        "                      F.expr(\"cols['1']['sentence']\").alias(\"sentence\")).show(truncate=False)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "yY8iJ7DUHhn0"
      },
      "source": [
        "# with NER Model\n",
        "\n",
        "Now we will show how to get key phrases from NER chunks by feeding `ChunkKeyPhraseExtraction` with the output of `NerConverter`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8WuqSiHXHbkX"
      },
      "outputs": [],
      "source": [
        "documenter = nlp.DocumentAssembler() \\\n",
        "    .setInputCol(\"text\") \\\n",
        "    .setOutputCol(\"document\")\n",
        "\n",
        "sentencer = nlp.SentenceDetector() \\\n",
        "    .setInputCols([\"document\"]) \\\n",
        "    .setOutputCol(\"sentences\")\n",
        "\n",
        "tokenizer = nlp.Tokenizer() \\\n",
        "    .setInputCols([\"document\"]) \\\n",
        "    .setOutputCol(\"tokens\") \\\n",
        "    .setSplitChars(['\\[','\\]']) \n",
        "\n",
        "embeddings = nlp.BertEmbeddings.pretrained(\"bert_embeddings_sec_bert_base\",\"en\")\\\n",
        "    .setInputCols(\"sentences\", \"tokens\") \\\n",
        "    .setOutputCol(\"embeddings\")\\\n",
        "    .setMaxSentenceLength(512)\\\n",
        "    .setCaseSensitive(True)\n",
        "\n",
        "ner_tagger = finance.NerModel.pretrained(\"finner_sec_10k_summary\",\"en\",\"finance/models\")\\\n",
        "    .setInputCols([\"sentences\", \"tokens\", \"embeddings\"]) \\\n",
        "    .setOutputCol(\"ner_tags\")\n",
        "\n",
        "ner_converter = finance.NerConverterInternal()\\\n",
        "    .setInputCols(\"sentences\", \"tokens\", \"ner_tags\")\\\n",
        "    .setOutputCol(\"ner_chunks\")\n",
        "\n",
        "ner_key_phrase_extractor = finance.ChunkKeyPhraseExtraction.pretrained()\\\n",
        "    .setTopN(10) \\\n",
        "    .setDivergence(0.4)\\\n",
        "    .setInputCols([\"sentences\", \"ner_chunks\"])\\\n",
        "    .setOutputCol(\"ner_key_phrases\")\n",
        "\n",
        "ner_pipeline = nlp.Pipeline(stages=[\n",
        "    documenter, \n",
        "    sentencer, \n",
        "    tokenizer, \n",
        "    embeddings, \n",
        "    ner_tagger, \n",
        "    ner_converter, \n",
        "    ner_key_phrase_extractor\n",
        "])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 14,
      "metadata": {
        "id": "A851c6yiJRPJ"
      },
      "outputs": [],
      "source": [
        "ner_results = ner_pipeline.fit(empty_data).transform(textDF)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 15,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "BRtANq26JXdM",
        "outputId": "9700351c-6e99-45a5-e78d-2dcac0907aec"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "+----------------------------------------------+-----------------+\n",
            "|ner_chunk                                     |label            |\n",
            "+----------------------------------------------+-----------------+\n",
            "|January 31, 2021                              |FISCAL_YEAR      |\n",
            "|001-38856                                     |CFN              |\n",
            "|PAGERDUTY, INC                                |ORG              |\n",
            "|Delaware                                      |STATE            |\n",
            "|27-2793871                                    |IRS              |\n",
            "|600 Townsend St., Suite 200, San Francisco, CA|ADDRESS          |\n",
            "|(844) 800-3889                                |PHONE            |\n",
            "|Common Stock                                  |TITLE_CLASS      |\n",
            "|$0.000005                                     |TITLE_CLASS_VALUE|\n",
            "|PD                                            |TICKER           |\n",
            "|New York Stock Exchange                       |STOCK_EXCHANGE   |\n",
            "+----------------------------------------------+-----------------+\n",
            "\n"
          ]
        }
      ],
      "source": [
        "# ner_chunk results\n",
        "\n",
        "ner_results.select(F.explode(F.arrays_zip(ner_results.ner_chunks.result,\n",
        "                                          ner_results.ner_chunks.metadata)).alias(\"cols\"))\\\n",
        "           .select(F.expr(\"cols['0']\").alias(\"ner_chunk\"),\n",
        "                   F.expr(\"cols['1']['entity']\").alias(\"label\")).show(50, truncate=False)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 16,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "LeGwhovDQBrV",
        "outputId": "3c1a41e9-bb1a-4f34-f936-75bd3e533d77"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "+----------------------------------------------+-----------------+-------------------+---------------------+--------+\n",
            "|key_phrase                                    |label            |DocumentSimilarity |MMRScore             |sentence|\n",
            "+----------------------------------------------+-----------------+-------------------+---------------------+--------+\n",
            "|New York Stock Exchange                       |STOCK_EXCHANGE   |0.5488381988412031 |0.3293029323900442   |6       |\n",
            "|27-2793871                                    |IRS              |0.45730858080350506|0.13971821238765356  |3       |\n",
            "|600 Townsend St., Suite 200, San Francisco, CA|ADDRESS          |0.43345846542169797|0.0629858198432803   |4       |\n",
            "|(844) 800-3889                                |PHONE            |0.3828927936642871 |-0.05135087737234392 |4       |\n",
            "|PAGERDUTY, INC                                |ORG              |0.3797431768838629 |0.0665505773270374   |2       |\n",
            "|Common Stock                                  |TITLE_CLASS      |0.35267543540066576|0.013411679066343135 |6       |\n",
            "|$0.000005                                     |TITLE_CLASS_VALUE|0.3453745062411516 |0.042424153422046806 |6       |\n",
            "|Delaware                                      |STATE            |0.3269358508439858 |0.020992782487245037 |3       |\n",
            "|January 31, 2021                              |FISCAL_YEAR      |0.30783572125737296|0.031128095591583277 |1       |\n",
            "|PD                                            |TICKER           |0.2656579767820962 |-0.059599408981947044|6       |\n",
            "+----------------------------------------------+-----------------+-------------------+---------------------+--------+\n",
            "\n"
          ]
        }
      ],
      "source": [
        "ner_results.select(F.explode(F.arrays_zip(ner_results.ner_key_phrases.result, \n",
        "                                          ner_results.ner_key_phrases.metadata)).alias(\"cols\"))\\\n",
        "           .select(F.expr(\"cols['0']\").alias(\"key_phrase\"),\n",
        "                   F.expr(\"cols['1']['entity']\").alias(\"label\"),\n",
        "                   F.expr(\"cols['1']['DocumentSimilarity']\").alias(\"DocumentSimilarity\"),\n",
        "                   F.expr(\"cols['1']['MMRScore']\").alias(\"MMRScore\"),\n",
        "                   F.expr(\"cols['1']['sentence']\").alias(\"sentence\")).show(truncate=False)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "RbqsAtrcX2CE"
      },
      "source": [
        "# with NGramGenerator and NER Model\n",
        "\n",
        "NGramGenerator and NER (Named Entity Recognition) Mode are additional components or techniques that can be used in conjunction with Chunk Key Phrase Extraction to enhance the extraction of key phrases.\n",
        "\n",
        "- NGramGenerator: An NGram refers to a contiguous sequence of n items from a given text, where an item can be a word, character, or any other linguistic unit. NGramGenerator is a component that generates NGrams from the input text. By considering NGrams of varying lengths (unigrams, bigrams, trigrams, etc.), the NGramGenerator captures both single words and multi-word expressions, which can be valuable key phrases.\n",
        "\n",
        "For example, if the input text is \"I love to play soccer,\" the NGramGenerator can produce unigrams like \"I,\" \"love,\" \"to,\" \"play,\" and \"soccer,\" as well as bigrams like \"I love,\" \"love to,\" \"to play,\" and \"play soccer.\" These NGrams provide more context and improve the extraction of meaningful key phrases.\n",
        "\n",
        "- NER Mode (Named Entity Recognition): Named Entity Recognition is a subtask of NLP that aims to identify and classify named entities, such as person names, locations, organizations, dates, etc., in text. NER Mode is a specific setting or approach used during Chunk Key Phrase Extraction, where named entities are recognized and treated as important chunks or key phrases.\n",
        "\n",
        "By incorporating NER Mode, the extraction process can specifically focus on extracting key phrases that represent named entities, which are typically highly informative and relevant in many applications. For instance, in a news article, named entities like \"Barack Obama,\" \"New York City,\" or \"Apple Inc.\" are important key phrases that convey crucial information.\n",
        "\n",
        "Using NGramGenerator and NER Mode in combination with Chunk Key Phrase Extraction can lead to more accurate and comprehensive extraction of key phrases from text. These techniques allow for the identification of meaningful phrases, including single words, multi-word expressions, and named entities, which contribute to a better understanding of the content and enable more effective analysis."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "cDpPgelgX2jP"
      },
      "outputs": [],
      "source": [
        "documenter = nlp.DocumentAssembler() \\\n",
        "    .setInputCol(\"text\") \\\n",
        "    .setOutputCol(\"document\")\n",
        "\n",
        "sentencer = nlp.SentenceDetector() \\\n",
        "    .setInputCols([\"document\"]) \\\n",
        "    .setOutputCol(\"sentences\")\n",
        "\n",
        "tokenizer = nlp.Tokenizer() \\\n",
        "    .setInputCols([\"document\"]) \\\n",
        "    .setOutputCol(\"tokens\") \\\n",
        "    .setSplitChars(['\\[','\\]']) \n",
        "\n",
        "stop_words_cleaner = nlp.StopWordsCleaner.pretrained()\\\n",
        "    .setInputCols(\"tokens\")\\\n",
        "    .setOutputCol(\"clean_tokens\")\\\n",
        "    .setCaseSensitive(False)\n",
        "\n",
        "ngram_generator = nlp.NGramGenerator()\\\n",
        "    .setInputCols([\"clean_tokens\"])\\\n",
        "    .setOutputCol(\"ngrams\")\\\n",
        "    .setN(3)\n",
        "        \n",
        "embeddings = nlp.BertEmbeddings.pretrained(\"bert_embeddings_sec_bert_base\",\"en\")\\\n",
        "    .setInputCols(\"sentences\", \"tokens\") \\\n",
        "    .setOutputCol(\"embeddings\")\\\n",
        "    .setMaxSentenceLength(512)\\\n",
        "    .setCaseSensitive(True)\n",
        "\n",
        "ner_tagger = finance.NerModel.pretrained(\"finner_sec_10k_summary\",\"en\",\"finance/models\")\\\n",
        "    .setInputCols([\"sentences\", \"tokens\", \"embeddings\"]) \\\n",
        "    .setOutputCol(\"ner_tags\")\n",
        "\n",
        "ner_converter = finance.NerConverterInternal()\\\n",
        "    .setInputCols(\"sentences\", \"tokens\", \"ner_tags\")\\\n",
        "    .setOutputCol(\"ner_chunks\")\n",
        "\n",
        "chunk_merger = finance.ChunkMergeApproach()\\\n",
        "    .setInputCols(\"ngrams\", \"ner_chunks\")\\\n",
        "    .setOutputCol(\"merged_chunks\")\\\n",
        "    .setMergeOverlapping(False)\n",
        "\n",
        "ngram_ner_key_phrase_extractor = finance.ChunkKeyPhraseExtraction.pretrained()\\\n",
        "    .setTopN(10) \\\n",
        "    .setDivergence(0.4)\\\n",
        "    .setInputCols([\"sentences\", \"merged_chunks\"])\\\n",
        "    .setOutputCol(\"key_phrases\")\n",
        "\n",
        "ngram_ner_pipeline = nlp.Pipeline(stages=[\n",
        "    documenter, \n",
        "    sentencer, \n",
        "    tokenizer, \n",
        "    stop_words_cleaner,\n",
        "    ngram_generator,\n",
        "    embeddings, \n",
        "    ner_tagger, \n",
        "    ner_converter, \n",
        "    chunk_merger,\n",
        "    ngram_ner_key_phrase_extractor\n",
        "])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 18,
      "metadata": {
        "id": "jk_Wx7ElYoxr"
      },
      "outputs": [],
      "source": [
        "ngram_ner_results = ngram_ner_pipeline.fit(empty_data).transform(textDF)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "Qy9Yp_CRYuZ_"
      },
      "source": [
        "**Show the merged key phrase candidate results. `UNK` ones from NGramGenerator and the others from `ner_jsl` model.**"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 19,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "7SOjf3L1Yu6F",
        "outputId": "3986bc2f-84bd-45cb-fdca-14780b03d616"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "+------------------------------------------------------------------------------------------------------------------------------------------------+\n",
            "|key_phrase_candidate                                                                                                                            |\n",
            "+------------------------------------------------------------------------------------------------------------------------------------------------+\n",
            "|{chunk, 0, 21, ANNUAL REPORT PURSUANT, {entity -> UNK, chunk -> 0, sentence -> 0}, []}                                                          |\n",
            "|{chunk, 7, 32, REPORT PURSUANT SECTION, {entity -> UNK, chunk -> 1, sentence -> 0}, []}                                                         |\n",
            "|{chunk, 14, 35, PURSUANT SECTION 13, {entity -> UNK, chunk -> 2, sentence -> 0}, []}                                                            |\n",
            "|{chunk, 26, 43, SECTION 13 15(d, {entity -> UNK, chunk -> 3, sentence -> 0}, []}                                                                |\n",
            "|{chunk, 34, 44, 13 15(d ), {entity -> UNK, chunk -> 4, sentence -> 0}, []}                                                                      |\n",
            "|{chunk, 40, 62, 15(d ) SECURITIES, {entity -> UNK, chunk -> 5, sentence -> 0}, []}                                                              |\n",
            "|{chunk, 44, 75, ) SECURITIES EXCHANGE, {entity -> UNK, chunk -> 6, sentence -> 0}, []}                                                          |\n",
            "|{chunk, 53, 79, SECURITIES EXCHANGE ACT, {entity -> UNK, chunk -> 7, sentence -> 0}, []}                                                        |\n",
            "|{chunk, 68, 87, EXCHANGE ACT 1934, {entity -> UNK, chunk -> 8, sentence -> 0}, []}                                                              |\n",
            "|{chunk, 77, 102, ACT 1934 annual, {entity -> UNK, chunk -> 9, sentence -> 0}, []}                                                               |\n",
            "|{chunk, 84, 109, 1934 annual period, {entity -> UNK, chunk -> 10, sentence -> 0}, []}                                                           |\n",
            "|{chunk, 97, 115, annual period ended, {entity -> UNK, chunk -> 11, sentence -> 0}, []}                                                          |\n",
            "|{chunk, 104, 123, period ended January, {entity -> UNK, chunk -> 12, sentence -> 0}, []}                                                        |\n",
            "|{chunk, 111, 126, ended January 31, {entity -> UNK, chunk -> 13, sentence -> 0}, []}                                                            |\n",
            "|{chunk, 117, 127, January 31 ,, {entity -> UNK, chunk -> 14, sentence -> 0}, []}                                                                |\n",
            "|{chunk, 117, 132, January 31, 2021, {chunk -> 15, confidence -> 0.89890003, ner_source -> ner_chunks, entity -> FISCAL_YEAR, sentence -> 1}, []}|\n",
            "|{chunk, 125, 132, 31 , 2021, {entity -> UNK, chunk -> 16, sentence -> 0}, []}                                                                   |\n",
            "|{chunk, 127, 146, , 2021 TRANSITION, {entity -> UNK, chunk -> 17, sentence -> 0}, []}                                                           |\n",
            "|{chunk, 129, 153, 2021 TRANSITION REPORT, {entity -> UNK, chunk -> 18, sentence -> 0}, []}                                                      |\n",
            "|{chunk, 137, 162, TRANSITION REPORT PURSUANT, {entity -> UNK, chunk -> 19, sentence -> 0}, []}                                                  |\n",
            "|{chunk, 148, 173, REPORT PURSUANT SECTION, {entity -> UNK, chunk -> 20, sentence -> 0}, []}                                                     |\n",
            "|{chunk, 155, 176, PURSUANT SECTION 13, {entity -> UNK, chunk -> 21, sentence -> 0}, []}                                                         |\n",
            "|{chunk, 167, 184, SECTION 13 15(d, {entity -> UNK, chunk -> 22, sentence -> 0}, []}                                                             |\n",
            "|{chunk, 175, 185, 13 15(d ), {entity -> UNK, chunk -> 23, sentence -> 0}, []}                                                                   |\n",
            "|{chunk, 181, 203, 15(d ) SECURITIES, {entity -> UNK, chunk -> 24, sentence -> 0}, []}                                                           |\n",
            "|{chunk, 185, 212, ) SECURITIES EXCHANGE, {entity -> UNK, chunk -> 25, sentence -> 0}, []}                                                       |\n",
            "|{chunk, 194, 216, SECURITIES EXCHANGE ACT, {entity -> UNK, chunk -> 26, sentence -> 0}, []}                                                     |\n",
            "|{chunk, 205, 224, EXCHANGE ACT 1934, {entity -> UNK, chunk -> 27, sentence -> 0}, []}                                                           |\n",
            "|{chunk, 214, 243, ACT 1934 transition, {entity -> UNK, chunk -> 28, sentence -> 0}, []}                                                         |\n",
            "|{chunk, 221, 250, 1934 transition period, {entity -> UNK, chunk -> 29, sentence -> 0}, []}                                                      |\n",
            "+------------------------------------------------------------------------------------------------------------------------------------------------+\n",
            "only showing top 30 rows\n",
            "\n"
          ]
        }
      ],
      "source": [
        "ngram_ner_results.selectExpr(\"explode(merged_chunks) AS key_phrase_candidate\").show(30,truncate=False)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 20,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "m1fv9UDYYxCz",
        "outputId": "5bcb1d6f-cd61-4f11-9380-e35367469964"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "+----------------------------------------------+-----------------+\n",
            "|key_phrase_candidate                          |label            |\n",
            "+----------------------------------------------+-----------------+\n",
            "|January 31, 2021                              |FISCAL_YEAR      |\n",
            "|001-38856                                     |CFN              |\n",
            "|PAGERDUTY, INC                                |ORG              |\n",
            "|Delaware                                      |STATE            |\n",
            "|27-2793871                                    |IRS              |\n",
            "|600 Townsend St., Suite 200, San Francisco, CA|ADDRESS          |\n",
            "|(844) 800-3889                                |PHONE            |\n",
            "|Common Stock                                  |TITLE_CLASS      |\n",
            "|$0.000005                                     |TITLE_CLASS_VALUE|\n",
            "|PD                                            |TICKER           |\n",
            "|New York Stock Exchange                       |STOCK_EXCHANGE   |\n",
            "+----------------------------------------------+-----------------+\n",
            "\n"
          ]
        }
      ],
      "source": [
        "# NER chunk results\n",
        "ngram_ner_results.select(F.explode(F.arrays_zip(ngram_ner_results.merged_chunks.result,\n",
        "                                                ngram_ner_results.merged_chunks.metadata)).alias(\"cols\"))\\\n",
        "                 .select(F.expr(\"cols['0']\").alias(\"key_phrase_candidate\"),\n",
        "                         F.expr(\"cols['1']['entity']\").alias(\"label\")).filter(\"label != 'UNK'\").show(50, truncate=False)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 21,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ohSJbw9hY3pe",
        "outputId": "22695d80-c4dd-461d-9d7d-c1e14e0ffefc"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "+---------------------------------------+-----+\n",
            "|key_phrase_candidate                   |label|\n",
            "+---------------------------------------+-----+\n",
            "|ANNUAL REPORT PURSUANT                 |UNK  |\n",
            "|REPORT PURSUANT SECTION                |UNK  |\n",
            "|PURSUANT SECTION 13                    |UNK  |\n",
            "|SECTION 13 15(d                        |UNK  |\n",
            "|13 15(d )                              |UNK  |\n",
            "|15(d ) SECURITIES                      |UNK  |\n",
            "|) SECURITIES EXCHANGE                  |UNK  |\n",
            "|SECURITIES EXCHANGE ACT                |UNK  |\n",
            "|EXCHANGE ACT 1934                      |UNK  |\n",
            "|ACT 1934 annual                        |UNK  |\n",
            "|1934 annual period                     |UNK  |\n",
            "|annual period ended                    |UNK  |\n",
            "|period ended January                   |UNK  |\n",
            "|ended January 31                       |UNK  |\n",
            "|January 31 ,                           |UNK  |\n",
            "|31 , 2021                              |UNK  |\n",
            "|, 2021 TRANSITION                      |UNK  |\n",
            "|2021 TRANSITION REPORT                 |UNK  |\n",
            "|TRANSITION REPORT PURSUANT             |UNK  |\n",
            "|REPORT PURSUANT SECTION                |UNK  |\n",
            "|PURSUANT SECTION 13                    |UNK  |\n",
            "|SECTION 13 15(d                        |UNK  |\n",
            "|13 15(d )                              |UNK  |\n",
            "|15(d ) SECURITIES                      |UNK  |\n",
            "|) SECURITIES EXCHANGE                  |UNK  |\n",
            "|SECURITIES EXCHANGE ACT                |UNK  |\n",
            "|EXCHANGE ACT 1934                      |UNK  |\n",
            "|ACT 1934 transition                    |UNK  |\n",
            "|1934 transition period                 |UNK  |\n",
            "|transition period from________to_______|UNK  |\n",
            "|period from________to_______ Commission|UNK  |\n",
            "|from________to_______ Commission File  |UNK  |\n",
            "|Commission File Number                 |UNK  |\n",
            "|File Number :                          |UNK  |\n",
            "|Number : 001-38856                     |UNK  |\n",
            "|: 001-38856 PAGERDUTY                  |UNK  |\n",
            "|001-38856 PAGERDUTY ,                  |UNK  |\n",
            "|PAGERDUTY , .                          |UNK  |\n",
            "|, . (                                  |UNK  |\n",
            "|. ( Exact                              |UNK  |\n",
            "|( Exact registrant                     |UNK  |\n",
            "|Exact registrant charter               |UNK  |\n",
            "|registrant charter )                   |UNK  |\n",
            "|charter ) Delaware                     |UNK  |\n",
            "|) Delaware 27-2793871                  |UNK  |\n",
            "|Delaware 27-2793871 (                  |UNK  |\n",
            "|27-2793871 ( State                     |UNK  |\n",
            "|( State jurisdiction                   |UNK  |\n",
            "|State jurisdiction incorporation       |UNK  |\n",
            "|jurisdiction incorporation organization|UNK  |\n",
            "+---------------------------------------+-----+\n",
            "only showing top 50 rows\n",
            "\n"
          ]
        }
      ],
      "source": [
        "# ngram results\n",
        "ngram_ner_results.select(F.explode(F.arrays_zip(ngram_ner_results.merged_chunks.result,\n",
        "                                                ngram_ner_results.merged_chunks.metadata)).alias(\"cols\"))\\\n",
        "                 .select(F.expr(\"cols['0']\").alias(\"key_phrase_candidate\"),\n",
        "                         F.expr(\"cols['1']['entity']\").alias(\"label\")).filter(\"label == 'UNK'\").show(50, truncate=False)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 22,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "y8IqP7CuY7os",
        "outputId": "32b99207-576a-432a-b090-a484402378eb"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "+---------------------------------------+-----------+\n",
            "|key_phrase_candidate                   |label      |\n",
            "+---------------------------------------+-----------+\n",
            "|ANNUAL REPORT PURSUANT                 |UNK        |\n",
            "|REPORT PURSUANT SECTION                |UNK        |\n",
            "|PURSUANT SECTION 13                    |UNK        |\n",
            "|SECTION 13 15(d                        |UNK        |\n",
            "|13 15(d )                              |UNK        |\n",
            "|15(d ) SECURITIES                      |UNK        |\n",
            "|) SECURITIES EXCHANGE                  |UNK        |\n",
            "|SECURITIES EXCHANGE ACT                |UNK        |\n",
            "|EXCHANGE ACT 1934                      |UNK        |\n",
            "|ACT 1934 annual                        |UNK        |\n",
            "|1934 annual period                     |UNK        |\n",
            "|annual period ended                    |UNK        |\n",
            "|period ended January                   |UNK        |\n",
            "|ended January 31                       |UNK        |\n",
            "|January 31 ,                           |UNK        |\n",
            "|January 31, 2021                       |FISCAL_YEAR|\n",
            "|31 , 2021                              |UNK        |\n",
            "|, 2021 TRANSITION                      |UNK        |\n",
            "|2021 TRANSITION REPORT                 |UNK        |\n",
            "|TRANSITION REPORT PURSUANT             |UNK        |\n",
            "|REPORT PURSUANT SECTION                |UNK        |\n",
            "|PURSUANT SECTION 13                    |UNK        |\n",
            "|SECTION 13 15(d                        |UNK        |\n",
            "|13 15(d )                              |UNK        |\n",
            "|15(d ) SECURITIES                      |UNK        |\n",
            "|) SECURITIES EXCHANGE                  |UNK        |\n",
            "|SECURITIES EXCHANGE ACT                |UNK        |\n",
            "|EXCHANGE ACT 1934                      |UNK        |\n",
            "|ACT 1934 transition                    |UNK        |\n",
            "|1934 transition period                 |UNK        |\n",
            "|transition period from________to_______|UNK        |\n",
            "|period from________to_______ Commission|UNK        |\n",
            "|from________to_______ Commission File  |UNK        |\n",
            "|Commission File Number                 |UNK        |\n",
            "|File Number :                          |UNK        |\n",
            "|Number : 001-38856                     |UNK        |\n",
            "|: 001-38856 PAGERDUTY                  |UNK        |\n",
            "|001-38856                              |CFN        |\n",
            "|001-38856 PAGERDUTY ,                  |UNK        |\n",
            "|PAGERDUTY, INC                         |ORG        |\n",
            "|PAGERDUTY , .                          |UNK        |\n",
            "|, . (                                  |UNK        |\n",
            "|. ( Exact                              |UNK        |\n",
            "|( Exact registrant                     |UNK        |\n",
            "|Exact registrant charter               |UNK        |\n",
            "|registrant charter )                   |UNK        |\n",
            "|charter ) Delaware                     |UNK        |\n",
            "|) Delaware 27-2793871                  |UNK        |\n",
            "|Delaware                               |STATE      |\n",
            "|Delaware 27-2793871 (                  |UNK        |\n",
            "+---------------------------------------+-----------+\n",
            "only showing top 50 rows\n",
            "\n"
          ]
        }
      ],
      "source": [
        "# merged (NER chunk + ngram) results\n",
        "ngram_ner_results.select(F.explode(F.arrays_zip(ngram_ner_results.merged_chunks.result,\n",
        "                                                ngram_ner_results.merged_chunks.metadata)).alias(\"cols\"))\\\n",
        "                 .select(F.expr(\"cols['0']\").alias(\"key_phrase_candidate\"),\n",
        "                         F.expr(\"cols['1']['entity']\").alias(\"label\")).show(50, truncate=False)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 23,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "I_r745LNZjX_",
        "outputId": "13fa1045-5ba5-49b2-bdc8-b3621554759d"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "+---------------------------------------+------+--------+\n",
            "|key_phrase_candidate                   |source|sentence|\n",
            "+---------------------------------------+------+--------+\n",
            "|ANNUAL REPORT PURSUANT                 |ngram |0       |\n",
            "|REPORT PURSUANT SECTION                |ngram |0       |\n",
            "|PURSUANT SECTION 13                    |ngram |0       |\n",
            "|SECTION 13 15(d                        |ngram |0       |\n",
            "|13 15(d )                              |ngram |0       |\n",
            "|15(d ) SECURITIES                      |ngram |0       |\n",
            "|) SECURITIES EXCHANGE                  |ngram |0       |\n",
            "|SECURITIES EXCHANGE ACT                |ngram |0       |\n",
            "|EXCHANGE ACT 1934                      |ngram |0       |\n",
            "|ACT 1934 annual                        |ngram |0       |\n",
            "|1934 annual period                     |ngram |0       |\n",
            "|annual period ended                    |ngram |0       |\n",
            "|period ended January                   |ngram |0       |\n",
            "|ended January 31                       |ngram |0       |\n",
            "|January 31 ,                           |ngram |0       |\n",
            "|January 31, 2021                       |NER   |1       |\n",
            "|31 , 2021                              |ngram |0       |\n",
            "|, 2021 TRANSITION                      |ngram |0       |\n",
            "|2021 TRANSITION REPORT                 |ngram |0       |\n",
            "|TRANSITION REPORT PURSUANT             |ngram |0       |\n",
            "|REPORT PURSUANT SECTION                |ngram |0       |\n",
            "|PURSUANT SECTION 13                    |ngram |0       |\n",
            "|SECTION 13 15(d                        |ngram |0       |\n",
            "|13 15(d )                              |ngram |0       |\n",
            "|15(d ) SECURITIES                      |ngram |0       |\n",
            "|) SECURITIES EXCHANGE                  |ngram |0       |\n",
            "|SECURITIES EXCHANGE ACT                |ngram |0       |\n",
            "|EXCHANGE ACT 1934                      |ngram |0       |\n",
            "|ACT 1934 transition                    |ngram |0       |\n",
            "|1934 transition period                 |ngram |0       |\n",
            "|transition period from________to_______|ngram |0       |\n",
            "|period from________to_______ Commission|ngram |0       |\n",
            "|from________to_______ Commission File  |ngram |0       |\n",
            "|Commission File Number                 |ngram |0       |\n",
            "|File Number :                          |ngram |0       |\n",
            "|Number : 001-38856                     |ngram |0       |\n",
            "|: 001-38856 PAGERDUTY                  |ngram |0       |\n",
            "|001-38856                              |NER   |2       |\n",
            "|001-38856 PAGERDUTY ,                  |ngram |0       |\n",
            "|PAGERDUTY, INC                         |NER   |2       |\n",
            "|PAGERDUTY , .                          |ngram |0       |\n",
            "|, . (                                  |ngram |0       |\n",
            "|. ( Exact                              |ngram |0       |\n",
            "|( Exact registrant                     |ngram |0       |\n",
            "|Exact registrant charter               |ngram |0       |\n",
            "|registrant charter )                   |ngram |0       |\n",
            "|charter ) Delaware                     |ngram |0       |\n",
            "|) Delaware 27-2793871                  |ngram |0       |\n",
            "|Delaware                               |NER   |3       |\n",
            "|Delaware 27-2793871 (                  |ngram |0       |\n",
            "+---------------------------------------+------+--------+\n",
            "only showing top 50 rows\n",
            "\n"
          ]
        }
      ],
      "source": [
        "ngram_ner_results.selectExpr(\"explode(merged_chunks) AS key_phrase_candidate\")\\\n",
        "                 .selectExpr(\"key_phrase_candidate.result AS key_phrase_candidate\",\n",
        "                             \"IF(key_phrase_candidate.metadata.entity = 'UNK', 'ngram', 'NER') AS source\",\n",
        "                             \"key_phrase_candidate.metadata.sentence\")\\\n",
        "                 .show(50, truncate=False)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 24,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "oMAoOamcZj7k",
        "outputId": "05a6aecb-be50-4977-c9b9-a2869b30ae2e"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "+-------------------------------------+-----+------------------+-------------------+--------+\n",
            "|key_phrase                           |label|DocumentSimilarity|MMRScore           |sentence|\n",
            "+-------------------------------------+-----+------------------+-------------------+--------+\n",
            "|SECURITIES EXCHANGE ACT              |UNK  |0.6393701205777877|0.3836220875904442 |0       |\n",
            "|Securities registered pursuant       |UNK  |0.6136637817081764|0.07235383001749868|0       |\n",
            "|Delaware 27-2793871                  |UNK  |0.5795593361842474|0.21278103633873757|0       |\n",
            "|Commission File Number               |UNK  |0.5611810371021204|0.0662808662583036 |0       |\n",
            "|class Trading symbol(s               |UNK  |0.5605955342194471|0.1130287649782131 |0       |\n",
            "|Employer Identification Number       |UNK  |0.5440926937211562|0.11586821117770699|0       |\n",
            "|PD York Stock                        |UNK  |0.5371489243663168|0.08852012162737921|0       |\n",
            "|from________to_______ Commission File|UNK  |0.5155971905879061|0.14162160767505633|0       |\n",
            "|TRANSITION REPORT PURSUANT           |UNK  |0.5036100781247339|0.08332010027661099|0       |\n",
            "|Exact registrant charter             |UNK  |0.4904558869833586|0.10867417072096042|0       |\n",
            "+-------------------------------------+-----+------------------+-------------------+--------+\n",
            "\n"
          ]
        }
      ],
      "source": [
        "ngram_ner_results.select(F.explode(F.arrays_zip(ngram_ner_results.key_phrases.result,\n",
        "                                                ngram_ner_results.key_phrases.metadata)).alias(\"cols\"))\\\n",
        "                 .select(F.expr(\"cols['0']\").alias(\"key_phrase\"),\n",
        "                         F.expr(\"cols['1']['entity']\").alias(\"label\"),\n",
        "                         F.expr(\"cols['1']['DocumentSimilarity']\").alias(\"DocumentSimilarity\"),\n",
        "                         F.expr(\"cols['1']['MMRScore']\").alias(\"MMRScore\"),\n",
        "                         F.expr(\"cols['1']['sentence']\").alias(\"sentence\")).show(truncate=False)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "eu6QKAldZySR"
      },
      "outputs": [],
      "source": []
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "gpuClass": "standard",
    "kernelspec": {
      "display_name": "tf-gpu",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)]"
    },
    "vscode": {
      "interpreter": {
        "hash": "3f47d918ae832c68584484921185f5c85a1760864bf927a683dc6fb56366cc77"
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}