{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import Python libraries\n",
    "from typing import *\n",
    "import os\n",
    "import ibm_watson\n",
    "import ibm_watson.natural_language_understanding_v1 as nlu\n",
    "import ibm_cloud_sdk_core\n",
    "import pandas as pd\n",
    "import sys\n",
    "\n",
    "# And of course we need the text_extensions_for_pandas library itself.\n",
    "try:\n",
    "    import text_extensions_for_pandas as tp\n",
    "except ModuleNotFoundError as e:\n",
    "    raise Exception(\"text_extensions_for_pandas package not found on the Jupyter \"\n",
    "                    \"kernel's path. Please either run:\\n\"\n",
    "                    \"   ln -s ../../text_extensions_for_pandas .\\n\"\n",
    "                    \"from the directory containing this notebook, or use a Python \"\n",
    "                    \"environment on which you have used `pip` to install the package.\")\n",
    "\n",
    "\n",
    "if \"IBM_API_KEY\" not in os.environ:\n",
    "    raise ValueError(\"IBM_API_KEY environment variable not set. Please create \"\n",
    "                     \"a free instance of IBM Watson Natural Language Understanding \"\n",
    "                     \"(see https://www.ibm.com/cloud/watson-natural-language-understanding) \"\n",
    "                     \"and set the IBM_API_KEY environment variable to your instance's \"\n",
    "                     \"API key value.\")\n",
    "api_key = os.environ.get(\"IBM_API_KEY\")\n",
    "service_url = os.environ.get(\"IBM_SERVICE_URL\")  \n",
    "natural_language_understanding = ibm_watson.NaturalLanguageUnderstandingV1(\n",
    "    version=\"2021-01-01\",\n",
    "    authenticator=ibm_cloud_sdk_core.authenticators.IAMAuthenticator(api_key)\n",
    ")\n",
    "natural_language_understanding.set_service_url(service_url)\n",
    "\n",
    "# Github notebook gists will be this wide: ------------------>\n",
    "# Screenshots of this notebook should be this wide: ----------------------------->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Market Intelligence with Pandas and IBM Watson\n",
    "\n",
    "In this article, we'll show how to perform an example market intelligence task using [Watson Natural Language Understanding](https://www.ibm.com/cloud/watson-natural-language-understanding?cm_mmc=open_source_technology) and our open source library [Text Extensions for Pandas](https://ibm.biz/text-extensions-for-pandas). \n",
    "\n",
    "*Market intelligence* is an important application of natural language processing. In this context, \"market intelligence\" means \"finding useful facts about customers and competitors in news articles\". This article focuses on a market intelligence task: **extracting the names of executives from corporate press releases**.\n",
    "\n",
    "Information about a company's leadership has many uses. You could use that information to identify points of contact for sales or partnership discussions. Or you could estimate how much attention a company is giving to different strategic areas. Some organizations even use this information for recruiting purposes.\n",
    "\n",
    "Press releases are a good place to find the names of executives, because these articles often feature quotes from company leaders. Here's an example quote from an [IBM press release](https://newsroom.ibm.com/2020-12-02-IBM-Named-a-Leader-in-the-2020-IDC-MarketScape-For-Worldwide-Advanced-Machine-Learning-Software-Platform) from December 2020:\n",
    "\n",
    "![Snippet of a press release: \"By combining the power of AI with the flexibility and agility of hybrid cloud, our clients are driving innovation and digitizing their operations at a fast pace,\" said Daniel Hernandez, general manager, Data and AI, IBM. ](images/quote.png)\n",
    "\n",
    "This quote contains information about the name of an executive:\n",
    "![The quote from the previous picture, highlighting the name \"Daniel Hernandez\"](images/annotated_quote.png)\n",
    "\n",
    "This snippet is an example of the general pattern that we will look for:\n",
    "* The article contains a quotation.\n",
    "* The person to whom the quotation is attributed is mentioned by name.\n",
    "\n",
    "The key challenge that we need to address is the many different forms that this pattern can take. Here are some examples of variations that we would like to capture:\n",
    "\n",
    "![Variations on the quote from the previous picture: (1) Present-tense \"says\" instead of \"said\"; (2) Name occurs before the quote; and (3) Name occurs in the middle of the quote](images/alternate_quotes.png)\n",
    "\n",
    "We'll deal with this variability by using general-purpose semantic models. These models extract high-level facts from formal text. The text could express a given fact in many different ways, but all of those different forms produce the same output.\n",
    "\n",
    "Semantic models can save a lot of work. There's no need to label separate training data or write separate rules or for all of the variations of our target pattern. A small amount of code can capture all these variations at once.\n",
    "\n",
    "Let's get started!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Use IBM Watson to identify people quoted by name.\n",
    "\n",
    "IBM Watson Natural Language Understanding includes a model called `semantic_roles` that performs [Semantic Role Labeling](https://en.wikipedia.org/wiki/Semantic_role_labeling). You can think of Semantic Role Labeling as finding *subject-verb-object* triples:\n",
    "* The actions that occurred in the text (the verb),\n",
    "* Who performed each action (the subject), and\n",
    "* On whom the action was performed (the object).\n",
    "\n",
    "If take our example executive quote and feed it through the semantic_roles model, we get the following raw output:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'usage': {'text_units': 1, 'text_characters': 221, 'features': 1},\n",
       " 'semantic_roles': [{'subject': {'text': 'our clients'},\n",
       "   'sentence': '\"By combining the power of AI with the flexibility and agility of hybrid cloud, our clients are driving innovation and digitizing their operations at a fast pace,\" said\\xa0Daniel Hernandez, general manager, Data and AI, IBM.',\n",
       "   'object': {'text': 'driving innovation and digitizing their operations'},\n",
       "   'action': {'verb': {'text': 'be', 'tense': 'present'},\n",
       "    'text': 'are',\n",
       "    'normalized': 'be'}},\n",
       "  {'subject': {'text': 'our clients'},\n",
       "   'sentence': '\"By combining the power of AI with the flexibility and agility of hybrid cloud, our clients are driving innovation and digitizing their operations at a fast pace,\" said\\xa0Daniel Hernandez, general manager, Data and AI, IBM.',\n",
       "   'object': {'text': 'innovation and digitizing their operations'},\n",
       "   'action': {'verb': {'text': 'drive', 'tense': 'present'},\n",
       "    'text': 'are driving',\n",
       "    'normalized': 'be drive'}},\n",
       "  {'subject': {'text': 'our clients'},\n",
       "   'sentence': '\"By combining the power of AI with the flexibility and agility of hybrid cloud, our clients are driving innovation and digitizing their operations at a fast pace,\" said\\xa0Daniel Hernandez, general manager, Data and AI, IBM.',\n",
       "   'object': {'text': 'their operations'},\n",
       "   'action': {'verb': {'text': 'digitize', 'tense': 'present'},\n",
       "    'text': 'digitizing',\n",
       "    'normalized': 'digitize'}},\n",
       "  {'subject': {'text': 'Daniel Hernandez, general manager, Data and AI, IBM'},\n",
       "   'sentence': '\"By combining the power of AI with the flexibility and agility of hybrid cloud, our clients are driving innovation and digitizing their operations at a fast pace,\" said\\xa0Daniel Hernandez, general manager, Data and AI, IBM.',\n",
       "   'object': {'text': 'By combining the power of AI with the flexibility and agility of hybrid cloud, our clients are driving innovation and digitizing their operations at a fast pace'},\n",
       "   'action': {'verb': {'text': 'say', 'tense': 'past'},\n",
       "    'text': 'said',\n",
       "    'normalized': 'say'}}],\n",
       " 'language': 'en',\n",
       " 'analyzed_text': '\"By combining the power of AI with the flexibility and agility of hybrid cloud, our clients are driving innovation and digitizing their operations at a fast pace,\" said\\xa0Daniel Hernandez, general manager, Data and AI, IBM.'}"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "response = natural_language_understanding.analyze(\n",
    "    text='''\"By combining the power of AI with the flexibility and agility of \\\n",
    "hybrid cloud, our clients are driving innovation and digitizing their operations \\\n",
    "at a fast pace,\" said Daniel Hernandez, general manager, Data and AI, IBM.''',\n",
    "    return_analyzed_text=True,\n",
    "    features=nlu.Features(\n",
    "        semantic_roles=nlu.SemanticRolesOptions()\n",
    "    )).get_result()\n",
    "response"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That format is a bit hard to read. Let's use our open-source library, [Text Extensions for Pandas](https://ibm.biz/text-extensions-for-pandas), to convert it to a Pandas DataFrame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>subject.text</th>\n",
       "      <th>sentence</th>\n",
       "      <th>object.text</th>\n",
       "      <th>action.verb.text</th>\n",
       "      <th>action.verb.tense</th>\n",
       "      <th>action.text</th>\n",
       "      <th>action.normalized</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>our clients</td>\n",
       "      <td>\"By combining the power of AI with the flexibi...</td>\n",
       "      <td>driving innovation and digitizing their operat...</td>\n",
       "      <td>be</td>\n",
       "      <td>present</td>\n",
       "      <td>are</td>\n",
       "      <td>be</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>our clients</td>\n",
       "      <td>\"By combining the power of AI with the flexibi...</td>\n",
       "      <td>innovation and digitizing their operations</td>\n",
       "      <td>drive</td>\n",
       "      <td>present</td>\n",
       "      <td>are driving</td>\n",
       "      <td>be drive</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>our clients</td>\n",
       "      <td>\"By combining the power of AI with the flexibi...</td>\n",
       "      <td>their operations</td>\n",
       "      <td>digitize</td>\n",
       "      <td>present</td>\n",
       "      <td>digitizing</td>\n",
       "      <td>digitize</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Daniel Hernandez, general manager, Data and AI...</td>\n",
       "      <td>\"By combining the power of AI with the flexibi...</td>\n",
       "      <td>By combining the power of AI with the flexibil...</td>\n",
       "      <td>say</td>\n",
       "      <td>past</td>\n",
       "      <td>said</td>\n",
       "      <td>say</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                        subject.text  \\\n",
       "0                                        our clients   \n",
       "1                                        our clients   \n",
       "2                                        our clients   \n",
       "3  Daniel Hernandez, general manager, Data and AI...   \n",
       "\n",
       "                                            sentence  \\\n",
       "0  \"By combining the power of AI with the flexibi...   \n",
       "1  \"By combining the power of AI with the flexibi...   \n",
       "2  \"By combining the power of AI with the flexibi...   \n",
       "3  \"By combining the power of AI with the flexibi...   \n",
       "\n",
       "                                         object.text action.verb.text  \\\n",
       "0  driving innovation and digitizing their operat...               be   \n",
       "1         innovation and digitizing their operations            drive   \n",
       "2                                   their operations         digitize   \n",
       "3  By combining the power of AI with the flexibil...              say   \n",
       "\n",
       "  action.verb.tense  action.text action.normalized  \n",
       "0           present          are                be  \n",
       "1           present  are driving          be drive  \n",
       "2           present   digitizing          digitize  \n",
       "3              past         said               say  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import text_extensions_for_pandas as tp\n",
    "\n",
    "dfs = tp.io.watson.nlu.parse_response(response)\n",
    "dfs[\"semantic_roles\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can see that the `semantic_roles` model has identified four subject-verb-object triples. Each row of this DataFrame contains one triple. In the first row, the verb is \"to be\", and in the last row, the verb is \"to say\".\n",
    "\n",
    "The last row is where things get interesting for us, because the verb \"to say\" indicates that *someone made a statement*. And that's exactly the high-level pattern we're looking for. Let's filter the DataFrame down to that row and look at it more closely."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>subject.text</th>\n",
       "      <th>sentence</th>\n",
       "      <th>object.text</th>\n",
       "      <th>action.verb.text</th>\n",
       "      <th>action.verb.tense</th>\n",
       "      <th>action.text</th>\n",
       "      <th>action.normalized</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Daniel Hernandez, general manager, Data and AI...</td>\n",
       "      <td>\"By combining the power of AI with the flexibi...</td>\n",
       "      <td>By combining the power of AI with the flexibil...</td>\n",
       "      <td>say</td>\n",
       "      <td>past</td>\n",
       "      <td>said</td>\n",
       "      <td>say</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                        subject.text  \\\n",
       "3  Daniel Hernandez, general manager, Data and AI...   \n",
       "\n",
       "                                            sentence  \\\n",
       "3  \"By combining the power of AI with the flexibi...   \n",
       "\n",
       "                                         object.text action.verb.text  \\\n",
       "3  By combining the power of AI with the flexibil...              say   \n",
       "\n",
       "  action.verb.tense action.text action.normalized  \n",
       "3              past        said               say  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dfs[\"semantic_roles\"][dfs[\"semantic_roles\"][\"action.normalized\"] == \"say\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The subject in this subject-verb-object triple is \"Daniel Hernandez, general manager, Data and AI, IBM\", and the object is the quote from Mr. Hernandez.\n",
    "\n",
    "This model's output has captured the general action of \"\\[person\\] says \\[quotation\\]\". Different variations of that general pattern will produce the same output. If we move the attribution to the middle of the quote, we get the same result:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>subject.text</th>\n",
       "      <th>sentence</th>\n",
       "      <th>object.text</th>\n",
       "      <th>action.verb.text</th>\n",
       "      <th>action.verb.tense</th>\n",
       "      <th>action.text</th>\n",
       "      <th>action.normalized</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Daniel Hernandez, general manager, Data and AI...</td>\n",
       "      <td>\"By combining the power of AI with the flexibi...</td>\n",
       "      <td>By combining the power of AI with the flexibil...</td>\n",
       "      <td>say</td>\n",
       "      <td>past</td>\n",
       "      <td>said</td>\n",
       "      <td>say</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                        subject.text  \\\n",
       "0  Daniel Hernandez, general manager, Data and AI...   \n",
       "\n",
       "                                            sentence  \\\n",
       "0  \"By combining the power of AI with the flexibi...   \n",
       "\n",
       "                                         object.text action.verb.text  \\\n",
       "0  By combining the power of AI with the flexibil...              say   \n",
       "\n",
       "  action.verb.tense action.text action.normalized  \n",
       "0              past        said               say  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "response = natural_language_understanding.analyze(\n",
    "    text='''\"By combining the power of AI with the flexibility and agility of \\\n",
    "hybrid cloud,” said Daniel Hernandez, general manager, Data and AI, IBM, “our \\\n",
    "clients are driving innovation and digitizing their operations at a fast pace.\"''',\n",
    "    return_analyzed_text=True,\n",
    "    features=nlu.Features(semantic_roles=nlu.SemanticRolesOptions())).get_result()\n",
    "dfs = tp.io.watson.nlu.parse_response(response)\n",
    "dfs[\"semantic_roles\"][dfs[\"semantic_roles\"][\"action.normalized\"] == \"say\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we change the past-tense verb \"said\" to the present-tense \"says\", we get the same result again:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>subject.text</th>\n",
       "      <th>sentence</th>\n",
       "      <th>object.text</th>\n",
       "      <th>action.verb.text</th>\n",
       "      <th>action.verb.tense</th>\n",
       "      <th>action.text</th>\n",
       "      <th>action.normalized</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Daniel Hernandez, general manager, Data and AI...</td>\n",
       "      <td>\"By combining the power of AI with the flexibi...</td>\n",
       "      <td>By combining the power of AI with the flexibil...</td>\n",
       "      <td>say</td>\n",
       "      <td>present</td>\n",
       "      <td>says</td>\n",
       "      <td>say</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                        subject.text  \\\n",
       "3  Daniel Hernandez, general manager, Data and AI...   \n",
       "\n",
       "                                            sentence  \\\n",
       "3  \"By combining the power of AI with the flexibi...   \n",
       "\n",
       "                                         object.text action.verb.text  \\\n",
       "3  By combining the power of AI with the flexibil...              say   \n",
       "\n",
       "  action.verb.tense action.text action.normalized  \n",
       "3           present        says               say  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "response = natural_language_understanding.analyze(\n",
    "    text='''\"By combining the power of AI with the flexibility and agility of \\\n",
    "hybrid cloud, our clients are driving innovation and digitizing their operations \\\n",
    "at a fast pace,\" says Daniel Hernandez, general manager, Data and AI, IBM.''',\n",
    "    return_analyzed_text=True,\n",
    "    features=nlu.Features(semantic_roles=nlu.SemanticRolesOptions())).get_result()\n",
    "dfs = tp.io.watson.nlu.parse_response(response)\n",
    "dfs[\"semantic_roles\"][dfs[\"semantic_roles\"][\"action.normalized\"] == \"say\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All the different variations that we talked about earlier will produce the same result. This model lets us capture them all with very little code. All we need to do is to run the model and filter the outputs down to the verb we're looking for.\n",
    "\n",
    "So far we've been looking at one paragraph. Let's rerun the same process on the entire press release.\n",
    "\n",
    "## Finding instances of \"Someone Said Something\"\n",
    "\n",
    "As before, we can run the document through Watson Natural Language Understanding's Python interface and tell Watson to run its semantic_roles model. Then we use Text Extensions for Pandas to convert the model results to a DataFrame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>subject.text</th>\n",
       "      <th>sentence</th>\n",
       "      <th>object.text</th>\n",
       "      <th>action.verb.text</th>\n",
       "      <th>action.verb.tense</th>\n",
       "      <th>action.text</th>\n",
       "      <th>action.normalized</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>IBM)</td>\n",
       "      <td>ARMONK, N.Y., Dec. 2, 2020 /PRNewswire/ -- IBM...</td>\n",
       "      <td>to the Leaders Category in the latest IDC Mark...</td>\n",
       "      <td>name</td>\n",
       "      <td>past</td>\n",
       "      <td>has been named</td>\n",
       "      <td>have be name</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>The report</td>\n",
       "      <td>The report evaluated vendors who offer tools ...</td>\n",
       "      <td>vendors who offer tools and frameworks for dev...</td>\n",
       "      <td>evaluate</td>\n",
       "      <td>past</td>\n",
       "      <td>evaluated</td>\n",
       "      <td>evaluate</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>vendors</td>\n",
       "      <td>The report evaluated vendors who offer tools ...</td>\n",
       "      <td>tools and frameworks</td>\n",
       "      <td>offer</td>\n",
       "      <td>present</td>\n",
       "      <td>offer</td>\n",
       "      <td>offer</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>by the IDC MarketScape</td>\n",
       "      <td>As reported by the IDC MarketScape, IBM offer...</td>\n",
       "      <td>IBM offers a wide range of innovative machine ...</td>\n",
       "      <td>report</td>\n",
       "      <td>past</td>\n",
       "      <td>reported</td>\n",
       "      <td>report</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>innovative machine</td>\n",
       "      <td>As reported by the IDC MarketScape, IBM offer...</td>\n",
       "      <td>capabilities</td>\n",
       "      <td>learn</td>\n",
       "      <td>present</td>\n",
       "      <td>learning</td>\n",
       "      <td>learn</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             subject.text                                           sentence  \\\n",
       "0                    IBM)  ARMONK, N.Y., Dec. 2, 2020 /PRNewswire/ -- IBM...   \n",
       "1              The report   The report evaluated vendors who offer tools ...   \n",
       "2                 vendors   The report evaluated vendors who offer tools ...   \n",
       "3  by the IDC MarketScape   As reported by the IDC MarketScape, IBM offer...   \n",
       "4      innovative machine   As reported by the IDC MarketScape, IBM offer...   \n",
       "\n",
       "                                         object.text action.verb.text  \\\n",
       "0  to the Leaders Category in the latest IDC Mark...             name   \n",
       "1  vendors who offer tools and frameworks for dev...         evaluate   \n",
       "2                               tools and frameworks            offer   \n",
       "3  IBM offers a wide range of innovative machine ...           report   \n",
       "4                                       capabilities            learn   \n",
       "\n",
       "  action.verb.tense     action.text action.normalized  \n",
       "0              past  has been named      have be name  \n",
       "1              past       evaluated          evaluate  \n",
       "2           present           offer             offer  \n",
       "3              past        reported            report  \n",
       "4           present        learning             learn  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "DOC_URL = \"https://newsroom.ibm.com/2020-12-02-IBM-Named-a-Leader-in-the-2020-IDC-MarketScape-For-Worldwide-Advanced-Machine-Learning-Software-Platform\"\n",
    "\n",
    "# Make the request\n",
    "response = natural_language_understanding.analyze(\n",
    "    url=DOC_URL,  # NLU will fetch the URL for us.\n",
    "    return_analyzed_text=True,\n",
    "    features=nlu.Features(\n",
    "        semantic_roles=nlu.SemanticRolesOptions()\n",
    "    )).get_result()\n",
    "\n",
    "# Convert the output of the `semantic_roles` model to a DataFrame\n",
    "semantic_roles_df = tp.io.watson.nlu.parse_response(response)[\"semantic_roles\"]\n",
    "semantic_roles_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we filter down to the subject-verb-object triples for the verb \"to say\", we can see that this document has quite a few examples of the \"person says statement\" pattern:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>subject.text</th>\n",
       "      <th>sentence</th>\n",
       "      <th>object.text</th>\n",
       "      <th>action.verb.text</th>\n",
       "      <th>action.verb.tense</th>\n",
       "      <th>action.text</th>\n",
       "      <th>action.normalized</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>Daniel Hernandez, general manager, Data and AI...</td>\n",
       "      <td>\"By combining the power of AI with the flexib...</td>\n",
       "      <td>By combining the power of AI with the flexibil...</td>\n",
       "      <td>say</td>\n",
       "      <td>past</td>\n",
       "      <td>said</td>\n",
       "      <td>say</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>Curren Katz, Director of Data Science R&amp;D, Hig...</td>\n",
       "      <td>\"At the beginning of the COVID-19 pandemic, H...</td>\n",
       "      <td>At the beginning of the COVID-19 pandemic, Hig...</td>\n",
       "      <td>say</td>\n",
       "      <td>past</td>\n",
       "      <td>said</td>\n",
       "      <td>say</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31</th>\n",
       "      <td>Ritu Jyoti, program vice president, AI researc...</td>\n",
       "      <td>Digital Transformation (DX) is one of the key...</td>\n",
       "      <td>Digital Transformation (DX) is one of the key ...</td>\n",
       "      <td>say</td>\n",
       "      <td>present</td>\n",
       "      <td>says</td>\n",
       "      <td>say</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                         subject.text  \\\n",
       "15  Daniel Hernandez, general manager, Data and AI...   \n",
       "21  Curren Katz, Director of Data Science R&D, Hig...   \n",
       "31  Ritu Jyoti, program vice president, AI researc...   \n",
       "\n",
       "                                             sentence  \\\n",
       "15   \"By combining the power of AI with the flexib...   \n",
       "21   \"At the beginning of the COVID-19 pandemic, H...   \n",
       "31   Digital Transformation (DX) is one of the key...   \n",
       "\n",
       "                                          object.text action.verb.text  \\\n",
       "15  By combining the power of AI with the flexibil...              say   \n",
       "21  At the beginning of the COVID-19 pandemic, Hig...              say   \n",
       "31  Digital Transformation (DX) is one of the key ...              say   \n",
       "\n",
       "   action.verb.tense action.text action.normalized  \n",
       "15              past        said               say  \n",
       "21              past        said               say  \n",
       "31           present        says               say  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "quotes_df = semantic_roles_df[semantic_roles_df[\"action.normalized\"] == \"say\"]\n",
    "quotes_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The DataFrame `quotes_df` contains all the instances of the \"person says statement\" pattern that the model has found. We want to filter this set down to cases where the subject (the person making the statement) is mentioned by name. We also want to extract that name.\n",
    "\n",
    "<!-- ### What can be in a subject? -->\n",
    "In this press release, all three instances of the \"person says statement\" pattern happen to have a name in the subject. But there will not always be a name. Consider this example sentence from [another IBM press release](https://newsroom.ibm.com/2021-04-08-IBM-Consumer-Study-Points-to-Potential-Recovery-of-Retail-and-Travel-Industries-as-Consumers-Receive-the-COVID-19-Vaccine):\n",
    "\n",
    "> 27 percent of Gen Z surveyed said they will increase outside \\\n",
    "interaction, compared to 19 percent of Gen X surveyed and only 16 percent of \\\n",
    "those surveyed over 55.\n",
    "\n",
    "Here, the subject for the verb \"said\" is \"27 percent of Gen Z surveyed\". That subject that does not include a person name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>subject.text</th>\n",
       "      <th>sentence</th>\n",
       "      <th>object.text</th>\n",
       "      <th>action.verb.text</th>\n",
       "      <th>action.verb.tense</th>\n",
       "      <th>action.text</th>\n",
       "      <th>action.normalized</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>27 percent of Gen Z surveyed</td>\n",
       "      <td>27 percent of Gen Z surveyed said they will in...</td>\n",
       "      <td>they will increase outside interaction, compar...</td>\n",
       "      <td>say</td>\n",
       "      <td>past</td>\n",
       "      <td>said</td>\n",
       "      <td>say</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   subject.text  \\\n",
       "0  27 percent of Gen Z surveyed   \n",
       "\n",
       "                                            sentence  \\\n",
       "0  27 percent of Gen Z surveyed said they will in...   \n",
       "\n",
       "                                         object.text action.verb.text  \\\n",
       "0  they will increase outside interaction, compar...              say   \n",
       "\n",
       "  action.verb.tense action.text action.normalized  \n",
       "0              past        said               say  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Do not include this cell in the blog.\n",
    "\n",
    "# Show that the `semantic_roles` model produces the output we described above.\n",
    "response = natural_language_understanding.analyze(\n",
    "    text='''27 percent of Gen Z surveyed said they will increase outside \\\n",
    "interaction, compared to 19 percent of Gen X surveyed and only 16 percent of \\\n",
    "those surveyed over 55.''',\n",
    "    return_analyzed_text=True,\n",
    "    features=nlu.Features(semantic_roles=nlu.SemanticRolesOptions())).get_result()\n",
    "\n",
    "# Convert the output of the `semantic_roles` model to a DataFrame\n",
    "tp.io.watson.nlu.parse_response(response)[\"semantic_roles\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Finding places where a person is mentioned by name\n",
    "\n",
    "How can we find the matches where the subject contains a person's name? Fortunately for us, Watson Natural Language Understanding has a model for exactly that task. The `entities` model in this Watson service finds named entity mentions. A named entity mention is a place where the document mentions an *entity* like a person or company by the entity's *name*.\n",
    "\n",
    "This model will find person names with high accuracy. The code below tells the Watson service to run the entities model and retrieve mentions. Then we convert the result to a DataFrame using Text Extensions for Pandas:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>type</th>\n",
       "      <th>text</th>\n",
       "      <th>span</th>\n",
       "      <th>confidence</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Organization</td>\n",
       "      <td>IDC MarketScape</td>\n",
       "      <td>[112, 127): 'IDC MarketScape'</td>\n",
       "      <td>0.466973</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Organization</td>\n",
       "      <td>IDC MarketScape</td>\n",
       "      <td>[383, 398): 'IDC MarketScape'</td>\n",
       "      <td>0.753796</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Organization</td>\n",
       "      <td>IDC MarketScape</td>\n",
       "      <td>[956, 971): 'IDC MarketScape'</td>\n",
       "      <td>0.664680</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Organization</td>\n",
       "      <td>IDC MarketScape</td>\n",
       "      <td>[1346, 1361): 'IDC MarketScape'</td>\n",
       "      <td>0.677499</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Organization</td>\n",
       "      <td>IDC MarketScape</td>\n",
       "      <td>[3786, 3801): 'IDC MarketScape'</td>\n",
       "      <td>0.524242</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49</th>\n",
       "      <td>Organization</td>\n",
       "      <td>AI</td>\n",
       "      <td>[2512, 2514): 'AI'</td>\n",
       "      <td>0.514581</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50</th>\n",
       "      <td>Organization</td>\n",
       "      <td>ICT</td>\n",
       "      <td>[3534, 3537): 'ICT'</td>\n",
       "      <td>0.691880</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>51</th>\n",
       "      <td>JobTitle</td>\n",
       "      <td>telecommunications vendors</td>\n",
       "      <td>[3997, 4023): 'telecommunications vendors'</td>\n",
       "      <td>0.259333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>52</th>\n",
       "      <td>Person</td>\n",
       "      <td>Tyler Allen</td>\n",
       "      <td>[4213, 4224): 'Tyler Allen'</td>\n",
       "      <td>0.964611</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>53</th>\n",
       "      <td>EmailAddress</td>\n",
       "      <td>tballen@us.ibm.com</td>\n",
       "      <td>[4248, 4266): 'tballen@us.ibm.com'</td>\n",
       "      <td>0.800000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>54 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "           type             text                             span  confidence\n",
       "0  Organization  IDC MarketScape    [112, 127): 'IDC MarketScape'    0.466973\n",
       "1  Organization  IDC MarketScape    [383, 398): 'IDC MarketScape'    0.753796\n",
       "2  Organization  IDC MarketScape    [956, 971): 'IDC MarketScape'    0.664680\n",
       "3  Organization  IDC MarketScape  [1346, 1361): 'IDC MarketScape'    0.677499\n",
       "4  Organization  IDC MarketScape  [3786, 3801): 'IDC MarketScape'    0.524242"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.options.display.max_rows = 30  # Keep the output of this cell compact\n",
    "response = natural_language_understanding.analyze(\n",
    "    url=DOC_URL,\n",
    "    return_analyzed_text=True,\n",
    "    features=nlu.Features(\n",
    "        # Ask Watson to find mentions of named entities\n",
    "        entities=nlu.EntitiesOptions(mentions=True),\n",
    "        \n",
    "        # Also divide the document into words. We'll use these in just a moment.\n",
    "        syntax=nlu.SyntaxOptions(tokens=nlu.SyntaxOptionsTokens()),\n",
    "    )).get_result()\n",
    "entity_mentions_df = tp.io.watson.nlu.parse_response(response)[\"entity_mentions\"]\n",
    "entity_mentions_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `entities` model's output contains mentions of many types of entity. For this application, we need\n",
    "mentions of person names. Let's filter our DataFrame down to just those types of mentions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>type</th>\n",
       "      <th>text</th>\n",
       "      <th>span</th>\n",
       "      <th>confidence</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>31</th>\n",
       "      <td>Person</td>\n",
       "      <td>IBM Watson</td>\n",
       "      <td>[1915, 1925): 'IBM Watson'</td>\n",
       "      <td>0.364448</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34</th>\n",
       "      <td>Person</td>\n",
       "      <td>Ritu Jyoti</td>\n",
       "      <td>[2476, 2486): 'Ritu Jyoti'</td>\n",
       "      <td>0.959464</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39</th>\n",
       "      <td>Person</td>\n",
       "      <td>Watson</td>\n",
       "      <td>[2891, 2897): 'Watson'</td>\n",
       "      <td>0.933148</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>40</th>\n",
       "      <td>Person</td>\n",
       "      <td>Watson</td>\n",
       "      <td>[3060, 3066): 'Watson'</td>\n",
       "      <td>0.988052</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      type        text                        span  confidence\n",
       "31  Person  IBM Watson  [1915, 1925): 'IBM Watson'    0.364448\n",
       "34  Person  Ritu Jyoti  [2476, 2486): 'Ritu Jyoti'    0.959464\n",
       "39  Person      Watson      [2891, 2897): 'Watson'    0.933148\n",
       "40  Person      Watson      [3060, 3066): 'Watson'    0.988052"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "person_mentions_df = entity_mentions_df[entity_mentions_df[\"type\"] == \"Person\"]\n",
    "person_mentions_df.tail(4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tying it all together\n",
    "\n",
    "Now we have two pieces of information that we need to combine:\n",
    "* Instances of the \"person said statement\" pattern from the `semantic_roles` model\n",
    "* Mentions of person names from the `entities` model\n",
    "\n",
    "We need to align the \"subject\" part of the semantic role labeler's output with the person mentions. We can use the span manipulation facilities of Text Extensions for Pandas to do this.\n",
    "\n",
    "*Spans* are a common concept in natural language processing. A span represents a region of the document, usually as begin and end offsets and a reference to the document's text. Text Extensions for Pandas adds a special `SpanDtype` data type to Pandas DataFrames. With this data type, you can define a DataFrame with one or more columns of span data. For example, the column called \"span\" in the DataFrame above is of the `SpanDtype` data type. The first span in this column, `[1288, 1304): 'Daniel Hernandez'`, shows that the name \"Daniel Hernandez\" occurs between locations 1288 and 1304 in the document.\n",
    "\n",
    "The output of the `semantic_roles` model doesn't contain location information. But that's ok, because it's easy to create your own spans. We just need to use some string matching to recover the missing locations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>subject.text</th>\n",
       "      <th>span</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Daniel Hernandez, general manager, Data and AI...</td>\n",
       "      <td>[1288, 1339): 'Daniel Hernandez, general manag...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Curren Katz, Director of Data Science R&amp;D, Hig...</td>\n",
       "      <td>[1838, 1896): 'Curren Katz, Director of Data S...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Ritu Jyoti, program vice president, AI researc...</td>\n",
       "      <td>[2476, 2581): 'Ritu Jyoti, program vice presid...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                        subject.text  \\\n",
       "0  Daniel Hernandez, general manager, Data and AI...   \n",
       "1  Curren Katz, Director of Data Science R&D, Hig...   \n",
       "2  Ritu Jyoti, program vice president, AI researc...   \n",
       "\n",
       "                                                span  \n",
       "0  [1288, 1339): 'Daniel Hernandez, general manag...  \n",
       "1  [1838, 1896): 'Curren Katz, Director of Data S...  \n",
       "2  [2476, 2581): 'Ritu Jyoti, program vice presid...  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Retrieve the full document text from the entity mentions output.\n",
    "doc_text = entity_mentions_df[\"span\"].array.document_text\n",
    "\n",
    "# Filter down to just the rows and columns we're interested in\n",
    "subjects_df = quotes_df[[\"subject.text\"]].copy().reset_index(drop=True)\n",
    "\n",
    "# Use String.index() to find where the strings in \"subject.text\" begin\n",
    "subjects_df[\"begin\"] = pd.Series(\n",
    "    [doc_text.index(s) for s in subjects_df[\"subject.text\"]], dtype=int)\n",
    "\n",
    "# Compute end offsets and wrap the <begin, end, text> triples in a SpanArray\n",
    "subjects_df[\"end\"] = subjects_df[\"begin\"] + subjects_df[\"subject.text\"].str.len()\n",
    "subjects_df[\"span\"] = tp.SpanArray(doc_text, subjects_df[\"begin\"], \n",
    "                                   subjects_df[\"end\"])\n",
    "subjects_df = subjects_df.drop(columns=[\"begin\", \"end\"])\n",
    "subjects_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have a column of span data for the `semantic_roles` model's output, and we can align these spans with the spans of person mentions. Text Extensions for Pandas includes built-in span operations. One of these operations, `contain_join()`, takes two columns of span data and identifies all pairs of spans where the first span contains the second span. We can use this operation to find all the places where the span from the `semantic_roles` model contains a span from the output of the `entities` model: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>subject</th>\n",
       "      <th>person</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>[1288, 1339): 'Daniel Hernandez, general manag...</td>\n",
       "      <td>[1288, 1304): 'Daniel Hernandez'</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>[1838, 1896): 'Curren Katz, Director of Data S...</td>\n",
       "      <td>[1838, 1849): 'Curren Katz'</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>[2476, 2581): 'Ritu Jyoti, program vice presid...</td>\n",
       "      <td>[2476, 2486): 'Ritu Jyoti'</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                             subject  \\\n",
       "0  [1288, 1339): 'Daniel Hernandez, general manag...   \n",
       "1  [1838, 1896): 'Curren Katz, Director of Data S...   \n",
       "2  [2476, 2581): 'Ritu Jyoti, program vice presid...   \n",
       "\n",
       "                             person  \n",
       "0  [1288, 1304): 'Daniel Hernandez'  \n",
       "1       [1838, 1849): 'Curren Katz'  \n",
       "2        [2476, 2486): 'Ritu Jyoti'  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "execs_df = tp.spanner.contain_join(subjects_df[\"span\"], \n",
    "                                   person_mentions_df[\"span\"],\n",
    "                                   \"subject\", \"person\")\n",
    "execs_df[[\"subject\", \"person\"]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To recap: With a few lines of Python code, we've identified places in the article where the article quoted a person by name. For each of those quotations, we've identified the person name and its location in the document (the `person` column in the DataFrame above).\n",
    "\n",
    "### Combining Code Into One Function\n",
    "\n",
    "Here's all the code we've just created, condensed down to a single Python function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "# In the blog post, this will be a Github gist.\n",
    "# See https://gist.github.com/frreiss/038ac63ef20eed323a5637f9ddb2de8d\n",
    "\n",
    "import pandas as pd\n",
    "import text_extensions_for_pandas as tp\n",
    "import ibm_watson\n",
    "import ibm_watson.natural_language_understanding_v1 as nlu\n",
    "import ibm_cloud_sdk_core\n",
    "\n",
    "def find_persons_quoted_by_name(doc_url, api_key, service_url) -> pd.DataFrame:\n",
    "    # Ask Watson Natural Language Understanding to run its \"semantic_roles\"\n",
    "    # and \"entities\" models.\n",
    "    natural_language_understanding = ibm_watson.NaturalLanguageUnderstandingV1(\n",
    "        version=\"2021-01-01\",\n",
    "        authenticator=ibm_cloud_sdk_core.authenticators.IAMAuthenticator(api_key)\n",
    "    )\n",
    "    natural_language_understanding.set_service_url(service_url)\n",
    "    nlu_results = natural_language_understanding.analyze(\n",
    "        url=doc_url,\n",
    "        return_analyzed_text=True,\n",
    "        features=nlu.Features(\n",
    "            entities=nlu.EntitiesOptions(mentions=True),\n",
    "            semantic_roles=nlu.SemanticRolesOptions())).get_result()\n",
    "    \n",
    "    # Convert the output of Watson Natural Language Understanding to DataFrames.\n",
    "    dataframes = tp.io.watson.nlu.parse_response(nlu_results)\n",
    "    entity_mentions_df = dataframes[\"entity_mentions\"]\n",
    "    semantic_roles_df = dataframes[\"semantic_roles\"]\n",
    "    \n",
    "    # Extract mentions of person names\n",
    "    person_mentions_df = entity_mentions_df[entity_mentions_df[\"type\"] == \"Person\"]\n",
    "    \n",
    "    # Extract instances of subjects that made statements\n",
    "    quotes_df = semantic_roles_df[semantic_roles_df[\"action.normalized\"] == \"say\"]\n",
    "    subjects_df = quotes_df[[\"subject.text\"]].copy().reset_index(drop=True)\n",
    "    \n",
    "        # Retrieve the full document text from the entity mentions output.\n",
    "    doc_text = entity_mentions_df[\"span\"].array.document_text\n",
    "\n",
    "    # Filter down to just the rows and columns we're interested in\n",
    "    subjects_df = quotes_df[[\"subject.text\"]].copy().reset_index(drop=True)\n",
    "\n",
    "    # Use String.index() to find where the strings in \"subject.text\" begin\n",
    "    subjects_df[\"begin\"] = pd.Series(\n",
    "        [doc_text.index(s) for s in subjects_df[\"subject.text\"]], dtype=int)\n",
    "\n",
    "    # Compute end offsets and wrap the <begin, end, text> triples in a SpanArray column\n",
    "    subjects_df[\"end\"] = subjects_df[\"begin\"] + subjects_df[\"subject.text\"].str.len()\n",
    "    subjects_df[\"span\"] = tp.SpanArray(doc_text, subjects_df[\"begin\"], subjects_df[\"end\"])\n",
    "\n",
    "    # Align subjects with person names\n",
    "    execs_df = tp.spanner.contain_join(subjects_df[\"span\"], \n",
    "                                       person_mentions_df[\"span\"],\n",
    "                                       \"subject\", \"person\")\n",
    "    # Add on the document URL.\n",
    "    execs_df[\"url\"] = doc_url\n",
    "    return execs_df[[\"person\", \"url\"]]\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>person</th>\n",
       "      <th>url</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>[1288, 1304): 'Daniel Hernandez'</td>\n",
       "      <td>https://newsroom.ibm.com/2020-12-02-IBM-Named-...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>[1838, 1849): 'Curren Katz'</td>\n",
       "      <td>https://newsroom.ibm.com/2020-12-02-IBM-Named-...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>[2476, 2486): 'Ritu Jyoti'</td>\n",
       "      <td>https://newsroom.ibm.com/2020-12-02-IBM-Named-...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                             person  \\\n",
       "0  [1288, 1304): 'Daniel Hernandez'   \n",
       "1       [1838, 1849): 'Curren Katz'   \n",
       "2        [2476, 2486): 'Ritu Jyoti'   \n",
       "\n",
       "                                                 url  \n",
       "0  https://newsroom.ibm.com/2020-12-02-IBM-Named-...  \n",
       "1  https://newsroom.ibm.com/2020-12-02-IBM-Named-...  \n",
       "2  https://newsroom.ibm.com/2020-12-02-IBM-Named-...  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Don't include this cell in the blog post.\n",
    "\n",
    "# Verify that the code above works\n",
    "find_persons_quoted_by_name(DOC_URL, api_key, service_url)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Calling the Function on Many Documents\n",
    "\n",
    "This function, `find_persons_quoted_by_name()`, turns  a press release into a list of executive names. Here's the output that we get if we pass a year's worth articles from the [\"Announcements\" section of ibm.com](https://newsroom.ibm.com/announcements) through it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Don't include this cell in the blog post.\n",
    "\n",
    "# Load press release URLs from a file\n",
    "with open(\"ibm_press_releases.txt\", \"r\") as f:\n",
    "    lines = [l.strip() for l in f.readlines()]\n",
    "    ibm_press_release_urls = [l for l in lines if len(l) > 0 and l[0] != \"#\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>person</th>\n",
       "      <th>url</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>[1201, 1215): 'Wendi Whitmore'</td>\n",
       "      <td>https://newsroom.ibm.com/2020-02-11-IBM-X-Forc...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>[1281, 1292): 'Rob DiCicco'</td>\n",
       "      <td>https://newsroom.ibm.com/2020-02-18-IBM-Study-...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>[1213, 1229): 'Christoph Herman'</td>\n",
       "      <td>https://newsroom.ibm.com/2020-02-19-IBM-Power-...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>[2227, 2242): 'Stephen Leonard'</td>\n",
       "      <td>https://newsroom.ibm.com/2020-02-19-IBM-Power-...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>[2068, 2076): 'Bob Lord'</td>\n",
       "      <td>https://newsroom.ibm.com/2020-02-26-2020-Call-...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>[3114, 3124): 'Mike Doran'</td>\n",
       "      <td>https://newsroom.ibm.com/2021-01-25-OVHcloud-t...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>[3155, 3169): 'Howard Boville'</td>\n",
       "      <td>https://newsroom.ibm.com/2021-01-26-Luminor-Ba...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>[3114, 3137): 'Samuel Brack Co-Founder'</td>\n",
       "      <td>https://newsroom.ibm.com/2021-01-26-DIA-Levera...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>[3509, 3523): 'Hillery Hunter'</td>\n",
       "      <td>https://newsroom.ibm.com/2021-01-26-DIA-Levera...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>[1487, 1497): 'Ana Zamper'</td>\n",
       "      <td>https://newsroom.ibm.com/2021-01-26-Latin-Amer...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>314 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                     person  \\\n",
       "0            [1201, 1215): 'Wendi Whitmore'   \n",
       "0               [1281, 1292): 'Rob DiCicco'   \n",
       "0          [1213, 1229): 'Christoph Herman'   \n",
       "1           [2227, 2242): 'Stephen Leonard'   \n",
       "0                  [2068, 2076): 'Bob Lord'   \n",
       "..                                      ...   \n",
       "0                [3114, 3124): 'Mike Doran'   \n",
       "0            [3155, 3169): 'Howard Boville'   \n",
       "0   [3114, 3137): 'Samuel Brack Co-Founder'   \n",
       "1            [3509, 3523): 'Hillery Hunter'   \n",
       "0                [1487, 1497): 'Ana Zamper'   \n",
       "\n",
       "                                                  url  \n",
       "0   https://newsroom.ibm.com/2020-02-11-IBM-X-Forc...  \n",
       "0   https://newsroom.ibm.com/2020-02-18-IBM-Study-...  \n",
       "0   https://newsroom.ibm.com/2020-02-19-IBM-Power-...  \n",
       "1   https://newsroom.ibm.com/2020-02-19-IBM-Power-...  \n",
       "0   https://newsroom.ibm.com/2020-02-26-2020-Call-...  \n",
       "..                                                ...  \n",
       "0   https://newsroom.ibm.com/2021-01-25-OVHcloud-t...  \n",
       "0   https://newsroom.ibm.com/2021-01-26-Luminor-Ba...  \n",
       "0   https://newsroom.ibm.com/2021-01-26-DIA-Levera...  \n",
       "1   https://newsroom.ibm.com/2021-01-26-DIA-Levera...  \n",
       "0   https://newsroom.ibm.com/2021-01-26-Latin-Amer...  \n",
       "\n",
       "[314 rows x 2 columns]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "executive_names = pd.concat([\n",
    "    find_persons_quoted_by_name(url, api_key, service_url) \n",
    "    for url in ibm_press_release_urls\n",
    "])\n",
    "executive_names"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we've turned 191 press releases into a DataFrame with 301 executive names (**EDIT:** 314 names with the latest version of Watson Natural Language Understanding, as of October 2021).\n",
    "That's a lot of power packed into one screen's worth of code! To find out more about the advanced semantic models that let us do so much with so little code, check out Watson Natural Language Understanding [here](https://www.ibm.com/cloud/watson-natural-language-understanding?cm_mmc=open_source_technology)!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>subject.text</th>\n",
       "      <th>span</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Daniel Hernandez, general manager, Data and AI...</td>\n",
       "      <td>[1288, 1339): 'Daniel Hernandez, general manag...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Curren Katz, Director of Data Science R&amp;D, Hig...</td>\n",
       "      <td>[1838, 1896): 'Curren Katz, Director of Data S...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Ritu Jyoti, program vice president, AI researc...</td>\n",
       "      <td>[2476, 2581): 'Ritu Jyoti, program vice presid...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                        subject.text  \\\n",
       "0  Daniel Hernandez, general manager, Data and AI...   \n",
       "1  Curren Katz, Director of Data Science R&D, Hig...   \n",
       "2  Ritu Jyoti, program vice president, AI researc...   \n",
       "\n",
       "                                                span  \n",
       "0  [1288, 1339): 'Daniel Hernandez, general manag...  \n",
       "1  [1838, 1896): 'Curren Katz, Director of Data S...  \n",
       "2  [2476, 2581): 'Ritu Jyoti, program vice presid...  "
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Alternate version of adding spans to subjecs: Use dictionary matching.\n",
    "# This method is currently problematic because we don't have payloads\n",
    "# for dictionary entries. We have to use exact string matching to map the\n",
    "# original strings back to the dictionary matches.\n",
    "\n",
    "# Create a dictionary from the strings in quotes_df[\"subject.text\"].\n",
    "tokenizer = tp.io.spacy.simple_tokenizer()\n",
    "dictionary = tp.spanner.extract.create_dict(quotes_df[\"subject.text\"], tokenizer)\n",
    "\n",
    "# Match the dictionary against the document text.\n",
    "doc_text = entity_mentions_df[\"span\"].array.document_text\n",
    "tokens = tp.io.spacy.make_tokens(doc_text, tokenizer)\n",
    "matches_df = tp.spanner.extract_dict(tokens, dictionary, output_col_name=\"span\")\n",
    "matches_df[\"subject.text\"] = matches_df[\"span\"].array.covered_text  # Join key\n",
    "\n",
    "# Merge the dictionary matches back with the original strings.\n",
    "subjects_df = quotes_df[[\"subject.text\"]].merge(matches_df)\n",
    "subjects_df"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}