{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src='https://spacy.io/pipeline-7a14d4edd18f3edfee8f34393bff2992.svg'><center><span style=\"font-size:10px;\">Source: <a href=\"https://spacy.io/usage/processing-pipelines\">spaCy Language Processing Pipelines</a></span></center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Quick and Dirty - Entity Extraction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:gray;font-size:20px;\">From idea to prototype in AI.</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<em>If you've ever been around a startup or in the tech world for any significant amount of time, you've <strong>definitely</strong> encountered some, if not all of the following phrases: \"agile software development\", \"prototyping\", \"feedback loop\", \"rapid iteration\", etc.</em>\n",
    "\n",
    "<em>This Silicon Valley techno-babble can be distilled down to one simple concept, which just so happens to be the mantra of many a successful entrepreneur: test out your idea as quickly as possible, and then make it better over time. Stated more verbosely, before you invest mind and money into creating a cutting-edge solution to a problem, it might benefit you to get a baseline performance for your task using off-the-shelf techniques. Once you establish the efficacy of a low-cost, easy approach, you can then put on your Elon Musk hat and drive towards #innovation and #disruption.</em> \n",
    "\n",
    "<em>A concrete example might help illustrate this point:</em>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Entity Extraction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's say our goal was to create a natural language system that effectively allowed someone to converse with an academic paper. This task could be step one of many towards the development of an automated scientific discovery tool. Society can thank us later. \n",
    "\n",
    "But where do we begin? Well, a part of the solution has to deal with [knowledge extraction](https://en.wikipedia.org/wiki/Knowledge_extraction). In order to create a conversational engine that understands scientific papers, we'll first need to develop an entity recognition module, and this, lucky for us, is the topic of our notebook! \n",
    "\n",
    "\"What's an entity?\" you ask? Excellent question. Take a look at the following sentence:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> Dr. Abraham is the primary author of this paper, and a physician in the specialty of internal medicine."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, it should be relatively straighforward for an English-speaking human to pick out the important concepts in this sentence:\n",
    "\n",
    "> **[Dr. Abraham]** is the **[primary author]** of this **[paper]**, and a **[physician]** in the **[specialty]** of **[internal medicine]**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These words and/or phrases are categorized as \"entities\" because they represent salient ideas, nouns, and noun phrases in the real world. A subset of entities can be \"named\", in that they correspond to <strong><em>specific</em></strong> places, people, organizations, and so on. A [named entity](https://en.wikipedia.org/wiki/Named_entity) is to a regular entity, what \"Dr. Abraham\" is to a \"physician\". The good doctor is a real person and an instance of the \"physician\" class, and is therefore considered \"named\". Examples of named entities include \"Google\", \"Neil DeGrasse Tyson\", and \"Tokyo\", while regular, garden-variety entities can include the list just mentioned, as well as things like \"dog\", \"newspaper\", \"task\", etc.\n",
    "\n",
    "Let's see if we can get a computer to run this kind of analysis to pull important concepts from sentences. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The Task"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For our conversational academic paper program, we won't be satisfied with simply capturing named entities, because we need to understand the relationships between general concepts as well as actual things, places, etc. Unfortunately, while most out-of-the-box text processing libraries have a moderately useful <strong>named entity recognizer</strong>, they have little to no support for a generalized <strong>entity recognizer</strong>. \n",
    "\n",
    "This is because of a subtle, yet important constraint. \n",
    "\n",
    "Entities, as we've discussed, correspond to a superset of named entities, which <strong><em>should</em></strong> make them easier to extract. Indeed, blindly pulling all entities from a text source is in fact simple, but it's sadly not all that useful. In order to justify this exercise, we'd need to develop an entity extraction approach that is restricted to, or is cognizant of, some particular domain, for example, neuroscience, psychology, computer science, economics, etc. This paradoxical complexity makes it nontrivial to create a generic, but useful, entity recognizer. Hence the lack of support in most open-source libraries that deal with natural language processing. \n",
    "\n",
    "To largely simplify our task then, we must generate a set of entities from a scientific paper, that is <strong><em>larger</em></strong> than a simple list of named entities, but <strong><em>smaller</em></strong> than the giant list of all entities, restricted to the domain of a particular paper in question. \n",
    "\n",
    "Yikes. Are you sweating a little? Because I am. Instead of reaching for some Ibuprofen and deep learning pills, let's make a prototype using a little ingenuity, simple open-source code, and a lot of heuristics. Hopefully, through this process, we'll also learn a bit about the text processing pipeline that brings understanding natural language into the realm of the possible. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Enought chit-chat. Let's get to it!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<em><strong>Fun fact</strong>: Curious about what 'autoreload' does? <a href=\"https://ipython.org/ipython-doc/3/config/extensions/autoreload.html\">Check this out</a>.</em>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import spacy\n",
    "from spacy.displacy.render import EntityRenderer\n",
    "from IPython.core.display import display, HTML"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Utils and Prep"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's do some basic housekeeping before we start diving headfirst into entity extraction. We'll need to deal with visualization, load up a language model, and of course, examine/set-up our data source.\n",
    "\n",
    "### Show and Tell\n",
    "Our prototype will lean heavily on a popular natural langauge processing (NLP) library known as spaCy, which also has a wonderful set of classes and methods defined to help visualize parts of the NLP pipeline. Up top, where we've imported modules, you'll have noticed that we're pulling 'EntityRenderer' from spaCy's displacy module, as we'll be repurposing some of this code for our... um... purposes. In general, this is a good exercise if you ever want to get your hands dirty and really learn how certain classes work in your friendly neighborhood open-source projects. Nothing should ever be off-limits or a black box; always dissect and play with your code before you eat it.  \n",
    "\n",
    "Wander on over to spaCy's [website](https://spacy.io/), and you'll quickly discover that they've put in some serious thought into making the user interface absolutely gorgeous. (While Matthew undeniably had some input on this, I'm going to make an intelligent assumption that the design ideas are probably Ines' [contribution](https://explosion.ai/about)). \n",
    "\n",
    "<em><strong>&lt;rant&gt;</strong> Why spend so much time discussing visualization? Well, one of my biggest pet peeves is this: even if you can create a product, if you don't put in the time to make it look beautiful, or delightful to use, then you don't care about packaging your ideas for export to an audience. And that makes me sad. Once you get something working, make it pretty. <strong>&lt;/rant&gt;</strong></em>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def custom_render(doc, df, column, options={}, page=False, minify=False, idx=0):\n",
    "    \"\"\"Overload the spaCy built-in rendering to allow custom part-of-speech (POS) tags.\n",
    "    \n",
    "    Keyword arguments:\n",
    "    doc -- a spaCy nlp doc object\n",
    "    df -- a pandas dataframe object\n",
    "    column -- the name of of a column of interest in the dataframe\n",
    "    options -- various options to feed into the spaCy renderer, including colors\n",
    "    page -- rendering markup as full HTML page (default False)\n",
    "    minify -- for compact HTML (default False)\n",
    "    idx -- index for specific query or doc in dataframe (default 0)\n",
    "    \n",
    "    \"\"\"\n",
    "    renderer, converter = EntityRenderer, parse_custom_ents\n",
    "    renderer = renderer(options=options)\n",
    "    parsed = [converter(doc, df=df, idx=idx, column=column)]\n",
    "    html = renderer.render(parsed, page=page, minify=minify).strip()  \n",
    "    return display(HTML(html))\n",
    "\n",
    "def parse_custom_ents(doc, df, idx, column):\n",
    "    \"\"\"Parse custom entity types that aren't in the original spaCy module.\n",
    "    \n",
    "    Keyword arguments:\n",
    "    doc -- a spaCy nlp doc object\n",
    "    df -- a pandas dataframe object\n",
    "    idx -- index for specific query or doc in dataframe\n",
    "    column -- the name of of a column of interest in the dataframe\n",
    "    \n",
    "    \"\"\"\n",
    "    if column in df.columns:\n",
    "        entities = df[column][idx]\n",
    "        ents = [{'start': ent[1], 'end': ent[2], 'label': ent[3]} \n",
    "                for ent in entities]\n",
    "    else:\n",
    "        ents = [{'start': ent.start_char, 'end': ent.end_char, 'label': ent.label_}\n",
    "            for ent in doc.ents]\n",
    "    return {'text': doc.text, 'ents': ents, 'title': None}\n",
    "\n",
    "def render_entities(idx, df, options={}, column='named_ents'):\n",
    "    \"\"\"A wrapper function to get text from a dataframe and render it visually in jupyter notebooks\n",
    "    \n",
    "    Keyword arguments:\n",
    "    idx -- index for specific query or doc in dataframe (default 0)\n",
    "    df -- a pandas dataframe object\n",
    "    options -- various options to feed into the spaCy renderer, including colors\n",
    "    column -- the name of of a column of interest in the dataframe (default 'named_ents')\n",
    "    \n",
    "    \"\"\"\n",
    "    text = df['text'][idx]\n",
    "    custom_render(nlp(text), df=df, column=column, options=options, idx=idx)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 154,
   "metadata": {},
   "outputs": [],
   "source": [
    "# colors for additional part of speech tags we want to visualize\n",
    "options = {\n",
    "    'colors': {'COMPOUND': '#FE6BFE', 'PROPN': '#18CFE6', 'NOUN': '#18CFE6', 'NP': '#1EECA6', 'ENTITY': '#FF8800'}\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.set_option('display.max_rows', 10) # edit how jupyter will render our pandas dataframes\n",
    "pd.options.mode.chained_assignment = None # prevent warning about working on a copy of a dataframe"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "spaCy's pre-built models are trained on different corpora of text, to capture parts-of-speech, extract named entities, and in general understand how to tokenize words into chunks that have meaning in a given language. \n",
    "\n",
    "We'll grab the 'en_core_web_lg' model by running the following command in the shell (comment it out once you've run it so you don't keep downloading it every time you go through the notebook). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# !python -m spacy download en_core_web_lg\n",
    "nlp = spacy.load('en_core_web_lg')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<em><strong>Fun fact</strong>: We can run shell commands in a Jupyter notebook by using the bang operator. This is an example of a <a href=\"https://ipython.readthedocs.io/en/stable/interactive/magics.html\">magic</a> command, of which we saw an example at the begnning with '%autoreload'.</em>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Gather Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As our data source, we'll be using papers presented at the [Neural Information Processing Systems (NIPS)](https://nips.cc/) conference held in a different location around the world each year. NIPS is the premier conference for all things machine learning, and considering our goal with this notebook, is an apropos choice to source our data. We'll pull a conveniently packaged dataset from [Kaggle](https://www.kaggle.com/benhamner/nips-2015-papers/version/2/home), a data science competition site, and then work with a subset of the papers to keep our prototyping as lean and fast as possible. \n",
    "\n",
    "Once we've grabbed the files using Kaggle's [API](https://github.com/Kaggle/kaggle-api), we'll take a look at what we're working with. Let's store everything in a separate 'data' folder to keep our directory clean. I've discarded all extra files and renamed the essential one to 'nips.csv'. You'll see a few other files in there, but ignore them for now. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "PATH = './data/'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "freq_words.csv nips.csv\r\n"
     ]
    }
   ],
   "source": [
    "!ls {PATH}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<em><strong>Fun fact</strong>: You can use python variables in shell commands by nesting them inside curly braces.</em>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "file = 'nips.csv'\n",
    "df = pd.read_csv(f'{PATH}{file}')\n",
    "\n",
    "mini_df = df[:10]\n",
    "mini_df.index = pd.RangeIndex(len(mini_df.index))\n",
    "\n",
    "# comment this out to run on full dataset\n",
    "df = mini_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Game Plan"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we're all ready to get started, let's come up with a general list of tasks to to guide our approach. \n",
    "\n",
    "<br>\n",
    "<ol>\n",
    "    <strong><li>Inspect and clean data</li></strong>\n",
    "    <strong><li>Extract named entities</li></strong>\n",
    "    <strong><li>Extract nouns</li></strong>\n",
    "    <strong><li>Combine named entities and nouns</li></strong>\n",
    "    <strong><li>Extract noun phrases</li></strong>\n",
    "    <strong><li>Extract compound noun phrases</li></strong>\n",
    "    <strong><li>Combine entities and compound noun phrases</li></strong>\n",
    "    <strong><li>Reduce entity count with heuristics</li></strong>\n",
    "    <strong><li>Celebrate with excessive fist-pumping</li></strong>\n",
    "</ol>\n",
    "\n",
    "That doesn't look too bad now does it? Let's build ourselves a prototype entity extractor."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Inspect and clean data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Id</th>\n",
       "      <th>Title</th>\n",
       "      <th>EventType</th>\n",
       "      <th>PdfName</th>\n",
       "      <th>Abstract</th>\n",
       "      <th>PaperText</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5677</td>\n",
       "      <td>Double or Nothing: Multiplicative Incentive Me...</td>\n",
       "      <td>Poster</td>\n",
       "      <td>5677-double-or-nothing-multiplicative-incentiv...</td>\n",
       "      <td>Crowdsourcing has gained immense popularity in...</td>\n",
       "      <td>Double or Nothing: Multiplicative\\nIncentive M...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5941</td>\n",
       "      <td>Learning with Symmetric Label Noise: The Impor...</td>\n",
       "      <td>Spotlight</td>\n",
       "      <td>5941-learning-with-symmetric-label-noise-the-i...</td>\n",
       "      <td>Convex potential minimisation is the de facto ...</td>\n",
       "      <td>Learning with Symmetric Label Noise: The\\nImpo...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>6019</td>\n",
       "      <td>Algorithmic Stability and Uniform Generalization</td>\n",
       "      <td>Poster</td>\n",
       "      <td>6019-algorithmic-stability-and-uniform-general...</td>\n",
       "      <td>One of the central questions in statistical le...</td>\n",
       "      <td>Algorithmic Stability and Uniform Generalizati...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>6035</td>\n",
       "      <td>Adaptive Low-Complexity Sequential Inference f...</td>\n",
       "      <td>Poster</td>\n",
       "      <td>6035-adaptive-low-complexity-sequential-infere...</td>\n",
       "      <td>We develop a sequential low-complexity inferen...</td>\n",
       "      <td>Adaptive Low-Complexity Sequential Inference f...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5978</td>\n",
       "      <td>Covariance-Controlled Adaptive Langevin Thermo...</td>\n",
       "      <td>Poster</td>\n",
       "      <td>5978-covariance-controlled-adaptive-langevin-t...</td>\n",
       "      <td>Monte Carlo sampling for Bayesian posterior in...</td>\n",
       "      <td>Covariance-Controlled Adaptive Langevin\\nTherm...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5714</td>\n",
       "      <td>Robust Portfolio Optimization</td>\n",
       "      <td>Poster</td>\n",
       "      <td>5714-robust-portfolio-optimization.pdf</td>\n",
       "      <td>We propose a robust portfolio optimization app...</td>\n",
       "      <td>Robust Portfolio Optimization\\n\\nFang Han\\nDep...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>5937</td>\n",
       "      <td>Logarithmic Time Online Multiclass prediction</td>\n",
       "      <td>Spotlight</td>\n",
       "      <td>5937-logarithmic-time-online-multiclass-predic...</td>\n",
       "      <td>We study the problem of multiclass classificat...</td>\n",
       "      <td>Logarithmic Time Online Multiclass prediction\\...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>5802</td>\n",
       "      <td>Planar Ultrametrics for Image Segmentation</td>\n",
       "      <td>Poster</td>\n",
       "      <td>5802-planar-ultrametrics-for-image-segmentatio...</td>\n",
       "      <td>We study the problem of hierarchical clusterin...</td>\n",
       "      <td>Planar Ultrametrics for Image Segmentation\\n\\n...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>5776</td>\n",
       "      <td>Expressing an Image Stream with a Sequence of ...</td>\n",
       "      <td>Poster</td>\n",
       "      <td>5776-expressing-an-image-stream-with-a-sequenc...</td>\n",
       "      <td>We propose an approach for generating a sequen...</td>\n",
       "      <td>Expressing an Image Stream with a Sequence of\\...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>5814</td>\n",
       "      <td>Parallel Correlation Clustering on Big Graphs</td>\n",
       "      <td>Poster</td>\n",
       "      <td>5814-parallel-correlation-clustering-on-big-gr...</td>\n",
       "      <td>Given a similarity graph between items, correl...</td>\n",
       "      <td>Parallel Correlation Clustering on Big Graphs\\...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     Id                                              Title  EventType  \\\n",
       "0  5677  Double or Nothing: Multiplicative Incentive Me...     Poster   \n",
       "1  5941  Learning with Symmetric Label Noise: The Impor...  Spotlight   \n",
       "2  6019   Algorithmic Stability and Uniform Generalization     Poster   \n",
       "3  6035  Adaptive Low-Complexity Sequential Inference f...     Poster   \n",
       "4  5978  Covariance-Controlled Adaptive Langevin Thermo...     Poster   \n",
       "5  5714                      Robust Portfolio Optimization     Poster   \n",
       "6  5937      Logarithmic Time Online Multiclass prediction  Spotlight   \n",
       "7  5802         Planar Ultrametrics for Image Segmentation     Poster   \n",
       "8  5776  Expressing an Image Stream with a Sequence of ...     Poster   \n",
       "9  5814      Parallel Correlation Clustering on Big Graphs     Poster   \n",
       "\n",
       "                                             PdfName  \\\n",
       "0  5677-double-or-nothing-multiplicative-incentiv...   \n",
       "1  5941-learning-with-symmetric-label-noise-the-i...   \n",
       "2  6019-algorithmic-stability-and-uniform-general...   \n",
       "3  6035-adaptive-low-complexity-sequential-infere...   \n",
       "4  5978-covariance-controlled-adaptive-langevin-t...   \n",
       "5             5714-robust-portfolio-optimization.pdf   \n",
       "6  5937-logarithmic-time-online-multiclass-predic...   \n",
       "7  5802-planar-ultrametrics-for-image-segmentatio...   \n",
       "8  5776-expressing-an-image-stream-with-a-sequenc...   \n",
       "9  5814-parallel-correlation-clustering-on-big-gr...   \n",
       "\n",
       "                                            Abstract  \\\n",
       "0  Crowdsourcing has gained immense popularity in...   \n",
       "1  Convex potential minimisation is the de facto ...   \n",
       "2  One of the central questions in statistical le...   \n",
       "3  We develop a sequential low-complexity inferen...   \n",
       "4  Monte Carlo sampling for Bayesian posterior in...   \n",
       "5  We propose a robust portfolio optimization app...   \n",
       "6  We study the problem of multiclass classificat...   \n",
       "7  We study the problem of hierarchical clusterin...   \n",
       "8  We propose an approach for generating a sequen...   \n",
       "9  Given a similarity graph between items, correl...   \n",
       "\n",
       "                                           PaperText  \n",
       "0  Double or Nothing: Multiplicative\\nIncentive M...  \n",
       "1  Learning with Symmetric Label Noise: The\\nImpo...  \n",
       "2  Algorithmic Stability and Uniform Generalizati...  \n",
       "3  Adaptive Low-Complexity Sequential Inference f...  \n",
       "4  Covariance-Controlled Adaptive Langevin\\nTherm...  \n",
       "5  Robust Portfolio Optimization\\n\\nFang Han\\nDep...  \n",
       "6  Logarithmic Time Online Multiclass prediction\\...  \n",
       "7  Planar Ultrametrics for Image Segmentation\\n\\n...  \n",
       "8  Expressing an Image Stream with a Sequence of\\...  \n",
       "9  Parallel Correlation Clustering on Big Graphs\\...  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "lower = lambda x: x.lower() # make everything lowercase"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>crowdsourcing has gained immense popularity in...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>convex potential minimisation is the de facto ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>one of the central questions in statistical le...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>we develop a sequential low-complexity inferen...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>monte carlo sampling for bayesian posterior in...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>we propose a robust portfolio optimization app...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>we study the problem of multiclass classificat...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>we study the problem of hierarchical clusterin...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>we propose an approach for generating a sequen...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>given a similarity graph between items, correl...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                text\n",
       "0  crowdsourcing has gained immense popularity in...\n",
       "1  convex potential minimisation is the de facto ...\n",
       "2  one of the central questions in statistical le...\n",
       "3  we develop a sequential low-complexity inferen...\n",
       "4  monte carlo sampling for bayesian posterior in...\n",
       "5  we propose a robust portfolio optimization app...\n",
       "6  we study the problem of multiclass classificat...\n",
       "7  we study the problem of hierarchical clusterin...\n",
       "8  we propose an approach for generating a sequen...\n",
       "9  given a similarity graph between items, correl..."
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "df = pd.DataFrame(df['Abstract'].apply(lower))\n",
    "df.columns = ['text']\n",
    "display(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Initially, there was quite a bit of metadata associated with each entry, including a unique identifier, the type of paper presented at the conference, as well as the actual paper text. After pulling out just the abstracts, we've now ended up with with a clean, read-to-go dataframe, and are ready to begin extracting entities. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Extract named entities"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_named_ents(text):\n",
    "    \"\"\"Extract named entities, and beginning, middle and end idx using spaCy's out-of-the-box model. \n",
    "    \n",
    "    Keyword arguments:\n",
    "    text -- the actual text source from which to extract entities\n",
    "    \n",
    "    \"\"\"\n",
    "    return [(ent.text, ent.start_char, ent.end_char, ent.label_) for ent in nlp(text).ents]\n",
    "\n",
    "def add_named_ents(df):\n",
    "    \"\"\"Create new column in data frame with named entity tuple extracted.\n",
    "    \n",
    "    Keyword arguments:\n",
    "    df -- a dataframe object\n",
    "    \n",
    "    \"\"\"\n",
    "    df['named_ents'] = df['text'].apply(extract_named_ents)    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>named_ents</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>crowdsourcing has gained immense popularity in...</td>\n",
       "      <td>[(several hundred, 896, 911, CARDINAL)]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>convex potential minimisation is the de facto ...</td>\n",
       "      <td>[(2008, 109, 113, DATE), (2008, 500, 504, DATE...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>one of the central questions in statistical le...</td>\n",
       "      <td>[(one, 0, 3, CARDINAL)]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>we develop a sequential low-complexity inferen...</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>monte carlo sampling for bayesian posterior in...</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>we propose a robust portfolio optimization app...</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>we study the problem of multiclass classificat...</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>we study the problem of hierarchical clusterin...</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>we propose an approach for generating a sequen...</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>given a similarity graph between items, correl...</td>\n",
       "      <td>[(3-approximation, 257, 272, CARDINAL), (graph...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                text  \\\n",
       "0  crowdsourcing has gained immense popularity in...   \n",
       "1  convex potential minimisation is the de facto ...   \n",
       "2  one of the central questions in statistical le...   \n",
       "3  we develop a sequential low-complexity inferen...   \n",
       "4  monte carlo sampling for bayesian posterior in...   \n",
       "5  we propose a robust portfolio optimization app...   \n",
       "6  we study the problem of multiclass classificat...   \n",
       "7  we study the problem of hierarchical clusterin...   \n",
       "8  we propose an approach for generating a sequen...   \n",
       "9  given a similarity graph between items, correl...   \n",
       "\n",
       "                                          named_ents  \n",
       "0            [(several hundred, 896, 911, CARDINAL)]  \n",
       "1  [(2008, 109, 113, DATE), (2008, 500, 504, DATE...  \n",
       "2                            [(one, 0, 3, CARDINAL)]  \n",
       "3                                                 []  \n",
       "4                                                 []  \n",
       "5                                                 []  \n",
       "6                                                 []  \n",
       "7                                                 []  \n",
       "8                                                 []  \n",
       "9  [(3-approximation, 257, 272, CARDINAL), (graph...  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "add_named_ents(df)\n",
    "display(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div class=\"entities\" style=\"line-height: 2.5\">given a similarity graph between items, correlation clustering (cc) groups similar items together and dissimilar ones apart. one of the most popular cc algorithms is kwikcluster:  an algorithm that serially clusters neighborhoods of vertices, and obtains a \n",
       "<mark class=\"entity\" style=\"background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    3-approximation\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">CARDINAL</span>\n",
       "</mark>\n",
       " ratio. unfortunately, in practice kwikcluster requires a large number of clustering rounds, a potential bottleneck for large \n",
       "<mark class=\"entity\" style=\"background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    graphs.we\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ORG</span>\n",
       "</mark>\n",
       " present \n",
       "<mark class=\"entity\" style=\"background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    c4\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ORG</span>\n",
       "</mark>\n",
       " and clusterwild!, \n",
       "<mark class=\"entity\" style=\"background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    two\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">CARDINAL</span>\n",
       "</mark>\n",
       " algorithms for parallel correlation clustering that run in a polylogarithmic number of rounds, and provably achieve nearly linear speedups. c4 uses concurrency control to enforce serializability of a parallel clustering process, and guarantees a \n",
       "<mark class=\"entity\" style=\"background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    3-approximation\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">CARDINAL</span>\n",
       "</mark>\n",
       " ratio. clusterwild! is a coordination free algorithm that abandons consistency for the benefit of better scaling; this leads to a provably small loss in the \n",
       "<mark class=\"entity\" style=\"background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    3\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">CARDINAL</span>\n",
       "</mark>\n",
       " approximation ratio.we provide extensive experimental results for both algorithms,  where we outperform the state of the art, both in terms of clustering accuracy and running time. we show that our algorithms can cluster \n",
       "<mark class=\"entity\" style=\"background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    billion\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">QUANTITY</span>\n",
       "</mark>\n",
       "-edge graphs in \n",
       "<mark class=\"entity\" style=\"background: #bfe1d9; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    under 5 seconds\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">TIME</span>\n",
       "</mark>\n",
       " on \n",
       "<mark class=\"entity\" style=\"background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    32\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">CARDINAL</span>\n",
       "</mark>\n",
       " cores, while achieving a \n",
       "<mark class=\"entity\" style=\"background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    15x speedup\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">QUANTITY</span>\n",
       "</mark>\n",
       ".</div>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "column = 'named_ents'\n",
    "render_entities(9, df, options=options, column=column) # take a look at one of the abstracts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A quick glance at some of the abstracts shows that while we are able to extract numeric entities, not much else comes through. Not great. But then again, this is exactly why simply extracting named entities is not enough. On the plus side, our intuition about built-in models and scientific text was spot on! The spaCy named entity recognizer just wasn't exposed to this category of corpora and was instead trained on [blogs, news, and comments](https://spacy.io/models/en#en_core_web_lg). Academic papers don't use the most common English words, so it isn't unreasonable to expect a generally trained model to fail when confronted with text in such a restricted domain.   \n",
    "\n",
    "Look at a few more abstracts by changing the index parameter in our \"render_entities\" function to convince yourself of the following notion:\n",
    "\n",
    "We need to widen our search. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Extract all nouns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_nouns(text):\n",
    "    \"\"\"Extract a few types of nouns, and beginning, middle and end idx using spaCy's POS (part of speech) tagger. \n",
    "    \n",
    "    Keyword arguments:\n",
    "    text -- the actual text source from which to extract entities\n",
    "    \n",
    "    \"\"\"\n",
    "    keep_pos = ['PROPN', 'NOUN']\n",
    "    return [(tok.text, tok.idx, tok.idx+len(tok.text), tok.pos_) for tok in nlp(text) if tok.pos_ in keep_pos]\n",
    "\n",
    "def add_nouns(df):\n",
    "    \"\"\"Create new column in data frame with nouns extracted.\n",
    "    \n",
    "    Keyword arguments:\n",
    "    df -- a dataframe object\n",
    "    \n",
    "    \"\"\"\n",
    "    df['nouns'] = df['text'].apply(extract_nouns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>named_ents</th>\n",
       "      <th>nouns</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>crowdsourcing has gained immense popularity in...</td>\n",
       "      <td>[(several hundred, 896, 911, CARDINAL)]</td>\n",
       "      <td>[(crowdsourcing, 0, 13, NOUN), (popularity, 33...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>convex potential minimisation is the de facto ...</td>\n",
       "      <td>[(2008, 109, 113, DATE), (2008, 500, 504, DATE...</td>\n",
       "      <td>[(minimisation, 17, 29, NOUN), (approach, 46, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>one of the central questions in statistical le...</td>\n",
       "      <td>[(one, 0, 3, CARDINAL)]</td>\n",
       "      <td>[(questions, 19, 28, NOUN), (learning, 44, 52,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>we develop a sequential low-complexity inferen...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(complexity, 28, 38, NOUN), (inference, 39, 4...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>monte carlo sampling for bayesian posterior in...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>we propose a robust portfolio optimization app...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(portfolio, 20, 29, NOUN), (optimization, 30,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>we study the problem of multiclass classificat...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(problem, 13, 20, NOUN), (multiclass, 24, 34,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>we study the problem of hierarchical clusterin...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(problem, 13, 20, NOUN), (clustering, 37, 47,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>we propose an approach for generating a sequen...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(approach, 14, 22, NOUN), (sequence, 40, 48, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>given a similarity graph between items, correl...</td>\n",
       "      <td>[(3-approximation, 257, 272, CARDINAL), (graph...</td>\n",
       "      <td>[(similarity, 8, 18, NOUN), (graph, 19, 24, NO...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                text  \\\n",
       "0  crowdsourcing has gained immense popularity in...   \n",
       "1  convex potential minimisation is the de facto ...   \n",
       "2  one of the central questions in statistical le...   \n",
       "3  we develop a sequential low-complexity inferen...   \n",
       "4  monte carlo sampling for bayesian posterior in...   \n",
       "5  we propose a robust portfolio optimization app...   \n",
       "6  we study the problem of multiclass classificat...   \n",
       "7  we study the problem of hierarchical clusterin...   \n",
       "8  we propose an approach for generating a sequen...   \n",
       "9  given a similarity graph between items, correl...   \n",
       "\n",
       "                                          named_ents  \\\n",
       "0            [(several hundred, 896, 911, CARDINAL)]   \n",
       "1  [(2008, 109, 113, DATE), (2008, 500, 504, DATE...   \n",
       "2                            [(one, 0, 3, CARDINAL)]   \n",
       "3                                                 []   \n",
       "4                                                 []   \n",
       "5                                                 []   \n",
       "6                                                 []   \n",
       "7                                                 []   \n",
       "8                                                 []   \n",
       "9  [(3-approximation, 257, 272, CARDINAL), (graph...   \n",
       "\n",
       "                                               nouns  \n",
       "0  [(crowdsourcing, 0, 13, NOUN), (popularity, 33...  \n",
       "1  [(minimisation, 17, 29, NOUN), (approach, 46, ...  \n",
       "2  [(questions, 19, 28, NOUN), (learning, 44, 52,...  \n",
       "3  [(complexity, 28, 38, NOUN), (inference, 39, 4...  \n",
       "4  [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...  \n",
       "5  [(portfolio, 20, 29, NOUN), (optimization, 30,...  \n",
       "6  [(problem, 13, 20, NOUN), (multiclass, 24, 34,...  \n",
       "7  [(problem, 13, 20, NOUN), (clustering, 37, 47,...  \n",
       "8  [(approach, 14, 22, NOUN), (sequence, 40, 48, ...  \n",
       "9  [(similarity, 8, 18, NOUN), (graph, 19, 24, NO...  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "add_nouns(df)\n",
    "display(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div class=\"entities\" style=\"line-height: 2.5\">\n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    crowdsourcing\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " has gained immense \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    popularity\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " in \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    machine\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " learning \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    applications\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " for obtaining large \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    amounts\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " of labeled \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    data\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       ". \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    crowdsourcing\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " is cheap and fast, but suffers from the \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    problem\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " of low-\n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    quality\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    data\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       ". to address this fundamental \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    challenge\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " in \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    crowdsourcing\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       ", we propose a simple \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    payment\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    mechanism\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " to incentivize \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    workers\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " to answer only the \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    questions\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " that they are sure of and skip the \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    rest\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       ". we show that surprisingly, under a mild and natural \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    no\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       "-free-\n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    lunch\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    requirement\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       ", this \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    mechanism\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " is the one and only \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    incentive\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       "-compatible \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    payment\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    mechanism\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " possible. we also show that among all possible \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    incentive\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       "-compatible  \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    mechanisms\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " (that may or may not satisfy no-free-\n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    lunch\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       "), our \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    mechanism\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " makes the smallest possible \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    payment\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " to \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    spammers\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       ".  interestingly, this unique \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    mechanism\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " takes a multiplicative \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    form\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       ". the \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    simplicity\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " of the \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    mechanism\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " is an added \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    benefit\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       ".  in preliminary \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    experiments\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " involving over several hundred \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    workers\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       ", we observe a significant \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    reduction\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " in the \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    error\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    rates\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " under our unique \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    mechanism\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " for the same or lower monetary \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    expenditure\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       ".</div>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "column = 'nouns'\n",
    "render_entities(0, df, options=options, column=column)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is more colorful. But is it useful? It appears as if we are able to pull out a lot of concepts, but things like \"rest\", \"popularity\", and \"data\", aren't all that interesting (atleast in the first abstract). Our search is too wide at this point. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Good to know. Let's power through for now, and merge our lists of entities.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4: Combine named entities and nouns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_named_nouns(row_series):\n",
    "    \"\"\"Combine nouns and non-numerical entities. \n",
    "    \n",
    "    Keyword arguments:\n",
    "    row_series -- a Pandas Series object\n",
    "    \n",
    "    \"\"\"\n",
    "    ents = set()\n",
    "    idxs = set()\n",
    "    # remove duplicates and merge two lists together\n",
    "    for noun_tuple in row_series['nouns']:\n",
    "        for named_ents_tuple in row_series['named_ents']:\n",
    "            if noun_tuple[1] == named_ents_tuple[1]: \n",
    "                idxs.add(noun_tuple[1])\n",
    "                ents.add(named_ents_tuple)\n",
    "        if noun_tuple[1] not in idxs:\n",
    "            ents.add(noun_tuple)\n",
    "    \n",
    "    return sorted(list(ents), key=lambda x: x[1])\n",
    "\n",
    "def add_named_nouns(df):\n",
    "    \"\"\"Create new column in data frame with nouns and named ents.\n",
    "    \n",
    "    Keyword arguments:\n",
    "    df -- a dataframe object\n",
    "    \n",
    "    \"\"\"\n",
    "    df['named_nouns'] = df.apply(extract_named_nouns, axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>named_ents</th>\n",
       "      <th>nouns</th>\n",
       "      <th>named_nouns</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>crowdsourcing has gained immense popularity in...</td>\n",
       "      <td>[(several hundred, 896, 911, CARDINAL)]</td>\n",
       "      <td>[(crowdsourcing, 0, 13, NOUN), (popularity, 33...</td>\n",
       "      <td>[(crowdsourcing, 0, 13, NOUN), (popularity, 33...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>convex potential minimisation is the de facto ...</td>\n",
       "      <td>[(2008, 109, 113, DATE), (2008, 500, 504, DATE...</td>\n",
       "      <td>[(minimisation, 17, 29, NOUN), (approach, 46, ...</td>\n",
       "      <td>[(minimisation, 17, 29, NOUN), (approach, 46, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>one of the central questions in statistical le...</td>\n",
       "      <td>[(one, 0, 3, CARDINAL)]</td>\n",
       "      <td>[(questions, 19, 28, NOUN), (learning, 44, 52,...</td>\n",
       "      <td>[(questions, 19, 28, NOUN), (learning, 44, 52,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>we develop a sequential low-complexity inferen...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(complexity, 28, 38, NOUN), (inference, 39, 4...</td>\n",
       "      <td>[(complexity, 28, 38, NOUN), (inference, 39, 4...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>monte carlo sampling for bayesian posterior in...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...</td>\n",
       "      <td>[(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>we propose a robust portfolio optimization app...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(portfolio, 20, 29, NOUN), (optimization, 30,...</td>\n",
       "      <td>[(portfolio, 20, 29, NOUN), (optimization, 30,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>we study the problem of multiclass classificat...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(problem, 13, 20, NOUN), (multiclass, 24, 34,...</td>\n",
       "      <td>[(problem, 13, 20, NOUN), (multiclass, 24, 34,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>we study the problem of hierarchical clusterin...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(problem, 13, 20, NOUN), (clustering, 37, 47,...</td>\n",
       "      <td>[(problem, 13, 20, NOUN), (clustering, 37, 47,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>we propose an approach for generating a sequen...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(approach, 14, 22, NOUN), (sequence, 40, 48, ...</td>\n",
       "      <td>[(approach, 14, 22, NOUN), (sequence, 40, 48, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>given a similarity graph between items, correl...</td>\n",
       "      <td>[(3-approximation, 257, 272, CARDINAL), (graph...</td>\n",
       "      <td>[(similarity, 8, 18, NOUN), (graph, 19, 24, NO...</td>\n",
       "      <td>[(similarity, 8, 18, NOUN), (graph, 19, 24, NO...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                text  \\\n",
       "0  crowdsourcing has gained immense popularity in...   \n",
       "1  convex potential minimisation is the de facto ...   \n",
       "2  one of the central questions in statistical le...   \n",
       "3  we develop a sequential low-complexity inferen...   \n",
       "4  monte carlo sampling for bayesian posterior in...   \n",
       "5  we propose a robust portfolio optimization app...   \n",
       "6  we study the problem of multiclass classificat...   \n",
       "7  we study the problem of hierarchical clusterin...   \n",
       "8  we propose an approach for generating a sequen...   \n",
       "9  given a similarity graph between items, correl...   \n",
       "\n",
       "                                          named_ents  \\\n",
       "0            [(several hundred, 896, 911, CARDINAL)]   \n",
       "1  [(2008, 109, 113, DATE), (2008, 500, 504, DATE...   \n",
       "2                            [(one, 0, 3, CARDINAL)]   \n",
       "3                                                 []   \n",
       "4                                                 []   \n",
       "5                                                 []   \n",
       "6                                                 []   \n",
       "7                                                 []   \n",
       "8                                                 []   \n",
       "9  [(3-approximation, 257, 272, CARDINAL), (graph...   \n",
       "\n",
       "                                               nouns  \\\n",
       "0  [(crowdsourcing, 0, 13, NOUN), (popularity, 33...   \n",
       "1  [(minimisation, 17, 29, NOUN), (approach, 46, ...   \n",
       "2  [(questions, 19, 28, NOUN), (learning, 44, 52,...   \n",
       "3  [(complexity, 28, 38, NOUN), (inference, 39, 4...   \n",
       "4  [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...   \n",
       "5  [(portfolio, 20, 29, NOUN), (optimization, 30,...   \n",
       "6  [(problem, 13, 20, NOUN), (multiclass, 24, 34,...   \n",
       "7  [(problem, 13, 20, NOUN), (clustering, 37, 47,...   \n",
       "8  [(approach, 14, 22, NOUN), (sequence, 40, 48, ...   \n",
       "9  [(similarity, 8, 18, NOUN), (graph, 19, 24, NO...   \n",
       "\n",
       "                                         named_nouns  \n",
       "0  [(crowdsourcing, 0, 13, NOUN), (popularity, 33...  \n",
       "1  [(minimisation, 17, 29, NOUN), (approach, 46, ...  \n",
       "2  [(questions, 19, 28, NOUN), (learning, 44, 52,...  \n",
       "3  [(complexity, 28, 38, NOUN), (inference, 39, 4...  \n",
       "4  [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...  \n",
       "5  [(portfolio, 20, 29, NOUN), (optimization, 30,...  \n",
       "6  [(problem, 13, 20, NOUN), (multiclass, 24, 34,...  \n",
       "7  [(problem, 13, 20, NOUN), (clustering, 37, 47,...  \n",
       "8  [(approach, 14, 22, NOUN), (sequence, 40, 48, ...  \n",
       "9  [(similarity, 8, 18, NOUN), (graph, 19, 24, NO...  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "add_named_nouns(df)\n",
    "display(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div class=\"entities\" style=\"line-height: 2.5\">convex potential \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    minimisation\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " is the de facto \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    approach\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " to binary \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    classification\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       ". however, long and servedio [2008] proved that under symmetric \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    label\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    noise\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " (\n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    sln\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">PROPN</span>\n",
       "</mark>\n",
       "), \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    minimisation\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " of any convex \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    potential\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " over a linear \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    function\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    class\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " can result in \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    classification\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    performance\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " equivalent to random \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    guessing\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       ". this ostensibly shows that convex \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    losses\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " are not \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    sln\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       "-robust. in this \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    paper\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       ", we propose a \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    convex\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       ", \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    classification\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       "-calibrated \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    loss\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " and prove that it is sln-robust. the \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    loss\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " avoids the long and servedio [2008] \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    result\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " by \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    virtue\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " of being negatively unbounded. the \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    loss\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " is a \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    modification\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " of the \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    hinge\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    loss\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       ", where one does not clamp at zero; hence, we call it the unhinged \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    loss\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       ". we show that the optimal unhinged \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    solution\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " is equivalent to that of a strongly regularised \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    svm\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       ", and is the limiting \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    solution\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " for any convex \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    potential\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       "; this implies that strong \n",
       "<mark class=\"entity\" style=\"background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    l2\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ORG</span>\n",
       "</mark>\n",
       " \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    regularisation\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " makes most standard \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    learners\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    sln\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       "-robust. \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    experiments\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " confirm the unhinged loss’ \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    sln\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       "-\n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    robustness\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       ".</div>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "column = 'named_nouns'\n",
    "render_entities(1, df, options=options, column=column)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this step, we're just combining the named entities extracted using spaCy's built-in model with nouns identified by the part-of-speech or POS tagger. We're dropping any numeric entities for now because they are harder to deal with and don't really represent new concepts. You'll notice (if you look closely enough), that we are also ignoring any hyphenated entities. In spaCy's tokenizer, it is possible to prevent hyphenated words form being split apart, but we'll reserve this, along with other types of advanced fine-tuning or low-level editing to if and when we move beyond the prototype phase. \n",
    "\n",
    "So far, in the past few steps, we've deal with one-word entities. However, it's also entirely permissible for combinations of two or more words to represent a single concept. This means that in order for our prototype to successfully capture the most relevant concepts, we'll need to pull n-length phrases from our academic abstracts in addition to single word entities. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5: Extract noun phrases"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A Chunky Pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Even mild exposure to computer science, or any of the various isoforms of engineering, will have introduced you to the idea of an abstraction, wherein low-level concepts are bundled into higher-order relationships. The <strong>noun phrase</strong> or <strong>chunk</strong> is an abstraction which consists of two or more words, and is the by-product of dependency parsing, POS tagging, and tokenization. spaCy's POS tagger is essentially a statistical model which learns to predict the tag (noun, verb, adjective, etc.) for a given word using examples of tagged-sentences. \n",
    "\n",
    "This supervised machine learning approach relies on tokens generated from splitting text into somewhat atomic units using a rule-based tokenizer (although there are some interesting [unsupervised models](https://github.com/google/sentencepiece) out there as well). Dependency parsing then uncovers relationships between these tagged tokens, allowing us to finally extract noun chunks or phrases of relevance. \n",
    "\n",
    "The full pipeline goes something like this: \n",
    "\n",
    "<strong>raw text</strong> &rarr; <strong>tokenization &rarr; </strong> <strong>POS tagging</strong> &rarr; <strong>dependency parsing</strong> &rarr; <strong>noun chunk extraction</strong>\n",
    "\n",
    "Theoretically, one could swap out noun chunk extraction for named entity recognition, but that's the part of the pipeline we are attempting to modify for our own purposes, because we want n-length entities. Barring our custom intrusion, however, this is exactly how spaCy's built-in model works! If you don't believe me (which you shouldn't, since you're a scientist), scroll up to the very top of this notebook to convince yourself. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Neat huh? Need a visualization of tokenization, POS tagging, and dependency parsing to convince you of just how cool this is? \n",
    "\n",
    "Take a look:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" id=\"0\" class=\"displacy\" width=\"3200\" height=\"574.5\" style=\"max-width: none; height: 574.5px; color: #000000; background: #ffffff; font-family: Arial\">\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"484.5\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"50\">Dr.</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"50\">PROPN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"484.5\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"225\">Abraham</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"225\">PROPN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"484.5\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"400\">is</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"400\">VERB</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"484.5\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"575\">the</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"575\">DET</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"484.5\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"750\">primary</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"750\">ADJ</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"484.5\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"925\">author</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"925\">NOUN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"484.5\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1100\">of</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1100\">ADP</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"484.5\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1275\">this</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1275\">DET</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"484.5\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1450\">paper,</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1450\">NOUN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"484.5\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1625\">and</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1625\">CCONJ</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"484.5\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1800\">a</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1800\">DET</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"484.5\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1975\">physician</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1975\">NOUN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"484.5\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2150\">in</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2150\">ADP</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"484.5\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2325\">the</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2325\">DET</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"484.5\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2500\">specialty</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2500\">NOUN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"484.5\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2675\">of</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2675\">ADP</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"484.5\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"2850\">internal</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"2850\">ADJ</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"484.5\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"3025\">medicine.</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"3025\">NOUN</tspan>\n",
       "</text>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-0-0\" stroke-width=\"2px\" d=\"M70,439.5 C70,352.0 205.0,352.0 205.0,439.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-0-0\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">compound</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M70,441.5 L62,429.5 78,429.5\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-0-1\" stroke-width=\"2px\" d=\"M245,439.5 C245,352.0 380.0,352.0 380.0,439.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-0-1\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">nsubj</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M245,441.5 L237,429.5 253,429.5\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-0-2\" stroke-width=\"2px\" d=\"M595,439.5 C595,264.5 910.0,264.5 910.0,439.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-0-2\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M595,441.5 L587,429.5 603,429.5\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-0-3\" stroke-width=\"2px\" d=\"M770,439.5 C770,352.0 905.0,352.0 905.0,439.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-0-3\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">amod</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M770,441.5 L762,429.5 778,429.5\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-0-4\" stroke-width=\"2px\" d=\"M420,439.5 C420,177.0 915.0,177.0 915.0,439.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-0-4\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">attr</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M915.0,441.5 L923.0,429.5 907.0,429.5\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-0-5\" stroke-width=\"2px\" d=\"M945,439.5 C945,352.0 1080.0,352.0 1080.0,439.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-0-5\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">prep</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M1080.0,441.5 L1088.0,429.5 1072.0,429.5\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-0-6\" stroke-width=\"2px\" d=\"M1295,439.5 C1295,352.0 1430.0,352.0 1430.0,439.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-0-6\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M1295,441.5 L1287,429.5 1303,429.5\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-0-7\" stroke-width=\"2px\" d=\"M1120,439.5 C1120,264.5 1435.0,264.5 1435.0,439.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-0-7\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">pobj</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M1435.0,441.5 L1443.0,429.5 1427.0,429.5\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-0-8\" stroke-width=\"2px\" d=\"M945,439.5 C945,89.5 1620.0,89.5 1620.0,439.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-0-8\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">cc</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M1620.0,441.5 L1628.0,429.5 1612.0,429.5\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-0-9\" stroke-width=\"2px\" d=\"M1820,439.5 C1820,352.0 1955.0,352.0 1955.0,439.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-0-9\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M1820,441.5 L1812,429.5 1828,429.5\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-0-10\" stroke-width=\"2px\" d=\"M945,439.5 C945,2.0 1975.0,2.0 1975.0,439.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-0-10\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">conj</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M1975.0,441.5 L1983.0,429.5 1967.0,429.5\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-0-11\" stroke-width=\"2px\" d=\"M1995,439.5 C1995,352.0 2130.0,352.0 2130.0,439.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-0-11\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">prep</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M2130.0,441.5 L2138.0,429.5 2122.0,429.5\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-0-12\" stroke-width=\"2px\" d=\"M2345,439.5 C2345,352.0 2480.0,352.0 2480.0,439.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-0-12\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M2345,441.5 L2337,429.5 2353,429.5\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-0-13\" stroke-width=\"2px\" d=\"M2170,439.5 C2170,264.5 2485.0,264.5 2485.0,439.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-0-13\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">pobj</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M2485.0,441.5 L2493.0,429.5 2477.0,429.5\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-0-14\" stroke-width=\"2px\" d=\"M2520,439.5 C2520,352.0 2655.0,352.0 2655.0,439.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-0-14\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">prep</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M2655.0,441.5 L2663.0,429.5 2647.0,429.5\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-0-15\" stroke-width=\"2px\" d=\"M2870,439.5 C2870,352.0 3005.0,352.0 3005.0,439.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-0-15\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">amod</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M2870,441.5 L2862,429.5 2878,429.5\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-0-16\" stroke-width=\"2px\" d=\"M2695,439.5 C2695,264.5 3010.0,264.5 3010.0,439.5\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-0-16\" class=\"displacy-label\" startOffset=\"50%\" fill=\"currentColor\" text-anchor=\"middle\">pobj</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M3010.0,441.5 L3018.0,429.5 3002.0,429.5\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "</svg>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "text = \"Dr. Abraham is the primary author of this paper, and a physician in the specialty of internal medicine.\"\n",
    "\n",
    "spacy.displacy.render(nlp(text), jupyter=True) # generating raw-markup using spacy's built-in renderer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just gorgeous. Following our pipeline, let's use this dependency tree to tease out the noun phrases in our dummy sentence. We'll have to create a few functions to do the heavy lifting first (we can reuse these guys for our full dataset later), and then use a simple procedure to visualize our example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_noun_phrases(text):\n",
    "    \"\"\"Combine noun phrases. \n",
    "    \n",
    "    Keyword arguments:\n",
    "    text -- the actual text source from which to extract entities\n",
    "    \n",
    "    \"\"\"\n",
    "    return [(chunk.text, chunk.start_char, chunk.end_char, chunk.label_) for chunk in nlp(text).noun_chunks]\n",
    "\n",
    "def add_noun_phrases(df):\n",
    "    \"\"\"Create new column in data frame with noun phrases.\n",
    "    \n",
    "    Keyword arguments:\n",
    "    df -- a dataframe object\n",
    "    \n",
    "    \"\"\"\n",
    "    df['noun_phrases'] = df['text'].apply(extract_noun_phrases)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [],
   "source": [
    "def visualize_noun_phrases(text):\n",
    "    \"\"\"Create a temporary dataframe to extract and visualize noun phrases. \n",
    "    \n",
    "    Keyword arguments:\n",
    "    text -- the actual text source from which to extract entities\n",
    "    \n",
    "    \"\"\"\n",
    "    df = pd.DataFrame([text]) \n",
    "    df.columns = ['text']\n",
    "    add_noun_phrases(df)\n",
    "    column = 'noun_phrases'\n",
    "    render_entities(0, dummy_df, options=options, column=column)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div class=\"entities\" style=\"line-height: 2.5\">\n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    Dr. Abraham\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       " is \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    the primary author\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       " of \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    this paper\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       ", and \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    a physician\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       " in \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    the specialty\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       " of \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    internal medicine\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       ".</div>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "visualize_noun_phrases(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compare this to what we'd originally set out to accomplish:\n",
    "\n",
    "> **[Dr. Abraham]** is the **[primary author]** of this **[paper]**, and a **[physician]** in the **[specialty]** of **[internal medicine]**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I don't know about you, but everytime I see this work, I'm blown away by both the intricate complexity and beautiful simplicity of this process. Ignoring the prepositions, with one single move, we've done a damn-near perfect job of extracting the main ideas from this sentence. How amazing is that?! \n",
    "\n",
    "Hats off to spaCy, and the hordes of data scientists, machine learning engineers, and linguists that made this possible."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Back to School"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, if we just use this approach and add together the single-word entities we extracted from our academic abstracts earlier, we should be getting close to a pretty awesome set of concepts! Let's capture some noun phrases and see what we get. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>named_ents</th>\n",
       "      <th>nouns</th>\n",
       "      <th>named_nouns</th>\n",
       "      <th>noun_phrases</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>crowdsourcing has gained immense popularity in...</td>\n",
       "      <td>[(several hundred, 896, 911, CARDINAL)]</td>\n",
       "      <td>[(crowdsourcing, 0, 13, NOUN), (popularity, 33...</td>\n",
       "      <td>[(crowdsourcing, 0, 13, NOUN), (popularity, 33...</td>\n",
       "      <td>[(crowdsourcing, 0, 13, NP), (immense populari...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>convex potential minimisation is the de facto ...</td>\n",
       "      <td>[(2008, 109, 113, DATE), (2008, 500, 504, DATE...</td>\n",
       "      <td>[(minimisation, 17, 29, NOUN), (approach, 46, ...</td>\n",
       "      <td>[(minimisation, 17, 29, NOUN), (approach, 46, ...</td>\n",
       "      <td>[(convex potential minimisation, 0, 29, NP), (...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>one of the central questions in statistical le...</td>\n",
       "      <td>[(one, 0, 3, CARDINAL)]</td>\n",
       "      <td>[(questions, 19, 28, NOUN), (learning, 44, 52,...</td>\n",
       "      <td>[(questions, 19, 28, NOUN), (learning, 44, 52,...</td>\n",
       "      <td>[(the central questions, 7, 28, NP), (statisti...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>we develop a sequential low-complexity inferen...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(complexity, 28, 38, NOUN), (inference, 39, 4...</td>\n",
       "      <td>[(complexity, 28, 38, NOUN), (inference, 39, 4...</td>\n",
       "      <td>[(we, 0, 2, NP), (a sequential low-complexity ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>monte carlo sampling for bayesian posterior in...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...</td>\n",
       "      <td>[(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...</td>\n",
       "      <td>[(bayesian posterior inference, 25, 53, NP), (...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>we propose a robust portfolio optimization app...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(portfolio, 20, 29, NOUN), (optimization, 30,...</td>\n",
       "      <td>[(portfolio, 20, 29, NOUN), (optimization, 30,...</td>\n",
       "      <td>[(we, 0, 2, NP), (a robust portfolio optimizat...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>we study the problem of multiclass classificat...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(problem, 13, 20, NOUN), (multiclass, 24, 34,...</td>\n",
       "      <td>[(problem, 13, 20, NOUN), (multiclass, 24, 34,...</td>\n",
       "      <td>[(we, 0, 2, NP), (the problem, 9, 20, NP), (mu...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>we study the problem of hierarchical clusterin...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(problem, 13, 20, NOUN), (clustering, 37, 47,...</td>\n",
       "      <td>[(problem, 13, 20, NOUN), (clustering, 37, 47,...</td>\n",
       "      <td>[(we, 0, 2, NP), (the problem, 9, 20, NP), (hi...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>we propose an approach for generating a sequen...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(approach, 14, 22, NOUN), (sequence, 40, 48, ...</td>\n",
       "      <td>[(approach, 14, 22, NOUN), (sequence, 40, 48, ...</td>\n",
       "      <td>[(we, 0, 2, NP), (an approach, 11, 22, NP), (a...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>given a similarity graph between items, correl...</td>\n",
       "      <td>[(3-approximation, 257, 272, CARDINAL), (graph...</td>\n",
       "      <td>[(similarity, 8, 18, NOUN), (graph, 19, 24, NO...</td>\n",
       "      <td>[(similarity, 8, 18, NOUN), (graph, 19, 24, NO...</td>\n",
       "      <td>[(a similarity graph, 6, 24, NP), (items, 33, ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                text  \\\n",
       "0  crowdsourcing has gained immense popularity in...   \n",
       "1  convex potential minimisation is the de facto ...   \n",
       "2  one of the central questions in statistical le...   \n",
       "3  we develop a sequential low-complexity inferen...   \n",
       "4  monte carlo sampling for bayesian posterior in...   \n",
       "5  we propose a robust portfolio optimization app...   \n",
       "6  we study the problem of multiclass classificat...   \n",
       "7  we study the problem of hierarchical clusterin...   \n",
       "8  we propose an approach for generating a sequen...   \n",
       "9  given a similarity graph between items, correl...   \n",
       "\n",
       "                                          named_ents  \\\n",
       "0            [(several hundred, 896, 911, CARDINAL)]   \n",
       "1  [(2008, 109, 113, DATE), (2008, 500, 504, DATE...   \n",
       "2                            [(one, 0, 3, CARDINAL)]   \n",
       "3                                                 []   \n",
       "4                                                 []   \n",
       "5                                                 []   \n",
       "6                                                 []   \n",
       "7                                                 []   \n",
       "8                                                 []   \n",
       "9  [(3-approximation, 257, 272, CARDINAL), (graph...   \n",
       "\n",
       "                                               nouns  \\\n",
       "0  [(crowdsourcing, 0, 13, NOUN), (popularity, 33...   \n",
       "1  [(minimisation, 17, 29, NOUN), (approach, 46, ...   \n",
       "2  [(questions, 19, 28, NOUN), (learning, 44, 52,...   \n",
       "3  [(complexity, 28, 38, NOUN), (inference, 39, 4...   \n",
       "4  [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...   \n",
       "5  [(portfolio, 20, 29, NOUN), (optimization, 30,...   \n",
       "6  [(problem, 13, 20, NOUN), (multiclass, 24, 34,...   \n",
       "7  [(problem, 13, 20, NOUN), (clustering, 37, 47,...   \n",
       "8  [(approach, 14, 22, NOUN), (sequence, 40, 48, ...   \n",
       "9  [(similarity, 8, 18, NOUN), (graph, 19, 24, NO...   \n",
       "\n",
       "                                         named_nouns  \\\n",
       "0  [(crowdsourcing, 0, 13, NOUN), (popularity, 33...   \n",
       "1  [(minimisation, 17, 29, NOUN), (approach, 46, ...   \n",
       "2  [(questions, 19, 28, NOUN), (learning, 44, 52,...   \n",
       "3  [(complexity, 28, 38, NOUN), (inference, 39, 4...   \n",
       "4  [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...   \n",
       "5  [(portfolio, 20, 29, NOUN), (optimization, 30,...   \n",
       "6  [(problem, 13, 20, NOUN), (multiclass, 24, 34,...   \n",
       "7  [(problem, 13, 20, NOUN), (clustering, 37, 47,...   \n",
       "8  [(approach, 14, 22, NOUN), (sequence, 40, 48, ...   \n",
       "9  [(similarity, 8, 18, NOUN), (graph, 19, 24, NO...   \n",
       "\n",
       "                                        noun_phrases  \n",
       "0  [(crowdsourcing, 0, 13, NP), (immense populari...  \n",
       "1  [(convex potential minimisation, 0, 29, NP), (...  \n",
       "2  [(the central questions, 7, 28, NP), (statisti...  \n",
       "3  [(we, 0, 2, NP), (a sequential low-complexity ...  \n",
       "4  [(bayesian posterior inference, 25, 53, NP), (...  \n",
       "5  [(we, 0, 2, NP), (a robust portfolio optimizat...  \n",
       "6  [(we, 0, 2, NP), (the problem, 9, 20, NP), (mu...  \n",
       "7  [(we, 0, 2, NP), (the problem, 9, 20, NP), (hi...  \n",
       "8  [(we, 0, 2, NP), (an approach, 11, 22, NP), (a...  \n",
       "9  [(a similarity graph, 6, 24, NP), (items, 33, ...  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "add_noun_phrases(df)\n",
    "display(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div class=\"entities\" style=\"line-height: 2.5\">\n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    crowdsourcing\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       " has gained \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    immense popularity\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       " in \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    machine learning applications\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       " for obtaining \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    large amounts\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       " of \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    labeled data\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       ". \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    crowdsourcing\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       " is cheap and fast, but suffers from \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    the problem\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       " of \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    low-quality data\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       ". to address \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    this fundamental challenge\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       " in \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    crowdsourcing\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       ", \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    we\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       " propose \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    a simple payment mechanism\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       " to incentivize \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    workers\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       " to answer \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    only the questions\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       " that \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    they\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       " are sure of and skip \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    the rest\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       ". \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    we\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       " show that surprisingly, under \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    a mild and natural no-free-lunch requirement\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       ", \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    this mechanism\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       " is the one and \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    only incentive-compatible payment mechanism\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       " possible. \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    we\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       " also show that among \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    all possible incentive-compatible  mechanisms\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       " (that may or may not satisfy \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    no-free-lunch\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       "), \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    our mechanism\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       " makes \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    the smallest possible payment\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       " to \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    spammers\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       ".  interestingly, \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    this unique mechanism\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       " takes \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    a multiplicative form\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       ". \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    the simplicity\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       " of \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    the mechanism\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       " is \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    an added benefit\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       ".  in \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    preliminary experiments\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       " involving over \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    several hundred workers\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       ", \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    we\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       " observe \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    a significant reduction\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       " in \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    the error rates\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       " under \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    our unique mechanism\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       " for \n",
       "<mark class=\"entity\" style=\"background: #1EECA6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    the same or lower monetary expenditure\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NP</span>\n",
       "</mark>\n",
       ".</div>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "column = 'noun_phrases'\n",
    "render_entities(0, df, options=options, column=column)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Hmm... should've seen this coming. While we've now done a great job of extracting noun phrases from our abstracts, we're running into the same problem as before. Our funnel is too wide, and we're pulling uninteresting bigrams like \"the simplicity\", \"the rest\", and \"this mechanism\". These chunks are indeed noun phrases, but not domain-specific concepts. Not to mention, we still have to deal with those pesky prepositions (try saying that five times fast). \n",
    "\n",
    "Let's see if we can narrow our search and just get the most important phrases. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 6: Extract compound noun phrases"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_compounds(text):\n",
    "    \"\"\"Extract compound noun phrases with beginning and end idxs. \n",
    "    \n",
    "    Keyword arguments:\n",
    "    text -- the actual text source from which to extract entities\n",
    "    \n",
    "    \"\"\"\n",
    "    comp_idx = 0\n",
    "    compound = []\n",
    "    compound_nps = []\n",
    "    tok_idx = 0\n",
    "    for idx, tok in enumerate(nlp(text)):\n",
    "        if tok.dep_ == 'compound':\n",
    "\n",
    "            # capture hyphenated compounds\n",
    "            children = ''.join([c.text for c in tok.children])\n",
    "            if '-' in children:\n",
    "                compound.append(''.join([children, tok.text]))\n",
    "            else:\n",
    "                compound.append(tok.text)\n",
    "\n",
    "            # remember starting index of first child in compound or word\n",
    "            try:\n",
    "                tok_idx = [c for c in tok.children][0].idx\n",
    "            except IndexError:\n",
    "                if len(compound) == 1:\n",
    "                    tok_idx = tok.idx\n",
    "            comp_idx = tok.i\n",
    "\n",
    "        # append the last word in a compound phrase\n",
    "        if tok.i - comp_idx == 1:\n",
    "            compound.append(tok.text)\n",
    "            if len(compound) > 1: \n",
    "                compound = ' '.join(compound)\n",
    "                compound_nps.append((compound, tok_idx, tok_idx+len(compound), 'COMPOUND'))\n",
    "\n",
    "            # reset parameters\n",
    "            tok_idx = 0 \n",
    "            compound = []\n",
    "\n",
    "    return compound_nps\n",
    "\n",
    "def add_compounds(df):\n",
    "    \"\"\"Create new column in data frame with compound noun phrases.\n",
    "    \n",
    "    Keyword arguments:\n",
    "    df -- a dataframe object\n",
    "    \n",
    "    \"\"\"\n",
    "    df['compounds'] = df['text'].apply(extract_compounds)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>named_ents</th>\n",
       "      <th>nouns</th>\n",
       "      <th>named_nouns</th>\n",
       "      <th>noun_phrases</th>\n",
       "      <th>compounds</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>crowdsourcing has gained immense popularity in...</td>\n",
       "      <td>[(several hundred, 896, 911, CARDINAL)]</td>\n",
       "      <td>[(crowdsourcing, 0, 13, NOUN), (popularity, 33...</td>\n",
       "      <td>[(crowdsourcing, 0, 13, NOUN), (popularity, 33...</td>\n",
       "      <td>[(crowdsourcing, 0, 13, NP), (immense populari...</td>\n",
       "      <td>[(machine learning applications, 47, 76, COMPO...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>convex potential minimisation is the de facto ...</td>\n",
       "      <td>[(2008, 109, 113, DATE), (2008, 500, 504, DATE...</td>\n",
       "      <td>[(minimisation, 17, 29, NOUN), (approach, 46, ...</td>\n",
       "      <td>[(minimisation, 17, 29, NOUN), (approach, 46, ...</td>\n",
       "      <td>[(convex potential minimisation, 0, 29, NP), (...</td>\n",
       "      <td>[(label noise, 143, 154, COMPOUND), (function ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>one of the central questions in statistical le...</td>\n",
       "      <td>[(one, 0, 3, CARDINAL)]</td>\n",
       "      <td>[(questions, 19, 28, NOUN), (learning, 44, 52,...</td>\n",
       "      <td>[(questions, 19, 28, NOUN), (learning, 44, 52,...</td>\n",
       "      <td>[(the central questions, 7, 28, NP), (statisti...</td>\n",
       "      <td>[(learning theory, 44, 59, COMPOUND), (inferen...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>we develop a sequential low-complexity inferen...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(complexity, 28, 38, NOUN), (inference, 39, 4...</td>\n",
       "      <td>[(complexity, 28, 38, NOUN), (inference, 39, 4...</td>\n",
       "      <td>[(we, 0, 2, NP), (a sequential low-complexity ...</td>\n",
       "      <td>[(low-complexity inference procedure, 28, 62, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>monte carlo sampling for bayesian posterior in...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...</td>\n",
       "      <td>[(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...</td>\n",
       "      <td>[(bayesian posterior inference, 25, 53, NP), (...</td>\n",
       "      <td>[(monte carlo sampling, 0, 20, COMPOUND), (mac...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>we propose a robust portfolio optimization app...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(portfolio, 20, 29, NOUN), (optimization, 30,...</td>\n",
       "      <td>[(portfolio, 20, 29, NOUN), (optimization, 30,...</td>\n",
       "      <td>[(we, 0, 2, NP), (a robust portfolio optimizat...</td>\n",
       "      <td>[(portfolio optimization approach, 20, 51, COM...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>we study the problem of multiclass classificat...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(problem, 13, 20, NOUN), (multiclass, 24, 34,...</td>\n",
       "      <td>[(problem, 13, 20, NOUN), (multiclass, 24, 34,...</td>\n",
       "      <td>[(we, 0, 2, NP), (the problem, 9, 20, NP), (mu...</td>\n",
       "      <td>[(test time, 134, 143, COMPOUND), (tree constr...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>we study the problem of hierarchical clusterin...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(problem, 13, 20, NOUN), (clustering, 37, 47,...</td>\n",
       "      <td>[(problem, 13, 20, NOUN), (clustering, 37, 47,...</td>\n",
       "      <td>[(we, 0, 2, NP), (the problem, 9, 20, NP), (hi...</td>\n",
       "      <td>[(lp relaxation, 182, 195, COMPOUND), (cost pe...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>we propose an approach for generating a sequen...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(approach, 14, 22, NOUN), (sequence, 40, 48, ...</td>\n",
       "      <td>[(approach, 14, 22, NOUN), (sequence, 40, 48, ...</td>\n",
       "      <td>[(we, 0, 2, NP), (an approach, 11, 22, NP), (a...</td>\n",
       "      <td>[(image stream, 77, 89, COMPOUND), (image stre...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>given a similarity graph between items, correl...</td>\n",
       "      <td>[(3-approximation, 257, 272, CARDINAL), (graph...</td>\n",
       "      <td>[(similarity, 8, 18, NOUN), (graph, 19, 24, NO...</td>\n",
       "      <td>[(similarity, 8, 18, NOUN), (graph, 19, 24, NO...</td>\n",
       "      <td>[(a similarity graph, 6, 24, NP), (items, 33, ...</td>\n",
       "      <td>[(similarity graph, 8, 24, COMPOUND), (correla...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                text  \\\n",
       "0  crowdsourcing has gained immense popularity in...   \n",
       "1  convex potential minimisation is the de facto ...   \n",
       "2  one of the central questions in statistical le...   \n",
       "3  we develop a sequential low-complexity inferen...   \n",
       "4  monte carlo sampling for bayesian posterior in...   \n",
       "5  we propose a robust portfolio optimization app...   \n",
       "6  we study the problem of multiclass classificat...   \n",
       "7  we study the problem of hierarchical clusterin...   \n",
       "8  we propose an approach for generating a sequen...   \n",
       "9  given a similarity graph between items, correl...   \n",
       "\n",
       "                                          named_ents  \\\n",
       "0            [(several hundred, 896, 911, CARDINAL)]   \n",
       "1  [(2008, 109, 113, DATE), (2008, 500, 504, DATE...   \n",
       "2                            [(one, 0, 3, CARDINAL)]   \n",
       "3                                                 []   \n",
       "4                                                 []   \n",
       "5                                                 []   \n",
       "6                                                 []   \n",
       "7                                                 []   \n",
       "8                                                 []   \n",
       "9  [(3-approximation, 257, 272, CARDINAL), (graph...   \n",
       "\n",
       "                                               nouns  \\\n",
       "0  [(crowdsourcing, 0, 13, NOUN), (popularity, 33...   \n",
       "1  [(minimisation, 17, 29, NOUN), (approach, 46, ...   \n",
       "2  [(questions, 19, 28, NOUN), (learning, 44, 52,...   \n",
       "3  [(complexity, 28, 38, NOUN), (inference, 39, 4...   \n",
       "4  [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...   \n",
       "5  [(portfolio, 20, 29, NOUN), (optimization, 30,...   \n",
       "6  [(problem, 13, 20, NOUN), (multiclass, 24, 34,...   \n",
       "7  [(problem, 13, 20, NOUN), (clustering, 37, 47,...   \n",
       "8  [(approach, 14, 22, NOUN), (sequence, 40, 48, ...   \n",
       "9  [(similarity, 8, 18, NOUN), (graph, 19, 24, NO...   \n",
       "\n",
       "                                         named_nouns  \\\n",
       "0  [(crowdsourcing, 0, 13, NOUN), (popularity, 33...   \n",
       "1  [(minimisation, 17, 29, NOUN), (approach, 46, ...   \n",
       "2  [(questions, 19, 28, NOUN), (learning, 44, 52,...   \n",
       "3  [(complexity, 28, 38, NOUN), (inference, 39, 4...   \n",
       "4  [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...   \n",
       "5  [(portfolio, 20, 29, NOUN), (optimization, 30,...   \n",
       "6  [(problem, 13, 20, NOUN), (multiclass, 24, 34,...   \n",
       "7  [(problem, 13, 20, NOUN), (clustering, 37, 47,...   \n",
       "8  [(approach, 14, 22, NOUN), (sequence, 40, 48, ...   \n",
       "9  [(similarity, 8, 18, NOUN), (graph, 19, 24, NO...   \n",
       "\n",
       "                                        noun_phrases  \\\n",
       "0  [(crowdsourcing, 0, 13, NP), (immense populari...   \n",
       "1  [(convex potential minimisation, 0, 29, NP), (...   \n",
       "2  [(the central questions, 7, 28, NP), (statisti...   \n",
       "3  [(we, 0, 2, NP), (a sequential low-complexity ...   \n",
       "4  [(bayesian posterior inference, 25, 53, NP), (...   \n",
       "5  [(we, 0, 2, NP), (a robust portfolio optimizat...   \n",
       "6  [(we, 0, 2, NP), (the problem, 9, 20, NP), (mu...   \n",
       "7  [(we, 0, 2, NP), (the problem, 9, 20, NP), (hi...   \n",
       "8  [(we, 0, 2, NP), (an approach, 11, 22, NP), (a...   \n",
       "9  [(a similarity graph, 6, 24, NP), (items, 33, ...   \n",
       "\n",
       "                                           compounds  \n",
       "0  [(machine learning applications, 47, 76, COMPO...  \n",
       "1  [(label noise, 143, 154, COMPOUND), (function ...  \n",
       "2  [(learning theory, 44, 59, COMPOUND), (inferen...  \n",
       "3  [(low-complexity inference procedure, 28, 62, ...  \n",
       "4  [(monte carlo sampling, 0, 20, COMPOUND), (mac...  \n",
       "5  [(portfolio optimization approach, 20, 51, COM...  \n",
       "6  [(test time, 134, 143, COMPOUND), (tree constr...  \n",
       "7  [(lp relaxation, 182, 195, COMPOUND), (cost pe...  \n",
       "8  [(image stream, 77, 89, COMPOUND), (image stre...  \n",
       "9  [(similarity graph, 8, 24, COMPOUND), (correla...  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "add_compounds(df)\n",
    "display(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div class=\"entities\" style=\"line-height: 2.5\">crowdsourcing has gained immense popularity in \n",
       "<mark class=\"entity\" style=\"background: #FE6BFE; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    machine learning applications\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">COMPOUND</span>\n",
       "</mark>\n",
       " for obtaining large amounts of labeled data. crowdsourcing is cheap and fast, but suffers from the problem of \n",
       "<mark class=\"entity\" style=\"background: #FE6BFE; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    low-quality data\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">COMPOUND</span>\n",
       "</mark>\n",
       ". to address this fundamental challenge in crowdsourcing, we propose a simple \n",
       "<mark class=\"entity\" style=\"background: #FE6BFE; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    payment mechanism\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">COMPOUND</span>\n",
       "</mark>\n",
       " to incentivize workers to answer only the questions that they are sure of and skip the rest. we show that surprisingly, under a mild and natural \n",
       "<mark class=\"entity\" style=\"background: #FE6BFE; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    no-free-lunch requirement\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">COMPOUND</span>\n",
       "</mark>\n",
       ", this mechanism is the one and only incentive-compatible \n",
       "<mark class=\"entity\" style=\"background: #FE6BFE; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    payment mechanism\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">COMPOUND</span>\n",
       "</mark>\n",
       " possible. we also show that among all possible incentive-compatible  mechanisms (that may or may not satisfy no-free-lunch), our mechanism makes the smallest possible payment to spammers.  interestingly, this unique mechanism takes a multiplicative form. the simplicity of the mechanism is an added benefit.  in preliminary experiments involving over several hundred workers, we observe a significant reduction in the \n",
       "<mark class=\"entity\" style=\"background: #FE6BFE; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    error rates\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">COMPOUND</span>\n",
       "</mark>\n",
       " under our unique mechanism for the same or lower monetary expenditure.</div>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "column = 'compounds'\n",
    "render_entities(0, df, options=options, column=column)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's starting to look pretty good! By targetting words in the dependency tree that were tagged as belonging to a compound, we were able to drive the number of noun phrases down rather nicely. Next, we'll add these phrases to the list of entities we extracted from each abstract, to create a set which will include unigrams, bigrams, and more. Oh my!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 7: Combine entities and compound noun phrases"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {},
   "outputs": [],
   "source": [
    "def extract_comp_nouns(row_series, cols=[]):\n",
    "    \"\"\"Combine compound noun phrases and entities. \n",
    "    \n",
    "    Keyword arguments:\n",
    "    row_series -- a Pandas Series object\n",
    "    \n",
    "    \"\"\"\n",
    "    return {noun_tuple[0] for col in cols for noun_tuple in row_series[col]}\n",
    "\n",
    "def add_comp_nouns(df, cols=[]):\n",
    "    \"\"\"Create new column in data frame with merged entities.\n",
    "    \n",
    "    Keyword arguments:\n",
    "    df -- a dataframe object\n",
    "    cols -- a list of column names that need to be merged\n",
    "    \n",
    "    \"\"\"\n",
    "    df['comp_nouns'] = df.apply(extract_comp_nouns, axis=1, cols=cols)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>named_ents</th>\n",
       "      <th>nouns</th>\n",
       "      <th>named_nouns</th>\n",
       "      <th>noun_phrases</th>\n",
       "      <th>compounds</th>\n",
       "      <th>comp_nouns</th>\n",
       "      <th>clean_ents</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>crowdsourcing has gained immense popularity in...</td>\n",
       "      <td>[(several hundred, 896, 911, CARDINAL)]</td>\n",
       "      <td>[(crowdsourcing, 0, 13, NOUN), (popularity, 33...</td>\n",
       "      <td>[(crowdsourcing, 0, 13, NOUN), (popularity, 33...</td>\n",
       "      <td>[(crowdsourcing, 0, 13, NP), (immense populari...</td>\n",
       "      <td>[(machine learning applications, 47, 76, COMPO...</td>\n",
       "      <td>{mechanisms, low-quality data, problem, requir...</td>\n",
       "      <td>{mechanisms, low-quality data, problem, worker...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>convex potential minimisation is the de facto ...</td>\n",
       "      <td>[(2008, 109, 113, DATE), (2008, 500, 504, DATE...</td>\n",
       "      <td>[(minimisation, 17, 29, NOUN), (approach, 46, ...</td>\n",
       "      <td>[(minimisation, 17, 29, NOUN), (approach, 46, ...</td>\n",
       "      <td>[(convex potential minimisation, 0, 29, NP), (...</td>\n",
       "      <td>[(label noise, 143, 154, COMPOUND), (function ...</td>\n",
       "      <td>{function class, solution, result, performance...</td>\n",
       "      <td>{function class, solution, result, svm, paper,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>one of the central questions in statistical le...</td>\n",
       "      <td>[(one, 0, 3, CARDINAL)]</td>\n",
       "      <td>[(questions, 19, 28, NOUN), (learning, 44, 52,...</td>\n",
       "      <td>[(questions, 19, 28, NOUN), (learning, 44, 52,...</td>\n",
       "      <td>[(the central questions, 7, 28, NP), (statisti...</td>\n",
       "      <td>[(learning theory, 44, 59, COMPOUND), (inferen...</td>\n",
       "      <td>{result, conditions, dimensionality reduction ...</td>\n",
       "      <td>{dimensionality reduction methods, conditions,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>we develop a sequential low-complexity inferen...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(complexity, 28, 38, NOUN), (inference, 39, 4...</td>\n",
       "      <td>[(complexity, 28, 38, NOUN), (inference, 39, 4...</td>\n",
       "      <td>[(we, 0, 2, NP), (a sequential low-complexity ...</td>\n",
       "      <td>[(low-complexity inference procedure, 28, 62, ...</td>\n",
       "      <td>{large-sample limit, concentration, asymptotic...</td>\n",
       "      <td>{classes, form parametric, number, function, e...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>monte carlo sampling for bayesian posterior in...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...</td>\n",
       "      <td>[(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...</td>\n",
       "      <td>[(bayesian posterior inference, 25, 53, NP), (...</td>\n",
       "      <td>[(monte carlo sampling, 0, 20, COMPOUND), (mac...</td>\n",
       "      <td>{machine learning, discrete-time analogues, gr...</td>\n",
       "      <td>{machine learning, discrete-time analogues, gr...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>we propose a robust portfolio optimization app...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(portfolio, 20, 29, NOUN), (optimization, 30,...</td>\n",
       "      <td>[(portfolio, 20, 29, NOUN), (optimization, 30,...</td>\n",
       "      <td>[(we, 0, 2, NP), (a robust portfolio optimizat...</td>\n",
       "      <td>[(portfolio optimization approach, 20, 51, COM...</td>\n",
       "      <td>{work, optimization, portfolio, dependence, th...</td>\n",
       "      <td>{dependence, theory, events, method, dimension...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>we study the problem of multiclass classificat...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(problem, 13, 20, NOUN), (multiclass, 24, 34,...</td>\n",
       "      <td>[(problem, 13, 20, NOUN), (multiclass, 24, 34,...</td>\n",
       "      <td>[(we, 0, 2, NP), (the problem, 9, 20, NP), (mu...</td>\n",
       "      <td>[(test time, 134, 143, COMPOUND), (tree constr...</td>\n",
       "      <td>{entropy, problem, classes, conditions, number...</td>\n",
       "      <td>{problem, classes, conditions, number, functio...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>we study the problem of hierarchical clusterin...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(problem, 13, 20, NOUN), (clustering, 37, 47,...</td>\n",
       "      <td>[(problem, 13, 20, NOUN), (clustering, 37, 47,...</td>\n",
       "      <td>[(we, 0, 2, NP), (the problem, 9, 20, NP), (hi...</td>\n",
       "      <td>[(lp relaxation, 182, 195, COMPOUND), (cost pe...</td>\n",
       "      <td>{problem, image, distances, partitions, cost, ...</td>\n",
       "      <td>{space, matching, terms, algorithm, cost perfe...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>we propose an approach for generating a sequen...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(approach, 14, 22, NOUN), (sequence, 40, 48, ...</td>\n",
       "      <td>[(approach, 14, 22, NOUN), (sequence, 40, 48, ...</td>\n",
       "      <td>[(we, 0, 2, NP), (an approach, 11, 22, NP), (a...</td>\n",
       "      <td>[(image stream, 77, 89, COMPOUND), (image stre...</td>\n",
       "      <td>{text-image parallel, language descriptions, m...</td>\n",
       "      <td>{text-image parallel, language descriptions, m...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>given a similarity graph between items, correl...</td>\n",
       "      <td>[(3-approximation, 257, 272, CARDINAL), (graph...</td>\n",
       "      <td>[(similarity, 8, 18, NOUN), (graph, 19, 24, NO...</td>\n",
       "      <td>[(similarity, 8, 18, NOUN), (graph, 19, 24, NO...</td>\n",
       "      <td>[(a similarity graph, 6, 24, NP), (items, 33, ...</td>\n",
       "      <td>[(similarity graph, 8, 24, COMPOUND), (correla...</td>\n",
       "      <td>{practice kwikcluster, similarity, ratio, prac...</td>\n",
       "      <td>{practice kwikcluster, ratio, serializability,...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                text  \\\n",
       "0  crowdsourcing has gained immense popularity in...   \n",
       "1  convex potential minimisation is the de facto ...   \n",
       "2  one of the central questions in statistical le...   \n",
       "3  we develop a sequential low-complexity inferen...   \n",
       "4  monte carlo sampling for bayesian posterior in...   \n",
       "5  we propose a robust portfolio optimization app...   \n",
       "6  we study the problem of multiclass classificat...   \n",
       "7  we study the problem of hierarchical clusterin...   \n",
       "8  we propose an approach for generating a sequen...   \n",
       "9  given a similarity graph between items, correl...   \n",
       "\n",
       "                                          named_ents  \\\n",
       "0            [(several hundred, 896, 911, CARDINAL)]   \n",
       "1  [(2008, 109, 113, DATE), (2008, 500, 504, DATE...   \n",
       "2                            [(one, 0, 3, CARDINAL)]   \n",
       "3                                                 []   \n",
       "4                                                 []   \n",
       "5                                                 []   \n",
       "6                                                 []   \n",
       "7                                                 []   \n",
       "8                                                 []   \n",
       "9  [(3-approximation, 257, 272, CARDINAL), (graph...   \n",
       "\n",
       "                                               nouns  \\\n",
       "0  [(crowdsourcing, 0, 13, NOUN), (popularity, 33...   \n",
       "1  [(minimisation, 17, 29, NOUN), (approach, 46, ...   \n",
       "2  [(questions, 19, 28, NOUN), (learning, 44, 52,...   \n",
       "3  [(complexity, 28, 38, NOUN), (inference, 39, 4...   \n",
       "4  [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...   \n",
       "5  [(portfolio, 20, 29, NOUN), (optimization, 30,...   \n",
       "6  [(problem, 13, 20, NOUN), (multiclass, 24, 34,...   \n",
       "7  [(problem, 13, 20, NOUN), (clustering, 37, 47,...   \n",
       "8  [(approach, 14, 22, NOUN), (sequence, 40, 48, ...   \n",
       "9  [(similarity, 8, 18, NOUN), (graph, 19, 24, NO...   \n",
       "\n",
       "                                         named_nouns  \\\n",
       "0  [(crowdsourcing, 0, 13, NOUN), (popularity, 33...   \n",
       "1  [(minimisation, 17, 29, NOUN), (approach, 46, ...   \n",
       "2  [(questions, 19, 28, NOUN), (learning, 44, 52,...   \n",
       "3  [(complexity, 28, 38, NOUN), (inference, 39, 4...   \n",
       "4  [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...   \n",
       "5  [(portfolio, 20, 29, NOUN), (optimization, 30,...   \n",
       "6  [(problem, 13, 20, NOUN), (multiclass, 24, 34,...   \n",
       "7  [(problem, 13, 20, NOUN), (clustering, 37, 47,...   \n",
       "8  [(approach, 14, 22, NOUN), (sequence, 40, 48, ...   \n",
       "9  [(similarity, 8, 18, NOUN), (graph, 19, 24, NO...   \n",
       "\n",
       "                                        noun_phrases  \\\n",
       "0  [(crowdsourcing, 0, 13, NP), (immense populari...   \n",
       "1  [(convex potential minimisation, 0, 29, NP), (...   \n",
       "2  [(the central questions, 7, 28, NP), (statisti...   \n",
       "3  [(we, 0, 2, NP), (a sequential low-complexity ...   \n",
       "4  [(bayesian posterior inference, 25, 53, NP), (...   \n",
       "5  [(we, 0, 2, NP), (a robust portfolio optimizat...   \n",
       "6  [(we, 0, 2, NP), (the problem, 9, 20, NP), (mu...   \n",
       "7  [(we, 0, 2, NP), (the problem, 9, 20, NP), (hi...   \n",
       "8  [(we, 0, 2, NP), (an approach, 11, 22, NP), (a...   \n",
       "9  [(a similarity graph, 6, 24, NP), (items, 33, ...   \n",
       "\n",
       "                                           compounds  \\\n",
       "0  [(machine learning applications, 47, 76, COMPO...   \n",
       "1  [(label noise, 143, 154, COMPOUND), (function ...   \n",
       "2  [(learning theory, 44, 59, COMPOUND), (inferen...   \n",
       "3  [(low-complexity inference procedure, 28, 62, ...   \n",
       "4  [(monte carlo sampling, 0, 20, COMPOUND), (mac...   \n",
       "5  [(portfolio optimization approach, 20, 51, COM...   \n",
       "6  [(test time, 134, 143, COMPOUND), (tree constr...   \n",
       "7  [(lp relaxation, 182, 195, COMPOUND), (cost pe...   \n",
       "8  [(image stream, 77, 89, COMPOUND), (image stre...   \n",
       "9  [(similarity graph, 8, 24, COMPOUND), (correla...   \n",
       "\n",
       "                                          comp_nouns  \\\n",
       "0  {mechanisms, low-quality data, problem, requir...   \n",
       "1  {function class, solution, result, performance...   \n",
       "2  {result, conditions, dimensionality reduction ...   \n",
       "3  {large-sample limit, concentration, asymptotic...   \n",
       "4  {machine learning, discrete-time analogues, gr...   \n",
       "5  {work, optimization, portfolio, dependence, th...   \n",
       "6  {entropy, problem, classes, conditions, number...   \n",
       "7  {problem, image, distances, partitions, cost, ...   \n",
       "8  {text-image parallel, language descriptions, m...   \n",
       "9  {practice kwikcluster, similarity, ratio, prac...   \n",
       "\n",
       "                                          clean_ents  \n",
       "0  {mechanisms, low-quality data, problem, worker...  \n",
       "1  {function class, solution, result, svm, paper,...  \n",
       "2  {dimensionality reduction methods, conditions,...  \n",
       "3  {classes, form parametric, number, function, e...  \n",
       "4  {machine learning, discrete-time analogues, gr...  \n",
       "5  {dependence, theory, events, method, dimension...  \n",
       "6  {problem, classes, conditions, number, functio...  \n",
       "7  {space, matching, terms, algorithm, cost perfe...  \n",
       "8  {text-image parallel, language descriptions, m...  \n",
       "9  {practice kwikcluster, ratio, serializability,...  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "cols = ['nouns', 'compounds']\n",
    "add_comp_nouns(df, cols=cols)\n",
    "display(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div class=\"entities\" style=\"line-height: 2.5\">\n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    crowdsourcing\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " has gained immense \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    popularity\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " in \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    machine\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " learning \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    applications\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " for obtaining large \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    amounts\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " of labeled \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    data\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       ". \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    crowdsourcing\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " is cheap and fast, but suffers from the \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    problem\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " of low-\n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    quality\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    data\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       ". to address this fundamental \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    challenge\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " in \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    crowdsourcing\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       ", we propose a simple \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    payment\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    mechanism\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " to incentivize \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    workers\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " to answer only the \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    questions\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " that they are sure of and skip the \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    rest\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       ". we show that surprisingly, under a mild and natural \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    no\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       "-free-\n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    lunch\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    requirement\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       ", this \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    mechanism\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " is the one and only \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    incentive\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       "-compatible \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    payment\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    mechanism\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " possible. we also show that among all possible \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    incentive\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       "-compatible  \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    mechanisms\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " (that may or may not satisfy no-free-\n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    lunch\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       "), our \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    mechanism\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " makes the smallest possible \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    payment\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " to \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    spammers\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       ".  interestingly, this unique \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    mechanism\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " takes a multiplicative \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    form\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       ". the \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    simplicity\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " of the \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    mechanism\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " is an added \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    benefit\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       ".  in preliminary \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    experiments\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " involving over several hundred \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    workers\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       ", we observe a significant \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    reduction\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " in the \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    error\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    rates\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " under our unique \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    mechanism\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       " for the same or lower monetary \n",
       "<mark class=\"entity\" style=\"background: #18CFE6; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    expenditure\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NOUN</span>\n",
       "</mark>\n",
       ".</div>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# take a look at all the nouns again\n",
    "column = 'named_nouns'\n",
    "render_entities(0, df, options=options, column=column)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div class=\"entities\" style=\"line-height: 2.5\">crowdsourcing has gained immense popularity in \n",
       "<mark class=\"entity\" style=\"background: #FE6BFE; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    machine learning applications\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">COMPOUND</span>\n",
       "</mark>\n",
       " for obtaining large amounts of labeled data. crowdsourcing is cheap and fast, but suffers from the problem of \n",
       "<mark class=\"entity\" style=\"background: #FE6BFE; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    low-quality data\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">COMPOUND</span>\n",
       "</mark>\n",
       ". to address this fundamental challenge in crowdsourcing, we propose a simple \n",
       "<mark class=\"entity\" style=\"background: #FE6BFE; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    payment mechanism\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">COMPOUND</span>\n",
       "</mark>\n",
       " to incentivize workers to answer only the questions that they are sure of and skip the rest. we show that surprisingly, under a mild and natural \n",
       "<mark class=\"entity\" style=\"background: #FE6BFE; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    no-free-lunch requirement\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">COMPOUND</span>\n",
       "</mark>\n",
       ", this mechanism is the one and only incentive-compatible \n",
       "<mark class=\"entity\" style=\"background: #FE6BFE; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    payment mechanism\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">COMPOUND</span>\n",
       "</mark>\n",
       " possible. we also show that among all possible incentive-compatible  mechanisms (that may or may not satisfy no-free-lunch), our mechanism makes the smallest possible payment to spammers.  interestingly, this unique mechanism takes a multiplicative form. the simplicity of the mechanism is an added benefit.  in preliminary experiments involving over several hundred workers, we observe a significant reduction in the \n",
       "<mark class=\"entity\" style=\"background: #FE6BFE; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    error rates\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">COMPOUND</span>\n",
       "</mark>\n",
       " under our unique mechanism for the same or lower monetary expenditure.</div>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# take a look at all the compound noun phrases again\n",
    "column = 'compounds'\n",
    "render_entities(0, df, options=options, column=column)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'amounts',\n",
       " 'applications',\n",
       " 'benefit',\n",
       " 'challenge',\n",
       " 'crowdsourcing',\n",
       " 'data',\n",
       " 'error',\n",
       " 'error rates',\n",
       " 'expenditure',\n",
       " 'experiments',\n",
       " 'form',\n",
       " 'incentive',\n",
       " 'low-quality data',\n",
       " 'lunch',\n",
       " 'machine',\n",
       " 'machine learning applications',\n",
       " 'mechanism',\n",
       " 'mechanisms',\n",
       " 'no',\n",
       " 'no-free-lunch requirement',\n",
       " 'payment',\n",
       " 'payment mechanism',\n",
       " 'popularity',\n",
       " 'problem',\n",
       " 'quality',\n",
       " 'questions',\n",
       " 'rates',\n",
       " 'reduction',\n",
       " 'requirement',\n",
       " 'rest',\n",
       " 'simplicity',\n",
       " 'spammers',\n",
       " 'workers'}"
      ]
     },
     "execution_count": 97,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# take a look at combined entities\n",
    "df['comp_nouns'][0] "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have all the entities grouped together, we can see how good we are doing. We've successfully captured single-word as well as n-grams, but there appear to be a lot of duplicates. Words that should've been included in a phrase were somehow split apart, most likely as a result of not properly dealing with hyphenation when we first tokenized our abstracts. \n",
    "\n",
    "Not to worry, this should be relatively easy to take care. We'll also apply a few other heuristics to clean up our list and remove the most common English words to further pare down the list of entities.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 8: Reduce entity count with heuristics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {},
   "outputs": [],
   "source": [
    "def drop_duplicate_np_splits(ents):\n",
    "    \"\"\"Drop any entities that are already captured by noun phrases. \n",
    "    \n",
    "    Keyword arguments:\n",
    "    ents -- a set of entities\n",
    "    \n",
    "    \"\"\"\n",
    "    drop_ents = set()\n",
    "    for ent in ents:\n",
    "        if len(ent.split(' ')) > 1:\n",
    "            for e in ent.split(' '):\n",
    "                if e in ents:\n",
    "                    drop_ents.add(e)\n",
    "    return ents - drop_ents\n",
    "\n",
    "def drop_single_char_nps(ents):\n",
    "    \"\"\"Within an entity, drop single characters. \n",
    "    \n",
    "    Keyword arguments:\n",
    "    ents -- a set of entities\n",
    "    \n",
    "    \"\"\"\n",
    "    return {' '.join([e for e in ent.split(' ') if not len(e) == 1]) for ent in ents}\n",
    "\n",
    "def drop_double_char(ents):\n",
    "    \"\"\"Drop any entities that are less than three characters. \n",
    "    \n",
    "    Keyword arguments:\n",
    "    ents -- a set of entities\n",
    "    \n",
    "    \"\"\"\n",
    "    drop_ents = {ent for ent in ents if len(ent) < 3}\n",
    "    return ents - drop_ents\n",
    "\n",
    "def keep_alpha(ents):\n",
    "    \"\"\"Keep only entities with alphabetical unicode characters, hyphens, and spaces. \n",
    "    \n",
    "    Keyword arguments:\n",
    "    ents -- a set of entities\n",
    "    \n",
    "    \"\"\"\n",
    "    keep_char = set('-abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ')\n",
    "    drop_ents = {ent for ent in ents if not set(ent).issubset(keep_char)}\n",
    "    return ents - drop_ents"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These last four functions will slice and dice the list of entities gathered from each abstract in various ways. In addition to this granular processing, we'll also want to remove words that are frequent in the English language, as a heuristic to naturally drop stop words and uncover the domain of each academic source. \n",
    "\n",
    "Why is this?\n",
    "\n",
    "Well, in NLP, as in search engine optimization (SEO), the most common words in a given corpus are known as [stop words](https://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html). These unfortunate candidates are hunted down with extreme prejudice and removed from the population to improve search results, enhance semantic analysis, and in our case, help restrict the domain. This is because removing stop words automatically limits the vocabulary of a corpus to the words that are less frequent and therefore, more likely to exist in that abstract than anywhere else. \n",
    "\n",
    "You can, of course, argue that the most common words in a scientific paper might in fact be the most important concepts, but stop words are usually overwhelmingingly overrepresented in any corpus. This intuition however, is exactly why we aren't going to simply take the most common words in one specific abstract and remove them. Instead, we'll be targetting the most frequent words based on a large, general domain sample of the English language. \n",
    "\n",
    "The \"freq_words.csv\" file you might have noticed earlier in our file path, is actually a list generated from a corpus with 10 billion words gathered by the good people at [Word Frequencey Data](https://www.wordfrequency.info/).\n",
    "\n",
    "Let's take a look at the list and then remove these words from our set of entities. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "freq_words.csv nips.csv\r\n"
     ]
    }
   ],
   "source": [
    "!ls {PATH}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Rank</th>\n",
       "      <th>Word</th>\n",
       "      <th>Part of speech</th>\n",
       "      <th>Frequency</th>\n",
       "      <th>Dispersion</th>\n",
       "      <th>Unnamed: 5</th>\n",
       "      <th>Unnamed: 6</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0</td>\n",
       "      <td>the</td>\n",
       "      <td>a</td>\n",
       "      <td>22038615.0</td>\n",
       "      <td>0.98</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2.0</td>\n",
       "      <td>be</td>\n",
       "      <td>v</td>\n",
       "      <td>12545825.0</td>\n",
       "      <td>0.97</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3.0</td>\n",
       "      <td>and</td>\n",
       "      <td>c</td>\n",
       "      <td>10741073.0</td>\n",
       "      <td>0.99</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4.0</td>\n",
       "      <td>of</td>\n",
       "      <td>i</td>\n",
       "      <td>10343885.0</td>\n",
       "      <td>0.97</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4996</th>\n",
       "      <td>4996.0</td>\n",
       "      <td>plaintiff</td>\n",
       "      <td>n</td>\n",
       "      <td>5312.0</td>\n",
       "      <td>0.88</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4997</th>\n",
       "      <td>4997.0</td>\n",
       "      <td>kid</td>\n",
       "      <td>v</td>\n",
       "      <td>5094.0</td>\n",
       "      <td>0.92</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4998</th>\n",
       "      <td>4998.0</td>\n",
       "      <td>middle-class</td>\n",
       "      <td>j</td>\n",
       "      <td>5025.0</td>\n",
       "      <td>0.93</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4999</th>\n",
       "      <td>4999.0</td>\n",
       "      <td>apology</td>\n",
       "      <td>n</td>\n",
       "      <td>4972.0</td>\n",
       "      <td>0.94</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5000</th>\n",
       "      <td>5000.0</td>\n",
       "      <td>till</td>\n",
       "      <td>i</td>\n",
       "      <td>5079.0</td>\n",
       "      <td>0.92</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5001 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "        Rank          Word Part of speech   Frequency  Dispersion  Unnamed: 5  \\\n",
       "0        NaN           NaN            NaN         NaN         NaN         NaN   \n",
       "1        1.0           the              a  22038615.0        0.98         NaN   \n",
       "2        2.0            be              v  12545825.0        0.97         NaN   \n",
       "3        3.0           and              c  10741073.0        0.99         NaN   \n",
       "4        4.0            of              i  10343885.0        0.97         NaN   \n",
       "...      ...           ...            ...         ...         ...         ...   \n",
       "4996  4996.0     plaintiff              n      5312.0        0.88         NaN   \n",
       "4997  4997.0           kid              v      5094.0        0.92         NaN   \n",
       "4998  4998.0  middle-class              j      5025.0        0.93         NaN   \n",
       "4999  4999.0       apology              n      4972.0        0.94         NaN   \n",
       "5000  5000.0          till              i      5079.0        0.92         NaN   \n",
       "\n",
       "      Unnamed: 6  \n",
       "0            NaN  \n",
       "1            NaN  \n",
       "2            NaN  \n",
       "3            NaN  \n",
       "4            NaN  \n",
       "...          ...  \n",
       "4996         NaN  \n",
       "4997         NaN  \n",
       "4998         NaN  \n",
       "4999         NaN  \n",
       "5000         NaN  \n",
       "\n",
       "[5001 rows x 7 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "filename = 'freq_words.csv'\n",
    "freq_words_df = pd.read_csv(f'{PATH}{filename}')\n",
    "display(freq_words_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1                the\n",
       "2                 be\n",
       "3                and\n",
       "4                 of\n",
       "5                  a\n",
       "            ...     \n",
       "4996       plaintiff\n",
       "4997             kid\n",
       "4998    middle-class\n",
       "4999         apology\n",
       "5000            till\n",
       "Name: Word, Length: 5000, dtype: object"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "freq_words = freq_words_df['Word'].iloc[1:]\n",
    "display(freq_words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "metadata": {},
   "outputs": [],
   "source": [
    "def remove_freq_words(ents):\n",
    "    \"\"\"Drop any entities in the 5000 most common words in the English langauge. \n",
    "    \n",
    "    Keyword arguments:\n",
    "    ents -- a set of entities\n",
    "    \n",
    "    \"\"\"\n",
    "    filename = 'freq_words.csv'\n",
    "    PATH = './data/'\n",
    "    freq_words = pd.read_csv(f'{PATH}{filename}')['Word'].iloc[1:]\n",
    "    for word in freq_words:\n",
    "        try:\n",
    "            ents.remove(word)\n",
    "        except KeyError:\n",
    "            continue # ignore the stop word if it's not in the list of abstract entities\n",
    "    return ents\n",
    "\n",
    "def add_clean_ents(df, funcs=[]):\n",
    "    \"\"\"Create new column in data frame with cleaned entities.\n",
    "    \n",
    "    Keyword arguments:\n",
    "    df -- a dataframe object\n",
    "    funcs -- a list of heuristic functions to be applied to entities\n",
    "    \n",
    "    \"\"\"\n",
    "    col = 'clean_ents'\n",
    "    df[col] = df['comp_nouns']\n",
    "    for f in funcs:\n",
    "        df[col] = df[col].apply(f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 107,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>named_ents</th>\n",
       "      <th>nouns</th>\n",
       "      <th>named_nouns</th>\n",
       "      <th>noun_phrases</th>\n",
       "      <th>compounds</th>\n",
       "      <th>comp_nouns</th>\n",
       "      <th>clean_ents</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>crowdsourcing has gained immense popularity in...</td>\n",
       "      <td>[(several hundred, 896, 911, CARDINAL)]</td>\n",
       "      <td>[(crowdsourcing, 0, 13, NOUN), (popularity, 33...</td>\n",
       "      <td>[(crowdsourcing, 0, 13, NOUN), (popularity, 33...</td>\n",
       "      <td>[(crowdsourcing, 0, 13, NP), (immense populari...</td>\n",
       "      <td>[(machine learning applications, 47, 76, COMPO...</td>\n",
       "      <td>{mechanisms, low-quality data, problem, requir...</td>\n",
       "      <td>{mechanisms, low-quality data, workers, questi...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>convex potential minimisation is the de facto ...</td>\n",
       "      <td>[(2008, 109, 113, DATE), (2008, 500, 504, DATE...</td>\n",
       "      <td>[(minimisation, 17, 29, NOUN), (approach, 46, ...</td>\n",
       "      <td>[(minimisation, 17, 29, NOUN), (approach, 46, ...</td>\n",
       "      <td>[(convex potential minimisation, 0, 29, NP), (...</td>\n",
       "      <td>[(label noise, 143, 154, COMPOUND), (function ...</td>\n",
       "      <td>{function class, solution, result, performance...</td>\n",
       "      <td>{function class, svm, guessing, learners, conv...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>one of the central questions in statistical le...</td>\n",
       "      <td>[(one, 0, 3, CARDINAL)]</td>\n",
       "      <td>[(questions, 19, 28, NOUN), (learning, 44, 52,...</td>\n",
       "      <td>[(questions, 19, 28, NOUN), (learning, 44, 52,...</td>\n",
       "      <td>[(the central questions, 7, 28, NP), (statisti...</td>\n",
       "      <td>[(learning theory, 44, 59, COMPOUND), (inferen...</td>\n",
       "      <td>{result, conditions, dimensionality reduction ...</td>\n",
       "      <td>{dimensionality reduction methods, conditions,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>we develop a sequential low-complexity inferen...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(complexity, 28, 38, NOUN), (inference, 39, 4...</td>\n",
       "      <td>[(complexity, 28, 38, NOUN), (inference, 39, 4...</td>\n",
       "      <td>[(we, 0, 2, NP), (a sequential low-complexity ...</td>\n",
       "      <td>[(low-complexity inference procedure, 28, 62, ...</td>\n",
       "      <td>{large-sample limit, concentration, asymptotic...</td>\n",
       "      <td>{classes, form parametric, methods, dirichlet ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>monte carlo sampling for bayesian posterior in...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...</td>\n",
       "      <td>[(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...</td>\n",
       "      <td>[(bayesian posterior inference, 25, 53, NP), (...</td>\n",
       "      <td>[(monte carlo sampling, 0, 20, COMPOUND), (mac...</td>\n",
       "      <td>{machine learning, discrete-time analogues, gr...</td>\n",
       "      <td>{machine learning, discrete-time analogues, gr...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>we propose a robust portfolio optimization app...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(portfolio, 20, 29, NOUN), (optimization, 30,...</td>\n",
       "      <td>[(portfolio, 20, 29, NOUN), (optimization, 30,...</td>\n",
       "      <td>[(we, 0, 2, NP), (a robust portfolio optimizat...</td>\n",
       "      <td>[(portfolio optimization approach, 20, 51, COM...</td>\n",
       "      <td>{work, optimization, portfolio, dependence, th...</td>\n",
       "      <td>{dependence, events, asset returns, dimensions...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>we study the problem of multiclass classificat...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(problem, 13, 20, NOUN), (multiclass, 24, 34,...</td>\n",
       "      <td>[(problem, 13, 20, NOUN), (multiclass, 24, 34,...</td>\n",
       "      <td>[(we, 0, 2, NP), (the problem, 9, 20, NP), (mu...</td>\n",
       "      <td>[(test time, 134, 143, COMPOUND), (tree constr...</td>\n",
       "      <td>{entropy, problem, classes, conditions, number...</td>\n",
       "      <td>{classes, conditions, partitions, multiclass, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>we study the problem of hierarchical clusterin...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(problem, 13, 20, NOUN), (clustering, 37, 47,...</td>\n",
       "      <td>[(problem, 13, 20, NOUN), (clustering, 37, 47,...</td>\n",
       "      <td>[(we, 0, 2, NP), (the problem, 9, 20, NP), (hi...</td>\n",
       "      <td>[(lp relaxation, 182, 195, COMPOUND), (cost pe...</td>\n",
       "      <td>{problem, image, distances, partitions, cost, ...</td>\n",
       "      <td>{matching, algorithm, cost perfect, lp relaxat...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>we propose an approach for generating a sequen...</td>\n",
       "      <td>[]</td>\n",
       "      <td>[(approach, 14, 22, NOUN), (sequence, 40, 48, ...</td>\n",
       "      <td>[(approach, 14, 22, NOUN), (sequence, 40, 48, ...</td>\n",
       "      <td>[(we, 0, 2, NP), (an approach, 11, 22, NP), (a...</td>\n",
       "      <td>[(image stream, 77, 89, COMPOUND), (image stre...</td>\n",
       "      <td>{text-image parallel, language descriptions, m...</td>\n",
       "      <td>{text-image parallel, language descriptions, m...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>given a similarity graph between items, correl...</td>\n",
       "      <td>[(3-approximation, 257, 272, CARDINAL), (graph...</td>\n",
       "      <td>[(similarity, 8, 18, NOUN), (graph, 19, 24, NO...</td>\n",
       "      <td>[(similarity, 8, 18, NOUN), (graph, 19, 24, NO...</td>\n",
       "      <td>[(a similarity graph, 6, 24, NP), (items, 33, ...</td>\n",
       "      <td>[(similarity graph, 8, 24, COMPOUND), (correla...</td>\n",
       "      <td>{practice kwikcluster, similarity, ratio, prac...</td>\n",
       "      <td>{practice kwikcluster, serializability, result...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                text  \\\n",
       "0  crowdsourcing has gained immense popularity in...   \n",
       "1  convex potential minimisation is the de facto ...   \n",
       "2  one of the central questions in statistical le...   \n",
       "3  we develop a sequential low-complexity inferen...   \n",
       "4  monte carlo sampling for bayesian posterior in...   \n",
       "5  we propose a robust portfolio optimization app...   \n",
       "6  we study the problem of multiclass classificat...   \n",
       "7  we study the problem of hierarchical clusterin...   \n",
       "8  we propose an approach for generating a sequen...   \n",
       "9  given a similarity graph between items, correl...   \n",
       "\n",
       "                                          named_ents  \\\n",
       "0            [(several hundred, 896, 911, CARDINAL)]   \n",
       "1  [(2008, 109, 113, DATE), (2008, 500, 504, DATE...   \n",
       "2                            [(one, 0, 3, CARDINAL)]   \n",
       "3                                                 []   \n",
       "4                                                 []   \n",
       "5                                                 []   \n",
       "6                                                 []   \n",
       "7                                                 []   \n",
       "8                                                 []   \n",
       "9  [(3-approximation, 257, 272, CARDINAL), (graph...   \n",
       "\n",
       "                                               nouns  \\\n",
       "0  [(crowdsourcing, 0, 13, NOUN), (popularity, 33...   \n",
       "1  [(minimisation, 17, 29, NOUN), (approach, 46, ...   \n",
       "2  [(questions, 19, 28, NOUN), (learning, 44, 52,...   \n",
       "3  [(complexity, 28, 38, NOUN), (inference, 39, 4...   \n",
       "4  [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...   \n",
       "5  [(portfolio, 20, 29, NOUN), (optimization, 30,...   \n",
       "6  [(problem, 13, 20, NOUN), (multiclass, 24, 34,...   \n",
       "7  [(problem, 13, 20, NOUN), (clustering, 37, 47,...   \n",
       "8  [(approach, 14, 22, NOUN), (sequence, 40, 48, ...   \n",
       "9  [(similarity, 8, 18, NOUN), (graph, 19, 24, NO...   \n",
       "\n",
       "                                         named_nouns  \\\n",
       "0  [(crowdsourcing, 0, 13, NOUN), (popularity, 33...   \n",
       "1  [(minimisation, 17, 29, NOUN), (approach, 46, ...   \n",
       "2  [(questions, 19, 28, NOUN), (learning, 44, 52,...   \n",
       "3  [(complexity, 28, 38, NOUN), (inference, 39, 4...   \n",
       "4  [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...   \n",
       "5  [(portfolio, 20, 29, NOUN), (optimization, 30,...   \n",
       "6  [(problem, 13, 20, NOUN), (multiclass, 24, 34,...   \n",
       "7  [(problem, 13, 20, NOUN), (clustering, 37, 47,...   \n",
       "8  [(approach, 14, 22, NOUN), (sequence, 40, 48, ...   \n",
       "9  [(similarity, 8, 18, NOUN), (graph, 19, 24, NO...   \n",
       "\n",
       "                                        noun_phrases  \\\n",
       "0  [(crowdsourcing, 0, 13, NP), (immense populari...   \n",
       "1  [(convex potential minimisation, 0, 29, NP), (...   \n",
       "2  [(the central questions, 7, 28, NP), (statisti...   \n",
       "3  [(we, 0, 2, NP), (a sequential low-complexity ...   \n",
       "4  [(bayesian posterior inference, 25, 53, NP), (...   \n",
       "5  [(we, 0, 2, NP), (a robust portfolio optimizat...   \n",
       "6  [(we, 0, 2, NP), (the problem, 9, 20, NP), (mu...   \n",
       "7  [(we, 0, 2, NP), (the problem, 9, 20, NP), (hi...   \n",
       "8  [(we, 0, 2, NP), (an approach, 11, 22, NP), (a...   \n",
       "9  [(a similarity graph, 6, 24, NP), (items, 33, ...   \n",
       "\n",
       "                                           compounds  \\\n",
       "0  [(machine learning applications, 47, 76, COMPO...   \n",
       "1  [(label noise, 143, 154, COMPOUND), (function ...   \n",
       "2  [(learning theory, 44, 59, COMPOUND), (inferen...   \n",
       "3  [(low-complexity inference procedure, 28, 62, ...   \n",
       "4  [(monte carlo sampling, 0, 20, COMPOUND), (mac...   \n",
       "5  [(portfolio optimization approach, 20, 51, COM...   \n",
       "6  [(test time, 134, 143, COMPOUND), (tree constr...   \n",
       "7  [(lp relaxation, 182, 195, COMPOUND), (cost pe...   \n",
       "8  [(image stream, 77, 89, COMPOUND), (image stre...   \n",
       "9  [(similarity graph, 8, 24, COMPOUND), (correla...   \n",
       "\n",
       "                                          comp_nouns  \\\n",
       "0  {mechanisms, low-quality data, problem, requir...   \n",
       "1  {function class, solution, result, performance...   \n",
       "2  {result, conditions, dimensionality reduction ...   \n",
       "3  {large-sample limit, concentration, asymptotic...   \n",
       "4  {machine learning, discrete-time analogues, gr...   \n",
       "5  {work, optimization, portfolio, dependence, th...   \n",
       "6  {entropy, problem, classes, conditions, number...   \n",
       "7  {problem, image, distances, partitions, cost, ...   \n",
       "8  {text-image parallel, language descriptions, m...   \n",
       "9  {practice kwikcluster, similarity, ratio, prac...   \n",
       "\n",
       "                                          clean_ents  \n",
       "0  {mechanisms, low-quality data, workers, questi...  \n",
       "1  {function class, svm, guessing, learners, conv...  \n",
       "2  {dimensionality reduction methods, conditions,...  \n",
       "3  {classes, form parametric, methods, dirichlet ...  \n",
       "4  {machine learning, discrete-time analogues, gr...  \n",
       "5  {dependence, events, asset returns, dimensions...  \n",
       "6  {classes, conditions, partitions, multiclass, ...  \n",
       "7  {matching, algorithm, cost perfect, lp relaxat...  \n",
       "8  {text-image parallel, language descriptions, m...  \n",
       "9  {practice kwikcluster, serializability, result...  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "funcs = [drop_duplicate_np_splits, drop_double_char, keep_alpha, drop_single_char_nps, remove_freq_words]\n",
    "add_clean_ents(df, funcs)\n",
    "display(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 179,
   "metadata": {},
   "outputs": [],
   "source": [
    "def visualize_entities(df, idx=0):\n",
    "    \"\"\"Visualize the entities for a given abstract in the dataframe. \n",
    "    \n",
    "    Keyword arguments:\n",
    "    df -- a dataframe object\n",
    "    idx -- the index of interest for the dataframe (default 0)\n",
    "    \n",
    "    \"\"\"\n",
    "    # store entity start and end index for visualization in dummy df\n",
    "    ents = []\n",
    "    abstract = df['text'][idx]\n",
    "    for ent in df['clean_ents'][idx]:\n",
    "        i = abstract.find(ent) # locate the index of the entity in the abstract\n",
    "        ents.append((ent, i, i+len(ent), 'ENTITY')) \n",
    "    ents.sort(key=lambda tup: tup[1])\n",
    "\n",
    "    dummy_df = pd.DataFrame([abstract, ents]).T # transpose dataframe\n",
    "    dummy_df.columns = ['text', 'clean_ents']\n",
    "    column = 'clean_ents'\n",
    "    render_entities(0, dummy_df, options=options, column=column)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 181,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div class=\"entities\" style=\"line-height: 2.5\">\n",
       "<mark class=\"entity\" style=\"background: #FF8800; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    crowdsourcing\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ENTITY</span>\n",
       "</mark>\n",
       " has gained immense popularity in \n",
       "<mark class=\"entity\" style=\"background: #FF8800; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    machine learning applications\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ENTITY</span>\n",
       "</mark>\n",
       " for obtaining large \n",
       "<mark class=\"entity\" style=\"background: #FF8800; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    amounts\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ENTITY</span>\n",
       "</mark>\n",
       " of labeled data. crowdsourcing is cheap and fast, but suffers from the problem of \n",
       "<mark class=\"entity\" style=\"background: #FF8800; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    low-quality data\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ENTITY</span>\n",
       "</mark>\n",
       ". to address this fundamental challenge in crowdsourcing, we propose a simple \n",
       "<mark class=\"entity\" style=\"background: #FF8800; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    payment mechanism\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ENTITY</span>\n",
       "</mark>\n",
       " to incentivize \n",
       "<mark class=\"entity\" style=\"background: #FF8800; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    workers\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ENTITY</span>\n",
       "</mark>\n",
       " to answer only the \n",
       "<mark class=\"entity\" style=\"background: #FF8800; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    questions\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ENTITY</span>\n",
       "</mark>\n",
       " that they are sure of and skip the rest. we show that surprisingly, under a mild and natural \n",
       "<mark class=\"entity\" style=\"background: #FF8800; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    no-free-lunch requirement\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ENTITY</span>\n",
       "</mark>\n",
       ", this mechanism is the one and only incentive-compatible payment mechanism possible. we also show that among all possible incentive-compatible  \n",
       "<mark class=\"entity\" style=\"background: #FF8800; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    mechanisms\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ENTITY</span>\n",
       "</mark>\n",
       " (that may or may not satisfy no-free-lunch), our mechanism makes the smallest possible payment to \n",
       "<mark class=\"entity\" style=\"background: #FF8800; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    spammers\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ENTITY</span>\n",
       "</mark>\n",
       ".  interestingly, this unique mechanism takes a multiplicative form. the \n",
       "<mark class=\"entity\" style=\"background: #FF8800; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    simplicity\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ENTITY</span>\n",
       "</mark>\n",
       " of the mechanism is an added benefit.  in preliminary \n",
       "<mark class=\"entity\" style=\"background: #FF8800; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    experiments\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ENTITY</span>\n",
       "</mark>\n",
       " involving over several hundred workers, we observe a significant reduction in the \n",
       "<mark class=\"entity\" style=\"background: #FF8800; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    error rates\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ENTITY</span>\n",
       "</mark>\n",
       " under our unique mechanism for the same or lower monetary \n",
       "<mark class=\"entity\" style=\"background: #FF8800; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
       "    expenditure\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ENTITY</span>\n",
       "</mark>\n",
       ".</div>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "visualize_entities(df, 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's a good looking list of concepts wouldn't you say? By removing stop words and fine-tuning our set, we were able to capture only the most important entities in this first abstract! Let's finish up with a quick recapitulation of our approach and some thoughts on what we can do going forward. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 9: Celebrate with excessive fist-pumping"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Well, at the risk of tooting our own horn, I feel rather confident saying that we've accomplished what we set out to do! We took an abstract from a scientific paper, combined named and regular entities, extracted compound noun phrases, and pared down the final list using heuristics and stop word domain restriction to generate a set of important concepts. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src='https://media.giphy.com/media/t3Mzdx0SA3Eis/giphy.gif'><center><span style=\"font-size:10px;\">Source: <a href=\"https://giphy.com/gifs/excited-the-office-yes-t3Mzdx0SA3Eis\">GIPHY</a></span></center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Keep in mind that this exercise wasn't to create the world's best entity extractor. It was to get a fast baseline for what we can do with limited knowledge about the domain, and limited use of deep learning superpowers. We've now ended up with a prototype that shows we can get relatively far using out-of-the-box methods, with minor scripting for customization. And the best part? Our approach didn't require any extensive compute or proprietary software! \n",
    "\n",
    "Going forward, we'd want to test our approach on larger data sets (perhaps full scientific papers), and create an easy-to-use API for visualization, as well as individual and batch processing of text sources. Improving the actual entity extraction itself might involve a <strong>language model trained on academic papers</strong> or the addition of <strong>other intelligent heuristics</strong>. At some point, we'd also want to link each entity to an external database with further information, so that our conversational academic paper program would be able to orient these concepts within a larger knowledge graph. \n",
    "\n",
    "At the end of all of this, we've built a fast entity extraction prototype that confidently moves us towards creating an engine to communicate with academic papers, which will (hopefully) set the foundation for an automated scientific discovery tool.\n",
    "\n",
    "Great work! "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}