{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
Source: spaCy Language Processing Pipelines
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Quick and Dirty - Entity Extraction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From idea to prototype in AI." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you've ever been around a startup or in the tech world for any significant amount of time, you've definitely encountered some, if not all of the following phrases: \"agile software development\", \"prototyping\", \"feedback loop\", \"rapid iteration\", etc.\n", "\n", "This Silicon Valley techno-babble can be distilled down to one simple concept, which just so happens to be the mantra of many a successful entrepreneur: test out your idea as quickly as possible, and then make it better over time. Stated more verbosely, before you invest mind and money into creating a cutting-edge solution to a problem, it might benefit you to get a baseline performance for your task using off-the-shelf techniques. Once you establish the efficacy of a low-cost, easy approach, you can then put on your Elon Musk hat and drive towards #innovation and #disruption. \n", "\n", "A concrete example might help illustrate this point:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Entity Extraction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's say our goal was to create a natural language system that effectively allowed someone to converse with an academic paper. This task could be step one of many towards the development of an automated scientific discovery tool. Society can thank us later. \n", "\n", "But where do we begin? Well, a part of the solution has to deal with [knowledge extraction](https://en.wikipedia.org/wiki/Knowledge_extraction). In order to create a conversational engine that understands scientific papers, we'll first need to develop an entity recognition module, and this, lucky for us, is the topic of our notebook! \n", "\n", "\"What's an entity?\" you ask? Excellent question. Take a look at the following sentence:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Dr. Abraham is the primary author of this paper, and a physician in the specialty of internal medicine." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, it should be relatively straighforward for an English-speaking human to pick out the important concepts in this sentence:\n", "\n", "> **[Dr. Abraham]** is the **[primary author]** of this **[paper]**, and a **[physician]** in the **[specialty]** of **[internal medicine]**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These words and/or phrases are categorized as \"entities\" because they represent salient ideas, nouns, and noun phrases in the real world. A subset of entities can be \"named\", in that they correspond to specific places, people, organizations, and so on. A [named entity](https://en.wikipedia.org/wiki/Named_entity) is to a regular entity, what \"Dr. Abraham\" is to a \"physician\". The good doctor is a real person and an instance of the \"physician\" class, and is therefore considered \"named\". Examples of named entities include \"Google\", \"Neil DeGrasse Tyson\", and \"Tokyo\", while regular, garden-variety entities can include the list just mentioned, as well as things like \"dog\", \"newspaper\", \"task\", etc.\n", "\n", "Let's see if we can get a computer to run this kind of analysis to pull important concepts from sentences. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The Task" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For our conversational academic paper program, we won't be satisfied with simply capturing named entities, because we need to understand the relationships between general concepts as well as actual things, places, etc. Unfortunately, while most out-of-the-box text processing libraries have a moderately useful named entity recognizer, they have little to no support for a generalized entity recognizer. \n", "\n", "This is because of a subtle, yet important constraint. \n", "\n", "Entities, as we've discussed, correspond to a superset of named entities, which should make them easier to extract. Indeed, blindly pulling all entities from a text source is in fact simple, but it's sadly not all that useful. In order to justify this exercise, we'd need to develop an entity extraction approach that is restricted to, or is cognizant of, some particular domain, for example, neuroscience, psychology, computer science, economics, etc. This paradoxical complexity makes it nontrivial to create a generic, but useful, entity recognizer. Hence the lack of support in most open-source libraries that deal with natural language processing. \n", "\n", "To largely simplify our task then, we must generate a set of entities from a scientific paper, that is larger than a simple list of named entities, but smaller than the giant list of all entities, restricted to the domain of a particular paper in question. \n", "\n", "Yikes. Are you sweating a little? Because I am. Instead of reaching for some Ibuprofen and deep learning pills, let's make a prototype using a little ingenuity, simple open-source code, and a lot of heuristics. Hopefully, through this process, we'll also learn a bit about the text processing pipeline that brings understanding natural language into the realm of the possible. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Enought chit-chat. Let's get to it!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fun fact: Curious about what 'autoreload' does? Check this out." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": true }, "outputs": [], "source": [ "import pandas as pd\n", "import spacy\n", "from spacy.displacy.render import EntityRenderer\n", "from IPython.core.display import display, HTML" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Utils and Prep" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's do some basic housekeeping before we start diving headfirst into entity extraction. We'll need to deal with visualization, load up a language model, and of course, examine/set-up our data source.\n", "\n", "### Show and Tell\n", "Our prototype will lean heavily on a popular natural langauge processing (NLP) library known as spaCy, which also has a wonderful set of classes and methods defined to help visualize parts of the NLP pipeline. Up top, where we've imported modules, you'll have noticed that we're pulling 'EntityRenderer' from spaCy's displacy module, as we'll be repurposing some of this code for our... um... purposes. In general, this is a good exercise if you ever want to get your hands dirty and really learn how certain classes work in your friendly neighborhood open-source projects. Nothing should ever be off-limits or a black box; always dissect and play with your code before you eat it. \n", "\n", "Wander on over to spaCy's [website](https://spacy.io/), and you'll quickly discover that they've put in some serious thought into making the user interface absolutely gorgeous. (While Matthew undeniably had some input on this, I'm going to make an intelligent assumption that the design ideas are probably Ines' [contribution](https://explosion.ai/about)). \n", "\n", "<rant> Why spend so much time discussing visualization? Well, one of my biggest pet peeves is this: even if you can create a product, if you don't put in the time to make it look beautiful, or delightful to use, then you don't care about packaging your ideas for export to an audience. And that makes me sad. Once you get something working, make it pretty. </rant>" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def custom_render(doc, df, column, options={}, page=False, minify=False, idx=0):\n", " \"\"\"Overload the spaCy built-in rendering to allow custom part-of-speech (POS) tags.\n", " \n", " Keyword arguments:\n", " doc -- a spaCy nlp doc object\n", " df -- a pandas dataframe object\n", " column -- the name of of a column of interest in the dataframe\n", " options -- various options to feed into the spaCy renderer, including colors\n", " page -- rendering markup as full HTML page (default False)\n", " minify -- for compact HTML (default False)\n", " idx -- index for specific query or doc in dataframe (default 0)\n", " \n", " \"\"\"\n", " renderer, converter = EntityRenderer, parse_custom_ents\n", " renderer = renderer(options=options)\n", " parsed = [converter(doc, df=df, idx=idx, column=column)]\n", " html = renderer.render(parsed, page=page, minify=minify).strip() \n", " return display(HTML(html))\n", "\n", "def parse_custom_ents(doc, df, idx, column):\n", " \"\"\"Parse custom entity types that aren't in the original spaCy module.\n", " \n", " Keyword arguments:\n", " doc -- a spaCy nlp doc object\n", " df -- a pandas dataframe object\n", " idx -- index for specific query or doc in dataframe\n", " column -- the name of of a column of interest in the dataframe\n", " \n", " \"\"\"\n", " if column in df.columns:\n", " entities = df[column][idx]\n", " ents = [{'start': ent[1], 'end': ent[2], 'label': ent[3]} \n", " for ent in entities]\n", " else:\n", " ents = [{'start': ent.start_char, 'end': ent.end_char, 'label': ent.label_}\n", " for ent in doc.ents]\n", " return {'text': doc.text, 'ents': ents, 'title': None}\n", "\n", "def render_entities(idx, df, options={}, column='named_ents'):\n", " \"\"\"A wrapper function to get text from a dataframe and render it visually in jupyter notebooks\n", " \n", " Keyword arguments:\n", " idx -- index for specific query or doc in dataframe (default 0)\n", " df -- a pandas dataframe object\n", " options -- various options to feed into the spaCy renderer, including colors\n", " column -- the name of of a column of interest in the dataframe (default 'named_ents')\n", " \n", " \"\"\"\n", " text = df['text'][idx]\n", " custom_render(nlp(text), df=df, column=column, options=options, idx=idx)" ] }, { "cell_type": "code", "execution_count": 154, "metadata": {}, "outputs": [], "source": [ "# colors for additional part of speech tags we want to visualize\n", "options = {\n", " 'colors': {'COMPOUND': '#FE6BFE', 'PROPN': '#18CFE6', 'NOUN': '#18CFE6', 'NP': '#1EECA6', 'ENTITY': '#FF8800'}\n", "}" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "pd.set_option('display.max_rows', 10) # edit how jupyter will render our pandas dataframes\n", "pd.options.mode.chained_assignment = None # prevent warning about working on a copy of a dataframe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "spaCy's pre-built models are trained on different corpora of text, to capture parts-of-speech, extract named entities, and in general understand how to tokenize words into chunks that have meaning in a given language. \n", "\n", "We'll grab the 'en_core_web_lg' model by running the following command in the shell (comment it out once you've run it so you don't keep downloading it every time you go through the notebook). " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# !python -m spacy download en_core_web_lg\n", "nlp = spacy.load('en_core_web_lg')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fun fact: We can run shell commands in a Jupyter notebook by using the bang operator. This is an example of a magic command, of which we saw an example at the begnning with '%autoreload'." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Gather Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As our data source, we'll be using papers presented at the [Neural Information Processing Systems (NIPS)](https://nips.cc/) conference held in a different location around the world each year. NIPS is the premier conference for all things machine learning, and considering our goal with this notebook, is an apropos choice to source our data. We'll pull a conveniently packaged dataset from [Kaggle](https://www.kaggle.com/benhamner/nips-2015-papers/version/2/home), a data science competition site, and then work with a subset of the papers to keep our prototyping as lean and fast as possible. \n", "\n", "Once we've grabbed the files using Kaggle's [API](https://github.com/Kaggle/kaggle-api), we'll take a look at what we're working with. Let's store everything in a separate 'data' folder to keep our directory clean. I've discarded all extra files and renamed the essential one to 'nips.csv'. You'll see a few other files in there, but ignore them for now. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "PATH = './data/'" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "freq_words.csv nips.csv\r\n" ] } ], "source": [ "!ls {PATH}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fun fact: You can use python variables in shell commands by nesting them inside curly braces." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "file = 'nips.csv'\n", "df = pd.read_csv(f'{PATH}{file}')\n", "\n", "mini_df = df[:10]\n", "mini_df.index = pd.RangeIndex(len(mini_df.index))\n", "\n", "# comment this out to run on full dataset\n", "df = mini_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Game Plan" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we're all ready to get started, let's come up with a general list of tasks to to guide our approach. \n", "\n", "
\n", "
    \n", "
  1. Inspect and clean data
  2. \n", "
  3. Extract named entities
  4. \n", "
  5. Extract nouns
  6. \n", "
  7. Combine named entities and nouns
  8. \n", "
  9. Extract noun phrases
  10. \n", "
  11. Extract compound noun phrases
  12. \n", "
  13. Combine entities and compound noun phrases
  14. \n", "
  15. Reduce entity count with heuristics
  16. \n", "
  17. Celebrate with excessive fist-pumping
  18. \n", "
\n", "\n", "That doesn't look too bad now does it? Let's build ourselves a prototype entity extractor." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: Inspect and clean data" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdTitleEventTypePdfNameAbstractPaperText
05677Double or Nothing: Multiplicative Incentive Me...Poster5677-double-or-nothing-multiplicative-incentiv...Crowdsourcing has gained immense popularity in...Double or Nothing: Multiplicative\\nIncentive M...
15941Learning with Symmetric Label Noise: The Impor...Spotlight5941-learning-with-symmetric-label-noise-the-i...Convex potential minimisation is the de facto ...Learning with Symmetric Label Noise: The\\nImpo...
26019Algorithmic Stability and Uniform GeneralizationPoster6019-algorithmic-stability-and-uniform-general...One of the central questions in statistical le...Algorithmic Stability and Uniform Generalizati...
36035Adaptive Low-Complexity Sequential Inference f...Poster6035-adaptive-low-complexity-sequential-infere...We develop a sequential low-complexity inferen...Adaptive Low-Complexity Sequential Inference f...
45978Covariance-Controlled Adaptive Langevin Thermo...Poster5978-covariance-controlled-adaptive-langevin-t...Monte Carlo sampling for Bayesian posterior in...Covariance-Controlled Adaptive Langevin\\nTherm...
55714Robust Portfolio OptimizationPoster5714-robust-portfolio-optimization.pdfWe propose a robust portfolio optimization app...Robust Portfolio Optimization\\n\\nFang Han\\nDep...
65937Logarithmic Time Online Multiclass predictionSpotlight5937-logarithmic-time-online-multiclass-predic...We study the problem of multiclass classificat...Logarithmic Time Online Multiclass prediction\\...
75802Planar Ultrametrics for Image SegmentationPoster5802-planar-ultrametrics-for-image-segmentatio...We study the problem of hierarchical clusterin...Planar Ultrametrics for Image Segmentation\\n\\n...
85776Expressing an Image Stream with a Sequence of ...Poster5776-expressing-an-image-stream-with-a-sequenc...We propose an approach for generating a sequen...Expressing an Image Stream with a Sequence of\\...
95814Parallel Correlation Clustering on Big GraphsPoster5814-parallel-correlation-clustering-on-big-gr...Given a similarity graph between items, correl...Parallel Correlation Clustering on Big Graphs\\...
\n", "
" ], "text/plain": [ " Id Title EventType \\\n", "0 5677 Double or Nothing: Multiplicative Incentive Me... Poster \n", "1 5941 Learning with Symmetric Label Noise: The Impor... Spotlight \n", "2 6019 Algorithmic Stability and Uniform Generalization Poster \n", "3 6035 Adaptive Low-Complexity Sequential Inference f... Poster \n", "4 5978 Covariance-Controlled Adaptive Langevin Thermo... Poster \n", "5 5714 Robust Portfolio Optimization Poster \n", "6 5937 Logarithmic Time Online Multiclass prediction Spotlight \n", "7 5802 Planar Ultrametrics for Image Segmentation Poster \n", "8 5776 Expressing an Image Stream with a Sequence of ... Poster \n", "9 5814 Parallel Correlation Clustering on Big Graphs Poster \n", "\n", " PdfName \\\n", "0 5677-double-or-nothing-multiplicative-incentiv... \n", "1 5941-learning-with-symmetric-label-noise-the-i... \n", "2 6019-algorithmic-stability-and-uniform-general... \n", "3 6035-adaptive-low-complexity-sequential-infere... \n", "4 5978-covariance-controlled-adaptive-langevin-t... \n", "5 5714-robust-portfolio-optimization.pdf \n", "6 5937-logarithmic-time-online-multiclass-predic... \n", "7 5802-planar-ultrametrics-for-image-segmentatio... \n", "8 5776-expressing-an-image-stream-with-a-sequenc... \n", "9 5814-parallel-correlation-clustering-on-big-gr... \n", "\n", " Abstract \\\n", "0 Crowdsourcing has gained immense popularity in... \n", "1 Convex potential minimisation is the de facto ... \n", "2 One of the central questions in statistical le... \n", "3 We develop a sequential low-complexity inferen... \n", "4 Monte Carlo sampling for Bayesian posterior in... \n", "5 We propose a robust portfolio optimization app... \n", "6 We study the problem of multiclass classificat... \n", "7 We study the problem of hierarchical clusterin... \n", "8 We propose an approach for generating a sequen... \n", "9 Given a similarity graph between items, correl... \n", "\n", " PaperText \n", "0 Double or Nothing: Multiplicative\\nIncentive M... \n", "1 Learning with Symmetric Label Noise: The\\nImpo... \n", "2 Algorithmic Stability and Uniform Generalizati... \n", "3 Adaptive Low-Complexity Sequential Inference f... \n", "4 Covariance-Controlled Adaptive Langevin\\nTherm... \n", "5 Robust Portfolio Optimization\\n\\nFang Han\\nDep... \n", "6 Logarithmic Time Online Multiclass prediction\\... \n", "7 Planar Ultrametrics for Image Segmentation\\n\\n... \n", "8 Expressing an Image Stream with a Sequence of\\... \n", "9 Parallel Correlation Clustering on Big Graphs\\... " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(df)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "lower = lambda x: x.lower() # make everything lowercase" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
text
0crowdsourcing has gained immense popularity in...
1convex potential minimisation is the de facto ...
2one of the central questions in statistical le...
3we develop a sequential low-complexity inferen...
4monte carlo sampling for bayesian posterior in...
5we propose a robust portfolio optimization app...
6we study the problem of multiclass classificat...
7we study the problem of hierarchical clusterin...
8we propose an approach for generating a sequen...
9given a similarity graph between items, correl...
\n", "
" ], "text/plain": [ " text\n", "0 crowdsourcing has gained immense popularity in...\n", "1 convex potential minimisation is the de facto ...\n", "2 one of the central questions in statistical le...\n", "3 we develop a sequential low-complexity inferen...\n", "4 monte carlo sampling for bayesian posterior in...\n", "5 we propose a robust portfolio optimization app...\n", "6 we study the problem of multiclass classificat...\n", "7 we study the problem of hierarchical clusterin...\n", "8 we propose an approach for generating a sequen...\n", "9 given a similarity graph between items, correl..." ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df = pd.DataFrame(df['Abstract'].apply(lower))\n", "df.columns = ['text']\n", "display(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Initially, there was quite a bit of metadata associated with each entry, including a unique identifier, the type of paper presented at the conference, as well as the actual paper text. After pulling out just the abstracts, we've now ended up with with a clean, read-to-go dataframe, and are ready to begin extracting entities. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Extract named entities" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "def extract_named_ents(text):\n", " \"\"\"Extract named entities, and beginning, middle and end idx using spaCy's out-of-the-box model. \n", " \n", " Keyword arguments:\n", " text -- the actual text source from which to extract entities\n", " \n", " \"\"\"\n", " return [(ent.text, ent.start_char, ent.end_char, ent.label_) for ent in nlp(text).ents]\n", "\n", "def add_named_ents(df):\n", " \"\"\"Create new column in data frame with named entity tuple extracted.\n", " \n", " Keyword arguments:\n", " df -- a dataframe object\n", " \n", " \"\"\"\n", " df['named_ents'] = df['text'].apply(extract_named_ents) " ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textnamed_ents
0crowdsourcing has gained immense popularity in...[(several hundred, 896, 911, CARDINAL)]
1convex potential minimisation is the de facto ...[(2008, 109, 113, DATE), (2008, 500, 504, DATE...
2one of the central questions in statistical le...[(one, 0, 3, CARDINAL)]
3we develop a sequential low-complexity inferen...[]
4monte carlo sampling for bayesian posterior in...[]
5we propose a robust portfolio optimization app...[]
6we study the problem of multiclass classificat...[]
7we study the problem of hierarchical clusterin...[]
8we propose an approach for generating a sequen...[]
9given a similarity graph between items, correl...[(3-approximation, 257, 272, CARDINAL), (graph...
\n", "
" ], "text/plain": [ " text \\\n", "0 crowdsourcing has gained immense popularity in... \n", "1 convex potential minimisation is the de facto ... \n", "2 one of the central questions in statistical le... \n", "3 we develop a sequential low-complexity inferen... \n", "4 monte carlo sampling for bayesian posterior in... \n", "5 we propose a robust portfolio optimization app... \n", "6 we study the problem of multiclass classificat... \n", "7 we study the problem of hierarchical clusterin... \n", "8 we propose an approach for generating a sequen... \n", "9 given a similarity graph between items, correl... \n", "\n", " named_ents \n", "0 [(several hundred, 896, 911, CARDINAL)] \n", "1 [(2008, 109, 113, DATE), (2008, 500, 504, DATE... \n", "2 [(one, 0, 3, CARDINAL)] \n", "3 [] \n", "4 [] \n", "5 [] \n", "6 [] \n", "7 [] \n", "8 [] \n", "9 [(3-approximation, 257, 272, CARDINAL), (graph... " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "add_named_ents(df)\n", "display(df)" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
given a similarity graph between items, correlation clustering (cc) groups similar items together and dissimilar ones apart. one of the most popular cc algorithms is kwikcluster: an algorithm that serially clusters neighborhoods of vertices, and obtains a \n", "\n", " 3-approximation\n", " CARDINAL\n", "\n", " ratio. unfortunately, in practice kwikcluster requires a large number of clustering rounds, a potential bottleneck for large \n", "\n", " graphs.we\n", " ORG\n", "\n", " present \n", "\n", " c4\n", " ORG\n", "\n", " and clusterwild!, \n", "\n", " two\n", " CARDINAL\n", "\n", " algorithms for parallel correlation clustering that run in a polylogarithmic number of rounds, and provably achieve nearly linear speedups. c4 uses concurrency control to enforce serializability of a parallel clustering process, and guarantees a \n", "\n", " 3-approximation\n", " CARDINAL\n", "\n", " ratio. clusterwild! is a coordination free algorithm that abandons consistency for the benefit of better scaling; this leads to a provably small loss in the \n", "\n", " 3\n", " CARDINAL\n", "\n", " approximation ratio.we provide extensive experimental results for both algorithms, where we outperform the state of the art, both in terms of clustering accuracy and running time. we show that our algorithms can cluster \n", "\n", " billion\n", " QUANTITY\n", "\n", "-edge graphs in \n", "\n", " under 5 seconds\n", " TIME\n", "\n", " on \n", "\n", " 32\n", " CARDINAL\n", "\n", " cores, while achieving a \n", "\n", " 15x speedup\n", " QUANTITY\n", "\n", ".
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "column = 'named_ents'\n", "render_entities(9, df, options=options, column=column) # take a look at one of the abstracts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A quick glance at some of the abstracts shows that while we are able to extract numeric entities, not much else comes through. Not great. But then again, this is exactly why simply extracting named entities is not enough. On the plus side, our intuition about built-in models and scientific text was spot on! The spaCy named entity recognizer just wasn't exposed to this category of corpora and was instead trained on [blogs, news, and comments](https://spacy.io/models/en#en_core_web_lg). Academic papers don't use the most common English words, so it isn't unreasonable to expect a generally trained model to fail when confronted with text in such a restricted domain. \n", "\n", "Look at a few more abstracts by changing the index parameter in our \"render_entities\" function to convince yourself of the following notion:\n", "\n", "We need to widen our search. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Extract all nouns" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "def extract_nouns(text):\n", " \"\"\"Extract a few types of nouns, and beginning, middle and end idx using spaCy's POS (part of speech) tagger. \n", " \n", " Keyword arguments:\n", " text -- the actual text source from which to extract entities\n", " \n", " \"\"\"\n", " keep_pos = ['PROPN', 'NOUN']\n", " return [(tok.text, tok.idx, tok.idx+len(tok.text), tok.pos_) for tok in nlp(text) if tok.pos_ in keep_pos]\n", "\n", "def add_nouns(df):\n", " \"\"\"Create new column in data frame with nouns extracted.\n", " \n", " Keyword arguments:\n", " df -- a dataframe object\n", " \n", " \"\"\"\n", " df['nouns'] = df['text'].apply(extract_nouns)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textnamed_entsnouns
0crowdsourcing has gained immense popularity in...[(several hundred, 896, 911, CARDINAL)][(crowdsourcing, 0, 13, NOUN), (popularity, 33...
1convex potential minimisation is the de facto ...[(2008, 109, 113, DATE), (2008, 500, 504, DATE...[(minimisation, 17, 29, NOUN), (approach, 46, ...
2one of the central questions in statistical le...[(one, 0, 3, CARDINAL)][(questions, 19, 28, NOUN), (learning, 44, 52,...
3we develop a sequential low-complexity inferen...[][(complexity, 28, 38, NOUN), (inference, 39, 4...
4monte carlo sampling for bayesian posterior in...[][(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...
5we propose a robust portfolio optimization app...[][(portfolio, 20, 29, NOUN), (optimization, 30,...
6we study the problem of multiclass classificat...[][(problem, 13, 20, NOUN), (multiclass, 24, 34,...
7we study the problem of hierarchical clusterin...[][(problem, 13, 20, NOUN), (clustering, 37, 47,...
8we propose an approach for generating a sequen...[][(approach, 14, 22, NOUN), (sequence, 40, 48, ...
9given a similarity graph between items, correl...[(3-approximation, 257, 272, CARDINAL), (graph...[(similarity, 8, 18, NOUN), (graph, 19, 24, NO...
\n", "
" ], "text/plain": [ " text \\\n", "0 crowdsourcing has gained immense popularity in... \n", "1 convex potential minimisation is the de facto ... \n", "2 one of the central questions in statistical le... \n", "3 we develop a sequential low-complexity inferen... \n", "4 monte carlo sampling for bayesian posterior in... \n", "5 we propose a robust portfolio optimization app... \n", "6 we study the problem of multiclass classificat... \n", "7 we study the problem of hierarchical clusterin... \n", "8 we propose an approach for generating a sequen... \n", "9 given a similarity graph between items, correl... \n", "\n", " named_ents \\\n", "0 [(several hundred, 896, 911, CARDINAL)] \n", "1 [(2008, 109, 113, DATE), (2008, 500, 504, DATE... \n", "2 [(one, 0, 3, CARDINAL)] \n", "3 [] \n", "4 [] \n", "5 [] \n", "6 [] \n", "7 [] \n", "8 [] \n", "9 [(3-approximation, 257, 272, CARDINAL), (graph... \n", "\n", " nouns \n", "0 [(crowdsourcing, 0, 13, NOUN), (popularity, 33... \n", "1 [(minimisation, 17, 29, NOUN), (approach, 46, ... \n", "2 [(questions, 19, 28, NOUN), (learning, 44, 52,... \n", "3 [(complexity, 28, 38, NOUN), (inference, 39, 4... \n", "4 [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN... \n", "5 [(portfolio, 20, 29, NOUN), (optimization, 30,... \n", "6 [(problem, 13, 20, NOUN), (multiclass, 24, 34,... \n", "7 [(problem, 13, 20, NOUN), (clustering, 37, 47,... \n", "8 [(approach, 14, 22, NOUN), (sequence, 40, 48, ... \n", "9 [(similarity, 8, 18, NOUN), (graph, 19, 24, NO... " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "add_nouns(df)\n", "display(df)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " crowdsourcing\n", " NOUN\n", "\n", " has gained immense \n", "\n", " popularity\n", " NOUN\n", "\n", " in \n", "\n", " machine\n", " NOUN\n", "\n", " learning \n", "\n", " applications\n", " NOUN\n", "\n", " for obtaining large \n", "\n", " amounts\n", " NOUN\n", "\n", " of labeled \n", "\n", " data\n", " NOUN\n", "\n", ". \n", "\n", " crowdsourcing\n", " NOUN\n", "\n", " is cheap and fast, but suffers from the \n", "\n", " problem\n", " NOUN\n", "\n", " of low-\n", "\n", " quality\n", " NOUN\n", "\n", " \n", "\n", " data\n", " NOUN\n", "\n", ". to address this fundamental \n", "\n", " challenge\n", " NOUN\n", "\n", " in \n", "\n", " crowdsourcing\n", " NOUN\n", "\n", ", we propose a simple \n", "\n", " payment\n", " NOUN\n", "\n", " \n", "\n", " mechanism\n", " NOUN\n", "\n", " to incentivize \n", "\n", " workers\n", " NOUN\n", "\n", " to answer only the \n", "\n", " questions\n", " NOUN\n", "\n", " that they are sure of and skip the \n", "\n", " rest\n", " NOUN\n", "\n", ". we show that surprisingly, under a mild and natural \n", "\n", " no\n", " NOUN\n", "\n", "-free-\n", "\n", " lunch\n", " NOUN\n", "\n", " \n", "\n", " requirement\n", " NOUN\n", "\n", ", this \n", "\n", " mechanism\n", " NOUN\n", "\n", " is the one and only \n", "\n", " incentive\n", " NOUN\n", "\n", "-compatible \n", "\n", " payment\n", " NOUN\n", "\n", " \n", "\n", " mechanism\n", " NOUN\n", "\n", " possible. we also show that among all possible \n", "\n", " incentive\n", " NOUN\n", "\n", "-compatible \n", "\n", " mechanisms\n", " NOUN\n", "\n", " (that may or may not satisfy no-free-\n", "\n", " lunch\n", " NOUN\n", "\n", "), our \n", "\n", " mechanism\n", " NOUN\n", "\n", " makes the smallest possible \n", "\n", " payment\n", " NOUN\n", "\n", " to \n", "\n", " spammers\n", " NOUN\n", "\n", ". interestingly, this unique \n", "\n", " mechanism\n", " NOUN\n", "\n", " takes a multiplicative \n", "\n", " form\n", " NOUN\n", "\n", ". the \n", "\n", " simplicity\n", " NOUN\n", "\n", " of the \n", "\n", " mechanism\n", " NOUN\n", "\n", " is an added \n", "\n", " benefit\n", " NOUN\n", "\n", ". in preliminary \n", "\n", " experiments\n", " NOUN\n", "\n", " involving over several hundred \n", "\n", " workers\n", " NOUN\n", "\n", ", we observe a significant \n", "\n", " reduction\n", " NOUN\n", "\n", " in the \n", "\n", " error\n", " NOUN\n", "\n", " \n", "\n", " rates\n", " NOUN\n", "\n", " under our unique \n", "\n", " mechanism\n", " NOUN\n", "\n", " for the same or lower monetary \n", "\n", " expenditure\n", " NOUN\n", "\n", ".
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "column = 'nouns'\n", "render_entities(0, df, options=options, column=column)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is more colorful. But is it useful? It appears as if we are able to pull out a lot of concepts, but things like \"rest\", \"popularity\", and \"data\", aren't all that interesting (atleast in the first abstract). Our search is too wide at this point. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Good to know. Let's power through for now, and merge our lists of entities. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4: Combine named entities and nouns" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "def extract_named_nouns(row_series):\n", " \"\"\"Combine nouns and non-numerical entities. \n", " \n", " Keyword arguments:\n", " row_series -- a Pandas Series object\n", " \n", " \"\"\"\n", " ents = set()\n", " idxs = set()\n", " # remove duplicates and merge two lists together\n", " for noun_tuple in row_series['nouns']:\n", " for named_ents_tuple in row_series['named_ents']:\n", " if noun_tuple[1] == named_ents_tuple[1]: \n", " idxs.add(noun_tuple[1])\n", " ents.add(named_ents_tuple)\n", " if noun_tuple[1] not in idxs:\n", " ents.add(noun_tuple)\n", " \n", " return sorted(list(ents), key=lambda x: x[1])\n", "\n", "def add_named_nouns(df):\n", " \"\"\"Create new column in data frame with nouns and named ents.\n", " \n", " Keyword arguments:\n", " df -- a dataframe object\n", " \n", " \"\"\"\n", " df['named_nouns'] = df.apply(extract_named_nouns, axis=1)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textnamed_entsnounsnamed_nouns
0crowdsourcing has gained immense popularity in...[(several hundred, 896, 911, CARDINAL)][(crowdsourcing, 0, 13, NOUN), (popularity, 33...[(crowdsourcing, 0, 13, NOUN), (popularity, 33...
1convex potential minimisation is the de facto ...[(2008, 109, 113, DATE), (2008, 500, 504, DATE...[(minimisation, 17, 29, NOUN), (approach, 46, ...[(minimisation, 17, 29, NOUN), (approach, 46, ...
2one of the central questions in statistical le...[(one, 0, 3, CARDINAL)][(questions, 19, 28, NOUN), (learning, 44, 52,...[(questions, 19, 28, NOUN), (learning, 44, 52,...
3we develop a sequential low-complexity inferen...[][(complexity, 28, 38, NOUN), (inference, 39, 4...[(complexity, 28, 38, NOUN), (inference, 39, 4...
4monte carlo sampling for bayesian posterior in...[][(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...[(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...
5we propose a robust portfolio optimization app...[][(portfolio, 20, 29, NOUN), (optimization, 30,...[(portfolio, 20, 29, NOUN), (optimization, 30,...
6we study the problem of multiclass classificat...[][(problem, 13, 20, NOUN), (multiclass, 24, 34,...[(problem, 13, 20, NOUN), (multiclass, 24, 34,...
7we study the problem of hierarchical clusterin...[][(problem, 13, 20, NOUN), (clustering, 37, 47,...[(problem, 13, 20, NOUN), (clustering, 37, 47,...
8we propose an approach for generating a sequen...[][(approach, 14, 22, NOUN), (sequence, 40, 48, ...[(approach, 14, 22, NOUN), (sequence, 40, 48, ...
9given a similarity graph between items, correl...[(3-approximation, 257, 272, CARDINAL), (graph...[(similarity, 8, 18, NOUN), (graph, 19, 24, NO...[(similarity, 8, 18, NOUN), (graph, 19, 24, NO...
\n", "
" ], "text/plain": [ " text \\\n", "0 crowdsourcing has gained immense popularity in... \n", "1 convex potential minimisation is the de facto ... \n", "2 one of the central questions in statistical le... \n", "3 we develop a sequential low-complexity inferen... \n", "4 monte carlo sampling for bayesian posterior in... \n", "5 we propose a robust portfolio optimization app... \n", "6 we study the problem of multiclass classificat... \n", "7 we study the problem of hierarchical clusterin... \n", "8 we propose an approach for generating a sequen... \n", "9 given a similarity graph between items, correl... \n", "\n", " named_ents \\\n", "0 [(several hundred, 896, 911, CARDINAL)] \n", "1 [(2008, 109, 113, DATE), (2008, 500, 504, DATE... \n", "2 [(one, 0, 3, CARDINAL)] \n", "3 [] \n", "4 [] \n", "5 [] \n", "6 [] \n", "7 [] \n", "8 [] \n", "9 [(3-approximation, 257, 272, CARDINAL), (graph... \n", "\n", " nouns \\\n", "0 [(crowdsourcing, 0, 13, NOUN), (popularity, 33... \n", "1 [(minimisation, 17, 29, NOUN), (approach, 46, ... \n", "2 [(questions, 19, 28, NOUN), (learning, 44, 52,... \n", "3 [(complexity, 28, 38, NOUN), (inference, 39, 4... \n", "4 [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN... \n", "5 [(portfolio, 20, 29, NOUN), (optimization, 30,... \n", "6 [(problem, 13, 20, NOUN), (multiclass, 24, 34,... \n", "7 [(problem, 13, 20, NOUN), (clustering, 37, 47,... \n", "8 [(approach, 14, 22, NOUN), (sequence, 40, 48, ... \n", "9 [(similarity, 8, 18, NOUN), (graph, 19, 24, NO... \n", "\n", " named_nouns \n", "0 [(crowdsourcing, 0, 13, NOUN), (popularity, 33... \n", "1 [(minimisation, 17, 29, NOUN), (approach, 46, ... \n", "2 [(questions, 19, 28, NOUN), (learning, 44, 52,... \n", "3 [(complexity, 28, 38, NOUN), (inference, 39, 4... \n", "4 [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN... \n", "5 [(portfolio, 20, 29, NOUN), (optimization, 30,... \n", "6 [(problem, 13, 20, NOUN), (multiclass, 24, 34,... \n", "7 [(problem, 13, 20, NOUN), (clustering, 37, 47,... \n", "8 [(approach, 14, 22, NOUN), (sequence, 40, 48, ... \n", "9 [(similarity, 8, 18, NOUN), (graph, 19, 24, NO... " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "add_named_nouns(df)\n", "display(df)" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
convex potential \n", "\n", " minimisation\n", " NOUN\n", "\n", " is the de facto \n", "\n", " approach\n", " NOUN\n", "\n", " to binary \n", "\n", " classification\n", " NOUN\n", "\n", ". however, long and servedio [2008] proved that under symmetric \n", "\n", " label\n", " NOUN\n", "\n", " \n", "\n", " noise\n", " NOUN\n", "\n", " (\n", "\n", " sln\n", " PROPN\n", "\n", "), \n", "\n", " minimisation\n", " NOUN\n", "\n", " of any convex \n", "\n", " potential\n", " NOUN\n", "\n", " over a linear \n", "\n", " function\n", " NOUN\n", "\n", " \n", "\n", " class\n", " NOUN\n", "\n", " can result in \n", "\n", " classification\n", " NOUN\n", "\n", " \n", "\n", " performance\n", " NOUN\n", "\n", " equivalent to random \n", "\n", " guessing\n", " NOUN\n", "\n", ". this ostensibly shows that convex \n", "\n", " losses\n", " NOUN\n", "\n", " are not \n", "\n", " sln\n", " NOUN\n", "\n", "-robust. in this \n", "\n", " paper\n", " NOUN\n", "\n", ", we propose a \n", "\n", " convex\n", " NOUN\n", "\n", ", \n", "\n", " classification\n", " NOUN\n", "\n", "-calibrated \n", "\n", " loss\n", " NOUN\n", "\n", " and prove that it is sln-robust. the \n", "\n", " loss\n", " NOUN\n", "\n", " avoids the long and servedio [2008] \n", "\n", " result\n", " NOUN\n", "\n", " by \n", "\n", " virtue\n", " NOUN\n", "\n", " of being negatively unbounded. the \n", "\n", " loss\n", " NOUN\n", "\n", " is a \n", "\n", " modification\n", " NOUN\n", "\n", " of the \n", "\n", " hinge\n", " NOUN\n", "\n", " \n", "\n", " loss\n", " NOUN\n", "\n", ", where one does not clamp at zero; hence, we call it the unhinged \n", "\n", " loss\n", " NOUN\n", "\n", ". we show that the optimal unhinged \n", "\n", " solution\n", " NOUN\n", "\n", " is equivalent to that of a strongly regularised \n", "\n", " svm\n", " NOUN\n", "\n", ", and is the limiting \n", "\n", " solution\n", " NOUN\n", "\n", " for any convex \n", "\n", " potential\n", " NOUN\n", "\n", "; this implies that strong \n", "\n", " l2\n", " ORG\n", "\n", " \n", "\n", " regularisation\n", " NOUN\n", "\n", " makes most standard \n", "\n", " learners\n", " NOUN\n", "\n", " \n", "\n", " sln\n", " NOUN\n", "\n", "-robust. \n", "\n", " experiments\n", " NOUN\n", "\n", " confirm the unhinged loss’ \n", "\n", " sln\n", " NOUN\n", "\n", "-\n", "\n", " robustness\n", " NOUN\n", "\n", ".
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "column = 'named_nouns'\n", "render_entities(1, df, options=options, column=column)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this step, we're just combining the named entities extracted using spaCy's built-in model with nouns identified by the part-of-speech or POS tagger. We're dropping any numeric entities for now because they are harder to deal with and don't really represent new concepts. You'll notice (if you look closely enough), that we are also ignoring any hyphenated entities. In spaCy's tokenizer, it is possible to prevent hyphenated words form being split apart, but we'll reserve this, along with other types of advanced fine-tuning or low-level editing to if and when we move beyond the prototype phase. \n", "\n", "So far, in the past few steps, we've deal with one-word entities. However, it's also entirely permissible for combinations of two or more words to represent a single concept. This means that in order for our prototype to successfully capture the most relevant concepts, we'll need to pull n-length phrases from our academic abstracts in addition to single word entities. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 5: Extract noun phrases" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A Chunky Pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even mild exposure to computer science, or any of the various isoforms of engineering, will have introduced you to the idea of an abstraction, wherein low-level concepts are bundled into higher-order relationships. The noun phrase or chunk is an abstraction which consists of two or more words, and is the by-product of dependency parsing, POS tagging, and tokenization. spaCy's POS tagger is essentially a statistical model which learns to predict the tag (noun, verb, adjective, etc.) for a given word using examples of tagged-sentences. \n", "\n", "This supervised machine learning approach relies on tokens generated from splitting text into somewhat atomic units using a rule-based tokenizer (although there are some interesting [unsupervised models](https://github.com/google/sentencepiece) out there as well). Dependency parsing then uncovers relationships between these tagged tokens, allowing us to finally extract noun chunks or phrases of relevance. \n", "\n", "The full pipeline goes something like this: \n", "\n", "raw texttokenization → POS taggingdependency parsingnoun chunk extraction\n", "\n", "Theoretically, one could swap out noun chunk extraction for named entity recognition, but that's the part of the pipeline we are attempting to modify for our own purposes, because we want n-length entities. Barring our custom intrusion, however, this is exactly how spaCy's built-in model works! If you don't believe me (which you shouldn't, since you're a scientist), scroll up to the very top of this notebook to convince yourself. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Neat huh? Need a visualization of tokenization, POS tagging, and dependency parsing to convince you of just how cool this is? \n", "\n", "Take a look:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " Dr.\n", " PROPN\n", "\n", "\n", "\n", " Abraham\n", " PROPN\n", "\n", "\n", "\n", " is\n", " VERB\n", "\n", "\n", "\n", " the\n", " DET\n", "\n", "\n", "\n", " primary\n", " ADJ\n", "\n", "\n", "\n", " author\n", " NOUN\n", "\n", "\n", "\n", " of\n", " ADP\n", "\n", "\n", "\n", " this\n", " DET\n", "\n", "\n", "\n", " paper,\n", " NOUN\n", "\n", "\n", "\n", " and\n", " CCONJ\n", "\n", "\n", "\n", " a\n", " DET\n", "\n", "\n", "\n", " physician\n", " NOUN\n", "\n", "\n", "\n", " in\n", " ADP\n", "\n", "\n", "\n", " the\n", " DET\n", "\n", "\n", "\n", " specialty\n", " NOUN\n", "\n", "\n", "\n", " of\n", " ADP\n", "\n", "\n", "\n", " internal\n", " ADJ\n", "\n", "\n", "\n", " medicine.\n", " NOUN\n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nsubj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " det\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " amod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " attr\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " prep\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " det\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " pobj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " cc\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " det\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " conj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " prep\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " det\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " pobj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " prep\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " amod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " pobj\n", " \n", " \n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "text = \"Dr. Abraham is the primary author of this paper, and a physician in the specialty of internal medicine.\"\n", "\n", "spacy.displacy.render(nlp(text), jupyter=True) # generating raw-markup using spacy's built-in renderer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just gorgeous. Following our pipeline, let's use this dependency tree to tease out the noun phrases in our dummy sentence. We'll have to create a few functions to do the heavy lifting first (we can reuse these guys for our full dataset later), and then use a simple procedure to visualize our example." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "def extract_noun_phrases(text):\n", " \"\"\"Combine noun phrases. \n", " \n", " Keyword arguments:\n", " text -- the actual text source from which to extract entities\n", " \n", " \"\"\"\n", " return [(chunk.text, chunk.start_char, chunk.end_char, chunk.label_) for chunk in nlp(text).noun_chunks]\n", "\n", "def add_noun_phrases(df):\n", " \"\"\"Create new column in data frame with noun phrases.\n", " \n", " Keyword arguments:\n", " df -- a dataframe object\n", " \n", " \"\"\"\n", " df['noun_phrases'] = df['text'].apply(extract_noun_phrases)" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [], "source": [ "def visualize_noun_phrases(text):\n", " \"\"\"Create a temporary dataframe to extract and visualize noun phrases. \n", " \n", " Keyword arguments:\n", " text -- the actual text source from which to extract entities\n", " \n", " \"\"\"\n", " df = pd.DataFrame([text]) \n", " df.columns = ['text']\n", " add_noun_phrases(df)\n", " column = 'noun_phrases'\n", " render_entities(0, dummy_df, options=options, column=column)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " Dr. Abraham\n", " NP\n", "\n", " is \n", "\n", " the primary author\n", " NP\n", "\n", " of \n", "\n", " this paper\n", " NP\n", "\n", ", and \n", "\n", " a physician\n", " NP\n", "\n", " in \n", "\n", " the specialty\n", " NP\n", "\n", " of \n", "\n", " internal medicine\n", " NP\n", "\n", ".
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "visualize_noun_phrases(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compare this to what we'd originally set out to accomplish:\n", "\n", "> **[Dr. Abraham]** is the **[primary author]** of this **[paper]**, and a **[physician]** in the **[specialty]** of **[internal medicine]**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I don't know about you, but everytime I see this work, I'm blown away by both the intricate complexity and beautiful simplicity of this process. Ignoring the prepositions, with one single move, we've done a damn-near perfect job of extracting the main ideas from this sentence. How amazing is that?! \n", "\n", "Hats off to spaCy, and the hordes of data scientists, machine learning engineers, and linguists that made this possible." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Back to School" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, if we just use this approach and add together the single-word entities we extracted from our academic abstracts earlier, we should be getting close to a pretty awesome set of concepts! Let's capture some noun phrases and see what we get. " ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textnamed_entsnounsnamed_nounsnoun_phrases
0crowdsourcing has gained immense popularity in...[(several hundred, 896, 911, CARDINAL)][(crowdsourcing, 0, 13, NOUN), (popularity, 33...[(crowdsourcing, 0, 13, NOUN), (popularity, 33...[(crowdsourcing, 0, 13, NP), (immense populari...
1convex potential minimisation is the de facto ...[(2008, 109, 113, DATE), (2008, 500, 504, DATE...[(minimisation, 17, 29, NOUN), (approach, 46, ...[(minimisation, 17, 29, NOUN), (approach, 46, ...[(convex potential minimisation, 0, 29, NP), (...
2one of the central questions in statistical le...[(one, 0, 3, CARDINAL)][(questions, 19, 28, NOUN), (learning, 44, 52,...[(questions, 19, 28, NOUN), (learning, 44, 52,...[(the central questions, 7, 28, NP), (statisti...
3we develop a sequential low-complexity inferen...[][(complexity, 28, 38, NOUN), (inference, 39, 4...[(complexity, 28, 38, NOUN), (inference, 39, 4...[(we, 0, 2, NP), (a sequential low-complexity ...
4monte carlo sampling for bayesian posterior in...[][(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...[(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...[(bayesian posterior inference, 25, 53, NP), (...
5we propose a robust portfolio optimization app...[][(portfolio, 20, 29, NOUN), (optimization, 30,...[(portfolio, 20, 29, NOUN), (optimization, 30,...[(we, 0, 2, NP), (a robust portfolio optimizat...
6we study the problem of multiclass classificat...[][(problem, 13, 20, NOUN), (multiclass, 24, 34,...[(problem, 13, 20, NOUN), (multiclass, 24, 34,...[(we, 0, 2, NP), (the problem, 9, 20, NP), (mu...
7we study the problem of hierarchical clusterin...[][(problem, 13, 20, NOUN), (clustering, 37, 47,...[(problem, 13, 20, NOUN), (clustering, 37, 47,...[(we, 0, 2, NP), (the problem, 9, 20, NP), (hi...
8we propose an approach for generating a sequen...[][(approach, 14, 22, NOUN), (sequence, 40, 48, ...[(approach, 14, 22, NOUN), (sequence, 40, 48, ...[(we, 0, 2, NP), (an approach, 11, 22, NP), (a...
9given a similarity graph between items, correl...[(3-approximation, 257, 272, CARDINAL), (graph...[(similarity, 8, 18, NOUN), (graph, 19, 24, NO...[(similarity, 8, 18, NOUN), (graph, 19, 24, NO...[(a similarity graph, 6, 24, NP), (items, 33, ...
\n", "
" ], "text/plain": [ " text \\\n", "0 crowdsourcing has gained immense popularity in... \n", "1 convex potential minimisation is the de facto ... \n", "2 one of the central questions in statistical le... \n", "3 we develop a sequential low-complexity inferen... \n", "4 monte carlo sampling for bayesian posterior in... \n", "5 we propose a robust portfolio optimization app... \n", "6 we study the problem of multiclass classificat... \n", "7 we study the problem of hierarchical clusterin... \n", "8 we propose an approach for generating a sequen... \n", "9 given a similarity graph between items, correl... \n", "\n", " named_ents \\\n", "0 [(several hundred, 896, 911, CARDINAL)] \n", "1 [(2008, 109, 113, DATE), (2008, 500, 504, DATE... \n", "2 [(one, 0, 3, CARDINAL)] \n", "3 [] \n", "4 [] \n", "5 [] \n", "6 [] \n", "7 [] \n", "8 [] \n", "9 [(3-approximation, 257, 272, CARDINAL), (graph... \n", "\n", " nouns \\\n", "0 [(crowdsourcing, 0, 13, NOUN), (popularity, 33... \n", "1 [(minimisation, 17, 29, NOUN), (approach, 46, ... \n", "2 [(questions, 19, 28, NOUN), (learning, 44, 52,... \n", "3 [(complexity, 28, 38, NOUN), (inference, 39, 4... \n", "4 [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN... \n", "5 [(portfolio, 20, 29, NOUN), (optimization, 30,... \n", "6 [(problem, 13, 20, NOUN), (multiclass, 24, 34,... \n", "7 [(problem, 13, 20, NOUN), (clustering, 37, 47,... \n", "8 [(approach, 14, 22, NOUN), (sequence, 40, 48, ... \n", "9 [(similarity, 8, 18, NOUN), (graph, 19, 24, NO... \n", "\n", " named_nouns \\\n", "0 [(crowdsourcing, 0, 13, NOUN), (popularity, 33... \n", "1 [(minimisation, 17, 29, NOUN), (approach, 46, ... \n", "2 [(questions, 19, 28, NOUN), (learning, 44, 52,... \n", "3 [(complexity, 28, 38, NOUN), (inference, 39, 4... \n", "4 [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN... \n", "5 [(portfolio, 20, 29, NOUN), (optimization, 30,... \n", "6 [(problem, 13, 20, NOUN), (multiclass, 24, 34,... \n", "7 [(problem, 13, 20, NOUN), (clustering, 37, 47,... \n", "8 [(approach, 14, 22, NOUN), (sequence, 40, 48, ... \n", "9 [(similarity, 8, 18, NOUN), (graph, 19, 24, NO... \n", "\n", " noun_phrases \n", "0 [(crowdsourcing, 0, 13, NP), (immense populari... \n", "1 [(convex potential minimisation, 0, 29, NP), (... \n", "2 [(the central questions, 7, 28, NP), (statisti... \n", "3 [(we, 0, 2, NP), (a sequential low-complexity ... \n", "4 [(bayesian posterior inference, 25, 53, NP), (... \n", "5 [(we, 0, 2, NP), (a robust portfolio optimizat... \n", "6 [(we, 0, 2, NP), (the problem, 9, 20, NP), (mu... \n", "7 [(we, 0, 2, NP), (the problem, 9, 20, NP), (hi... \n", "8 [(we, 0, 2, NP), (an approach, 11, 22, NP), (a... \n", "9 [(a similarity graph, 6, 24, NP), (items, 33, ... " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "add_noun_phrases(df)\n", "display(df)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " crowdsourcing\n", " NP\n", "\n", " has gained \n", "\n", " immense popularity\n", " NP\n", "\n", " in \n", "\n", " machine learning applications\n", " NP\n", "\n", " for obtaining \n", "\n", " large amounts\n", " NP\n", "\n", " of \n", "\n", " labeled data\n", " NP\n", "\n", ". \n", "\n", " crowdsourcing\n", " NP\n", "\n", " is cheap and fast, but suffers from \n", "\n", " the problem\n", " NP\n", "\n", " of \n", "\n", " low-quality data\n", " NP\n", "\n", ". to address \n", "\n", " this fundamental challenge\n", " NP\n", "\n", " in \n", "\n", " crowdsourcing\n", " NP\n", "\n", ", \n", "\n", " we\n", " NP\n", "\n", " propose \n", "\n", " a simple payment mechanism\n", " NP\n", "\n", " to incentivize \n", "\n", " workers\n", " NP\n", "\n", " to answer \n", "\n", " only the questions\n", " NP\n", "\n", " that \n", "\n", " they\n", " NP\n", "\n", " are sure of and skip \n", "\n", " the rest\n", " NP\n", "\n", ". \n", "\n", " we\n", " NP\n", "\n", " show that surprisingly, under \n", "\n", " a mild and natural no-free-lunch requirement\n", " NP\n", "\n", ", \n", "\n", " this mechanism\n", " NP\n", "\n", " is the one and \n", "\n", " only incentive-compatible payment mechanism\n", " NP\n", "\n", " possible. \n", "\n", " we\n", " NP\n", "\n", " also show that among \n", "\n", " all possible incentive-compatible mechanisms\n", " NP\n", "\n", " (that may or may not satisfy \n", "\n", " no-free-lunch\n", " NP\n", "\n", "), \n", "\n", " our mechanism\n", " NP\n", "\n", " makes \n", "\n", " the smallest possible payment\n", " NP\n", "\n", " to \n", "\n", " spammers\n", " NP\n", "\n", ". interestingly, \n", "\n", " this unique mechanism\n", " NP\n", "\n", " takes \n", "\n", " a multiplicative form\n", " NP\n", "\n", ". \n", "\n", " the simplicity\n", " NP\n", "\n", " of \n", "\n", " the mechanism\n", " NP\n", "\n", " is \n", "\n", " an added benefit\n", " NP\n", "\n", ". in \n", "\n", " preliminary experiments\n", " NP\n", "\n", " involving over \n", "\n", " several hundred workers\n", " NP\n", "\n", ", \n", "\n", " we\n", " NP\n", "\n", " observe \n", "\n", " a significant reduction\n", " NP\n", "\n", " in \n", "\n", " the error rates\n", " NP\n", "\n", " under \n", "\n", " our unique mechanism\n", " NP\n", "\n", " for \n", "\n", " the same or lower monetary expenditure\n", " NP\n", "\n", ".
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "column = 'noun_phrases'\n", "render_entities(0, df, options=options, column=column)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hmm... should've seen this coming. While we've now done a great job of extracting noun phrases from our abstracts, we're running into the same problem as before. Our funnel is too wide, and we're pulling uninteresting bigrams like \"the simplicity\", \"the rest\", and \"this mechanism\". These chunks are indeed noun phrases, but not domain-specific concepts. Not to mention, we still have to deal with those pesky prepositions (try saying that five times fast). \n", "\n", "Let's see if we can narrow our search and just get the most important phrases. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 6: Extract compound noun phrases" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "def extract_compounds(text):\n", " \"\"\"Extract compound noun phrases with beginning and end idxs. \n", " \n", " Keyword arguments:\n", " text -- the actual text source from which to extract entities\n", " \n", " \"\"\"\n", " comp_idx = 0\n", " compound = []\n", " compound_nps = []\n", " tok_idx = 0\n", " for idx, tok in enumerate(nlp(text)):\n", " if tok.dep_ == 'compound':\n", "\n", " # capture hyphenated compounds\n", " children = ''.join([c.text for c in tok.children])\n", " if '-' in children:\n", " compound.append(''.join([children, tok.text]))\n", " else:\n", " compound.append(tok.text)\n", "\n", " # remember starting index of first child in compound or word\n", " try:\n", " tok_idx = [c for c in tok.children][0].idx\n", " except IndexError:\n", " if len(compound) == 1:\n", " tok_idx = tok.idx\n", " comp_idx = tok.i\n", "\n", " # append the last word in a compound phrase\n", " if tok.i - comp_idx == 1:\n", " compound.append(tok.text)\n", " if len(compound) > 1: \n", " compound = ' '.join(compound)\n", " compound_nps.append((compound, tok_idx, tok_idx+len(compound), 'COMPOUND'))\n", "\n", " # reset parameters\n", " tok_idx = 0 \n", " compound = []\n", "\n", " return compound_nps\n", "\n", "def add_compounds(df):\n", " \"\"\"Create new column in data frame with compound noun phrases.\n", " \n", " Keyword arguments:\n", " df -- a dataframe object\n", " \n", " \"\"\"\n", " df['compounds'] = df['text'].apply(extract_compounds)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textnamed_entsnounsnamed_nounsnoun_phrasescompounds
0crowdsourcing has gained immense popularity in...[(several hundred, 896, 911, CARDINAL)][(crowdsourcing, 0, 13, NOUN), (popularity, 33...[(crowdsourcing, 0, 13, NOUN), (popularity, 33...[(crowdsourcing, 0, 13, NP), (immense populari...[(machine learning applications, 47, 76, COMPO...
1convex potential minimisation is the de facto ...[(2008, 109, 113, DATE), (2008, 500, 504, DATE...[(minimisation, 17, 29, NOUN), (approach, 46, ...[(minimisation, 17, 29, NOUN), (approach, 46, ...[(convex potential minimisation, 0, 29, NP), (...[(label noise, 143, 154, COMPOUND), (function ...
2one of the central questions in statistical le...[(one, 0, 3, CARDINAL)][(questions, 19, 28, NOUN), (learning, 44, 52,...[(questions, 19, 28, NOUN), (learning, 44, 52,...[(the central questions, 7, 28, NP), (statisti...[(learning theory, 44, 59, COMPOUND), (inferen...
3we develop a sequential low-complexity inferen...[][(complexity, 28, 38, NOUN), (inference, 39, 4...[(complexity, 28, 38, NOUN), (inference, 39, 4...[(we, 0, 2, NP), (a sequential low-complexity ...[(low-complexity inference procedure, 28, 62, ...
4monte carlo sampling for bayesian posterior in...[][(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...[(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...[(bayesian posterior inference, 25, 53, NP), (...[(monte carlo sampling, 0, 20, COMPOUND), (mac...
5we propose a robust portfolio optimization app...[][(portfolio, 20, 29, NOUN), (optimization, 30,...[(portfolio, 20, 29, NOUN), (optimization, 30,...[(we, 0, 2, NP), (a robust portfolio optimizat...[(portfolio optimization approach, 20, 51, COM...
6we study the problem of multiclass classificat...[][(problem, 13, 20, NOUN), (multiclass, 24, 34,...[(problem, 13, 20, NOUN), (multiclass, 24, 34,...[(we, 0, 2, NP), (the problem, 9, 20, NP), (mu...[(test time, 134, 143, COMPOUND), (tree constr...
7we study the problem of hierarchical clusterin...[][(problem, 13, 20, NOUN), (clustering, 37, 47,...[(problem, 13, 20, NOUN), (clustering, 37, 47,...[(we, 0, 2, NP), (the problem, 9, 20, NP), (hi...[(lp relaxation, 182, 195, COMPOUND), (cost pe...
8we propose an approach for generating a sequen...[][(approach, 14, 22, NOUN), (sequence, 40, 48, ...[(approach, 14, 22, NOUN), (sequence, 40, 48, ...[(we, 0, 2, NP), (an approach, 11, 22, NP), (a...[(image stream, 77, 89, COMPOUND), (image stre...
9given a similarity graph between items, correl...[(3-approximation, 257, 272, CARDINAL), (graph...[(similarity, 8, 18, NOUN), (graph, 19, 24, NO...[(similarity, 8, 18, NOUN), (graph, 19, 24, NO...[(a similarity graph, 6, 24, NP), (items, 33, ...[(similarity graph, 8, 24, COMPOUND), (correla...
\n", "
" ], "text/plain": [ " text \\\n", "0 crowdsourcing has gained immense popularity in... \n", "1 convex potential minimisation is the de facto ... \n", "2 one of the central questions in statistical le... \n", "3 we develop a sequential low-complexity inferen... \n", "4 monte carlo sampling for bayesian posterior in... \n", "5 we propose a robust portfolio optimization app... \n", "6 we study the problem of multiclass classificat... \n", "7 we study the problem of hierarchical clusterin... \n", "8 we propose an approach for generating a sequen... \n", "9 given a similarity graph between items, correl... \n", "\n", " named_ents \\\n", "0 [(several hundred, 896, 911, CARDINAL)] \n", "1 [(2008, 109, 113, DATE), (2008, 500, 504, DATE... \n", "2 [(one, 0, 3, CARDINAL)] \n", "3 [] \n", "4 [] \n", "5 [] \n", "6 [] \n", "7 [] \n", "8 [] \n", "9 [(3-approximation, 257, 272, CARDINAL), (graph... \n", "\n", " nouns \\\n", "0 [(crowdsourcing, 0, 13, NOUN), (popularity, 33... \n", "1 [(minimisation, 17, 29, NOUN), (approach, 46, ... \n", "2 [(questions, 19, 28, NOUN), (learning, 44, 52,... \n", "3 [(complexity, 28, 38, NOUN), (inference, 39, 4... \n", "4 [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN... \n", "5 [(portfolio, 20, 29, NOUN), (optimization, 30,... \n", "6 [(problem, 13, 20, NOUN), (multiclass, 24, 34,... \n", "7 [(problem, 13, 20, NOUN), (clustering, 37, 47,... \n", "8 [(approach, 14, 22, NOUN), (sequence, 40, 48, ... \n", "9 [(similarity, 8, 18, NOUN), (graph, 19, 24, NO... \n", "\n", " named_nouns \\\n", "0 [(crowdsourcing, 0, 13, NOUN), (popularity, 33... \n", "1 [(minimisation, 17, 29, NOUN), (approach, 46, ... \n", "2 [(questions, 19, 28, NOUN), (learning, 44, 52,... \n", "3 [(complexity, 28, 38, NOUN), (inference, 39, 4... \n", "4 [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN... \n", "5 [(portfolio, 20, 29, NOUN), (optimization, 30,... \n", "6 [(problem, 13, 20, NOUN), (multiclass, 24, 34,... \n", "7 [(problem, 13, 20, NOUN), (clustering, 37, 47,... \n", "8 [(approach, 14, 22, NOUN), (sequence, 40, 48, ... \n", "9 [(similarity, 8, 18, NOUN), (graph, 19, 24, NO... \n", "\n", " noun_phrases \\\n", "0 [(crowdsourcing, 0, 13, NP), (immense populari... \n", "1 [(convex potential minimisation, 0, 29, NP), (... \n", "2 [(the central questions, 7, 28, NP), (statisti... \n", "3 [(we, 0, 2, NP), (a sequential low-complexity ... \n", "4 [(bayesian posterior inference, 25, 53, NP), (... \n", "5 [(we, 0, 2, NP), (a robust portfolio optimizat... \n", "6 [(we, 0, 2, NP), (the problem, 9, 20, NP), (mu... \n", "7 [(we, 0, 2, NP), (the problem, 9, 20, NP), (hi... \n", "8 [(we, 0, 2, NP), (an approach, 11, 22, NP), (a... \n", "9 [(a similarity graph, 6, 24, NP), (items, 33, ... \n", "\n", " compounds \n", "0 [(machine learning applications, 47, 76, COMPO... \n", "1 [(label noise, 143, 154, COMPOUND), (function ... \n", "2 [(learning theory, 44, 59, COMPOUND), (inferen... \n", "3 [(low-complexity inference procedure, 28, 62, ... \n", "4 [(monte carlo sampling, 0, 20, COMPOUND), (mac... \n", "5 [(portfolio optimization approach, 20, 51, COM... \n", "6 [(test time, 134, 143, COMPOUND), (tree constr... \n", "7 [(lp relaxation, 182, 195, COMPOUND), (cost pe... \n", "8 [(image stream, 77, 89, COMPOUND), (image stre... \n", "9 [(similarity graph, 8, 24, COMPOUND), (correla... " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "add_compounds(df)\n", "display(df)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
crowdsourcing has gained immense popularity in \n", "\n", " machine learning applications\n", " COMPOUND\n", "\n", " for obtaining large amounts of labeled data. crowdsourcing is cheap and fast, but suffers from the problem of \n", "\n", " low-quality data\n", " COMPOUND\n", "\n", ". to address this fundamental challenge in crowdsourcing, we propose a simple \n", "\n", " payment mechanism\n", " COMPOUND\n", "\n", " to incentivize workers to answer only the questions that they are sure of and skip the rest. we show that surprisingly, under a mild and natural \n", "\n", " no-free-lunch requirement\n", " COMPOUND\n", "\n", ", this mechanism is the one and only incentive-compatible \n", "\n", " payment mechanism\n", " COMPOUND\n", "\n", " possible. we also show that among all possible incentive-compatible mechanisms (that may or may not satisfy no-free-lunch), our mechanism makes the smallest possible payment to spammers. interestingly, this unique mechanism takes a multiplicative form. the simplicity of the mechanism is an added benefit. in preliminary experiments involving over several hundred workers, we observe a significant reduction in the \n", "\n", " error rates\n", " COMPOUND\n", "\n", " under our unique mechanism for the same or lower monetary expenditure.
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "column = 'compounds'\n", "render_entities(0, df, options=options, column=column)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's starting to look pretty good! By targetting words in the dependency tree that were tagged as belonging to a compound, we were able to drive the number of noun phrases down rather nicely. Next, we'll add these phrases to the list of entities we extracted from each abstract, to create a set which will include unigrams, bigrams, and more. Oh my!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 7: Combine entities and compound noun phrases" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [], "source": [ "def extract_comp_nouns(row_series, cols=[]):\n", " \"\"\"Combine compound noun phrases and entities. \n", " \n", " Keyword arguments:\n", " row_series -- a Pandas Series object\n", " \n", " \"\"\"\n", " return {noun_tuple[0] for col in cols for noun_tuple in row_series[col]}\n", "\n", "def add_comp_nouns(df, cols=[]):\n", " \"\"\"Create new column in data frame with merged entities.\n", " \n", " Keyword arguments:\n", " df -- a dataframe object\n", " cols -- a list of column names that need to be merged\n", " \n", " \"\"\"\n", " df['comp_nouns'] = df.apply(extract_comp_nouns, axis=1, cols=cols)" ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textnamed_entsnounsnamed_nounsnoun_phrasescompoundscomp_nounsclean_ents
0crowdsourcing has gained immense popularity in...[(several hundred, 896, 911, CARDINAL)][(crowdsourcing, 0, 13, NOUN), (popularity, 33...[(crowdsourcing, 0, 13, NOUN), (popularity, 33...[(crowdsourcing, 0, 13, NP), (immense populari...[(machine learning applications, 47, 76, COMPO...{mechanisms, low-quality data, problem, requir...{mechanisms, low-quality data, problem, worker...
1convex potential minimisation is the de facto ...[(2008, 109, 113, DATE), (2008, 500, 504, DATE...[(minimisation, 17, 29, NOUN), (approach, 46, ...[(minimisation, 17, 29, NOUN), (approach, 46, ...[(convex potential minimisation, 0, 29, NP), (...[(label noise, 143, 154, COMPOUND), (function ...{function class, solution, result, performance...{function class, solution, result, svm, paper,...
2one of the central questions in statistical le...[(one, 0, 3, CARDINAL)][(questions, 19, 28, NOUN), (learning, 44, 52,...[(questions, 19, 28, NOUN), (learning, 44, 52,...[(the central questions, 7, 28, NP), (statisti...[(learning theory, 44, 59, COMPOUND), (inferen...{result, conditions, dimensionality reduction ...{dimensionality reduction methods, conditions,...
3we develop a sequential low-complexity inferen...[][(complexity, 28, 38, NOUN), (inference, 39, 4...[(complexity, 28, 38, NOUN), (inference, 39, 4...[(we, 0, 2, NP), (a sequential low-complexity ...[(low-complexity inference procedure, 28, 62, ...{large-sample limit, concentration, asymptotic...{classes, form parametric, number, function, e...
4monte carlo sampling for bayesian posterior in...[][(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...[(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...[(bayesian posterior inference, 25, 53, NP), (...[(monte carlo sampling, 0, 20, COMPOUND), (mac...{machine learning, discrete-time analogues, gr...{machine learning, discrete-time analogues, gr...
5we propose a robust portfolio optimization app...[][(portfolio, 20, 29, NOUN), (optimization, 30,...[(portfolio, 20, 29, NOUN), (optimization, 30,...[(we, 0, 2, NP), (a robust portfolio optimizat...[(portfolio optimization approach, 20, 51, COM...{work, optimization, portfolio, dependence, th...{dependence, theory, events, method, dimension...
6we study the problem of multiclass classificat...[][(problem, 13, 20, NOUN), (multiclass, 24, 34,...[(problem, 13, 20, NOUN), (multiclass, 24, 34,...[(we, 0, 2, NP), (the problem, 9, 20, NP), (mu...[(test time, 134, 143, COMPOUND), (tree constr...{entropy, problem, classes, conditions, number...{problem, classes, conditions, number, functio...
7we study the problem of hierarchical clusterin...[][(problem, 13, 20, NOUN), (clustering, 37, 47,...[(problem, 13, 20, NOUN), (clustering, 37, 47,...[(we, 0, 2, NP), (the problem, 9, 20, NP), (hi...[(lp relaxation, 182, 195, COMPOUND), (cost pe...{problem, image, distances, partitions, cost, ...{space, matching, terms, algorithm, cost perfe...
8we propose an approach for generating a sequen...[][(approach, 14, 22, NOUN), (sequence, 40, 48, ...[(approach, 14, 22, NOUN), (sequence, 40, 48, ...[(we, 0, 2, NP), (an approach, 11, 22, NP), (a...[(image stream, 77, 89, COMPOUND), (image stre...{text-image parallel, language descriptions, m...{text-image parallel, language descriptions, m...
9given a similarity graph between items, correl...[(3-approximation, 257, 272, CARDINAL), (graph...[(similarity, 8, 18, NOUN), (graph, 19, 24, NO...[(similarity, 8, 18, NOUN), (graph, 19, 24, NO...[(a similarity graph, 6, 24, NP), (items, 33, ...[(similarity graph, 8, 24, COMPOUND), (correla...{practice kwikcluster, similarity, ratio, prac...{practice kwikcluster, ratio, serializability,...
\n", "
" ], "text/plain": [ " text \\\n", "0 crowdsourcing has gained immense popularity in... \n", "1 convex potential minimisation is the de facto ... \n", "2 one of the central questions in statistical le... \n", "3 we develop a sequential low-complexity inferen... \n", "4 monte carlo sampling for bayesian posterior in... \n", "5 we propose a robust portfolio optimization app... \n", "6 we study the problem of multiclass classificat... \n", "7 we study the problem of hierarchical clusterin... \n", "8 we propose an approach for generating a sequen... \n", "9 given a similarity graph between items, correl... \n", "\n", " named_ents \\\n", "0 [(several hundred, 896, 911, CARDINAL)] \n", "1 [(2008, 109, 113, DATE), (2008, 500, 504, DATE... \n", "2 [(one, 0, 3, CARDINAL)] \n", "3 [] \n", "4 [] \n", "5 [] \n", "6 [] \n", "7 [] \n", "8 [] \n", "9 [(3-approximation, 257, 272, CARDINAL), (graph... \n", "\n", " nouns \\\n", "0 [(crowdsourcing, 0, 13, NOUN), (popularity, 33... \n", "1 [(minimisation, 17, 29, NOUN), (approach, 46, ... \n", "2 [(questions, 19, 28, NOUN), (learning, 44, 52,... \n", "3 [(complexity, 28, 38, NOUN), (inference, 39, 4... \n", "4 [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN... \n", "5 [(portfolio, 20, 29, NOUN), (optimization, 30,... \n", "6 [(problem, 13, 20, NOUN), (multiclass, 24, 34,... \n", "7 [(problem, 13, 20, NOUN), (clustering, 37, 47,... \n", "8 [(approach, 14, 22, NOUN), (sequence, 40, 48, ... \n", "9 [(similarity, 8, 18, NOUN), (graph, 19, 24, NO... \n", "\n", " named_nouns \\\n", "0 [(crowdsourcing, 0, 13, NOUN), (popularity, 33... \n", "1 [(minimisation, 17, 29, NOUN), (approach, 46, ... \n", "2 [(questions, 19, 28, NOUN), (learning, 44, 52,... \n", "3 [(complexity, 28, 38, NOUN), (inference, 39, 4... \n", "4 [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN... \n", "5 [(portfolio, 20, 29, NOUN), (optimization, 30,... \n", "6 [(problem, 13, 20, NOUN), (multiclass, 24, 34,... \n", "7 [(problem, 13, 20, NOUN), (clustering, 37, 47,... \n", "8 [(approach, 14, 22, NOUN), (sequence, 40, 48, ... \n", "9 [(similarity, 8, 18, NOUN), (graph, 19, 24, NO... \n", "\n", " noun_phrases \\\n", "0 [(crowdsourcing, 0, 13, NP), (immense populari... \n", "1 [(convex potential minimisation, 0, 29, NP), (... \n", "2 [(the central questions, 7, 28, NP), (statisti... \n", "3 [(we, 0, 2, NP), (a sequential low-complexity ... \n", "4 [(bayesian posterior inference, 25, 53, NP), (... \n", "5 [(we, 0, 2, NP), (a robust portfolio optimizat... \n", "6 [(we, 0, 2, NP), (the problem, 9, 20, NP), (mu... \n", "7 [(we, 0, 2, NP), (the problem, 9, 20, NP), (hi... \n", "8 [(we, 0, 2, NP), (an approach, 11, 22, NP), (a... \n", "9 [(a similarity graph, 6, 24, NP), (items, 33, ... \n", "\n", " compounds \\\n", "0 [(machine learning applications, 47, 76, COMPO... \n", "1 [(label noise, 143, 154, COMPOUND), (function ... \n", "2 [(learning theory, 44, 59, COMPOUND), (inferen... \n", "3 [(low-complexity inference procedure, 28, 62, ... \n", "4 [(monte carlo sampling, 0, 20, COMPOUND), (mac... \n", "5 [(portfolio optimization approach, 20, 51, COM... \n", "6 [(test time, 134, 143, COMPOUND), (tree constr... \n", "7 [(lp relaxation, 182, 195, COMPOUND), (cost pe... \n", "8 [(image stream, 77, 89, COMPOUND), (image stre... \n", "9 [(similarity graph, 8, 24, COMPOUND), (correla... \n", "\n", " comp_nouns \\\n", "0 {mechanisms, low-quality data, problem, requir... \n", "1 {function class, solution, result, performance... \n", "2 {result, conditions, dimensionality reduction ... \n", "3 {large-sample limit, concentration, asymptotic... \n", "4 {machine learning, discrete-time analogues, gr... \n", "5 {work, optimization, portfolio, dependence, th... \n", "6 {entropy, problem, classes, conditions, number... \n", "7 {problem, image, distances, partitions, cost, ... \n", "8 {text-image parallel, language descriptions, m... \n", "9 {practice kwikcluster, similarity, ratio, prac... \n", "\n", " clean_ents \n", "0 {mechanisms, low-quality data, problem, worker... \n", "1 {function class, solution, result, svm, paper,... \n", "2 {dimensionality reduction methods, conditions,... \n", "3 {classes, form parametric, number, function, e... \n", "4 {machine learning, discrete-time analogues, gr... \n", "5 {dependence, theory, events, method, dimension... \n", "6 {problem, classes, conditions, number, functio... \n", "7 {space, matching, terms, algorithm, cost perfe... \n", "8 {text-image parallel, language descriptions, m... \n", "9 {practice kwikcluster, ratio, serializability,... " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "cols = ['nouns', 'compounds']\n", "add_comp_nouns(df, cols=cols)\n", "display(df)" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " crowdsourcing\n", " NOUN\n", "\n", " has gained immense \n", "\n", " popularity\n", " NOUN\n", "\n", " in \n", "\n", " machine\n", " NOUN\n", "\n", " learning \n", "\n", " applications\n", " NOUN\n", "\n", " for obtaining large \n", "\n", " amounts\n", " NOUN\n", "\n", " of labeled \n", "\n", " data\n", " NOUN\n", "\n", ". \n", "\n", " crowdsourcing\n", " NOUN\n", "\n", " is cheap and fast, but suffers from the \n", "\n", " problem\n", " NOUN\n", "\n", " of low-\n", "\n", " quality\n", " NOUN\n", "\n", " \n", "\n", " data\n", " NOUN\n", "\n", ". to address this fundamental \n", "\n", " challenge\n", " NOUN\n", "\n", " in \n", "\n", " crowdsourcing\n", " NOUN\n", "\n", ", we propose a simple \n", "\n", " payment\n", " NOUN\n", "\n", " \n", "\n", " mechanism\n", " NOUN\n", "\n", " to incentivize \n", "\n", " workers\n", " NOUN\n", "\n", " to answer only the \n", "\n", " questions\n", " NOUN\n", "\n", " that they are sure of and skip the \n", "\n", " rest\n", " NOUN\n", "\n", ". we show that surprisingly, under a mild and natural \n", "\n", " no\n", " NOUN\n", "\n", "-free-\n", "\n", " lunch\n", " NOUN\n", "\n", " \n", "\n", " requirement\n", " NOUN\n", "\n", ", this \n", "\n", " mechanism\n", " NOUN\n", "\n", " is the one and only \n", "\n", " incentive\n", " NOUN\n", "\n", "-compatible \n", "\n", " payment\n", " NOUN\n", "\n", " \n", "\n", " mechanism\n", " NOUN\n", "\n", " possible. we also show that among all possible \n", "\n", " incentive\n", " NOUN\n", "\n", "-compatible \n", "\n", " mechanisms\n", " NOUN\n", "\n", " (that may or may not satisfy no-free-\n", "\n", " lunch\n", " NOUN\n", "\n", "), our \n", "\n", " mechanism\n", " NOUN\n", "\n", " makes the smallest possible \n", "\n", " payment\n", " NOUN\n", "\n", " to \n", "\n", " spammers\n", " NOUN\n", "\n", ". interestingly, this unique \n", "\n", " mechanism\n", " NOUN\n", "\n", " takes a multiplicative \n", "\n", " form\n", " NOUN\n", "\n", ". the \n", "\n", " simplicity\n", " NOUN\n", "\n", " of the \n", "\n", " mechanism\n", " NOUN\n", "\n", " is an added \n", "\n", " benefit\n", " NOUN\n", "\n", ". in preliminary \n", "\n", " experiments\n", " NOUN\n", "\n", " involving over several hundred \n", "\n", " workers\n", " NOUN\n", "\n", ", we observe a significant \n", "\n", " reduction\n", " NOUN\n", "\n", " in the \n", "\n", " error\n", " NOUN\n", "\n", " \n", "\n", " rates\n", " NOUN\n", "\n", " under our unique \n", "\n", " mechanism\n", " NOUN\n", "\n", " for the same or lower monetary \n", "\n", " expenditure\n", " NOUN\n", "\n", ".
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# take a look at all the nouns again\n", "column = 'named_nouns'\n", "render_entities(0, df, options=options, column=column)" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
crowdsourcing has gained immense popularity in \n", "\n", " machine learning applications\n", " COMPOUND\n", "\n", " for obtaining large amounts of labeled data. crowdsourcing is cheap and fast, but suffers from the problem of \n", "\n", " low-quality data\n", " COMPOUND\n", "\n", ". to address this fundamental challenge in crowdsourcing, we propose a simple \n", "\n", " payment mechanism\n", " COMPOUND\n", "\n", " to incentivize workers to answer only the questions that they are sure of and skip the rest. we show that surprisingly, under a mild and natural \n", "\n", " no-free-lunch requirement\n", " COMPOUND\n", "\n", ", this mechanism is the one and only incentive-compatible \n", "\n", " payment mechanism\n", " COMPOUND\n", "\n", " possible. we also show that among all possible incentive-compatible mechanisms (that may or may not satisfy no-free-lunch), our mechanism makes the smallest possible payment to spammers. interestingly, this unique mechanism takes a multiplicative form. the simplicity of the mechanism is an added benefit. in preliminary experiments involving over several hundred workers, we observe a significant reduction in the \n", "\n", " error rates\n", " COMPOUND\n", "\n", " under our unique mechanism for the same or lower monetary expenditure.
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# take a look at all the compound noun phrases again\n", "column = 'compounds'\n", "render_entities(0, df, options=options, column=column)" ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'amounts',\n", " 'applications',\n", " 'benefit',\n", " 'challenge',\n", " 'crowdsourcing',\n", " 'data',\n", " 'error',\n", " 'error rates',\n", " 'expenditure',\n", " 'experiments',\n", " 'form',\n", " 'incentive',\n", " 'low-quality data',\n", " 'lunch',\n", " 'machine',\n", " 'machine learning applications',\n", " 'mechanism',\n", " 'mechanisms',\n", " 'no',\n", " 'no-free-lunch requirement',\n", " 'payment',\n", " 'payment mechanism',\n", " 'popularity',\n", " 'problem',\n", " 'quality',\n", " 'questions',\n", " 'rates',\n", " 'reduction',\n", " 'requirement',\n", " 'rest',\n", " 'simplicity',\n", " 'spammers',\n", " 'workers'}" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# take a look at combined entities\n", "df['comp_nouns'][0] " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have all the entities grouped together, we can see how good we are doing. We've successfully captured single-word as well as n-grams, but there appear to be a lot of duplicates. Words that should've been included in a phrase were somehow split apart, most likely as a result of not properly dealing with hyphenation when we first tokenized our abstracts. \n", "\n", "Not to worry, this should be relatively easy to take care. We'll also apply a few other heuristics to clean up our list and remove the most common English words to further pare down the list of entities. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 8: Reduce entity count with heuristics" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [], "source": [ "def drop_duplicate_np_splits(ents):\n", " \"\"\"Drop any entities that are already captured by noun phrases. \n", " \n", " Keyword arguments:\n", " ents -- a set of entities\n", " \n", " \"\"\"\n", " drop_ents = set()\n", " for ent in ents:\n", " if len(ent.split(' ')) > 1:\n", " for e in ent.split(' '):\n", " if e in ents:\n", " drop_ents.add(e)\n", " return ents - drop_ents\n", "\n", "def drop_single_char_nps(ents):\n", " \"\"\"Within an entity, drop single characters. \n", " \n", " Keyword arguments:\n", " ents -- a set of entities\n", " \n", " \"\"\"\n", " return {' '.join([e for e in ent.split(' ') if not len(e) == 1]) for ent in ents}\n", "\n", "def drop_double_char(ents):\n", " \"\"\"Drop any entities that are less than three characters. \n", " \n", " Keyword arguments:\n", " ents -- a set of entities\n", " \n", " \"\"\"\n", " drop_ents = {ent for ent in ents if len(ent) < 3}\n", " return ents - drop_ents\n", "\n", "def keep_alpha(ents):\n", " \"\"\"Keep only entities with alphabetical unicode characters, hyphens, and spaces. \n", " \n", " Keyword arguments:\n", " ents -- a set of entities\n", " \n", " \"\"\"\n", " keep_char = set('-abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ')\n", " drop_ents = {ent for ent in ents if not set(ent).issubset(keep_char)}\n", " return ents - drop_ents" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These last four functions will slice and dice the list of entities gathered from each abstract in various ways. In addition to this granular processing, we'll also want to remove words that are frequent in the English language, as a heuristic to naturally drop stop words and uncover the domain of each academic source. \n", "\n", "Why is this?\n", "\n", "Well, in NLP, as in search engine optimization (SEO), the most common words in a given corpus are known as [stop words](https://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html). These unfortunate candidates are hunted down with extreme prejudice and removed from the population to improve search results, enhance semantic analysis, and in our case, help restrict the domain. This is because removing stop words automatically limits the vocabulary of a corpus to the words that are less frequent and therefore, more likely to exist in that abstract than anywhere else. \n", "\n", "You can, of course, argue that the most common words in a scientific paper might in fact be the most important concepts, but stop words are usually overwhelmingingly overrepresented in any corpus. This intuition however, is exactly why we aren't going to simply take the most common words in one specific abstract and remove them. Instead, we'll be targetting the most frequent words based on a large, general domain sample of the English language. \n", "\n", "The \"freq_words.csv\" file you might have noticed earlier in our file path, is actually a list generated from a corpus with 10 billion words gathered by the good people at [Word Frequencey Data](https://www.wordfrequency.info/).\n", "\n", "Let's take a look at the list and then remove these words from our set of entities. " ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "freq_words.csv nips.csv\r\n" ] } ], "source": [ "!ls {PATH}" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RankWordPart of speechFrequencyDispersionUnnamed: 5Unnamed: 6
0NaNNaNNaNNaNNaNNaNNaN
11.0thea22038615.00.98NaNNaN
22.0bev12545825.00.97NaNNaN
33.0andc10741073.00.99NaNNaN
44.0ofi10343885.00.97NaNNaN
........................
49964996.0plaintiffn5312.00.88NaNNaN
49974997.0kidv5094.00.92NaNNaN
49984998.0middle-classj5025.00.93NaNNaN
49994999.0apologyn4972.00.94NaNNaN
50005000.0tilli5079.00.92NaNNaN
\n", "

5001 rows × 7 columns

\n", "
" ], "text/plain": [ " Rank Word Part of speech Frequency Dispersion Unnamed: 5 \\\n", "0 NaN NaN NaN NaN NaN NaN \n", "1 1.0 the a 22038615.0 0.98 NaN \n", "2 2.0 be v 12545825.0 0.97 NaN \n", "3 3.0 and c 10741073.0 0.99 NaN \n", "4 4.0 of i 10343885.0 0.97 NaN \n", "... ... ... ... ... ... ... \n", "4996 4996.0 plaintiff n 5312.0 0.88 NaN \n", "4997 4997.0 kid v 5094.0 0.92 NaN \n", "4998 4998.0 middle-class j 5025.0 0.93 NaN \n", "4999 4999.0 apology n 4972.0 0.94 NaN \n", "5000 5000.0 till i 5079.0 0.92 NaN \n", "\n", " Unnamed: 6 \n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "... ... \n", "4996 NaN \n", "4997 NaN \n", "4998 NaN \n", "4999 NaN \n", "5000 NaN \n", "\n", "[5001 rows x 7 columns]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "filename = 'freq_words.csv'\n", "freq_words_df = pd.read_csv(f'{PATH}{filename}')\n", "display(freq_words_df)" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1 the\n", "2 be\n", "3 and\n", "4 of\n", "5 a\n", " ... \n", "4996 plaintiff\n", "4997 kid\n", "4998 middle-class\n", "4999 apology\n", "5000 till\n", "Name: Word, Length: 5000, dtype: object" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "freq_words = freq_words_df['Word'].iloc[1:]\n", "display(freq_words)" ] }, { "cell_type": "code", "execution_count": 106, "metadata": {}, "outputs": [], "source": [ "def remove_freq_words(ents):\n", " \"\"\"Drop any entities in the 5000 most common words in the English langauge. \n", " \n", " Keyword arguments:\n", " ents -- a set of entities\n", " \n", " \"\"\"\n", " filename = 'freq_words.csv'\n", " PATH = './data/'\n", " freq_words = pd.read_csv(f'{PATH}{filename}')['Word'].iloc[1:]\n", " for word in freq_words:\n", " try:\n", " ents.remove(word)\n", " except KeyError:\n", " continue # ignore the stop word if it's not in the list of abstract entities\n", " return ents\n", "\n", "def add_clean_ents(df, funcs=[]):\n", " \"\"\"Create new column in data frame with cleaned entities.\n", " \n", " Keyword arguments:\n", " df -- a dataframe object\n", " funcs -- a list of heuristic functions to be applied to entities\n", " \n", " \"\"\"\n", " col = 'clean_ents'\n", " df[col] = df['comp_nouns']\n", " for f in funcs:\n", " df[col] = df[col].apply(f)" ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textnamed_entsnounsnamed_nounsnoun_phrasescompoundscomp_nounsclean_ents
0crowdsourcing has gained immense popularity in...[(several hundred, 896, 911, CARDINAL)][(crowdsourcing, 0, 13, NOUN), (popularity, 33...[(crowdsourcing, 0, 13, NOUN), (popularity, 33...[(crowdsourcing, 0, 13, NP), (immense populari...[(machine learning applications, 47, 76, COMPO...{mechanisms, low-quality data, problem, requir...{mechanisms, low-quality data, workers, questi...
1convex potential minimisation is the de facto ...[(2008, 109, 113, DATE), (2008, 500, 504, DATE...[(minimisation, 17, 29, NOUN), (approach, 46, ...[(minimisation, 17, 29, NOUN), (approach, 46, ...[(convex potential minimisation, 0, 29, NP), (...[(label noise, 143, 154, COMPOUND), (function ...{function class, solution, result, performance...{function class, svm, guessing, learners, conv...
2one of the central questions in statistical le...[(one, 0, 3, CARDINAL)][(questions, 19, 28, NOUN), (learning, 44, 52,...[(questions, 19, 28, NOUN), (learning, 44, 52,...[(the central questions, 7, 28, NP), (statisti...[(learning theory, 44, 59, COMPOUND), (inferen...{result, conditions, dimensionality reduction ...{dimensionality reduction methods, conditions,...
3we develop a sequential low-complexity inferen...[][(complexity, 28, 38, NOUN), (inference, 39, 4...[(complexity, 28, 38, NOUN), (inference, 39, 4...[(we, 0, 2, NP), (a sequential low-complexity ...[(low-complexity inference procedure, 28, 62, ...{large-sample limit, concentration, asymptotic...{classes, form parametric, methods, dirichlet ...
4monte carlo sampling for bayesian posterior in...[][(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...[(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...[(bayesian posterior inference, 25, 53, NP), (...[(monte carlo sampling, 0, 20, COMPOUND), (mac...{machine learning, discrete-time analogues, gr...{machine learning, discrete-time analogues, gr...
5we propose a robust portfolio optimization app...[][(portfolio, 20, 29, NOUN), (optimization, 30,...[(portfolio, 20, 29, NOUN), (optimization, 30,...[(we, 0, 2, NP), (a robust portfolio optimizat...[(portfolio optimization approach, 20, 51, COM...{work, optimization, portfolio, dependence, th...{dependence, events, asset returns, dimensions...
6we study the problem of multiclass classificat...[][(problem, 13, 20, NOUN), (multiclass, 24, 34,...[(problem, 13, 20, NOUN), (multiclass, 24, 34,...[(we, 0, 2, NP), (the problem, 9, 20, NP), (mu...[(test time, 134, 143, COMPOUND), (tree constr...{entropy, problem, classes, conditions, number...{classes, conditions, partitions, multiclass, ...
7we study the problem of hierarchical clusterin...[][(problem, 13, 20, NOUN), (clustering, 37, 47,...[(problem, 13, 20, NOUN), (clustering, 37, 47,...[(we, 0, 2, NP), (the problem, 9, 20, NP), (hi...[(lp relaxation, 182, 195, COMPOUND), (cost pe...{problem, image, distances, partitions, cost, ...{matching, algorithm, cost perfect, lp relaxat...
8we propose an approach for generating a sequen...[][(approach, 14, 22, NOUN), (sequence, 40, 48, ...[(approach, 14, 22, NOUN), (sequence, 40, 48, ...[(we, 0, 2, NP), (an approach, 11, 22, NP), (a...[(image stream, 77, 89, COMPOUND), (image stre...{text-image parallel, language descriptions, m...{text-image parallel, language descriptions, m...
9given a similarity graph between items, correl...[(3-approximation, 257, 272, CARDINAL), (graph...[(similarity, 8, 18, NOUN), (graph, 19, 24, NO...[(similarity, 8, 18, NOUN), (graph, 19, 24, NO...[(a similarity graph, 6, 24, NP), (items, 33, ...[(similarity graph, 8, 24, COMPOUND), (correla...{practice kwikcluster, similarity, ratio, prac...{practice kwikcluster, serializability, result...
\n", "
" ], "text/plain": [ " text \\\n", "0 crowdsourcing has gained immense popularity in... \n", "1 convex potential minimisation is the de facto ... \n", "2 one of the central questions in statistical le... \n", "3 we develop a sequential low-complexity inferen... \n", "4 monte carlo sampling for bayesian posterior in... \n", "5 we propose a robust portfolio optimization app... \n", "6 we study the problem of multiclass classificat... \n", "7 we study the problem of hierarchical clusterin... \n", "8 we propose an approach for generating a sequen... \n", "9 given a similarity graph between items, correl... \n", "\n", " named_ents \\\n", "0 [(several hundred, 896, 911, CARDINAL)] \n", "1 [(2008, 109, 113, DATE), (2008, 500, 504, DATE... \n", "2 [(one, 0, 3, CARDINAL)] \n", "3 [] \n", "4 [] \n", "5 [] \n", "6 [] \n", "7 [] \n", "8 [] \n", "9 [(3-approximation, 257, 272, CARDINAL), (graph... \n", "\n", " nouns \\\n", "0 [(crowdsourcing, 0, 13, NOUN), (popularity, 33... \n", "1 [(minimisation, 17, 29, NOUN), (approach, 46, ... \n", "2 [(questions, 19, 28, NOUN), (learning, 44, 52,... \n", "3 [(complexity, 28, 38, NOUN), (inference, 39, 4... \n", "4 [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN... \n", "5 [(portfolio, 20, 29, NOUN), (optimization, 30,... \n", "6 [(problem, 13, 20, NOUN), (multiclass, 24, 34,... \n", "7 [(problem, 13, 20, NOUN), (clustering, 37, 47,... \n", "8 [(approach, 14, 22, NOUN), (sequence, 40, 48, ... \n", "9 [(similarity, 8, 18, NOUN), (graph, 19, 24, NO... \n", "\n", " named_nouns \\\n", "0 [(crowdsourcing, 0, 13, NOUN), (popularity, 33... \n", "1 [(minimisation, 17, 29, NOUN), (approach, 46, ... \n", "2 [(questions, 19, 28, NOUN), (learning, 44, 52,... \n", "3 [(complexity, 28, 38, NOUN), (inference, 39, 4... \n", "4 [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN... \n", "5 [(portfolio, 20, 29, NOUN), (optimization, 30,... \n", "6 [(problem, 13, 20, NOUN), (multiclass, 24, 34,... \n", "7 [(problem, 13, 20, NOUN), (clustering, 37, 47,... \n", "8 [(approach, 14, 22, NOUN), (sequence, 40, 48, ... \n", "9 [(similarity, 8, 18, NOUN), (graph, 19, 24, NO... \n", "\n", " noun_phrases \\\n", "0 [(crowdsourcing, 0, 13, NP), (immense populari... \n", "1 [(convex potential minimisation, 0, 29, NP), (... \n", "2 [(the central questions, 7, 28, NP), (statisti... \n", "3 [(we, 0, 2, NP), (a sequential low-complexity ... \n", "4 [(bayesian posterior inference, 25, 53, NP), (... \n", "5 [(we, 0, 2, NP), (a robust portfolio optimizat... \n", "6 [(we, 0, 2, NP), (the problem, 9, 20, NP), (mu... \n", "7 [(we, 0, 2, NP), (the problem, 9, 20, NP), (hi... \n", "8 [(we, 0, 2, NP), (an approach, 11, 22, NP), (a... \n", "9 [(a similarity graph, 6, 24, NP), (items, 33, ... \n", "\n", " compounds \\\n", "0 [(machine learning applications, 47, 76, COMPO... \n", "1 [(label noise, 143, 154, COMPOUND), (function ... \n", "2 [(learning theory, 44, 59, COMPOUND), (inferen... \n", "3 [(low-complexity inference procedure, 28, 62, ... \n", "4 [(monte carlo sampling, 0, 20, COMPOUND), (mac... \n", "5 [(portfolio optimization approach, 20, 51, COM... \n", "6 [(test time, 134, 143, COMPOUND), (tree constr... \n", "7 [(lp relaxation, 182, 195, COMPOUND), (cost pe... \n", "8 [(image stream, 77, 89, COMPOUND), (image stre... \n", "9 [(similarity graph, 8, 24, COMPOUND), (correla... \n", "\n", " comp_nouns \\\n", "0 {mechanisms, low-quality data, problem, requir... \n", "1 {function class, solution, result, performance... \n", "2 {result, conditions, dimensionality reduction ... \n", "3 {large-sample limit, concentration, asymptotic... \n", "4 {machine learning, discrete-time analogues, gr... \n", "5 {work, optimization, portfolio, dependence, th... \n", "6 {entropy, problem, classes, conditions, number... \n", "7 {problem, image, distances, partitions, cost, ... \n", "8 {text-image parallel, language descriptions, m... \n", "9 {practice kwikcluster, similarity, ratio, prac... \n", "\n", " clean_ents \n", "0 {mechanisms, low-quality data, workers, questi... \n", "1 {function class, svm, guessing, learners, conv... \n", "2 {dimensionality reduction methods, conditions,... \n", "3 {classes, form parametric, methods, dirichlet ... \n", "4 {machine learning, discrete-time analogues, gr... \n", "5 {dependence, events, asset returns, dimensions... \n", "6 {classes, conditions, partitions, multiclass, ... \n", "7 {matching, algorithm, cost perfect, lp relaxat... \n", "8 {text-image parallel, language descriptions, m... \n", "9 {practice kwikcluster, serializability, result... " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "funcs = [drop_duplicate_np_splits, drop_double_char, keep_alpha, drop_single_char_nps, remove_freq_words]\n", "add_clean_ents(df, funcs)\n", "display(df)" ] }, { "cell_type": "code", "execution_count": 179, "metadata": {}, "outputs": [], "source": [ "def visualize_entities(df, idx=0):\n", " \"\"\"Visualize the entities for a given abstract in the dataframe. \n", " \n", " Keyword arguments:\n", " df -- a dataframe object\n", " idx -- the index of interest for the dataframe (default 0)\n", " \n", " \"\"\"\n", " # store entity start and end index for visualization in dummy df\n", " ents = []\n", " abstract = df['text'][idx]\n", " for ent in df['clean_ents'][idx]:\n", " i = abstract.find(ent) # locate the index of the entity in the abstract\n", " ents.append((ent, i, i+len(ent), 'ENTITY')) \n", " ents.sort(key=lambda tup: tup[1])\n", "\n", " dummy_df = pd.DataFrame([abstract, ents]).T # transpose dataframe\n", " dummy_df.columns = ['text', 'clean_ents']\n", " column = 'clean_ents'\n", " render_entities(0, dummy_df, options=options, column=column)" ] }, { "cell_type": "code", "execution_count": 181, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " crowdsourcing\n", " ENTITY\n", "\n", " has gained immense popularity in \n", "\n", " machine learning applications\n", " ENTITY\n", "\n", " for obtaining large \n", "\n", " amounts\n", " ENTITY\n", "\n", " of labeled data. crowdsourcing is cheap and fast, but suffers from the problem of \n", "\n", " low-quality data\n", " ENTITY\n", "\n", ". to address this fundamental challenge in crowdsourcing, we propose a simple \n", "\n", " payment mechanism\n", " ENTITY\n", "\n", " to incentivize \n", "\n", " workers\n", " ENTITY\n", "\n", " to answer only the \n", "\n", " questions\n", " ENTITY\n", "\n", " that they are sure of and skip the rest. we show that surprisingly, under a mild and natural \n", "\n", " no-free-lunch requirement\n", " ENTITY\n", "\n", ", this mechanism is the one and only incentive-compatible payment mechanism possible. we also show that among all possible incentive-compatible \n", "\n", " mechanisms\n", " ENTITY\n", "\n", " (that may or may not satisfy no-free-lunch), our mechanism makes the smallest possible payment to \n", "\n", " spammers\n", " ENTITY\n", "\n", ". interestingly, this unique mechanism takes a multiplicative form. the \n", "\n", " simplicity\n", " ENTITY\n", "\n", " of the mechanism is an added benefit. in preliminary \n", "\n", " experiments\n", " ENTITY\n", "\n", " involving over several hundred workers, we observe a significant reduction in the \n", "\n", " error rates\n", " ENTITY\n", "\n", " under our unique mechanism for the same or lower monetary \n", "\n", " expenditure\n", " ENTITY\n", "\n", ".
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "visualize_entities(df, 0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's a good looking list of concepts wouldn't you say? By removing stop words and fine-tuning our set, we were able to capture only the most important entities in this first abstract! Let's finish up with a quick recapitulation of our approach and some thoughts on what we can do going forward. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 9: Celebrate with excessive fist-pumping" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Well, at the risk of tooting our own horn, I feel rather confident saying that we've accomplished what we set out to do! We took an abstract from a scientific paper, combined named and regular entities, extracted compound noun phrases, and pared down the final list using heuristics and stop word domain restriction to generate a set of important concepts. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Source: GIPHY
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Keep in mind that this exercise wasn't to create the world's best entity extractor. It was to get a fast baseline for what we can do with limited knowledge about the domain, and limited use of deep learning superpowers. We've now ended up with a prototype that shows we can get relatively far using out-of-the-box methods, with minor scripting for customization. And the best part? Our approach didn't require any extensive compute or proprietary software! \n", "\n", "Going forward, we'd want to test our approach on larger data sets (perhaps full scientific papers), and create an easy-to-use API for visualization, as well as individual and batch processing of text sources. Improving the actual entity extraction itself might involve a language model trained on academic papers or the addition of other intelligent heuristics. At some point, we'd also want to link each entity to an external database with further information, so that our conversational academic paper program would be able to orient these concepts within a larger knowledge graph. \n", "\n", "At the end of all of this, we've built a fast entity extraction prototype that confidently moves us towards creating an engine to communicate with academic papers, which will (hopefully) set the foundation for an automated scientific discovery tool.\n", "\n", "Great work! " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 2 }