{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "All the IPython Notebooks in **[Python Natural Language Processing](https://github.com/milaan9/Python_Python_Natural_Language_Processing)** lecture series by **[Dr. Milaan Parmar](https://www.linkedin.com/in/milaanparmar/)** are available @ **[GitHub](https://github.com/milaan9)**\n", "" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "view-in-github" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "metadata": { "id": "C5SsYvaIAjXd" }, "source": [ "# 06 Named Entity Recognition (NER)\n", "(also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes\n", "\n", "spaCy has an **'ner'** pipeline component that identifies token spans fitting a predetermined set of named entities. These are available as the **`ents`** property of a **`Doc`** object.\n", "\n", "https://spacy.io/usage/training#ner" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "oMxIwzuVAjXe" }, "outputs": [], "source": [ "# Perform standard imports\n", "import spacy\n", "nlp = spacy.load('en_core_web_sm')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "5egxj7m2AjXj" }, "outputs": [], "source": [ "# Write a function to display basic entity info:\n", "def show_ents(doc):\n", " if doc.ents:\n", " for ent in doc.ents:\n", " print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))\n", " else:\n", " print('No named entities found.')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "PU_SegPsPbNm", "outputId": "71fb6119-64f8-4e91-823b-96ea9ab81937" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Milaan Parmar CS - ORG - Companies, agencies, institutions, etc.\n" ] } ], "source": [ "doc = nlp(u'Hi, everyone welcome to Milaan Parmar CS tutorial on NPL')\n", "\n", "show_ents(doc)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "hOE3UPZ5AjXo", "outputId": "89e61213-11e1-4ba3-8144-fa101ddc4053", "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "England - GPE - Countries, cities, states\n", "Canada - GPE - Countries, cities, states\n", "next month - DATE - Absolute or relative dates or periods\n" ] } ], "source": [ "doc = nlp(u'May I go to England or Canada, next month to see the virus report?')\n", "\n", "show_ents(doc)" ] }, { "cell_type": "markdown", "metadata": { "id": "YYSnge4iAjXx" }, "source": [ "## Entity annotations\n", "`Doc.ents` are token spans with their own set of annotations.\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
`ent.text`The original entity text
`ent.label`The entity type's hash value
`ent.label_`The entity type's string description
`ent.start`The token span's *start* index position in the Doc
`ent.end`The token span's *stop* index position in the Doc
`ent.start_char`The entity text's *start* index position in the Doc
`ent.end_char`The entity text's *stop* index position in the Doc
\n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "CU0mEsoPAjXx", "outputId": "cb31c06a-fba3-4ad2-f376-42afa0bee2fd" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "500 dollars 4 6 20 31 MONEY\n", "Blake 7 8 37 42 PERSON\n", "Microsoft 11 12 55 64 ORG\n" ] } ], "source": [ "doc = nlp(u'Can I please borrow 500 dollars from Blake to buy some Microsoft stock?')\n", "\n", "for ent in doc.ents:\n", " print(ent.text, ent.start, ent.end, ent.start_char, ent.end_char, ent.label_)" ] }, { "cell_type": "markdown", "metadata": { "id": "cvNQ5nFGAjX1" }, "source": [ "## NER Tags\n", "Tags are accessible through the `.label_` property of an entity.\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
TYPEDESCRIPTIONEXAMPLE
`PERSON`People, including fictional.*Fred Flintstone*
`NORP`Nationalities or religious or political groups.*The Republican Party*
`FAC`Buildings, airports, highways, bridges, etc.*Logan International Airport, The Golden Gate*
`ORG`Companies, agencies, institutions, etc.*Microsoft, FBI, MIT*
`GPE`Countries, cities, states.*France, UAR, Chicago, Idaho*
`LOC`Non-GPE locations, mountain ranges, bodies of water.*Europe, Nile River, Midwest*
`PRODUCT`Objects, vehicles, foods, etc. (Not services.)*Formula 1*
`EVENT`Named hurricanes, battles, wars, sports events, etc.*Olympic Games*
`WORK_OF_ART`Titles of books, songs, etc.*The Mona Lisa*
`LAW`Named documents made into laws.*Roe v. Wade*
`LANGUAGE`Any named language.*English*
`DATE`Absolute or relative dates or periods.*20 July 1969*
`TIME`Times smaller than a day.*Four hours*
`PERCENT`Percentage, including \"%\".*Eighty percent*
`MONEY`Monetary values, including unit.*Twenty Cents*
`QUANTITY`Measurements, as of weight or distance.*Several kilometers, 55kg*
`ORDINAL`\"first\", \"second\", etc.*9th, Ninth*
`CARDINAL`Numerals that do not fall under another type.*2, Two, Fifty-two*
" ] }, { "cell_type": "markdown", "metadata": { "id": "Gvqo94OAAjX2" }, "source": [ "___\n", "### **Adding a Named Entity to a Span**\n", "Normally we would have spaCy build a library of named entities by training it on several samples of text.
In this case, we only want to add one value:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "6QFBkrcqAjX2", "outputId": "1f9083c2-ad2a-474c-90a1-de6ef81219e3" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Arthur - ORG - Companies, agencies, institutions, etc.\n", "U.K. - GPE - Countries, cities, states\n", "$6 million - MONEY - Monetary values, including unit\n" ] } ], "source": [ "doc = nlp(u'Arthur to build a U.K. factory for $6 million')\n", "\n", "show_ents(doc)" ] }, { "cell_type": "markdown", "metadata": { "id": "Etm6Z9CWGeRA" }, "source": [ "Add **Milaan** as **PERSON**" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 248 }, "id": "G_8tbV5rAjX7", "outputId": "a19effa8-2ae6-4e38-c4cd-a15c0e92168d" }, "outputs": [ { "ename": "ValueError", "evalue": "ignored", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 8\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0;31m# Add the entity to the existing Doc object\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mdoc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0ments\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mlist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdoc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0ments\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mnew_ent\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32mdoc.pyx\u001b[0m in \u001b[0;36mspacy.tokens.doc.Doc.ents.__set__\u001b[0;34m()\u001b[0m\n", "\u001b[0;31mValueError\u001b[0m: [E103] Trying to set conflicting doc.ents: '(0, 1, 'ORG')' and '(0, 1, 'PERSON')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap." ] } ], "source": [ "from spacy.tokens import Span\n", "\n", "# Get the hash value of the ORG entity label\n", "ORG = doc.vocab.strings[u'PERSON'] \n", "\n", "# Create a Span for the new entity\n", "new_ent = Span(doc, 0, 1, label=ORG)\n", "\n", "# Add the entity to the existing Doc object\n", "doc.ents = list(doc.ents) + [new_ent]" ] }, { "cell_type": "markdown", "metadata": { "id": "unAVoOAxAjX_" }, "source": [ "In the code above, the arguments passed to `Span()` are:\n", "- `doc` - the name of the Doc object\n", "- `0` - the *start* index position of the span\n", "- `1` - the *stop* index position (exclusive)\n", "- `label=PERSON` - the label assigned to our entity" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "JtoYKMknAjYA", "outputId": "aaf8d3ee-2655-432b-d70a-dc79208af5a0", "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Arthur - ORG - Companies, agencies, institutions, etc.\n", "U.K. - GPE - Countries, cities, states\n", "$6 million - MONEY - Monetary values, including unit\n" ] } ], "source": [ "show_ents(doc)" ] }, { "cell_type": "markdown", "metadata": { "id": "nPNZAomgAjYD" }, "source": [ "___\n", "## Adding Named Entities to All Matching Spans\n", "What if we want to tag *all* occurrences of \"WORDS\"? WE NEED TO use the PhraseMatcher to identify a series of spans in the Doc:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "eocm1Oq9AjYE", "outputId": "2a4a0f1d-a6f9-4256-df81-54d90528ddcc" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "first - ORDINAL - \"first\", \"second\", etc.\n" ] } ], "source": [ "doc = nlp(u'Our company plans to introduce a new vacuum cleaner. '\n", " u'If successful, the vacuum cleaner will be our first product.')\n", "\n", "show_ents(doc)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "id": "x_ZfN0WJAjYH" }, "outputs": [], "source": [ "# Import PhraseMatcher and create a matcher object:\n", "from spacy.matcher import PhraseMatcher\n", "matcher = PhraseMatcher(nlp.vocab)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "id": "WwTZbId2AjYK" }, "outputs": [], "source": [ "# Create the desired phrase patterns:\n", "phrase_list = ['vacuum cleaner', 'vacuum-cleaner']\n", "phrase_patterns = [nlp(text) for text in phrase_list]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "zdvup9bOAjYN", "outputId": "c4eff7bc-0341-4460-98a5-2a333ea26820" }, "outputs": [ { "data": { "text/plain": [ "[(2689272359382549672, 7, 9), (2689272359382549672, 14, 16)]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Apply the patterns to our matcher object:\n", "matcher.add('newproduct', None, *phrase_patterns)\n", "\n", "# Apply the matcher to our Doc object:\n", "matches = matcher(doc)\n", "\n", "# See what matches occur:\n", "matches" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "id": "mcRv7ezzAjYS" }, "outputs": [], "source": [ "# Here we create Spans from each match, and create named entities from them:\n", "from spacy.tokens import Span\n", "\n", "PROD = doc.vocab.strings[u'PRODUCT']\n", "\n", "new_ents = [Span(doc, match[1],match[2],label=PROD) for match in matches]\n", "\n", "doc.ents = list(doc.ents) + new_ents" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Hn77OkeNAjYW", "outputId": "0f5c8750-5c9b-404b-fc87-313dfda6edcc" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "vacuum cleaner - PRODUCT - Objects, vehicles, foods, etc. (not services)\n", "vacuum cleaner - PRODUCT - Objects, vehicles, foods, etc. (not services)\n", "first - ORDINAL - \"first\", \"second\", etc.\n" ] } ], "source": [ "show_ents(doc)" ] }, { "cell_type": "markdown", "metadata": { "id": "6WLXgXDYAjYa" }, "source": [ "___\n", "## Counting Entities\n", "While spaCy may not have a built-in tool for counting entities, we can pass a conditional statement into a list comprehension:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Y4AQj2yfAjYb", "outputId": "0df969e7-a578-44e9-e888-23aee6d168fb" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "29.50 - MONEY - Monetary values, including unit\n", "five dollars - MONEY - Monetary values, including unit\n" ] } ], "source": [ "doc = nlp(u'Originally priced at $29.50, the sweater was marked down to five dollars.')\n", "\n", "show_ents(doc)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "NPr5Q-eWAjYi", "outputId": "ffe19b4c-3f43-4bc2-e498-50eb6f6b7df6" }, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len([ent for ent in doc.ents if ent.label_=='MONEY'])" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "id": "xw0U1DC5AjYp", "outputId": "180a517b-e326-4cac-f735-095072785cb8" }, "outputs": [ { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'2.2.4'" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spacy.__version__" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "MYzkCgmfAjYu", "outputId": "4ed899d6-9e4c-403a-cf13-4a649309b4e0" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "29.50 - MONEY - Monetary values, including unit\n", "five dollars - MONEY - Monetary values, including unit\n" ] } ], "source": [ "\n", "doc = nlp(u'Originally priced at $29.50,\\nthe sweater was marked down to five dollars.')\n", "\n", "show_ents(doc)" ] }, { "cell_type": "markdown", "metadata": { "id": "8E0XnrraAjYy" }, "source": [ "### However, there is a simple fix that can be added to the nlp pipeline:\n", "\n", "https://spacy.io/usage/processing-pipelines" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "id": "jkNfED2IAjYz" }, "outputs": [], "source": [ "# Quick function to remove ents formed on whitespace:\n", "def remove_whitespace_entities(doc):\n", " doc.ents = [e for e in doc.ents if not e.text.isspace()]\n", " return doc\n", "\n", "# Insert this into the pipeline AFTER the ner component:\n", "nlp.add_pipe(remove_whitespace_entities, after='ner')" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "KyDsaxhFAjY2", "outputId": "2fc120a8-161d-47e0-e118-d7578e55f2ba" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "29.50 - MONEY - Monetary values, including unit\n", "five dollars - MONEY - Monetary values, including unit\n" ] } ], "source": [ "# Rerun nlp on the text above, and show ents:\n", "doc = nlp(u'Originally priced at $29.50,\\nthe sweater was marked down to five dollars.')\n", "\n", "show_ents(doc)" ] }, { "cell_type": "markdown", "metadata": { "id": "5nr_1GyZAjY7" }, "source": [ "For more on **Named Entity Recognition** visit https://spacy.io/usage/linguistic-features#101" ] }, { "cell_type": "markdown", "metadata": { "id": "_dcHMWK9AjY8" }, "source": [ "___\n", "## Noun Chunks\n", "`Doc.noun_chunks` are *base noun phrases*: token spans that include the noun and words describing the noun. Noun chunks cannot be nested, cannot overlap, and do not involve prepositional phrases or relative clauses.
\n", "Where `Doc.ents` rely on the **ner** pipeline component, `Doc.noun_chunks` are provided by the **parser**." ] }, { "cell_type": "markdown", "metadata": { "id": "YUtcqSTZAjY8" }, "source": [ "### `noun_chunks` components:\n", "\n", "\n", "\n", "\n", "\n", "
`.text`The original noun chunk text.
`.root.text`The original text of the word connecting the noun chunk to the rest of the parse.
`.root.dep_`Dependency relation connecting the root to its head.
`.root.head.text`The text of the root token's head.
" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "v6cVi0G5AjY9", "outputId": "f9355e51-e6a4-44c4-a997-ae6dac031c37" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Autonomous cars - cars - nsubj - shift\n", "insurance liability - liability - dobj - shift\n", "manufacturers - manufacturers - pobj - toward\n" ] } ], "source": [ "doc = nlp(u\"Autonomous cars shift insurance liability toward manufacturers.\")\n", "\n", "for chunk in doc.noun_chunks:\n", " print(chunk.text+' - '+chunk.root.text+' - '+chunk.root.dep_+' - '+chunk.root.head.text)" ] }, { "cell_type": "markdown", "metadata": { "id": "F7BZ5FqMAjZA" }, "source": [ "### `Doc.noun_chunks` is a generator function\n", "Previously we mentioned that `Doc` objects do not retain a list of sentences, but they're available through the `Doc.sents` generator.
It's the same with `Doc.noun_chunks` - lists can be created if needed:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 163 }, "id": "0vZ5biA4AjZA", "outputId": "156aac33-eff2-45bf-8420-dacb2bfb9b6d" }, "outputs": [ { "ename": "TypeError", "evalue": "ignored", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdoc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnoun_chunks\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mTypeError\u001b[0m: object of type 'generator' has no len()" ] } ], "source": [ "len(doc.noun_chunks)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Wb-4fGA4AjZF", "outputId": "82adf8ab-b87f-4c00-8b52-ae3fb3ce7b5a" }, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(list(doc.noun_chunks))" ] }, { "cell_type": "markdown", "metadata": { "id": "-HUINE9GAjZK" }, "source": [ "For more on **noun_chunks** visit https://spacy.io/usage/linguistic-features#noun-chunks" ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "6_Named_Entity_Recognition.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 1 }