{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"All the IPython Notebooks in **[Python Natural Language Processing](https://github.com/milaan9/Python_Python_Natural_Language_Processing)** lecture series by **[Dr. Milaan Parmar](https://www.linkedin.com/in/milaanparmar/)** are available @ **[GitHub](https://github.com/milaan9)**\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "view-in-github"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "C5SsYvaIAjXd"
},
"source": [
"# 06 Named Entity Recognition (NER)\n",
"(also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes\n",
"\n",
"spaCy has an **'ner'** pipeline component that identifies token spans fitting a predetermined set of named entities. These are available as the **`ents`** property of a **`Doc`** object.\n",
"\n",
"https://spacy.io/usage/training#ner"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"id": "oMxIwzuVAjXe"
},
"outputs": [],
"source": [
"# Perform standard imports\n",
"import spacy\n",
"nlp = spacy.load('en_core_web_sm')"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"id": "5egxj7m2AjXj"
},
"outputs": [],
"source": [
"# Write a function to display basic entity info:\n",
"def show_ents(doc):\n",
" if doc.ents:\n",
" for ent in doc.ents:\n",
" print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))\n",
" else:\n",
" print('No named entities found.')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "PU_SegPsPbNm",
"outputId": "71fb6119-64f8-4e91-823b-96ea9ab81937"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Milaan Parmar CS - ORG - Companies, agencies, institutions, etc.\n"
]
}
],
"source": [
"doc = nlp(u'Hi, everyone welcome to Milaan Parmar CS tutorial on NPL')\n",
"\n",
"show_ents(doc)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "hOE3UPZ5AjXo",
"outputId": "89e61213-11e1-4ba3-8144-fa101ddc4053",
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"England - GPE - Countries, cities, states\n",
"Canada - GPE - Countries, cities, states\n",
"next month - DATE - Absolute or relative dates or periods\n"
]
}
],
"source": [
"doc = nlp(u'May I go to England or Canada, next month to see the virus report?')\n",
"\n",
"show_ents(doc)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "YYSnge4iAjXx"
},
"source": [
"## Entity annotations\n",
"`Doc.ents` are token spans with their own set of annotations.\n",
"
\n",
"| `ent.text` | The original entity text |
\n",
"| `ent.label` | The entity type's hash value |
\n",
"| `ent.label_` | The entity type's string description |
\n",
"| `ent.start` | The token span's *start* index position in the Doc |
\n",
"| `ent.end` | The token span's *stop* index position in the Doc |
\n",
"| `ent.start_char` | The entity text's *start* index position in the Doc |
\n",
"| `ent.end_char` | The entity text's *stop* index position in the Doc |
\n",
"
\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "CU0mEsoPAjXx",
"outputId": "cb31c06a-fba3-4ad2-f376-42afa0bee2fd"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"500 dollars 4 6 20 31 MONEY\n",
"Blake 7 8 37 42 PERSON\n",
"Microsoft 11 12 55 64 ORG\n"
]
}
],
"source": [
"doc = nlp(u'Can I please borrow 500 dollars from Blake to buy some Microsoft stock?')\n",
"\n",
"for ent in doc.ents:\n",
" print(ent.text, ent.start, ent.end, ent.start_char, ent.end_char, ent.label_)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "cvNQ5nFGAjX1"
},
"source": [
"## NER Tags\n",
"Tags are accessible through the `.label_` property of an entity.\n",
"\n",
"| TYPE | DESCRIPTION | EXAMPLE |
\n",
"| `PERSON` | People, including fictional. | *Fred Flintstone* |
\n",
"| `NORP` | Nationalities or religious or political groups. | *The Republican Party* |
\n",
"| `FAC` | Buildings, airports, highways, bridges, etc. | *Logan International Airport, The Golden Gate* |
\n",
"| `ORG` | Companies, agencies, institutions, etc. | *Microsoft, FBI, MIT* |
\n",
"| `GPE` | Countries, cities, states. | *France, UAR, Chicago, Idaho* |
\n",
"| `LOC` | Non-GPE locations, mountain ranges, bodies of water. | *Europe, Nile River, Midwest* |
\n",
"| `PRODUCT` | Objects, vehicles, foods, etc. (Not services.) | *Formula 1* |
\n",
"| `EVENT` | Named hurricanes, battles, wars, sports events, etc. | *Olympic Games* |
\n",
"| `WORK_OF_ART` | Titles of books, songs, etc. | *The Mona Lisa* |
\n",
"| `LAW` | Named documents made into laws. | *Roe v. Wade* |
\n",
"| `LANGUAGE` | Any named language. | *English* |
\n",
"| `DATE` | Absolute or relative dates or periods. | *20 July 1969* |
\n",
"| `TIME` | Times smaller than a day. | *Four hours* |
\n",
"| `PERCENT` | Percentage, including \"%\". | *Eighty percent* |
\n",
"| `MONEY` | Monetary values, including unit. | *Twenty Cents* |
\n",
"| `QUANTITY` | Measurements, as of weight or distance. | *Several kilometers, 55kg* |
\n",
"| `ORDINAL` | \"first\", \"second\", etc. | *9th, Ninth* |
\n",
"| `CARDINAL` | Numerals that do not fall under another type. | *2, Two, Fifty-two* |
\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Gvqo94OAAjX2"
},
"source": [
"___\n",
"### **Adding a Named Entity to a Span**\n",
"Normally we would have spaCy build a library of named entities by training it on several samples of text.
In this case, we only want to add one value:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "6QFBkrcqAjX2",
"outputId": "1f9083c2-ad2a-474c-90a1-de6ef81219e3"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Arthur - ORG - Companies, agencies, institutions, etc.\n",
"U.K. - GPE - Countries, cities, states\n",
"$6 million - MONEY - Monetary values, including unit\n"
]
}
],
"source": [
"doc = nlp(u'Arthur to build a U.K. factory for $6 million')\n",
"\n",
"show_ents(doc)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Etm6Z9CWGeRA"
},
"source": [
"Add **Milaan** as **PERSON**"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 248
},
"id": "G_8tbV5rAjX7",
"outputId": "a19effa8-2ae6-4e38-c4cd-a15c0e92168d"
},
"outputs": [
{
"ename": "ValueError",
"evalue": "ignored",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 8\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0;31m# Add the entity to the existing Doc object\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 10\u001b[0;31m \u001b[0mdoc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0ments\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mlist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdoc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0ments\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mnew_ent\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32mdoc.pyx\u001b[0m in \u001b[0;36mspacy.tokens.doc.Doc.ents.__set__\u001b[0;34m()\u001b[0m\n",
"\u001b[0;31mValueError\u001b[0m: [E103] Trying to set conflicting doc.ents: '(0, 1, 'ORG')' and '(0, 1, 'PERSON')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap."
]
}
],
"source": [
"from spacy.tokens import Span\n",
"\n",
"# Get the hash value of the ORG entity label\n",
"ORG = doc.vocab.strings[u'PERSON'] \n",
"\n",
"# Create a Span for the new entity\n",
"new_ent = Span(doc, 0, 1, label=ORG)\n",
"\n",
"# Add the entity to the existing Doc object\n",
"doc.ents = list(doc.ents) + [new_ent]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "unAVoOAxAjX_"
},
"source": [
"In the code above, the arguments passed to `Span()` are:\n",
"- `doc` - the name of the Doc object\n",
"- `0` - the *start* index position of the span\n",
"- `1` - the *stop* index position (exclusive)\n",
"- `label=PERSON` - the label assigned to our entity"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "JtoYKMknAjYA",
"outputId": "aaf8d3ee-2655-432b-d70a-dc79208af5a0",
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Arthur - ORG - Companies, agencies, institutions, etc.\n",
"U.K. - GPE - Countries, cities, states\n",
"$6 million - MONEY - Monetary values, including unit\n"
]
}
],
"source": [
"show_ents(doc)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nPNZAomgAjYD"
},
"source": [
"___\n",
"## Adding Named Entities to All Matching Spans\n",
"What if we want to tag *all* occurrences of \"WORDS\"? WE NEED TO use the PhraseMatcher to identify a series of spans in the Doc:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "eocm1Oq9AjYE",
"outputId": "2a4a0f1d-a6f9-4256-df81-54d90528ddcc"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"first - ORDINAL - \"first\", \"second\", etc.\n"
]
}
],
"source": [
"doc = nlp(u'Our company plans to introduce a new vacuum cleaner. '\n",
" u'If successful, the vacuum cleaner will be our first product.')\n",
"\n",
"show_ents(doc)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"id": "x_ZfN0WJAjYH"
},
"outputs": [],
"source": [
"# Import PhraseMatcher and create a matcher object:\n",
"from spacy.matcher import PhraseMatcher\n",
"matcher = PhraseMatcher(nlp.vocab)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"id": "WwTZbId2AjYK"
},
"outputs": [],
"source": [
"# Create the desired phrase patterns:\n",
"phrase_list = ['vacuum cleaner', 'vacuum-cleaner']\n",
"phrase_patterns = [nlp(text) for text in phrase_list]"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "zdvup9bOAjYN",
"outputId": "c4eff7bc-0341-4460-98a5-2a333ea26820"
},
"outputs": [
{
"data": {
"text/plain": [
"[(2689272359382549672, 7, 9), (2689272359382549672, 14, 16)]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Apply the patterns to our matcher object:\n",
"matcher.add('newproduct', None, *phrase_patterns)\n",
"\n",
"# Apply the matcher to our Doc object:\n",
"matches = matcher(doc)\n",
"\n",
"# See what matches occur:\n",
"matches"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"id": "mcRv7ezzAjYS"
},
"outputs": [],
"source": [
"# Here we create Spans from each match, and create named entities from them:\n",
"from spacy.tokens import Span\n",
"\n",
"PROD = doc.vocab.strings[u'PRODUCT']\n",
"\n",
"new_ents = [Span(doc, match[1],match[2],label=PROD) for match in matches]\n",
"\n",
"doc.ents = list(doc.ents) + new_ents"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Hn77OkeNAjYW",
"outputId": "0f5c8750-5c9b-404b-fc87-313dfda6edcc"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"vacuum cleaner - PRODUCT - Objects, vehicles, foods, etc. (not services)\n",
"vacuum cleaner - PRODUCT - Objects, vehicles, foods, etc. (not services)\n",
"first - ORDINAL - \"first\", \"second\", etc.\n"
]
}
],
"source": [
"show_ents(doc)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6WLXgXDYAjYa"
},
"source": [
"___\n",
"## Counting Entities\n",
"While spaCy may not have a built-in tool for counting entities, we can pass a conditional statement into a list comprehension:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Y4AQj2yfAjYb",
"outputId": "0df969e7-a578-44e9-e888-23aee6d168fb"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"29.50 - MONEY - Monetary values, including unit\n",
"five dollars - MONEY - Monetary values, including unit\n"
]
}
],
"source": [
"doc = nlp(u'Originally priced at $29.50, the sweater was marked down to five dollars.')\n",
"\n",
"show_ents(doc)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "NPr5Q-eWAjYi",
"outputId": "ffe19b4c-3f43-4bc2-e498-50eb6f6b7df6"
},
"outputs": [
{
"data": {
"text/plain": [
"2"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len([ent for ent in doc.ents if ent.label_=='MONEY'])"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 35
},
"id": "xw0U1DC5AjYp",
"outputId": "180a517b-e326-4cac-f735-095072785cb8"
},
"outputs": [
{
"data": {
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "string"
},
"text/plain": [
"'2.2.4'"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"spacy.__version__"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "MYzkCgmfAjYu",
"outputId": "4ed899d6-9e4c-403a-cf13-4a649309b4e0"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"29.50 - MONEY - Monetary values, including unit\n",
"five dollars - MONEY - Monetary values, including unit\n"
]
}
],
"source": [
"\n",
"doc = nlp(u'Originally priced at $29.50,\\nthe sweater was marked down to five dollars.')\n",
"\n",
"show_ents(doc)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8E0XnrraAjYy"
},
"source": [
"### However, there is a simple fix that can be added to the nlp pipeline:\n",
"\n",
"https://spacy.io/usage/processing-pipelines"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"id": "jkNfED2IAjYz"
},
"outputs": [],
"source": [
"# Quick function to remove ents formed on whitespace:\n",
"def remove_whitespace_entities(doc):\n",
" doc.ents = [e for e in doc.ents if not e.text.isspace()]\n",
" return doc\n",
"\n",
"# Insert this into the pipeline AFTER the ner component:\n",
"nlp.add_pipe(remove_whitespace_entities, after='ner')"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "KyDsaxhFAjY2",
"outputId": "2fc120a8-161d-47e0-e118-d7578e55f2ba"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"29.50 - MONEY - Monetary values, including unit\n",
"five dollars - MONEY - Monetary values, including unit\n"
]
}
],
"source": [
"# Rerun nlp on the text above, and show ents:\n",
"doc = nlp(u'Originally priced at $29.50,\\nthe sweater was marked down to five dollars.')\n",
"\n",
"show_ents(doc)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5nr_1GyZAjY7"
},
"source": [
"For more on **Named Entity Recognition** visit https://spacy.io/usage/linguistic-features#101"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_dcHMWK9AjY8"
},
"source": [
"___\n",
"## Noun Chunks\n",
"`Doc.noun_chunks` are *base noun phrases*: token spans that include the noun and words describing the noun. Noun chunks cannot be nested, cannot overlap, and do not involve prepositional phrases or relative clauses.
\n",
"Where `Doc.ents` rely on the **ner** pipeline component, `Doc.noun_chunks` are provided by the **parser**."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "YUtcqSTZAjY8"
},
"source": [
"### `noun_chunks` components:\n",
"\n",
"| `.text` | The original noun chunk text. |
\n",
"| `.root.text` | The original text of the word connecting the noun chunk to the rest of the parse. |
\n",
"| `.root.dep_` | Dependency relation connecting the root to its head. |
\n",
"| `.root.head.text` | The text of the root token's head. |
\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "v6cVi0G5AjY9",
"outputId": "f9355e51-e6a4-44c4-a997-ae6dac031c37"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Autonomous cars - cars - nsubj - shift\n",
"insurance liability - liability - dobj - shift\n",
"manufacturers - manufacturers - pobj - toward\n"
]
}
],
"source": [
"doc = nlp(u\"Autonomous cars shift insurance liability toward manufacturers.\")\n",
"\n",
"for chunk in doc.noun_chunks:\n",
" print(chunk.text+' - '+chunk.root.text+' - '+chunk.root.dep_+' - '+chunk.root.head.text)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "F7BZ5FqMAjZA"
},
"source": [
"### `Doc.noun_chunks` is a generator function\n",
"Previously we mentioned that `Doc` objects do not retain a list of sentences, but they're available through the `Doc.sents` generator.
It's the same with `Doc.noun_chunks` - lists can be created if needed:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 163
},
"id": "0vZ5biA4AjZA",
"outputId": "156aac33-eff2-45bf-8420-dacb2bfb9b6d"
},
"outputs": [
{
"ename": "TypeError",
"evalue": "ignored",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdoc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnoun_chunks\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;31mTypeError\u001b[0m: object of type 'generator' has no len()"
]
}
],
"source": [
"len(doc.noun_chunks)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Wb-4fGA4AjZF",
"outputId": "82adf8ab-b87f-4c00-8b52-ae3fb3ce7b5a"
},
"outputs": [
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(list(doc.noun_chunks))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-HUINE9GAjZK"
},
"source": [
"For more on **noun_chunks** visit https://spacy.io/usage/linguistic-features#noun-chunks"
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [],
"name": "6_Named_Entity_Recognition.ipynb",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
}
},
"nbformat": 4,
"nbformat_minor": 1
}