{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Text Scraping\n", "\n", "One of the things I learned early on about scraping web pages (often referred to as \"screen scraping\") is that it often amounts to trying to recreate databases that have been re-presented as web pages using HTML templates. For example:\n", "\n", "- display a database table as an HTML table in a web page;\n", "- display each row of a database as a templated HTML page.\n", "\n", "The aim of the scrape in these cases might be as simple as pulling the table from the page and representing it as a dataframe, or trying to reverse engineer the HTML template that converts data to HTML into something that can extract the data from the HTML back as a row in a corresponding data table.\n", "\n", "In the latter case, the scrape may proceed in a couple of ways. For example:\n", "\n", "- by trying to identify structural HTML tag elements that contain recognisable data items, retrieving the HTML tag element, then extracting the data value;\n", "- parsing the recognisable literal *text* displayed on the web page and trying to extract data items based on that (i.e. ignore the HTML structural eelements and go straight for the extracted text). For an example of this sort of parsing, see the [r1chardj0n3s/parse](https://github.com/r1chardj0n3s/parse) Python package as applied to text pulled from a page using something like the [kennethreitz/requests-html](https://github.com/kennethreitz/requests-html) package.\n", "\n", "In more general cases, however, such as when trying to abstract meaningful information from arbitrary, natural language, texts, we need to up our game and start to analyse the texts as natural language texts." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Entity Extraction\n", "\n", "As an example, consider the following text:\n", "\n", "```From February 2016, as an author, payments from Head of Zeus Publishing; a client of Averbrook Ltd. Address: 45-47 Clerkenwell Green London EC1R 0HT, via Sheil Land, 52 Doughty Street. London WC1N 2LS. From October 2016 until July 2018, I will receive a regular payment of £13,000 per month (previously £11,000). Hours: 12 non-consecutive hrs per week. Any additional payments are listed below. (Updated 20 January 2016, 14 October 2016 and 2 March 2018)```\n", "\n", "To a human reader, we can identify various structural patterns, as well as parsing the natural language sentences.\n", "\n", "Let's start with some of the structural patterns:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from parse import parse" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "bigtext = '''\\\n", "From February 2016, as an author, payments from Head of Zeus Publishing; \\\n", "a client of Averbrook Ltd. Address: 45-47 Clerkenwell Green London EC1R 0HT, via Sheil Land, 52 Doughty Street. \\\n", "London WC1N 2LS. From October 2016 until July 2018, I will receive a regular payment \\\n", "of £13,000 per month (previously £11,000). Hours: 12 non-consecutive hrs per week. \\\n", "Any additional payments are listed below. (Updated 20 January 2016, 14 October 2016 and 2 March 2018)'''" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'20 January 2016, 14 October 2016 and 2 March 2018'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Extract the sentence containing the update dates\n", "parse('{}(Updated {updated})', bigtext)['updated']" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'12 non-consecutive hrs per week'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Extract the phrase describing the hours\n", "parse('{}Hours: {hours}.{}', bigtext)['hours']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There also look to be sentences that might be standard sentences, such as `Any additional payments are listed below.`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## From Web Scraping to Text-Scraping Using Natural Language Processing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Within the text are things that we might recognise as company names, dates, or addresses. *Entity recognition* refers to a natural language processing technique that attempts to extract words that describe \"things\", that is, *entities*, as well as identifying what sorts of \"thing\", or entity, they are.\n", "\n", "One powerful Python natural language processing package, `spacy`, has an entity recognition capability. Let's see how to use it and what sort of output it produces:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "#Import the spacy package\n", "import spacy\n", "\n", "#The package parses lanuage according to different statistically trained models\n", "#Let's load in the basic English model:\n", "nlp = spacy.load('en')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "#Generate a version of the text annotated using features detected by the model\n", "doc = nlp(bigtext)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The parsed text is annotated in a variety of ways.\n", "\n", "For example, we can directly access all the sentences in the original text:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[From February 2016, as an author, payments from Head of Zeus Publishing; a client of Averbrook Ltd.,\n", " Address: 45-47 Clerkenwell Green London EC1R 0HT, via Sheil Land, 52 Doughty Street.,\n", " London WC1N 2LS.,\n", " From October 2016 until July 2018, I will receive a regular payment of £13,000 per month (previously £11,000).,\n", " Hours: 12 non-consecutive hrs per week.,\n", " Any additional payments are listed below.,\n", " (Updated 20 January 2016, 14 October 2016 and 2 March 2018)]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(doc.sents)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "February 2016 :: DATE\n", "Zeus Publishing :: PERSON\n", "Averbrook Ltd. :: ORG\n", "45 :: CARDINAL\n", "EC1R 0HT :: POSTCODE\n", "Sheil Land :: ORG\n", "52 Doughty Street :: QUANTITY\n", "London :: GPE\n", "WC1N 2LS :: POSTCODE\n", "October 2016 :: DATE\n", "July 2018 :: DATE\n", "13,000 :: MONEY\n", "11,000 :: MONEY\n", "12 :: CARDINAL\n", "Updated :: ORG\n", "20 January 2016 :: DATE\n", "14 October 2016 :: DATE\n", "2 March 2018 :: DATE\n" ] } ], "source": [ "ents = list(doc.ents)\n", "entTypes = []\n", "for entity in ents:\n", " entTypes.append(entity.label_)\n", " \n", " print(entity, '::', entity.label_)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ORG Companies, agencies, institutions, etc.\n", "GPE Countries, cities, states\n", "PERSON People, including fictional\n", "DATE Absolute or relative dates or periods\n", "MONEY Monetary values, including unit\n", "QUANTITY Measurements, as of weight or distance\n", "CARDINAL Numerals that do not fall under another type\n" ] } ], "source": [ "for entType in set(entTypes):\n", " print(entType, spacy.explain(entType))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also look at each of the tokens in text and identify whether it is part of a entity, and if so, what sort. The `.iob_` attributes identifies `O` as not part of an entity, `B` as the first token in an entity, and `I` as continuing part of an entity." ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "From::::O\n", "February::DATE::B\n", "2016::DATE::I\n", ",::::O\n", "as::::O\n", "an::::O\n", "author::::O\n", ",::::O\n", "payments::::O\n", "from::::O\n", "Head::::O\n", "of::::O\n", "Zeus::PERSON::B\n", "Publishing::PERSON::I\n", ";::::O\n" ] } ], "source": [ "for token in doc[:15]:\n", " print('::'.join([token.text, token.ent_type_,token.ent_iob_]) )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at the extracted entities, we see we get some good hits:\n", "\n", "- `Averbrook Ltd.` is an `ORG`;\n", "- `20 January 2016` and `14 October 2016` are both instances of a `DATE`\n", "\n", "Some near misses:\n", "\n", "- `Zeus Publishing` isn't a `PERSON`, although we might see why it has been recognised as such. (Could we overlay the model with an additional mapping of `if PERSON and endswith.in(['Publishing', 'Holdings']) -> ORG` ?) \n", "\n", "And some things that are mis-categorised:\n", "\n", "- `52 Doughty Street` isn't really meaningful as a `QUANTITY`.\n", "\n", "Several things we might usefully want to categorise - such as a UK postcode, for example, which might be useful in and of itself, or when helping us to identify an address - is *not* recognised as an entity." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Things recognised as dates we might want to then further parse as date object types:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(February 2016, datetime.datetime(2016, 2, 19, 0, 0)),\n", " (October 2016, datetime.datetime(2016, 10, 19, 0, 0)),\n", " (July 2018, datetime.datetime(2018, 7, 19, 0, 0)),\n", " (20 January 2016, datetime.datetime(2016, 1, 20, 0, 0)),\n", " (14 October 2016, datetime.datetime(2016, 10, 14, 0, 0)),\n", " (2 March 2018, datetime.datetime(2018, 3, 2, 0, 0))]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from dateutil import parser as dtparser\n", "\n", "[(d, dtparser.parse(d.string)) for d in ents if d.label_ == 'DATE']" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "#see also https://github.com/akoumjian/datefinder\n", "#datefinder - Find dates inside text using Python and get back datetime objects " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Token Shapes\n", "\n", "As well as indentifying entities, `spacy` analyses texts at several othr levels. One such level of abstraction is the \"shape\" of each token. This identifies whether or not a character is an upper or lower case alphabetic character, a digit, or a punctuation character (which appears as itself):" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "From :: Xxxx\n", "February :: Xxxxx\n", "2016 :: dddd\n", ", :: ,\n", "as :: xx\n", "an :: xx\n", "author :: xxxx\n", ", :: ,\n", "payments :: xxxx\n", "from :: xxxx\n", "Head :: Xxxx\n", "of :: xx\n", "Zeus :: Xxxx\n", "Publishing :: Xxxxx\n", "; :: ;\n" ] } ], "source": [ "for token in doc[:15]:\n", " print(token, '::', token.shape_) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scraping a Text Based on Its Shape Structure And Adding New Entity Types\n", "\n", "The \"shape\" of a token provides an additional structural item that we might be able to make use of in scrapers of the raw text.\n", "\n", "For example, writing an efficient regular expression to identify a UK postcode can be a difficult task, but we can start to cobble one together from the shapes of different postcodes written in \"standard\" postcode form:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['XXd', 'dXX', ',', 'XXdX', 'dXX', ',', 'Xd', 'dXX']" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[pc.shape_ for pc in nlp('MK7 6AA, SW1A 1AA, N7 6BB')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can define a `matcher` function that will identify the tokens in a document that match a particular ordered combination of shape patterns.\n", "\n", "For example, the postcode like things described above have the shapes:\n", "\n", "- `XXd dXX`\n", "- `XXdX dXX`\n", "- `Xd dXX`\n", "\n", "We can use these structural patterns to identify token pairs as possible postcodes." ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "from spacy.matcher import Matcher\n", "\n", "nlp = spacy.load('en')\n", "matcher = Matcher(nlp.vocab)\n", "\n", "matcher.add('POSTCODE', None, \n", " [{'SHAPE':'XXdX'}, {'SHAPE':'dXX'}],\n", " [{'SHAPE':'XXd'}, {'SHAPE':'dXX'}],\n", " [{'SHAPE':'Xd'}, {'SHAPE':'dXX'}])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's test that:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Matches: [WC1N 4CC, MK7 4AA]\n", "Entities: [(James Smith, 'PERSON'), (Lady Jane Grey, 'PERSON')]\n" ] } ], "source": [ "pcdoc = nlp('pc is WC1N 4CC okay, as is MK7 4AA and Sir James Smith and Lady Jane Grey are presumably persons.')\n", "matches = matcher(pcdoc)\n", "\n", "#See what we matched, and let's see what entities we have detected\n", "print('Matches: {}\\nEntities: {}'.format([pcdoc[m[1]:m[2]] for m in matches], [(m,m.label_) for m in pcdoc.ents]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adding a new entity type with a matcher callback\n", "\n", "The matcher seems to have matched the postcodes, but is not identifying them as entities. (We also note that the entity matcher has missed the \"Sir\" title. In some cases, it might also match a postcode as a person.)\n", "\n", "To add the matched items to the entity list, we need to add a callback function to the matcher." ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Matches: [WC1N 4CC, MK7 4AA]\n", "Entities: [(WC1N 4CC, 'POSTCODE'), (MK7 4AA, 'POSTCODE'), (James Smith, 'PERSON')]\n" ] } ], "source": [ "##Define a POSTCODE as a new entity type by adding matched postcodes to the doc.ents\n", "#https://stackoverflow.com/a/47799669\n", "\n", "nlp = spacy.load('en')\n", "matcher = Matcher(nlp.vocab)\n", "\n", "def add_entity_label(matcher, doc, i, matches):\n", " match_id, start, end = matches[i]\n", " doc.ents += ((match_id, start, end),)\n", " \n", "#Recognise postcodes from different shapes\n", "matcher.add('POSTCODE', add_entity_label, [{'SHAPE': 'XXdX'},{'SHAPE':'dXX'}], [{'SHAPE':'XXd'},{'SHAPE':'dXX'}])\n", "\n", "pcdoc = nlp('pc is WC1N 4CC okay, as is MK7 4AA and James Smith is presumably a person')\n", "matches = matcher(pcdoc)\n", "\n", "print('Matches: {}\\nEntities: {}'.format([pcdoc[m[1]:m[2]] for m in matches], [(m,m.label_) for m in pcdoc.ents]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's put those pieces together more succinctly:" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'From February 2016, as an author, payments from Head of Zeus Publishing; a client of Averbrook Ltd. Address: 45-47 Clerkenwell Green London EC1R 0HT, via Sheil Land, 52 Doughty Street. London WC1N 2LS. From October 2016 until July 2018, I will receive a regular payment of £13,000 per month (previously £11,000). Hours: 12 non-consecutive hrs per week. Any additional payments are listed below. (Updated 20 January 2016, 14 October 2016 and 2 March 2018)'" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bigtext" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "#Generate base tagged doc\n", "doc = nlp(bigtext)\n", "\n", "#Run postcode tagger over the doc\n", "_ = matcher(doc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The tagged document should now include `POSTCODE` entities. One of the easiest ways to check the effectiveness of a new entity tagger is to check the document with recognised entities visualised within it.\n", "\n", "The `displacy` package has a Jupyter enabled visualiser for doing just that." ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
From \n", "\n", " February 2016\n", " DATE\n", "\n", ", as an author, payments from Head of \n", "\n", " Zeus Publishing\n", " PERSON\n", "\n", "; a client of \n", "\n", " Averbrook Ltd.\n", " ORG\n", "\n", " Address: \n", "\n", " 45\n", " CARDINAL\n", "\n", "-47 Clerkenwell Green London \n", "\n", " EC1R 0HT\n", " POSTCODE\n", "\n", ", via \n", "\n", " Sheil Land\n", " ORG\n", "\n", ", \n", "\n", " 52 Doughty Street\n", " QUANTITY\n", "\n", ". \n", "\n", " London\n", " GPE\n", "\n", " \n", "\n", " WC1N 2LS\n", " POSTCODE\n", "\n", ". From \n", "\n", " October 2016\n", " DATE\n", "\n", " until \n", "\n", " July 2018\n", " DATE\n", "\n", ", I will receive a regular payment of £\n", "\n", " 13,000\n", " MONEY\n", "\n", " per month (previously £\n", "\n", " 11,000\n", " MONEY\n", "\n", "). Hours: \n", "\n", " 12\n", " CARDINAL\n", "\n", " non-consecutive hrs per week. Any additional payments are listed below. (\n", "\n", " Updated\n", " ORG\n", "\n", " \n", "\n", " 20 January 2016\n", " DATE\n", "\n", ", \n", "\n", " 14 October 2016\n", " DATE\n", "\n", " and \n", "\n", " 2 March 2018\n", " DATE\n", "\n", ")
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from spacy import displacy\n", "\n", "displacy.render(doc, jupyter=True, style='ent')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Matching A Large Number of Phrases\n", "\n", "If we have a large number of phrases that are examples of a particular (new) entity type, we can match them using a `PhraseMatcher`.\n", "\n", "For example, suppose we have a table of MP data:" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
constituencydate_of_birthdays_servicefirst_start_dategenderlist_namemember_idparty
0Hackney North and Stoke Newington1953-09-27110411987-06-11FAbbott, Ms Diane172Labour
1Oldham East and Saddleworth1960-09-1525432011-01-13FAbrahams, Debbie4212Labour
2Selby and Ainsty1966-11-3027952010-05-06MAdams, Nigel4057Conservative
3Hitchin and Harpenden1986-02-112792017-06-08MAfolami, Bim4639Conservative
4Windsor1965-08-0445982005-05-05MAfriyie, Adam1586Conservative
\n", "
" ], "text/plain": [ " constituency date_of_birth days_service \\\n", "0 Hackney North and Stoke Newington 1953-09-27 11041 \n", "1 Oldham East and Saddleworth 1960-09-15 2543 \n", "2 Selby and Ainsty 1966-11-30 2795 \n", "3 Hitchin and Harpenden 1986-02-11 279 \n", "4 Windsor 1965-08-04 4598 \n", "\n", " first_start_date gender list_name member_id party \n", "0 1987-06-11 F Abbott, Ms Diane 172 Labour \n", "1 2011-01-13 F Abrahams, Debbie 4212 Labour \n", "2 2010-05-06 M Adams, Nigel 4057 Conservative \n", "3 2017-06-08 M Afolami, Bim 4639 Conservative \n", "4 2005-05-05 M Afriyie, Adam 1586 Conservative " ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "mpdata=pd.read_csv('members_mar18.csv')\n", "mpdata.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From this, we can extract a list of MP names, albeit in reverse word order." ] }, { "cell_type": "code", "execution_count": 130, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Abbott, Ms Diane',\n", " 'Abrahams, Debbie',\n", " 'Adams, Nigel',\n", " 'Afolami, Bim',\n", " 'Afriyie, Adam']" ] }, "execution_count": 130, "metadata": {}, "output_type": "execute_result" } ], "source": [ "term_list = mpdata['list_name'].tolist()\n", "term_list[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we wanted to match those names as \"MP\" entities, we could use the following recipe to add an MP entity type that will be returned if any of the MP names are matched:" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [], "source": [ "from spacy.matcher import PhraseMatcher\n", "\n", "nlp = spacy.load('en')\n", "matcher = PhraseMatcher(nlp.vocab)\n", "\n", "patterns = [nlp(text) for text in term_list]\n", "\n", "matcher.add('MP', add_entity_label, *patterns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's test that new entity on a test string:" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
The MPs were \n", "\n", " Adams, Nigel\n", " MP\n", "\n", ", \n", "\n", " Afolami, Bim\n", " MP\n", "\n", " and \n", "\n", " Abbott, Ms Diane\n", " MP\n", "\n", ".
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "doc = nlp(\"The MPs were Adams, Nigel, Afolami, Bim and Abbott, Ms Diane.\")\n", "\n", "matches = matcher(doc)\n", "\n", "displacy.render(doc, jupyter=True, style='ent')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Matching a Regular Expression\n", "\n", "Sometimes we may want to use a regular expression as an entity detector. For example, we might want to tighten up the postcode entity detectio by using a regular expression, rather than shape matching." ] }, { "cell_type": "code", "execution_count": 181, "metadata": {}, "outputs": [], "source": [ "import re\n", "\n", "#https://stackoverflow.com/a/164994/454773\n", "regex_ukpc = r'([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z]))))\\s?[0-9][A-Za-z]{2})'\n" ] }, { "cell_type": "code", "execution_count": 198, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
The postcodes were \n", "\n", " MK1 6AA\n", " POSTCODE\n", "\n", " and \n", "\n", " W1A 1AA\n", " POSTCODE\n", "\n", ".
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#Based on https://spacy.io/usage/linguistic-features\n", "nlp = spacy.load('en')\n", "\n", "doc = nlp(\"The postcodes were MK1 6AA and W1A 1AA.\")\n", "\n", "for match in re.finditer(regex_ukpc, doc.text):\n", " start, end = match.span() # get matched indices\n", " entity = doc.char_span(start, end, label='POSTCODE') # create Span from indices\n", " doc.ents = list(doc.ents) + [entity]\n", " entity.merge()\n", "\n", " \n", "displacy.render(doc, jupyter=True, style='ent')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Updating the training of already existing Entities\n", "\n", "We note previously that the matcher was missing the \"Sir\" title on matched persons." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(James Smith, Lady Jane Grey)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlp('pc is WC1N 4CC okay, as is MK7 4AA and Sir James Smith and Lady Jane Grey are presumably persons').ents" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see if we can update the training of the model so that it *does* recognise the \"Sir\" title as part of a person's name.\n", "\n", "We can do that by creating some new training data and using it to update the model. The `entities` *dict* identifies the index values in the test string that delimit the entity we want to extract." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# training data\n", "TRAIN_DATA = [\n", " ('Received from Sir John Smith last week.', {\n", " 'entities': [(14, 28, 'PERSON')]\n", " }),\n", " ('Sir Richard Jones is another person', {\n", " 'entities': [(0, 18, 'PERSON')]\n", " })\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, we are going to let `spacy` learn its own patterns, as a statistical model, that will - if the learning pays off correctly - identify things like \"Sir Bimble Bobs\" as a `PERSON` entity." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/ajh59/anaconda3/lib/python3.6/site-packages/numpy/linalg/linalg.py:2257: RuntimeWarning: invalid value encountered in sqrt\n", " ret = sqrt(sqnorm)\n" ] } ], "source": [ "import random\n", "\n", "#model='en' #'en_core_web_sm'\n", "#nlp = spacy.load(model)\n", "\n", "cycles=20\n", "optimizer = nlp.begin_training()\n", "for i in range(cycles):\n", " random.shuffle(TRAIN_DATA)\n", " for txt, annotations in TRAIN_DATA:\n", " nlp.update([txt], [annotations], sgd=optimizer)\n", " " ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "(Sir James Smith, Lady Jane Grey)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlp('pc is WC1N 4CC okay, as is MK7 4AA and Sir James Smith and Lady Jane Grey are presumably persons').ents" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the things that can be a bit fiddly is generating the training strings. We ca produce a little utility function that will help us create a training pattern by identifying the index value(s) associated with a particular substring, that we wish to identify as an example of a particular entity type, inside a text string.\n", "\n", "The first thing we need to do is find the index values within a string that show where a particular substring can be found. The Python `find()` and `index()` methods will find the first location of a substring in a string. However, where a substring appears several times in a sring, we need a new function to identify all the locations. There are several ways of doing this...\n", "\n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "#Find multiple matches using .find()\n", "#https://stackoverflow.com/a/4665027/454773\n", "def _find_all(string, substring):\n", " #Generator to return index of each string match\n", " start = 0\n", " while True:\n", " start = string.find(substring, start)\n", " if start == -1: return\n", " yield start\n", " start += len(substring)\n", "\n", "def find_all(string, substring):\n", " return list(_find_all(string, substring))\n", "\n", "\n", " \n", "#Find multiple matches using a regular expression\n", "#https://stackoverflow.com/a/4664889/454773\n", "import re\n", "def refind_all(string, substring):\n", " return [m.start() for m in re.finditer(substring, string)]" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2, 5]\n", "[2, 5]\n" ] } ], "source": [ "txt = 'This is a string.'\n", "substring = 'is'\n", "\n", "print( find_all(txt, substring) )\n", "print( refind_all(txt, substring) )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use either of these functions to find the location of a substring in a string, and then use these index values to help us create our training data." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('Received from Sir John Smith last week.', {'entities': [(14, 28, 'PERSON')]})" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def trainingTupleBuilder(string, substring, typ, entities=None):\n", " ixs = refind_all(string, substring)\n", " offset = len(substring)\n", " if entities is None: entities = {'entities':[]}\n", " for ix in ixs:\n", " entities['entities'].append( (ix, ix+offset, typ) )\n", " \n", " return (string, entities)\n", "\n", "#('Received from Sir John Smith last week.', {'entities': [(14, 28, 'PERSON')]})\n", "trainingTupleBuilder('Received from Sir John Smith last week.','Sir John Smith','PERSON')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Training a Simple Model to Recognise Addresses\n", "\n", "As well as extracting postcodes as entities, could we also train a simple model to extract addresses?" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('He lives at 27, Oswaldtwistle Way, Birmingham',\n", " {'entities': [(12, 45, 'B-ADDRESS')]}),\n", " ('Payments from Boondoggle Limited, 377, Hope Street, Little Village, Halifax. Received: October, 2017',\n", " {'entities': [(34, 75, 'B-ADDRESS')]})]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "TRAIN_DATA = []\n", "\n", "TRAIN_DATA.append(trainingTupleBuilder(\"He lives at 27, Oswaldtwistle Way, Birmingham\",'27, Oswaldtwistle Way, Birmingham','B-ADDRESS'))\n", "TRAIN_DATA.append(trainingTupleBuilder(\"Payments from Boondoggle Limited, 377, Hope Street, Little Village, Halifax. Received: October, 2017\",'377, Hope Street, Little Village, Halifax','B-ADDRESS'))\n", "\n", "TRAIN_DATA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `B-` prefix identifies the entity as a multi-token entity." ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [], "source": [ "#https://spacy.io/usage/training\n", "def spacytrainer(model=None, output_dir=None, n_iter=100, debug=False):\n", " \"\"\"Load the model, set up the pipeline and train the entity recognizer.\"\"\"\n", " if model is not None:\n", " if isinstance(model,str):\n", " nlp = spacy.load(model) # load existing spaCy model\n", " print(\"Loaded model '%s'\" % model)\n", " #Else we assume we have passed in an nlp model\n", " else: nlp = model\n", " else:\n", " nlp = spacy.blank('en') # create blank Language class\n", " print(\"Created blank 'en' model\")\n", "\n", " # create the built-in pipeline components and add them to the pipeline\n", " # nlp.create_pipe works for built-ins that are registered with spaCy\n", " if 'ner' not in nlp.pipe_names:\n", " ner = nlp.create_pipe('ner')\n", " nlp.add_pipe(ner, last=True)\n", " # otherwise, get it so we can add labels\n", " else:\n", " ner = nlp.get_pipe('ner')\n", "\n", " # add labels\n", " for _, annotations in TRAIN_DATA:\n", " for ent in annotations.get('entities'):\n", " ner.add_label(ent[2])\n", "\n", " # get names of other pipes to disable them during training\n", " other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']\n", " with nlp.disable_pipes(*other_pipes): # only train NER\n", " optimizer = nlp.begin_training()\n", " for itn in range(n_iter):\n", " random.shuffle(TRAIN_DATA)\n", " losses = {}\n", " for text, annotations in TRAIN_DATA:\n", " nlp.update(\n", " [text], # batch of texts\n", " [annotations], # batch of annotations\n", " drop=0.5, # dropout - make it harder to memorise data\n", " sgd=optimizer, # callable to update weights\n", " losses=losses)\n", " if debug: print(losses)\n", "\n", " # test the trained model\n", " if debug:\n", " for text, _ in TRAIN_DATA:\n", " doc = nlp(text)\n", " print('Entities', [(ent.text, ent.label_) for ent in doc.ents])\n", " print('Tokens', [(t.text, t.ent_type_, t.ent_iob) for t in doc])\n", "\n", " # save model to output directory\n", " if output_dir is not None:\n", " output_dir = Path(output_dir)\n", " if not output_dir.exists():\n", " output_dir.mkdir()\n", " nlp.to_disk(output_dir)\n", " print(\"Saved model to\", output_dir)\n", "\n", " # test the saved model\n", " print(\"Loading from\", output_dir)\n", " nlp2 = spacy.load(output_dir)\n", " for text, _ in TRAIN_DATA:\n", " doc = nlp2(text)\n", " print('Entities', [(ent.text, ent.label_) for ent in doc.ents])\n", " print('Tokens', [(t.text, t.ent_type_, t.ent_iob) for t in doc])\n", "\n", " return nlp\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's update the `en` model to include a really crude address parser based on the two lines of training data described above." ] }, { "cell_type": "code", "execution_count": 96, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loaded model 'en'\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/Users/ajh59/anaconda3/lib/python3.6/site-packages/numpy/linalg/linalg.py:2257: RuntimeWarning: invalid value encountered in sqrt\n", " ret = sqrt(sqnorm)\n" ] } ], "source": [ "nlp = spacytrainer('en')" ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
From \n", "\n", " February 2016\n", " DATE\n", "\n", ", as an author, payments from Head of Zeus Publishing; a client of \n", "\n", " Averbrook Ltd.\n", " ORG\n", "\n", " Address: \n", "\n", " 45-47 Clerkenwell Green London EC1R 0HT, via Sheil Land, 52 Doughty Street.\n", " B-ADDRESS\n", "\n", " London WC1N \n", "\n", " 2LS\n", " CARDINAL\n", "\n", ". From \n", "\n", " October 2016\n", " DATE\n", "\n", " until \n", "\n", " July 2018\n", " DATE\n", "\n", ", I will receive a regular payment of £\n", "\n", " 13,000\n", " MONEY\n", "\n", " per month (previously £\n", "\n", " 11,000\n", " MONEY\n", "\n", "). Hours: \n", "\n", " 12\n", " CARDINAL\n", "\n", " non-consecutive hrs per week. Any additional payments are listed below. (Updated 20 January 2016, \n", "\n", " 14 October 2016\n", " DATE\n", "\n", " and \n", "\n", " 2 March 2018\n", " DATE\n", "\n", ")
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#See if we can identify the address\n", "addr_doc = nlp(text)\n", "\n", "displacy.render(addr_doc , jupyter=True, style='ent')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Parts of Speech (POS)\n", "\n", "As well as recognising different types of entity, which may be identified across several different words, the `spacy` parser also marks up each separate word (or *token*) as a particular \"part-of-speech\" (POS), such as a noun, verb, or adjective.\n", "\n", "Parts of speech are identified as `.pos_` or `.tag_` token attributes." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "From :: ADP :: IN\n", "February :: PROPN :: NNP\n", "2016 :: NUM :: CD\n", ", :: PUNCT :: ,\n", "as :: ADP :: IN\n", "an :: DET :: DT\n", "author :: NOUN :: NN\n", ", :: PUNCT :: ,\n", "payments :: NOUN :: NNS\n", "from :: ADP :: IN\n", "Head :: PROPN :: NNP\n", "of :: ADP :: IN\n", "Zeus :: PROPN :: NNP\n", "Publishing :: PROPN :: NNP\n", "; :: PUNCT :: :\n" ] } ], "source": [ "tags = []\n", "for token in doc[:15]:\n", " print(token, '::', token.pos_, '::', token.tag_)\n", " tags.append(token.tag_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An `explain()` function describes each POS type in natural language terms:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ": :: punctuation mark, colon or ellipsis\n", "NNS :: noun, plural\n", "CD :: cardinal number\n", ", :: punctuation mark, comma\n", "IN :: conjunction, subordinating or preposition\n", "NNP :: noun, proper singular\n", "NN :: noun, singular or mass\n", "DT :: determiner\n" ] } ], "source": [ "for tag in set(tags):\n", " print(tag, '::', spacy.explain(tag))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also get a list of \"noun chunks\" identified in the text, as well as other words they relate to in a sentence:" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "February :: February :: pobj :: From\n", "an author :: author :: pobj :: as\n", "payments :: payments :: conj :: author\n", "Head :: Head :: pobj :: from\n", "Zeus Publishing :: Publishing :: pobj :: of\n", "Averbrook Ltd. :: Ltd. :: pobj :: of\n", "Address :: Address :: ROOT :: Address\n", "45-47 Clerkenwell Green London EC1R :: EC1R :: appos :: Address\n", "Sheil Land :: Land :: pobj :: via\n", "52 Doughty Street :: Street :: appos :: Land\n", "London WC1N :: WC1N :: ROOT :: WC1N\n", "October :: October :: pobj :: From\n", "July :: July :: pobj :: until\n", "I :: I :: nsubj :: receive\n", "a regular payment :: payment :: dobj :: receive\n", "month :: month :: pobj :: per\n", "Hours :: Hours :: ROOT :: Hours\n", "12 non-consecutive hrs :: hrs :: appos :: Hours\n", "week :: week :: pobj :: per\n", "Any additional payments :: payments :: nsubjpass :: listed\n", "(Updated 20 January :: January :: ROOT :: January\n", "14 October :: October :: appos :: January\n", "2 March :: March :: conj :: October\n" ] } ], "source": [ "for chunk in doc.noun_chunks:\n", " print(' :: '.join([chunk.text, chunk.root.text, chunk.root.dep_,\n", " chunk.root.head.text]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scraping a Text Based on Its POS Structure - `textacy`\n", "\n", "As well as the basic `spacy` functionality, packages exist that build on `spacy` to provide further tools for working with abstractions identified using `spacy`.\n", "\n", "For example, the `textacy` package provides a way of parsing sentences using regular expressions defined over (Ontonotes5?) POS tags:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[payments from Head of Zeus Publishing, client of Averbrook Ltd.]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import textacy\n", "\n", "list(textacy.extract.pos_regex_matches(nlp(text),r' +'))" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'en': {'NP': '? * ( ? ?)* (| ?)+',\n", " 'PP': ' ? * ( ? ?)* ( ?)+',\n", " 'VP': '* * '}}" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "textacy.constants.POS_REGEX_PATTERNS" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A DT DET\n", "sum NN NOUN\n", "of IN ADP\n", "£ $ SYM\n", "2000 CD NUM\n", "- SYM SYM\n", "3000 CD NUM\n", "last JJ ADJ\n", "or CC CCONJ\n", "£ $ SYM\n", "2,000 CD NUM\n", "or CC CCONJ\n", "£ $ SYM\n", "2000-£3000 CD NUM\n", "or CC CCONJ\n", "£ $ SYM\n", "2,000-£3,000 CD NUM\n", "year NN NOUN\n" ] } ], "source": [ "xx='A sum of £2000-3000 last or £2,000 or £2000-£3000 or £2,000-£3,000 year'\n", "for t in nlp(xx):\n", " print(t,t.tag_, t.pos_)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2,000 MONEY\n", "2000-£3000 MONEY\n", "2,000-£3,000 MONEY\n" ] } ], "source": [ "for e in nlp(xx).ents:\n", " print(e, e.label_)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[£2000-3000, £2,000, £2000-£3000, £2,000-£3,000]" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(textacy.extract.pos_regex_matches(nlp('A sum of £2000-3000 last or £2,000 or £2000-£3000 or £2,000-£3,000 year'),r'??'))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we can define appropriate POS pattern, we can extract terms from an arbitrary text based on that pattern, an approach that is far more general than trying to write a regular expression pattern matcher over just the raw text." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#define approx amount eg £10,000-£15,000 or £10,000-15,000\n", "parse('{}£{a}-£{b:g}{}','eg £10,000-£15,000 or £14,000-£16,000'.replace(',',''))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## More Complex Matching Rules\n", "\n", "Matchers can be created over a wide range of attributes ([docs](https://spacy.io/usage/linguistic-features#section-rule-based-matching)), including POS tags and entity labels.\n", "\n", "For example, we can start trying to build an address tagger by looking for things that end with a postcode." ] }, { "cell_type": "code", "execution_count": 156, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
From \n", "\n", " February 2016\n", " DATE\n", "\n", ", as an author, payments from Head of \n", "\n", " Zeus Publishing\n", " PERSON\n", "\n", "; a client of \n", "\n", " Averbrook Ltd.\n", " ORG\n", "\n", " Address: \n", "\n", " 45\n", " CARDINAL\n", "\n", "-47 Clerkenwell Green London \n", "\n", " EC1R 0HT\n", " POSTCODE\n", "\n", ", via \n", "\n", " Sheil Land\n", " ORG\n", "\n", ", \n", "\n", " 52 Doughty Street\n", " QUANTITY\n", "\n", ". \n", "\n", " London\n", " GPE\n", "\n", " \n", "\n", " WC1N 2LS\n", " POSTCODE\n", "\n", ". From \n", "\n", " October 2016\n", " DATE\n", "\n", " until \n", "\n", " July 2018\n", " DATE\n", "\n", ", I will receive a regular payment of £\n", "\n", " 13,000\n", " MONEY\n", "\n", " per month (previously £\n", "\n", " 11,000\n", " MONEY\n", "\n", "). Hours: \n", "\n", " 12\n", " CARDINAL\n", "\n", " non-consecutive hrs per week. Any additional payments are listed below. (\n", "\n", " Updated\n", " ORG\n", "\n", " \n", "\n", " 20 January 2016\n", " DATE\n", "\n", ", \n", "\n", " 14 October 2016\n", " DATE\n", "\n", " and \n", "\n", " 2 March 2018\n", " DATE\n", "\n", ")
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "47 Clerkenwell Green London EC1R 0HT\n", "EC1R 0HT\n", "London WC1N 2LS\n", "WC1N 2LS\n", "[(February 2016, 'DATE'), (Zeus Publishing, 'PERSON'), (Averbrook Ltd., 'ORG'), (45, 'CARDINAL'), (47 Clerkenwell Green London, 'ADDRESS'), (EC1R 0HT, 'POSTCODE'), (Sheil Land, 'ORG'), (52 Doughty Street, 'QUANTITY'), (London, 'ADDRESS'), (WC1N 2LS, 'POSTCODE'), (October 2016, 'DATE'), (July 2018, 'DATE'), (13,000, 'MONEY'), (11,000, 'MONEY'), (12, 'CARDINAL'), (Updated, 'ORG'), (20 January 2016, 'DATE'), (14 October 2016, 'DATE'), (2 March 2018, 'DATE')]\n" ] } ], "source": [ "nlp = spacy.load('en')\n", "\n", "matcher = Matcher(nlp.vocab)\n", "matcher.add('POSTCODE', add_entity_label, [{'SHAPE': 'XXdX'},{'SHAPE':'dXX'}], [{'SHAPE':'XXd'},{'SHAPE':'dXX'}])\n", "matcher.add('ADDRESS', add_entity_label, \n", " [{'POS':'NUM','OP':'+'},{'POS':'PROPN','OP':'+'}, {'ENT_TYPE':'POSTCODE', 'OP':'+'}],\n", " [{'ENT_TYPE':'GPE','OP':'+'}, {'ENT_TYPE':'POSTCODE', 'OP':'+'}])\n", "\n", "addr_doc = nlp(text)\n", "matcher(addr_doc)\n", "\n", "displacy.render(addr_doc , jupyter=True, style='ent')\n", "for m in matcher(addr_doc):\n", " print(addr_doc[m[1]:m[2]])\n", " \n", "print([(e, e.label_) for e in addr_doc.ents])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, we note that the visualiser cannot cope with rendering multiple entity types over one or more words. In the above example, the `POSTCODE` entitites are highlighted, but we note from the matcher that `ADDRESS` ranges are also identified that extend across entities defined over fewer terms." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualising - `displaCy`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can look at the structure of a text by printing out the child elements associated with each token in a sentence:" ] }, { "cell_type": "code", "execution_count": 135, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "From February 2016, as an author, payments from Head of Zeus Publishing; a client of Averbrook Ltd. \n", "\n", "From : [February, ,, as]\n", "February : [2016]\n", "2016 : []\n", ", : []\n", "as : [author, ;]\n", "an : []\n", "author : [an, ,, payments]\n", ", : []\n", "payments : [from]\n", "from : [Head]\n", "Head : [of]\n", "of : [Publishing]\n", "Zeus : []\n", "Publishing : [Zeus]\n", "; : [client]\n", "a : []\n", "client : [a, of]\n", "of : [Ltd.]\n", "Averbrook : []\n", "Ltd. : [Averbrook]\n", "\n", "Address: 45-47 Clerkenwell Green London EC1R 0HT, via Sheil Land, 52 Doughty Street. \n", "\n", "Address : [:, EC1R, 0HT, ,, via, .]\n", ": : []\n", "45 : []\n", "- : []\n", "47 : [-]\n", "Clerkenwell : []\n", "Green : []\n", "London : [Clerkenwell, Green]\n", "EC1R : [45, 47, London]\n", "0HT : []\n", ", : []\n", "via : [Land]\n", "Sheil : []\n", "Land : [Sheil, ,, Street]\n", ", : []\n", "52 : []\n", "Doughty : []\n", "Street : [52, Doughty]\n", ". : []\n", "\n", "London WC1N 2LS. \n", "\n", "London : []\n", "WC1N : [London, 2LS, .]\n", "2LS : []\n", ". : []\n", "\n", "From October 2016 until July 2018, I will receive a regular payment of £13,000 per month (previously £11,000). \n", "\n", "From : [October, until]\n", "October : [2016]\n", "2016 : []\n", "until : [July]\n", "July : [2018]\n", "2018 : []\n", ", : []\n", "I : []\n", "will : []\n", "receive : [From, ,, I, will, payment, .]\n", "a : []\n", "regular : []\n", "payment : [a, regular, of, (, 11,000, )]\n", "of : [13,000]\n", "£ : []\n", "13,000 : [£, per]\n", "per : [month]\n", "month : []\n", "( : []\n", "previously : []\n", "£ : []\n", "11,000 : [previously, £]\n", ") : []\n", ". : []\n", "\n", "Hours: 12 non-consecutive hrs per week. \n", "\n", "Hours : [:, hrs, .]\n", ": : []\n", "12 : []\n", "non : []\n", "- : [non]\n", "consecutive : []\n", "hrs : [12, -, consecutive, per]\n", "per : [week]\n", "week : []\n", ". : []\n", "\n", "Any additional payments are listed below. \n", "\n", "Any : []\n", "additional : []\n", "payments : [Any, additional]\n", "are : []\n", "listed : [payments, are, below, .]\n", "below : []\n", ". : []\n", "\n", "(Updated 20 January 2016, 14 October 2016 and 2 March 2018) \n", "\n", "( : []\n", "Updated : []\n", "20 : []\n", "January : [(, Updated, 20, 2016, ,, October, )]\n", "2016 : []\n", ", : []\n", "14 : []\n", "October : [14, 2016, and, March]\n", "2016 : []\n", "and : []\n", "2 : []\n", "March : [2, 2018]\n", "2018 : []\n", ") : []\n", "\n" ] } ], "source": [ "for sent in nlp(text).sents:\n", " print(sent,'\\n')\n", " for token in sent:\n", " print(token, ': ', str(list(token.children)))\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, the `displaCy` toolset, included as part of `spacy`, provides a more appealing way of visualising parsed documents in two different ways:\n", "\n", "- as a dependency graph, showing POS tags for each token and how they relate to each other;\n", "- as a text display with extracted entities highlighted." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dependency graph identifies POS tags as well as how tokens are related in natural language grammatical phrases:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "from spacy import displacy" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " From\n", " ADP\n", "\n", "\n", "\n", " February\n", " PROPN\n", "\n", "\n", "\n", " 2016,\n", " NUM\n", "\n", "\n", "\n", " as\n", " ADP\n", "\n", "\n", "\n", " an\n", " DET\n", "\n", "\n", "\n", " author,\n", " NOUN\n", "\n", "\n", "\n", " payments\n", " NOUN\n", "\n", "\n", "\n", " from\n", " ADP\n", "\n", "\n", "\n", " Head\n", " PROPN\n", "\n", "\n", "\n", " of\n", " ADP\n", "\n", "\n", "\n", " Zeus\n", " PROPN\n", "\n", "\n", "\n", " Publishing;\n", " PROPN\n", "\n", "\n", "\n", " a\n", " DET\n", "\n", "\n", "\n", " client\n", " NOUN\n", "\n", "\n", "\n", " of\n", " ADP\n", "\n", "\n", "\n", " Averbrook\n", " PROPN\n", "\n", "\n", "\n", " Ltd.\n", " PROPN\n", "\n", "\n", "\n", " Address:\n", " NOUN\n", "\n", "\n", "\n", " 45-\n", " NUM\n", "\n", "\n", "\n", " 47\n", " NUM\n", "\n", "\n", "\n", " Clerkenwell\n", " PROPN\n", "\n", "\n", "\n", " Green\n", " PROPN\n", "\n", "\n", "\n", " London\n", " PROPN\n", "\n", "\n", "\n", " EC1R\n", " PROPN\n", "\n", "\n", "\n", " 0HT,\n", " NOUN\n", "\n", "\n", "\n", " via\n", " ADP\n", "\n", "\n", "\n", " Sheil\n", " PROPN\n", "\n", "\n", "\n", " Land,\n", " PROPN\n", "\n", "\n", "\n", " 52\n", " NUM\n", "\n", "\n", "\n", " Doughty\n", " PROPN\n", "\n", "\n", "\n", " Street.\n", " PROPN\n", "\n", "\n", "\n", " London\n", " PROPN\n", "\n", "\n", "\n", " WC1N\n", " PROPN\n", "\n", "\n", "\n", " 2LS.\n", " NUM\n", "\n", "\n", "\n", " From\n", " ADP\n", "\n", "\n", "\n", " October\n", " PROPN\n", "\n", "\n", "\n", " 2016\n", " NUM\n", "\n", "\n", "\n", " until\n", " ADP\n", "\n", "\n", "\n", " July\n", " PROPN\n", "\n", "\n", "\n", " 2018,\n", " NUM\n", "\n", "\n", "\n", " I\n", " PRON\n", "\n", "\n", "\n", " will\n", " VERB\n", "\n", "\n", "\n", " receive\n", " VERB\n", "\n", "\n", "\n", " a\n", " DET\n", "\n", "\n", "\n", " regular\n", " ADJ\n", "\n", "\n", "\n", " payment\n", " NOUN\n", "\n", "\n", "\n", " of\n", " ADP\n", "\n", "\n", "\n", " £\n", " SYM\n", "\n", "\n", "\n", " 13,000\n", " NUM\n", "\n", "\n", "\n", " per\n", " ADP\n", "\n", "\n", "\n", " month (\n", " NOUN\n", "\n", "\n", "\n", " previously\n", " ADV\n", "\n", "\n", "\n", " £\n", " SYM\n", "\n", "\n", "\n", " 11,000).\n", " NUM\n", "\n", "\n", "\n", " Hours:\n", " NOUN\n", "\n", "\n", "\n", " 12\n", " NUM\n", "\n", "\n", "\n", " non-\n", " ADJ\n", "\n", "\n", "\n", " consecutive\n", " ADJ\n", "\n", "\n", "\n", " hrs\n", " NOUN\n", "\n", "\n", "\n", " per\n", " ADP\n", "\n", "\n", "\n", " week.\n", " NOUN\n", "\n", "\n", "\n", " Any\n", " DET\n", "\n", "\n", "\n", " additional\n", " ADJ\n", "\n", "\n", "\n", " payments\n", " NOUN\n", "\n", "\n", "\n", " are\n", " VERB\n", "\n", "\n", "\n", " listed\n", " VERB\n", "\n", "\n", "\n", " below. (\n", " ADV\n", "\n", "\n", "\n", " Updated\n", " VERB\n", "\n", "\n", "\n", " 20\n", " NUM\n", "\n", "\n", "\n", " January\n", " PROPN\n", "\n", "\n", "\n", " 2016,\n", " NUM\n", "\n", "\n", "\n", " 14\n", " NUM\n", "\n", "\n", "\n", " October\n", " PROPN\n", "\n", "\n", "\n", " 2016\n", " NUM\n", "\n", "\n", "\n", " and\n", " CCONJ\n", "\n", "\n", "\n", " 2\n", " NUM\n", "\n", "\n", "\n", " March\n", " PROPN\n", "\n", "\n", "\n", " 2018)\n", " NUM\n", "\n", "\n", "\n", " \n", " \n", " pobj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nummod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " prep\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " det\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " pobj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " conj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " prep\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " pobj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " prep\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " punct\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " det\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " intj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " prep\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " pobj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nummod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nummod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " appos\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " punct\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " prep\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " pobj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nummod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " appos\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nummod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " prep\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " pobj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nummod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " prep\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " pobj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nummod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nsubj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " aux\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " det\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " amod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " dobj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " prep\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nmod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " pobj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " prep\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " pobj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " advmod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " quantmod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " appos\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nummod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " punct\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " amod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " appos\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " prep\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " pobj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " det\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " amod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nsubjpass\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " auxpass\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " advmod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " amod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nummod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nummod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nummod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " appos\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nummod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " cc\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nummod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " conj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nummod\n", " \n", " \n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "displacy.render(doc, jupyter=True,style='dep')" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " From\n", " ADP\n", "\n", "\n", "\n", " February\n", " PROPN\n", "\n", "\n", "\n", " 2016,\n", " NUM\n", "\n", "\n", "\n", " as\n", " ADP\n", "\n", "\n", "\n", " an\n", " DET\n", "\n", "\n", "\n", " author,\n", " NOUN\n", "\n", "\n", "\n", " payments\n", " NOUN\n", "\n", "\n", "\n", " from\n", " ADP\n", "\n", "\n", "\n", " Head\n", " PROPN\n", "\n", "\n", "\n", " of\n", " ADP\n", "\n", "\n", "\n", " Zeus\n", " PROPN\n", "\n", "\n", "\n", " Publishing;\n", " PROPN\n", "\n", "\n", "\n", " a\n", " DET\n", "\n", "\n", "\n", " client\n", " NOUN\n", "\n", "\n", "\n", " of\n", " ADP\n", "\n", "\n", "\n", " Averbrook\n", " PROPN\n", "\n", "\n", "\n", " Ltd.\n", " PROPN\n", "\n", "\n", "\n", " Address:\n", " NOUN\n", "\n", "\n", "\n", " 45-\n", " NUM\n", "\n", "\n", "\n", " 47\n", " NUM\n", "\n", "\n", "\n", " Clerkenwell\n", " PROPN\n", "\n", "\n", "\n", " Green\n", " PROPN\n", "\n", "\n", "\n", " London\n", " PROPN\n", "\n", "\n", "\n", " EC1R\n", " PROPN\n", "\n", "\n", "\n", " 0HT,\n", " NOUN\n", "\n", "\n", "\n", " via\n", " ADP\n", "\n", "\n", "\n", " Sheil\n", " PROPN\n", "\n", "\n", "\n", " Land,\n", " PROPN\n", "\n", "\n", "\n", " 52\n", " NUM\n", "\n", "\n", "\n", " Doughty\n", " PROPN\n", "\n", "\n", "\n", " Street.\n", " PROPN\n", "\n", "\n", "\n", " London\n", " PROPN\n", "\n", "\n", "\n", " WC1N\n", " PROPN\n", "\n", "\n", "\n", " 2LS.\n", " NUM\n", "\n", "\n", "\n", " From\n", " ADP\n", "\n", "\n", "\n", " October\n", " PROPN\n", "\n", "\n", "\n", " 2016\n", " NUM\n", "\n", "\n", "\n", " until\n", " ADP\n", "\n", "\n", "\n", " July\n", " PROPN\n", "\n", "\n", "\n", " 2018,\n", " NUM\n", "\n", "\n", "\n", " I\n", " PRON\n", "\n", "\n", "\n", " will\n", " VERB\n", "\n", "\n", "\n", " receive\n", " VERB\n", "\n", "\n", "\n", " a\n", " DET\n", "\n", "\n", "\n", " regular\n", " ADJ\n", "\n", "\n", "\n", " payment\n", " NOUN\n", "\n", "\n", "\n", " of\n", " ADP\n", "\n", "\n", "\n", " £\n", " SYM\n", "\n", "\n", "\n", " 13,000\n", " NUM\n", "\n", "\n", "\n", " per\n", " ADP\n", "\n", "\n", "\n", " month (\n", " NOUN\n", "\n", "\n", "\n", " previously\n", " ADV\n", "\n", "\n", "\n", " £\n", " SYM\n", "\n", "\n", "\n", " 11,000).\n", " NUM\n", "\n", "\n", "\n", " Hours:\n", " NOUN\n", "\n", "\n", "\n", " 12\n", " NUM\n", "\n", "\n", "\n", " non-\n", " ADJ\n", "\n", "\n", "\n", " consecutive\n", " ADJ\n", "\n", "\n", "\n", " hrs\n", " NOUN\n", "\n", "\n", "\n", " per\n", " ADP\n", "\n", "\n", "\n", " week.\n", " NOUN\n", "\n", "\n", "\n", " Any\n", " DET\n", "\n", "\n", "\n", " additional\n", " ADJ\n", "\n", "\n", "\n", " payments\n", " NOUN\n", "\n", "\n", "\n", " are\n", " VERB\n", "\n", "\n", "\n", " listed\n", " VERB\n", "\n", "\n", "\n", " below. (\n", " ADV\n", "\n", "\n", "\n", " Updated\n", " VERB\n", "\n", "\n", "\n", " 20\n", " NUM\n", "\n", "\n", "\n", " January\n", " PROPN\n", "\n", "\n", "\n", " 2016,\n", " NUM\n", "\n", "\n", "\n", " 14\n", " NUM\n", "\n", "\n", "\n", " October\n", " PROPN\n", "\n", "\n", "\n", " 2016\n", " NUM\n", "\n", "\n", "\n", " and\n", " CCONJ\n", "\n", "\n", "\n", " 2\n", " NUM\n", "\n", "\n", "\n", " March\n", " PROPN\n", "\n", "\n", "\n", " 2018)\n", " NUM\n", "\n", "\n", "\n", " \n", " \n", " pobj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nummod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " prep\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " det\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " pobj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " conj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " prep\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " pobj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " prep\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " punct\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " det\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " intj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " prep\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " pobj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nummod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nummod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " appos\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " punct\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " prep\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " pobj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nummod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " appos\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nummod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " prep\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " pobj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nummod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " prep\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " pobj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nummod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nsubj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " aux\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " det\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " amod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " dobj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " prep\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nmod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " pobj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " prep\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " pobj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " advmod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " quantmod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " appos\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nummod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " punct\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " amod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " appos\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " prep\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " pobj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " det\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " amod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nsubjpass\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " auxpass\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " advmod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " amod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nummod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nummod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nummod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " appos\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nummod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " cc\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nummod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " conj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nummod\n", " \n", " \n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "displacy.render(doc, jupyter=True,style='dep',options={'distance':85, 'compact':True})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also use `displaCy` to highlight, inline, the entities extracted from a text." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
pc is \n", "\n", " WC1N 4CC\n", " POSTCODE\n", "\n", " okay, as is \n", "\n", " MK7 4AA\n", " POSTCODE\n", "\n", " and \n", "\n", " James Smith\n", " PERSON\n", "\n", " is presumably a person
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "displacy.render(pcdoc, jupyter=True,style='ent')" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
From \n", "\n", " February 2016\n", " DATE\n", "\n", ", as an author, payments from Head of \n", "\n", " Zeus Publishing\n", " PERSON\n", "\n", "; a client of \n", "\n", " Averbrook Ltd.\n", " ORG\n", "\n", " Address: \n", "\n", " 45\n", " CARDINAL\n", "\n", "-47 Clerkenwell Green London \n", "\n", " EC1R 0HT\n", " POSTCODE\n", "\n", ", via \n", "\n", " Sheil Land\n", " ORG\n", "\n", ", \n", "\n", " 52 Doughty Street\n", " QUANTITY\n", "\n", ". \n", "\n", " London\n", " GPE\n", "\n", " \n", "\n", " WC1N 2LS\n", " POSTCODE\n", "\n", ". From \n", "\n", " October 2016\n", " DATE\n", "\n", " until \n", "\n", " July 2018\n", " DATE\n", "\n", ", I will receive a regular payment of £\n", "\n", " 13,000\n", " MONEY\n", "\n", " per month (previously £\n", "\n", " 11,000\n", " MONEY\n", "\n", "). Hours: \n", "\n", " 12\n", " CARDINAL\n", "\n", " non-consecutive hrs per week. Any additional payments are listed below. (\n", "\n", " Updated\n", " ORG\n", "\n", " \n", "\n", " 20 January 2016\n", " DATE\n", "\n", ", \n", "\n", " 14 October 2016\n", " DATE\n", "\n", " and \n", "\n", " 2 March 2018\n", " DATE\n", "\n", ")
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "displacy.render(doc, jupyter=True,style='ent')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extending Entities\n", "\n", "eg add a flag to say a person is an MP" ] }, { "cell_type": "code", "execution_count": 151, "metadata": {}, "outputs": [], "source": [ "mpdata=pd.read_csv('members_mar18.csv')\n", "tmp = mpdata.to_dict(orient='record')\n", "mpdatadict = {k['list_name']:k for k in tmp }" ] }, { "cell_type": "code", "execution_count": 148, "metadata": {}, "outputs": [], "source": [ "#via https://spacy.io/usage/processing-pipelines\n", "\n", "mpdata=pd.read_csv('members_mar18.csv')\n", "\n", "\n", "\"\"\"Example of a spaCy v2.0 pipeline component to annotate MP record with MNIS data\"\"\"\n", "from spacy.tokens import Doc, Span, Token\n", "\n", " \n", "class RESTMPComponent(object):\n", " \"\"\"spaCy v2.0 pipeline component that annotates MP entity with MP data.\n", " \"\"\"\n", " name = 'mp_annotator' # component name, will show up in the pipeline\n", "\n", " def __init__(self, nlp, label='MP'):\n", " \"\"\"Initialise the pipeline component. The shared nlp instance is used\n", " to initialise the matcher with the shared vocab, get the label ID and\n", " generate Doc objects as phrase match patterns.\n", " \"\"\"\n", " # Get MP data\n", " mpdata=pd.read_csv('members_mar18.csv')\n", " mpdatadict = mpdata.to_dict(orient='record')\n", "\n", " # Convert MP data to a dict keyed by MP name\n", " self.mpdata = {k['list_name']:k for k in mpdatadict }\n", " \n", " self.label = nlp.vocab.strings[label] # get entity label ID\n", "\n", " # Set up the PhraseMatcher with Doc patterns for each MP name\n", " patterns = [nlp(c) for c in self.mpdata.keys()]\n", " self.matcher = PhraseMatcher(nlp.vocab)\n", " self.matcher.add('MPS', None, *patterns)\n", "\n", " # Register attribute on the Token. We'll be overwriting this based on\n", " # the matches, so we're only setting a default value, not a getter.\n", " # If no default value is set, it defaults to None.\n", " Token.set_extension('is_mp', default=False)\n", " Token.set_extension('mnis_id')\n", " Token.set_extension('constituency')\n", " Token.set_extension('party')\n", "\n", " # Register attributes on Doc and Span via a getter that checks if one of\n", " # the contained tokens is set to is_country == True.\n", " Doc.set_extension('is_mp', getter=self.is_mp)\n", " Span.set_extension('is_mp', getter=self.is_mp)\n", "\n", "\n", " def __call__(self, doc):\n", " \"\"\"Apply the pipeline component on a Doc object and modify it if matches\n", " are found. Return the Doc, so it can be processed by the next component\n", " in the pipeline, if available.\n", " \"\"\"\n", " matches = self.matcher(doc)\n", " spans = [] # keep the spans for later so we can merge them afterwards\n", " for _, start, end in matches:\n", " # Generate Span representing the entity & set label\n", " entity = Span(doc, start, end, label=self.label)\n", " spans.append(entity)\n", " # Set custom attribute on each token of the entity\n", " # Can be extended with other data associated with the MP\n", " for token in entity:\n", " token._.set('is_mp', True)\n", " token._.set('mnis_id', self.mpdata[entity.text]['member_id'])\n", " token._.set('constituency', self.mpdata[entity.text]['constituency'])\n", " token._.set('party', self.mpdata[entity.text]['party'])\n", " # Overwrite doc.ents and add entity – be careful not to replace!\n", " doc.ents = list(doc.ents) + [entity]\n", " for span in spans:\n", " # Iterate over all spans and merge them into one token. This is done\n", " # after setting the entities – otherwise, it would cause mismatched\n", " # indices!\n", " span.merge()\n", " return doc # don't forget to return the Doc!\n", "\n", " def is_mp(self, tokens):\n", " \"\"\"Getter for Doc and Span attributes. Returns True if one of the tokens\n", " is an MP.\"\"\"\n", " return any([t._.get('is_mp') for t in tokens])\n" ] }, { "cell_type": "code", "execution_count": 150, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Pipeline ['tagger', 'parser', 'ner', 'mp_annotator']\n", "Doc has MPs True\n", "Abbott, Ms Diane :: Hackney North and Stoke Newington :: Labour :: 172\n", "Afriyie, Adam :: Windsor :: Conservative :: 1586\n", "Entities [('Abbott, Ms Diane and', 'MP'), ('Afriyie, Adam', 'MP')]\n" ] } ], "source": [ "# For simplicity, we start off with only the blank English Language class\n", "# and no model or pre-defined pipeline loaded.\n", "nlp = spacy.load('en')\n", "\n", "rest_mp = RESTMPComponent(nlp) # initialise component\n", "nlp.add_pipe(rest_mp) # add it to the pipeline\n", "doc = nlp(u\"Some text about MPs Abbott, Ms Diane and Afriyie, Adam\")\n", "\n", "print('Pipeline', nlp.pipe_names) # pipeline contains component name\n", "print('Doc has MPs', doc._.is_mp) # Doc contains MPs\n", "for token in doc:\n", " if token._.is_mp:\n", " print(token.text, '::', token._.constituency,'::', token._.party,\n", " '::', token._.mnis_id) # MP data\n", "print('Entities', [(e.text, e.label_) for e in doc.ents]) # entities" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It may be worth producing other exemplar pipelines based around UK gov registers. Or updating the above model to build the model using data directly pulled from the UK Parliament API.\n", "\n", "[Loosely related (relevant to registers): https://github.com/frankieroberto/govuk-government-organisations-autocomplete]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }