{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Text Scraping\n", "\n", "One of the things I learned early on about scraping web pages (often referred to as \"screen scraping\") is that it often amounts to trying to recreate databases that have been re-presented as web pages using HTML templates. For example:\n", "\n", "- display a database table as an HTML table in a web page;\n", "- display each row of a database as a templated HTML page.\n", "\n", "The aim of the scrape in these cases might be as simple as pulling the table from the page and representing it as a dataframe, or trying to reverse engineer the HTML template that converts data to HTML into something that can extract the data from the HTML back as a row in a corresponding data table.\n", "\n", "In the latter case, the scrape may proceed in a couple of ways. For example:\n", "\n", "- by trying to identify structural HTML tag elements that contain recognisable data items, retrieving the HTML tag element, then extracting the data value;\n", "- parsing the recognisable literal *text* displayed on the web page and trying to extract data items based on that (i.e. ignore the HTML structural eelements and go straight for the extracted text). For an example of this sort of parsing, see the [r1chardj0n3s/parse](https://github.com/r1chardj0n3s/parse) Python package as applied to text pulled from a page using something like the [kennethreitz/requests-html](https://github.com/kennethreitz/requests-html) package.\n", "\n", "In more general cases, however, such as when trying to abstract meaningful information from arbitrary, natural language, texts, we need to up our game and start to analyse the texts as natural language texts." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Entity Extraction\n", "\n", "As an example, consider the following text:\n", "\n", "```From February 2016, as an author, payments from Head of Zeus Publishing; a client of Averbrook Ltd. Address: 45-47 Clerkenwell Green London EC1R 0HT, via Sheil Land, 52 Doughty Street. London WC1N 2LS. From October 2016 until July 2018, I will receive a regular payment of £13,000 per month (previously £11,000). Hours: 12 non-consecutive hrs per week. Any additional payments are listed below. (Updated 20 January 2016, 14 October 2016 and 2 March 2018)```\n", "\n", "To a human reader, we can identify various structural patterns, as well as parsing the natural language sentences.\n", "\n", "Let's start with some of the structural patterns:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from parse import parse" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "bigtext = '''\\\n", "From February 2016, as an author, payments from Head of Zeus Publishing; \\\n", "a client of Averbrook Ltd. Address: 45-47 Clerkenwell Green London EC1R 0HT, via Sheil Land, 52 Doughty Street. \\\n", "London WC1N 2LS. From October 2016 until July 2018, I will receive a regular payment \\\n", "of £13,000 per month (previously £11,000). Hours: 12 non-consecutive hrs per week. \\\n", "Any additional payments are listed below. (Updated 20 January 2016, 14 October 2016 and 2 March 2018)'''" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'20 January 2016, 14 October 2016 and 2 March 2018'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Extract the sentence containing the update dates\n", "parse('{}(Updated {updated})', bigtext)['updated']" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'12 non-consecutive hrs per week'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Extract the phrase describing the hours\n", "parse('{}Hours: {hours}.{}', bigtext)['hours']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There also look to be sentences that might be standard sentences, such as `Any additional payments are listed below.`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## From Web Scraping to Text-Scraping Using Natural Language Processing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Within the text are things that we might recognise as company names, dates, or addresses. *Entity recognition* refers to a natural language processing technique that attempts to extract words that describe \"things\", that is, *entities*, as well as identifying what sorts of \"thing\", or entity, they are.\n", "\n", "One powerful Python natural language processing package, `spacy`, has an entity recognition capability. Let's see how to use it and what sort of output it produces:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "#Import the spacy package\n", "import spacy\n", "\n", "#The package parses lanuage according to different statistically trained models\n", "#Let's load in the basic English model:\n", "nlp = spacy.load('en')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "#Generate a version of the text annotated using features detected by the model\n", "doc = nlp(bigtext)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The parsed text is annotated in a variety of ways.\n", "\n", "For example, we can directly access all the sentences in the original text:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[From February 2016, as an author, payments from Head of Zeus Publishing; a client of Averbrook Ltd.,\n", " Address: 45-47 Clerkenwell Green London EC1R 0HT, via Sheil Land, 52 Doughty Street.,\n", " London WC1N 2LS.,\n", " From October 2016 until July 2018, I will receive a regular payment of £13,000 per month (previously £11,000).,\n", " Hours: 12 non-consecutive hrs per week.,\n", " Any additional payments are listed below.,\n", " (Updated 20 January 2016, 14 October 2016 and 2 March 2018)]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(doc.sents)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "February 2016 :: DATE\n", "Zeus Publishing :: PERSON\n", "Averbrook Ltd. :: ORG\n", "45 :: CARDINAL\n", "EC1R 0HT :: POSTCODE\n", "Sheil Land :: ORG\n", "52 Doughty Street :: QUANTITY\n", "London :: GPE\n", "WC1N 2LS :: POSTCODE\n", "October 2016 :: DATE\n", "July 2018 :: DATE\n", "13,000 :: MONEY\n", "11,000 :: MONEY\n", "12 :: CARDINAL\n", "Updated :: ORG\n", "20 January 2016 :: DATE\n", "14 October 2016 :: DATE\n", "2 March 2018 :: DATE\n" ] } ], "source": [ "ents = list(doc.ents)\n", "entTypes = []\n", "for entity in ents:\n", " entTypes.append(entity.label_)\n", " \n", " print(entity, '::', entity.label_)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ORG Companies, agencies, institutions, etc.\n", "GPE Countries, cities, states\n", "PERSON People, including fictional\n", "DATE Absolute or relative dates or periods\n", "MONEY Monetary values, including unit\n", "QUANTITY Measurements, as of weight or distance\n", "CARDINAL Numerals that do not fall under another type\n" ] } ], "source": [ "for entType in set(entTypes):\n", " print(entType, spacy.explain(entType))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also look at each of the tokens in text and identify whether it is part of a entity, and if so, what sort. The `.iob_` attributes identifies `O` as not part of an entity, `B` as the first token in an entity, and `I` as continuing part of an entity." ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "From::::O\n", "February::DATE::B\n", "2016::DATE::I\n", ",::::O\n", "as::::O\n", "an::::O\n", "author::::O\n", ",::::O\n", "payments::::O\n", "from::::O\n", "Head::::O\n", "of::::O\n", "Zeus::PERSON::B\n", "Publishing::PERSON::I\n", ";::::O\n" ] } ], "source": [ "for token in doc[:15]:\n", " print('::'.join([token.text, token.ent_type_,token.ent_iob_]) )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at the extracted entities, we see we get some good hits:\n", "\n", "- `Averbrook Ltd.` is an `ORG`;\n", "- `20 January 2016` and `14 October 2016` are both instances of a `DATE`\n", "\n", "Some near misses:\n", "\n", "- `Zeus Publishing` isn't a `PERSON`, although we might see why it has been recognised as such. (Could we overlay the model with an additional mapping of `if PERSON and endswith.in(['Publishing', 'Holdings']) -> ORG` ?) \n", "\n", "And some things that are mis-categorised:\n", "\n", "- `52 Doughty Street` isn't really meaningful as a `QUANTITY`.\n", "\n", "Several things we might usefully want to categorise - such as a UK postcode, for example, which might be useful in and of itself, or when helping us to identify an address - is *not* recognised as an entity." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Things recognised as dates we might want to then further parse as date object types:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(February 2016, datetime.datetime(2016, 2, 19, 0, 0)),\n", " (October 2016, datetime.datetime(2016, 10, 19, 0, 0)),\n", " (July 2018, datetime.datetime(2018, 7, 19, 0, 0)),\n", " (20 January 2016, datetime.datetime(2016, 1, 20, 0, 0)),\n", " (14 October 2016, datetime.datetime(2016, 10, 14, 0, 0)),\n", " (2 March 2018, datetime.datetime(2018, 3, 2, 0, 0))]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from dateutil import parser as dtparser\n", "\n", "[(d, dtparser.parse(d.string)) for d in ents if d.label_ == 'DATE']" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "#see also https://github.com/akoumjian/datefinder\n", "#datefinder - Find dates inside text using Python and get back datetime objects " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Token Shapes\n", "\n", "As well as indentifying entities, `spacy` analyses texts at several othr levels. One such level of abstraction is the \"shape\" of each token. This identifies whether or not a character is an upper or lower case alphabetic character, a digit, or a punctuation character (which appears as itself):" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "From :: Xxxx\n", "February :: Xxxxx\n", "2016 :: dddd\n", ", :: ,\n", "as :: xx\n", "an :: xx\n", "author :: xxxx\n", ", :: ,\n", "payments :: xxxx\n", "from :: xxxx\n", "Head :: Xxxx\n", "of :: xx\n", "Zeus :: Xxxx\n", "Publishing :: Xxxxx\n", "; :: ;\n" ] } ], "source": [ "for token in doc[:15]:\n", " print(token, '::', token.shape_) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scraping a Text Based on Its Shape Structure And Adding New Entity Types\n", "\n", "The \"shape\" of a token provides an additional structural item that we might be able to make use of in scrapers of the raw text.\n", "\n", "For example, writing an efficient regular expression to identify a UK postcode can be a difficult task, but we can start to cobble one together from the shapes of different postcodes written in \"standard\" postcode form:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['XXd', 'dXX', ',', 'XXdX', 'dXX', ',', 'Xd', 'dXX']" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[pc.shape_ for pc in nlp('MK7 6AA, SW1A 1AA, N7 6BB')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can define a `matcher` function that will identify the tokens in a document that match a particular ordered combination of shape patterns.\n", "\n", "For example, the postcode like things described above have the shapes:\n", "\n", "- `XXd dXX`\n", "- `XXdX dXX`\n", "- `Xd dXX`\n", "\n", "We can use these structural patterns to identify token pairs as possible postcodes." ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "from spacy.matcher import Matcher\n", "\n", "nlp = spacy.load('en')\n", "matcher = Matcher(nlp.vocab)\n", "\n", "matcher.add('POSTCODE', None, \n", " [{'SHAPE':'XXdX'}, {'SHAPE':'dXX'}],\n", " [{'SHAPE':'XXd'}, {'SHAPE':'dXX'}],\n", " [{'SHAPE':'Xd'}, {'SHAPE':'dXX'}])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's test that:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Matches: [WC1N 4CC, MK7 4AA]\n", "Entities: [(James Smith, 'PERSON'), (Lady Jane Grey, 'PERSON')]\n" ] } ], "source": [ "pcdoc = nlp('pc is WC1N 4CC okay, as is MK7 4AA and Sir James Smith and Lady Jane Grey are presumably persons.')\n", "matches = matcher(pcdoc)\n", "\n", "#See what we matched, and let's see what entities we have detected\n", "print('Matches: {}\\nEntities: {}'.format([pcdoc[m[1]:m[2]] for m in matches], [(m,m.label_) for m in pcdoc.ents]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adding a new entity type with a matcher callback\n", "\n", "The matcher seems to have matched the postcodes, but is not identifying them as entities. (We also note that the entity matcher has missed the \"Sir\" title. In some cases, it might also match a postcode as a person.)\n", "\n", "To add the matched items to the entity list, we need to add a callback function to the matcher." ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Matches: [WC1N 4CC, MK7 4AA]\n", "Entities: [(WC1N 4CC, 'POSTCODE'), (MK7 4AA, 'POSTCODE'), (James Smith, 'PERSON')]\n" ] } ], "source": [ "##Define a POSTCODE as a new entity type by adding matched postcodes to the doc.ents\n", "#https://stackoverflow.com/a/47799669\n", "\n", "nlp = spacy.load('en')\n", "matcher = Matcher(nlp.vocab)\n", "\n", "def add_entity_label(matcher, doc, i, matches):\n", " match_id, start, end = matches[i]\n", " doc.ents += ((match_id, start, end),)\n", " \n", "#Recognise postcodes from different shapes\n", "matcher.add('POSTCODE', add_entity_label, [{'SHAPE': 'XXdX'},{'SHAPE':'dXX'}], [{'SHAPE':'XXd'},{'SHAPE':'dXX'}])\n", "\n", "pcdoc = nlp('pc is WC1N 4CC okay, as is MK7 4AA and James Smith is presumably a person')\n", "matches = matcher(pcdoc)\n", "\n", "print('Matches: {}\\nEntities: {}'.format([pcdoc[m[1]:m[2]] for m in matches], [(m,m.label_) for m in pcdoc.ents]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's put those pieces together more succinctly:" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'From February 2016, as an author, payments from Head of Zeus Publishing; a client of Averbrook Ltd. Address: 45-47 Clerkenwell Green London EC1R 0HT, via Sheil Land, 52 Doughty Street. London WC1N 2LS. From October 2016 until July 2018, I will receive a regular payment of £13,000 per month (previously £11,000). Hours: 12 non-consecutive hrs per week. Any additional payments are listed below. (Updated 20 January 2016, 14 October 2016 and 2 March 2018)'" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bigtext" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "#Generate base tagged doc\n", "doc = nlp(bigtext)\n", "\n", "#Run postcode tagger over the doc\n", "_ = matcher(doc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The tagged document should now include `POSTCODE` entities. One of the easiest ways to check the effectiveness of a new entity tagger is to check the document with recognised entities visualised within it.\n", "\n", "The `displacy` package has a Jupyter enabled visualiser for doing just that." ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | constituency | \n", "date_of_birth | \n", "days_service | \n", "first_start_date | \n", "gender | \n", "list_name | \n", "member_id | \n", "party | \n", "
---|---|---|---|---|---|---|---|---|
0 | \n", "Hackney North and Stoke Newington | \n", "1953-09-27 | \n", "11041 | \n", "1987-06-11 | \n", "F | \n", "Abbott, Ms Diane | \n", "172 | \n", "Labour | \n", "
1 | \n", "Oldham East and Saddleworth | \n", "1960-09-15 | \n", "2543 | \n", "2011-01-13 | \n", "F | \n", "Abrahams, Debbie | \n", "4212 | \n", "Labour | \n", "
2 | \n", "Selby and Ainsty | \n", "1966-11-30 | \n", "2795 | \n", "2010-05-06 | \n", "M | \n", "Adams, Nigel | \n", "4057 | \n", "Conservative | \n", "
3 | \n", "Hitchin and Harpenden | \n", "1986-02-11 | \n", "279 | \n", "2017-06-08 | \n", "M | \n", "Afolami, Bim | \n", "4639 | \n", "Conservative | \n", "
4 | \n", "Windsor | \n", "1965-08-04 | \n", "4598 | \n", "2005-05-05 | \n", "M | \n", "Afriyie, Adam | \n", "1586 | \n", "Conservative | \n", "