{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Regular expressions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the previous tutorial, we learned to cut up a string, and also to check whether a string contains a substring. For example:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True\n" ] } ], "source": [ "print(\"cat\" in \"Implicate\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But what if you want to search text for a _pattern_ rather than a specific string? For example, an email address is of the pattern XXX@YYY.ZZZ where the XXXs, YYYs, and ZZZs can be anything. Searching for the email addresses in a chunk of text, one might at first think to just search for the characteristic @ sign of an email address. But then Twitter handles also come with @ signs.\n", "\n", "We need something cleverer.\n", "\n", "This is where regular expressions come in. Regular expressions are very powerful, but also rather fiddly. So, if you can get away with not using them, then great. But when you need them, you really need them; I had to use regular expressions at one point when writing these tutorials to fix a recurring formatting error!\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import re" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That will load the regular expressions library into memory. Now, there's no way I can show you everything that regular expressions can do. All I can hope to do is give you some examples and appreciation for them and how you would go about starting to use them. So here's the low-down. The feature is available in some form in pretty much every programming language, text editor, and command-line shell to search text for specific patterns. A regular expression is essentially a string that describes a pattern with some fairly complex syntax, and then other strings can be checked to see if they match this pattern. Let's take some text and see if we can extract all the email addresses." ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": true }, "outputs": [], "source": [ "TLDs = \"A top-level domain is a suffix to a domain name, such as the .com in google.com or .gov.uk in many government websites. This means there are now many possible ending for an email address, and every email address needs one: you can't just have my@address or something. Rather than simply joebloggs@gmail.com, we could now have joe@bloggs.me, or joe@bloggs.blog. Or how about jane@doe.email, or judgejudy@television.attorney, or santa@claus.christmas. These are all valid email addresses, even if they look like twitter handles, which are of the form @father_christmas. Email addresses can contain many ascii characters, including even a . such as sam.cassidy@notmyactualaddress.email, and as we know, some have many dots after the @ sign like me@student.liv.ac.uk. So there are a lot of possible combinations and elementary Python methods couldn't account for them all!\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, regular expressions are essentially their own mini programming language. So as not to overwhelm, we'll go through a process of iteration. Firstly, we should decide which function from re we want to use. The options are match(), search() and findall(), corresponding to checking the start of a string, checking anywhere in the string, and collecting all instances into a list. We want to extract all the email addresses into a list, so we'll use ```findall```. Let's start by telling ```re``` to find all in the instances of the @ sign." ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['@', '@', '@', '@', '@', '@', '@', '@', '@', '@', '@']\n" ] } ], "source": [ "addresses = re.findall('@', TLDs)\n", "print(addresses)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okay, not very useful. But you can see the syntax for the function: first argument is a pattern to match, second argument is the string to check for. Most characters in a regular expression are just matched exactly, for example, passing \"@\" really did search for all instances of @. But there are special ways to specify more complex patterns.\n", "\n", "So the first thing we might want to try is to find substring in ```TLDs``` that have an @ sign preceeded by and followed by \"word\" characters. Word characters are letters and numbers. The symbol for any word character is ```\\w```. But here's the first problem, a little annoyance of Python: Python strings treat backslashes as special escape characters, and Python evaluates the string literal before passing it to the ```findall()``` function. Hence we must first escape the backslash with another backslash. In other words, we write ```\\\\w```, Python understands it as ```\\w``` and then passes it to ```findall()```. So, let's do an example:" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['y@a', 's@g', 'e@b', 'e@b', 'e@d', 'y@t', 'a@c', 'y@n', 'e@s']\n" ] } ], "source": [ "addresses = re.findall('\\\\w@\\\\w', TLDs)\n", "print(addresses)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Better. But of course, we don't just want a single word character, we want several word characters. There are two operators that mean \"match several repeats of the previous character\". These are ```*``` and ```+```. The difference is that ```*``` will still match if there are 0 repeats; ie. it's optional. If we use ```+```, there must be at least one instance of a word character to count as a match. Look:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['my@address', 'joebloggs@gmail', 'joe@bloggs', 'joe@bloggs', 'jane@doe', 'judgejudy@television', 'santa@claus', '@father_christmas', 'cassidy@notmyactualaddress', '@', 'me@student']\n" ] } ], "source": [ "addresses = re.findall('\\\\w*@\\\\w*', TLDs)\n", "print(addresses)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, better. But there are many problems. Firstly, we didn't account for the fact that email addresses can contain punctuation. Also by making word characters optional, we grabbed a lone \"@\" sign, and a twitter handle (which only has characters to the right of the @). To capture punctuation as well, we can use ```\\S```.\n", "\n", "Useful rule: ```\\w``` matches a word character. ```\\W``` means a non-word character. ```\\s``` matches a space, ```\\S``` matches a non-space, and so on.\n", "\n", "So here, we are going to look for @ signs wedged between any non-space characters:" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['my@address', 'joebloggs@gmail.com,', 'joe@bloggs.me,', 'joe@bloggs.blog.', 'jane@doe.email,', 'judgejudy@television.attorney,', 'santa@claus.christmas.', 'sam.cassidy@notmyactualaddress.email,', 'me@student.liv.ac.uk.']\n" ] } ], "source": [ "addresses = re.findall('\\\\S+@\\\\S+', TLDs)\n", "print(addresses)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking good, but we've also captured a whole bunch of extra punctuation at the end. Useful option is to use ```\\b```, which matches word boundaries (in other words, punctuation ad spaces that commonly appear at the end of a word):" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['my@address', 'joebloggs@gmail.com', 'joe@bloggs.me', 'joe@bloggs.blog', 'jane@doe.email', 'judgejudy@television.attorney', 'santa@claus.christmas', 'sam.cassidy@notmyactualaddress.email', 'me@student.liv.ac.uk']\n" ] } ], "source": [ "addresses = re.findall('\\\\b\\\\S+@\\\\S+\\\\b', TLDs)\n", "print(addresses)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're so close to getting them now. But we've captured one non-example, the my@address one. We need to make sure there's at least a ```.``` in there somewhere after the @. So next I try changing ```\\\\S+``` to ```\\\\w+\\\\.\\\\S\\\\+```, which reads \"match word characters, followed by a ., followed by any non-space characters\". Notice I had to escape the ```.``` with ```\\\\.```, because usually ```.``` in a regular expression means \"match any character whatsoever\"." ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['joebloggs@gmail.com', 'joe@bloggs.me', 'joe@bloggs.blog', 'jane@doe.email', 'judgejudy@television.attorney', 'santa@claus.christmas', 'sam.cassidy@notmyactualaddress.email', 'me@student.liv.ac.uk']\n" ] } ], "source": [ "addresses = re.findall(\"\\\\b\\\\S+@\\\\w+\\\\.\\\\S+\\\\b\", TLDs)\n", "print(addresses)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Perfect. We captured all and only the email addresses in the above paragraph. Now, you're probably looking at that and reeling at how disgusting it is to read. This is one very good reason to avoid regular expressions unless they really are the only tool for the job! But to make it a little nicer, I'll let you in on a tip, with a warning.\n", "\n", "You can get around the need to \"double escape\" everything with ```\\\\``` if you use Python \"raw string\" literals. This means that Python will evaluate the string directly as you type it, with no respect for the usual escape characters. To make a raw string, just add an ```r``` before the literal:" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['joebloggs@gmail.com', 'joe@bloggs.me', 'joe@bloggs.blog', 'jane@doe.email', 'judgejudy@television.attorney', 'santa@claus.christmas', 'sam.cassidy@notmyactualaddress.email', 'me@student.liv.ac.uk']\n" ] } ], "source": [ "addresses = re.findall(r\"\\b\\S+@\\w+\\.\\S+\\b\", TLDs)\n", "print(addresses)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This may be a little more pleasant, especially if you want to match an actual backslash in the text (to match an actual backslash, ```\\\\\\\\``` is required. Can you work out why?). The one word of warning is that you cannot search for quote marks like this, because a quote mark will close the string unless escaped with ```\\\"```, and raw strings ignore escape characters. Let's explore some more concepts and features of regular expressions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Greedy vs non-greedy repetition\n", "\n", "We saw how ```+``` and ```*``` could be used to repeat a character an arbitrary number of times. However, there are two possible behaviours for this repetition operator: \"greedy\", and \"non-greedy\". The greedy version is the default: it will keep matching characters until the condition fails, for example, ```\\S``` will keep matching non-space characters until the next character is a space. The non-greedy option is specified by adding ```?```: the non-greedy option will keep matching until the next part of the pattern succeeeds. In other words ```\\S+?o``` will match all non-space characters up to the next ```o``` character. For example, ```\\S+?\\.com``` could be used to extract ```.com``` web addresses, but ```\\S+\\.com``` won't, because .com is a non-space character and will be considered part of the ```\\S+``` part of the string: this part \"eats up\" everything until the next space." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Matching and searching\n", "\n", "Matching and searching are similar operations. They return a \"match\" object, which can be queried for information about matches that were found and where they start and end. Let's suppose I want to scan through documents and see they contain any references to a topic I'm researching, say \"cats\". Some considerations: \"cat\" is a very short word, appearing inside many other words, and it may appear in plural or possessive form, eg \"cats\" or \"cat's\", and it may begin a sentence and hence be capitalized. So let's make a list of sentences and see which ones contain references to cats, using a regular expression." ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "collapsed": true }, "outputs": [], "source": [ "sentences = [\"Cats were revered as sacred animals in ancient egypt.\",\n", " \"The term 'big cat' is often reserved for species with the capability to roar.\",\n", " \"A cat's diet is almost exclusively meat, making them among the strictest carnivores of all mammals.\",\n", " \"Quintus Lutatius Catulus consistently opposed Ceasar, whom he endeavoured to implicate in the Catilinarian conspiracy.\",\n", " \"To display the content of text file in the bash shell, you can use the cat command.\"]" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Cats were revered as sacred animals in ancient egypt.\n", "is about cats!\n", "The term 'big cat' is often reserved for species with the capability to roar.\n", "is about cats!\n", "A cat's diet is almost exclusively meat, making them among the strictest carnivores of all mammals.\n", "is about cats!\n", "Quintus Lutatius Catulus consistently opposed Ceasar, whom he endeavoured to implicate in the Catilinarian conspiracy.\n", "is not about cats\n", "To display the content of text file in the bash shell, you can use the cat command.\n", "is about cats!\n" ] } ], "source": [ "match_pairs = []\n", "for snippet in sentences:\n", " match = re.search(\"\\\\bcat(s|'s)?\\\\b\", snippet, re.IGNORECASE)\n", " print(snippet)\n", " if match:\n", " print(\"is about cats!\")\n", " match_pairs.append((match, snippet))\n", " else:\n", " print(\"is not about cats\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So let's break it down. The first thing you should notice is we have used ```\\b``` again on either side to show that this should be a standalone word: it must have word boundaries such as spaces, commas, or full stops on either side. Then we have ```(s|'s)```. The brackets \"group\" strings of characters, the ```|``` means \"or\". So we're looking for ```cat```, followed by ```s``` or ```'s```. Then the question mark makes the previous group optional: it'll still match even if it is just \"cat\" by itself. Notice I also passed an additional argument to ```search()``` called ```re.IGNORECASE``` --- there are several such arguments that modify the behaviour of regular expression matching. This one just says that \"cat\", \"CAT\" or \"Cat\" are absolutely fine.\n", "\n", "The previous code also created a list of pairs, containing the sentence and the so-called match object. Let's show what can be done:" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Cats were revered as sacred animals in ancient egypt.\n", "Word \"Cats\" found starting at character 0\n", "The term 'big cat' is often reserved for species with the capability to roar.\n", "Word \"cat\" found starting at character 14\n", "A cat's diet is almost exclusively meat, making them among the strictest carnivores of all mammals.\n", "Word \"cat's\" found starting at character 2\n", "To display the content of text file in the bash shell, you can use the cat command.\n", "Word \"cat\" found starting at character 71\n" ] } ], "source": [ "for pair in match_pairs:\n", " match, snippet = pair[0], pair[1]\n", " print(snippet)\n", " print(\"Word \\\"{}\\\" found starting at character {}\".format(match.group(), match.start()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Naturally, we also have ```match.end()``` to show where the group ends, and also ```match.span()```, which returns the tuple ```(match.start(), match.end())```. There is so much more that can be done with regular expressions, such as search and replace, but this is beyond the scope of this tutorial." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Further resources:\n", "\n", "* https://docs.python.org/3/howto/regex.html Official tutorial for Python regular expressions\n", "* https://www.tutorialspoint.com/python/python_reg_expressions.htm Has a helpful list of patterns\n", "* https://docs.python.org/3/library/re.html Full documentation outlining every feature." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise\n", "\n" ] }, { "cell_type": "code", "execution_count": 100, "metadata": { "collapsed": true }, "outputs": [], "source": [ "phrases = [\"You can call me on 07700 900432.\",\n", " \"My mobile number is 07700930710\",\n", " \"My date of birth is 07.08.92\",\n", " \"Why not phone me on 202-555-0136?\"\n", " \"There are around 7600000000 people on Earth\",\n", " \"If you're from overseas, call +44 7700 900190\",\n", " \"Try calling +447700900999 now!\",\n", " \"56+44=100.\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use the ```re``` module to extract the UK mobile numbers (those starting with 07 or +44) from the above list of strings. You may need to refer to the resources above to figure out what you need!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 2 }