{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Regex introduction\n", "\n", "## What is a regex?\n", "[**Regex**](https://en.wikipedia.org/wiki/Regular_expression) stands for _regular expression_, and regular expressions are a way of writing patterns that match strings. Usually these patterns can be used to search strings for specific things, or to search and then replace certain things, etc. Regular expressions are great for string manipulation!\n", "\n", "## Why do regular expressions matter?\n", "From the first paragraph in this guide you might have guessed it, but regular expressions can be very useful **whenever you have to deal with strings**. From the basic renaming of a set of similarly named variables in your source code to [data preprocessing](https://github.com/clone95/Virgilio/blob/master/Specializations/HardSkills/DataPreprocessing.md). Regular expressions usually offer a concise way of expressing whatever type of things you want to find. For example, if you wanted to parse a form and look for the year that someone might have been born in, you could use something like `(19)|(20)[0-9][0-9]`. This is an example of a regular expression!\n", "\n", "## Prerequisites\n", "This guide does not assume any prior knowledge. Examples will be coded in Python, but mastery of the programming language is neither assumed nor needed. You are welcome to read the guide in your browser or to download it and to run the examples/toying around with them.\n", "\n", "# Index\n", " - [Basic regex](#Basic-regex)\n", " - [Using Python re](#Using-Python-re)\n", " - [$\\pi$ lookup](#$\\pi$-lookup)\n", " - [Matching options](#Matching-options)\n", " - [Virgilio or Virgil?](#Virgilio-or-Virgil?)\n", " - [Matching repetitions](#Matching-repetitions)\n", " - [Greed](#Greed)\n", " - [Removing excessive spaces](#Removing-excessive-spaces)\n", " - [Character classes](#Character-classes)\n", " - [Phone numbers v1](#Phone-numbers-v1)\n", " - [More `re` functions](#More-re-functions)\n", " - [`search` with `match`](#search-with-match)\n", " - [Count matches with `findall`](#Count-matches-with-findall)\n", " - [Special characters](#Special-characters)\n", " - [Phone numbers v2](#Phone-numbers-v2)\n", " - [Groups](#Groups)\n", " - [Phone numbers v3](#Phone-numbers-v3)\n", " - [Toy project about regex](#Toy-project-about-regex)\n", " - [Further reading](#Further-reading)\n", " - [Suggested solutions](#Suggested-solutions)\n", " \n", "Let's dive right in!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Just a quick word:** I tried to include some small exercises whenever I show you something new, so that you can try and test your knowledge. Examples of solutions are provided in the [end of the notebook](#Suggested-solutions)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic regex\n", "\n", "A regex is just a string written in a certain format, that can then be used by specific tools/libraries/programs to perform pattern matching on strings. Throughout this guide we will use `this formatting` to refer to regular expressions!\n", "\n", "The simplest regular expressions that one can create are just composed of regular characters. If you wanted to find all the occurrences of the word _\"Virgilio\"_ in a text, you could write the regex `Virgilio`. In this regular expression, no character is doing anything special or different. In fact, this regular expression is just a normal word. That is ok, regular expressions are strings, after all!\n", "\n", "If you were given the text _\"Project Virgilio is great\"_, you could use your `Virgilio` regex to find the occurrence of the word _\"Virgilio\"_. However, if the text was _\"Project virgilio is great\"_, then your regex wouldn't work, because regular expressions are **case-sensitive** by default and thus should match everything exactly. We say that `Virgilio` matches the sequence of characters \"Virgilio\" literally." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using Python re\n", "\n", "To check if our regular expressions are working well and to give you the opportunity to directly experiment with them, we will be using Python's `re` module to work with regular expressions. To use the `re` module we first import it, then define a regular expression and then use the `search()` function over a string! Pretty simple:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "'Virgilio' is in 'Project Virgilio is great'\n", "'Virgilio' is not in 'Project virgilio is great'\n" ] } ], "source": [ "import re\n", "\n", "regex = \"Virgilio\"\n", "str1 = \"Project Virgilio is great\"\n", "str2 = \"Project virgilio is great\"\n", "\n", "if re.search(regex, str1):\n", " print(\"'{}' is in '{}'\".format(regex, str1))\n", "else:\n", " print(\"'{}' is not in '{}'\".format(regex, str1))\n", " \n", "if re.search(regex, str2):\n", " print(\"'{}' is in '{}'\".format(regex, str2))\n", "else:\n", " print(\"'{}' is not in '{}'\".format(regex, str2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `re.search(regex, string)` function takes a regex as first argument and then searches for any matches over the string that was given as the second argument. However, the return value of the function is **not** a boolean, but a *match object*:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "print(re.search(regex, str1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Match objects have relevant information about the match(es) encountered: the start and end positions, the string that was matched, and even some other things for more complex regular expressions.\n", "\n", "We can see that in this case the match is exactly the same as the regular expression, so it may look like the `match` information inside the match object is irrelevant... but it becomes relevant as soon as we introduce options or repetitions into our regex.\n", "\n", "If no matches are found, then the `.search()` function returns `None`:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "None\n" ] } ], "source": [ "print(re.search(regex, str2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Whenever the match is not `None`, we can save the returned match object and use it to extract all the needed information!" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The match started at pos 8 and ended at pos 16\n", "Or with tuple notation, the match is at (8, 16)\n", "And btw, the actual string matched was 'Virgilio'\n" ] } ], "source": [ "m = re.search(regex, str1)\n", "if m is not None:\n", " print(\"The match started at pos {} and ended at pos {}\".format(m.start(), m.end()))\n", " print(\"Or with tuple notation, the match is at {}\".format(m.span()))\n", " print(\"And btw, the actual string matched was '{}'\".format(m.group()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you should try to get some more matches and some fails with your own literal regular expressions. I provide three examples of my own:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The match is at (20, 25)\n", "\n", "Woops, did I just got the alphabet wrong..?\n", "\n", "I just matched 'a' inside 'aaaaa aaaaaa a aaa'\n" ] } ], "source": [ "m1 = re.search(\"regex\", \"This guide is about regexes\")\n", "if m1 is not None:\n", " print(\"The match is at {}\\n\".format(m1.span()))\n", "\n", "m2 = re.search(\"abc\", \"The alphabet goes 'abdefghij...'\")\n", "if m2 is None:\n", " print(\"Woops, did I just got the alphabet wrong..?\\n\")\n", " \n", "s = \"aaaaa aaaaaa a aaa\"\n", "m3 = re.search(\"a\", s)\n", "if m3 is not None:\n", " print(\"I just matched '{}' inside '{}'\".format(m3.group(), s))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### $\\pi$ lookup\n", "\n", "$$\\pi = 3.1415\\cdots$$\n", "\n", "right? Well, what comes after the dots? An infinite sequence of digits, right? Could it be that your date of birth appears in the first million digits of $\\pi$? Well, we could use a regex to find that out! Change the `regex` variable below to look for your date of birth or for any number you want, in the first million digits of $\\pi$!" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "pifile = \"regex-bin/pi.txt\"\n", "regex = \"\" # define your regex to look your favourite number up\n", "\n", "with open(pifile, \"r\") as f:\n", " pistr = f.read() # pistr is a string that contains 1M digits of pi\n", " \n", "## search for your number here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To search for numbers in the first 100 million digits of $\\pi$ (or 200 million, I didn't really get it) you can check [this](https://www.angio.net/pi/piquery) website." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Matching options\n", "\n", "We just saw a very simple regular expression that was trying to find the word _\"Virgilio\"_ in text, but we also saw that we had zero flexibility and we couldn't even handle the fact that someone may have forgotten to capitalize the name properly, spelling it like _\"virgilio\"_ instead.\n", "\n", "To prevent problems like this, regular expressions can be written in a way to handle different possibilities. For our case, we want the first letter to be either _\"V\"_ or _\"v\"_, and that should be followed by _\"irgilio\"_.\n", "\n", "In order to handle different possibilities, we use the character `|`. For instance, `V|v` matches the letter vee, regardless of its capitalization:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "small v found\n", "big V found\n" ] } ], "source": [ "v = \"v\"\n", "V = \"V\"\n", "regex = \"v|V\"\n", "if re.search(regex, v):\n", " print(\"small v found\")\n", "if re.search(regex, V):\n", " print(\"big V found\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can concatenate the regex for the first letter and the `irgilio` regex (for the rest of the name) to get a regex that matches the name of Virgilio, regardless of the capitalization of its first letter:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "virgilio found!\n", "Virgilio found!\n" ] } ], "source": [ "virgilio = \"virgilio\"\n", "Virgilio = \"Virgilio\"\n", "regex = \"(V|v)irgilio\"\n", "if re.search(regex, virgilio):\n", " print(\"virgilio found!\")\n", "if re.search(regex, Virgilio):\n", " print(\"Virgilio found!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that we write the regex with parenthesis: `(V|v)irgilio`\n", "\n", "If we only wrote `V|virgilio`, then the regular expression would match either \"V\" or \"virgilio\", instead of \"Virgilio\" or \"virgilio\":" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "regex = \"V|virgilio\"\n", "print(re.search(regex, \"This sentence only has a big V\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we really need to parenthesize the `(V|v)` there. If we do, it will work as expected!" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "None\n" ] } ], "source": [ "regex = \"(V|v)irgilio\"\n", "print(re.search(regex, \"The name of the project is virgilio, but with a big V!\"))\n", "print(re.search(regex, \"This sentence only has a big V\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Maybe you didn't even notice, but there is something else going on! Notice that we used the characteres `|`, `(` and `)`, and those are not present in the word _\"virgilio\"_, but nonetheless our regex `(V|v)irgilio` matched it... that is because these three characters have special meanings in the regex world, and hence are **not** interpreted literally, contrary to what happens to any letter in `irgilio`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Virgilio or Virgil?\n", "\n", "Here is a couple of paragraphs from Wikipedia's [article on Virgil](https://en.wikipedia.org/wiki/Virgil):\n", "\n", " > Publius Vergilius Maro (Classical Latin: [ˈpuː.blɪ.ʊs wɛrˈɡɪ.lɪ.ʊs ˈma.roː]; traditional dates October 15, 70 BC – September 21, 19 BC[1]), usually called Virgil or Vergil (/ˈvɜːrdʒɪl/) in English, was an ancient Roman poet of the Augustan period. He wrote three of the most famous poems in Latin literature: the Eclogues (or Bucolics), the Georgics, and the epic Aeneid. A number of minor poems, collected in the Appendix Vergiliana, are sometimes attributed to him.[2][3]\n", "\n", " > Virgil is traditionally ranked as one of Rome's greatest poets. His Aeneid has been considered the national epic of ancient Rome since the time of its composition. Modeled after Homer's Iliad and Odyssey, the Aeneid follows the Trojan refugee Aeneas as he struggles to fulfill his destiny and reach Italy, where his descendants Romulus and Remus were to found the city of Rome. Virgil's work has had wide and deep influence on Western literature, most notably Dante's Divine Comedy, in which Virgil appears as Dante's guide through Hell and Purgatory.\n", " \n", "\"Virgilio\" is the italian form of \"Virgil\", and I edited the above paragraphs to have the italian version instead of the english one. I want you to revert this!\n", "\n", "You might want to take a look at [`while` cycles in Python](https://realpython.com/python-while-loop/), [string indexing](https://www.digitalocean.com/community/tutorials/how-to-index-and-slice-strings-in-python-3) and [string concatenation](https://realpython.com/python-string-split-concatenate-join/). The point is that you find a match, you break the string into the part _before_ the match and the part _after_ the match, and you glue those two together with _Virgilio_ in between.\n", "\n", "Notice that [string replacement](https://www.tutorialspoint.com/python/string_replace.htm) would probably be faster and easier, but that would defeat the purpose of this exercise. After fixing everything, print the final results to be sure that you fixed every occurrence of the name." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "paragraphs = \\\n", "\"\"\"Publius Vergilius Maro (Classical Latin: [ˈpuː.blɪ.ʊs wɛrˈɡɪ.lɪ.ʊs ˈma.roː]; traditional dates October 15, 70 BC – September 21, 19 BC[1]), usually called virgilio or Vergil (/ˈvɜːrdʒɪl/) in English, was an ancient Roman poet of the Augustan period. He wrote three of the most famous poems in Latin literature: the Eclogues (or Bucolics), the Georgics, and the epic Aeneid. A number of minor poems, collected in the Appendix Vergiliana, are sometimes attributed to him.[2][3]\n", "\n", "Virgilio is traditionally ranked as one of Rome's greatest poets. His Aeneid has been considered the national epic of ancient Rome since the time of its composition. Modeled after Homer's Iliad and Odyssey, the Aeneid follows the Trojan refugee Aeneas as he struggles to fulfill his destiny and reach Italy, where his descendants Romulus and Remus were to found the city of Rome. virgilio's work has had wide and deep influence on Western literature, most notably Dante's Divine Comedy, in which virgilio appears as Dante's guide through Hell and Purgatory.\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Matching repetitions\n", "\n", "Sometimes we want to find patterns that have bits that will be repeated. For example, people make a _\"awww\"_ or _\"owww\"_ sound when they see something cute, like a baby. But the number of _\"w\"_ I used there was completely arbitrary! If the baby is really really cute, someone might write _\"awwwwwwwwwww\"_. So how can I write a regex that matches _\"aww\"_ and _\"oww\"_, but with an arbitrary number of characters _\"w\"_?\n", "\n", "I will illustrate several ways of capturing repetitions, by testing regular expressions against the following strings:\n", "\n", " - \"awww\" (3 letters \"w\")\n", " - \"awwww\" (4 letters \"w\")\n", " - \"awwwwwww\" (7 letters \"w\")\n", " - \"awwwwwwwwwwwwwwww\" (16 letters \"w\")\n", " - \"aw\" (1 letter \"w\")\n", " - \"a\" (0 letters \"w\")" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "cute_strings = [\n", " \"awww\",\n", " \"awwww\",\n", " \"awwwwwww\",\n", " \"awwwwwwwwwwwwwwww\",\n", " \"aw\",\n", " \"a\"\n", "]\n", "\n", "def match_cute_strings(regex):\n", " \"\"\"Takes a regex, prints matches and non-matches\"\"\"\n", " for s in cute_strings:\n", " m = re.search(regex, s)\n", " if m:\n", " print(\"match: {}\".format(s))\n", " else:\n", " print(\"non match: {}\".format(s))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### At least once\n", "\n", "If I want to match all strings that containt **at least** one \"w\", we can use the character `+`. A `+` means that we want to find **one or more repetitions** of whatever was to the left of it. For example, the regex `a+` will match any string that has at least one \"a\"." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "match: awww\n", "match: awwww\n", "match: awwwwwww\n", "match: awwwwwwwwwwwwwwww\n", "match: aw\n", "non match: a\n" ] } ], "source": [ "regex = \"aw+\"\n", "match_cute_strings(regex)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Any number of times\n", "\n", "If I want to match all strings that contain an arbitrary number of letters \"w\", I can use the character `*`. The character `*` means **match any number of repetitions** of whatever comes on the left of it, _even 0 repetitions_! So the regex `a*` would match the empty string \"\", because the empty string \"\" has 0 repetitions of the letter \"a\"." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "match: awww\n", "match: awwww\n", "match: awwwwwww\n", "match: awwwwwwwwwwwwwwww\n", "match: aw\n", "match: a\n" ] } ], "source": [ "regex = \"aw*\"\n", "match_cute_strings(regex)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### A specific number of times\n", "\n", "If I want to match a string that contains a certain particle a specific number of times, I can use the `{n}` notation, where `n` is replaced by the number of repetitions I want. For example, `a{3}` matches the string \"aaa\" but not the string \"aa\"." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "match: awww\n", "match: awwww\n", "match: awwwwwww\n", "match: awwwwwwwwwwwwwwww\n", "non match: aw\n", "non match: a\n" ] } ], "source": [ "regex = \"aw{3}\"\n", "match_cute_strings(regex)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Wait a minute**, why did the pattern `aw{3}` match the longer expressions of cuteness, like \"awwww\" or \"awwwwwww\"? Because the regular expressions try to find _substrings_ that match the pattern. Our pattern is `awww` (if I write the `w{3}` explicitly) and the string **awww**w has that substring, just like the string **awww**wwww has it, or the longer version with 16 letters \"w\". If we wanted to exclude the strings \"awwww\", \"awwwwwww\" and \"awwwwwwwwwwwwwwww\" we would have to fix our regex. A better example that demonstrates how `{n}` works is by considering, instead of expressions of cuteness, expressions of amusement like \"wow\", \"woow\" and \"wooooooooooooow\". We define some expressions of amusement:\n", "\n", " - \"wow\"\n", " - \"woow\"\n", " - \"wooow\"\n", " - \"woooow\"\n", " - \"wooooooooow\"\n", " \n", "and now we test our `{3}` pattern." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "wow_strings = [\n", " \"wow\",\n", " \"woow\",\n", " \"wooow\",\n", " \"woooow\",\n", " \"wooooooooow\"\n", "]\n", "\n", "def match_wow_strings(regex):\n", " \"\"\"Takes a regex, prints matches and non-matches\"\"\"\n", " for s in wow_strings:\n", " m = re.search(regex, s)\n", " if m:\n", " print(\"match: {}\".format(s))\n", " else:\n", " print(\"non match: {}\".format(s))" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "non match: wow\n", "non match: woow\n", "match: wooow\n", "non match: woooow\n", "non match: wooooooooow\n" ] } ], "source": [ "regex = \"wo{3}w\"\n", "match_wow_strings(regex)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Between $n$ and $m$ times\n", "\n", "Expressing amusement with only three \"o\" is ok, but people might also use two or four \"o\". How can we capture a variable number of letters, but within a range? Say I only want to capture versions of \"wow\" that have between 2 and 4 letters \"o\". I can do it with `{2,4}`." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "non match: wow\n", "match: woow\n", "match: wooow\n", "match: woooow\n", "non match: wooooooooow\n" ] } ], "source": [ "regex = \"wo{2,4}w\"\n", "match_wow_strings(regex)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Up to $n$ times or at least $m$ times\n", "\n", "Now we are just playing with the type of repetitions we might want, but of course we might say that we want **no more** than $n$ repetitions, which you would do with `{,n}`, or that we want **at least** $m$ repetitions, which you would do with `{m,}`.\n", "\n", "In fact, take a look at these regular expressions:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "match: wow\n", "match: woow\n", "match: wooow\n", "match: woooow\n", "non match: wooooooooow\n" ] } ], "source": [ "regex = \"wo{,4}w\" # should not match strings with more than 4 o's\n", "match_wow_strings(regex)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "non match: wow\n", "non match: woow\n", "match: wooow\n", "match: woooow\n", "match: wooooooooow\n" ] } ], "source": [ "regex = \"wo{3,}w\" # should not match strings with less than 3 o's\n", "match_wow_strings(regex)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### To be or not to be\n", "\n", "Last but not least, sometimes we care about something that might or might not be present. For example, above we dealed with the English and Italian versions of the name Virgilio. If we wanted to write a regular expression to capture both versions, we could write `((V|v)irgil)|((V|v)irgilio)`, or slightly more compact, `(V|v)((irgil)|(irgilio))`. But this does not look good at all, right? All we need to say is that the final \"io\" might or might not be present. We do this with the `?` character. So the regex `(V|v)irgil(io)?` matches the upper and lower case versions of \"Virgil\" and \"Virgilio\"." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The name virgil was matched!\n", "The name Virgil was matched!\n", "The name virgilio was matched!\n", "The name Virgilio was matched!\n" ] } ], "source": [ "regex = \"(V|v)irgil(io)?\"\n", "names = [\"virgil\", \"Virgil\", \"virgilio\", \"Virgilio\"]\n", "for name in names:\n", " m = re.search(regex, name)\n", " if m:\n", " print(\"The name {} was matched!\".format(name))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Greed\n", "\n", "The `+`, `?`, `*` and `{,}` operators are all greedy. What does this mean? It means that they will try to match as much as possible. They have this default behaviour, as opposed to stopping to try and find more matches as soon as the regex is satisfied. To better illustrate what I mean by this, let us look again at the information contained in the `match` object we have been dealing with:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "regex = \"a+\"\n", "s = \"aaa\"\n", "m = re.search(regex, s)\n", "print(m)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice the part of the printed information that says `match='aaa'`. The function `m.group()` will let me know what was the actual string that was matched by the regular expression, and in this case it was \"aaa\". Why does it make sense to have access to this information? Well, the regex I wrote, `a+`, will match one or more letters \"a\" in a row. If I use the regex over a string and I get a match, how would I be able to know how many \"a\"s were matched, if I didn't have access to that type of information?" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "aaa\n" ] } ], "source": [ "print(m.group())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So let us verify that, in fact, the operators I mentioned are all greedy. Again, because they all match as many characters as they can.\n", "\n", "Below, we see that given a string of thirty times the letter \"a\",\n", "\n", " - the pattern `a?` matches 1 \"a\", which is as much as it could\n", " - the pattern `a+` matches 30 \"a\"s, which is as much as it could\n", " - the pattern `a*` also matches 30\n", " - the pattern `a{5,10}` matches 10 \"a\"s, which was the limit imposed by us" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "a\n", "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\n", "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\n", "aaaaaaaaaa\n" ] } ], "source": [ "s = \"a\"*30\n", "print(re.search(\"a?\", s).group())\n", "print(re.search(\"a+\", s).group())\n", "print(re.search(\"a*\", s).group())\n", "print(re.search(\"a{5,10}\", s).group())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we don't want our operators to be greedy, we just put an extra `?` after them. So the following regular expressions are **not** greedy:\n", "\n", " - the pattern `a??` will match **no** characters, much like `a*?`, because now their goal is to match as little as possible. But a match of length 0 is the shortest match possible!\n", " - the pattern `a+?` will only match 1 \"a\"\n", " - the pattern `a{5,10}?` will only match 5 \"a\"s\n", " \n", "We can easily confirm what I just said by running the code below. Notice that now I print things differently, because otherwise we wouldn't be able to see the `a??` and `a*?` patterns matching nothing." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "''\n", "'a'\n", "''\n", "'aaaaa'\n" ] } ], "source": [ "s = \"a\"*30\n", "print(\"'{}'\".format(re.search(\"a??\", s).group()))\n", "print(\"'{}'\".format(re.search(\"a+?\", s).group()))\n", "print(\"'{}'\".format(re.search(\"a*?\", s).group()))\n", "print(\"'{}'\".format(re.search(\"a{5,10}?\", s).group()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Removing excessive spaces\n", "\n", "Now that we know about repetitions, I am going to tell you about the `sub` function and we are going to use that to parse a piece of text and remove all extra spaces that are present. Typing in `re.sub(regex, rep, string)` will use the given regex on the given string, and whenever it matches, it removes the match and puts the `rep` in there.\n", "\n", "For example, I can use that to replace all English/Italian occurrences of the name Virgilio with a standardized one:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Virgilio has many names, like Virgilio, Virgilio, Virgilio, Virgilio, or even Virgilio.\n" ] } ], "source": [ "s = \"Virgilio has many names, like virgil, virgilio, Virgil, Vergil, or even vergil.\"\n", "regex = \"(V|v)(e|i)rgil(io)?\"\n", "\n", "print(\n", " re.sub(regex, \"Virgilio\", s)\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Now it is your turn. I am going to give you this sentence as input, and your job is to fix the whitespace in it. When you are done, save the result in a string named `s`, and check if `s.count(\" \")` is equal to 0 or not." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "weird_text = \"Now it is your turn. I am going to give you this sentence as input, and your job is to fix the whitespace in it. When you are done, save the result in a string named `s`, and check if `s.count(\" \")` is equal to 0 or not.\"\n", "regex = \"\" # put your regex here\n", "\n", "# substitute the extra whitespace here\n", "# save the result in 's'\n", "\n", "# this print should be 0\n", "print(s.count(\" \"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Character classes\n", "\n", "So far we have been using writing some simple regular expressions that have been matching some words, and some names, and things like that. Now we have a different plan. We will write a regular expression that will match on US phone numbers, which we will assume are of the form xxx-xxx-xxxx. The first three digits are the area code, but we will not care about whether the area code actually makes sense or not. How do we match this, then?\n", "\n", "In fact, how can I match the first digit? It can be any number from 0 to 9, so should I write `(0|1|2|3|4|5|6|7|8|9)` to match the first digit, and then repeat? Actually, we could do that, yes, to get this regex:\n", "\n", "`(0|1|2|3|4|5|6|7|8|9){3}-(0|1|2|3|4|5|6|7|8|9){3}-(0|1|2|3|4|5|6|7|8|9){4}`\n", "\n", "Does this work?" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "None\n", "None\n", "\n", "None\n" ] } ], "source": [ "regex = \"(0|1|2|3|4|5|6|7|8|9){3}-(0|1|2|3|4|5|6|7|8|9){3}-(0|1|2|3|4|5|6|7|8|9){4}\"\n", "numbers = [\n", " \"202-555-0181\",\n", " \"202555-0181\",\n", " \"202 555 0181\",\n", " \"512-555-0191\",\n", " \"96-125-3546\",\n", "]\n", "for nr in numbers:\n", " print(re.search(regex, nr))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like it works, but surely there must be a better way... and there is! Instead of writing out every digit like we did, we can actually write a range of values! In fact, the regex `[0-9]` matches all digits from 0 to 9. So we can actually shorten our regex to `[0-9]{3}-[0-9]{3}-[0-9]{4}`:" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "None\n", "None\n", "\n", "None\n" ] } ], "source": [ "regex = \"[0-9]{3}-[0-9]{3}-[0-9]{4}\"\n", "numbers = [\n", " \"202-555-0181\",\n", " \"202555-0181\",\n", " \"202 555 0181\",\n", " \"512-555-0191\",\n", " \"96-125-3546\",\n", "]\n", "for nr in numbers:\n", " print(re.search(regex, nr))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The magic here is being done by the `[]`, which denotes a character class. The way `[]` works is, the regex will try to match any of the things that are inside, and it just so happens that `0-9` is a shorter way of listing all the digits. Of course you could also do `[0123456789]{3}-[0123456789]{3}-[0123456789]{4}` which is slightly shorter than our first attempt, but still pretty bad. Similar to `0-9`, we have `a-z` and `A-Z`, which go through all letters of the alphabet.\n", "\n", "You can also start and end in different places, for example `c-o` can be used to match words that only use letters between the \"c\" and the \"o\", like \"hello\":" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n" ] } ], "source": [ "regex = \"[c-o]+\"\n", "print(re.search(regex, \"hello\"))\n", "print(re.search(regex, \"rice\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With these character classes we can actually rewrite our Virgilio regex into something slightly shorter, going from `(V|v)(e|i)rgil(io)?` to `[Vv][ie]rgil(io)?`." ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Virgilio has many names, like Virgilio, Virgilio, Virgilio, Virgilio, or even Virgilio.\n" ] } ], "source": [ "s = \"Virgilio has many names, like virgil, virgilio, Virgil, Vergil, or even vergil.\"\n", "regex = \"[Vv][ie]rgil(io)?\"\n", "\n", "print(\n", " re.sub(regex, \"Virgilio\", s)\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again we see that our regular expression matched the **ice** in r**ice**, because the \"r\" was not inside the legal range of letters, but **ice** was.\n", "\n", "The _character class_ is the square brackets `[]` and whatever goes inside it. Also, note that the special characters we have been using lose their meaning inside a character class! So `[()?+*{}]` will actually look to match any of those characters:" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "regex = \"[()?+*{}]\"\n", "print(re.search(regex, \"Did I just ask a question?\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A final note on character classes, if they start with `^` then we are actually saying \"use everything _except_ what is inside this\":" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "None\n", "\n" ] } ], "source": [ "regex = \"[^c-o]+\"\n", "print(re.search(regex, \"hello\"))\n", "print(re.search(regex, \"rice\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Phone numbers v1\n", "\n", "Now that you know how to use character classes to denote ranges, you need to write a regular expression that matches american phone numbers with the format xxx-xxx-xxxx. Not only that, but you must also cope with the fact that the numbers may or may not be preceeded by the country indicator, which you can assume that will look like \"+1\" or \"001\". The country indicator may be separated from the rest of the number with a space or with a dash." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "regex = \"\" # write your regex here\n", "matches = [ # you should be able to match those\n", " \"202-555-0181\",\n", " \"001 202-555-0181\",\n", " \"+1-512-555-0191\"\n", "]\n", "non_matches = [ # for now, none of these should be matched\n", " \"202555-0181\",\n", " \"96-125-3546\",\n", " \"(+1)5125550191\"\n", "]\n", "for s in matches:\n", " print(re.search(regex, s))\n", "for s in non_matches:\n", " print(re.search(regex, s))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## More `re` functions\n", "\n", "So far we only looked at the `.search()` function of the `re` module, but now I am going to tell you about a couple more function that can be quite handy when you are dealing with pattern matching. By the time you are done with this small section, you will now the following functions: `match()`, `search()`, `findall()`, `sub()` and `split()`.\n", "\n", "If you are here mostly for the regular expressions, and you don't care much about using them with Python, you can just skim through this section... even though it is still a nice read." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### `search()` and `sub()`\n", "\n", "You already know these two functions, `re.search(regex, string)` will try to find your pattern given by `regex` in the given `string` and return the information of the match in a `match` object. The function `re.sub(regex, rep, string)` will take a regex and two strings; it will then look for the pattern you specified in `string` and replace the matches with the other string `rep` you gave it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### `match()`\n", "\n", "The function `re.match(regex, string)` is similar to the function `re.search()`, except that `.match()` will only check if your pattern applies to the **beginning** of the string. That is, if your string does not **start** with the pattern you provided, the function returns `None`." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ".search() found abc in abcdef\n", ".search() found abc in the alphabet starts with abc\n", ".match() says that abcdef starts with abc\n" ] } ], "source": [ "regex = \"abc\"\n", "string1 = \"abcdef\"\n", "string2 = \"the alphabet starts with abc\"\n", "# the .search() function finds the patterns, regardless of position\n", "if re.search(regex, string1):\n", " print(\".search() found {} in {}\".format(regex, string1))\n", "if re.search(regex, string2):\n", " print(\".search() found {} in {}\".format(regex, string2))\n", " \n", "# the .match() function only checks if the string STARTS with the pattern\n", "if re.match(regex, string1):\n", " print(\".match() says that {} starts with {}\".format(string1, regex))\n", "if re.match(regex, string2): # this one should NOT print\n", " print(\".match() says that {} starts with {}\".format(string2, regex))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### `findall()`\n", "\n", "The `re.findall(regex, string)` is exactly like the `.search()` function, except that it will return **all** the matches it can find, instead of just the first one. Instead of returning a `match` object, it just returns the string that matched." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "['wow', 'wow', 'wow']\n" ] } ], "source": [ "regex = \"wow\"\n", "string = \"wow wow wow!\"\n", "\n", "print(re.search(regex, string))\n", "\n", "print(re.findall(regex, string))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "['ab1', 'ab2', 'ab3']\n" ] } ], "source": [ "regex = \"ab[0-9]\"\n", "string = \"ab1 ab2 ab3\"\n", "\n", "print(re.search(regex, string))\n", "\n", "print(re.findall(regex, string))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is important to note that the `findall()` function only returns _non-overlaping_ matches. That is, one could argue that `wow` appears twice in \"wowow\", in the beginning: **wow**ow, and in the end: wo**wow**. Nonetheless, `findall()` only returns one match because the second match overlaps with the first:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['wow']\n" ] } ], "source": [ "regex = \"wow\"\n", "string = \"wowow\"\n", "print(re.findall(regex, string))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With this information it now makes a bit more sense to consider the greediness of the operators we showed before, like `?` and `+`. Imagine we are dealing with the regex `a+` and we have a string \"aaaaaaaaa\". If we use the greedy version of `+`, then we get a single match which is the whole string. If we use the non-greedy version of the operator `+`, perhaps because we want as many matches as possible, we will get a bunch of \"a\" matches!" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['aaaaaaaaa']\n", "['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a']\n" ] } ], "source": [ "regex_greedy = \"a+\"\n", "regex_nongreedy = \"a+?\"\n", "string = \"aaaaaaaaa\"\n", "\n", "print(re.findall(regex_greedy, string))\n", "\n", "print(re.findall(regex_nongreedy, string))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### `split()`\n", "\n", "The `re.split(regex, string)` splits the given string into bits wherever it is able to find the pattern you specified. Say we are interested in finding all the sequences of consecutive consonants in a sentence (I don't know why you would want that...). Then we can use the vowels and the space \" \" to break up the sentence:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Th', 's', 's', 'j', 'st', 'r', 'g', 'l', 'r', 's', 'nt', 'nc', '']\n" ] } ], "source": [ "regex = \"[aeiou ]+\" # this will eliminate all vowels/spaces that appear consecutively\n", "string = \"This is just a regular sentence\"\n", "\n", "print(re.split(regex, string))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `search` with `match`\n", "\n", "Recall that the `match()` function only checks if your pattern is in the beginning of the string. What I want you to do is define your own `search` function that takes a regex and a string, and returns `True` if the pattern is inside the string, and `False` otherwise. Can you do it?" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "def my_search(regex, string):\n", " pass # write your code here\n", "\n", "regex = \"[0-9]{2,4}\"\n", "\n", "# your function should be able to match in all these strings\n", "string1 = \"1984 was already some years ago.\"\n", "string2 = \"There is also a book whose title is '1984', but the story isn't set in the year of 1984.\"\n", "string3 = \"Sometimes people write '84 for short.\"\n", "\n", "# your function should also match with this regex and this string\n", "regex = \"a*\"\n", "string = \"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Count matches with `findall`\n", "\n", "Now I want you to define the `count_matches` function, which takes a regex and a string, and returns the number of non-overlaping matches there exist in the given string. Can you do it?" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "def count_matches(regex, string):\n", " pass # your code goes here\n", "\n", "regex = \"wow\"\n", "\n", "string1 = \"wow wow wow\" # this should be 3\n", "string2 = \"wowow\" # this should be 1\n", "string3 = \"wowowow\" # this should be 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Special characters\n", "\n", "It is time to ramp things up a bit! We have seen some characters that have special meanings, and now I am going to introduce a couple more of those! I will start by listing them, and then I'll explain them in more detail:\n", "\n", " - `.` is used to match **any** character, except for a newline\n", " - `^` is used to match at the beginning of the string\n", " - `$` is used to match at the end of the string\n", " - `\\d` is used to match any digit\n", " - `\\w` is used to match any alphanumeric character\n", " - `\\s` is used to match any type of whitespace\n", " - `\\` is used to remove the special meaning of the characters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Dot `.`\n", "\n", "The `.` can be used in a regular expression to capture any character that might have been used there, as long as we are still in the same line. That is, the only place where `.` doesn't work is if we changed lines in the text. Imagine the pattern was `d.ck`. Then the pattern would match\n", "\n", "```\n", "\"duck\"```\n", "\n", "but it would not match\n", "\n", "```\n", "\"d\n", "ck\"```\n", "\n", "because we changed lines in the middle of the string." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Caret `^`\n", "\n", "If we use a `^` in the beginning of the regular expression, then we only care about matches in the beginning of the string. That is, `^wow` would only match if the string started with \"wow\":" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "None\n" ] } ], "source": [ "regex = \"^wow\"\n", "\n", "print(re.search(regex, \"wow, this is awesome\"))\n", "print(re.search(regex, \"this is awesome, wow\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall that `^` inside the character class can also mean \"anything but whatever is in this class\", so the regular expression `[^d]uck` would match any string that has **uck** in it, as long as it is not the word \"duck\". If the caret `^` appears inside a character class `[]` but it is not the first character, than it has no special meaning and it just stands for the character itself. This means that the regex `[()^{}]` is looking to match any of the characters listed:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n" ] } ], "source": [ "regex = \"[()^{}]\"\n", "print(re.search(regex, \"^\"))\n", "print(re.search(regex, \"(\"))\n", "print(re.search(regex, \"}\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Dollar sign `$`\n", "\n", "Contrary to the caret `^`, the dollar sign only matches at the end of the string!" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "None\n", "\n" ] } ], "source": [ "regex = \"wow$\"\n", "\n", "print(re.search(regex, \"wow, this is awesome\"))\n", "print(re.search(regex, \"this is awesome, wow\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Combining the `^` with the `$` means we are looking to match the whole string with our pattern. For example `^[a-zA-Z ]*$` checks if our string only contains letters and spaces and nothing else:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "None\n", "None\n" ] } ], "source": [ "regex = \"^[a-zA-Z ]*$\"\n", "\n", "s1 = \"this is a sentence with only letters and spaces\"\n", "s2 = \"this sentence has 1 number\"\n", "s3 = \"this one has punctuation...\"\n", "\n", "print(re.search(regex, s1))\n", "print(re.search(regex, s2))\n", "print(re.search(regex, s3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Character groups `\\d`, `\\w` and `\\s`\n", "\n", "Whenever you see a backslash followed by a letter, that probably means that something _special_ is going on. These three special \"characters\" are shorthand notation for some character classes `[]`. For example, the `\\d` is the same as `[0-9]`. The `\\w` represents any alphanumeric character (like letters, numbers and `_`), and `\\s` represents any whitespace character (like the space \" \", the tab, the newline, etc).\n", "\n", "All these three special characters I showed, can be capitalized. If they are, then they mean the exact opposite! So `\\D` means \"anything **except** a digit\", `\\W` means \"anything **except** an alphanumeric character\" and `\\S` means \"anything **except** whitespace characters." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['these are some words']\n" ] } ], "source": [ "regex = \"\\D+\"\n", "s = \"these are some words\"\n", "print(re.findall(regex, s))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Adding up to that, these special characters can be used inside a character class, so for instance `[abc\\d]` would match any digit and the letters \"a\", \"b\" and \"c\". If the caret character `^` is used, then we are excluding whatever the special character refers to. As an example, if `[\\d]` would match any digit, then `[^\\d]` will match anything that is not a digit." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The backslash `\\`\n", "\n", "We already saw the backslash being used before letters to give them some special meaning... Well, the backslash before a special character also strips it of its special meaning! So, if you wanted to match a backslash, you could use `\\\\`. If you want to match any of the other special characters we already saw, you could put a `\\` before them, like `\\+` to match a plus sign. The next regular expression can be used to match an addition expression like \"16 + 6\"" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n", "None\n" ] } ], "source": [ "regex = \"[\\d]+ ?\\+ ?[\\d]+\"\n", "add1 = \"16 + 6\"\n", "add2 = \"4325+2\"\n", "add3 = \"4+ 564\"\n", "mult1 = \"56 * 2\"\n", "\n", "print(re.search(regex, add1))\n", "print(re.search(regex, add2))\n", "print(re.search(regex, add3))\n", "print(re.search(regex, mult1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Phone numbers v2\n", "\n", "Now I invite you to take a look at [Phone numbers v1](#Phone-numbers-v1) and rewrite your regular expression to include some new special characters that you didn't know before!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "regex = \"\" # write your regex here\n", "matches = [ # you should be able to match those\n", " \"202-555-0181\",\n", " \"001 202-555-0181\",\n", " \"+1-512-555-0191\"\n", "]\n", "non_matches = [ # for now, none of these should be matched\n", " \"202555-0181\",\n", " \"96-125-3546\",\n", " \"(+1)5125550191\"\n", "]\n", "for s in matches:\n", " print(re.search(regex, s))\n", "for s in non_matches:\n", " print(re.search(regex, s))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Groups\n", "\n", "So far, when we used a regex to match a string we could retrieve the whole information of the match by using the `.group()` function on the match object:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "my nam is\n" ] } ], "source": [ "regex = \"my name? is\"\n", "\n", "m = re.search(regex, \"my nam is Virgilio\")\n", "if m is not None:\n", " print(m.group())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Say we are dealing with phone numbers again, and we want to look for phone numbers in a big text. But after that, we also want to extract the country from where the number is from. How could we do it..? Well, we can use a regex to match the phone numbers, and then use a second regex to extract the country code, right? (Let us just assume that phone numbers are written with the digits all in a sequence, with no spaces or \"-\" separating them.)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The country code is: +351\n", "The country code is: 001\n", "The country code is: +1\n", "The country code is: 0048\n" ] } ], "source": [ "regex_number = \"((00|[+])\\d{1,3}[ -])\\d{8,12}\"\n", "regex_code = \"((00|[+])\\d{1,3})\"\n", "matches = [ # you should be able to match those\n", " \"+351 2025550181\",\n", " \"001 2025550181\",\n", " \"+1-5125550191\",\n", " \"0048 123456789\"\n", "]\n", "\n", "for s in matches:\n", " m = re.search(regex_number, s) # match the phone number\n", " if m is not None:\n", " phone_number = m.group() # extract the phone number\n", " code = re.search(regex_code, phone_number) # match the country code\n", " print(\"The country code is: {}\".format(code.group()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But not only is this repetitive, because I just copied the beginning of the `regex_number` into the `regex_code`, but it becomes very cumbersome if I am trying to retrieve several different parts of my match. Because of this, there is a functionality of regular expressions that is _grouping_. By grouping parts of the regular expression, you can do things like using the repetition operators on them and **retrieve their information** later on.\n", "\n", "To do grouping, one only needs to use the `()` parenthesis. For example, the regex `(ab)+` looks for matches of the form \"ab\", \"abab\", \"ababab\", etcetera.\n", "\n", "We also used the grouping [in the beginning](#Matching-options) to create a regex that matched \"Virgilio\" and \"virgilio\", by writing `(V|v)irgilio`.\n", "\n", "Now off to the part that really matters! We can use grouping to retrieve portions of the matches, and we do that with the `.group()` function! Any set of `()` defines a group, and then we can use the `.group(i)` function to retrieve group `i`. Just note that the 0th group is always the whole match, and then you start counting from the left!" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "abc defghi\n", "abc defghi\n", "abc\n", "defghi\n", "fg\n", "('abc', 'defghi', 'fg')\n" ] } ], "source": [ "regex_with_grouping = \"(abc) (de(fg)hi)\"\n", "m = re.search(regex_with_grouping, \"abc defghi jklm n opq\")\n", "print(m.group())\n", "print(m.group(0))\n", "print(m.group(1))\n", "print(m.group(2))\n", "print(m.group(3))\n", "print(m.groups())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that `match.group()` and `match.group(0)` are the same thing. Also note that the function `match.groups()` returns all the groups in a tuple!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Phone numbers v3\n", "\n", "Using what you learned so far, write a regex that matches phone numbers with different country codes. Assume the following:\n", "\n", " - The country code starts with either `00` or `+`, followed by one to three digits\n", " - The phone number has length between 8 and 12\n", " - The phone number and country code are separated by a space \" \" or by a hyphen \"-\"\n", " \n", "Have your code look for phone numbers in the string I will provide next, and have it print the different country codes it finds.\n", "\n", "You might want to read what the exact behaviour of `re.findall()` is when the regex has groups in it. You can do that by checking the [documentation of the `re` module](https://docs.python.org/3/library/re.html#re.findall)." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "paragraph = \"\"\"Hello, I am Virgilio and I am from Italy.\n", "If phones were a thing when I was alive, my number would've probably been 0039 3123456789.\n", "I would also love to get a house with 3 floors and something like +1 000 square meters.\n", "Now that we are at it, I can also tell you that the number 0039 3135313531 would have suited Leo da Vinci very well...\n", "And come to think of it, someone told me that Socrates had dibs on +30-2111112222\"\"\"\n", "# you should find 3 phone numbers\n", "# and you should not be fooled by the other numbers that show up in the text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Toy project about regex\n", "\n", "For the toy project, that is far from trivial, you are left with mimicking what [I did here](http://mathspp.blogspot.com/2017/11/on-computing-all-patterns-matched-by.html). If you follow that link, you will find a piece of code that takes a regular expression and then prints all the strings that the given regex would match.\n", "\n", "I'll just give you a couple of examples on how this works:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sys\n", "sys.path.append(\"./regex-bin\")\n", "import regexPrinter\n", "\n", "def get_iter(regex):\n", " return regexPrinter.printRegex(regex).print()\n", "\n", "def printall(regex):\n", " for poss_match in get_iter(regex):\n", " print(poss_match)\n", "\n", "regex = \"V|virgilio\"\n", "printall(regex)\n", "print(\"-\"*30)\n", "regex = \"wo+w\"\n", "printall(regex)\n", "print(\"-\"*30)\n", "# notice that for some reason, dumb me used {n:m} instead of {n,m}\n", "# also note that I only implemented {n,m}, and not {n,} nor {,m} nor {n}\n", "# also note that this does not support nor \\d nor [0-9]\n", "regex = \"((00|[+])1[ -])?[0123456789]{3:3}\"\n", "printall(regex)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the code is protected against infinite patterns, which are signaled with `...`." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "this is infinite!\n", "this is infinite!!\n", "this is infinite!...!\n" ] } ], "source": [ "printall(\"this is infinite!+\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you are completely new to this sort of things, then this will look completely impossible... but it is not, because I am a normal person and I was able to do it! So if you really want you can also do it! In the link you have listed all the functionality I decided to include, which excluded `\\d`, for example.\n", "\n", "I was only able to do this in the way I did because I had gone through some (not all) of the blog posts in [this amazing series](https://ruslanspivak.com/lsbasi-part1/).\n", "\n", "Maybe you can implement a smaller subset of the features without too much trouble? The point of this is that you could only print the strings matched by a regex if you know how regular expressions work. Try starting with only implementing literal matching and the `|` and `?` operators. Can you now include grouping `()` so that `(ab)?` would work as expected? Can you add `[]`? What about `+` and `*`? Or maybe start with `{n,m}` and write `?`, `+` and `*` as `{0,1}`, `{1,}` and `{0,}` respectively.\n", "\n", "You can also postpone this project for a bit, and dig deeper into the world of regex. The next section contains some additional references and some websites with exercises to practice your new knowledge!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Further reading\n", "For regular expressions in Python, you can take a look at the [documentation](https://docs.python.org/3/library/re.html) of the `re` module, as well as this [regex HOWTO](https://docs.python.org/3/howto/regex.html).\n", "\n", "Some nice topics to follow up on this would include, but are not limited to:\n", " - Non capturing groups (and named groups for Python)\n", " - Lookaheads (positive, negative, ...)\n", " - Regex compilation and flags (for Python)\n", " - Recursive regular expressions\n", "\n", "[This](https://regexr.com/) interesting website (and [this one](https://regex101.com/) as well) provides an interface for you to type regular expressions and see what they match in a text. The tool also gives you an explanation of what your regular expression is doing.\n", "\n", "---\n", "\n", "I found some interesting websites with exercises on regular expressions. [This one](https://regexone.com/lesson/introduction_abcs) has more \"basic\" exercises, each one of them preceeded by an explanation of whatever you will need to complete the exercise. I suggest you to go through them. [Hackerrank](https://www.hackerrank.com/domains/regex) and [regexplay](http://play.inginf.units.it/#/) also have some interesting exercises, but those require you to login in some way.\n", "\n", "---\n", "\n", "If you enjoyed this guide and/or it was useful, consider leaving a star in the [Virgilio repository](https://github.com/clone95/Virgilio) and sharing it with your friends!\n", "\n", "This was brought to you by the editor of the [Mathspp Blog](https://mathspp.blogspot.com), [RojerGS](https://github.com/RojerGS)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Suggested solutions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### $\\pi$ lookup (solved)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found the number '9876' at positions (4087, 4091)\n" ] } ], "source": [ "pifile = \"regex-bin/pi.txt\"\n", "regex = \"9876\" # define your regex to look your favourite number up\n", "\n", "with open(pifile, \"r\") as f:\n", " pistr = f.read() # pistr is a string that contains 1M digits of pi\n", " \n", "## search for your number here\n", "m = re.search(regex, pistr)\n", "if m:\n", " print(\"Found the number '{}' at positions {}\".format(regex, m.span()))\n", "else:\n", " print(\"Sorry, the first million digits of pi can't help you with that...\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Virgilio or Virgil? (solved)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Publius Vergilius Maro (Classical Latin: [ˈpuː.blɪ.ʊs wɛrˈɡɪ.lɪ.ʊs ˈma.roː]; traditional dates October 15, 70 BC – September 21, 19 BC[1]), usually called Virgil or Vergil (/ˈvɜːrdʒɪl/) in English, was an ancient Roman poet of the Augustan period. He wrote three of the most famous poems in Latin literature: the Eclogues (or Bucolics), the Georgics, and the epic Aeneid. A number of minor poems, collected in the Appendix Vergiliana, are sometimes attributed to him.[2][3]\n", "\n", "Virgil is traditionally ranked as one of Rome's greatest poets. His Aeneid has been considered the national epic of ancient Rome since the time of its composition. Modeled after Homer's Iliad and Odyssey, the Aeneid follows the Trojan refugee Aeneas as he struggles to fulfill his destiny and reach Italy, where his descendants Romulus and Remus were to found the city of Rome. Virgil's work has had wide and deep influence on Western literature, most notably Dante's Divine Comedy, in which Virgil appears as Dante's guide through Hell and Purgatory.\n" ] } ], "source": [ "paragraphs = \\\n", "\"\"\"Publius Vergilius Maro (Classical Latin: [ˈpuː.blɪ.ʊs wɛrˈɡɪ.lɪ.ʊs ˈma.roː]; traditional dates October 15, 70 BC – September 21, 19 BC[1]), usually called virgilio or Vergil (/ˈvɜːrdʒɪl/) in English, was an ancient Roman poet of the Augustan period. He wrote three of the most famous poems in Latin literature: the Eclogues (or Bucolics), the Georgics, and the epic Aeneid. A number of minor poems, collected in the Appendix Vergiliana, are sometimes attributed to him.[2][3]\n", "\n", "Virgilio is traditionally ranked as one of Rome's greatest poets. His Aeneid has been considered the national epic of ancient Rome since the time of its composition. Modeled after Homer's Iliad and Odyssey, the Aeneid follows the Trojan refugee Aeneas as he struggles to fulfill his destiny and reach Italy, where his descendants Romulus and Remus were to found the city of Rome. virgilio's work has had wide and deep influence on Western literature, most notably Dante's Divine Comedy, in which virgilio appears as Dante's guide through Hell and Purgatory.\"\"\"\n", "\n", "regex = \"(V|v)irgilio\"\n", "parsed_str = paragraphs\n", "m = re.search(regex, parsed_str)\n", "while m is not None:\n", " parsed_str = parsed_str[:m.start()] + \"Virgil\" + parsed_str[m.end():]\n", " m = re.search(regex, parsed_str)\n", "\n", "print(parsed_str)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Removing excessive spaces (solved)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "Now it is your turn. I am going to give you this sentence as input, and your job is to fix the whitespace in it. When you are done, save the result in a string named `s`, and check if `s.count()` is equal to 0 or not.\n" ] } ], "source": [ "weird_text = \"Now it is your turn. I am going to give you this sentence as input, and your job is to fix the whitespace in it. When you are done, save the result in a string named `s`, and check if `s.count(\" \")` is equal to 0 or not.\"\n", "regex = \" +\" # put your regex here\n", "# there are several possible solutions, I chose this one\n", "\n", "# substitute the extra whitespace here\n", "s = re.sub(regex, \" \", weird_text)\n", "\n", "# this print should be 0\n", "print(s.count(\" \"))\n", "print(s)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Phone numbers v1 (solved)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n", "None\n", "None\n", "None\n" ] } ], "source": [ "regex = \"((00|[+])1[ -])?[0-9]{3}-[0-9]{3}-[0-9]{4}\" # write your regex here\n", "matches = [ # you should be able to match those\n", " \"202-555-0181\",\n", " \"001 202-555-0181\",\n", " \"+1-512-555-0191\"\n", "]\n", "non_matches = [ # for now, none of these should be matched\n", " \"202555-0181\",\n", " \"96-125-3546\",\n", " \"(+1)5125550191\"\n", "]\n", "for s in matches:\n", " print(re.search(regex, s))\n", "for s in non_matches:\n", " print(re.search(regex, s))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `search` with `matched` (solved)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True\n", "True\n", "True\n", "True\n" ] } ], "source": [ "def my_search(regex, string):\n", " found = False\n", " while string:\n", " m = re.match(regex, string)\n", " if m:\n", " return True\n", " string = string[1:]\n", " # check if the pattern matches the empty string\n", " if re.match(regex, string):\n", " return True\n", " else:\n", " return False\n", "\n", "regex = \"[0-9]{2,4}\"\n", "\n", "# your function should be able to match in all these strings\n", "string1 = \"1984 was already some years ago.\"\n", "print(my_search(regex, string1))\n", "string2 = \"There is also a book whose title is '1984', but the story isn't set in the year of 1984.\"\n", "print(my_search(regex, string2))\n", "string3 = \"Sometimes people write '84 for short.\"\n", "print(my_search(regex, string3))\n", "\n", "# your function should also match with this regex and this string\n", "regex = \"a*\"\n", "string = \"\"\n", "print(my_search(regex, string))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Count matches with `findall` (solved)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3\n", "1\n", "2\n" ] } ], "source": [ "def count_matches(regex, string):\n", " return len(re.findall(regex, string))\n", "\n", "regex = \"wow\"\n", "\n", "string1 = \"wow wow wow\" # this should be 3\n", "print(count_matches(regex, string1))\n", "string2 = \"wowow\" # this should be 1\n", "print(count_matches(regex, string2))\n", "string3 = \"wowowow\" # this should be 2\n", "print(count_matches(regex, string3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Phone numbers v2 (solved)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n", "None\n", "None\n", "None\n" ] } ], "source": [ "regex = \"((00|[+])1[ -])?\\d{3}-\\d{3}-\\d{4}\" # write your regex here\n", "matches = [ # you should be able to match those\n", " \"202-555-0181\",\n", " \"001 202-555-0181\",\n", " \"+1-512-555-0191\"\n", "]\n", "non_matches = [ # for now, none of these should be matched\n", " \"202555-0181\",\n", " \"96-125-3546\",\n", " \"(+1)5125550191\"\n", "]\n", "for s in matches:\n", " print(re.search(regex, s))\n", "for s in non_matches:\n", " print(re.search(regex, s))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Phone numbers v3 (solved)\n", "\n", "For this \"problem\", one thinks of using the `.findall()` function to look for all matches. When we do that, we don't get a list of the match objects, but instead a list with tuples, where each tuple has a specific group from our regex. This is the behaviour that is [documented for the `re.findall()` function](https://docs.python.org/3/library/re.html#re.findall).\n", "\n", "This is fine, because we really only cared about the number code, and we can print it easily. If we wanted the match objects, then the alternative would be to use the [`re.finditer()`](https://docs.python.org/3/library/re.html#re.finditer) function." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('0039', '00')\n", "('0039', '00')\n", "('+30', '+')\n", "The number '0039 3123456789' has country code: 0039\n", "The number '0039 3135313531' has country code: 0039\n", "The number '+30-2111112222' has country code: +30\n" ] } ], "source": [ "paragraph = \"\"\"Hello, I am Virgilio and I am from Italy.\n", "If phones were a thing when I was alive, my number would've probably been 0039 3123456789.\n", "I would also love to get a house with 3 floors and something like +1 000 square meters.\n", "Now that we are at it, I can also tell you that the number 0039 3135313531 would have suited Leo da Vinci very well...\n", "And come to think of it, someone told me that Socrates had dibs on +30-2111112222\"\"\"\n", "# you should find 3 phone numbers\n", "# and you should not be fooled by the other numbers that show up in the text\n", "\n", "regex = \"((00|[+])\\d{1,3})[ -]\\d{8,12}\"\n", "ns = re.findall(regex, paragraph) # find numbers\n", "for n in ns:\n", " # n is a tuple with the two groups our string has\n", " print(n)\n", " \n", "for n in re.finditer(regex, paragraph):\n", " print(\"The number '{}' has country code: {}\".format(n.group(), n.group(1)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 2 }