{ "cells": [ { "cell_type": "markdown", "id": "c0f94bec", "metadata": {}, "source": [ "--- \n", " \n", "\n", "

Department of Data Science

\n", "

Course: Tools and Techniques for Data Science

\n", "\n", "---\n", "

Instructor: Muhammad Arif Butt, Ph.D.

" ] }, { "cell_type": "markdown", "id": "3234ed0d", "metadata": {}, "source": [ "

Lecture 2.18

" ] }, { "cell_type": "markdown", "id": "fe13f1af", "metadata": {}, "source": [ "\"Open" ] }, { "cell_type": "markdown", "id": "9aa4323e", "metadata": {}, "source": [ "## _Regular Expressions Part-II.ipynb_" ] }, { "cell_type": "markdown", "id": "6ad97a3c", "metadata": {}, "source": [ "\n", "" ] }, { "cell_type": "markdown", "id": "4126c633", "metadata": {}, "source": [ "## Learning Agenda\n", "1. Review of Meta Characters, Quantifiers and Escape Codes\n", "2. The Python `re` Module\n", " 1. The `re.compile()` Method\n", " 2. The `re.Pattern.search()` Method\n", " 3. The `re.Pattern.match()` Method\n", " 4. The `re.Pattern.findall()` Method\n", " 5. The `re.Pattern.finditer()` Method\n", "3. Practical Example\n", " 1. Extracting Names\n", " 2. Extracting Date of Births\n", " 3. Extracting Emails and Usernames\n", " 4. Extracting valid Cell phones\n", " 5. Extracting Domain names from URLs\n", "4. Modifying Strings\n", " 1. The `re.Pattern.split()` Method\n", " 2. The `re.Pattern.sub()` Method\n", " 3. The `re.Pattern.subn()` Method" ] }, { "cell_type": "markdown", "id": "58b13156", "metadata": {}, "source": [ "## 1. Review of Meta Characters, Quantifiers and Escape Codes" ] }, { "cell_type": "markdown", "id": "af0832f8", "metadata": {}, "source": [ "### a. Wild Card / Meta Characters\n", "Special characters are characters that do not match themselves as seen but have a special meaning when used in a regular expression. Some commonly used wild cards or meta characters are listed below:\n", "\n", "\n", "| Wild Card | Description \n", "| :-: |:-------------\n", "| **^** |Caret symbol specifies that the match must start at the beginning of the string, and in MULTILINE mode also matches immediately after each newline
- `^b` will check if the string starts with 'b' such as baba, boss, basic, b, by, etc.
- `^si` will check if the string starts with 'si' such as simple, sister, si, etc.\n", "| **$** |Specifies that the match must occur at the end of the string
- `s$` will check for the string that ends with a such as geeks, ends, s, etc.
- `ing$` will check for the string that ends with ing such as going, seeing, ing, etc.\n", "| **.** |Represent a single occurrance of any character except new line
- `a.b` will check for the string that contains any character at the place of the dot such as acb, acbd, abbb, etc
- `..` will check if the string contains at least 2 characters\n", "| **\\\\** |Used to drop special meaning of a character following it or used to refer to a special character.
- Since dot `(.)` is a metacharacter, so if you want to search it in a string you have to use the backslash `(\\)` just before the dot `(.)` so that it will lose its specialty. \n", "| **[...]** |Matches a single character in the listed set. If caret is the first character inside it, it means negation
- `[abc]` means match any single character out of this set
- `[123]` means match any single digit out of this set
- `[a-z]` means match any single character out of lower case alphabets
- `[0-9]` means match any single digit out of this set
- `[^0-3]` means any number except 0, 1, 2, or 3
- `[^a-c]` means any character except a, b, or c
- [0-5][0-9] will match all the two-digits numbers from 00 to 59
- `[0-9A-Fa-f]` will match any hexadecimal digit.
- Special characters lose their special meaning inside sets, so `[(+*)]` will match any of the literal characters '(', '+', '*', or ')'.
- To match a literal ']' inside a set, precede it with a backslash, or place it at the beginning of the set, so `[()[\\]{}]` and `[]()[{}]` will both match parenthesis.\n", "| **^[...]**|Matches any character in the set at the beginning of the string\n", "| **[^...]**|Matches any character except those NOT in the listed set (negation)\n", "| **\\|** |Or symbol works as the OR operator meaning it checks whether the pattern before or after the or symbol is present in the string or not
- `a\\|b` will match any string that contains a or b such as acd, bcd, abcd, etc.
- To match a literal '\\|', use `\\|`, or enclose it inside a character class, as in `[\\|]`.\n", "| **( )** |Used to capture and group" ] }, { "cell_type": "markdown", "id": "1a40818f", "metadata": {}, "source": [ "### b. Quantifiers\n", "- A quantifier metacharacter immediately follows a portion of a and indicates how many times that portion must occur for the match to succeed. *, +, ?, {m}, {m,n}. When used alone, the quantifier metacharacters *, +, and ? are all greedy, meaning they produce the longest possible match. \n", "\n", "| Wild Card | Description \n", "| :-: |:-------------\n", "| **\\*** |The preceding character/expression is repeated zero or more times\n", "| **+** |The preceding character/expression is repeated one or more times,
- `ab+c` will be matched for the string abc, abbc, dabc, but will not be matched for ac, abdc because there is no b in ac and d is not followed by c in abdc.\n", "| **?** |The preceding character/expression is optional (zero or one occurrence).
- `ab?c` will be matched for the string ac, abc, acb, dabc, dac but will not be matched for abbc because there are two b. Similarly, it will not be matched for abdc because b is not followed by c.\n", "| **{n,m}** |The preceding character/expression is repeated from n to m times (both enclusive).
- `a{2,4}` will be matched for the string aaab, baaaac, gaad, but will not be matched for strings like abc, bc because there is only one a or no a in both the cases.\n", "| **{n}** |The preceding character/expression is repeated n times.
- `a{6}` will match exactly six 'a' characters, but not five. \n", "| **{n,}** |The preceding character/expression is repeated atleast n times \n", "| **{,m}** |The preceding character/expression is repeated upto m times\n", " \n", " \n", "Note: The repeat characters (`*` and `+`) perform greedy search to match the largest possible string. However, you can performa a non greedy search by: `*?` (0 or more characters but non-greedy) `+?` (1 or more characters but non-greedy)" ] }, { "cell_type": "markdown", "id": "53a7b442", "metadata": {}, "source": [ "### c. Escape Codes\n", "- You can use special escape codes to find specific types of patterns in your data, such as digits, non-digits,whitespace, and more. \n", "- The following list of special sequences isn’t complete.\n", "\n", "| Code | Description \n", "| :-: |:-------------\n", "| **\\d** |Matches any decimal digit. This is equivalent to [0-9] \n", "| **\\D** |Matches any non-digit character. This is equivalent to [^0-9] or [^\\d] \n", "| **\\s** |Matches any whitespace character. This is equivalent to [ \\r\\n\\t\\b\\f] \n", "| **\\S** |Matches any non-whitespace character. This is equivalent to [^ \\r\\t\\n\\f] or [^\\s] \n", "| **\\w** |Matches alphanumeric character. This is equivalent to [a-zA-Z0-9_] \n", "| **\\W** |Matches any non-alphanumeric character. This is equivalent to [^a-zA-Z0-9_] or [^\\w] \n", "| **\\b** |Matches where the specified characters are at the beginning or at the end of a word r\"\\bain\" OR r\"ain\\b\"\n", "| **\\B** |Matches where the specified characters are present, but NOT or at the end of a word r\"Bain\" OR r\"ain\\B\" " ] }, { "cell_type": "markdown", "id": "c0f18cf9", "metadata": {}, "source": [ "## 2. The Python `re` Module" ] }, { "cell_type": "code", "execution_count": 11, "id": "a3e77f46", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['A', 'ASCII', 'DEBUG', 'DOTALL', 'I', 'IGNORECASE', 'L', 'LOCALE', 'M', 'MULTILINE', 'Match', 'Pattern', 'RegexFlag', 'S', 'Scanner', 'T', 'TEMPLATE', 'U', 'UNICODE', 'VERBOSE', 'X', '_MAXCACHE', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', '__version__', '_cache', '_compile', '_compile_repl', '_expand', '_locale', '_pickle', '_special_chars_map', '_subx', 'compile', 'copyreg', 'enum', 'error', 'escape', 'findall', 'finditer', 'fullmatch', 'functools', 'match', 'purge', 'search', 'split', 'sre_compile', 'sre_parse', 'sub', 'subn', 'template']\n" ] } ], "source": [ "import re\n", "print(dir(re))" ] }, { "cell_type": "markdown", "id": "7af8d1dd", "metadata": {}, "source": [ "### a. The `re.compile()` Method\n", "This method is used to compile a regular expression pattern into a regular expression object, which can be used for matching using its `match()`, `search()` and other methods.\n", "\n", "**`re.compile(pattern, flags=0)`**\n", "\n", "Where,\n", " - `pattern` is the regular expression which you want to compile that you need to search/modify in a string or may be on a corpus of documents.\n", " - `flags` arguments can be used to modify the expression’s behaviour. Values can be any of the following variables, combined using bitwise OR (the | operator):\n", " - `IGNORECASE` or `I` to do a case in-sensitive search\n", " - `LOCALE` or `L` to perform a locale aware match.\n", " - `MULTILINE`, `M` to do multiline matching, affecting `^` and `$`\n", " - `DOTALL` or `S` to make the '.' special character match any character, including a newline; without this flag, '.' will match anything except a newline.\n", "\n", "Once you have an `Pattern object` representing a compiled regular expression, you can use its methods to perform various operations on a string or may be in a corpus of documents:\n", "- `p.match()`: Determine if the RE matches at the beginning of the string.\n", "- `p.search()`: Scan through a string, looking for any location where this RE matches.\n", "- `p.findall()`: Find all substrings where the RE matches, and returns them as a list.\n", "- `p.finditer()`: Find all substrings where the RE matches, and returns them as an iterator.\n", "- `p.split()`: Used to split string by the occurrences of pattern.\n", "- `p.sub()`: Used for for find and replace purpose..\n", "\n", "**Note:** Creating a regex object with `re.compile` is highly recommended if you intend to apply the same expression to many strings; doing so will save CPU cycles." ] }, { "cell_type": "code", "execution_count": 12, "id": "1099db63", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "re.compile('[A]+[a-z]*', re.MULTILINE)\n", "\n" ] } ], "source": [ "import re\n", "# The regular expression look for a string that starts with one or more uppercase 'A' alphabet, followed by \n", "# zero or more lower case alphabets in multi-line mode\n", "p = re.compile(r\"[A]+[a-z]*\", flags=re.M) \n", "\n", "print(p)\n", "print(type(p))" ] }, { "cell_type": "markdown", "id": "dc9ee288", "metadata": {}, "source": [ "Once you have got a pattern object, you can use its various methods for searching from a string" ] }, { "cell_type": "code", "execution_count": null, "id": "f26f0321", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "d724dbd3", "metadata": {}, "source": [ "### b. The `re.Pattern.search()` Method\n", "- Scan through string looking for the first location where the pattern object `p` produces a match, and return a corresponding match object. Return None if no position in the string matches the pattern.\n", "\n", "**`p.search(string, pos=0 endpos=9223372036854775807)`**\n", "\n", "- Where,\n", " - `p` is the compiled pattern object\n", " - `string` is the test string from which we want to search\n", " - `pos` and `endpos` can be used to specify the portion of test string from where to search" ] }, { "cell_type": "code", "execution_count": 13, "id": "9d427527", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n" ] } ], "source": [ "import re\n", "str1 = \"Rauf, Arif, Jamil and Ahmad are good at playing acrobatic games. AAA is triple As. Arif Butt.\"\n", "# The regular expression look for a string that starts with one or more uppercase 'A' alphabet, followed by \n", "# zero or more lower case alphabets in multi-line mode\n", "p = re.compile(r\"[A]+[a-z]*\", flags=re.M) \n", "\n", "match = p.search(str1)\n", "print(match)\n", "print(type(match))\n" ] }, { "cell_type": "code", "execution_count": null, "id": "3e0a5df4", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "e7eab265", "metadata": {}, "source": [ "### c. The `re.Pattern.match()` Method\n", "- Look for the pattern at the beginning of the string and if found returns a corresponding match object. Return None if the string does not match the pattern.\n", "\n", "**`p.match(string, pos=0 endpos=9223372036854775807)`**\n", "\n", "- Where,\n", " - `p` is the compiled pattern object\n", " - `string` is the test string from which we want to search\n", " - `pos` and `endpos` can be used to specify the portion of test string from where to search\n", "\n", "Note: \n", "- Even in MULTILINE mode, `re.match()` will only match at the beginning of the string and not at the beginning of each line.\n", "- If you want to locate a match anywhere in string, use `search()` instead " ] }, { "cell_type": "code", "execution_count": 15, "id": "7da73734", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n" ] } ], "source": [ "import re\n", "str1 = \"Arif, Jamil and Ahmad are good at playing acrobatic games. AAA is triple As. Arif Butt.\"\n", "str2 = \"Mr. Arif, Jamil and Ahmad are good at playing acrobatic games. AAA is triple As. Arif Butt.\"\n", "# The regular expression look for a string that starts with one or more uppercase 'A' alphabet, followed by \n", "# zero or more lower case alphabets in multi-line mode\n", "p = re.compile(r\"[A]+[a-z]*\", flags=re.M) \n", "\n", "rv = p.match(str1)\n", "print(rv)\n", "print(type(rv))" ] }, { "cell_type": "code", "execution_count": null, "id": "494543e0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "e87a6f71", "metadata": {}, "source": [ "### d. The `re.Pattern.findall()` Method\n", "- The `search()` only returns the first match, `match()` only matches at the beginning of the string, while `findall()` returns all matches in a string.\n", "- Return all non-overlapping matches of pattern in string, as a list of strings or tuples. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result.\n", "- If pattern `p` does not match, it returns an empty list.\n", "\n", "**`p.findall(string, pos=0 endpos=9223372036854775807)`**\n", "\n", "- Where,\n", " - `p` is the compiled pattern object\n", " - `string` is the test string from which we want to search\n", " - `pos` and `endpos` can be used to specify the portion of test string from where to search\n", " \n", "Note: The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do not affect the form of the result." ] }, { "cell_type": "code", "execution_count": 16, "id": "c2109514", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Arif', 'Ahmad', 'AAA', 'As', 'Arif']\n", "\n" ] } ], "source": [ "import re\n", "str1 = \"Arif, Jamil and Ahmad are good at playing acrobatic games. AAA is triple As. Arif Butt.\"\n", "# The regular expression look for a string that starts with one or more uppercase 'A' alphabet, followed by \n", "# zero or more lower case alphabets in multi-line mode\n", "p = re.compile(r\"[A]+[a-z]*\", flags=re.M) \n", "\n", "\n", "rv = p.findall(str1)\n", "print(rv)\n", "print(type(rv))" ] }, { "cell_type": "code", "execution_count": null, "id": "9ac11878", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "daa64e6b", "metadata": {}, "source": [ "### e. The `re.Pattern.finditer()` Method\n", "- Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result.\n", "\n", "**`p.finditer(string, pos=0 endpos=9223372036854775807)`**\n", "\n", "- Where,\n", " - `p` is the compiled pattern object\n", " - `string` is the test string from which we want to search\n", " - `pos` and `endpos` can be used to specify the portion of test string from where to search\n", " " ] }, { "cell_type": "code", "execution_count": 17, "id": "83e49cd6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n" ] } ], "source": [ "import re\n", "str1 = \"Rauf, Arif, Jamil and Ahmad are good at playing acrobatic games. AAA is triple As. Arif Butt.\"\n", "\n", "# The regular expression look for a string that starts with one or more uppercase 'A' alphabet, followed by \n", "# zero or more lower case alphabets in multi-line mode\n", "p = re.compile(r\"[A]+[a-z]*\", flags=re.M) \n", "\n", "matches = p.finditer(str1)\n", "print(matches)\n", "print(type(matches))" ] }, { "cell_type": "markdown", "id": "2bbb2991", "metadata": {}, "source": [ ">- **Once we have got the iterator of `Match object`, we can iterate it using a `for` loop.**\n", ">- **Let us see how many match objects are there in this iterator named `matches`.**" ] }, { "cell_type": "code", "execution_count": 18, "id": "e090d9ca", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n", "\n", "\n" ] } ], "source": [ "for m in matches:\n", " print(m)" ] }, { "cell_type": "code", "execution_count": null, "id": "61861968", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "e5186bd2", "metadata": {}, "source": [ ">- **Every match object has many associated methods.**\n", ">- **Let us see different attributes of each match object using these methods.**" ] }, { "cell_type": "markdown", "id": "8ad323a9", "metadata": {}, "source": [ "The **`group()`** method of the match object, return subgroups of the match (if they exist). By default return the entire match." ] }, { "cell_type": "code", "execution_count": 22, "id": "bccf0177", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Arif\n", "Ahmad\n", "AAA\n", "As\n", "Arif\n" ] } ], "source": [ "import re\n", "str1 = \"Rauf, Arif, Jamil and Ahmad are good at playing acrobatic games. AAA is triple As. Arif Butt.\"\n", "\n", "p = re.compile(r\"[A]+[a-z]*\") \n", "matches = p.finditer(str1)\n", "\n", "for m in matches:\n", " print(m.group(0))" ] }, { "cell_type": "markdown", "id": "95d4245d", "metadata": {}, "source": [ "The **`span()`** method of the match object, return a 2-tuple containing the start and end index (end index not inclusive)" ] }, { "cell_type": "code", "execution_count": 23, "id": "1b392b71", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(6, 10)\n", "(22, 27)\n", "(66, 69)\n", "(80, 82)\n", "(84, 88)\n" ] } ], "source": [ "import re\n", "str1 = \"Rauf, Arif, Jamil and Ahmad are good at playing acrobatic games. AAA is triple As. Arif Butt.\"\n", "\n", "p = re.compile(r\"[A]+[a-z]*\") \n", "matches = p.finditer(str1)\n", "\n", "for m in matches:\n", " print(m.span())" ] }, { "cell_type": "markdown", "id": "a1200ba9", "metadata": {}, "source": [ "The **`start(group=0)`** method of the match object, return index of the start of the substring matched by group." ] }, { "cell_type": "code", "execution_count": 24, "id": "8697591e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4\n", "20\n", "78\n", "82\n" ] } ], "source": [ "import re\n", "str1 = \"Arif, Jamil and Ahmad are good at playing acrobatic games. AAA is triple As. Arif Butt.\"\n", "str2 = \"Mr. Arif, Jamil and Ahmad are good at playing acrobatic games. AAA is triple As. Arif Butt.\"\n", "p = re.compile(r\"[A]+[a-z]+\") \n", "matches = p.finditer(str2)\n", "\n", "for m in matches:\n", " print(m.start())" ] }, { "cell_type": "markdown", "id": "a9cbe56a", "metadata": {}, "source": [ "The `end(group=0)` method of the match object, return index of the end of the substring matched by group." ] }, { "cell_type": "code", "execution_count": 25, "id": "bc08495d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "8\n", "25\n", "80\n", "86\n" ] } ], "source": [ "import re\n", "str1 = \"Arif, Jamil and Ahmad are good at playing acrobatic games. AAA is triple As. Arif Butt.\"\n", "str2 = \"Mr. Arif, Jamil and Ahmad are good at playing acrobatic games. AAA is triple As. Arif Butt.\"\n", "p = re.compile(r\"[A]+[a-z]+\") \n", "matches = p.finditer(str2)\n", "\n", "for m in matches:\n", " print(m.end())" ] }, { "cell_type": "code", "execution_count": null, "id": "fb975548", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "de97f34c", "metadata": {}, "source": [ "## 3. Practical Example" ] }, { "cell_type": "markdown", "id": "247c42cb", "metadata": {}, "source": [ "**Read a text file**" ] }, { "cell_type": "code", "execution_count": 27, "id": "561b15b4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mr. Arif Butt\r\n", "615-555-7164\r\n", "131 Model Town, Lahore\r\n", "01-04-1975\r\n", "arifpucit@gmail.com\r\n", "http://www.arifbutt.me\r\n", "\r\n", "\r\n", "Mrs. Nasira Jadoon\r\n", "317.615.9124\r\n", "33 Garden Town, Lahore\r\n", "20/02/1969\r\n", "nasira-123@gmail.com\r\n", "http://nasira.pu.edu.pk\r\n", "\r\n", "\r\n", "\r\n", "Mr. Khurram Shahzad\r\n", "321#521#9254\r\n", "69, A Wapda Town, Lahore\r\n", "12.09.1985\r\n", "khurram3@yahoo.com\r\n", "https://www.khurram.pu.edu.pk\r\n", "\r\n", "\r\n", "Ms Aqsa\r\n", "123.555.1997\r\n", "56 Joher Town, Lahore\r\n", "12/08/2001\r\n", "aqsa_007@gmail.com\r\n", "http://youtube.com\r\n", "\r\n", "Mr. B\r\n", "321-555-4321\r\n", "19 Township, Lahore\r\n", "05-07-2002\r\n", "mrB@yahoo.com\r\n", "http://facebook.com" ] } ], "source": [ "! cat datasets/names_addresses.txt" ] }, { "cell_type": "code", "execution_count": 26, "id": "4fdd049e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mr. Arif Butt\n", "615-555-7164\n", "131 Model Town, Lahore\n", "01-04-1975\n", "arifpucit@gmail.com\n", "http://www.arifbutt.me\n", "\n", "\n", "Mrs. Nasira Jadoon\n", "317.615.9124\n", "33 Garden Town, Lahore\n", "20/02/1969\n", "nasira-123@gmail.com\n", "http://nasira.pu.edu.pk\n", "\n", "\n", "\n", "Mr. Khurram Shahzad\n", "321#521#9254\n", "69, A Wapda Town, Lahore\n", "12.09.1985\n", "khurram3@yahoo.com\n", "https://www.khurram.pu.edu.pk\n", "\n", "\n", "Ms Aqsa\n", "123.555.1997\n", "56 Joher Town, Lahore\n", "12/08/2001\n", "aqsa_007@gmail.com\n", "http://youtube.com\n", "\n", "Mr. B\n", "321-555-4321\n", "19 Township, Lahore\n", "05-07-2002\n", "mrB@yahoo.com\n", "http://facebook.com\n" ] } ], "source": [ "with open(\"datasets/names_addresses.txt\", \"r\") as fd:\n", " print(fd.read())" ] }, { "cell_type": "code", "execution_count": null, "id": "06bb08a6", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "549d96f8", "metadata": {}, "source": [ "**Let us read the data from the file in a string**" ] }, { "cell_type": "code", "execution_count": 28, "id": "04ec382d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Mr. Arif Butt\\n615-555-7164\\n131 Model Town, Lahore\\n01-04-1975\\narifpucit@gmail.com\\nhttp://www.arifbutt.me\\n\\n\\nMrs. Nasira Jadoon\\n317.615.9124\\n33 Garden Town, Lahore\\n20/02/1969\\nnasira-123@gmail.com\\nhttp://nasira.pu.edu.pk\\n\\n\\n\\nMr. Khurram Shahzad\\n321#521#9254\\n69, A Wapda Town, Lahore\\n12.09.1985\\nkhurram3@yahoo.com\\nhttps://www.khurram.pu.edu.pk\\n\\n\\nMs Aqsa\\n123.555.1997\\n56 Joher Town, Lahore\\n12/08/2001\\naqsa_007@gmail.com\\nhttp://youtube.com\\n\\nMr. B\\n321-555-4321\\n19 Township, Lahore\\n05-07-2002\\nmrB@yahoo.com\\nhttp://facebook.com'" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "with open(\"datasets/names_addresses.txt\", \"r\") as fd:\n", " teststring = fd.read()\n", "teststring" ] }, { "cell_type": "code", "execution_count": null, "id": "a91390ca", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "e63f46d2", "metadata": {}, "source": [ "### a. Extracting Names\n", "- Assume that every name starts with Mr or Ms or Mrs, with an optional dot, a space and then followed by alphanumeric characters" ] }, { "cell_type": "code", "execution_count": null, "id": "1f0ffe74", "metadata": {}, "outputs": [], "source": [ "import re\n", "p = re.compile(r'(Mr|Ms|Mrs)\\.?\\s\\w+') \n", "\n", "#teststring contains data read from file\n", "matches = p.finditer(teststring)\n", "for match in matches:\n", " print(match)" ] }, { "cell_type": "code", "execution_count": null, "id": "5d321152", "metadata": {}, "outputs": [], "source": [ "p = re.compile(r'(Mr|Ms|Mrs)\\.?\\s\\w+') \n", "\n", "#teststring contains data read from file\n", "matches = p.finditer(teststring)\n", "\n", "for match in matches:\n", " print(match.group())\n", " " ] }, { "cell_type": "markdown", "id": "e1e0940e", "metadata": {}, "source": [ "What if we want to get the complete name" ] }, { "cell_type": "code", "execution_count": 31, "id": "2d9d41a0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mr. Arif Butt\n", "Mrs. Nasira Jadoon\n", "Mr. Khurram Shahzad\n", "Ms Aqsa\n", "\n", "Mr. B\n", "\n" ] } ], "source": [ "p = re.compile(r'(Mr|Ms|Mrs)\\.?\\s\\w+\\s[A-Za-z]*') \n", "\n", "#teststring contains data read from file\n", "matches = p.finditer(teststring)\n", "\n", "for match in matches:\n", " print(match.group())\n", " " ] }, { "cell_type": "code", "execution_count": null, "id": "63362e7a", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "7e49118e", "metadata": {}, "source": [ "### b. Extracting Date of Births\n", "- Assume that the the date, month and year are of two, two and four digits respectively. Moreover, they are separated by either a dot, a hyphen or a slash" ] }, { "cell_type": "code", "execution_count": null, "id": "8e6cbbd3", "metadata": {}, "outputs": [], "source": [ "p = re.compile(r'\\d{2}.\\d{2}.\\d{4}') \n", "\n", "#teststring contains data read from file\n", "matches = p.finditer(teststring)\n", "\n", "for match in matches:\n", " print(match.group())\n", " " ] }, { "cell_type": "markdown", "id": "e422df1b", "metadata": {}, "source": [ "What if we just want to get the date, month and year separately" ] }, { "cell_type": "code", "execution_count": 37, "id": "81694292", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "01-04-1975\n", "20/02/1969\n", "12.09.1985\n", "12/08/2001\n", "05-07-2002\n" ] } ], "source": [ "p = re.compile(r'(\\d{2}).(\\d{2}).(\\d{4})') \n", "\n", "#teststring contains data read from file\n", "matches = p.finditer(teststring)\n", "\n", "for match in matches:\n", " print(match.group())\n", " " ] }, { "cell_type": "code", "execution_count": null, "id": "68686abe", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "244a5e57", "metadata": {}, "source": [ "### c. Extracting Emails and Usernames\n", "**Valid Name Part:**\n", "- Lowercase case alphabets\n", "- Uppercase case alphabets\n", "- Digits: 0123456789,\n", "- dot: . (not first or last character)\n", "- For simplicity assume no special characters allowed\n", "\n", "**Valid Domain Part:**\n", "- Lowercase case alphabets\n", "- Uppercase case alphabets\n", "- Digits: 0123456789,\n", "- Hyphen: - (not first or last character),\n", "- Can contain IP address surrounded by square brackets: test@[192.168.2.4] or test@[IPv6:2018:db8::1]." ] }, { "cell_type": "code", "execution_count": null, "id": "b2e71404", "metadata": {}, "outputs": [], "source": [ "!cat datasets/names_addresses.txt" ] }, { "cell_type": "code", "execution_count": null, "id": "9f64ef19", "metadata": {}, "outputs": [], "source": [ "pattern = re.compile(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-z]{2,3}')\n", "\n", "#teststring contains data read from file\n", "matches = pattern.finditer(teststring)\n", "\n", "for match in matches:\n", " print(match.group())" ] }, { "cell_type": "markdown", "id": "46728040", "metadata": {}, "source": [ "What if we want to just extract the usernames" ] }, { "cell_type": "code", "execution_count": 40, "id": "decac4e4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "arifpucit@gmail.com\n", "nasira-123@gmail.com\n", "khurram3@yahoo.com\n", "aqsa_007@gmail.com\n", "mrB@yahoo.com\n" ] } ], "source": [ "p = re.compile(r'([a-zA-Z0-9_.+-]+)@([a-zA-Z0-9-]+)(\\.[a-z]{2,3})')\n", "\n", "#teststring contains data read from file\n", "matches = p.finditer(teststring)\n", "\n", "for match in matches:\n", " print(match.group())" ] }, { "cell_type": "code", "execution_count": null, "id": "03d3df90", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "20196c7f", "metadata": {}, "source": [ "### d. Extracting valid Cell phones\n", "Assume that every valid phone number consiste of 10 digits in three groups of three, three and four digits. The three groups are separated by either a `-`, `.` or `/` symbol" ] }, { "cell_type": "code", "execution_count": null, "id": "fc1dc661", "metadata": {}, "outputs": [], "source": [ "p = re.compile(r'\\d{3}[./-]\\d{3}[./-]\\d{4}')\n", "\n", "#teststring contains data read from file\n", "matches = p.finditer(teststring)\n", "\n", "for match in matches:\n", " print(match.group())\n", " " ] }, { "cell_type": "markdown", "id": "c5445400", "metadata": {}, "source": [ "You can easily extract the city codes, country codes and so on at your own by creating groups inside your regular expressions." ] }, { "cell_type": "code", "execution_count": null, "id": "e6ef7066", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "adc6b387", "metadata": {}, "source": [ "### e. Extracting Domain names from URLs\n", "- Assume simple URLs, having the protocol either `http://` or `https://`\n", "- Then we have optional `www.` string\n", "- Then we have group of characters that make up our domain name, followed by top level domain" ] }, { "cell_type": "code", "execution_count": null, "id": "db9ed774", "metadata": {}, "outputs": [], "source": [ "p = re.compile(r'https?://(www\\.)?(\\w+)(\\.\\w+)')\n", "\n", "#teststring contains data read from file\n", "matches = p.finditer(teststring)\n", "\n", "for match in matches:\n", " print(match.group()) " ] }, { "cell_type": "code", "execution_count": null, "id": "0cfae26b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "c95c3ef1", "metadata": {}, "source": [ "Let us extract the top level domain (TLDs) only." ] }, { "cell_type": "code", "execution_count": null, "id": "ee4b2933", "metadata": {}, "outputs": [], "source": [ "p = re.compile(r'https?://(www\\.)?(\\w+)(\\.\\w+)')\n", "\n", "#teststring contains data read from file\n", "matches = p.finditer(teststring)\n", "\n", "for match in matches:\n", " print(match.group(3)) " ] }, { "cell_type": "code", "execution_count": null, "id": "bf17fd81", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "a58adefb", "metadata": {}, "source": [ "## 4. Modifying Strings\n", "- Up to this point, we’ve simply performed searches against a static string. Regular expressions are also commonly used to modify strings in various ways, using the following pattern methods:\n", "\n", " - split(): Split the string into a list, splitting it wherever the RE matches\n", " - sub(): Find all substrings where the RE matches, and replace them with a different string\n", " - subn(): Does the same thing as sub(), but returns the new string and the number of replacements" ] }, { "cell_type": "markdown", "id": "41c31ada", "metadata": {}, "source": [ "### a. The `re.Pattern.split()` Method\n", "- It split the target string as per the regular expression pattern, and the matches are returned in the form of a list.\n", "**`p.split(string, maxsplit=0)`**\n", "\n", "- Where,\n", " - `string`: The variable pointing to the target string (i.e., the string we want to split).\n", " - `maxsplit`: The number of splits you wanted to perform. If maxsplit is 2, at most two splits occur, and the remainder of the string is returned as the final element of the list.\n", " \n", "\n", "Note: \n", "It’s similar to the `split()` method of strings but provides much more generality in the delimiters that you can split by; string split() only supports splitting by whitespace or by a fixed string." ] }, { "cell_type": "code", "execution_count": 43, "id": "3ad5ffe3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['My',\n", " 'name',\n", " '',\n", " '',\n", " '',\n", " '',\n", " '',\n", " '',\n", " '',\n", " '',\n", " '',\n", " '',\n", " '',\n", " '',\n", " 'is',\n", " 'Arif',\n", " 'Butt',\n", " 'and',\n", " 'my',\n", " 'lucky',\n", " 'number',\n", " 'is',\n", " '54',\n", " 'and',\n", " '143']" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# defining string\n", "mystring = \"My name is Arif Butt and my lucky number is 54 and 143\"\n", "mylist = mystring.split(sep=' ')\n", "mylist" ] }, { "cell_type": "code", "execution_count": 45, "id": "48b033ea", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['My', 'name', 'is', 'Arif', 'Butt', 'and', 'my', 'lucky', 'number', 'is', '54', 'and', '143']\n" ] } ], "source": [ "# importing required libraries\n", "import re\n", "\n", "# defining string\n", "mystring = \"My name is Arif Butt and my lucky number is 54 and 143\"\n", "\n", "p = re.compile(r\"\\s+\")\n", "\n", "word_list = p.split(mystring)\n", "\n", "print(word_list)" ] }, { "cell_type": "code", "execution_count": null, "id": "a32a2923", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "914f36db", "metadata": {}, "source": [ "The `maxsplit` parameter of `split()` is used to define how many splits you want to perform. In simple words, if the maxsplit is 2, then two splits will be done, and the remainder of the string is returned as the final element of the list." ] }, { "cell_type": "code", "execution_count": 48, "id": "d371245a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['My', 'name', 'is', 'Arif', 'Butt', 'and', 'my', 'lucky', 'number', 'is', '54', 'and', '143']\n" ] } ], "source": [ "import re\n", "\n", "# defining string\n", "mystring = \"My name is Arif Butt and my lucky number is 54 and 143\"\n", "\n", "p = re.compile(r\"\\s+\")\n", "\n", "word_list = p.split(mystring, maxsplit=0)\n", "\n", "print(word_list)" ] }, { "cell_type": "markdown", "id": "f6c871a5", "metadata": {}, "source": [ "- The `split()` method of strings allows you to split by whitespace or by a fixed string.\n", "- The regex `split()` method allows you to specify a regex pattern for the delimiters where you can specify multiple delimiters.\n", "- For example, using the regular expression re.split() method, we can split the string either by the `comma` or by `space`." ] }, { "cell_type": "markdown", "id": "4a2719d4", "metadata": {}, "source": [ "- Let us split by the `comma` or by `hyphen`." ] }, { "cell_type": "code", "execution_count": 49, "id": "16836216", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['12', '45', '78', '85', '17', '89']\n" ] } ], "source": [ "import re\n", "\n", "# defining string\n", "mystring = \"12,45,78,85-17-89\"\n", "\n", "p = re.compile(r\"-|,\")\n", "p = re.compile(r\"[-,]\")\n", "\n", "word_list = p.split(mystring)\n", "\n", "print(word_list)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "9fa5ff04", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 50, "id": "9519a1a6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['12', '45', '78', '85', '17', '89', '97', '54']\n" ] } ], "source": [ "import re\n", "\n", "# defining string\n", "mystring = \"12and45, 78and85-17and89-97,54\"\n", "\n", "p = re.compile(r\"and|[\\s,-]+\")\n", "\n", "word_list = p.split(mystring)\n", "\n", "print(word_list)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "0435dc2a", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "f1463696", "metadata": {}, "source": [ "### b. The `re.Pattern.sub()` and `re.Pattern.subn()` Methods\n", "- - Python regex offers `sub()` the `subn()` methods to `search` and `replace` patterns in a string. Using these methods we can replace one or more occurrences of a regex pattern in the target string with a substitute string.\n", "\n", "- The `sub()` method return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in `string` by the replacement `repl`.\n", "\n", "**`p.sub(repl, string, count=0)`**\n", "\n", "- Where,\n", " - `repl`: The replacement that we are going to insert for each occurrence of a pattern. The replacement can be a string or function.\n", " - `string`: The variable pointing to the target string (In which we want to perform the replacement).\n", " - `count`: The default value of count is zero, means, find and replace all occurrences of pattern with replacement. For count=n, means replace first n occurrencesof pattern with the replacement\n", " \n", " \n", "\n", "- It returns the string obtained by replacing the pattern occurrences in the string with the replacement string. If the pattern isn’t found, the string is returned unchanged." ] }, { "cell_type": "markdown", "id": "eb6fea21", "metadata": {}, "source": [ "Replace white space with underscore character" ] }, { "cell_type": "code", "execution_count": 52, "id": "b757f36e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Learning_is_fun_with_Arif_Butt\n" ] } ], "source": [ "import re\n", "\n", "# defining string\n", "mystring = \"Learning is fun with Arif Butt\"\n", "\n", "p = re.compile(r\"\\s\")\n", "\n", "word_list = p.sub(\"_\", mystring)\n", "\n", "print(word_list)" ] }, { "cell_type": "code", "execution_count": null, "id": "211ed06a", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "bbaaa1b7", "metadata": {}, "source": [ "Remove whitespaces from a string" ] }, { "cell_type": "code", "execution_count": 53, "id": "60288c16", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "LearningisfunwithArifButt\n" ] } ], "source": [ "import re\n", "\n", "# defining string\n", "mystring = \"Learning is fun with Arif Butt\"\n", "\n", "p = re.compile(r\"\\s+\")\n", "\n", "word_list = p.sub(\"\", mystring)\n", "\n", "print(word_list)" ] }, { "cell_type": "code", "execution_count": null, "id": "9e4a368b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "0c220e65", "metadata": {}, "source": [ "Remove leading Spaces from a string" ] }, { "cell_type": "code", "execution_count": 54, "id": "f5c26a56", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Learning is fun with Arif Butt\n" ] } ], "source": [ "import re\n", "\n", "# defining string\n", "mystring = \" Learning is fun with Arif Butt\"\n", "\n", "# ^\\s+ remove only leading spaces\n", "# caret (^) matches only at the start of the string\n", "p = re.compile(r\"^\\s+\")\n", "\n", "word_list = p.sub(\"\", mystring)\n", "\n", "print(word_list)" ] }, { "cell_type": "code", "execution_count": null, "id": "76da9f55", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "55865bc9", "metadata": {}, "source": [ "Remove both leading and trailing spaces" ] }, { "cell_type": "code", "execution_count": 55, "id": "2fb34ee4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Learning is fun with Arif Butt \t.\n" ] } ], "source": [ "import re\n", "\n", "# defining string\n", "mystring = \" Learning is fun with Arif Butt \\t. \"\n", "\n", "# ^\\s+ remove leading spaces\n", "# ^\\s+$ removes trailing spaces\n", "p = re.compile(r\"^\\s+|\\s+$\")\n", "\n", "word_list = p.sub(\"\", mystring)\n", "\n", "print(word_list)" ] }, { "cell_type": "code", "execution_count": 60, "id": "7d045434", "metadata": {}, "outputs": [], "source": [ "string1 = 'a aa ab bc bb abc abcd ba aaaa'\n", "p = re.compile(\"[abc]\")\n", "matches = p.finditer(string1)" ] }, { "cell_type": "code", "execution_count": 61, "id": "2916ae03", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ] } ], "source": [ "for m in matches:\n", " print(m)" ] }, { "cell_type": "code", "execution_count": null, "id": "9e4a6ae6", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }