{ "cells": [ { "cell_type": "markdown", "id": "c0f94bec", "metadata": {}, "source": [ "--- \n", " \n", "\n", "

Department of Data Science

\n", "

Course: Tools and Techniques for Data Science

\n", "\n", "---\n", "

Instructor: Muhammad Arif Butt, Ph.D.

" ] }, { "cell_type": "markdown", "id": "3234ed0d", "metadata": {}, "source": [ "

Lecture 2.17

" ] }, { "cell_type": "markdown", "id": "07ab6ff1", "metadata": {}, "source": [ "\"Open" ] }, { "cell_type": "markdown", "id": "9aa4323e", "metadata": {}, "source": [ "## _Regular Expressions Part-I.ipynb_\n", "https://docs.python.org/3/howto/regex.html#regex-howto\n", "\n", "https://docs.python.org/3/library/re.html" ] }, { "cell_type": "code", "execution_count": null, "id": "610fe665", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "794e0410", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "265bc1ec", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "aa38c930", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "6ad97a3c", "metadata": {}, "source": [ "

A Gentle Introduction to Regular Expressions (Regex)



\n", "\n", "\n", "\n", "\n", "








" ] }, { "cell_type": "markdown", "id": "99cbfc89", "metadata": {}, "source": [ "# Learning Agenda\n", "**PART-I:**\n", "1. A gentle introduction to Regular Expressions\n", "2. Overview of Regex Metacharacters, Anchors, Quantifiers, Escape Codes and Grouping Constructs\n", "3. Overview of regex101\n", "4. A Step by Step hands-on practical understanding of REs on regex101.com\n", "5. Practical Use Cases\n", " - Identify valid phone numbers\n", " - Identify/locate valid names or city codes\n", " - Identify valid email addresses\n", " - Identify valid URLs\n", "6. Substitution and Replacement\n", "\n", "

**PART-II:**\n", "\n", "**Lecture 2.18 (Regular Expressions in Python)**" ] }, { "cell_type": "markdown", "id": "e6bc917e", "metadata": {}, "source": [ "## Wild Card / Meta Characters\n", "Special characters are characters that do not match themselves as seen but have a special meaning when used in a regular expression. Some commonly used wild cards or meta characters are listed below:\n", "\n", "\n", "| Wild Card | Description \n", "| :-: |:-------------\n", "| **^** |Caret symbol specifies that the match must start at the beginning of the string, and in MULTILINE mode also matches immediately after each newline
- `^b` will check if the string starts with 'b' such as baba, boss, basic, b, by, etc.
- `^si` will check if the string starts with 'si' such as simple, sister, si, etc.\n", "| **$** |Specifies that the match must occur at the end of the string
- `s$` will check for the string that ends with a such as geeks, ends, s, etc.
- `ing$` will check for the string that ends with ing such as going, seeing, ing, etc.\n", "| **.** |Represent a single occurrance of any character except new line
- `a.b` will check for the string that contains any character at the place of the dot such as acb, acbd, abbb, etc
- `..` will check if the string contains at least 2 characters\n", "| **\\\\** |Used to drop special meaning of a character following it or used to refer to a special character.
- Since dot `(.)` is a metacharacter, so if you want to search it in a string you have to use the backslash `(\\)` just before the dot `(.)` so that it will lose its specialty. \n", "| **[...]** |Matches a single character in the listed set. If caret is the first character inside it, it means negation
- `[abc]` means match any single character out of this set
- `[123]` means match any single digit out of this set
- `[a-z]` means match any single character out of lower case alphabets
- `[0-9]` means match any single digit out of this set
- `[^0-3]` means any number except 0, 1, 2, or 3
- `[^a-c]` means any character except a, b, or c
- [0-5][0-9] will match all the two-digits numbers from 00 to 59
- `[0-9A-Fa-f]` will match any hexadecimal digit.
- Special characters lose their special meaning inside sets, so `[(+*)]` will match any of the literal characters '(', '+', '*', or ')'.
- To match a literal ']' inside a set, precede it with a backslash, or place it at the beginning of the set, so `[()[\\]{}]` and `[]()[{}]` will both match parenthesis.\n", "| **^[...]**|Matches any character in the set at the beginning of the string\n", "| **[^...]**|Matches any character except those NOT in the listed set (negation)\n", "| **\\|** |Or symbol works as the OR operator meaning it checks whether the pattern before or after the or symbol is present in the string or not
- `a\\|b` will match any string that contains a or b such as acd, bcd, abcd, etc.
- To match a literal '\\|', use `\\|`, or enclose it inside a character class, as in `[\\|]`.\n", "| **( )** |Used to capture and group" ] }, { "cell_type": "markdown", "id": "f0cfcb57", "metadata": {}, "source": [ "## Quantifiers\n", "- A quantifier metacharacter immediately follows a portion of a and indicates how many times that portion must occur for the match to succeed. *, +, ?, {m}, {m,n}. When used alone, the quantifier metacharacters *, +, and ? are all greedy, meaning they produce the longest possible match. \n", "\n", "| Wild Card | Description \n", "| :-: |:-------------\n", "| **\\*** |The preceding character/expression is repeated zero or more times\n", "| **+** |The preceding character/expression is repeated one or more times,
- `ab+c` will be matched for the string abc, abbc, dabc, but will not be matched for ac, abdc because there is no b in ac and d is not followed by c in abdc.\n", "| **?** |The preceding character/expression is optional (zero or one occurrence).
- `ab?c` will be matched for the string ac, abc, acb, dabc, dac but will not be matched for abbc because there are two b. Similarly, it will not be matched for abdc because b is not followed by c.\n", "| **{n,m}** |The preceding character/expression is repeated from n to m times (both enclusive).
- `a{2,4}` will be matched for the string aaab, baaaac, gaad, but will not be matched for strings like abc, bc because there is only one a or no a in both the cases.\n", "| **{n}** |The preceding character/expression is repeated n times.
- `a{6}` will match exactly six 'a' characters, but not five. \n", "| **{n,}** |The preceding character/expression is repeated atleast n times \n", "| **{,m}** |The preceding character/expression is repeated upto m times" ] }, { "cell_type": "code", "execution_count": null, "id": "68e10fe6", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "3cb10f31", "metadata": {}, "source": [ "## Escape Codes\n", "- You can use special escape codes to find specific types of patterns in your data, such as digits, non-digits,whitespace, and more. \n", "- The following list of special sequences isn’t complete.\n", "\n", "| Code | Description \n", "| :-: |:-------------\n", "| **\\d** |Matches any decimal digit. This is equivalent to [0-9] \n", "| **\\D** |Matches any non-digit character. This is equivalent to [^0-9] or [^\\d] \n", "| **\\s** |Matches any whitespace character. This is equivalent to [ \\r\\n\\t\\b\\f] \n", "| **\\S** |Matches any non-whitespace character. This is equivalent to [^ \\r\\t\\n\\f] or [^\\s] \n", "| **\\w** |Matches alphanumeric character. This is equivalent to [a-zA-Z0-9_] \n", "| **\\W** |Matches any non-alphanumeric character. This is equivalent to [^a-zA-Z0-9_] or [^\\w] \n", "| **\\b** |Matches where the specified characters are at the beginning or at the end of a word r\"\\bain\" OR r\"ain\\b\"\n", "| **\\B** |Matches where the specified characters are present, but NOT or at the end of a word r\"Bain\" OR r\"ain\\B\" " ] }, { "cell_type": "code", "execution_count": null, "id": "793f5d84", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "7d5b31e5", "metadata": {}, "source": [ "## Practice Regular Expressions\n", "(Visit reges101)[https://regex101.com/]" ] }, { "cell_type": "code", "execution_count": null, "id": "ad12e36f", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "e05e9f85", "metadata": {}, "source": [ "abcdefghijklmnopqurtuvwxyz\n", "ABCDEFGHIJKLMNOPQRSTUVWXYZ\n", "1234567890\n", "Ha HaHa\n", "MetaCharacters (Need to be escaped): \n", ".[{()\\^$|?*+\n", "arifbutt.me\n", "321-555-4321\n", "123.555.1234\n", "111#923#9234\n", "cat\n", "mat\n", "bat\n", "0x45\n", "0X4Ad\n", "0x2g3\n", "0x349ABf\n", "0x" ] }, { "cell_type": "markdown", "id": "a0cc2851", "metadata": {}, "source": [ "Hello World\n", "Mr. Shahzad\n", "Mr Khurram\n", "Ms Aqsa\n", "Mrs. Shaista\n", "Mr. B\n", "Learning is fun" ] }, { "cell_type": "markdown", "id": "c54e1d32", "metadata": {}, "source": [ "List of Valid Email Addresses\n", "arif@pucit.edu.pk\n", "arif.ds@pu.edu.pk\n", "arifpucit@gmail.com\n", "arif.pucit@pu.edu.pk\n", "first+123.5@example.com\n", "abc%xyz@subdomain.example.com\n", "my_name@example.com\n", "first-last@example.com\n", "\n", "List of Invalid Email Addresses\n", "#@%^%#$@#$@#.com\n", "abc.def@mail\n", "abc.def@mail#archive.com\n", "@example.com\n", "arif butt @example.com\n", "khurram#@gmail.com\n", "Abc.example.com" ] }, { "cell_type": "markdown", "id": "a098f5e7", "metadata": {}, "source": [ "https://www.google.com\n", "http://arifbutt.me\n", "https://youtube.com\n", "https://www.yahoo.com\n", "http://facebook.com" ] }, { "cell_type": "code", "execution_count": null, "id": "d0846c43", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "de40e1d0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "053be69e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "945de166", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "6513a159", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "0cb557a1", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "2af94d61", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "622aae0e", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "80f77f75", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 24, "id": "7ace107a", "metadata": {}, "outputs": [], "source": [ "def myfa(mynumb):\n", " oddbits = 0\n", " evenbits = 0\n", " for i in range(len(mynumb)):\n", " if (mynumb[i] == '1'):\n", " if (i % 2 == 0): \n", " evenbits += 1\n", " else:\n", " oddbits += 1\n", " if (abs(oddbits - evenbits) % 3 == 0):\n", " print(\"Yes\")\n", " else:\n", " print(\"No\")" ] }, { "cell_type": "code", "execution_count": 28, "id": "e04b9a85", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "No\n" ] } ], "source": [ "myfa(\"1011111\")" ] }, { "cell_type": "code", "execution_count": null, "id": "ca0a3a6d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "360a2ad0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "b914b6e8", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "7364f17b", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "14fb314d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "1083b3b0", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "31ab2ef3", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "497da2b6", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 23, "id": "fef81076", "metadata": {}, "outputs": [ { "ename": "SyntaxError", "evalue": "invalid syntax (3412454348.py, line 1)", "output_type": "error", "traceback": [ "\u001b[0;36m File \u001b[0;32m\"/var/folders/1t/g3ylw8h50cjdqmk5d6jh1qmm0000gn/T/ipykernel_14988/3412454348.py\"\u001b[0;36m, line \u001b[0;32m1\u001b[0m\n\u001b[0;31m ^(0|(1(01*0)*10*)+)$\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n" ] } ], "source": [ "^(0|(1(01*0)*10*)+)$" ] }, { "cell_type": "code", "execution_count": null, "id": "896d5e4e", "metadata": {}, "outputs": [], "source": [ "0\n", "1\n", "10\n", "11\n", "100\n", "101\n", "110\n", "111\n", "1000\n", "1001\n", "1010\n", "1011\n", "1100\n", "1101\n", "1110\n", "1111\n", "10000\n", "10001\n", "10010\n", "10011\n", "10100\n", "10101\n", "10110\n", "10111\n", "11000\n", "11001\n", "11010\n", "11011\n", "11100\n", "11101\n", "11110\n", "11111" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }