---   
 <img align="left" width="75" height="75"  src="https://upload.wikimedia.org/wikipedia/en/c/c8/University_of_the_Punjab_logo.png"> 

<h1 align="center">Department of Data Science</h1>
<h1 align="center">Course: Tools and Techniques for Data Science</h1>

---
<h3><div align="right">Instructor: Muhammad Arif Butt, Ph.D.</div></h3>    

<h1 align="center">Lecture 2.17</h1>

<a href="https://colab.research.google.com/github/arifpucit/data-science/blob/master/Section-2-Basics-of-Python-Programming/Lec-2.17-Regular-Expressions-in-Regex101/Regular-Expressions-regex101.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## _Regular Expressions Part-I.ipynb_
https://docs.python.org/3/howto/regex.html#regex-howto

https://docs.python.org/3/library/re.html

<h1 align="center">A Gentle Introduction to Regular Expressions (Regex)</h1> <br><br>

<img align="center" width="800" height="800"  src="images/re.jpeg"  >
<img align="center" width="500" height="500"  src="images/tm.jpg"  >

<br><br><br><br><br><br><br><br><br>

# Learning Agenda
**PART-I:**
1. A gentle introduction to Regular Expressions
2. Overview of Regex Metacharacters, Anchors, Quantifiers, Escape Codes and Grouping Constructs
3. Overview of regex101
4. A Step by Step hands-on practical understanding of REs on regex101.com
5. Practical Use Cases
    - Identify valid phone numbers
    - Identify/locate valid names or city codes
    - Identify valid email addresses
    - Identify valid URLs
6. Substitution and Replacement

<br><br>**PART-II:**

**Lecture 2.18 (Regular Expressions in Python)**

## Wild Card / Meta Characters
Special characters are characters that do not match themselves as seen but have a special meaning when used in a regular expression. Some commonly used wild cards or meta characters are listed below:


| Wild Card | Description         
| :-:       |:-------------
| **^**     |Caret symbol specifies that the match must start at the beginning of the string, and in MULTILINE mode also matches immediately after each newline<br>- `^b` will check if the string starts with 'b' such as baba, boss, basic, b, by, etc.<br>- `^si` will check if the string starts with 'si' such as simple, sister, si, etc.
| **$**     |Specifies that the match must occur at the end of the string <br> - `s$` will check for the string that ends with a such as geeks, ends, s, etc.<br>- `ing$` will check for the string that ends with ing such as going, seeing, ing, etc.
| **.**     |Represent a single occurrance of any character except new line <br> - `a.b` will check for the string that contains any character at the place of the dot such as acb, acbd, abbb, etc<br> - `..` will check if the string contains at least 2 characters
| **\\**    |Used to drop special meaning of a character following it or used to refer to a special character. <br> - Since dot `(.)` is a metacharacter, so if you want to search it in a string you have to use the backslash `(\)` just before the dot `(.)`  so that it will lose its specialty. 
| **[...]** |Matches a single character in the listed set. If caret is the first character inside it, it means negation<br>- `[abc]` means match any single character out of this set<br>- `[123]` means match any single digit out of this set<br>- `[a-z]` means match any single character out of lower case alphabets<br>- `[0-9]` means match any single digit out of this set<br>- `[^0-3]` means any number except 0, 1, 2, or 3<br>- `[^a-c]` means any character except a, b, or c<br>- [0-5][0-9] will match all the two-digits numbers from 00 to 59<br>- `[0-9A-Fa-f]` will match any hexadecimal digit.<br>- Special characters lose their special meaning inside sets, so `[(+*)]` will match any of the literal characters '(', '+', '*', or ')'.<br>- To match a literal ']' inside a set, precede it with a backslash, or place it at the beginning of the set, so `[()[\]{}]` and `[]()[{}]` will both match parenthesis.
| **^[...]**|Matches any character in the set at the beginning of the string
| **[^...]**|Matches any character except those NOT in the listed set (negation)
| **\|**    |Or symbol works as the OR operator meaning it checks whether the pattern before or after the or symbol is present in the string or not<br>- `a\|b` will match any string that contains a or b such as acd, bcd, abcd, etc.<br>- To match a literal '\|', use `\|`, or enclose it inside a character class, as in `[\|]`.
| **( )**   |Used to capture and group

## Quantifiers
- A quantifier metacharacter immediately follows a portion of a <regex> and indicates how many times that portion must occur for the match to succeed. *, +, ?, {m}, {m,n}. When used alone, the quantifier metacharacters *, +, and ? are all greedy, meaning they produce the longest possible match. 

| Wild Card | Description         
| :-:       |:-------------
| **\***    |The preceding character/expression is repeated zero or more times
| **+**     |The preceding character/expression is repeated one or more times, <br>- `ab+c` will be matched for the string abc, abbc, dabc, but will not be matched for ac, abdc because there is no b in ac and d is not followed by c in abdc.
| **?**     |The preceding character/expression is optional (zero or one occurrence). <br>- `ab?c` will be matched for the string ac, abc, acb, dabc, dac but will not be matched for abbc because there are two b. Similarly, it will not be matched for abdc because b is not followed by c.
| **{n,m}** |The preceding character/expression is repeated from n to m times (both enclusive). <br> - `a{2,4}` will be matched for the string aaab, baaaac, gaad, but will not be matched for strings like abc, bc because there is only one a or no a in both the cases.
| **{n}**   |The preceding character/expression is repeated n times.<br>- `a{6}` will match exactly six 'a' characters, but not five.           
| **{n,}**  |The preceding character/expression is repeated atleast n times 
| **{,m}**  |The preceding character/expression is repeated upto m times

## Escape Codes
- You can use special escape codes to find specific types of patterns in your data, such as digits, non-digits,whitespace, and more. 
- The following list of special sequences isnâ€™t complete.

| Code | Description         
| :-:  |:-------------
| **\d** |Matches any decimal digit. This is equivalent to [0-9]                              
| **\D** |Matches any non-digit character. This is equivalent to [^0-9] or [^\d]                           
| **\s** |Matches any whitespace character. This is equivalent to [ \r\n\t\b\f]                
| **\S** |Matches any non-whitespace character. This is equivalent to [^ \r\t\n\f] or [^\s]                         
| **\w** |Matches alphanumeric character. This is equivalent to [a-zA-Z0-9_]                  
| **\W** |Matches any non-alphanumeric character. This is equivalent to [^a-zA-Z0-9_] or [^\w]                  
| **\b** |Matches where the specified characters are at the beginning or at the end of a word r"\bain" OR r"ain\b"
| **\B** |Matches where the specified characters are present, but NOT or at the end of a word r"Bain" OR r"ain\B" 

##  Practice Regular Expressions
(Visit reges101)[https://regex101.com/]

abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890
Ha HaHa
MetaCharacters (Need to be escaped): 
.[{()\^$|?*+
arifbutt.me
321-555-4321
123.555.1234
111#923#9234
cat
mat
bat
0x45
0X4Ad
0x2g3
0x349ABf
0x

Hello World
Mr. Shahzad
Mr Khurram
Ms Aqsa
Mrs. Shaista
Mr. B
Learning is fun

List of Valid Email Addresses
arif@pucit.edu.pk
arif.ds@pu.edu.pk
arifpucit@gmail.com
arif.pucit@pu.edu.pk
first+123.5@example.com
abc%xyz@subdomain.example.com
my_name@example.com
first-last@example.com

List of Invalid Email Addresses
#@%^%#$@#$@#.com
abc.def@mail
abc.def@mail#archive.com
@example.com
arif butt @example.com
khurram#@gmail.com
Abc.example.com

https://www.google.com
http://arifbutt.me
https://youtube.com
https://www.yahoo.com
http://facebook.com

In [24]:
def myfa(mynumb):
    oddbits = 0
    evenbits = 0
    for i in range(len(mynumb)):
        if (mynumb[i] == '1'):
            if (i % 2 == 0): 
                evenbits += 1
            else:
                oddbits += 1
    if (abs(oddbits - evenbits) % 3 == 0):
        print("Yes")
    else:
        print("No")

In [28]:
myfa("1011111")

No


In [23]:
^(0|(1(01*0)*10*)+)$

SyntaxError: invalid syntax (3412454348.py, line 1)

In [None]:
0
1
10
11
100
101
110
111
1000
1001
1010
1011
1100
1101
1110
1111
10000
10001
10010
10011
10100
10101
10110
10111
11000
11001
11010
11011
11100
11101
11110
11111