{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "<img src=\"../../img/ods_stickers.jpg\" />\n",
    "    \n",
    "## Open Machine Learning Course\n",
    "    \n",
    "The Author of the material: Aditya Soni, the nickname in the ODS @ecdrid. This notebook serves as a very short glimpse from this [website](https://www.regular-expressions.info) primarily."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <center> A Tutorial On Understanding ([Rr]ege)(x|xp|xes|xps|xen)</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://i.imgur.com/bYwl7Vf.png\" alt=\"Learn Regex\">\n",
    "\n",
    "\n",
    "\n",
    "A common workflow with regular expressions is that you write a pattern for the thing you are looking for..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "# Let's see our very first expression\n",
    "\n",
    "- **<\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}\\b>**  It's a complex pattern as it includes lots of things like \n",
    "    - Character Class\n",
    "    - Alphabets\n",
    "    - Percentage\n",
    "    - Numbers\n",
    "    - Underscores\n",
    "    - $\\{\\}$, word Boundaries etc...\n",
    "    \n",
    "> In short  - **this pattern describes an email address; With the above regex pattern, we can search through a text file to find email addresses, or verify if a given string looks like an email address..**\n",
    "\n",
    "The most basic regex pattern in a token like just an $<b>$ i.e a single literal character. In the string **\" Zebra is an animal.\"**, this will match the very first $b$ in the **Ze$b$** Note that it doesn't matter whether it's present in the middle of the word as of now..\n",
    "\n",
    "Now let me introduce few very basics things used in $<regex>$ to define itself (remeber the e-mail address pattern above, now we will break into piece by piece..)\n",
    "\n",
    "In the regex discussed in this tutorial, there are **11** characters with special meanings:\n",
    "the opening square bracket $<[>$, the backslash, the caret <^>, the dollar sign <$>, the period or dot <.>, the\n",
    "vertical bar or pipe symbol <|>, the question mark <?>, the asterisk or star <*>, the plus sign <+>, the opening\n",
    "round bracket <(> and the closing round bracket <)>. These special characters are often called **“metacharacters”**.\n",
    "\n",
    "|Meta character|Description|\n",
    "|:----:|----|\n",
    "|**.**|<b>Period matches any single character except a line break.<b>|\n",
    "|**[ ]**|<b>Character class. Matches any character contained between the square brackets.<b>|\n",
    "|**[^ ]**|<b>Negated character class. Matches any character that is not contained between the square brackets <b>.|\n",
    "|*****|<b>Matches 0 or more repetitions of the preceding symbol.<b>|\n",
    "|**+**|<b>Matches 1 or more repetitions of the preceding symbol.<b>|\n",
    "|**?**|<b>Makes the preceding symbol optional.<b>|\n",
    "|**{n,m}**|<b>Braces. Matches at least \"n\" but not more than \"m\" repetitions of the preceding symbol.<b>|\n",
    "|**(xyz)**|<b>Character group. Matches the characters xyz in that exact order.<b>|\n",
    "|**&#124;**|<b>Alternation. Matches either the characters before or the characters after the symbol.<b>|\n",
    "|**&#92;**|<b>Escapes the next character. This allows you to match reserved characters `[ ] ( ) { } . * + ? ^ $ \\`.<b>| \n",
    "|**^**|<b>Matches the beginning of the input.<b>|\n",
    "|**$**|<b>Matches the end of the input.<b>|\n",
    "\n",
    "\n",
    "Read More [here](http://www.rexegg.com/regex-quickstart.html#chars) and [here](http://www.greenend.org.uk/rjk/tech/regexp.html). Both are Very Very Good...\n",
    "\n",
    "- Example  - If you want to use any of these characters as a literal in a regex, you need to escape them with a backslash. If\n",
    "you want to match **<1+1=2>**, the correct regex is $1\\+1=2$. Otherwise, the plus sign will have a special meaning. **Note** that **<1+1=2>**, with the *backslash omitted*, is a **valid** regex. So you will **not** get an error message. But it\n",
    "will not match **<1+1=2>**. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "# The Regex-Directed Engine Always Returns the Left-most Match\n",
    "This is a very important point to understand: a regex-directed engine will always return the leftmost match,\n",
    "even if a **better** match could be found later. When applying a regex to a string, the engine will start at the\n",
    "first character of the string. It will try all possible permutations of the regular expression at the first character.\n",
    "Only if all possibilities have been tried and found to fail, will the engine continue with the second character in\n",
    "the text. Again, it will try all possible permutations of the regex, in exactly the same order. The result is that\n",
    "the regex-directed engine will return the leftmost match.\n",
    "- When applying **<cat\\>** to **He captured a catfish for his cat.**, the engine will try to match the first\n",
    "token in the regex **<c\\>** to the first character in the match **H**. This fails. There are no other possible\n",
    "permutations of this regex, because it merely consists of a sequence of literal characters. So the regex engine\n",
    "tries to match the **<c\\>** with the **e**. This fails too, as does matching the **c** with the space. Arriving at the 4th\n",
    "character in the match, **<c\\>** matches **c**. The engine will then try to match the second token **<a\\>** to the 5th\n",
    "character, **a**. This succeeds too. But then, **<t\\>** fails to match **p**. At that point, the engine knows the regex\n",
    "cannot be matched starting at the 4th character in the match. So it will continue with the 5th: **a**. Again, **<c\\>**\n",
    "fails to match here and the engine carries on. At the 15th character in the match, **<c\\>** again matches **c**. The\n",
    "engine then proceeds to attempt to match the remainder of the regex at character 15 and finds that **<a\\>**\n",
    "matches **a** and **<t\\>** matches **t**.\n",
    "\n",
    "- The entire regular expression could be matched starting at character 15. The engine is **\"eager\"** to report a\n",
    "match. **It will therefore report the first three letters of catfish as a valid match**. The engine **never** proceeds\n",
    "beyond this point to see if there are any **better** matches. The *first match* is considered good enough. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Regex's Fundamentals"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Character Sets/Classes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Character sets are also called character class. **Square brackets** are used to specify character sets. Use a **hyphen** inside a character set to specify the characters' range. The order of the character range inside square brackets doesn't matter. For example, the regular expression `[Tt]he` means: `an uppercase T or lowercase t, followed by the letter h, followed by the letter e.`\n",
    "\n",
    "- **<[Tt]he>** => <font color=red>The</font> car parked in <font color=red>the</font> garage.\n",
    "\n",
    "A period inside a character set, however, means a literal period. The regular expression **<ar[.]>** means: a lowercase character a, followed by letter r, followed by a period **.** character.\n",
    "\n",
    "- **<ar[.]>** => A garage is a good place to park a c<font color=red>ar.</font>\n",
    "\n",
    "- **<[0-9]>** => Matches a **single digit between 0 and 9**. You can use more than one range.\n",
    "- **<[0-9a-fA-F]>** => Matches a **single hexadecimal digit**, case insensitively. \n",
    "- You can combine ranges and single characters. **<[0-9a-fxA-FX]>** matches a hexadecimal digit or the letter X.* Again, the order of the characters and the ranges does not matter.*\n",
    "- Find a word, even if it is misspelled, such as **<sep[ae]r[ae]te>** or **<li[cs]en[cs]e>**. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Negated Character Sets/Classes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Typing a **caret(^)** after the opening square bracket will negate the character class. **The result is that the character\n",
    "class will match any character that is <font color = red>not </font> in the character class.**\n",
    "- It is important to remember that a negated character class **still must match a character**. **<q[^u]>** does not\n",
    "mean: **<font color= red> a q not followed by a u </font>**. It means: **<font color= red a q followed by a character that is not a u </font>**. It will **not** match the\n",
    "$q$ in the string $Iraq$. It will match the $q$ and $the space$ after the $q$ in **Iraq is a country**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "###  Shorthand Character Sets\n",
    "\n",
    "Regular expression provides **shorthands** for the commonly used character sets,\n",
    "which offer **convenient shorthands** for commonly used regular expressions. The\n",
    "shorthand character sets are as follows:\n",
    "\n",
    "|Shorthand|Description|\n",
    "|:----:|----|\n",
    "|<b>.<b>|<b>Any character except new line. It's the most commonly misused metacharacter.<b>|\n",
    "|<b>\\w<b>|<b>Matches alphanumeric characters: `[a-zA-Z0-9_]`<b>|\n",
    "|<b>\\W<b>|<b>Matches non-alphanumeric characters: `[^\\w]`<b>|\n",
    "|<b>\\d<b>|<b>Matches digit: `[0-9]`<b>|\n",
    "|<b>\\D<b>|<b>Matches non-digit: `[^\\d]`<b>|\n",
    "|<b>\\s<b>|<b>Matches whitespace character: `[\\t\\n\\f\\r\\p{Z}]`<b>|\n",
    "|<b>\\S<b>|<b>Matches non-whitespace character: `[^\\s]`<b>|"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Repetitions\n",
    "\n",
    "Following meta characters `+`, `*` or `?` are used to specify how many times a\n",
    "subpattern can occur. These meta characters act differently in different\n",
    "situations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "### The Star *\n",
    "\n",
    "The symbol `*` matches zero or more repetitions of the preceding matcher. The\n",
    "regular expression `a*` means: zero or more repetitions of preceding lowercase\n",
    "character `a`. But if it appears after a character set or class then it finds\n",
    "the repetitions of the whole character set. \n",
    "For example, the regular expression\n",
    "- `[a-z]*` means: any number of lowercase letters in a row.\n",
    "\n",
    "The `*` symbol can be used with the meta character `.` to match any string of\n",
    "characters `.*`. The `*` symbol can be used with the whitespace character `\\s`\n",
    "to match a string of whitespace characters. For example, the expression\n",
    "`\\s*cat\\s*` means: zero or more spaces, followed by lowercase character `c`,\n",
    "followed by lowercase character `a`, followed by lowercase character `t`,\n",
    "followed by zero or more spaces."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "### The Plus +\n",
    "\n",
    "The symbol `+` matches one or more repetitions of the preceding character. For\n",
    "example, the regular expression `c.+t` means: lowercase letter `c`, followed by\n",
    "at least one character, followed by the lowercase character `t`. It needs to be\n",
    "clarified that `t` is the last `t` in the sentence.\n",
    "\n",
    "- **<c.+t>** => The fat <font color='red'> cat sat on the mat</font>."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "### The Question Mark ?\n",
    "In regular expression the meta character `?` makes the preceding character optional. This symbol matches zero or one instance of the preceding character. For example, the regular expression `[T]?he` means: `Optional the uppercase letter T, followed by the lowercase character h, followed by the lowercase character e.`\n",
    "\n",
    "- **<[Tt]he>** => <font color=red>The</font> car parked in <font color=red>the</font> garage."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The Lazy Star *? \n",
    "\n",
    "Repeats the previous item zero or more times. Lazy, so the engine first attempts to skip the\n",
    "previous item, before trying permutations with ever increasing matches of the preceding\n",
    "item. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "|Regex|Means|\n",
    "|:----:|----|\n",
    "|abc+|        matches a string that has ab followed by one or more c|\n",
    "|abc?|       matches a string that has ab followed by zero or one c|\n",
    "|abc{2}|      matches a string that has ab followed by 2 c|\n",
    "|abc{2,}|    matches a string that has ab followed by 2 or more c|\n",
    "|abc{2,5}|    matches a string that has ab followed by 2 up to 5 c|\n",
    "|a(bc)\\*|     matches a string that has a followed by zero or more copies of the sequence bc|\n",
    "|a(bc){2,5}|  matches a string that has a followed by 2 up to 5 copies of the sequence bc|\n",
    "|**<.+>**| matches `<div>simple div</div>`|"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Full stop or Period or dot **.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In regular expressions, the dot or period is one of the most commonly used metacharacters. Unfortunately, it\n",
    "is also the most commonly misused metacharacter. The dot is short for the negated character class **<[^\\n]>** (UNIX regex flavors) or\n",
    "**<[^\\r\\n]>** (Windows regex flavors).\n",
    "\n",
    "<font color = 'Red'> <b> Use The Dot Sparingly <b></font>\n",
    "- The dot is a **very powerful** regex metacharacter. It allows you to be **lazy**. `Put in a dot, and everything will\n",
    "match just fine when you test the regex on valid data. The problem is that the regex will also match in cases\n",
    "where it should not match..`\n",
    "    \n",
    "Example - Let’s say we want to match a date in `mm/dd/yy` format, but we\n",
    "want to leave the user the choice of date separators. The quick solution is **<\\d\\d.\\d\\d.\\d\\d>**. Seems fine at\n",
    "first sight.. It will match a date like `02/12/03` just what we intended, So fine... \n",
    "- <font color='red'> <b> Trouble is: 02512703<b></font> is also considered a **valid date** by this regular expression. In this match, the first dot matched $5$, and the second matched $7$. Obviously $not$ what we intended. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Start of String and End of String Anchors ( $ and ^)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Anchors are a different breed. They do not match any character at all. Instead, they match a position before,\n",
    "after or between characters. They can be used to `anchor` the regex match at a certain position. \n",
    "- The caret **<^>**\n",
    "matches the position before the first character in the string. Applying **<^a>** to `abc` matches `a`. **<^b>** will\n",
    "not match `abc` at all, because the **<b\\>** cannot be matched right after the start of the string, matched by **<^>**.\n",
    "- Similarly, **<\\$>** matches right after the last character in the string. **<c\\$>** matches `c` in `abc`, while **<a\\$>** `does not` match `abc` at all...."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center> **So Now we are good to go!! Armed with regex, let's see what they can do..** <center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Word Boundaries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The word boundary \\b matches positions where one side is a word character (usually a letter, digit or underscore—but see below for variations across engines) and the other side is not a word character (for instance, it may be the beginning of the string or a space character).\n",
    "\n",
    "The `\\bcat\\b` would therefore match `cat` in a `black cat`, but it wouldn't match it in `catatonic`, `tomcat` or `certificate`. Removing one of the boundaries, `\\bcat` would match `cat` in `catfish`, and `cat\\b` would match `cat` in `tomcat`, but not vice-versa. Both, of course, would match `cat` on its own. \n",
    "\n",
    "Word boundaries are useful when you want to match a sequence of letters (or digits) on their own, or to ensure that they occur at the beginning or the end of a sequence of characters.\n",
    "\n",
    "Be aware, though, that `\\bcat\\b` will not match `cat` in `_cat` or in `cat25` because there is no boundary between an underscore and a letter, nor between a letter and a digit: these all belong to what regex defines as word characters."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Grouping ()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By placing part of a regular expression inside round brackets or parentheses, you can group that part of the\n",
    "regular expression together. This allows you to apply a regex operator, e.g. a repetition operator, to the entire\n",
    "group.\n",
    "Only round brackets can be used for grouping. Square brackets define a character class, and curly\n",
    "braces are used by a special repetition operator.\n",
    "\n",
    "- The regex `Set(Value)?` matches \"**Set** or **SetValue**\". \n",
    "    - In the first case, the first backreference will be\n",
    "empty, because it did not match anything. \n",
    "    - In the second case, the `first backreference` will contain **Value**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### So How can we use it?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Backreferences allow you to re-use part of the regex match. You can reuse it inside the regular expression before or afterwards depending on the Regex Flavour you are using...`\n",
    "Some regex flavours use `\\`, some flavours use `$`, etc..\n",
    "In Perl, you can use the magic variables $1, $2, etc. to access the part of the string matched by the backreference"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Regex** : `(\\w+)\\1` on String `seek` will match `ee`\n",
    "\n",
    "PS I am myself studying this section properly, hence couldn't add more details :))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Examples"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Regex are now written in Quotes (\\`)\n",
    "- The String to be matched is in (\"bold\")\n",
    "\n",
    "Suppose you want to use a regex to match a list of function names in a programming language: \"**Get, GetValue, Set or SetValue.**\" \n",
    "- The obvious solution is `Get|GetValue|Set|SetValue`\n",
    "\n",
    "*Now take a look closer carefully at the regex and the string, both.\n",
    "Here are some other ways to do the same task*\n",
    "-  `Get(Value)?|Set(Value)?`\n",
    "-  `\\b(Get|GetValue|Set|SetValue)\\b`\n",
    "-  `\\b(Get(Value)?|Set(Value)?)\\b`\n",
    "- Even this one is correct `\\b(Get|Set)(Value)?\\b`\n",
    "\n",
    "**Regex**:\t`<[^>]+>`\n",
    "- **What it does**:\tThis finds any HTML, such as `<\\a>, <\\b>, <\\img />, <\\br />, etc`. You can use this to find segments that have HTML tags you need to deal with, or to remove all HTML tags from a text.\n",
    "    \n",
    "**Regex**:\t`https?:\\/\\/[\\w\\.\\/\\-?=&%,]+`\n",
    "- What it does:\tThis will find a URL. It will capture most URLs that begin with http:// or https://.\n",
    "\n",
    "**Regex**:\t`'\\w+?'`\n",
    "- **What it does**:\tThis finds single words that are surrounded by apostrophes.\n",
    "\n",
    "**Regex**:\t`([-A-Za-z0-9_]*?([-A-Za-z_][0-9]|[0-9][-A-Za-z_])[-A-Za-z0-9_]*)`\n",
    "- **What it does**:\tAlphanumeric part numbers and references like: 1111_A, AA1AAA or 1-1-1-A, 21A1 and 10UC10P-BACW, abcd-1234, 1234-pqtJK, sft-0021 or 21-1_AB and 55A or AK7_GY.\n",
    "This can be very useful if you are translating documents that have a lot of alphanumeric codes or references in them, and you need to be able to find them easily.\n",
    "\n",
    "**Regex**:\t`\\b(the|The)\\b.*?\\b(?=\\W?\\b(is|are|was|can|shall| must|that|which|about|by|at|if|when|should|among|above|under|$)\\b)`\n",
    "- **What it does**:\tThis finds text that begins with the or The and ends with stop words such as is, are, was, can, shall, must, that, which, about, by, at, if, when, should, among, above or under, or the end of the segment.\n",
    "This is particularly useful when you need to extract terminology. Suppose you have segments like these:\n",
    "`\n",
    "The Web based look up is our new feature. A project manager should not proofread... Our Product Name is...`\n",
    "    - The Regex shown above would find anything between The and is, or should. With most texts, there is a good chance that anything this Regex finds is a good term that you can add to your Termbase.\n",
    "\n",
    "**Regex**:\t`\\b(a|an|A|An)\\b.*?\\b(?=\\W?\\b(is|are|was|can|shall|must |that|which|about|by|at|if|when|among|above|under|$)\\b)`\n",
    "- **What it does**:\tThis works much like the Regex shown above, except that it finds text that begins with a or an, rather than the. This can also be very helpful when you need to extract terminology from a project.\n",
    "\n",
    "**Regex**:  `\\b(this|these|This|These)\\b.*?\\b(?=\\W?\\b(is|are|was|can|shall|must|that|which|about|by|at|if|when|among|above|under|$)\\b)`\n",
    "    - **What it does**:\tThis works much like the Regex shown above, except that it finds text that begins with this or these. This can also be very helpful when you need to extract terminology from a project.\n",
    "\n",
    "**Regex** :`(.*?)`\n",
    "- **What it does** : Accept blah-blah-blah..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Python [re module](https://docs.python.org/3/library/re.html)\n",
    "\n",
    "- `re.sub(regex, replacement, subject)` performs a search-and-replace across subject, replacing all\n",
    "matches of regex in subject with replacement. The result is returned by the sub() function. **The subject\n",
    "string you pass is not modified**. The re.sub() function applies the same backslash logic to the replacement text as is applied to the regular expression. Therefore, you should use raw strings for the replacement text..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-11-07T05:27:28.416159Z",
     "start_time": "2018-11-07T05:27:28.224381Z"
    }
   },
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload\n",
    "import re\n",
    "import time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-11-07T05:27:28.436099Z",
     "start_time": "2018-11-07T05:27:28.416159Z"
    }
   },
   "outputs": [],
   "source": [
    "s = \"How do you do this\"\n",
    "print(\n",
    "    \"After applying re.sub -- \",\n",
    "    re.sub(r\"How do you\", \"How do I\", s),\n",
    "    \"\\nOriginal Text is still -- \",\n",
    "    s,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So does that mens that we have to type one regex expression everytime, run and check it and then the substituion willl happen? i.e Can't we stack `re.sub(), re.sub(), re.sub()....`\n",
    "\n",
    "Surely not, Remeber `re.sub()` is returning a string after making the chnages that matched the pattern you asked more.."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-11-07T05:27:28.520045Z",
     "start_time": "2018-11-07T05:27:28.436099Z"
    }
   },
   "outputs": [],
   "source": [
    "s_old = \"How do you do this\"\n",
    "print(\"After applying re.sub -- \", end=\"\")\n",
    "s_new = re.sub(r\"How do you\", \"How do I\", s_old)\n",
    "print(f\"\\nOriginal Text isn't still **{s_old}** but it's now **{s_new}**\")\n",
    "# Obviously s_old and s_new are different, I am just trying to show that we can stack the operations...."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-11-07T05:27:28.597175Z",
     "start_time": "2018-11-07T05:27:28.520045Z"
    }
   },
   "outputs": [],
   "source": [
    "tweet = \"#fingerprint #Pregnancy Test https://goo.gl/h1MfQV #android +#apps +#beautiful \\\n",
    "         #cute #health #igers #iphoneonly #iphonesia #iphone \\\n",
    "             <3 ;D :( :-(\"\n",
    "\n",
    "# Let's take care of emojis and the #(hash-tags)...\n",
    "\n",
    "print(f\"Original Tweet ---- \\n {tweet}\")\n",
    "\n",
    "## Replacing #hashtag with only hashtag\n",
    "tweet = re.sub(r\"#(\\S+)\", r\" \\1 \", tweet)\n",
    "# this gets a bit technical as here we are using Backreferencing and Character Sets Shorthands and replacing the captured Group.\n",
    "# \\S = [^\\s] Matches any charachter that isn't white space\n",
    "print(f\"\\n Tweet after replacing hashtags ----\\n  {tweet}\")\n",
    "\n",
    "## Love -- <3, :*\n",
    "tweet = re.sub(r\"(<3|:\\*)\", \" EMO_POS \", tweet)\n",
    "print(f\"\\n Tweet after replacing Emojis for Love with EMP_POS ----\\n  {tweet}\")\n",
    "\n",
    "# The parentheses are for Grouping, so we search (remeber the raw string (`r`))\n",
    "# either for <3 or(|) :\\* (as * is a meta character, so preceeded by the backslash)\n",
    "\n",
    "## Wink -- ;-), ;), ;-D, ;D, (;,  (-;\n",
    "tweet = re.sub(r\"(;-?\\)|;-?D|\\(-?;)\", \" EMO_POS \", tweet)\n",
    "print(f\"\\n Tweet after replacing Emojis for Wink with EMP_POS ----\\n  {tweet}\")\n",
    "\n",
    "# The parentheses are for Grouping as usual, then we first focus on `;-), ;),`, so we can see that 1st we need to have a ;\n",
    "# and then we can either have a `-` or nothing, so we can do this via using our `?` clubbed with `;` and hence we have the very\n",
    "# starting with `(;-?\\)` and simarly for others...\n",
    "\n",
    "## Sad -- :-(, : (, :(, ):, )-:\n",
    "tweet = re.sub(r\"(:\\s?\\(|:-\\(|\\)\\s?:|\\)-:)\", \" EMO_NEG \", tweet)\n",
    "print(f\"\\n Tweet after replacing Emojis for Sad with EMP_NEG ----\\n  {tweet}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-11-07T05:27:28.671495Z",
     "start_time": "2018-11-07T05:27:28.597175Z"
    }
   },
   "outputs": [],
   "source": [
    "##See the Output Carefully, there are Spaces inbetween un-necessary...\n",
    "## Replace multiple spaces with a single space\n",
    "tweet = re.sub(r\"\\s+\", \" \", tweet)\n",
    "print(f\"\\n Tweet after replacing xtra spaces ----\\n  {tweet}\")\n",
    "\n",
    "##Replace the Puctuations (+,;)\n",
    "tweet = re.sub(r\"[^\\w\\s]\", \"\", tweet)\n",
    "print(f\"\\n Tweet after replacing Punctuation + with PUNC ----\\n  {tweet}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-11-07T05:27:28.755386Z",
     "start_time": "2018-11-07T05:27:28.675486Z"
    }
   },
   "outputs": [],
   "source": [
    "# bags of positive/negative smiles (You can extend the above example to take care of these few too...))) A good Excercise...\n",
    "\n",
    "positive_emojis = set(\n",
    "    [\n",
    "        \":‑)\",\n",
    "        \":)\",\n",
    "        \":-]\",\n",
    "        \":]\",\n",
    "        \":-3\",\n",
    "        \":3\",\n",
    "        \":->\",\n",
    "        \":>\",\n",
    "        \"8-)\",\n",
    "        \"8)\",\n",
    "        \":-}\",\n",
    "        \":}\",\n",
    "        \":o)\",\n",
    "        \":c)\",\n",
    "        \":^)\",\n",
    "        \"=]\",\n",
    "        \"=)\",\n",
    "        \":‑D\",\n",
    "        \":D\",\n",
    "        \"8‑D\",\n",
    "        \"8D\",\n",
    "        \"x‑D\",\n",
    "        \"xD\",\n",
    "        \"X‑D\",\n",
    "        \"XD\",\n",
    "        \"=D\",\n",
    "        \"=3\",\n",
    "        \"B^D\",\n",
    "        \":-))\",\n",
    "        \";‑)\",\n",
    "        \";)\",\n",
    "        \"*-)\",\n",
    "        \"*)\",\n",
    "        \";‑]\",\n",
    "        \";]\",\n",
    "        \";^)\",\n",
    "        \":‑,\",\n",
    "        \";D\",\n",
    "        \":‑P\",\n",
    "        \":P\",\n",
    "        \"X‑P\",\n",
    "        \"XP\",\n",
    "        \"x‑p\",\n",
    "        \"xp\",\n",
    "        \":‑p\",\n",
    "        \":p\",\n",
    "        \":‑Þ\",\n",
    "        \":Þ\",\n",
    "        \":‑þ\",\n",
    "        \":þ\",\n",
    "        \":‑b\",\n",
    "        \":b\",\n",
    "        \"d:\",\n",
    "        \"=p\",\n",
    "        \">:P\",\n",
    "        \":'‑)\",\n",
    "        \":')\",\n",
    "        \":-*\",\n",
    "        \":*\",\n",
    "        \":×\",\n",
    "    ]\n",
    ")\n",
    "negative_emojis = set(\n",
    "    [\n",
    "        \":‑(\",\n",
    "        \":(\",\n",
    "        \":‑c\",\n",
    "        \":c\",\n",
    "        \":‑<\",\n",
    "        \":<\",\n",
    "        \":‑[\",\n",
    "        \":[\",\n",
    "        \":-||\",\n",
    "        \">:[\",\n",
    "        \":{\",\n",
    "        \":@\",\n",
    "        \">:(\",\n",
    "        \"D‑':\",\n",
    "        \"D:<\",\n",
    "        \"D:\",\n",
    "        \"D8\",\n",
    "        \"D;\",\n",
    "        \"D=\",\n",
    "        \"DX\",\n",
    "        \":‑/\",\n",
    "        \":/\",\n",
    "        \":‑.\",\n",
    "        \">:\\\\\",\n",
    "        \">:/\",\n",
    "        \":\\\\\",\n",
    "        \"=/\",\n",
    "        \"=\\\\\",\n",
    "        \":L\",\n",
    "        \"=L\",\n",
    "        \":S\",\n",
    "        \":‑|\",\n",
    "        \":|\",\n",
    "        \"|‑O\",\n",
    "        \"<:‑|\",\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-11-07T05:27:28.839276Z",
     "start_time": "2018-11-07T05:27:28.763373Z"
    }
   },
   "outputs": [],
   "source": [
    "## Pattern to match any IP Addresses\n",
    "pattern = r\"\\b\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\b\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "the above pattern will also match `999.999.999.999` but that isn't a valid IP at all\n",
    "Now this depends on the data at hand as to how far you want the regex to be accurate...\n",
    "To restrict all `4` numbers in the IP address to `0..255`, you can use this\n",
    "complex beast: \n",
    "- `\\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.(25[0-5]|2[0-4][0-\n",
    "9]|[01]?[0-9][0-9]?)\\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.(25[0-5]|2[0-\n",
    "4][0-9]|[01]?[0-9][0-9]?)\\b`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-11-07T05:27:28.949510Z",
     "start_time": "2018-11-07T05:27:28.847262Z"
    }
   },
   "outputs": [],
   "source": [
    "updated_pattern = r\"\\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\b\"\n",
    "updated_pattern"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-11-07T05:27:29.040308Z",
     "start_time": "2018-11-07T05:27:28.957499Z"
    }
   },
   "outputs": [],
   "source": [
    "if re.search(pattern, \"999.999.999.999\"):\n",
    "    print(\"Matched\")\n",
    "if re.search(updated_pattern, \"256.999.999.999\"):\n",
    "    print(\"Matched\")\n",
    "else:\n",
    "    print(\"Not Matched\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-11-07T05:27:29.116200Z",
     "start_time": "2018-11-07T05:27:29.048297Z"
    }
   },
   "outputs": [],
   "source": [
    "# Valid Dates..\n",
    "pattern = r\"(19|20)\\d\\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- matches a date in yyyy-mm-dd format from between 1900-01-01 and 2099-12-31, with a choice of four separators(space included :))\n",
    "\n",
    "- The year is matched by `(19|20)\\d\\d`\n",
    "- The month is matched by `(0[1-9]|1[012])` (rounding brackets are necessary so that to include both the options)\n",
    "    - By using character classes, \n",
    "        - the first option matches a number between `01 and 09`, and \n",
    "        - the second matches `10, 11 or 12`\n",
    "- The last part of the regex consists of three options. The first matches the numbers `01\n",
    "through 09`, the second `10 through 29`, and the third matches `30 or 31`... "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# References (A lot)\n",
    "\n",
    "- http://www.rexegg.com/ <<<< THE ULTIMATE WEBSITE>>>>\n",
    "- https://github.com/aloisdg/awesome-regex\n",
    "- http://linuxreviews.org/beginner/tao_of_regular_expressions/tao_of_regular_expressions.en.print.pdf\n",
    "- https://developers.google.com/edu/python/regular-expressions\n",
    "- https://www.youtube.com/watch?v=EkluES9Rvak"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center> <b>Oh, yes, and forget about practice, that's completely overrated.<b> </center>\n",
    "    <center><b> <font color='red'> Just kidding.... </font> <b> </center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src='https://i0.wp.com/www.scoutingforbeer.org.uk/wp-content/uploads/2016/04/thank-you-1400x800-c-default.gif'>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  },
  "nbTranslate": {
   "displayLangs": [
    "*"
   ],
   "hotkey": "alt-t",
   "langInMainMenu": true,
   "sourceLang": "en",
   "targetLang": "fr",
   "useGoogleTranslate": true
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}