{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# [X-Village] Lesson08-regular_expression\n",
    "# by 劉正仁"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "  \n",
    "  \n",
    "  # 正規表示式 ( Regular expression )\n",
    "\n",
    "* 正規表示式從字面上就可以得知他是一個**表示法**。\n",
    "* 正規表示式其實就是<span style=\"color: green\">字串</span>(符合正規表示式語法)，可以利用這定義好的<span style=\"color: green\">字串</span>找出其他<span style=\"color: red\">文字資料</span>中包含此<span style=\"color: green\">字串</span>的部份"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# 示意圖\n",
    "\n",
    "<center> <img src=\"./images/re1.png\"> </center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# 應用範圍\n",
    "\n",
    "* 網頁字串搜尋功能(Ctrl+f)\n",
    "* 伺服器端帳號密碼格式驗證\n",
    "* 大數據文字探勘\n",
    "* 文章子字串替換\n",
    "* 還有好多好多...\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# 快速瞥過，看看正規表示式在程式中是怎麼一回事"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The location of 'regular expression' is between 27 and 45 in text\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "# Search sub-string in text\n",
    "text = \"Today is good day to learn regular expression.\"\n",
    "\n",
    "# Define re pattern\n",
    "pattern = r'regular expression'\n",
    "\n",
    "# Search if there is 'regular expression' in the text\n",
    "match = re.search(pattern, text)\n",
    "\n",
    "# Check result\n",
    "start_index = match.span()[0]\n",
    "end_index = match.span()[1]\n",
    "match_string = match.group()\n",
    "print(\"The location of '{}' is between {} and {} in text\".format(match_string, start_index, end_index))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# 程式碼剖析"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# 程式碼剖析 (Con't)\n",
    "\n",
    "### 定義文字資料  \n",
    "\n",
    "<center> <img src=\"./images/a1.png\"> </center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# 程式碼剖析 (Con't)\n",
    "\n",
    "### 定義正規表示式  \n",
    "\n",
    "<center> <img src=\"./images/a2.png\"> </center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# 程式碼剖析 (Con't)\n",
    "\n",
    "### Raw string 是什麼？  \n",
    "\n",
    "<center> <img src=\"./images/a3.png\"> </center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# 程式碼剖析 (Con't)\n",
    "\n",
    "**文字資料**和**正規表示式**做比對  \n",
    "  \n",
    "<center> <img src=\"./images/a4.png\"> </center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# 程式碼剖析 (Con't)\n",
    "\n",
    "### 結果檢查\n",
    "\n",
    "<center> <img src=\"./images/a5.png\"> </center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# 正規表示式用法\n",
    "\n",
    "* 一般文字\n",
    "* 基礎流程\n",
    "* 常用函式\n",
    "* 特殊文字 \n",
    "* 重複語法   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# 一般文字\n",
    "\n",
    "<h3> 若指定 <span style='color:green'>一般文字</span>(e.g 'hello', 'johnny', 'name')在正規表示式中，則程式會在<span style=\"color:red\">文字資料</span>中，找出與指定文字**完全一樣的字串**。 </h3>\n",
    "\n",
    "<h3> 可以想像成瀏覽器中常使用的 Ctrl+f 功能。 </h3>\n",
    "\n",
    "![CTRL+F](./images/re2.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class '_sre.SRE_Match'>\n",
      "The location of 'Monday' is between 9 and 15 in text\n"
     ]
    }
   ],
   "source": [
    "# Demo\n",
    "# Ordinary literal in regular expression\n",
    "import re\n",
    "\n",
    "# The text which re applies to\n",
    "text = \"Today is Monday, tommorrow is Tuesday.\"\n",
    "\n",
    "# define pattern\n",
    "pattern = r'Monday'\n",
    "\n",
    "# search pattern in the text\n",
    "match = re.search(pattern, text)\n",
    "\n",
    "# Check\n",
    "print(type(match))\n",
    "start_index = match.span()[0]\n",
    "end_index = match.span()[1]\n",
    "match_string = match.group()\n",
    "print(\"The location of '{}' is between {} and {} in text\".format(match_string, start_index, end_index))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'NoneType'>\n",
      "None\n"
     ]
    }
   ],
   "source": [
    "# Ordinary literal in regular expression\n",
    "# re cannot find a match in the text \n",
    "import re\n",
    "\n",
    "# The text which re applies to\n",
    "text = \"Today is Monday, tommorrow is Tuesday.\"\n",
    "\n",
    "# Specify pattern\n",
    "pattern = r'Wednesday'\n",
    "\n",
    "# Apply re to the text\n",
    "match = re.search(pattern, text)\n",
    "\n",
    "# Print out result\n",
    "print(type(match))\n",
    "print(match)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# 基礎流程\n",
    "  \n",
    "1. 定義正規表示式    \n",
    "2. 套用正規表示式到文字資料上  \n",
    "3. 取出查找到的字串  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# 定義正規表示式\n",
    "\n",
    "<center> <img src=\"./images/box.png\"> </center>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "# Define pattern\n",
    "pattern = r'cookie'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# 套用正規表示式到文字資料上\n",
    "\n",
    "<center> <img src=\"./images/box2.png\"> </center>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "text = \"Cake and cookie\"\n",
    "\n",
    "# Non-Compiled Version\n",
    "match = re.search(pattern, text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# 取出查找到的字串"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "cookie\n"
     ]
    }
   ],
   "source": [
    "print(match.group())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# 常用函式\n",
    "\n",
    "以下列舉出幾個在使用 re 模組時，常用的類別與函式\n",
    "* Re 模組\n",
    "    * search\n",
    "    * match\n",
    "    * sub\n",
    "    * compile\n",
    "* Match 物件中方法(method)\n",
    "    * span\n",
    "    * group\n",
    "    * groups\n",
    "    * groupdict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "##  [re.search(pattern, string, flags=0)](https://docs.python.org/3/library/re.html#re.search)\n",
    "* **Arguements**  \n",
    "    * <span style=\"color:green\">pattern</span>  \n",
    "    定義好的正規表示式（e.g r'cookie')  \n",
    "    * <span style=\"color:red\">string</span>   \n",
    "    欲套用正規表示式之文字資料\n",
    "    * flags（目前用不到）  \n",
    "    預設值為0\n",
    "* **Return Value**  \n",
    "若程式在<span style=\"color:red\"> string </span>中找到符合<span style=\"color:green\">pattern</span>的字串，則會回傳一個型態為 Match 的物件。反之，擇回傳一個 None 物件。\n",
    "> **Note**    \n",
    "> 與 re.match 不同之處在於，只要 string 中的**任何位置**出現符合 pattern 的字串, 就會回傳 Match 物件"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Type of 'match' object: <class '_sre.SRE_Match'>\n",
      "Matching pattern: 'cookie'\n"
     ]
    }
   ],
   "source": [
    "# re.search \n",
    "\n",
    "import re\n",
    "\n",
    "# text which re applies to\n",
    "text = \"Cake and cookie\"\n",
    "\n",
    "# Define pattern\n",
    "pattern = \"cookie\"\n",
    "\n",
    "# search pattern in the text\n",
    "match = re.search(pattern, text)\n",
    "\n",
    "# Check\n",
    "print(\"Type of 'match' object:\", type(match))\n",
    "print(\"Matching pattern: %r\" % match.group())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# [re.match(pattern, string, flag=0)](https://docs.python.org/3/library/re.html#re.match)\n",
    "\n",
    "* **Arguemnts**\n",
    "    * <span style=\"color:green\">pattern</span>   \n",
    "    定義好的正規表示式（e.g r'cookie')  \n",
    "  \n",
    "    * <span style=\"color:red\">string</span>  \n",
    "    欲套用正規表示式之文字資料\n",
    "  \n",
    "    * flags（目前用不到）  \n",
    "    預設值為0\n",
    "* **Return Value**  \n",
    "若程式在<span style=\"color:red\"> string </span>中找到符合<span style=\"color:green\">pattern</span>的字串，則會回傳一個型態為 Match 的物件。反之，擇回傳一個 None 物件。\n",
    "> **Note**\n",
    "> match 函式只會比對 string 開頭的文字，若 string 的**中間**部份有符合 pattern 的字串擇**不會**回傳 Match 物件。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Type of 'match1' object: <class 'NoneType'>\n",
      "None\n",
      "==================================================\n",
      "Type of 'match2' object: <class '_sre.SRE_Match'>\n",
      "Matching pattern: 'cookie'\n"
     ]
    }
   ],
   "source": [
    "# demo\n",
    "# re.match\n",
    "\n",
    "import re\n",
    "\n",
    "# texts which re applies to\n",
    "text1 = \"Cake and cookie\"\n",
    "text2 = \"cookie and Cake\"\n",
    "\n",
    "# specify pattern\n",
    "pattern = \"cookie\"\n",
    "\n",
    "# search pattern in texts\n",
    "match1 = re.match(pattern, text1)\n",
    "match2 = re.match(pattern, text2)\n",
    "\n",
    "# Check\n",
    "print(\"Type of 'match1' object:\", type(match1))\n",
    "print(match1)\n",
    "print(\"=\"*50)\n",
    "print(\"Type of 'match2' object:\", type(match2))\n",
    "print(\"Matching pattern: %r\" % match2.group())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# [re.sub(pattern, replace, string, count=0, flags=0)](https://docs.python.org/3/library/re.html#re.sub)\n",
    "\n",
    "* **Arguements**  \n",
    "    * <span style=\"color:green\">pattern</span>  \n",
    "    定義好的正規表示式（e.g r'cookie')  \n",
    "    * replace  \n",
    "    可以是**字串**抑或是**函式**(可呼叫的物件）。若為字串則會將符合 pattern 的字串轉成 replace。若為函式，則會將函式的回傳值當作替代的文字。  \n",
    "    * <span style=\"color:red\">string</span>  \n",
    "    欲套用正規表示式之文本字串  \n",
    "    * count  \n",
    "    預設為0，會將**所有**符合 pattern 的字串替代成相對應的字串(replace)。若為大於0的值，則只會將出現 **count 次數**次符合 pattern 的字串做轉換。  \n",
    "    * flags（目前用不到）  \n",
    "    預設值為0\n",
    "  \n",
    "* **Return Value**  \n",
    "回傳一個做過轉換的新的字串"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original text: cookie and Cake\n",
      "After substitution: biscuit and Cake\n"
     ]
    }
   ],
   "source": [
    "# Demo\n",
    "# re.sub with plain string\n",
    "\n",
    "import re\n",
    "\n",
    "# text which re applies to\n",
    "text = \"cookie and Cake\"\n",
    "\n",
    "# Define pattern\n",
    "pattern = r'cookie'\n",
    "\n",
    "# substitute the string 'cookie' with 'biscuit'\n",
    "new_text = re.sub(pattern, 'biscuit', text)\n",
    "\n",
    "# Check\n",
    "print(\"Original text:\", text)\n",
    "print(\"After substitution:\", new_text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original text: cookie and Cake\n",
      "After substitution: cookie-cookie and Cake\n"
     ]
    }
   ],
   "source": [
    "# Demo\n",
    "# re.sub with customed function\n",
    "\n",
    "import re\n",
    "\n",
    "# text which re applies to\n",
    "text = \"cookie and Cake\"\n",
    "\n",
    "# Define pattern\n",
    "pattern = r'cookie'\n",
    "\n",
    "# Custom replace function\n",
    "def repl(match):\n",
    "    new_string = match.group()+'-'+match.group()\n",
    "    return new_string\n",
    "\n",
    "new_text = re.sub(pattern, repl, text)\n",
    "\n",
    "print(\"Original text:\", text)\n",
    "print(\"After substitution:\", new_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# [re.compile(pattern, flag=0)](https://docs.python.org/3/library/re.html#re.compile)\n",
    "\n",
    "* **Arguements**\n",
    "    * <span style=\"color:green\">pattern</span>  \n",
    "    欲快取住的正規表示式\n",
    "  \n",
    "    * flags（目前用不到）  \n",
    "    預設值為0\n",
    "  \n",
    "* **Return Value**  \n",
    "回傳一個 Pattern 型態的物件"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "biscuit and Cake\n",
      "biscuit and Cake\n"
     ]
    }
   ],
   "source": [
    "# Demo\n",
    "# re.compile\n",
    "\n",
    "import re\n",
    "\n",
    "# texts which re applies to\n",
    "text = \"cookie and Cake\"\n",
    "\n",
    "# Define pattern\n",
    "pattern = r'cookie'\n",
    "\n",
    "# Compile\n",
    "regex = re.compile(pattern)\n",
    "\n",
    "# search pattern\n",
    "new_text1 = regex.sub('biscuit', text)\n",
    "new_text2 = re.sub(r'cookie', 'biscuit', text)\n",
    "\n",
    "print(new_text1)\n",
    "print(new_text2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# 為什麼要 Compile 呢？"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 275 µs, sys: 25 µs, total: 300 µs\n",
      "Wall time: 304 µs\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "# Newbie style\n",
    "\n",
    "text1 = 'Cake and cookie'\n",
    "text2 = 'cookie and Cake'\n",
    "text3 = 'Cake cookie'\n",
    "\n",
    "# search pattern\n",
    "for i in range(100):\n",
    "    match1 = re.search(r'cookie', text1)\n",
    "    match2 = re.search(r'cookie', text2)\n",
    "    match3 = re.search(r'cookie', text3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 101 µs, sys: 9 µs, total: 110 µs\n",
      "Wall time: 113 µs\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "# Prof style\n",
    "\n",
    "text1 = 'Cake and cookie'\n",
    "text2 = 'cookie and Cake'\n",
    "text3 = 'Cake cookie'\n",
    "\n",
    "# Compile regex\n",
    "regex = re.compile(r'cookie')\n",
    "\n",
    "# search pattern\n",
    "for i in range(100):\n",
    "    match1 = regex.search(text1)\n",
    "    match2 = regex.search(text2)\n",
    "    match3 = regex.search(text3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# 為什麼要 Compile 呢？ (Con't)\n",
    "* 快取住 re 物件  \n",
    "* 不用重複編寫一樣的 Code\n",
    "* 執行速度提升"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# [match.span( [group] )](https://docs.python.org/3.5/library/re.html#re.match.span)\n",
    "\n",
    "* **Arguements**\n",
    "    * \\[group\\]  \n",
    "    預設值為 group 0，代表整個 matching 的字串。group 可以為所有合法 group 的索引值(e.g 1, 2, 3, etc)，\n",
    "  \n",
    "  \n",
    "* **Return Value**  \n",
    "Tuple 物件，表示此 group 在文本中的位置(起始位置, 結束位置)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Match starting index: 9\n",
      "Match ending index: 15\n",
      "Result String: cookie\n"
     ]
    }
   ],
   "source": [
    "# 5 min practice\n",
    "# match.span\n",
    "\n",
    "import re\n",
    "\n",
    "text = 'Cake and cookie'\n",
    "\n",
    "# Define pattern\n",
    "pattern = r'cookie'\n",
    "\n",
    "# search pattern\n",
    "match = re.search(pattern, text)\n",
    "\n",
    "# Check\n",
    "print(\"Match starting index:\", match.span()[0])\n",
    "print(\"Match ending index:\", match.span()[1])\n",
    "print(\"Result String:\", text[match.span()[0]:match.span()[1]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# [match.group( [group1, ...] )](https://docs.python.org/3.5/library/re.html#re.match.group)\n",
    "\n",
    "* \\[group1, ...\\]  \n",
    "group1 若沒指定則 default 值為 0，代表回傳整串 match 的字串。  \n",
    "groupN 可以為任意有效的 group 索引值（e.g 1, 2, 3, etc)。\n",
    "  \n",
    "  \n",
    "* **Return Value**  \n",
    "回傳屬於那個 group 的字串，若參數 [group1, ...] 大於1以上，則回傳 tuple 包含所有 group 的字串。  \n",
    "若查找的字串不符合 group 中的定義則回傳 None。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# 如何定義 group\n",
    "\n",
    "<center><h2>(<span style=\"color:green\"> pattern </span>)</h2></center>\n",
    "  \n",
    "> <span style=\"color:green\">pattern</span>: group 的正規表示式  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# 為什麼要定義 group"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# 若我想從文字資料 \"Johnny Liu\" 取出 (1)姓 和 (2)名...."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# 你會這樣定義正規表示式嘛？\n",
    "\n",
    "<center> <h3> pattern = r'Johnny Liu' </h3> </center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# 應該這樣定義\n",
    "\n",
    "<center> <h3> pattern = r'(Johnny) (Liu)'</h3> </center>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Entire matching string: cookie\n",
      "Entire matching string: cookie\n"
     ]
    }
   ],
   "source": [
    "# Recap\n",
    "# Only one parameter in match.group\n",
    "\n",
    "import re\n",
    "\n",
    "text = 'Cake and cookie'\n",
    "\n",
    "# Define pattern\n",
    "pattern = r'cookie'\n",
    "\n",
    "# search pattern\n",
    "match = re.search(pattern, text)\n",
    "\n",
    "# Check\n",
    "print(\"Entire matching string:\", match.group())\n",
    "print(\"Entire matching string:\", match.group(0))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Group1 matching string: Cake\n",
      "Group2 matching string: cookie\n",
      "Group1 and Group2 matching strings: ('Cake', 'cookie')\n"
     ]
    }
   ],
   "source": [
    "# Demo\n",
    "# two or more parameter in match.group\n",
    "\n",
    "import re\n",
    "\n",
    "text = 'Cake and cookie'\n",
    "\n",
    "# Define pattern: () will define a group\n",
    "pattern = r'(Cake) and (cookie)'\n",
    "\n",
    "# search pattern\n",
    "match = re.search(pattern, text)\n",
    "\n",
    "# Check: only one argument\n",
    "print(\"Group1 matching string:\", match.group(1))\n",
    "print(\"Group2 matching string:\", match.group(2))\n",
    "\n",
    "# Check: two arguments\n",
    "print(\"Group1 and Group2 matching strings:\", match.group(1, 2))\n",
    "\n",
    "# Error\n",
    "# print(\"Group3 matching string:\", match.group(3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# [match.groups(default=None)](https://docs.python.org/3.5/library/re.html#re.match.groups)\n",
    "\n",
    "* **Return Value**  \n",
    "回傳一個串列的 group。  \n",
    "若查找的字串不符合串列中 group 的定義則回傳 default 值 None。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Groups of match: ('Cake', 'cookie')\n"
     ]
    }
   ],
   "source": [
    "# match.groups()\n",
    "\n",
    "import re\n",
    "\n",
    "text = 'Cake and cookie'\n",
    "\n",
    "# Define pattern\n",
    "pattern = r'(Cake) and (cookie)?'\n",
    "\n",
    "# search pattern\n",
    "match = re.search(pattern, text)\n",
    "\n",
    "# Check\n",
    "print(\"Groups of match:\", match.groups())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# [match.groupdict(default=None)](https://docs.python.org/3.5/library/re.html#re.match.groupdict)\n",
    "\n",
    "* **Return Value**  \n",
    "回傳一個字典物件，包含了所有一串列被命名的 groups, **鍵值(key)** 為命名的名稱. **值(Value)** 為 group 中的字串。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# 如何命名 group\n",
    "\n",
    "<center><h2> (<span style=\"color:blue\">?P</span><<span style=\"color:red\">group_name</span>> <span style=\"color:green\">pattern</span>) </h2></center>  \n",
    "  \n",
    "> <span style=\"color:blue\">?P</span>: 命名 group 前必須加的前綴字  \n",
    "> <span style=\"color:red\">group_name</span>: group 的名稱  \n",
    "> <span style=\"color:green\">pattern</span>: group 的正規表示式  \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'dict'>\n",
      "{'fat': 'cookie', 'fatter': 'Cake'}\n"
     ]
    }
   ],
   "source": [
    "# Demo\n",
    "# match.groupdict\n",
    "\n",
    "import re\n",
    "\n",
    "text = \"Cake and cookie\"\n",
    "\n",
    "# Define pattern\n",
    "pattern = r'(?P<fatter>Cake) and (?P<fat>cookie)'\n",
    "\n",
    "# search pattern\n",
    "match = re.search(pattern, text)\n",
    "\n",
    "# Check\n",
    "d = match.groupdict()\n",
    "print(type(d))\n",
    "print(d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# 動動手指\n",
    "\n",
    "### 請使用 match.groupdict() 取出剛剛 \"Johnny Liu\" 的 First name 和 Last name。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# 特殊文字\n",
    "\n",
    "若是正規表示式只能使用平字(plain text)，那他的功能也太弱了吧...。 所以除了平字之外它還可以有其他的**特殊文字**用來表示**集合**，**特殊符號**，**旗標**等等其他更較為有彈性且強大的功能。\n",
    "\n",
    "<center><h1 style=\"color:red\">這邊一定要多加練習唷～</h1></center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# 特殊文字查詢表\n",
    "\n",
    "| Character | Meaning\n",
    "| :------- | :--------\n",
    "| . | Match any single character except newline('\\n')\n",
    "| \\w | Match any singel letter, digit, or underscore\n",
    "| \\W | Match any character not part of \\w\n",
    "| \\s | Match a singel whitespace character like: space, newline, tab, return\n",
    "| \\S | Match any character not part or \\s\n",
    "| \\t | Match tab\n",
    "| \\n | Match newline\n",
    "| \\r | Match return\n",
    "| \\d | Match decimal digit 0-9\n",
    "| ^ | Match a pattern at the start of the string\n",
    "| $ | Match a pattern at the end of the string\n",
    "| \\A | Match only at the start of the string\n",
    "| \\b | Match only the beginning or end of the word\n",
    "| \\ | Match special character\n",
    "| [...] | Match character that appears between '[ ]'\n",
    "| [^...] | Match character that does not appear in '[ ]'\n",
    "\n",
    "> 還有其他很多特殊文字，所以有興趣可以自己在參考 Python re 的文件"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<center><h1> \\w 和 \\d 練習</h1><center>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'room_id': '65405', 'department': 'CSIE'}\n"
     ]
    }
   ],
   "source": [
    "# demo\n",
    "# \\w and \\d practice\n",
    "# Extract CSIE and room_id\n",
    "\n",
    "import re\n",
    "\n",
    "# text which re applies to\n",
    "text = \"CSIE-65405\"\n",
    "\n",
    "# specify pattern\n",
    "pattern = r'(?P<department>\\w\\w\\w\\w)-(?P<room_id>\\d\\d\\d\\d\\d)'\n",
    "\n",
    "# search pattern\n",
    "match = re.search(pattern, text)\n",
    "\n",
    "# Check\n",
    "print(match.groupdict())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# 課堂練習 ex1 (10 分鐘) (不計分）\n",
    "\n",
    "請利用 re 模組，寫一個程式讓使用者能夠輸入名字，並且讓程式取出其 lastname 和 firstname。\n",
    "\n",
    "* 輸入:  \n",
    "    Tom Tsai   \n",
    "    Amy Wang    \n",
    "    Tim Chen    \n",
    "    \n",
    "---\n",
    "\n",
    "* 輸出:  \n",
    "    \"Tom Tsai\": { 'lastname': \"Tsai\", 'firstname': \"Tom\" }  \n",
    "    \"Amy Wang\": { 'lastname': \"Wang\", 'firstname': \"Amy\" }  \n",
    "    \"Time Chen\": { 'lastname': 'Chen', 'firstname': \"Tim\" } "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<center><h1> [...] 練習 </h1><center>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'room_id': '65405', 'department': 'CSIE'}\n"
     ]
    }
   ],
   "source": [
    "# 5 mins for coding\n",
    "# [...] practice\n",
    "# Extract CSIE and room_id \n",
    "\n",
    "import re\n",
    "\n",
    "text = \"CSIE-65405\"\n",
    "\n",
    "# define pattern\n",
    "pattern = r'(?P<department>[a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z])-(?P<room_id>[0-9][0-9][0-9][0-9][0-9])'\n",
    "\n",
    "# search pattern in text\n",
    "match = re.search(pattern, text)\n",
    "\n",
    "# check\n",
    "print(match.groupdict())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# 課堂練習 ex2 (40 分鐘) (不計分）\n",
    "\n",
    "請利用 re 模組，寫一個簡單的**使用者驗證系統**，使用者提供帳號，密碼，程式判段是否為合法的帳號密碼。  \n",
    "* **帳號格式**：\n",
    "    1. 總長度為 3  \n",
    "    2. 且第一個英文字母為大寫  \n",
    "    3. 其餘為小寫  \n",
    "  \n",
    "* **密碼格式**:\n",
    "    1. 總長度為 9  \n",
    "    2. 前 3 個字為英文小寫  \n",
    "    3. 後六個數字為數字 0-9  \n",
    "    4. 但第一個數字必須為 0     "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# 課堂練習 ex2 Con't\n",
    "* 輸入:  \n",
    "    Tom, tom059357  \n",
    "    Amy, amy154852  \n",
    "    TiM, tim0002356  \n",
    "    Yen, yen0054321  \n",
    "    \n",
    "---\n",
    "\n",
    "* 輸出:  \n",
    "    Welcome, Tom!  \n",
    "    Password format error! Your password is amy154852  \n",
    "    Username format error! Your username is TiM  \n",
    "    Password legnth error! Your password length is 10."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Correct username\n",
      "Correct password\n",
      "Valid user\n"
     ]
    }
   ],
   "source": [
    "# Example\n",
    "\n",
    "import re\n",
    "\n",
    "class AuthSystem:\n",
    "    \n",
    "    def __init__(self):\n",
    "        \"\"\"Define regex\"\"\"\n",
    "        self.username_regex = re.compile(r'johnny')\n",
    "        self.password_regex = re.compile(r'johnny860410')\n",
    "    \n",
    "    def _check_username(self, username):\n",
    "        \"\"\"Check username is valid or not\"\"\"\n",
    "        if self.username_regex.search(username) is not None:\n",
    "            print(\"Correct username\")\n",
    "            return True\n",
    "        else: \n",
    "            print(\"Wrong username\")\n",
    "            return False\n",
    "        \n",
    "    def _check_password(self, password):\n",
    "        \"\"\"Check password is valid or not\"\"\"\n",
    "        if self.password_regex.search(password) is not None:\n",
    "            print(\"Correct password\")\n",
    "            return True\n",
    "        else:\n",
    "            print(\"Wrong password\")\n",
    "            return False\n",
    "        \n",
    "    def authenticate(self, username, password):\n",
    "        \"\"\"authenticate the user\"\"\"\n",
    "        if not self._check_username(username):\n",
    "            return\n",
    "        \n",
    "        if not self._check_password(password):\n",
    "            return\n",
    "        \n",
    "        print(\"Valid user\")\n",
    "\n",
    "    \n",
    "# Construct a object of type AuthSystem\n",
    "auth = AuthSystem()\n",
    "\n",
    "# authenticate the user's credentials\n",
    "auth.authenticate(\"johnny\", \"johnny860410\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# 重複語法\n",
    "\n",
    "為了讓在定義正規表示式的時候可以更為**簡潔**，且更為的**彈性**，正規表示式中有提供**重複語法**。  \n",
    "我們可以將之前寫的方法做一些修改。\n",
    "\n",
    "<center><h2> \\w\\w\\w\\w ==> \\w{4} </h2></center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# 重複語法查詢表\n",
    "\n",
    "| Character | Meaning\n",
    "| :--- | :--- \n",
    "| + | Match one or more characters to its left\n",
    "| * | Match zero or more characters to its left\n",
    "| ? | Match zero or one character to its left\n",
    "| {x} | Match 'x' times of character to its left\n",
    "| {x,} | Match 'x' or more times fo charater to its left\n",
    "| {x, y} | Match 'x' or more times but less than 'y' times of character to its left"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<center><h1> + 練習 </h1></center>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "None\n",
      "CakeCakeCake and cookie\n"
     ]
    }
   ],
   "source": [
    "# Demo\n",
    "# '+' practice\n",
    "import re\n",
    "\n",
    "# texts which re applies to\n",
    "text1 = \" and cookie\"\n",
    "text2 = \"CakeCakeCake and cookie\"\n",
    "\n",
    "# Define pattern\n",
    "plus_pattern = \"(Cake)+ and cookie\"\n",
    "\n",
    "# search pattern in texts # \n",
    "plus_match1 = re.search(plus_pattern, text1)\n",
    "plus_match2 = re.search(plus_pattern, text2)\n",
    "\n",
    "# check\n",
    "print(plus_match1)\n",
    "print(plus_match2.group())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<center><h1> * 練習 </h1></center>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " and cookie\n",
      "CakeCakeCake and cookie\n"
     ]
    }
   ],
   "source": [
    "# Demo\n",
    "\n",
    "import re\n",
    "\n",
    "# texts which re applies to\n",
    "text1 = \" and cookie\"\n",
    "text2 = \"CakeCakeCake and cookie\"\n",
    "\n",
    "# Define pattern\n",
    "mul_pattern = \"(Cake)* and cookie\"\n",
    "\n",
    "# search pattern in texts\n",
    "mul_match1 = re.search(mul_pattern, text1)\n",
    "mul_match2 = re.search(mul_pattern, text2)\n",
    "\n",
    "# check\n",
    "print(mul_match1.group())\n",
    "print(mul_match2.group())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<center><h1> ？ 練習 </h1></center>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " and cookie\n",
      "Cake and cookie\n"
     ]
    }
   ],
   "source": [
    "# 5 mins for coding\n",
    "# '+' practice\n",
    "import re\n",
    "\n",
    "# texts which re applies to\n",
    "text1 = \" and cookie\"\n",
    "text2 = \"Cake and cookie\"\n",
    "\n",
    "# Define pattern\n",
    "ques_pattern = \"(Cake)? and cookie\"\n",
    "\n",
    "# search pattern in texts \n",
    "ques_match1 = re.search(ques_pattern, text1)\n",
    "ques_match2 = re.search(ques_pattern, text2)\n",
    "\n",
    "# Check\n",
    "print(ques_match1.group())\n",
    "print(ques_match2.group())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# 課堂練習 ex3 (10 分鐘)\n",
    "\n",
    "請延續上一題的練習題，擴充其功能，讓能夠接受(accept)的帳號和密碼格式更複雜。    \n",
    "* **帳號格式**:  \n",
    "    1. 總長度大於 6 以上，小於 12 以下  \n",
    "    2. 第一個英文字母為大寫  \n",
    "    3. 其餘可為數字或是英文字母大小寫    \n",
    "    \n",
    "* **密碼格式**:  \n",
    "    1. 總長度大於 6 以上  \n",
    "    2. 只能為小寫英文或數字  \n",
    "   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# 課堂練習 ex3 Con't\n",
    "\n",
    "* 輸入:  \n",
    "    Tommy7410, tom7410  \n",
    "    Amy8520, amy85  \n",
    "    tim9630, tim9630  \n",
    "    Yen5566123456, yen0054321  \n",
    "    \n",
    "---\n",
    "\n",
    "* 輸出:  \n",
    "    Welcome, Tommy7410!  \n",
    "    Password length error! Your password length is 5  \n",
    "    Username format error! Your username is tim9630  \n",
    "    Username length error! Your username length is 13. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "\n",
    "# Greedy and Non-Greedy Match\n",
    "\n",
    "正規表示式在做查找(match)的時候，會盡量查找(match)到**最長符合的字串**，這種行為我們稱為 Greedy match，但有時這些行為不是我們所期望的。  \n",
    "在**最短符合的字串**就停止查找，這種行為我們稱為 Non-Greedy match。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<center><h1> Greedy Match </h1></center>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<h1> Title </h1>\n"
     ]
    }
   ],
   "source": [
    "# Greedy Match\n",
    "import re\n",
    "\n",
    "text = \"<h1> Title </h1>\"\n",
    "\n",
    "# Define pattern\n",
    "pattern = r'<.*>'\n",
    "\n",
    "# search pattern\n",
    "match = re.search(pattern, text)\n",
    "\n",
    "# Check\n",
    "print(match.group())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "<center><h1> Non-Greedy Match </h1></center>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<h1>\n"
     ]
    }
   ],
   "source": [
    "# Non-Greedy match\n",
    "import re\n",
    "\n",
    "text = \"<h1> Title </h1>\"\n",
    "\n",
    "# Define pattern\n",
    "pattern = r'<.*?>'\n",
    "\n",
    "# search pattern\n",
    "match = re.search(pattern, text)\n",
    "\n",
    "# Check\n",
    "print(match.group())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# 課程練習 ex4 (10 分鐘) (不計分)\n",
    "\n",
    "請利用 re 模組，提取出 html 檔案中 'a' 標籤的訊息。\n",
    "\n",
    "<center><h1> \\<a href=\"www.google.com\"\\> Google \\</a\\> </h1></center>\n",
    "> Link: www.google.com  \n",
    "> Content: Google"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "<center><h1> 感謝聆聽 </h1></center>\n",
    "<center><h3> Reference: https://docs.python.org/3/library/re.html </h3></center>"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}