{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# [X-Village] Lesson08-regular_expression\n", "# by 劉正仁" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ " \n", " \n", " # 正規表示式 ( Regular expression )\n", "\n", "* 正規表示式從字面上就可以得知他是一個**表示法**。\n", "* 正規表示式其實就是字串(符合正規表示式語法),可以利用這定義好的字串找出其他文字資料中包含此字串的部份" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 示意圖\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 應用範圍\n", "\n", "* 網頁字串搜尋功能(Ctrl+f)\n", "* 伺服器端帳號密碼格式驗證\n", "* 大數據文字探勘\n", "* 文章子字串替換\n", "* 還有好多好多...\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 快速瞥過,看看正規表示式在程式中是怎麼一回事" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The location of 'regular expression' is between 27 and 45 in text\n" ] } ], "source": [ "import re\n", "\n", "# Search sub-string in text\n", "text = \"Today is good day to learn regular expression.\"\n", "\n", "# Define re pattern\n", "pattern = r'regular expression'\n", "\n", "# Search if there is 'regular expression' in the text\n", "match = re.search(pattern, text)\n", "\n", "# Check result\n", "start_index = match.span()[0]\n", "end_index = match.span()[1]\n", "match_string = match.group()\n", "print(\"The location of '{}' is between {} and {} in text\".format(match_string, start_index, end_index))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 程式碼剖析" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 程式碼剖析 (Con't)\n", "\n", "### 定義文字資料 \n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 程式碼剖析 (Con't)\n", "\n", "### 定義正規表示式 \n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 程式碼剖析 (Con't)\n", "\n", "### Raw string 是什麼? \n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 程式碼剖析 (Con't)\n", "\n", "**文字資料**和**正規表示式**做比對 \n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 程式碼剖析 (Con't)\n", "\n", "### 結果檢查\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 正規表示式用法\n", "\n", "* 一般文字\n", "* 基礎流程\n", "* 常用函式\n", "* 特殊文字 \n", "* 重複語法 " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 一般文字\n", "\n", "

若指定 一般文字(e.g 'hello', 'johnny', 'name')在正規表示式中,則程式會在文字資料中,找出與指定文字**完全一樣的字串**。

\n", "\n", "

可以想像成瀏覽器中常使用的 Ctrl+f 功能。

\n", "\n", "![CTRL+F](./images/re2.png)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "The location of 'Monday' is between 9 and 15 in text\n" ] } ], "source": [ "# Demo\n", "# Ordinary literal in regular expression\n", "import re\n", "\n", "# The text which re applies to\n", "text = \"Today is Monday, tommorrow is Tuesday.\"\n", "\n", "# define pattern\n", "pattern = r'Monday'\n", "\n", "# search pattern in the text\n", "match = re.search(pattern, text)\n", "\n", "# Check\n", "print(type(match))\n", "start_index = match.span()[0]\n", "end_index = match.span()[1]\n", "match_string = match.group()\n", "print(\"The location of '{}' is between {} and {} in text\".format(match_string, start_index, end_index))" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "None\n" ] } ], "source": [ "# Ordinary literal in regular expression\n", "# re cannot find a match in the text \n", "import re\n", "\n", "# The text which re applies to\n", "text = \"Today is Monday, tommorrow is Tuesday.\"\n", "\n", "# Specify pattern\n", "pattern = r'Wednesday'\n", "\n", "# Apply re to the text\n", "match = re.search(pattern, text)\n", "\n", "# Print out result\n", "print(type(match))\n", "print(match)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 基礎流程\n", " \n", "1. 定義正規表示式 \n", "2. 套用正規表示式到文字資料上 \n", "3. 取出查找到的字串 " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 定義正規表示式\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import re\n", "\n", "# Define pattern\n", "pattern = r'cookie'" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 套用正規表示式到文字資料上\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "text = \"Cake and cookie\"\n", "\n", "# Non-Compiled Version\n", "match = re.search(pattern, text)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 取出查找到的字串" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "cookie\n" ] } ], "source": [ "print(match.group())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 常用函式\n", "\n", "以下列舉出幾個在使用 re 模組時,常用的類別與函式\n", "* Re 模組\n", " * search\n", " * match\n", " * sub\n", " * compile\n", "* Match 物件中方法(method)\n", " * span\n", " * group\n", " * groups\n", " * groupdict" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## [re.search(pattern, string, flags=0)](https://docs.python.org/3/library/re.html#re.search)\n", "* **Arguements** \n", " * pattern \n", " 定義好的正規表示式(e.g r'cookie') \n", " * string \n", " 欲套用正規表示式之文字資料\n", " * flags(目前用不到) \n", " 預設值為0\n", "* **Return Value** \n", "若程式在 string 中找到符合pattern的字串,則會回傳一個型態為 Match 的物件。反之,擇回傳一個 None 物件。\n", "> **Note** \n", "> 與 re.match 不同之處在於,只要 string 中的**任何位置**出現符合 pattern 的字串, 就會回傳 Match 物件" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Type of 'match' object: \n", "Matching pattern: 'cookie'\n" ] } ], "source": [ "# re.search \n", "\n", "import re\n", "\n", "# text which re applies to\n", "text = \"Cake and cookie\"\n", "\n", "# Define pattern\n", "pattern = \"cookie\"\n", "\n", "# search pattern in the text\n", "match = re.search(pattern, text)\n", "\n", "# Check\n", "print(\"Type of 'match' object:\", type(match))\n", "print(\"Matching pattern: %r\" % match.group())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# [re.match(pattern, string, flag=0)](https://docs.python.org/3/library/re.html#re.match)\n", "\n", "* **Arguemnts**\n", " * pattern \n", " 定義好的正規表示式(e.g r'cookie') \n", " \n", " * string \n", " 欲套用正規表示式之文字資料\n", " \n", " * flags(目前用不到) \n", " 預設值為0\n", "* **Return Value** \n", "若程式在 string 中找到符合pattern的字串,則會回傳一個型態為 Match 的物件。反之,擇回傳一個 None 物件。\n", "> **Note**\n", "> match 函式只會比對 string 開頭的文字,若 string 的**中間**部份有符合 pattern 的字串擇**不會**回傳 Match 物件。" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Type of 'match1' object: \n", "None\n", "==================================================\n", "Type of 'match2' object: \n", "Matching pattern: 'cookie'\n" ] } ], "source": [ "# demo\n", "# re.match\n", "\n", "import re\n", "\n", "# texts which re applies to\n", "text1 = \"Cake and cookie\"\n", "text2 = \"cookie and Cake\"\n", "\n", "# specify pattern\n", "pattern = \"cookie\"\n", "\n", "# search pattern in texts\n", "match1 = re.match(pattern, text1)\n", "match2 = re.match(pattern, text2)\n", "\n", "# Check\n", "print(\"Type of 'match1' object:\", type(match1))\n", "print(match1)\n", "print(\"=\"*50)\n", "print(\"Type of 'match2' object:\", type(match2))\n", "print(\"Matching pattern: %r\" % match2.group())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# [re.sub(pattern, replace, string, count=0, flags=0)](https://docs.python.org/3/library/re.html#re.sub)\n", "\n", "* **Arguements** \n", " * pattern \n", " 定義好的正規表示式(e.g r'cookie') \n", " * replace \n", " 可以是**字串**抑或是**函式**(可呼叫的物件)。若為字串則會將符合 pattern 的字串轉成 replace。若為函式,則會將函式的回傳值當作替代的文字。 \n", " * string \n", " 欲套用正規表示式之文本字串 \n", " * count \n", " 預設為0,會將**所有**符合 pattern 的字串替代成相對應的字串(replace)。若為大於0的值,則只會將出現 **count 次數**次符合 pattern 的字串做轉換。 \n", " * flags(目前用不到) \n", " 預設值為0\n", " \n", "* **Return Value** \n", "回傳一個做過轉換的新的字串" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original text: cookie and Cake\n", "After substitution: biscuit and Cake\n" ] } ], "source": [ "# Demo\n", "# re.sub with plain string\n", "\n", "import re\n", "\n", "# text which re applies to\n", "text = \"cookie and Cake\"\n", "\n", "# Define pattern\n", "pattern = r'cookie'\n", "\n", "# substitute the string 'cookie' with 'biscuit'\n", "new_text = re.sub(pattern, 'biscuit', text)\n", "\n", "# Check\n", "print(\"Original text:\", text)\n", "print(\"After substitution:\", new_text)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original text: cookie and Cake\n", "After substitution: cookie-cookie and Cake\n" ] } ], "source": [ "# Demo\n", "# re.sub with customed function\n", "\n", "import re\n", "\n", "# text which re applies to\n", "text = \"cookie and Cake\"\n", "\n", "# Define pattern\n", "pattern = r'cookie'\n", "\n", "# Custom replace function\n", "def repl(match):\n", " new_string = match.group()+'-'+match.group()\n", " return new_string\n", "\n", "new_text = re.sub(pattern, repl, text)\n", "\n", "print(\"Original text:\", text)\n", "print(\"After substitution:\", new_text)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# [re.compile(pattern, flag=0)](https://docs.python.org/3/library/re.html#re.compile)\n", "\n", "* **Arguements**\n", " * pattern \n", " 欲快取住的正規表示式\n", " \n", " * flags(目前用不到) \n", " 預設值為0\n", " \n", "* **Return Value** \n", "回傳一個 Pattern 型態的物件" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "biscuit and Cake\n", "biscuit and Cake\n" ] } ], "source": [ "# Demo\n", "# re.compile\n", "\n", "import re\n", "\n", "# texts which re applies to\n", "text = \"cookie and Cake\"\n", "\n", "# Define pattern\n", "pattern = r'cookie'\n", "\n", "# Compile\n", "regex = re.compile(pattern)\n", "\n", "# search pattern\n", "new_text1 = regex.sub('biscuit', text)\n", "new_text2 = re.sub(r'cookie', 'biscuit', text)\n", "\n", "print(new_text1)\n", "print(new_text2)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 為什麼要 Compile 呢?" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 275 µs, sys: 25 µs, total: 300 µs\n", "Wall time: 304 µs\n" ] } ], "source": [ "%%time\n", "\n", "# Newbie style\n", "\n", "text1 = 'Cake and cookie'\n", "text2 = 'cookie and Cake'\n", "text3 = 'Cake cookie'\n", "\n", "# search pattern\n", "for i in range(100):\n", " match1 = re.search(r'cookie', text1)\n", " match2 = re.search(r'cookie', text2)\n", " match3 = re.search(r'cookie', text3)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 101 µs, sys: 9 µs, total: 110 µs\n", "Wall time: 113 µs\n" ] } ], "source": [ "%%time\n", "\n", "# Prof style\n", "\n", "text1 = 'Cake and cookie'\n", "text2 = 'cookie and Cake'\n", "text3 = 'Cake cookie'\n", "\n", "# Compile regex\n", "regex = re.compile(r'cookie')\n", "\n", "# search pattern\n", "for i in range(100):\n", " match1 = regex.search(text1)\n", " match2 = regex.search(text2)\n", " match3 = regex.search(text3)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 為什麼要 Compile 呢? (Con't)\n", "* 快取住 re 物件 \n", "* 不用重複編寫一樣的 Code\n", "* 執行速度提升" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# [match.span( [group] )](https://docs.python.org/3.5/library/re.html#re.match.span)\n", "\n", "* **Arguements**\n", " * \\[group\\] \n", " 預設值為 group 0,代表整個 matching 的字串。group 可以為所有合法 group 的索引值(e.g 1, 2, 3, etc),\n", " \n", " \n", "* **Return Value** \n", "Tuple 物件,表示此 group 在文本中的位置(起始位置, 結束位置)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Match starting index: 9\n", "Match ending index: 15\n", "Result String: cookie\n" ] } ], "source": [ "# 5 min practice\n", "# match.span\n", "\n", "import re\n", "\n", "text = 'Cake and cookie'\n", "\n", "# Define pattern\n", "pattern = r'cookie'\n", "\n", "# search pattern\n", "match = re.search(pattern, text)\n", "\n", "# Check\n", "print(\"Match starting index:\", match.span()[0])\n", "print(\"Match ending index:\", match.span()[1])\n", "print(\"Result String:\", text[match.span()[0]:match.span()[1]])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# [match.group( [group1, ...] )](https://docs.python.org/3.5/library/re.html#re.match.group)\n", "\n", "* \\[group1, ...\\] \n", "group1 若沒指定則 default 值為 0,代表回傳整串 match 的字串。 \n", "groupN 可以為任意有效的 group 索引值(e.g 1, 2, 3, etc)。\n", " \n", " \n", "* **Return Value** \n", "回傳屬於那個 group 的字串,若參數 [group1, ...] 大於1以上,則回傳 tuple 包含所有 group 的字串。 \n", "若查找的字串不符合 group 中的定義則回傳 None。" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 如何定義 group\n", "\n", "

( pattern )

\n", " \n", "> pattern: group 的正規表示式 " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 為什麼要定義 group" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 若我想從文字資料 \"Johnny Liu\" 取出 (1)姓 和 (2)名...." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 你會這樣定義正規表示式嘛?\n", "\n", "

pattern = r'Johnny Liu'

" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 應該這樣定義\n", "\n", "

pattern = r'(Johnny) (Liu)'

" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Entire matching string: cookie\n", "Entire matching string: cookie\n" ] } ], "source": [ "# Recap\n", "# Only one parameter in match.group\n", "\n", "import re\n", "\n", "text = 'Cake and cookie'\n", "\n", "# Define pattern\n", "pattern = r'cookie'\n", "\n", "# search pattern\n", "match = re.search(pattern, text)\n", "\n", "# Check\n", "print(\"Entire matching string:\", match.group())\n", "print(\"Entire matching string:\", match.group(0))\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Group1 matching string: Cake\n", "Group2 matching string: cookie\n", "Group1 and Group2 matching strings: ('Cake', 'cookie')\n" ] } ], "source": [ "# Demo\n", "# two or more parameter in match.group\n", "\n", "import re\n", "\n", "text = 'Cake and cookie'\n", "\n", "# Define pattern: () will define a group\n", "pattern = r'(Cake) and (cookie)'\n", "\n", "# search pattern\n", "match = re.search(pattern, text)\n", "\n", "# Check: only one argument\n", "print(\"Group1 matching string:\", match.group(1))\n", "print(\"Group2 matching string:\", match.group(2))\n", "\n", "# Check: two arguments\n", "print(\"Group1 and Group2 matching strings:\", match.group(1, 2))\n", "\n", "# Error\n", "# print(\"Group3 matching string:\", match.group(3))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# [match.groups(default=None)](https://docs.python.org/3.5/library/re.html#re.match.groups)\n", "\n", "* **Return Value** \n", "回傳一個串列的 group。 \n", "若查找的字串不符合串列中 group 的定義則回傳 default 值 None。" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Groups of match: ('Cake', 'cookie')\n" ] } ], "source": [ "# match.groups()\n", "\n", "import re\n", "\n", "text = 'Cake and cookie'\n", "\n", "# Define pattern\n", "pattern = r'(Cake) and (cookie)?'\n", "\n", "# search pattern\n", "match = re.search(pattern, text)\n", "\n", "# Check\n", "print(\"Groups of match:\", match.groups())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# [match.groupdict(default=None)](https://docs.python.org/3.5/library/re.html#re.match.groupdict)\n", "\n", "* **Return Value** \n", "回傳一個字典物件,包含了所有一串列被命名的 groups, **鍵值(key)** 為命名的名稱. **值(Value)** 為 group 中的字串。" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 如何命名 group\n", "\n", "

(?P<group_name> pattern)

\n", " \n", "> ?P: 命名 group 前必須加的前綴字 \n", "> group_name: group 的名稱 \n", "> pattern: group 的正規表示式 \n" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "{'fat': 'cookie', 'fatter': 'Cake'}\n" ] } ], "source": [ "# Demo\n", "# match.groupdict\n", "\n", "import re\n", "\n", "text = \"Cake and cookie\"\n", "\n", "# Define pattern\n", "pattern = r'(?PCake) and (?Pcookie)'\n", "\n", "# search pattern\n", "match = re.search(pattern, text)\n", "\n", "# Check\n", "d = match.groupdict()\n", "print(type(d))\n", "print(d)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 動動手指\n", "\n", "### 請使用 match.groupdict() 取出剛剛 \"Johnny Liu\" 的 First name 和 Last name。" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 特殊文字\n", "\n", "若是正規表示式只能使用平字(plain text),那他的功能也太弱了吧...。 所以除了平字之外它還可以有其他的**特殊文字**用來表示**集合**,**特殊符號**,**旗標**等等其他更較為有彈性且強大的功能。\n", "\n", "

這邊一定要多加練習唷~

" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 特殊文字查詢表\n", "\n", "| Character | Meaning\n", "| :------- | :--------\n", "| . | Match any single character except newline('\\n')\n", "| \\w | Match any singel letter, digit, or underscore\n", "| \\W | Match any character not part of \\w\n", "| \\s | Match a singel whitespace character like: space, newline, tab, return\n", "| \\S | Match any character not part or \\s\n", "| \\t | Match tab\n", "| \\n | Match newline\n", "| \\r | Match return\n", "| \\d | Match decimal digit 0-9\n", "| ^ | Match a pattern at the start of the string\n", "| $ | Match a pattern at the end of the string\n", "| \\A | Match only at the start of the string\n", "| \\b | Match only the beginning or end of the word\n", "| \\ | Match special character\n", "| [...] | Match character that appears between '[ ]'\n", "| [^...] | Match character that does not appear in '[ ]'\n", "\n", "> 還有其他很多特殊文字,所以有興趣可以自己在參考 Python re 的文件" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "

\\w 和 \\d 練習

" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'room_id': '65405', 'department': 'CSIE'}\n" ] } ], "source": [ "# demo\n", "# \\w and \\d practice\n", "# Extract CSIE and room_id\n", "\n", "import re\n", "\n", "# text which re applies to\n", "text = \"CSIE-65405\"\n", "\n", "# specify pattern\n", "pattern = r'(?P\\w\\w\\w\\w)-(?P\\d\\d\\d\\d\\d)'\n", "\n", "# search pattern\n", "match = re.search(pattern, text)\n", "\n", "# Check\n", "print(match.groupdict())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 課堂練習 ex1 (10 分鐘) (不計分)\n", "\n", "請利用 re 模組,寫一個程式讓使用者能夠輸入名字,並且讓程式取出其 lastname 和 firstname。\n", "\n", "* 輸入: \n", " Tom Tsai \n", " Amy Wang \n", " Tim Chen \n", " \n", "---\n", "\n", "* 輸出: \n", " \"Tom Tsai\": { 'lastname': \"Tsai\", 'firstname': \"Tom\" } \n", " \"Amy Wang\": { 'lastname': \"Wang\", 'firstname': \"Amy\" } \n", " \"Time Chen\": { 'lastname': 'Chen', 'firstname': \"Tim\" } " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "

[...] 練習

" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'room_id': '65405', 'department': 'CSIE'}\n" ] } ], "source": [ "# 5 mins for coding\n", "# [...] practice\n", "# Extract CSIE and room_id \n", "\n", "import re\n", "\n", "text = \"CSIE-65405\"\n", "\n", "# define pattern\n", "pattern = r'(?P[a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z])-(?P[0-9][0-9][0-9][0-9][0-9])'\n", "\n", "# search pattern in text\n", "match = re.search(pattern, text)\n", "\n", "# check\n", "print(match.groupdict())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 課堂練習 ex2 (40 分鐘) (不計分)\n", "\n", "請利用 re 模組,寫一個簡單的**使用者驗證系統**,使用者提供帳號,密碼,程式判段是否為合法的帳號密碼。 \n", "* **帳號格式**:\n", " 1. 總長度為 3 \n", " 2. 且第一個英文字母為大寫 \n", " 3. 其餘為小寫 \n", " \n", "* **密碼格式**:\n", " 1. 總長度為 9 \n", " 2. 前 3 個字為英文小寫 \n", " 3. 後六個數字為數字 0-9 \n", " 4. 但第一個數字必須為 0 " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 課堂練習 ex2 Con't\n", "* 輸入: \n", " Tom, tom059357 \n", " Amy, amy154852 \n", " TiM, tim0002356 \n", " Yen, yen0054321 \n", " \n", "---\n", "\n", "* 輸出: \n", " Welcome, Tom! \n", " Password format error! Your password is amy154852 \n", " Username format error! Your username is TiM \n", " Password legnth error! Your password length is 10." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Correct username\n", "Correct password\n", "Valid user\n" ] } ], "source": [ "# Example\n", "\n", "import re\n", "\n", "class AuthSystem:\n", " \n", " def __init__(self):\n", " \"\"\"Define regex\"\"\"\n", " self.username_regex = re.compile(r'johnny')\n", " self.password_regex = re.compile(r'johnny860410')\n", " \n", " def _check_username(self, username):\n", " \"\"\"Check username is valid or not\"\"\"\n", " if self.username_regex.search(username) is not None:\n", " print(\"Correct username\")\n", " return True\n", " else: \n", " print(\"Wrong username\")\n", " return False\n", " \n", " def _check_password(self, password):\n", " \"\"\"Check password is valid or not\"\"\"\n", " if self.password_regex.search(password) is not None:\n", " print(\"Correct password\")\n", " return True\n", " else:\n", " print(\"Wrong password\")\n", " return False\n", " \n", " def authenticate(self, username, password):\n", " \"\"\"authenticate the user\"\"\"\n", " if not self._check_username(username):\n", " return\n", " \n", " if not self._check_password(password):\n", " return\n", " \n", " print(\"Valid user\")\n", "\n", " \n", "# Construct a object of type AuthSystem\n", "auth = AuthSystem()\n", "\n", "# authenticate the user's credentials\n", "auth.authenticate(\"johnny\", \"johnny860410\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 重複語法\n", "\n", "為了讓在定義正規表示式的時候可以更為**簡潔**,且更為的**彈性**,正規表示式中有提供**重複語法**。 \n", "我們可以將之前寫的方法做一些修改。\n", "\n", "

\\w\\w\\w\\w ==> \\w{4}

" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 重複語法查詢表\n", "\n", "| Character | Meaning\n", "| :--- | :--- \n", "| + | Match one or more characters to its left\n", "| * | Match zero or more characters to its left\n", "| ? | Match zero or one character to its left\n", "| {x} | Match 'x' times of character to its left\n", "| {x,} | Match 'x' or more times fo charater to its left\n", "| {x, y} | Match 'x' or more times but less than 'y' times of character to its left" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "

+ 練習

" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "None\n", "CakeCakeCake and cookie\n" ] } ], "source": [ "# Demo\n", "# '+' practice\n", "import re\n", "\n", "# texts which re applies to\n", "text1 = \" and cookie\"\n", "text2 = \"CakeCakeCake and cookie\"\n", "\n", "# Define pattern\n", "plus_pattern = \"(Cake)+ and cookie\"\n", "\n", "# search pattern in texts # \n", "plus_match1 = re.search(plus_pattern, text1)\n", "plus_match2 = re.search(plus_pattern, text2)\n", "\n", "# check\n", "print(plus_match1)\n", "print(plus_match2.group())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "

* 練習

" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " and cookie\n", "CakeCakeCake and cookie\n" ] } ], "source": [ "# Demo\n", "\n", "import re\n", "\n", "# texts which re applies to\n", "text1 = \" and cookie\"\n", "text2 = \"CakeCakeCake and cookie\"\n", "\n", "# Define pattern\n", "mul_pattern = \"(Cake)* and cookie\"\n", "\n", "# search pattern in texts\n", "mul_match1 = re.search(mul_pattern, text1)\n", "mul_match2 = re.search(mul_pattern, text2)\n", "\n", "# check\n", "print(mul_match1.group())\n", "print(mul_match2.group())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "

? 練習

" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " and cookie\n", "Cake and cookie\n" ] } ], "source": [ "# 5 mins for coding\n", "# '+' practice\n", "import re\n", "\n", "# texts which re applies to\n", "text1 = \" and cookie\"\n", "text2 = \"Cake and cookie\"\n", "\n", "# Define pattern\n", "ques_pattern = \"(Cake)? and cookie\"\n", "\n", "# search pattern in texts \n", "ques_match1 = re.search(ques_pattern, text1)\n", "ques_match2 = re.search(ques_pattern, text2)\n", "\n", "# Check\n", "print(ques_match1.group())\n", "print(ques_match2.group())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 課堂練習 ex3 (10 分鐘)\n", "\n", "請延續上一題的練習題,擴充其功能,讓能夠接受(accept)的帳號和密碼格式更複雜。 \n", "* **帳號格式**: \n", " 1. 總長度大於 6 以上,小於 12 以下 \n", " 2. 第一個英文字母為大寫 \n", " 3. 其餘可為數字或是英文字母大小寫 \n", " \n", "* **密碼格式**: \n", " 1. 總長度大於 6 以上 \n", " 2. 只能為小寫英文或數字 \n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 課堂練習 ex3 Con't\n", "\n", "* 輸入: \n", " Tommy7410, tom7410 \n", " Amy8520, amy85 \n", " tim9630, tim9630 \n", " Yen5566123456, yen0054321 \n", " \n", "---\n", "\n", "* 輸出: \n", " Welcome, Tommy7410! \n", " Password length error! Your password length is 5 \n", " Username format error! Your username is tim9630 \n", " Username length error! Your username length is 13. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "# Greedy and Non-Greedy Match\n", "\n", "正規表示式在做查找(match)的時候,會盡量查找(match)到**最長符合的字串**,這種行為我們稱為 Greedy match,但有時這些行為不是我們所期望的。 \n", "在**最短符合的字串**就停止查找,這種行為我們稱為 Non-Greedy match。" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "

Greedy Match

" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "

Title

\n" ] } ], "source": [ "# Greedy Match\n", "import re\n", "\n", "text = \"

Title

\"\n", "\n", "# Define pattern\n", "pattern = r'<.*>'\n", "\n", "# search pattern\n", "match = re.search(pattern, text)\n", "\n", "# Check\n", "print(match.group())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "

Non-Greedy Match

" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "

\n" ] } ], "source": [ "# Non-Greedy match\n", "import re\n", "\n", "text = \"

Title

\"\n", "\n", "# Define pattern\n", "pattern = r'<.*?>'\n", "\n", "# search pattern\n", "match = re.search(pattern, text)\n", "\n", "# Check\n", "print(match.group())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 課程練習 ex4 (10 分鐘) (不計分)\n", "\n", "請利用 re 模組,提取出 html 檔案中 'a' 標籤的訊息。\n", "\n", "

\\ Google \\

\n", "> Link: www.google.com \n", "> Content: Google" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "

感謝聆聽

\n", "

Reference: https://docs.python.org/3/library/re.html

" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 2 }