{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 07-1 정규 표현식 살펴보기\n",
    "\n",
    "- 정규 표현식(Regular Expressions), 정규식\n",
    "\n",
    "### 정규 표현식은 왜 필요한가?\n",
    "\n",
    "> 주민등록번호를 포함하고 있는 텍스트에서 모든 주민등록번호의 뒷자리를 * 문자로 변경하시오.\n",
    "\n",
    "- 정규식을 모르는 경우\n",
    "    1. 전체 텍스트를 공백 문자로 나눈다(split).\n",
    "    2. 나누어진 단어들이 주민번호 형식인지 조사.\n",
    "    3. 단어가 주민번호 형식이면 뒷자리를 * 로 변환\n",
    "    4. 나누어진 단어들을 다시 조립"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "park 800905-*******\n",
      "kim 700905-*******\n",
      "\n"
     ]
    }
   ],
   "source": [
    "data = '''\n",
    "park 800905-1049118\n",
    "kim 700905-1059119\n",
    "'''\n",
    "\n",
    "result = []\n",
    "\n",
    "for line in data.split('\\n'):\n",
    "    word_result = []\n",
    "    for word in line.split(' '):\n",
    "        if len(word) == 14 and word[:6].isdigit() and word[6] == '-' and word[7:].isdigit():\n",
    "            word = word[:7] + \"*******\"\n",
    "        word_result.append(word)\n",
    "    result.append(\" \".join(word_result))\n",
    "    \n",
    "print(\"\\n\".join(result))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "park 800905-*******\n",
      "kim 700905-*******\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# 정규식을 사용하면\n",
    "\n",
    "import re\n",
    "\n",
    "data = '''\n",
    "park 800905-1049118\n",
    "kim 700905-1059119\n",
    "'''\n",
    "\n",
    "pat = re.compile(\"(\\d{6})[-]\\d{7}\")\n",
    "print(pat.sub(\"\\g<1>-*******\", data))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 07-2  정규 표현식 시작하기\n",
    "\n",
    "### 정규 표현식의 기초, 메타 문자\n",
    "\n",
    "- 정규 표현식에는 다음과 같은 메타 문자가 있다.\n",
    "\n",
    "> . ^ $ * + ? { } [ ] \\ | ( )\n",
    "    \n",
    "#### 문자 클래스 [ ]\n",
    "\n",
    "문자 클래스로 만들어진 정규식은 \"[ 와 ] 사이의 문자들과 매치\" 라는 의미를 갖는다.\n",
    "\n",
    "***문자 클래스를 만드는 메타 문자인 [ 와 ] 사이에는 어떤 문자도 들어갈 수 있다.***\n",
    "\n",
    "즉, 정규 표현식이 [abc] 라면 이 표현식의 의미는 \"a, b, c 중 한개의 문자와 매치\"를 뜻한다.\n",
    "\n",
    "- \"a\"는 정규식과 일치하는 문자인 \"a\"가 있으므로 매치\n",
    "- \"before\"는 정규식과 일치하는 문자인 \"b\"가 있으므로 매치\n",
    "- \"dude\"는 정규식과 일치하는 문자인 a, b, c 중 어느 하나도 포함하고 있지 않으므로 매치되지 않음\n",
    "\n",
    "[ ] 안에 하이픈(-)을 사용하면 두 문자 사이의 범위(From - To)를 의마한다. [a-c]는 [abc]와 동일하고, [0-5]는 [012345]와 동일하다.\n",
    "\n",
    "- [a-zA-Z]: 알파벳 모두\n",
    "- [0-9]: 숫자\n",
    "\n",
    "메타 문자 ^ 는 반대(not)라는 의미로 주의를 해야 한다.\n",
    "\n",
    "- [^0-9]: 숫자가 아닌 문자만 매치\n",
    "\n",
    "#### [자주 사용하는 문자 클래스]\n",
    "\n",
    "[0-9], [a-zA-Z] 등과 같이 자주 사용하는 정규식의 경우 다음과 같이 별도의 표기법으로 표현할 수 있다.\n",
    "\n",
    "- \\d - 숫자와 매치, [0-9]와 동일\n",
    "- \\D - 숫자가 아닌 것과 매치, [^0-9]와 동일\n",
    "- \\s - whitespace 문자와 매치, [ \\t\\n\\r\\f\\v]와 동일, 맨 앞은 공백 문자\n",
    "- \\S - whitespace 문자가 아닌 것과 매치, [^ \\t\\n\\r\\f\\v]와 동일\n",
    "- \\w - 문자+숫자와 매치, [a-zA-Z0-9]와 동일\n",
    "- \\W - 문자+숫자가 아닌 문자와 매치, [^a-zA-Z0-9]와 동일\n",
    "\n",
    "#### Dot(.)\n",
    "\n",
    "Dot(.) 메타 문자는 \\n(줄바꿈 문자)를 제외한 모든 문자와 매치된다.\n",
    "\n",
    "- a.b - \"a + 모든문자 + b\"\n",
    "    - aab: 매치\n",
    "    - a0b: 매치\n",
    "    - abc: \"a\"와 \"b\" 사이에 어떤 문자도 없기 때문에 매치되지 않는다.\n",
    "    \n",
    "- a[.]b - \"모든 문자\"라는 의미가 아닌 문자 . 그대로를 의미한다.\n",
    "\n",
    "#### 반복(*)\n",
    "\n",
    "- ca*t - * 문자 앞의 a가 무한대로 반복될 수 있다는 의미(0번 포함)\n",
    "    - ct, cat, caaaat\n",
    "    \n",
    "#### 반복(+)\n",
    "\n",
    "최소한 한번 이상은 나와야 하는 반복\n",
    "\n",
    "- ca+t: cat, caaat 는 매치가 되나 ct 는 매치되지 않음\n",
    "\n",
    "#### 반복({m, n}, ?)\n",
    "\n",
    "{ } 메타 문자를 사용하면 반복 횟수를 고정시킬 수 있다.\n",
    " \n",
    "- {m, n}: m부터 n까지 반복\n",
    "- {m, }: 반복횟수가 m 이상인 경우\n",
    "- {, n}: 반복횟구가 n 이하인 경우\n",
    "\n",
    "\n",
    "- {1, }: + 와 동일\n",
    "- {0, }: * 와 동일\n",
    "- ?: {0, 1} 과 동일\n",
    "\n",
    "#### 파이썬에서 정규 표현식을 지원하는 re 모듈\n",
    "\n",
    "파이썬이 설치될 때 기본 라이브러리로 정규식을 지원하기 위한 re 모듈을 제공한다.\n",
    "\n",
    "```\n",
    "import re\n",
    "p = re.compile('ab*')\n",
    "```\n",
    "\n",
    "re.compile 을 이용하여 정규식을 컴파일하고 컴파일된 패턴객체(p)를 이용하여 작업을 수행한다.\n",
    "\n",
    "#### 정규식을 이용한 문자열 검색\n",
    "\n",
    "- match(): 문자열의 처음부터 정규식과 매치되는지 조사\n",
    "- search(): 문자열 전체를 검색하여 정규식과 매치되는지 조사\n",
    "- findall(): 정규식과 매치되는 모든 문자열(substring)을 리스트로 리턴\n",
    "- finditer(): 정규식과 매치되는 모든 문자열(substring)을 iterator 객체로 리턴\n",
    "\n",
    "match, search 는 정규식과 매치될 때에는 match 객체를 리턴하고 그렇지 않을 경우에는 None을 리턴한다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import re\n",
    "p = re.compile('[a-z]+')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<_sre.SRE_Match object; span=(0, 6), match='python'>\n"
     ]
    }
   ],
   "source": [
    "# match\n",
    "\n",
    "m = p.match(\"python\")\n",
    "print(m)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "None\n"
     ]
    }
   ],
   "source": [
    "m = p.match(\"3 python\")\n",
    "print(m)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "match found:  string\n"
     ]
    }
   ],
   "source": [
    "m = p.match(\"string goes here\")\n",
    "if m:\n",
    "    print(\"match found: \", m.group())\n",
    "else:\n",
    "    print(\"No match\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<_sre.SRE_Match object; span=(0, 6), match='python'>\n"
     ]
    }
   ],
   "source": [
    "# search\n",
    "\n",
    "m = p.search(\"python\")\n",
    "print(m)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<_sre.SRE_Match object; span=(2, 8), match='python'>\n"
     ]
    }
   ],
   "source": [
    "m = p.search(\"3 python\") # 문자열 전체를 검색하기 때문에 매치\n",
    "print(m)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['life', 'is', 'too', 'short']\n"
     ]
    }
   ],
   "source": [
    "# findall\n",
    "\n",
    "result = p.findall(\"life is too short\")\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<callable_iterator object at 0x7fba643a9978>\n"
     ]
    }
   ],
   "source": [
    "# finditer\n",
    "\n",
    "result = p.finditer(\"life is too short\")\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<_sre.SRE_Match object; span=(0, 4), match='life'>\n",
      "life\n",
      "<_sre.SRE_Match object; span=(5, 7), match='is'>\n",
      "is\n",
      "<_sre.SRE_Match object; span=(8, 11), match='too'>\n",
      "too\n",
      "<_sre.SRE_Match object; span=(12, 17), match='short'>\n",
      "short\n"
     ]
    }
   ],
   "source": [
    "for r in result: \n",
    "    print(r)\n",
    "    print(r.group())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "---\n",
    "### match 객체의 메서드\n",
    "\n",
    "- group(): 매치된 문자열을 리턴\n",
    "- start(): 매치된 문자열의 시작 위치를 리턴\n",
    "- end(): 매치된 문자열의 끝 위치를 리턴\n",
    "- span(): 매치된 문자열의 (시작, 끝)에 해당하는 튜플을 리턴"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "m = p.match(\"python\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'python'"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m.group()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m.start()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m.end()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0, 6)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m.span()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "### 컴파일 옵션\n",
    "\n",
    "- DOTALL(S): . 이 줄바꿈 문자를 포함하도록 한다.\n",
    "- IGNORECASE(I): 대소문자를 구분없이 매치한다.\n",
    "- MULTILINE(M): 여러줄과 매치한다.\n",
    "- VERBOSE(X): verbose 모드를 사용할 수 있도록 한다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "None\n"
     ]
    }
   ],
   "source": [
    "# DOTALL, S\n",
    "\n",
    "import re\n",
    "\n",
    "p = re.compile('a.b')\n",
    "m = p.match('a\\nb')\n",
    "print(m)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<_sre.SRE_Match object; span=(0, 3), match='a\\nb'>\n"
     ]
    }
   ],
   "source": [
    "p = re.compile('a.b', re.DOTALL)\n",
    "m = p.match('a\\nb')\n",
    "print(m)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(0, 1), match='p'>"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# IGNORECASE, I\n",
    "\n",
    "p = re.compile('[a-z]', re.I)\n",
    "p.match('python')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(0, 1), match='p'>"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p.match('python3')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(0, 1), match='p'>"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p.match('python 3')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "p.match('3 python')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(0, 1), match='P'>"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p.match('Python')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(0, 1), match='P'>"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p.match('PYTHON')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['python one']\n"
     ]
    }
   ],
   "source": [
    "# MULTILINE, M\n",
    "\n",
    "import re\n",
    "p = re.compile(\"^python\\s\\w+\")\n",
    "\n",
    "data = \"\"\"python one\n",
    "life is too short\n",
    "python two\n",
    "you need python\n",
    "python three\"\"\"\n",
    "\n",
    "print(p.findall(data))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['python one', 'python two', 'python three']\n"
     ]
    }
   ],
   "source": [
    "p = re.compile(\"^python\\s\\w+\", re.M)\n",
    "\n",
    "print(p.findall(data))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# VERBOSE, X\n",
    "\n",
    "charref = re.compile(r'&[#](0[0-7]+|[0-9]+|x[0-9a-fA-F]+);')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "charref = re.compile(r\"\"\"\n",
    " &[#]                # Start of a numeric entity reference\n",
    " (\n",
    "     0[0-7]+         # Octal form\n",
    "   | [0-9]+          # Decimal form\n",
    "   | x[0-9a-fA-F]+   # Hexadecimal form\n",
    " )\n",
    " ;                   # Trailing semicolon\n",
    "\"\"\", re.VERBOSE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- 패턴 객체인 charref 는 동일한 역할을 한다.\n",
    "- re.VERBOSE 옵션을 사용하면 문자열에 사용된 whitespace는 컴파일 시 제거된다.([] 내에 사용된 것은 제외)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## 07-3 강력한 정규 표현식의 세계로\n",
    "\n",
    "### 메타문자\n",
    "\n",
    "문자열 소모가 없는 메타 문자"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<_sre.SRE_Match object; span=(0, 4), match='Crow'>\n"
     ]
    }
   ],
   "source": [
    "# | 메타문자는 or 의 의미돠 동일하다.\n",
    "\n",
    "p = re.compile(\"Crow|Servo\")\n",
    "m = p.match(\"CrowHello\")\n",
    "print(m)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<_sre.SRE_Match object; span=(0, 4), match='Life'>\n"
     ]
    }
   ],
   "source": [
    "# ^ 는 문자열의 맨 처음과 일치함을 의미한다.\n",
    "\n",
    "print(re.search(\"^Life\", \"Life is woo short\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "None\n"
     ]
    }
   ],
   "source": [
    "print(re.search(\"^Life\", \"My Life\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<_sre.SRE_Match object; span=(12, 17), match='short'>\n"
     ]
    }
   ],
   "source": [
    "# $ 는 ^ 메타문자와 반대의 경우\n",
    "\n",
    "print(re.search(\"short$\", \"Life is too short\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "None\n"
     ]
    }
   ],
   "source": [
    "print(re.search(\"short$\", \"Life is too short, you need python\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### \\A\n",
    "\n",
    "- 문자열의 처음과 매치\n",
    "- ^ 와 동일한 의미이지만, re.MULTILINE 옵션을 사용할 경우는 라인과 상관없이 전체 문자열의 처음하고만 매치\n",
    "\n",
    "#### \\Z\n",
    "\n",
    "- 문자열의 끝과 매치\n",
    "- $ 와 동일한 의미이지만, re.MULTILINE 옵션 사용 시 라인과 상관없이 전체 문자열의 끝하고만 매치"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# \\b: 단어 구분자\n",
    "\n",
    "p = re.compile(r'\\bclass\\b')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<_sre.SRE_Match object; span=(3, 8), match='class'>\n"
     ]
    }
   ],
   "source": [
    "print(p.search(\"no class at all\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "None\n"
     ]
    }
   ],
   "source": [
    "print(p.search(\"the declassified algorithm\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "None\n"
     ]
    }
   ],
   "source": [
    "print(p.search(\"one subclass is\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# \\B: \\b 와 반대의 경우 매치된다.\n",
    "\n",
    "p = re.compile(r'\\Bclass\\B')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "None\n"
     ]
    }
   ],
   "source": [
    "print(p.search(\"no class at all\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<_sre.SRE_Match object; span=(6, 11), match='class'>\n"
     ]
    }
   ],
   "source": [
    "print(p.search(\"the declassified algorithm\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "None\n"
     ]
    }
   ],
   "source": [
    "print(p.search(\"one subclass is\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 그룹핑\n",
    "\n",
    "- 반복되는 문자열을 찾는 경우\n",
    "- 매치된 문자열 중에서 특정 부분의 문자열만 뽑아내긴 위한 경우\n",
    "- 그룹핑된 문자열 재참조\n",
    "- 그룹을 만들어 주는 메타문자는 '(', ')'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import re"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<_sre.SRE_Match object; span=(0, 9), match='ABCABCABC'>\n"
     ]
    }
   ],
   "source": [
    "p = re.compile('(ABC)+')\n",
    "m = p.search('ABCABCABC OK?')\n",
    "print(m)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ABCABCABC\n"
     ]
    }
   ],
   "source": [
    "print(m.group())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "park 010-1234-1234\n"
     ]
    }
   ],
   "source": [
    "p = re.compile(r\"\\w+\\s+\\d+[-]\\d+[-]\\d+\")\n",
    "m = p.search(\"park 010-1234-1234\")\n",
    "print(m.group())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "park 010-1234-1234\n"
     ]
    }
   ],
   "source": [
    "p = re.compile(r\"(\\w+)\\s+\\d+[-]\\d+[-]\\d+\")\n",
    "m = p.search(\"park 010-1234-1234\")\n",
    "print(m.group())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "park\n"
     ]
    }
   ],
   "source": [
    "print(m.group(1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "ename": "IndexError",
     "evalue": "no such group",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mIndexError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-7-9d9166b3aa7e>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mm\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgroup\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;31mIndexError\u001b[0m: no such group"
     ]
    }
   ],
   "source": [
    "print(m.group(2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "park 010-1234-1234\n"
     ]
    }
   ],
   "source": [
    "print(m.group(0))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "p = re.compile(r\"(\\w+)\\s+((\\d+)[-]\\d+[-]\\d+)\")\n",
    "m = p.search(\"park 010-1234-1234\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "park 010-1234-1234\n",
      "park 010-1234-1234\n",
      "park\n",
      "010-1234-1234\n",
      "010\n"
     ]
    },
    {
     "ename": "IndexError",
     "evalue": "no such group",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mIndexError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-10-614b6fd0421a>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mm\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgroup\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mm\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgroup\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m3\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mm\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgroup\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m4\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;31mIndexError\u001b[0m: no such group"
     ]
    }
   ],
   "source": [
    "print(m.group())\n",
    "print(m.group(0))\n",
    "print(m.group(1))\n",
    "print(m.group(2))\n",
    "print(m.group(3))\n",
    "print(m.group(4))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'the the'"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p = re.compile(r'(\\b\\w+)\\s+\\1')\n",
    "p.search('Paris in the the spring').group()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 그룹핑된 문자열에 이름 붙이기\n",
    "\n",
    "- 정규식 내에 그룹이 많아질 경우 관리하기 위해 다음과 같이 그룹명을 지정할 수 있다.\n",
    "> (?P<name>\\w+)\\s+((\\d+)[-]\\d+[-]\\d+)\n",
    "- 달라진 부분은 다음과 같다. (\\w+)라는 그룹에 name 이라는 이름을 지정\n",
    "> (\\w+) --> (?P<name>\\w+)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "park\n"
     ]
    }
   ],
   "source": [
    "p = re.compile(r\"(?P<name>\\w+)\\s+((\\d+)[-]\\d+[-]\\d+)\")\n",
    "m = p.search(\"park 010-1234-1234\")\n",
    "print(m.group(\"name\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'the the'"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 그룹명으로 재참조\n",
    "\n",
    "p = re.compile(r'(?P<word>\\b\\w+)\\s+(?P=word)')\n",
    "p.search('Paris in the the spring').group()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 전방 탐색\n",
    "\n",
    "- 긍정형 전방 탐색((?=...)) - ... 에 해당되는 정규식과 매치되어야 하며 조건이 통과되어도 문자열이 소모되지 않는다.\n",
    "- 부정형 전방 탐색((?!...)) - ...에 해당되는 정규식과 매치되지 않아야 하며 조건이 통과되어도 문자열이 소모되지 않는다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "http:\n"
     ]
    }
   ],
   "source": [
    "p = re.compile(\".+:\")\n",
    "m = p.search(\"http://google.com\")\n",
    "print(m.group())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "http\n"
     ]
    }
   ],
   "source": [
    "# : 제외한 결과를 얻기 위해 전방탐색 적용\n",
    "\n",
    "p = re.compile(\".+(?=:)\")\n",
    "m = p.search(\"http://google.com\")\n",
    "print(m.group())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 부정형 전방 탐색"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import re"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "test.txt\n"
     ]
    }
   ],
   "source": [
    "p = re.compile(\".*[.](?!bat$).*$\")\n",
    "m = p.search(\"test.txt\")\n",
    "print(m.group())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "ename": "AttributeError",
     "evalue": "'NoneType' object has no attribute 'group'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-4-af87954517f3>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0mm\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msearch\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"test.bat\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mm\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgroup\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;31mAttributeError\u001b[0m: 'NoneType' object has no attribute 'group'"
     ]
    }
   ],
   "source": [
    "m = p.search(\"test.bat\")\n",
    "print(m.group())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "test.exe\n"
     ]
    }
   ],
   "source": [
    "m = p.search(\"test.exe\")\n",
    "print(m.group())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "test.txt\n"
     ]
    }
   ],
   "source": [
    "p = re.compile(\".*[.](?!bat$|exe$).*$\")\n",
    "m = p.search(\"test.txt\")\n",
    "print(m.group())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "ename": "AttributeError",
     "evalue": "'NoneType' object has no attribute 'group'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-7-af87954517f3>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0mm\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msearch\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"test.bat\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mm\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgroup\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;31mAttributeError\u001b[0m: 'NoneType' object has no attribute 'group'"
     ]
    }
   ],
   "source": [
    "m = p.search(\"test.bat\")\n",
    "print(m.group())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "ename": "AttributeError",
     "evalue": "'NoneType' object has no attribute 'group'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-8-499eb26ffefe>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0mm\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msearch\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"test.exe\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mm\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgroup\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;31mAttributeError\u001b[0m: 'NoneType' object has no attribute 'group'"
     ]
    }
   ],
   "source": [
    "m = p.search(\"test.exe\")\n",
    "print(m.group())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 문자열 바꾸기"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'color socks and color shoes'"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p = re.compile('(blue|red|white)')\n",
    "p.sub('color', 'blue socks and red shoes')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'color socks and red shoes'"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 바꾸는 횟수를 제한\n",
    "\n",
    "p.sub('color', 'blue socks and red shoes', count=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### [sub 메서드와 유사한 subn 메서드]\n",
    "\n",
    "- 동일한 기능이지만 결과를 튜플로 리턴한다.\n",
    "- 리턴되는 튜플의 첫번째 요소는 변경된 문자열이고, 두번째 요소는 바꾸기가 발생한 횟수이다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('color socks and color shoes', 2)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p.subn('color', 'blue socks and red shoes')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "#### sub 메서드 사용 시 참조 구문 사용하기"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import re"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "010-1234-1234 park\n"
     ]
    }
   ],
   "source": [
    "p = re.compile(r\"(?P<name>\\w+)\\s+(?P<phone>(\\d+)[-]\\d+[-]\\d+)\")\n",
    "print(p.sub(\"\\g<phone> \\g<name>\", \"park 010-1234-1234\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "010-1234-1234 park\n"
     ]
    }
   ],
   "source": [
    "# 그룹명 대신 참조번호 이용\n",
    "\n",
    "p = re.compile(r\"(?P<name>\\w+)\\s+(?P<phone>(\\d+)[-]\\d+[-]\\d+)\")\n",
    "print(p.sub(\"\\g<2> \\g<1>\", \"park 010-1234-1234\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### sub 메서드의 입력 인수로 함수 넣기"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def hexrepl(match):\n",
    "     \"Return the hex string for a decimal number\"\n",
    "     value = int(match.group())\n",
    "     return hex(value)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Call 0xffd2 for printing, 0xc000 for user code.'"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p = re.compile(r'\\d+')\n",
    "p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Greedy vs Non-Greedy\n",
    "\n",
    "- \\* 메타문자는 매우 탐욕스러워서 매치할 수 있는 최대한의 문자열을 소모한다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "32"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s = '<html><head><title>Title</title>'\n",
    "len(s)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(0, 32)\n"
     ]
    }
   ],
   "source": [
    "print(re.match('<.*>', s).span())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<html><head><title>Title</title>\n"
     ]
    }
   ],
   "source": [
    "print(re.match('<.*>', s).group())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- < html > 이라는 문자열만 소모되도록 제한하기 위해서는 Non-Greedy 문자인 ? 를 사용한다.\n",
    "- non-greedy 문자인 ?은 *?, +?, ??, {m,n}?과 같이 사용할 수 있다.\n",
    "- 가능한 한 가장 최소한의 반복을 수행하도록 도와주는 역할을 한다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<html>\n"
     ]
    }
   ],
   "source": [
    "print(re.match('<.*?>', s).group())"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}