{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "831ff0f5",
   "metadata": {},
   "source": [
    "# 模块re:正则表达式"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08450111",
   "metadata": {},
   "source": [
    "正则表达式是用来匹配字符串或者子串的一种模式,匹配的字符串可以很具体,也可以很一般化。\n",
    "\n",
    "Python 标准库提供了 re 模块:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "2d5fa2b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "import re"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "b96a8bdb",
   "metadata": {},
   "source": [
    "## re.match()函数\n",
    "\n",
    "`re.match()`函数对字符串的开头进行匹配,返回第一个匹配对应的Match对象,否则返回None:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "8c0b625f",
   "metadata": {},
   "outputs": [],
   "source": [
    "pat = \"\\d+\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "a8cd96f8",
   "metadata": {},
   "outputs": [],
   "source": [
    "s = \"abc123abc123456\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "7167a2af",
   "metadata": {},
   "outputs": [],
   "source": [
    "re.match(pat, s)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9bd305a9",
   "metadata": {},
   "source": [
    "由于字符串不是字母开头,没有匹配结果。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1e81588e",
   "metadata": {},
   "source": [
    "## re.search()函数"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "659d7c15",
   "metadata": {},
   "source": [
    "与`re.match()`函数不同,`re.search()`函数会用正则表达式去匹配字符串中所有的子串,如果找到,返回第一个匹配对应的Match对象,否则返回None:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "fa28fa59",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<re.Match object; span=(3, 6), match='123'>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.search(pat, s)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "b17b6f30",
   "metadata": {},
   "source": [
    "可以调用返回的Match对象的.group()方法查看匹配到的字符串:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "364cb723",
   "metadata": {},
   "outputs": [],
   "source": [
    "m = re.search(pat, s)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "33a4dc68",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'123'"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m.group(0)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "4b6759f4",
   "metadata": {},
   "source": [
    "## re.split()函数\n",
    "\n",
    "`re.split()`使用指定的正则表达式作为分隔符,对字符串进行分割,其用法为:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "f979ae49",
   "metadata": {},
   "outputs": [],
   "source": [
    "pat = \" +\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "b1f9ddd8",
   "metadata": {},
   "outputs": [],
   "source": [
    "s = \"a b    c   d  e\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "717cd7f6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['a', 'b', 'c', 'd', 'e']"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.split(pat, s)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fa005fcf",
   "metadata": {},
   "source": [
    "## re.sub()函数"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "7e1b4b8e",
   "metadata": {},
   "source": [
    "`re.sub()`函数对字符串中正则表达式匹配的部分进行替换:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "ee543515",
   "metadata": {},
   "outputs": [],
   "source": [
    "pat = \" +\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "e2210ec5",
   "metadata": {},
   "outputs": [],
   "source": [
    "replace = \";\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "4174ab3d",
   "metadata": {},
   "outputs": [],
   "source": [
    "s = \"a b    c   d  e\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "c60255b9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'a;b;c;d;e'"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.sub(pat, replace, s)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc36b819",
   "metadata": {},
   "source": [
    "## 正则表达式规则\n",
    "\n",
    "正则表达式由一些普通字符和一些元字符组成。普通字符包括大小写的字母和数字,而元字符则具有特殊的含义:\n",
    "\n",
    "\n",
    "子表达式 | 匹配内容\n",
    "-- | -- \n",
    ".\t|匹配除了换行符之外的内容\n",
    "\\w\t|匹配所有字母和数字字符\n",
    "\\d\t|匹配所有数字,相当于 [0-9]\n",
    "\\s\t|匹配空白,相当于 [\\t\\n\\t\\f\\v]\n",
    "\\W,\\D,\\S\t|匹配对应小写字母形式的补\n",
    "[...]\t|表示可以匹配的集合,支持范围表示如 a-z, 0-9 等\n",
    "(...)\t|表示作为一个整体进行匹配\n",
    "¦\t|表示逻辑或\n",
    "^\t|表示匹配后面的子表达式的补\n",
    "*\t|表示匹配前面的子表达式 0 次或更多次\n",
    "+\t|表示匹配前面的子表达式 1 次或更多次\n",
    "?\t|表示匹配前面的子表达式 0 次或 1 次\n",
    "{m}\t|表示匹配前面的子表达式 m 次\n",
    "{m,}\t|表示匹配前面的子表达式至少 m 次\n",
    "{m,n}\t|表示匹配前面的子表达式至少 m 次,至多 n 次\n",
    "\n",
    "例如:\n",
    "\n",
    "- ca*t 匹配: ct, cat, caaaat, ...\n",
    "- ab\\d|ac\\d 匹配: ab1, ac9, ...\n",
    "- ([^a-q]bd) 匹配: rbd, 5bd, ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "66c90ed1",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}