{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **NER 사전 만들기**\n",
    "https://pypi.org/project/mdict-utils/\n",
    "1. **MDX 사전파일을** 활용하여 객체에 구분하기\n",
    "\n",
    "## **1 Sqlite3 데이터 불러오기**\n",
    "sqlite3 데이터 불러오기"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import sqlite3\n",
    "# import pandas as pd\n",
    "# conn = sqlite3.connect('backup/kordict.db')\n",
    "# resp = conn.execute(\"SELECT name FROM sqlite_master WHERE type='table';\")\n",
    "# tableNames = [name[0]  for name in resp]\n",
    "# tableNames"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %%time\n",
    "# df = pd.read_sql_query(\"SELECT * FROM {}\".format(tableNames[1]), conn)\n",
    "# itemIndex = df.entry.values.tolist()\n",
    "# itemData  = df.paraphrase.values.tolist()\n",
    "# print(len(itemIndex), itemIndex[:10], df.shape)\n",
    "# df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **2 자료 전처리**\n",
    "1. regex 를 활용하여 가치있는 **명사내용** 추출하기\n",
    "1. \"「1」\" 내용은 동일한 개념의 단어를 정리한 것\n",
    "1. sentence 별로 내용의 확인 및 구분이 필요\n",
    "\n",
    "```html\n",
    "<font color=crimson size=+2><b>대상행동</b></font>\n",
    "<br><br>\n",
    "<font COLOR=\"#00008B\"><b>대상^행동(代償行動)</b></font>\n",
    "『심리』<br> 자기가 요구하는 바를 얻지 못할 때 그와 비슷한 다른 대상으로 만족을 채우려는 행동. ≒보상 작용\n",
    "<br>「1」ㆍ보상 행동.\\r\\n'\n",
    " ```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# # 색인정보 : 51만개의 단어사전 데이터\n",
    "# itemIndex = df.entry.values.tolist()\n",
    "# itemData  = df.paraphrase.values.tolist()\n",
    "# len(itemIndex), itemIndex[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pickle\n",
    "with open('../data/nerDict.pk', 'rb') as handle:\n",
    "    nerDict = pickle.load(handle)\n",
    "itemIndex = list(nerDict.Text)\n",
    "itemData  = list(nerDict.Data)\n",
    "len(itemIndex), len(itemData)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **3 HTML 태그 제거 및 본문선별**\n",
    "1. **/font** 태그가 제목 이외에, 세부내용에도 포함됩니다.\n",
    "1. 때문에 이를 무조건 적용하기엔 문제가 있어서 Tag 구조를 명확하게 파악하기\n",
    "1. 우선 구조에 대한 형태를 몇가지로 확인한 뒤 작업을 진행하기\n",
    "\n",
    "```python\n",
    "'<title.*?>(.+?)</title>'  # 특정태그\n",
    "\"<[^>]+>|[^<]+\"            # html 태그 내부의 한글 추출\n",
    "'<.*?>'                    # 모든 태그\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "search_query = \"대리인\"\n",
    "num = [no  for no, _ in enumerate(itemIndex)  if _ == search_query][0]\n",
    "num"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "item_sample = \"\"\"<font color=crimson size=+2><b>대상행동</b></font><br><br>\n",
    "<font COLOR=\"#00008B\"><b>대상^행동(代償行動)</b></font>\n",
    "<img src=/url/source/image.jpg>\n",
    "『심리』<br> 자기가 요구하는 바를 얻지 못할 때 그와 비슷한 다른 대상으로 만족을 채우려는 행동. ≒보상 작용\n",
    "<br>「1」ㆍ보상 행동.\\r\\n\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tempXml = itemData[num] \n",
    "# tempXml = \" \".join(tempXml)# .lower()\n",
    "tempXml = item_sample.lower()\n",
    "from bs4 import BeautifulSoup\n",
    "html_source = BeautifulSoup(tempXml, 'html.parser').prettify()\n",
    "html_source # print(html_source)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "# regex_html = '<.*>'                 # html 태그내용만 추출 (탐욕적 수량자)\n",
    "regex_html = '<.*?>'                 # html 태그내용만 추출 (게으른 수량자)\n",
    "regex_text = \"<[^>]+>|[^<]+\"         # html 태그 내부의 한글 추출\n",
    "regex_font = '<font.*?>(.+?)</font>' # html 추출 : font 태그 내용만 추출\n",
    "\n",
    "# html 태그의 추출\n",
    "f_html = re.findall(regex_html, tempXml) # HTML 태그내용\n",
    "\", \".join(f_html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 특정 font 태그 내부의 text 를 추출\n",
    "f_font = re.findall(regex_font, tempXml)\n",
    "[\"<font>{}</font>\".format(_) for _ in f_font]\n",
    "[_ for _ in f_font]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# f_font 의 태그를 제거한 나머지\n",
    "f_txt = [re.sub(regex_html,\"\", _)  for _ in f_font] \n",
    "f_txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "f_full = re.findall(regex_text, tempXml) \n",
    "\", \".join(f_full)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 「1」', '「2」 개념설명 1, 2\n",
    "# ¶  문장의 예시\n",
    "[_  for _ in f_full  if _ not in f_html]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# font 태그 내용을 제거한 나머지\n",
    "[_  for _ in f_full  if _ not in f_html  if _ not in f_txt]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result = []\n",
    "# for _ in itemData:\n",
    "#     result += re.findall(regex_html, _) # HTML 태그내용\n",
    "result      = list(set(result))\n",
    "result_tags = []         # 결과 중 사진포함 내용을 제거한 나머지\n",
    "regex_img   = '<img.*?>' # html 추출 : font 태그 내용만 추출\n",
    "for _ in result:\n",
    "    _ = re.sub(regex_img, \"\", _)\n",
    "    if len(_) > 2:\n",
    "        result_tags.append(_.lower())\n",
    "len(set(result_tags)), sorted(set(result_tags))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **4 분류 작업용 함수 만들기**\n",
    "3개의 분류를 활용하여 List 만들기, 이후 대표적인 대상만 남기고 줄이기\n",
    "1. 타이들\n",
    "1. 대분류\n",
    "1. 소분류 1, 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "search_query = \"대리인\"\n",
    "num     = [no  for no, _ in enumerate(itemIndex)  if _ == search_query][0]\n",
    "tempXml = itemData[num].lower()\n",
    "tempXml"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "regex_html = '<.*?>'                 # html 태그내용만 추출\n",
    "regex_text = \"<[^>]+>|[^<]+\"         # html 태그 내부의 한글 추출\n",
    "regex_font = '<font.*?>(.+?)</font>' # html 추출 : font 태그 내용만 추출\n",
    "f_full = re.findall(regex_text, tempXml) \n",
    "f_html = re.findall(regex_html, tempXml) # HTML 태그내용\n",
    "f_test = [_  for _ in f_full  if _ not in f_html]\n",
    "f_test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# (?<=\\$)\n",
    "test  = \"「명사」「1」「동-사」\"\n",
    "t_txt = re.compile(r\"「[·가-힣]+」\")\n",
    "t_txt = re.compile(r\"「([·가-힣]+)」\")\n",
    "t_num = re.compile(r\"「\\d+」\")\n",
    "t_box = re.compile(r\"「.*?」\")\n",
    "# t_box = re.compile(r\"(?<=\\「).*?(?<=\\」)$\")\n",
    "# t_num.findall(test)\n",
    "# t_txt.findall(test)\n",
    "t_box.findall(test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "search_query = \"대리인\"\n",
    "num     = [no  for no, _ in enumerate(itemIndex)  if _ == search_query][0]\n",
    "tempXml = itemData[num].lower()\n",
    "tempXml"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result, result_len = [], []\n",
    "for _ in itemData:\n",
    "    temp = t_txt.findall(_)\n",
    "    temp_num = len(temp)\n",
    "    result_len.append(temp_num)\n",
    "    if temp_num > 1:\n",
    "        result += temp\n",
    "result =list(set(result))\n",
    "len(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **4 분류 작업용 함수 만들기 2**\n",
    "1. \"「.*?」\" 를 기준으로 객체들을 구분 분류하기\n",
    "1. 추후 숫자와 개념어를 구분하여 분류 재정의 하기"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 텍스트 인덱스값 찾기\n",
    "search_query = \"가결\"\n",
    "search_query = \"대리인\"\n",
    "search_query = \"서버\"\n",
    "num     = [no  for no, _ in enumerate(itemIndex)  if _ == search_query][0]\n",
    "tempXml = itemData[num].lower()\n",
    "tempXml"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "regex_html = '<.*?>'                 # html 태그내용만 추출\n",
    "regex_text = \"<[^>]+>|[^<]+\"         # html 태그 내부의 한글 추출\n",
    "f_txt      = [re.sub(regex_html,\"\", _)  for _ in f_font] \n",
    "f_full     = re.findall(regex_text, tempXml)\n",
    "t_temp     = [_  for _ in f_full  if _ not in f_html]\n",
    "t_temp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result = []\n",
    "for _ in itemData:\n",
    "    temp = ''\n",
    "    tok  = re.compile(r\"(「.*」)\")\n",
    "    for t in tok.findall(_):\n",
    "        if len(t) < 10:\n",
    "            t = t.replace(\"「\",\"\")\n",
    "            temp += \",\"+ t.replace(\"」\",\"\")\n",
    "    result.append(temp)\n",
    "len(result)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result_html = \" \".join(list(set(result)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test = \"<img src='hc513sy.jpg' width='300'> 테스트 추출\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **3 HTML 태그 제거 및 본문선별**\n",
    "1. **/font** 태그가 제목 이외에, 세부내용에도 포함됩니다.\n",
    "1. 때문에 이를 무조건 적용하기엔 문제가 있어서 Tag 구조를 명확하게 파악하기\n",
    "1. 우선 구조에 대한 형태를 몇가지로 확인한 뒤 작업을 진행하기\n",
    "\n",
    "```python\n",
    "'<title.*?>(.+?)</title>'  # 특정태그\n",
    "'<.*?>' # 모든 태그\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tokenizer = re.compile(\"<*h4[^>]*>(.*?)<\\s*/\\s*h4>/g\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tokenizer.findall('<h4 class=\"sds\">And more ...</h4>')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "re.findall"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "regex = re.compile(\"/<(.|\\n)*?>/\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "regex.finditer(\"<p>test</p>\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "idxNo = 12004\n",
    "itemTemp = itemData[idxNo].lower()\n",
    "toknizer = re.compile('<.*?>')\n",
    "toknizer_class = re.compile('class=\"[a-zA-Z0-9:;\\.\\s\\(\\)\\-\\,]*\"')\n",
    "# re.sub(toknizer, '', itemTemp)\n",
    "re.findall(toknizer, itemTemp)[6]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **3 Regex를 활용하여 구분하기**\n",
    "본문에 \"\\<br\\>\" 내용이 많아서 \"//text()\" 를 사용하면 컬럼갯수가 꼬여버린다\n",
    "regex, lxml 을 사용하면 좋겠지만 경우의 수가 많아서 모두 처리하지 않으면 안됨.."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = itemData[2945]\n",
    "data = data.lower()  # html Tag 전처리\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "re.findall(r'<font color=crimson (.*?)</font>', data, re.DOTALL)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re # 강결합으로 정치큰 부분 제거하기\n",
    "re.sub('<br><font color=.*</font>', \"\",data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_ = re.sub('<font .*</font><br>', \"\",data).split(\"<br>\")\n",
    "_[0], _[1].replace(\"\\r\\n\", \"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "from lxml.html import fromstring\n",
    "def exportMdx(data):\n",
    "    data = data.lower()  # html Tag 전처리\n",
    "    _ = re.sub('<font .*</font><br>', \"\",data).split(\"<br>\")\n",
    "    return fromstring(data).xpath('//font//text()'), (_[0], _[1].replace(\"\\r\\n\", \"\"))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}