{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# **Append**\n",
    "통계적 문법접근방법 빠르고 간단하게 살펴보기"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br></br>\n",
    "## **1 Parsing Tree**\n",
    "문법태그를 활용한 문법구조 생성하기"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %%time\n",
    "# text = '여배우 박민영은 높은 싱크로를 보여줬다'\n",
    "\n",
    "# from konlpy.tag import Twitter\n",
    "# twitter = Twitter()\n",
    "# words = twitter.pos(text, stem=True)\n",
    "# print(words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# from nltk import RegexpParser\n",
    "\n",
    "# grammar = \"\"\"\n",
    "# NP: {<N.*>*<Suffix>?}   # 명사구를 정의한다\n",
    "# VP: {<V.*>*}            # 동사구를 정의한다\n",
    "# AP: {<A.*>*}            # 형용사구를 정의한다 \"\"\"\n",
    "# parser = RegexpParser(grammar)\n",
    "# parser"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# chunks = parser.parse(words)\n",
    "# chunks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# text_tree = [list(txt)    for txt in chunks.subtrees()]\n",
    "# from pprint import pprint\n",
    "# pprint(text_tree[1:])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br></br>\n",
    "## **2. CFG 분석방법 맛보기**\n",
    "정해진 유형에 따른 문법적 구조예제 활용하기"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Grammar with 23 productions (start state = S)\n",
      "    S -> NP VP [1.0]\n",
      "    VP -> V NP [0.59]\n",
      "    VP -> V [0.4]\n",
      "    VP -> VP PP [0.01]\n",
      "    NP -> Det N [0.41]\n",
      "    NP -> Name [0.28]\n",
      "    NP -> NP PP [0.31]\n",
      "    PP -> P NP [1.0]\n",
      "    V -> 'saw' [0.21]\n",
      "    V -> 'ate' [0.51]\n",
      "    V -> 'ran' [0.28]\n",
      "    N -> 'boy' [0.11]\n",
      "    N -> 'cookie' [0.12]\n",
      "    N -> 'table' [0.13]\n",
      "    N -> 'telescope' [0.14]\n",
      "    N -> 'hill' [0.5]\n",
      "    Name -> 'Jack' [0.52]\n",
      "    Name -> 'Bob' [0.48]\n",
      "    P -> 'with' [0.61]\n",
      "    P -> 'under' [0.39]\n",
      "    Det -> 'the' [0.41]\n",
      "    Det -> 'a' [0.31]\n",
      "    Det -> 'my' [0.28]\n"
     ]
    }
   ],
   "source": [
    "from nltk.grammar import toy_pcfg2\n",
    "grammar = toy_pcfg2\n",
    "print(grammar)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# # Early Chart 분석방법 맛보기\n",
    "# import nltk\n",
    "# nltk.parse.featurechart.demo( print_times   = False, \n",
    "#                               print_grammar = True, \n",
    "#                               parser = nltk.parse.featurechart.FeatureChartParser, \n",
    "#                               sent   = 'I saw a dog' )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br></br>\n",
    "## **3 Word Net을 활용한 명사/동사/형용사 의미분석**\n",
    "**SynSet** 내용 살펴보기 ( Word Net 에 포함된 같은단어 Node 모음)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Synset('dog.n.01'),\n",
       " Synset('frump.n.01'),\n",
       " Synset('dog.n.03'),\n",
       " Synset('cad.n.01'),\n",
       " Synset('frank.n.02'),\n",
       " Synset('pawl.n.01'),\n",
       " Synset('andiron.n.01'),\n",
       " Synset('chase.v.01')]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from nltk.corpus import wordnet as wn\n",
    "wn.synsets('dog')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['she got a reputation as a frump', \"she's a real dog\"]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wn.synset('frump.n.01').examples()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'a dull unattractive unpleasant girl or woman'"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wn.synset('frump.n.01').definition()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br></br>\n",
    "## **3 SynSet 활용하기**\n",
    "Wordnet을 활용하여 단의 의미 구분하기"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: pywsd in /home/markbaum/Python/nltk/lib/python3.6/site-packages (1.1.7)\n",
      "Requirement already satisfied: pandas in /home/markbaum/Python/nltk/lib/python3.6/site-packages (from pywsd) (0.23.3)\n",
      "Requirement already satisfied: nltk in /home/markbaum/Python/nltk/lib/python3.6/site-packages (from pywsd) (3.3)\n",
      "Requirement already satisfied: numpy in /home/markbaum/Python/nltk/lib/python3.6/site-packages (from pywsd) (1.14.5)\n",
      "Requirement already satisfied: pytz>=2011k in /home/markbaum/Python/nltk/lib/python3.6/site-packages (from pandas->pywsd) (2018.5)\n",
      "Requirement already satisfied: python-dateutil>=2.5.0 in /home/markbaum/Python/nltk/lib/python3.6/site-packages (from pandas->pywsd) (2.7.3)\n",
      "Requirement already satisfied: six in /home/markbaum/Python/nltk/lib/python3.6/site-packages (from nltk->pywsd) (1.11.0)\n"
     ]
    }
   ],
   "source": [
    "# NLTK 기본 모듈에 포함된 wordnet DB를 활용\n",
    "! pip3 install pywsd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Warming up PyWSD (takes ~10 secs)... took 2.9580390453338623 secs.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "Synset('frump.n.01')"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sent = 'He act like a real dog'\n",
    "ambiguous = 'dog'\n",
    "\n",
    "from pywsd.lesk import simple_lesk\n",
    "answer = simple_lesk(sent, ambiguous)\n",
    "answer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'a dull unattractive unpleasant girl or woman'"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "answer.definition()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Synset('cad.n.01')"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sent = 'He looks like dirty dog'\n",
    "ambiguous = 'dog'\n",
    "answer = simple_lesk(sent, ambiguous)\n",
    "answer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'someone who is morally reprehensible'"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "answer.definition()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br></br>\n",
    "## **4 NLTK 객체 활용**\n",
    "nltk 객체를 활용하여 작업을 효율적으로 활용한다"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **01 nltk 객체 정의하기**\n",
    "Token List 객체를 생성한 뒤, 이를 활용하여 nltk 객체를 만든다"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# # 삼성전자 지속가능경영 보고서\n",
    "# skipword = ['갤러시', '가치창출']\n",
    "\n",
    "# from txtutil import txtnoun\n",
    "# from nltk.tokenize import word_tokenize\n",
    "# texts  = txtnoun(\"../data/kr-Report_2018.txt\", skip=skipword)\n",
    "# tokens = word_tokenize(texts)\n",
    "# tokens[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "# # nltk Token 객체를 활용한 다양한 메소드를 제동\n",
    "# import nltk\n",
    "# ss_nltk = nltk.Text(tokens, name='2018지속성장')\n",
    "# ss_nltk"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **02 nltk 객체 활용하기**\n",
    "내부 메서드를 활용한다"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "# # 객체의 이름을 출력\n",
    "# ss_nltk.name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "# # Token 과 연어관계에 있는 단어목록\n",
    "# ss_nltk.collocations(num=30, window_size=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "# # Token의 주변에 등장하는 단어들\n",
    "# ss_nltk.common_contexts(['책임경영'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "# # 인접하여 위치하는 Token 을 출력\n",
    "# ss_nltk.concordance('책임경영', lines=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ss_nltk.concordance_list('책임경영')[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "# # Token 의 빈도값 출력\n",
    "# ss_nltk.count('책임경영')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %matplotlib inline\n",
    "# from matplotlib import rc\n",
    "# rc('font', family=['NanumGothic','Malgun Gothic'])\n",
    "\n",
    "# # 해당 단어별 출현빈도 비교출력\n",
    "# ss_nltk.dispersion_plot(['책임경영', '경영진', '갤럭시', '갤러시', '업사이클링'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "# # 객체의 빈도를 Matplot linechart 로 출력\n",
    "# ss_nltk.plot(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "# # ko.readability('biline')\n",
    "# ss_nltk.similar('삼성전자',num=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **03 nltk.vocab() 객체 활용하기**\n",
    "Token 객체들 다루기"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "# # Token의 출현빈도 상위객체 출력\n",
    "# # ko.tokens(['초등학교', '저학년'])\n",
    "# ss_nltk.vocab().most_common(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "# list(ss_nltk.vocab().keys())[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "# list(ss_nltk.vocab().values())[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ss_nltk.vocab().freq('삼성전자')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}