{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# **Append**\n",
"통계적 문법접근방법 빠르고 간단하게 살펴보기"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"## **1 Parsing Tree**\n",
"문법태그를 활용한 문법구조 생성하기"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# %%time\n",
"# text = '여배우 박민영은 높은 싱크로를 보여줬다'\n",
"\n",
"# from konlpy.tag import Twitter\n",
"# twitter = Twitter()\n",
"# words = twitter.pos(text, stem=True)\n",
"# print(words)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# from nltk import RegexpParser\n",
"\n",
"# grammar = \"\"\"\n",
"# NP: {*?} # 명사구를 정의한다\n",
"# VP: {*} # 동사구를 정의한다\n",
"# AP: {*} # 형용사구를 정의한다 \"\"\"\n",
"# parser = RegexpParser(grammar)\n",
"# parser"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# chunks = parser.parse(words)\n",
"# chunks"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# text_tree = [list(txt) for txt in chunks.subtrees()]\n",
"# from pprint import pprint\n",
"# pprint(text_tree[1:])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"## **2. CFG 분석방법 맛보기**\n",
"정해진 유형에 따른 문법적 구조예제 활용하기"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Grammar with 23 productions (start state = S)\n",
" S -> NP VP [1.0]\n",
" VP -> V NP [0.59]\n",
" VP -> V [0.4]\n",
" VP -> VP PP [0.01]\n",
" NP -> Det N [0.41]\n",
" NP -> Name [0.28]\n",
" NP -> NP PP [0.31]\n",
" PP -> P NP [1.0]\n",
" V -> 'saw' [0.21]\n",
" V -> 'ate' [0.51]\n",
" V -> 'ran' [0.28]\n",
" N -> 'boy' [0.11]\n",
" N -> 'cookie' [0.12]\n",
" N -> 'table' [0.13]\n",
" N -> 'telescope' [0.14]\n",
" N -> 'hill' [0.5]\n",
" Name -> 'Jack' [0.52]\n",
" Name -> 'Bob' [0.48]\n",
" P -> 'with' [0.61]\n",
" P -> 'under' [0.39]\n",
" Det -> 'the' [0.41]\n",
" Det -> 'a' [0.31]\n",
" Det -> 'my' [0.28]\n"
]
}
],
"source": [
"from nltk.grammar import toy_pcfg2\n",
"grammar = toy_pcfg2\n",
"print(grammar)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# # Early Chart 분석방법 맛보기\n",
"# import nltk\n",
"# nltk.parse.featurechart.demo( print_times = False, \n",
"# print_grammar = True, \n",
"# parser = nltk.parse.featurechart.FeatureChartParser, \n",
"# sent = 'I saw a dog' )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"## **3 Word Net을 활용한 명사/동사/형용사 의미분석**\n",
"**SynSet** 내용 살펴보기 ( Word Net 에 포함된 같은단어 Node 모음)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Synset('dog.n.01'),\n",
" Synset('frump.n.01'),\n",
" Synset('dog.n.03'),\n",
" Synset('cad.n.01'),\n",
" Synset('frank.n.02'),\n",
" Synset('pawl.n.01'),\n",
" Synset('andiron.n.01'),\n",
" Synset('chase.v.01')]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from nltk.corpus import wordnet as wn\n",
"wn.synsets('dog')"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['she got a reputation as a frump', \"she's a real dog\"]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wn.synset('frump.n.01').examples()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'a dull unattractive unpleasant girl or woman'"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wn.synset('frump.n.01').definition()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"## **3 SynSet 활용하기**\n",
"Wordnet을 활용하여 단의 의미 구분하기"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: pywsd in /home/markbaum/Python/nltk/lib/python3.6/site-packages (1.1.7)\n",
"Requirement already satisfied: pandas in /home/markbaum/Python/nltk/lib/python3.6/site-packages (from pywsd) (0.23.3)\n",
"Requirement already satisfied: nltk in /home/markbaum/Python/nltk/lib/python3.6/site-packages (from pywsd) (3.3)\n",
"Requirement already satisfied: numpy in /home/markbaum/Python/nltk/lib/python3.6/site-packages (from pywsd) (1.14.5)\n",
"Requirement already satisfied: pytz>=2011k in /home/markbaum/Python/nltk/lib/python3.6/site-packages (from pandas->pywsd) (2018.5)\n",
"Requirement already satisfied: python-dateutil>=2.5.0 in /home/markbaum/Python/nltk/lib/python3.6/site-packages (from pandas->pywsd) (2.7.3)\n",
"Requirement already satisfied: six in /home/markbaum/Python/nltk/lib/python3.6/site-packages (from nltk->pywsd) (1.11.0)\n"
]
}
],
"source": [
"# NLTK 기본 모듈에 포함된 wordnet DB를 활용\n",
"! pip3 install pywsd"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Warming up PyWSD (takes ~10 secs)... took 2.9580390453338623 secs.\n"
]
},
{
"data": {
"text/plain": [
"Synset('frump.n.01')"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sent = 'He act like a real dog'\n",
"ambiguous = 'dog'\n",
"\n",
"from pywsd.lesk import simple_lesk\n",
"answer = simple_lesk(sent, ambiguous)\n",
"answer"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'a dull unattractive unpleasant girl or woman'"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"answer.definition()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Synset('cad.n.01')"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sent = 'He looks like dirty dog'\n",
"ambiguous = 'dog'\n",
"answer = simple_lesk(sent, ambiguous)\n",
"answer"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'someone who is morally reprehensible'"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"answer.definition()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"## **4 NLTK 객체 활용**\n",
"nltk 객체를 활용하여 작업을 효율적으로 활용한다"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### **01 nltk 객체 정의하기**\n",
"Token List 객체를 생성한 뒤, 이를 활용하여 nltk 객체를 만든다"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"# # 삼성전자 지속가능경영 보고서\n",
"# skipword = ['갤러시', '가치창출']\n",
"\n",
"# from txtutil import txtnoun\n",
"# from nltk.tokenize import word_tokenize\n",
"# texts = txtnoun(\"../data/kr-Report_2018.txt\", skip=skipword)\n",
"# tokens = word_tokenize(texts)\n",
"# tokens[:5]"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"# # nltk Token 객체를 활용한 다양한 메소드를 제동\n",
"# import nltk\n",
"# ss_nltk = nltk.Text(tokens, name='2018지속성장')\n",
"# ss_nltk"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### **02 nltk 객체 활용하기**\n",
"내부 메서드를 활용한다"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"# # 객체의 이름을 출력\n",
"# ss_nltk.name"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"# # Token 과 연어관계에 있는 단어목록\n",
"# ss_nltk.collocations(num=30, window_size=2)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"# # Token의 주변에 등장하는 단어들\n",
"# ss_nltk.common_contexts(['책임경영'])"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"# # 인접하여 위치하는 Token 을 출력\n",
"# ss_nltk.concordance('책임경영', lines=2)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"# ss_nltk.concordance_list('책임경영')[1]"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"# # Token 의 빈도값 출력\n",
"# ss_nltk.count('책임경영')"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"# %matplotlib inline\n",
"# from matplotlib import rc\n",
"# rc('font', family=['NanumGothic','Malgun Gothic'])\n",
"\n",
"# # 해당 단어별 출현빈도 비교출력\n",
"# ss_nltk.dispersion_plot(['책임경영', '경영진', '갤럭시', '갤러시', '업사이클링'])"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"# # 객체의 빈도를 Matplot linechart 로 출력\n",
"# ss_nltk.plot(10)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"# # ko.readability('biline')\n",
"# ss_nltk.similar('삼성전자',num=3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### **03 nltk.vocab() 객체 활용하기**\n",
"Token 객체들 다루기"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"# # Token의 출현빈도 상위객체 출력\n",
"# # ko.tokens(['초등학교', '저학년'])\n",
"# ss_nltk.vocab().most_common(10)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"# list(ss_nltk.vocab().keys())[:5]"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"# list(ss_nltk.vocab().values())[:5]"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"# ss_nltk.vocab().freq('삼성전자')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}