{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# **Append**\n", "통계적 문법접근방법 빠르고 간단하게 살펴보기" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "## **1 Parsing Tree**\n", "문법태그를 활용한 문법구조 생성하기" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# %%time\n", "# text = '여배우 박민영은 높은 싱크로를 보여줬다'\n", "\n", "# from konlpy.tag import Twitter\n", "# twitter = Twitter()\n", "# words = twitter.pos(text, stem=True)\n", "# print(words)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# from nltk import RegexpParser\n", "\n", "# grammar = \"\"\"\n", "# NP: {*?} # 명사구를 정의한다\n", "# VP: {*} # 동사구를 정의한다\n", "# AP: {*} # 형용사구를 정의한다 \"\"\"\n", "# parser = RegexpParser(grammar)\n", "# parser" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# chunks = parser.parse(words)\n", "# chunks" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# text_tree = [list(txt) for txt in chunks.subtrees()]\n", "# from pprint import pprint\n", "# pprint(text_tree[1:])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "## **2. CFG 분석방법 맛보기**\n", "정해진 유형에 따른 문법적 구조예제 활용하기" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Grammar with 23 productions (start state = S)\n", " S -> NP VP [1.0]\n", " VP -> V NP [0.59]\n", " VP -> V [0.4]\n", " VP -> VP PP [0.01]\n", " NP -> Det N [0.41]\n", " NP -> Name [0.28]\n", " NP -> NP PP [0.31]\n", " PP -> P NP [1.0]\n", " V -> 'saw' [0.21]\n", " V -> 'ate' [0.51]\n", " V -> 'ran' [0.28]\n", " N -> 'boy' [0.11]\n", " N -> 'cookie' [0.12]\n", " N -> 'table' [0.13]\n", " N -> 'telescope' [0.14]\n", " N -> 'hill' [0.5]\n", " Name -> 'Jack' [0.52]\n", " Name -> 'Bob' [0.48]\n", " P -> 'with' [0.61]\n", " P -> 'under' [0.39]\n", " Det -> 'the' [0.41]\n", " Det -> 'a' [0.31]\n", " Det -> 'my' [0.28]\n" ] } ], "source": [ "from nltk.grammar import toy_pcfg2\n", "grammar = toy_pcfg2\n", "print(grammar)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# # Early Chart 분석방법 맛보기\n", "# import nltk\n", "# nltk.parse.featurechart.demo( print_times = False, \n", "# print_grammar = True, \n", "# parser = nltk.parse.featurechart.FeatureChartParser, \n", "# sent = 'I saw a dog' )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "## **3 Word Net을 활용한 명사/동사/형용사 의미분석**\n", "**SynSet** 내용 살펴보기 ( Word Net 에 포함된 같은단어 Node 모음)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Synset('dog.n.01'),\n", " Synset('frump.n.01'),\n", " Synset('dog.n.03'),\n", " Synset('cad.n.01'),\n", " Synset('frank.n.02'),\n", " Synset('pawl.n.01'),\n", " Synset('andiron.n.01'),\n", " Synset('chase.v.01')]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from nltk.corpus import wordnet as wn\n", "wn.synsets('dog')" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['she got a reputation as a frump', \"she's a real dog\"]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wn.synset('frump.n.01').examples()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'a dull unattractive unpleasant girl or woman'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wn.synset('frump.n.01').definition()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "## **3 SynSet 활용하기**\n", "Wordnet을 활용하여 단의 의미 구분하기" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: pywsd in /home/markbaum/Python/nltk/lib/python3.6/site-packages (1.1.7)\n", "Requirement already satisfied: pandas in /home/markbaum/Python/nltk/lib/python3.6/site-packages (from pywsd) (0.23.3)\n", "Requirement already satisfied: nltk in /home/markbaum/Python/nltk/lib/python3.6/site-packages (from pywsd) (3.3)\n", "Requirement already satisfied: numpy in /home/markbaum/Python/nltk/lib/python3.6/site-packages (from pywsd) (1.14.5)\n", "Requirement already satisfied: pytz>=2011k in /home/markbaum/Python/nltk/lib/python3.6/site-packages (from pandas->pywsd) (2018.5)\n", "Requirement already satisfied: python-dateutil>=2.5.0 in /home/markbaum/Python/nltk/lib/python3.6/site-packages (from pandas->pywsd) (2.7.3)\n", "Requirement already satisfied: six in /home/markbaum/Python/nltk/lib/python3.6/site-packages (from nltk->pywsd) (1.11.0)\n" ] } ], "source": [ "# NLTK 기본 모듈에 포함된 wordnet DB를 활용\n", "! pip3 install pywsd" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Warming up PyWSD (takes ~10 secs)... took 2.9580390453338623 secs.\n" ] }, { "data": { "text/plain": [ "Synset('frump.n.01')" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sent = 'He act like a real dog'\n", "ambiguous = 'dog'\n", "\n", "from pywsd.lesk import simple_lesk\n", "answer = simple_lesk(sent, ambiguous)\n", "answer" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'a dull unattractive unpleasant girl or woman'" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "answer.definition()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Synset('cad.n.01')" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sent = 'He looks like dirty dog'\n", "ambiguous = 'dog'\n", "answer = simple_lesk(sent, ambiguous)\n", "answer" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'someone who is morally reprehensible'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "answer.definition()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "## **4 NLTK 객체 활용**\n", "nltk 객체를 활용하여 작업을 효율적으로 활용한다" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **01 nltk 객체 정의하기**\n", "Token List 객체를 생성한 뒤, 이를 활용하여 nltk 객체를 만든다" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# # 삼성전자 지속가능경영 보고서\n", "# skipword = ['갤러시', '가치창출']\n", "\n", "# from txtutil import txtnoun\n", "# from nltk.tokenize import word_tokenize\n", "# texts = txtnoun(\"../data/kr-Report_2018.txt\", skip=skipword)\n", "# tokens = word_tokenize(texts)\n", "# tokens[:5]" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# # nltk Token 객체를 활용한 다양한 메소드를 제동\n", "# import nltk\n", "# ss_nltk = nltk.Text(tokens, name='2018지속성장')\n", "# ss_nltk" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **02 nltk 객체 활용하기**\n", "내부 메서드를 활용한다" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "# # 객체의 이름을 출력\n", "# ss_nltk.name" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# # Token 과 연어관계에 있는 단어목록\n", "# ss_nltk.collocations(num=30, window_size=2)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "# # Token의 주변에 등장하는 단어들\n", "# ss_nltk.common_contexts(['책임경영'])" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# # 인접하여 위치하는 Token 을 출력\n", "# ss_nltk.concordance('책임경영', lines=2)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# ss_nltk.concordance_list('책임경영')[1]" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "# # Token 의 빈도값 출력\n", "# ss_nltk.count('책임경영')" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "# %matplotlib inline\n", "# from matplotlib import rc\n", "# rc('font', family=['NanumGothic','Malgun Gothic'])\n", "\n", "# # 해당 단어별 출현빈도 비교출력\n", "# ss_nltk.dispersion_plot(['책임경영', '경영진', '갤럭시', '갤러시', '업사이클링'])" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "# # 객체의 빈도를 Matplot linechart 로 출력\n", "# ss_nltk.plot(10)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "# # ko.readability('biline')\n", "# ss_nltk.similar('삼성전자',num=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **03 nltk.vocab() 객체 활용하기**\n", "Token 객체들 다루기" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "# # Token의 출현빈도 상위객체 출력\n", "# # ko.tokens(['초등학교', '저학년'])\n", "# ss_nltk.vocab().most_common(10)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "# list(ss_nltk.vocab().keys())[:5]" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "# list(ss_nltk.vocab().values())[:5]" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "# ss_nltk.vocab().freq('삼성전자')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }