{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# **문장 구조의 이해**\n", "## **1 구문분석**\n", "- **[스탠포드 대학교 구문분석 모듈](https://github.com/smilli/py-corenlp)**\n", "- ! pip install **pycorenlp**\n", "- NLTK 의 **CFG 구분분석** 모듈을 기본으로 사용\n", "- **pycorenlp(stanford CoreNLP)** 모듈은 Java 연결 오류가 발생" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "--------Parsing result as per defined grammar-------\n", "(S (NP I) (VP (V shot) (NP (Det an) (N elephant))))\n" ] } ], "source": [ "import nltk\n", "from nltk import CFG\n", "from nltk.tree import Tree\n", "\n", "# Part 1: NLTK 기본 모듈을 사용한 결과출력\n", "def definegrammar_pasrereult():\n", " Grammar = nltk.CFG.fromstring(\"\"\" \n", " S -> NP VP \n", " PP -> P NP \n", " NP -> Det N | Det N PP | 'I' \n", " VP -> V NP | VP PP \n", " Det -> 'an' | 'my' \n", " N -> 'elephant' | 'pajamas' \n", " V -> 'shot' \n", " P -> 'in' \n", " \"\"\")\n", " sent = \"I shot an elephant\".split()\n", " parser = nltk.ChartParser(Grammar)\n", " trees = parser.parse(sent)\n", " for tree in trees:\n", " print (tree)\n", "\n", "print (\"\\n--------Parsing result as per defined grammar-------\")\n", "definegrammar_pasrereult()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "--------Drawing Parse Tree-------\n", "(s (dp (d the) (np dog)) (vp (v chased) (dp (d the) (np cat))))\n", "\\Tree [.s\n", " [.dp [.d the ] [.np dog ] ]\n", " [.vp [.v chased ] [.dp [.d the ] [.np cat ] ] ] ]\n", " s \n", " ________|_____ \n", " | vp \n", " | _____|___ \n", " dp | dp \n", " ___|___ | ___|___ \n", " d np v d np\n", " | | | | | \n", "the dog chased the cat\n", "\n" ] } ], "source": [ "# Part 2: Tree 함수를 사용한 parse tree 그리기\n", "def draw_parser_tree():\n", " dp1 = Tree('dp', [Tree('d', ['the']), Tree('np', ['dog'])])\n", " dp2 = Tree('dp', [Tree('d', ['the']), Tree('np', ['cat'])])\n", " vp = Tree('vp', [Tree('v', ['chased']), dp2])\n", " tree = Tree('s', [dp1, vp])\n", " print(tree)\n", " print(tree.pformat_latex_qtree())\n", " tree.pretty_print()\n", "\n", "print (\"\\n--------Drawing Parse Tree-------\")\n", "draw_parser_tree()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# from pycorenlp import StanfordCoreNLP\n", "# from collections import defaultdict\n", "\n", "# Part 3: \"pycorenlp\" 을 사용한 시각화\n", "# pycorenlp 설치 후 standford corenlp website 에서 stanford-corenlp-full-* 설치가 필요\n", "# def stanford_parsing_result():\n", "# text = \"\"\" I shot an elephant. The dog chased the cat. School go to boy. \"\"\"\n", "# nlp = StanfordCoreNLP('http://localhost:9000')\n", "# res = nlp.annotate(text, properties={\n", "# 'annotators': 'tokenize,ssplit,pos,depparse,parse',\n", "# 'outputFormat': 'json'\n", "# })\n", "# print(res['sentences'][0]['parse'])\n", "# print(res['sentences'][2]['parse'])\n", "\n", "# Exception: Check whether you have started the CoreNLP server e.g. 오류가 발생\n", "# print (\"\\n--------Stanford Parser result------\")\n", "# stanford_parsing_result()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **2 spaCy NER 예제**\n", "개체명 인식을 활용한 NER 구현\n", "- **[spaCy](https://spacy.io/usage/models)** 에는 Ko 모듈도 있긴 하지만 아직 자료가 전무한 상황 입니다.\n", "- **PERSON**(사람), **NORP**(그룹), **FACILITY**(건물기관), **ORG**(조직/기관), **GPE**(도시/국가), **LOC**(수맥/수역), **PRODUCT**(물건/차량/음식), **EVENT**(이벤트), **WORK_OF_ART**(문화), **LANGUAGE**(이름) 기타 **날짜, 시간, 퍼센트, 금융, 수량, 서수, 기수** 등의 클래스 구분이 가능합니다\n", "1. ! pip install spacy\n", "1. ! python -m spacy download en # [모듈의 설치방법](https://stackoverflow.com/questions/49964028/spacy-oserror-cant-find-model-en)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-------Example 1 ------\n", "GPE : London\n", "GPE : the United Kingdom\n" ] } ], "source": [ "import spacy\n", "nlp = spacy.load('en')\n", "doc = nlp(u'London is a big city in the United Kingdom.')\n", "print (\"-------Example 1 ------\")\n", "for _ in doc.ents:\n", " print(\"{} : {}\".format(_.label_, _.text))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-------Example 2 ------\n", "GPE : France\n", "PERSON : Christine Lagarde\n", "TIME : 5:00 P.M.\n", "ORG : the Wall Street Journal\n" ] } ], "source": [ "doc1 = nlp(u'While in France, Christine Lagarde discussed short-term stimulus efforts in a '\n", " u'recent interview on 5:00 P.M. with the Wall Street Journal')\n", "print (\"-------Example 2 ------\")\n", "for _ in doc1.ents:\n", " print(\"{} : {}\".format(_.label_, _.text))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# **단어가방 (Bag of Words)**\n", "**sklearn** 을 활용한 **Bag of Words** 만들기" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[1, 1, 1, 0, 1, 1, 1, 0],\n", " [1, 1, 0, 1, 1, 1, 0, 1]])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "\n", "ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2, 2), min_df=1)\n", "counts = ngram_vectorizer.fit_transform(['words', 'wprds'])\n", "ngram_vectorizer.get_feature_names() == ([' w', 'ds', 'or', 'pr', 'rd', 's ', 'wo', 'wp'])\n", "counts.toarray().astype(int)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CountVectorizer(analyzer='char_wb', binary=False, decode_error='strict',\n", " dtype=, encoding='utf-8', input='content',\n", " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", " ngram_range=(2, 2), preprocessor=None, stop_words=None,\n", " strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n", " tokenizer=None, vocabulary=None)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ngram_vectorizer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# **Tf-IDF 측정하기**\n", "## **1 textblob 을 활용한 Tf-IDF**\n", "! pip install textblob" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[TextBlob(\"tf idf, short form of term frequency, inverse document frequency\"),\n", " TextBlob(\"is a numerical statistic that is intended to reflect how important\"),\n", " TextBlob(\"a word is to a document in a collection or corpus\")]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text = 'tf idf, short form of term frequency, inverse document frequency'\n", "text2 = 'is a numerical statistic that is intended to reflect how important'\n", "text3 = 'a word is to a document in a collection or corpus'\n", "\n", "from textblob import TextBlob\n", "blob = TextBlob(text)\n", "blob2 = TextBlob(text2)\n", "blob3 = TextBlob(text3)\n", "bloblist = [blob, blob2, blob3]\n", "bloblist" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tf : 0.1\n", "idf : 0.40546\n", "tf*idf : 0.04054\n" ] } ], "source": [ "import math\n", "def tf(word, blob):\n", " return blob.words.count(word) / len(blob.words)\n", "\n", "def idf(word, bloblist):\n", " # n_containing (0이 없도록 1 smoothing)\n", " x = 1 + sum(1 for blob in bloblist if word in blob) \n", " return math.log(len(bloblist) / (x if x else 1))\n", "\n", "def tfidf(word, blob, bloblist):\n", " return tf(word, blob) * idf(word, bloblist)\n", "\n", "tf_score = tf('short', blob)\n", "idf_score = idf('short', bloblist)\n", "tfidf_score = tfidf('short', blob, bloblist)\n", "print (\"\"\"tf : {:.7}\\nidf : {:.7}\\ntf*idf : {:.7}\"\"\".format(\n", " str(tf_score), str(idf_score), str(tfidf_score)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **2 scikit-learn 을 활용한 Tf-IDF**\n", "1. ! pip install textblob\n", "1. https://stackoverflow.com/questions/23175809/str-translate-gives-typeerror-translate-takes-one-argument-2-given-worked\n", " - String문자.**translate**( **str**.maketrans('','', string.punctuation))" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('and', 48), ('the', 33), ('to', 29), ('i', 26), ('of', 25)]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 문서 1개 호출한 뒤 Token의 생성\n", "import nltk, string, os\n", "from collections import Counter\n", "\n", "# 전처리 : 소문자 처리, 문장기호 삭제 등\n", "def get_tokens(file):\n", " with open(file, 'r') as shakes:\n", " text = shakes.read()\n", " \n", " lowers = text.lower() # 소문자 전처리\n", " no_punctuation = lowers.translate(str.maketrans('','',string.punctuation)) # 문장기호 삭제\n", " tokens = no_punctuation.split(' ') # 문장 내용을 Token 으로 분리\n", " return tokens\n", "\n", "tokens = get_tokens('./data/shakes1.txt')\n", "count = Counter(tokens)\n", "count.most_common(5)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('thy', 11), ('go', 7), ('love', 7), ('would', 5), ('thou', 5)]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# nlkt 의 stopword 를 활용한 Token 정규화\n", "from nltk.corpus import stopwords\n", "\n", "filtered = [w for w in tokens if not w in stopwords.words('english')]\n", "count = Counter(filtered)\n", "count.most_common(5)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('thi', 11), ('go', 7), ('love', 7), ('would', 5), ('natur', 5)]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# PorterStemmer 사용 Stemming 활용\n", "from nltk.stem.porter import PorterStemmer\n", "\n", "def stem_tokens(tokens, stemmer):\n", " stemmed = []\n", " for item in tokens:\n", " stemmed.append(stemmer.stem(item))\n", " return stemmed\n", "\n", "stemmer = PorterStemmer()\n", "stemmed = stem_tokens(filtered, stemmer)\n", "count = Counter(stemmed)\n", "count.most_common(5)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "from glob import glob\n", "fileList = glob(\"./data/shake*.txt\")\n", "token_dict = {}\n", "stemmer = PorterStemmer()\n", "\n", "def stem_tokens(tokens, stemmer):\n", " stemmed = []\n", " for item in tokens:\n", " stemmed.append(stemmer.stem(item))\n", " return stemmed\n", "\n", "def tokenize(text):\n", " tokens = nltk.word_tokenize(text)\n", " stems = stem_tokens(tokens, stemmer)\n", " return stems\n", "\n", "for file in fileList:\n", " shakes = open(file, 'r')\n", " text = shakes.read()\n", " lowers = text.lower()\n", " no_punctuation = lowers.translate(str.maketrans('','',string.punctuation)) # 문장기호 삭제\n", " token_dict[file] = no_punctuation\n", "# token_dict" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/markbaum/Python/nltk/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'afterward', 'alon', 'alreadi', 'alway', 'ani', 'anoth', 'anyon', 'anyth', 'anywher', 'becam', 'becaus', 'becom', 'befor', 'besid', 'cri', 'describ', 'dure', 'els', 'elsewher', 'empti', 'everi', 'everyon', 'everyth', 'everywher', 'fifti', 'formerli', 'forti', 'ha', 'henc', 'hereaft', 'herebi', 'hi', 'howev', 'hundr', 'inde', 'latterli', 'mani', 'meanwhil', 'moreov', 'mostli', 'nobodi', 'noon', 'noth', 'nowher', 'onc', 'onli', 'otherwis', 'ourselv', 'perhap', 'pleas', 'seriou', 'sever', 'sinc', 'sincer', 'sixti', 'someon', 'someth', 'sometim', 'somewher', 'themselv', 'thenc', 'thereaft', 'therebi', 'therefor', 'thi', 'thu', 'togeth', 'twelv', 'twenti', 'veri', 'wa', 'whatev', 'whenc', 'whenev', 'wherea', 'whereaft', 'wherebi', 'wherev', 'whi', 'yourselv'] not in stop_words.\n", " 'stop_words.' % sorted(inconsistent))\n" ] }, { "data": { "text/plain": [ "array([0.34618161, 0.66338461, 0.66338461])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# this can take some time\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')\n", "tfs = tfidf.fit_transform(token_dict.values())\n", "sample_str = 'this sentence has unseen text such as computer but also king lord juliet'\n", "response = tfidf.transform([sample_str])\n", "response.data" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "thi - 0.34618161159873423\n", "lord - 0.6633846138519129\n", "king - 0.6633846138519129\n" ] } ], "source": [ "feature_names = tfidf.get_feature_names()\n", "for col in response.nonzero()[1]:\n", " print (feature_names[col], ' - ', response[0, col])" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(array(['king', 'lord', 'thi'], dtype='\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameage-group
0rickyoung
1philold
\n", "" ], "text/plain": [ " name age-group\n", "0 rick young\n", "1 phil old" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "from sklearn.feature_extraction import DictVectorizer\n", "df = pd.DataFrame([['rick','young'],['phil','old']], columns=['name','age-group'])\n", "df" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
name_philname_rickage-group_oldage-group_young
00101
11010
\n", "
" ], "text/plain": [ " name_phil name_rick age-group_old age-group_young\n", "0 0 1 0 1\n", "1 1 0 1 0" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# pandas 모듈을 활용한 Encoding\n", "pd.get_dummies(df)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "({'country=US': 2, 'country=CAN': 0, 'country=MEX': 1},\n", " '\\n',\n", " array([[0., 0., 1.],\n", " [1., 0., 0.],\n", " [0., 0., 1.],\n", " [1., 0., 0.],\n", " [0., 1., 0.],\n", " [0., 0., 1.]]))" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Scikit-learn 모듈을 사용한 Tf-IDF\n", "X = pd.DataFrame({'income': [100000,110000,90000,30000,14000,50000],\n", " 'country':['US', 'CAN', 'US', 'CAN', 'MEX', 'US'],\n", " 'race':['White', 'Black', 'Latino', 'White', 'White', 'Black']})\n", "\n", "v = DictVectorizer()\n", "qualitative_features = ['country']\n", "X_qual = v.fit_transform(X[qualitative_features].to_dict('records'))\n", "v.vocabulary_ ,\"\\n\", X_qual.toarray()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **2 Concept Text Normalization**\n", "**개념텍스트 정규화** : 텍스트를 하나의 표준 형식으로 변환\n", "1. **Stemming** 도 정규화 방법 중 하나다\n", "1. NLP 분석의 **Language Model** 중 **최대 최소 스케일링** 이 유용하게 쓰인다\n", "\n", "## 인덱싱 : 카테고리 데이터를 인덱스 숫자로 활용\n", "## 랭킹 : 관련성이 높은 데이터를 먼저, 낮은 데이터를 아래로 재정렬ㄹ" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 4 }