{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# **NLP (Natural Language Processing) 분석**\n", "**뉴스그룹 데이터 Set 분석**\n", "1. 단어의 지식과 개념을 정리하는 작업으로 **Ontology** 같은 예들이 존재\n", "1. 낮은 단계로는 **Tagging, POS(Part of Speech)** 이 있다\n", "1. Python 라이브러리로 **NLTK, Gensim, TextBlob** 등이 있다" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "## **1. NLP 개념 익히기**\n", "\n", "**NLTK 에서 자연어 분석 작업**\n", "1. **Tokenization**(토큰화) : 텍스트를 공백문자로 구분하여 조각을 나누는 작업 (Document --> Ngram)\n", "1. **POS Tagging**(품사태깅) : **Konlpy, NLTK** 에 규격화 된 Tagger를 적용하거나, **POS_TAG**와 같은 함수를 활용\n", "1. **NER** (개체명 인식) : **Named entity Recognition** 은 텍스트 문장에서 명사를 식별하는 작업이다\n", "1. **Stemming** (어간추출) : 어간, 어근의 **원형으로** 되돌리는 작업\n", "1. **Lemmatization**(표제어 원형 복원) : 어간 추출보다 좁은 의미로써 **단어를 유효한 결과로** 출력하는 작업" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'machin'" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 어간의 추출\n", "from nltk.stem.porter import PorterStemmer\n", "porter_stemmer = PorterStemmer()\n", "porter_stemmer.stem('machines')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'machine'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Token 의 원형을 복구한다\n", "from nltk.stem import WordNetLemmatizer\n", "lemmatization = WordNetLemmatizer()\n", "lemmatization.lemmatize('machines')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **Gensim 에서의 자연어 분석 작업**\n", "문장간 유사도 측정을 위해 2008년 시작된 모듈로써, 다양한 모델링이 가능하다\n", "1. **Similarity Querying** (유사도 쿼리) : 주어긴 쿼리 객체와 유사한 객체를 검색\n", "1. **Word vectorization** (단어 벡터화) : 단어의 동시출현 Feacture를 유지하면서 단어를 표현\n", "1. **Distribution Computing** (분산 컴퓨팅) : 다수의 문서를 효과적으로 학습\n", "\n", "### **Text Blob**\n", "NLTK 기반의 라이브러리로 **맞춤법 확인 및 교정, 언어감지, 번역기능이** 추가" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "## **2. newsgroups 데이터 보기**\n", "20개의 **News Group을** 대상으로 **11,313 개의** 뉴스 데이터가 담겨있다\n", "### **01 News Group 데이터 살펴보기**\n", "> **fetch_20newsgroups()**\n", "\n", "1. subset : 적제할 데이터를 정의한다\n", "1. data_home : 파일을 저장할 폴더 ex) **~/scikit_learn_data**(초기값)\n", "1. categories : 추출할 목록을 지정" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['alt.atheism',\n", " 'comp.graphics',\n", " 'comp.os.ms-windows.misc',\n", " 'comp.sys.ibm.pc.hardware',\n", " 'comp.sys.mac.hardware',\n", " 'comp.windows.x',\n", " 'misc.forsale',\n", " 'rec.autos',\n", " 'rec.motorcycles',\n", " 'rec.sport.baseball',\n", " 'rec.sport.hockey',\n", " 'sci.crypt',\n", " 'sci.electronics',\n", " 'sci.med',\n", " 'sci.space',\n", " 'soc.religion.christian',\n", " 'talk.politics.guns',\n", " 'talk.politics.mideast',\n", " 'talk.politics.misc',\n", " 'talk.religion.misc']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 뉴스 데이터를 가져온다 (약 14MB)\n", "# 뉴스 그룹 0 ~ 19 (20개 목록)\n", "from sklearn.datasets import fetch_20newsgroups\n", "groups = fetch_20newsgroups(data_home='data/news/')\n", "groups['target_names']" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "data : \n", "filenames : \n", "target_names : \n", "target : \n", "DESCR : \n" ] }, { "data": { "text/plain": [ "dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# groups Json 데이터 살펴보기\n", "for key in groups.keys():\n", " print(key, ':', type(groups[key]))\n", "groups.keys()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[7 4 4 ... 3 1 8]\n" ] }, { "data": { "text/plain": [ "array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,\n", " 17, 18, 19])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 뉴스그룹 Primary Key 값인 정수값 인코딩\n", "# 정수들이 중복되지 않게 정리된 결과를 출력한다\n", "import numpy as np\n", "print(groups.target)\n", "np.unique(groups.target)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **02 개별 News 데이터 살펴보기**\n", "0 번 뉴스의 정보를 확인하기" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "News Group 포함된 자료갯수: 11,314 개\n", "\n", "0번 샘플보기: \n", "From: lerxst@wam.umd.edu (where's my thing)\n", "Subject: WHAT car is this!?\n", "Nntp-Posting-Host: rac3.wam.umd.edu\n", "Organization: University of Maryland, College Park\n", "Lines: 15\n", "\n", " I was wondering if anyone out there could enlighten me on this car I saw\n", "the other day. It was a 2-door sports car, looked to be from the late 60s/\n", "early 70s. It was called a Bricklin. The doors were really small. In addition,\n", "the front bumper was separate from the rest of the body. This is \n", "all I know. If anyone can tellme a model name, engine specs, years\n", "of production, where this car is made, history, or whatever info you\n", "have on this funky looking car, please e-mail.\n", "\n", "Thanks,\n", "- IL\n", " ---- brought to you by your neighborhood Lerxst ----\n", "\n", "\n", "\n", "\n", "\n" ] } ], "source": [ "# 해당 뉴스의 내용 살펴보기\n", "print(\"News Group 포함된 자료갯수: {:,} 개\\n\\n0번 샘플보기: \\n{}\".format(\n", " len(groups.data), groups.data[0]))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'rec.autos'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 0번 뉴스의 해당 그룹 Category 확인\n", "groups.target_names[groups.target[0]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **03 데이터 시각화**\n", "1. Document를 분석하는 경우 **Bag of Words로써** (단어의 집합) 활용한다\n", "1. **단어 모델링과** 어떠한 차이가 있는지를 확인해 본다\n", "1. **총 20개 카테고리** 뉴스들을 Histogram으로 시각화 (비슷한 갯수로 모델이 분표)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/markbaum/Python/python/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6521: MatplotlibDeprecationWarning: \n", "The 'normed' kwarg was deprecated in Matplotlib 2.1 and will be removed in 3.1. Use 'density' instead.\n", " alternative=\"'density'\", removal=\"3.1\")\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "# Seaborn 내부함수에 대한 FutureWarning이 출력\n", "import warnings\n", "warnings.simplefilter(action='ignore', category=FutureWarning)\n", "\n", "# 전체 뉴스그룹 데이터의 길이를 시각화 한다\n", "# 전체적으로 11,316개가 비슷한 길이를 갖음을 확인할 수 있다\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "sns.distplot(groups.target)\n", "plt.grid()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **04 Feacture Set을 활용하여 전체 Text 갯수를 확인한다**\n", "1. **Sklearnd의 CountVectorizer를** 활용하여 **빈도상위 500**개 Token으로 Embadding\n", "1. **위는 알파벳 갯수로써** 비교를 했기 때문에 비슷하게 나오는데\n", "1. **단어 기준으로** Feacture Set을 생성하면 결과가 어떻게 되는지 확인해보자\n", "\n", "> **CountVectorizer()** 의 파라미터 확인\n", "\n", "1. **stop_words** : 불용어 목록을 활성화 한다 ex) **None(초기값), english, [a, the, of]**\n", "1. **ngram_range** : 추출할 ngram 하한/ 상한선을 지정 ex) **(1,1)(초기값) (1,2) (2,2)**\n", "1. **lowercase** : 소문자 변환 활성화 여부 ex) **True(초기값), False**\n", "1. **max_feacture** : None 아니면 최대 token 갯수를 지정 ex) **None(초기값), 500**\n", "1. **binary** : 바이너리 여부를 정의한다" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['00', '000', '0d', '0t', '10', '100', '11', '12', '13', '14', '145', '15', '16', '17', '18', '19', '1993', '1d9', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '34u', '35', '40', '45', '50', '55', '80', '92', '93', '__', '___', 'a86', 'able', 'ac', 'access', 'actually', 'address', 'ago', 'agree', 'al', 'american', 'andrew', 'answer', 'anybody', 'apple', 'application', 'apr', 'april', 'area', 'argument', 'armenian', 'armenians', 'article', 'ask', 'asked', 'att', 'au', 'available', 'away', 'ax', 'b8f', 'bad', 'based', 'believe', 'berkeley', 'best', 'better', 'bible', 'big', 'bike', 'bit', 'black', 'board', 'body', 'book', 'box', 'buy', 'ca', 'california', 'called', 'came', 'canada', 'car', 'card', 'care', 'case', 'cause']\n" ] } ], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "import numpy as np\n", "\n", "# 빈도상위 500개의 단어로만 추출한 결과를 분석\n", "cv = CountVectorizer(stop_words=\"english\", max_features=500)\n", "transformed = cv.fit_transform(groups.data)\n", "print(cv.get_feature_names()[:100]) \n", "# 빈도상위 100개의 Token을 출력한다 : 문장간 식별력이 낮은 숫자와 기호들이 포함" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# 위에서 추출한 임베딩 데이터로 히스토그램을 보여준다\n", "sns.distplot(np.log(transformed.toarray().sum(axis=0)))\n", "plt.xlabel('Log Count')\n", "plt.ylabel('Frequency')\n", "plt.title('Distribution Plot of 500 Word Counts')\n", "plt.grid(); plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "## **3. newsgroups 데이터 분석**\n", "데이터를 전처리, 분석 과정을 단계별 진행한다 (분석에 용이한 텍스트로 재선별)\n", "### **01 데이터 전처리 (식별력이 높은 자료들만 추출한다)**\n", "1. 식별에 용이한 숫자 기호들은 제외한, 순수한 문자 데이터만 선별한다\n", "1. 단어별 일치도를 높이기 위해서 **표제어 복원을** 진행\n", "1. 문자중에도 Stopword, name 같이 식별력이 낮은 내용들은 제거한다\n", "\n", "```python\n", "# 실행 중 nltk 오류시\n", "import nltk\n", "nltk.download('names')\n", "```" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 아래에서 사용하는 알파벳 판단함수\n", "'names'.isalpha()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['able', 'accept', 'access', 'according', 'act', 'action', 'actually', 'add', 'address', 'ago']\n", "CPU times: user 12.6 s, sys: 30.8 ms, total: 12.6 s\n", "Wall time: 12.6 s\n" ] } ], "source": [ "%%time\n", "from nltk.corpus import names\n", "from nltk.stem import WordNetLemmatizer\n", "\n", "# 영문 제시어만 추출한다\n", "def letters_only(astr):\n", " for c in astr:\n", " if not c.isalpha(): return False\n", " return True\n", "\n", "# 추출한 데이터 Token을 하나씩 표제어복원을 진행한다\n", "all_names = set(names.words())\n", "lemmatizer = WordNetLemmatizer()\n", "cleaned = [' '.join([lemmatizer.lemmatize(word.lower())\n", " for word in post.split()\n", " if letters_only(word) and word not in all_names]) \n", " for post in groups.data]\n", " \n", "# cv = CountVectorizer(stop_words=\"english\", max_features=500)\n", "transformed = cv.fit_transform(cleaned)\n", "print(cv.get_feature_names()[:10])\n", "len(names.words())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **02 K-means 를 활용한 클러스터링**\n", "1. **(transformed)** : 위 전처리 및 빈도순서 500으로 선별된 데이터 활용\n", "1. 데이터세트를 몇 개의 클러스터로 묶는다\n", "1. **하드 클러스터링** : 개별 token 이 **1개 클러스터에만** 할당 (엄격)\n", "1. **소프트 클러스터링** : 개별 token 이 **다양한 확률값으로 여러 클러스터에** 할당 (유연)\n", "1. **이상치** (Outlier) : 어떠한 클러스터에도 할당되지 않는 값을 **이상치, 노이즈라** 한다\n", "\n", "> **KMeans(클러스터수, 샘플갯수, 반복횟수)**\n", "\n", "1. **n_cluster** : 클러스터 묶음 갯수 ex) 8 (기본값)\n", "1. **max_iter** : 반복자 할당 갯수 ex) 300 (기본값)\n", "1. **n_iter** : 다른 초기값으로 알고리즘 재실행 횟수 ex) 10 (기본값)\n", "1. **tol** : 실행 중지조건 ex) 1e-4" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 448 ms, sys: 401 ms, total: 848 ms\n", "Wall time: 1min 28s\n" ] } ], "source": [ "%%time\n", "# K Mean를 활용한 묶음처리\n", "from sklearn.cluster import KMeans\n", "km = KMeans(n_clusters=20, n_jobs=-1)\n", "km.fit(transformed)\n", "\n", "labels = groups.target\n", "plt.scatter(labels, km.labels_)\n", "plt.xlabel('Newsgroup'); plt.ylabel('Cluster')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **03 Topic 모델링**\n", "1. 문장 내 단어 Token중 **핵심(주제)가 되는 Token을** 선별한다\n", "1. Topic 마다 **다른 가중치를 할당하여 additive model을** 정의한다\n", "1. **비음수 행렬 인수분해** : Non-Negative Matrix Factorization" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0: wa thought later took left order seen taken\n", "1: db bit data place stuff add time line\n", "2: server using display screen support code mouse application\n", "3: file section information write source change entry number\n", "4: disk drive hard controller support card board head\n", "5: entry rule program source number info email build\n", "6: new york sale change service result study early\n", "7: image software user package using display include support\n", "8: window manager application using offer user information course\n", "9: gun united control house american second national issue\n", "10: hockey league team game division player list san\n", "11: turkish government sent war study came american world\n", "12: program change technology display information version application rate\n", "13: space nasa technology service national international small communication\n", "14: government political federal sure free private local country\n", "15: output line open write read return build section\n", "16: people country doing tell live killed lot saying\n", "17: widget application value set type return function list\n", "18: child case rate le report area research group\n", "19: jew jewish world war history help research arab\n", "20: armenian russian muslim turkish world city road today\n", "21: president said group tax press working package job\n", "22: ground box usually power code current house white\n", "23: russian president american support food money important private\n", "24: ibm color week memory hardware monitor software standard\n", "25: anonymous posting service server user group message post\n", "26: la win san went list year radio near\n", "27: work job young school lot private create business\n", "28: encryption technology access device policy security government data\n", "29: tape driver work memory using cause note following\n", "30: war military world attack way united russian force\n", "31: god bible shall man come life hell love\n", "32: atheist religious religion belief god sort feel idea\n", "33: data available information user research set model based\n", "34: center research medical institute national study test north\n", "35: think lot try trying talk kind agree certainly\n", "36: water city division list public similar north high\n", "37: section military shall weapon person division application mean\n", "38: good cover great pretty probably bad issue life\n", "39: drive head single mode set using model type\n", "40: israeli arab attack policy true apr fact stop\n", "41: use note using usually similar available standard work\n", "42: know tell way come sure understand let saw\n", "43: car speed driver change high buy different design\n", "44: internet email address information anonymous user network mail\n", "45: like look sound long little guy pretty having\n", "46: going come way mean kind sure working got\n", "47: state united public national political federal member local\n", "48: dod bike member computer list started live email\n", "49: greek killed act word western muslim turkish talk\n", "50: computer information public internet list issue network communication\n", "51: law act federal specific issue clear order moral\n", "52: book read reference list copy second study offer\n", "53: argument form true evidence event truth particular known\n", "54: make sense difference little sure making end tell\n", "55: scsi hard pc drive device bus different data\n", "56: time long having able lot order light response\n", "57: gun rate crime city death study control difference\n", "58: right second free shall security mean left american\n", "59: went came said told started saw took woman\n", "60: power period second san special le play goal\n", "61: used using product way function version note single\n", "62: problem work having using help apple running error\n", "63: available version widget server includes sun set support\n", "64: question answer ask asked science reason claim post\n", "65: san information police said group league political including\n", "66: number serial large men report following million le\n", "67: year ago old best sale hit long project\n", "68: want help let life reason trying copy tell\n", "69: point way different line algorithm exactly idea view\n", "70: run running home version start hit win speed\n", "71: got shot play took goal went hit lead\n", "72: thing saw sure got trying kind seen asked\n", "73: graphic send mail message package server various computer\n", "74: university science department general computer thanks engineering texas\n", "75: just maybe start thought big probably look getting\n", "76: key message public security algorithm standard method attack\n", "77: doe mean anybody actually different ask reading difference\n", "78: game win sound play left second lead great\n", "79: ha able called taken given past exactly looking\n", "80: believe belief christian truth evidence claim mean different\n", "81: drug study information war group reason usa evidence\n", "82: need help phone able needed kind thanks bike\n", "83: did death let money fact man wanted body\n", "84: chip clipper serial algorithm phone communication encryption key\n", "85: card driver video support mode mouse board bus\n", "86: church christian member group true bible different view\n", "87: ftp available anonymous general nasa package source version\n", "88: better player best play probably hit maybe big\n", "89: human life person moral kill claim reason world\n", "90: bit using let change mode attack size quite\n", "91: say mean word act clear said read simply\n", "92: health medical public national care study service user\n", "93: article post usa read world discussion opinion gmt\n", "94: team player win play city look bad great\n", "95: day come word christian said tell little way\n", "96: really lot sure look fact idea actually feel\n", "97: unit disk size serial total national got return\n", "98: image color version free available display current better\n", "99: woman men muslim religion way man great world\n", "CPU times: user 52.3 s, sys: 32.2 s, total: 1min 24s\n", "Wall time: 38.7 s\n" ] } ], "source": [ "%%time\n", "from sklearn.decomposition import NMF\n", "nmf = NMF(n_components=100, random_state=43).fit(transformed)\n", "\n", "for topic_idx, topic in enumerate(nmf.components_):\n", " label = '{}: '.format(topic_idx)\n", " print(label, \" \".join([cv.get_feature_names()[i]\n", " for i in topic.argsort()[:-9:-1]]))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 4 }