{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# **NLP (Natural Language Processing) 분석**\n", "**뉴스그룹 데이터 Set 분석**\n", "1. 단어의 지식과 개념을 정리하는 작업으로 **Ontology** 같은 예들이 존재\n", "1. 낮은 단계로는 **Tagging, POS(Part of Speech)** 이 있다\n", "1. Python 라이브러리로 **NLTK, Gensim, TextBlob** 등이 있다" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "## **1. NLP 개념 익히기**\n", "\n", "**NLTK 에서 자연어 분석 작업**\n", "1. **Tokenization**(토큰화) : 텍스트를 공백문자로 구분하여 조각을 나누는 작업 (Document --> Ngram)\n", "1. **POS Tagging**(품사태깅) : **Konlpy, NLTK** 에 규격화 된 Tagger를 적용하거나, **POS_TAG**와 같은 함수를 활용\n", "1. **NER** (개체명 인식) : **Named entity Recognition** 은 텍스트 문장에서 명사를 식별하는 작업이다\n", "1. **Stemming** (어간추출) : 어간, 어근의 **원형으로** 되돌리는 작업\n", "1. **Lemmatization**(표제어 원형 복원) : 어간 추출보다 좁은 의미로써 **단어를 유효한 결과로** 출력하는 작업" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'machin'" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 어간의 추출\n", "from nltk.stem.porter import PorterStemmer\n", "porter_stemmer = PorterStemmer()\n", "porter_stemmer.stem('machines')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'machine'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Token 의 원형을 복구한다\n", "from nltk.stem import WordNetLemmatizer\n", "lemmatization = WordNetLemmatizer()\n", "lemmatization.lemmatize('machines')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **Gensim 에서의 자연어 분석 작업**\n", "문장간 유사도 측정을 위해 2008년 시작된 모듈로써, 다양한 모델링이 가능하다\n", "1. **Similarity Querying** (유사도 쿼리) : 주어긴 쿼리 객체와 유사한 객체를 검색\n", "1. **Word vectorization** (단어 벡터화) : 단어의 동시출현 Feacture를 유지하면서 단어를 표현\n", "1. **Distribution Computing** (분산 컴퓨팅) : 다수의 문서를 효과적으로 학습\n", "\n", "### **Text Blob**\n", "NLTK 기반의 라이브러리로 **맞춤법 확인 및 교정, 언어감지, 번역기능이** 추가" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "## **2. newsgroups 데이터 보기**\n", "20개의 **News Group을** 대상으로 **11,313 개의** 뉴스 데이터가 담겨있다\n", "### **01 News Group 데이터 살펴보기**\n", "> **fetch_20newsgroups()**\n", "\n", "1. subset : 적제할 데이터를 정의한다\n", "1. data_home : 파일을 저장할 폴더 ex) **~/scikit_learn_data**(초기값)\n", "1. categories : 추출할 목록을 지정" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['alt.atheism',\n", " 'comp.graphics',\n", " 'comp.os.ms-windows.misc',\n", " 'comp.sys.ibm.pc.hardware',\n", " 'comp.sys.mac.hardware',\n", " 'comp.windows.x',\n", " 'misc.forsale',\n", " 'rec.autos',\n", " 'rec.motorcycles',\n", " 'rec.sport.baseball',\n", " 'rec.sport.hockey',\n", " 'sci.crypt',\n", " 'sci.electronics',\n", " 'sci.med',\n", " 'sci.space',\n", " 'soc.religion.christian',\n", " 'talk.politics.guns',\n", " 'talk.politics.mideast',\n", " 'talk.politics.misc',\n", " 'talk.religion.misc']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 뉴스 데이터를 가져온다 (약 14MB)\n", "# 뉴스 그룹 0 ~ 19 (20개 목록)\n", "from sklearn.datasets import fetch_20newsgroups\n", "groups = fetch_20newsgroups(data_home='data/news/')\n", "groups['target_names']" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "data : \n", "filenames : \n", "target_names : \n", "target : \n", "DESCR : \n" ] }, { "data": { "text/plain": [ "dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# groups Json 데이터 살펴보기\n", "for key in groups.keys():\n", " print(key, ':', type(groups[key]))\n", "groups.keys()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[7 4 4 ... 3 1 8]\n" ] }, { "data": { "text/plain": [ "array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,\n", " 17, 18, 19])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 뉴스그룹 Primary Key 값인 정수값 인코딩\n", "# 정수들이 중복되지 않게 정리된 결과를 출력한다\n", "import numpy as np\n", "print(groups.target)\n", "np.unique(groups.target)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **02 개별 News 데이터 살펴보기**\n", "0 번 뉴스의 정보를 확인하기" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "News Group 포함된 자료갯수: 11,314 개\n", "\n", "0번 샘플보기: \n", "From: lerxst@wam.umd.edu (where's my thing)\n", "Subject: WHAT car is this!?\n", "Nntp-Posting-Host: rac3.wam.umd.edu\n", "Organization: University of Maryland, College Park\n", "Lines: 15\n", "\n", " I was wondering if anyone out there could enlighten me on this car I saw\n", "the other day. It was a 2-door sports car, looked to be from the late 60s/\n", "early 70s. It was called a Bricklin. The doors were really small. In addition,\n", "the front bumper was separate from the rest of the body. This is \n", "all I know. If anyone can tellme a model name, engine specs, years\n", "of production, where this car is made, history, or whatever info you\n", "have on this funky looking car, please e-mail.\n", "\n", "Thanks,\n", "- IL\n", " ---- brought to you by your neighborhood Lerxst ----\n", "\n", "\n", "\n", "\n", "\n" ] } ], "source": [ "# 해당 뉴스의 내용 살펴보기\n", "print(\"News Group 포함된 자료갯수: {:,} 개\\n\\n0번 샘플보기: \\n{}\".format(\n", " len(groups.data), groups.data[0]))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'rec.autos'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 0번 뉴스의 해당 그룹 Category 확인\n", "groups.target_names[groups.target[0]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **03 데이터 시각화**\n", "1. Document를 분석하는 경우 **Bag of Words로써** (단어의 집합) 활용한다\n", "1. **단어 모델링과** 어떠한 차이가 있는지를 확인해 본다\n", "1. **총 20개 카테고리** 뉴스들을 Histogram으로 시각화 (비슷한 갯수로 모델이 분표)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/markbaum/Python/python/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6521: MatplotlibDeprecationWarning: \n", "The 'normed' kwarg was deprecated in Matplotlib 2.1 and will be removed in 3.1. Use 'density' instead.\n", " alternative=\"'density'\", removal=\"3.1\")\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAD8CAYAAACb4nSYAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAIABJREFUeJzt3Xt4XPV95/H3d2Ykje7WxZZtybZkbPANMCDjtAEih0AgbeIkhQBtWjYldbPF3U273adstsumPG2fpk82afuEduMGNoSUmoSQxC1OIYEqXBKMbWywLdtYlm+62Eb360hz+e4fM3IGRbaOxjM6c/m+nkePzpzzm9H3p7E+c/w75/yOqCrGGGNyg8ftAowxxswdC31jjMkhFvrGGJNDLPSNMSaHWOgbY0wOsdA3xpgcYqFvjDE5xFHoi8gdInJURFpF5KFptheIyNOx7btEpD62/rdEZH/cV0RE1ie3C8YYY5ySmS7OEhEv8A5wG9AO7AbuU9WWuDZ/AFyjqp8TkXuBT6jqPVNe52rgB6p6RZL7YIwxxiGfgzY3Aq2q2gYgItuBzUBLXJvNwBdjy88AXxMR0fd+otwHbJ/ph1VXV2t9fb2DstLbyMgIxcXFbpcx56zfucX6nT727t3brarzZ2rnJPRrgTNxj9uBjRdro6ohERkAqoDuuDb3EP1w+CUisgXYAlBTU8OXv/xlB2Wlt+HhYUpKStwuY85Zv3OL9Tt9bNq06ZSTdk5C/7KJyEZgVFUPTrddVbcB2wAaGxu1qalpLspKqebmZrKhH7Nl/c4t1u/M4+RAbgewJO5xXWzdtG1ExAeUAz1x2+8F/iXxMo0xxiSDk9DfDawUkQYRySca4DumtNkB3B9bvgt4aXI8X0Q8wKdwMJ5vjDEmtWYc3omN0W8Fnge8wOOqekhEHgH2qOoO4DHgSRFpBXqJfjBMugU4M3kg2BhjjHscjemr6k5g55R1D8ctB4C7L/LcZuB9iZdojDEmWeyKXGOMySEW+sYYk0Ms9I0xJodY6BtjTA6Zk4uzjHHLU7tOO2r3mxuXXtbr+Ucm3rPN6eulMye/O//IxBxUYpLJQt8YnH84JPP1suGDwWQeC31z2eIDbuoe76TL3ZNO9PVySbI/uEx2stDPAMn+Y7bAzCwW5iaZ7ECuMcbkENvTd1Eu7cHlUl+NSWe2p2+MMTnE9vTNRdneuTHZx/b0jTEmh1joG2NMDrHQN8aYHGKhb4wxOcQO5Bpj0oZdkZ16tqdvjDE5xELfGGNyiIW+McbkEAt9Y4zJIRb6xhiTQxydvSMidwB/B3iBb6jqX0/ZXgB8C7gB6AHuUdWTsW3XAF8HyoAIsEFVA8nqgDEmM9i0Hulhxj19EfECjwJ3AmuA+0RkzZRmDwB9qroC+CrwpdhzfcC3gc+p6lqgCQgmrXpjjDGz4mR450agVVXbVHUC2A5sntJmM/BEbPkZ4FYREeB24G1VfQtAVXtUNZyc0o0xxsyWqOqlG4jcBdyhqp+NPf5tYKOqbo1rczDWpj32+DiwEfg00SGfBcB8YLuq/s00P2MLsAWgpqbmhu3btyeha+4aHh6mpKTkkm16XbqpdGVxvqN2idTnCQWI+Pyzft5spbIPiUik3+nWh0R4QgHmlZc5apvMfjj93aWKk7/vubZp06a9qto4U7tUX5HrA24CNgCjwIsisldVX4xvpKrbgG0AjY2N2tTUlOKyUq+5uZmZ+uHWGGdTku9XG8/ffYRA9apZP2+2UtmHRCTS73TrQyL83Udm/Hc+KZn9cPq7SxUnf9/pysnwTgewJO5xXWzdtG1i4/jlRA/otgMvq2q3qo4CO4HrL7doY4wxiXES+ruBlSLSICL5wL3AjiltdgD3x5bvAl7S6LjR88DVIlIU+zD4ANCSnNKNMcbM1ozDO6oaEpGtRAPcCzyuqodE5BFgj6ruAB4DnhSRVqCX6AcDqtonIl8h+sGhwE5VfS5FfTHGGDMDR2P6qrqT6NBM/LqH45YDwN0Xee63iZ62aYwxxmV2Ra4xxuQQC31jjMkhFvrGGJNDLPSNMSaHWOgbY0wOsdA3xpgcYqFvjDE5xELfGGNyiIW+McbkEAt9Y4zJIRb6xhiTQyz0jTEmh1joG2NMDrHQN8aYHJLq2yXmrN6RibS+zZ0xJjdZ6JucoaoEw8pEOMJ4MMx4KMJEKBL9Hr8uHCEcUSKqqEIkthxRiKgiIvg8gtcT/e7zCCXjYbxj/RTmeSnM91747s/z4PPYf6hN+rDQNxltbCJMz8g4PcMTcd8n6Bkep2dkgoMdAwyPhxgZDzMyHiIUUcevLYDHI3gEPCJI7LsqhCNKKBLhvS93ZtrXyfd5KM73UlzgoyjfS3F+9Hvf6AQVRflUFudRUZRPVUk+FUX5FBf4KPB5EJHL+t1cjMY+wMIXPsyUPK8Hn0dS9jNN+rDQN2kjGI4wNhFmLBh+z/fRiRDDsdAemQgxPB7iH5pb6R2ZYHQiPO1rFfg8VJcUAFBakMfCskKKC7wU5UcDNd/nifvuvfC4wBtd53UYgBFVQmGF80fpL20gMFl3MMxYMNafiRAjsX6MjId5d2ickYkwrx3vuejrioDfF/3fQjii5HmFPG+0Lo37oFE0eiNSot/CEY0L86mPJ5en/5kegQKfl4I8D4V5XsoL86gozqem1M/ieX4Wlvnxee1/LZnOQj9LqCpjwTCDgRDBUIRgJEIorATD0b3RyfjyCPy45VxsLxZEJLoscmFvdnKP9kT3CEL0cTgWbuFI9DXDkejjYCQSWxfd82U4TKCjk1CsTTCihMMRQhGNfk0+N/Y1uTweDF9yL9wrQnFBdG+5pMDHutpyKouje8fVxQW/WC6JLhflexGRlB9X8YiQ7xP8+UJ+qX9Wz/3k9bX0jwbpHZmgb3SC3pHo1/B4iPHYB0cgGOFQ5yDBcCT6u4slvvCLD6TJz6bJ/5l4PYI39n5GH0d/f9H/tUS3T373SvTfQCgcITA51BUKMzoRpn80SFv3CBOhCAA+j1BfXczKBSWsXRz9/ZvMY6GfocYmwhw9N8Tp3hHO9I7RPTzOeOyPcybfTlEQCpDnAa+3H593crzbc2HZ6/FQkOe9MB7undzukQt7l/Hj4YV50a+ifB/+vPcOd/zmxqUp6cNc8ud5WVjuZWH5pT8s3DwhQFXpGw3S2T/GqZ4Rjp0f5kcHz/Kjg2dZUlHIjZVhbgsEKfXnzWldTn8n2fDvJNks9DNM6/lhdp3o4cjZIcIRJd/roa6ikOuXVVBRlE+Z30e+14PP6yHfK/i8HjweIToKED0wece6hb84SKmT/92PHaiMKEp0/U9azgPRP3zP5EHL2Njve5a90fD2CBT2HCVQvcrNX5FJIhGhsjifyuJ81tWWA9A3MsHbHQO8daaf7x2P8O9/9SKfvL6OLbcsZ0llkcsVm5lY6GeIzv4x/v3QWVrPD1Nc4GNjQyXX1s2jtqIQzywPvk3+8c7kZPdoIqWaLFdRnM8HrpzPB66cz7kTLbR7FvH07jM89cZpPnFdLQ9uWkFDdbHbZZqLcBT6InIH8HeAF/iGqv71lO0FwLeAG4Ae4B5VPSki9cBh4Gis6euq+rnklJ4bVJWXj3XzwqGzFOZ7+bWrF7GxodIOqJm0sKzUwx/92rX89w9fxddfPs5Tu07z7JvtfPTaxfzhB1ewYkGp2yWaKWYMfRHxAo8CtwHtwG4R2aGqLXHNHgD6VHWFiNwLfAm4J7btuKquT3LdOWE8GOa7e9tp6Rrk6tpyPnFdLf48r9tlGfNLFpb7+d8fXct/brqCb7xygid/fop/fauTu26o449uu5JF5YVul2hinOzp3wi0qmobgIhsBzYD8aG/GfhibPkZ4GtiJ/xelmA4whM/P8Xp3hE+cvUi3n9FlZ1DbdLeglI/X/jIan7/luX8Q/Nxnvz5KX64v5PfvamB+SUFttOSBkT10heriMhdwB2q+tnY498GNqrq1rg2B2Nt2mOPjwMbgRLgEPAOMAj8maq+Ms3P2AJsAaipqblh+/btSeiau/oHBon4ZncK36SwKk8cDnOoR/n0Ki/XzU/uUI7TU+16RyZm/dqeUCDhfs9GKvuQiET6nW59SIQnFGBeedlFt787GuF7xyZ4vStMsQ8+tNTD+xdFTwCYC6k6rXR4eJiSkpKUvHaiNm3atFdVG2dql+oDuV3AUlXtEZEbgB+IyFpVHYxvpKrbgG0AjY2N2tTUlOKyUu/Z515I+CyW7+/r4GBPLx+7djGrl1cRSHJtTQ5PY0vkVEF/95E5OXsnlX1IRCL9Trc+JMLffYSZ/l7vBg52DPD57fv5YdswL5/1ctuahVxbV57y/706/R3PVnNz84z9TldOdiE7gCVxj+ti66ZtIyI+oBzoUdVxVe0BUNW9wHHgysstOpsd6hxg98leblk5n/ctr3K7HGOSYl1tOb97UwOf+dV6/HlevrPnDNtebqOzf8zt0nKOk9DfDawUkQYRyQfuBXZMabMDuD+2fBfwkqqqiMyPHQhGRJYDK4G25JSefUbGQ/xgfyeL5/m5bU2N2+UYk3Qra0p5cNMKPnldLe8Oj/Pof7Ty3NudF676Nak34/COqoZEZCvwPNFTNh9X1UMi8giwR1V3AI8BT4pIK9BL9IMB4BbgEREJAhHgc6ram4qOZIMdb3USmAjzwE0NeOdozNOYueYRobG+krWLy3mh5SyvHe/h8NkhfuP6Oju/fw44GtNX1Z3AzinrHo5bDhAdupv6vO8B37vMGnPC0bODHOgY4PY1NSwsS/2BUGPcVpjvZfP6Wq6uLefZfR1845U2bl1dQ9NV82d9waFxzq7wSQOqyo9bzlFZnM/NK+e7XY4xc2r5/BL+8IMruKaunJ8cPse3fn6SsYvMnmoun4V+GmjpGqRzIMAHVy2wYR2Tkwp8Xj7VuITN6xdz/PwI2145zsBY0O2yspKFvssiqrx4+DzVJflcWzfP7XKMcY2IsLGhiv/0/nr6RoN8/afHeXdo3O2yso6FvssOdgxwdjDAratqbC/fGOCK+SX83s3LCUaUb7zaltYXp2UiC32XvXKsm/klBVxd52zmS2NyQe28Qj57UwOhsPL4aycYCthQT7JY6Luoo3+Mjv4x3re80s5WMGaKmjI/9/9qPUOBIN/82UnGQ3ZwNxks9F20+2QvPo+wfkmF26UYk5aWVhbxWxuXcXYgwPf3dTDTXGFmZhb6LhkPhXnrTD/X1JVTmG8zDxpzMVfWlHLbmhrebh/g9baL30zeOGOh75ID7QOMhyJsqK90uxRj0t4tV85n1cJSdh44y5leu6Pb5bDQd8kbJ3upKStgqd1T1JgZeUS4+4YllBb6+O7eMwTDNldPoiz0XXB+KEB73xiNyyrtxijGOFSY7+WT19XRPTzBi4fPu11OxrLQd0FLZ/R2Ak5vUG6MiVqxoITGZRW82vouHX02LXMiLPRdcLBzgCUVhZQX5rldijEZ5851iygu8PHsvnbCETubZ7Ys9OdY38gEnf0B28s3JkGF+V5+/ZrFdA0EePNUn9vlZBwL/Tl2qCs6tLNm0cXvK2qMubR1i8tYVlnEC4fPMR60i7Zmw0J/jh3qGGBRuZ+qkgK3SzEmY4kIH7l6ESPjIX567F23y8koFvpzaDAQ5HTvKGsX216+MZdrSWUR19SV8+qxbvpHbVI2pyz059DhrkEUWLvYxvONSYYPr12IKvz0Hdvbd8pCfw4dOzfMvKI8FpTa0I4xyVBRlM/1y+ax91Qfg3bTFUcc3SPXXL5wRGnrHmbd4nK7IMuYJPrAlQvYe6qPV1u7+cjVi96z7aldpx29xm9uXJqK0tKS7enPkY7+MQLBCCsWlLhdijFZpbI4ete5XSd6GB4PuV1O2nMU+iJyh4gcFZFWEXlomu0FIvJ0bPsuEamfsn2piAyLyJ8kp+zM03p+GCF6VyBjTHJ94Kr5hMLKz1q73S4l7c0Y+iLiBR4F7gTWAPeJyJopzR4A+lR1BfBV4EtTtn8F+NHll5u5Ws8Ps2ien+ICG1EzJtkWlPpZs7iMXSd6mQjZZGyX4mRP/0agVVXbVHUC2A5sntJmM/BEbPkZ4FaJDVyLyMeBE8Ch5JScecZDYc70jrLC9vKNSZlfuaKKsWCYt9v73S4lrTkJ/VrgTNzj9ti6aduoaggYAKpEpAT4U+DPL7/UzHWye4SwKisWlLpdijFZq6GqmJqyAn7e1mN32LqEVI81fBH4qqoOX+qMFRHZAmwBqKmpobm5OcVlpZ4nFMDffQSAE6fC+ASu1DPkd7t/5k5zc5ujdv6R2V/wEt/vVEplHxKRSL/TrQ+J8IQCjv9e56IfNy8I80xrhLMnDtNQ5vw8FafvxaTh4eGMzSknod8BLIl7XBdbN12bdhHxAeVAD7ARuEtE/gaYB0REJKCqX4t/sqpuA7YBNDY2alNTUwJdSS/PPvcCgepVALzz1jGWVXuJ1Cwn4HJdAE0OT09zerpbPH/3kQv9TqVU9iERifQ73fqQCH/3EZz+vc5FP9bOC/Nvp47w054SFi13fhqm0/diUnNzs+N+pxsnH4W7gZUi0iAi+cC9wI4pbXYA98eW7wJe0qibVbVeVeuBvwX+amrgZ7uxiTBnBwM0VNt4vjGpVuDzcsPSCg52DDAUsIu1pjNj6MfG6LcCzwOHge+o6iEReUREPhZr9hjRMfxW4I+BXzqtM1ed7h0BYFmV3RbRmLmwob6SiMJb7QNul5KWHI3pq+pOYOeUdQ/HLQeAu2d4jS8mUF/GO9UzikdgSYWFvjFzYUGZn7qKQvad7uOmFdVul5N27IrcFDvVO8qi8kLyffarNmauXLdkHl0DAboG7JaKU1kSpVA4orT3jdrQjjFz7Jq6eXhF2HfaztmfykI/hTr7xwiGlWVVxW6XYkxOKS7wcdXCUvaf6bf76E5hoZ9Cp3piB3ErbU/fmLl2/dJ5DI+HaD0/5HYpacVCP4VO9Y5SUZRHWWGe26UYk3OuXFhKYZ7XzuKZwkI/RVSVUz2jNrRjjEt8Hg9rFpVxuGuQUNgmYZtkoZ8iPQEYHg/ZQVxjXLSutozxUITj7w67XUrasNBPkVND0YNHS2083xjXXDG/hAKfh4Mdg26XkjYs9FPkzJCS5xUWlPrdLsWYnOXzeli9qIyWrkE7iyfGQj9Fzgwri8oL8Xrcn1XTmFy2bnE5Y8Ewbd02xAMW+ikRCkdoH1bqKgrdLsWYnLeypoR8rw3xTLLQT4Fj54cJRrDQNyYN5Hk9XLWwlJauQSJ2cxUL/VQ4EDsvuG6eHcQ1Jh2sXlTKyHiIjj6bi8dCPwXeau/H74XKkny3SzHGAFcuKEWAo+fs6lwL/RR4u32AJSWC5xK3iDTGzJ2iAh9LKos4etZC30I/ycZDYY6cHWRJqQW+MenkqoWldPSP5fwdtSz0k+xI1xDBsLKkxELfmHRyVU0pAO+cy+1TNy30k+zt9uj83banb0x6WVTup9Tvy/lxfQv9JHu7fYCq4nwqCtyuxBgTT0S4qqaU1vNDOX11roV+kh3qHGRtbTliB3GNSTtX1pQSCEY41TvidimusdBPoolQhGPnh1izqMztUowx01ixoASPwPHzuTuub6GfRK3nhwmGlTWLLfSNSUf+PC+18wpptdC/NBG5Q0SOikiriDw0zfYCEXk6tn2XiNTH1t8oIvtjX2+JyCeSW356aemKzu1he/rGpK8rFpTQ0T9GIBh2uxRXzBj6IuIFHgXuBNYA94nIminNHgD6VHUF8FXgS7H1B4FGVV0P3AF8XUR8ySo+3bR0DuLP89BQbXfLMiZdrZhfQkThRHdujus72dO/EWhV1TZVnQC2A5untNkMPBFbfga4VUREVUdVNRRb7wey+pB5S9cAqxaW2XTKxqSxpZVF5HmF1hy9m5aT0K8FzsQ9bo+tm7ZNLOQHgCoAEdkoIoeAA8Dn4j4Esoqq0tI5aOP5xqQ5n9fDsqrinD2Ym/KhFlXdBawVkdXAEyLyI1UNxLcRkS3AFoCamhqam5tTXVbSdY9FGAyE8A2dpbm5B08ogL/7iNtlTau5uc1RO//IxKxfe676nco+JCKRfqdbHxLhCQUc/72mUz9WFYX5t/MRJjoPU5Yvjt+LScPDwxmZU+As9DuAJXGP62LrpmvTHhuzLwd64huo6mERGQbWAXumbNsGbANobGzUpqamWXQhPbxw6Cywl483NXL90gqefe4FAtWr3C5rWk0blzpq99Su07N+bX/3kTnpdyr7kIhE+p1ufUiEv/sITv9e06kfy7xjcLKVlvBC1ldXOH4vJjU3Nzvud7pxMryzG1gpIg0ikg/cC+yY0mYHcH9s+S7gJVXV2HN8ACKyDFgFnExK5WmmpWsQj8DqhTa8Y0y6WzTPT2Gel+Pnc+9g7ox7+qoaEpGtwPOAF3hcVQ+JyCPAHlXdATwGPCkirUAv0Q8GgJuAh0QkCESAP1DV7lR0xG0tnYM0VBdTmO91uxRjzAw8IjRUF+fkfXMdjemr6k5g55R1D8ctB4C7p3nek8CTl1ljRmjpGuS6pRVul2GMcaihupiWrkH6R9PnWMNcsCtyk2BgLEh735hdlGVMBpm8nuZkT24N8VjoJ8HhyStx7XRNYzLGwnI//jwPJ7pH3S5lTlnoJ0FLp02/YEym8YiwrLI4567MtdBPgkOdg8wvLWB+qU2ib0wmaagupnt4nHeHxt0uZc5Y6CdBS9eg7eUbk4Emx/XfONHrciVzx0L/Mk2EIrSeH7LxfGMy0OJ5heR7PbxxomfmxlnCQv8yHTsfvRG67ekbk3m8HmFpVRG7bE/fOHXhIK7t6RuTkeqrijlydihnzte30L9MLV2DFOZ5qa+yOfSNyUT11UUA7D3V53Ilc8NC/zK1dA6yalGpzaFvTIaqmxedX3/3SQt9MwNVtTN3jMlw+T4P62rL2XMyN8b1LfQvQ3vfGEOBkI3nG5PhNtRX8nb7QE7cN9dC/zLYjdCNyQ6NyyqYCEc40DHgdikpZ6F/GVo6o3Por7I59I3JaDcsi86QuzsHhngs9C9DS5fNoW9MNqgqKeCK+cXsyYGDuRb6lyF6I/Ryt8swxiTBhvpK9pzsJRJRt0tJKQv9BA2MBunotzn0jckWjfWVDAZCHDuf3XfTstBPUIvNoW9MVtlQnxvj+hb6CbIzd4zJLksri5hfWpD15+tb6CeoxebQNyariAgb6ivYk+XTMVjoJ8iuxDUm+zQuq6S9b4yugTG3S0kZC/0E2Bz6xmSnDfWVAFl96qaj0BeRO0TkqIi0ishD02wvEJGnY9t3iUh9bP1tIrJXRA7Evn8wueW7w+bQNyY7rV5USlG+N6vH9WcMfRHxAo8CdwJrgPtEZM2UZg8Afaq6Avgq8KXY+m7go6p6NXA/8GSyCnfT5Bz6a21P35is4vN6uH5pRVbPuOlkT/9GoFVV21R1AtgObJ7SZjPwRGz5GeBWERFV3aeqnbH1h4BCEcn4I58tXYMU5XtZZnPoG5N1GusrOHJ2kMFA0O1SUsJJ6NcCZ+Iet8fWTdtGVUPAAFA1pc1vAG+qasbfdr6lc5BVC20OfWOy0Yb6SiIK+073u11KSvjm4oeIyFqiQz63X2T7FmALQE1NDc3NzXNRVkIiqrx9ZpT3LfJdsk5PKIC/+8jcFTYLzc1tjtr5R2Z/+7i56ncq+5CIRPqdbn1IhCcUcPz3ms79iH8vAiHFI/DMT/ehnfnTth8eHk7rnLoUJ6HfASyJe1wXWzddm3YR8QHlQA+AiNQB3wd+R1WPT/cDVHUbsA2gsbFRm5qaZtGFuXWye4Sx55v58IbVNN249KLtnn3uBQLVq+awMueaNl687nhP7To969f2dx+Zk36nsg+JSKTf6daHRPi7j+D07zWd+zH1vVh35FXeVS9NTb8ybfvm5mbH/U43ToZ3dgMrRaRBRPKBe4EdU9rsIHqgFuAu4CVVVRGZBzwHPKSqryWraDcd7IzOt72u1iZaMyZb3bCsgv1n+pkIRdwuJelmDP3YGP1W4HngMPAdVT0kIo+IyMdizR4DqkSkFfhjYPK0zq3ACuBhEdkf+1qQ9F7MoYMdg+R5hStrSt0uxRiTIhvqKwkEIxzqzL6bqjga01fVncDOKesejlsOAHdP87y/AP7iMmtMK4c6B7hqYSn5PruuzZhs1Ri7qcqek31ct7TC5WqSy5JrFlSVAx0DrLM59I3JagvK/CyrKsrKGTct9Geho3+M/tGgjecbkwMal1Wy51Qfqtl1UxUL/Vk42BG9EtdC35jst6G+gt6RCdq6R9wuJaks9GfhUOcAXo+waqEdxDUm2zVemHwtu4Z4LPRn4UDHACsXlODPsxuhG5PtrphfTEVRXtbNw2Oh75CqcrBjgLV2ENeYnCAiNMZulp5NLPQdOj80TvfwBFfX2syaxuSKDfUVnOwZ5d2hjJ8y7AILfYfeOhOdfMkO4hqTOybH9feeyp69fQt9h/af6cfnEQt9Y3LIusXlFPg8WTWub6Hv0P4z/axaVGoHcY3JIfk+D+uXzMuqcX0LfQfCEeXt9gGuW5Jdl2MbY2a2ob6Sg52DjE6E3C4lKSz0HTj+7jDD4yHWL5nndinGmDnWWF9BOKLsz5KbqljoO7DvdHQ8b/1SC31jcs31yyoQIWvG9S30Hdh/pp8yv48GuyeuMTmnzJ/HqoVl7MmSM3gs9B3Yd7qf9Usr8Ng9cY3JSRvqK3jzVB+hcObfVMVCfwYj4yHeOTdk4/nG5LDG+kpGJsIcOTvkdimXzUJ/Bm+3DxBRuM5C35ictaE+euZeNsyvb6E/g/2xK3GvtdA3JmctKi+kdl4hb5yw0M96b57uo76qiMrifLdLMca46FeuqOJnx3sIRzL7pioW+pcQiShvnOhlY0OV26UYY1x288pqBsaCHOzI7JulW+hfwpGzQwyMBdm4vNLtUowxLrtpRTUArxx71+VKLo+F/iW83tYDwMbltqdvTK6rKilg7eIyXj7W7XYpl8VR6IvIHSJyVERaReShabYXiMjTse27RKQ+tr5KRP5DRIZF5GvJLT31Xm/rYWllEbXzCt0uxRiTBm5eOZ99p/sYC2XuuP6MoS8iXuBR4E5gDXCfiKyZ0uwBoE9VVwBfBb6JtZ/MAAAKLklEQVQUWx8A/hfwJ0mreI5EIsobJ3t5nw3tGGNibl5ZTTCsHO0Nu11Kwpzs6d8ItKpqm6pOANuBzVPabAaeiC0/A9wqIqKqI6r6KtHwzyhHzw3RPxq0g7jGmAtuWFaBP8/DoZ7sDv1a4Ezc4/bYumnbqGoIGAAyOi1/MZ5ve/rGmCh/npcbG6o42J25oe9zuwAAEdkCbAGoqamhubnZ3YKAf9sXYH6h0PrWG7Qm8HxPKIC/+0jS60qG5uY2R+38IxOzfu256ncq+5CIRPqdbn1IhCcUcPz3ms79cPpeACyWIC+PKN/d+RLzizLvXBgnod8BLIl7XBdbN12bdhHxAeVAj9MiVHUbsA2gsbFRm5qanD41JSIR5Y9e/jG3rllEU9O1Cb3Gs8+9QKB6VZIrS46mjUsdtXtq1+lZv7a/+8ic9DuVfUhEIv1Otz4kwt99BKd/r+ncD6fvBcCydSNsP9rMUFkDd9/UkMKqUsPJx9RuYKWINIhIPnAvsGNKmx3A/bHlu4CXVDVjD28f6BigbzTI+1dk9AiVMSYFGqqLqS0Rnj901u1SEjLjnr6qhkRkK/A84AUeV9VDIvIIsEdVdwCPAU+KSCvQS/SDAQAROQmUAfki8nHgdlVtSX5XkufFw+fwCDRducDtUowxaej6Gh/PtfXSOzKRcVO0OBrTV9WdwM4p6x6OWw4Ad1/kufWXUZ8rfnL4PDcsq6Aiw95MY8zcuGGBl389HuTFw+e4u3HJzE9II5l3FCLFugbGaOka5NbVNW6XYoxJU8vKPCwu9/P8oXNulzJrFvpTvHj4PAAfWm1DO8aY6YkIt69dyCvH3mV0IuR2ObNioT/Fi4fPsbSyiCvml7hdijEmjd2+pobxUISX38msCdgs9OOMToR47XgPt65egIjdD9cYc3E3NlRSVZzPD/d3ul3KrFjox3mttYeJUIQP2Xi+MWYGPq+Hj19Xy08On6M3jS88m8pCP84P9ndQUZTHhnqbesEYM7O7G+sIhpUf7Jt6vWr6stCP6R+d4MeHzrF5fS35Pvu1GGNmtmphGVfXlvPdve1ul+KYpVvMjrc6mQhHuOuGOrdLMcZkkE811nG4azBjbqNooR/z3T3trF5UxrracrdLMcZkkI9dGx0deCZD9vYt9IEjZwc50DHA3baXb4yZpfKiPD68diHPvtnOUCDodjkzstAnupef5xU+ft3U2wQYY8zMtty8nMFAiG+/nr4ziU7K+dAfDAR5Zm87t62pybiJk4wx6eHqunJuuXI+j73aRiCY3jdYyfnQf+K1kwyMBfmDphVul2KMyWAPNl1B9/AE299I7739nA79wUCQf3qljQ+trrEDuMaYy7JxeRUb6iv4+sttTIQibpdzUTkd+t987SSDgRCf/9BKt0sxxmSBrR9cSddAgG/+7ITbpVxUzob+wFiQb7zSxm1rbC/fGJMct6ys5kOrF/CVH7/Dmd5Rt8uZVs6G/p/vOMTIRNj28o0xSSMiPLJ5HV4RvvD9A6TjXWNzMvR3Huji2X0dbN20grWLbS/fGJM8i+cV8qd3ruKVY908+2b6zcmTc6F/fjDAF75/gGvrytn6QTtjxxiTfJ/euIwb6yv5nz84wJun+9wu5z1yKvQHxoL83pN7CQTDfOWe9eR5c6r7xpg54vEI//jp61lY5ueBb+6m7d1ht0u6IGdSr390gt9+bBctnQP8/b3X2Z2xjDEpVVVSwDc/cyMeEX7n8Td459yQ2yUBORL6R84Ocu+21znSNcT//fQN3L52odslGWNyQH11Mf/vMxsIBCNs/tpraTHvvqPQF5E7ROSoiLSKyEPTbC8Qkadj23eJSH3ctv8RW39URD6cvNJnNjwe4svPH+XX//5Vzg+N80/3N3Kr3RXLGDOHrqmbx87/chNX15bz+af389kn9nCg3b1pmH0zNRARL/AocBvQDuwWkR2q2hLX7AGgT1VXiMi9wJeAe0RkDXAvsBZYDPxERK5U1ZRMTqGqdPSP0dI5yL8fPMuPDp5lLBjmk9fX8me/tsbm1jHGuGJBmZ9//r2NfP2nx9n2chsf/do53re8kltX1XDTymoaqovx53nnpJYZQx+4EWhV1TYAEdkObAbiQ38z8MXY8jPA1yR6Z/HNwHZVHQdOiEhr7PV+npzyf2Hf6T7uf/wNBgMhAEr9Pj5xfS33NC7h2iXzkv3jjDFmVvK8HrZ+cCX3/2o93/r5KX64v4O/3Hn4wvbqknx+/ZrFfPFja1Nah5PQrwXOxD1uBzZerI2qhkRkAKiKrX99ynNTMn/xksoiPnrtYlYvKmP1ojLWLi6bs09OY4xxqtSfx4ObVvDgphV09o+x60QP7b1jdA6MUV9VlPKf7yT0U05EtgBbYg+HReSom/UkSTXQ7XYR0/mt1L78nPQ7xX1IxKz7nYZ9SETa/jufjQTei5T1+zOJP3WZk0ZOQr8DWBL3uC62bro27SLiA8qBHofPRVW3AducFJwpRGSPqja6Xcdcs37nFut35nFy9s5uYKWINIhIPtEDszumtNkB3B9bvgt4SaOTTuwA7o2d3dMArATeSE7pxhhjZmvGPf3YGP1W4HnACzyuqodE5BFgj6ruAB4DnowdqO0l+sFArN13iB70DQEPpurMHWOMMTOTdJwFLhuIyJbYsFVOsX7nFut35rHQN8aYHJIT0zAYY4yJstBPspmmrMhWInJSRA6IyH4R2eN2PakkIo+LyHkRORi3rlJEfiwix2LfK9ysMRUu0u8vikhH7H3fLyIfcbPGVBCRJSLyHyLSIiKHROS/xtZn5HtuoZ9EcVNW3AmsAe6LTUWRKzap6vpMPZVtFr4J3DFl3UPAi6q6Engx9jjbfJNf7jfAV2Pv+3pV3TnHNc2FEPDfVHUN8D7gwdjfdUa+5xb6yXVhygpVnQAmp6wwWURVXyZ6llq8zcATseUngI/PaVFz4CL9znqq2qWqb8aWh4DDRGcWyMj33EI/uaabsiIl006kIQVeEJG9sSusc02NqnbFls8CuTSd61YReTs2/JMRQxyJis0gfB2wiwx9zy30TbLcpKrXEx3aelBEbnG7ILfELkzMldPi/hG4AlgPdAH/x91yUkdESoDvAZ9X1cH4bZn0nlvoJ5ejaSeykap2xL6fB75PdKgrl5wTkUUAse/nXa5nTqjqOVUNq2oE+Cey9H0XkTyigf/PqvpsbHVGvucW+snlZMqKrCMixSJSOrkM3A4cvPSzsk78VCT3Az90sZY5Mxl6MZ8gC9/32DTxjwGHVfUrcZsy8j23i7OSLHbK2t/yiykr/tLlklJORJYT3buH6NQeT2Vzv0XkX4AmojMtngP+N/AD4DvAUuAU8ClVzaqDnhfpdxPRoR0FTgK/HzfOnRVE5CbgFeAAEImt/gLRcf2Me88t9I0xJofY8I4xxuQQC31jjMkhFvrGGJNDLPSNMSaHWOgbY0wOsdA3xpgcYqFvjDE5xELfGGNyyP8HHwd4yVMVvI0AAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "# Seaborn 내부함수에 대한 FutureWarning이 출력\n", "import warnings\n", "warnings.simplefilter(action='ignore', category=FutureWarning)\n", "\n", "# 전체 뉴스그룹 데이터의 길이를 시각화 한다\n", "# 전체적으로 11,316개가 비슷한 길이를 갖음을 확인할 수 있다\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "sns.distplot(groups.target)\n", "plt.grid()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **04 Feacture Set을 활용하여 전체 Text 갯수를 확인한다**\n", "1. **Sklearnd의 CountVectorizer를** 활용하여 **빈도상위 500**개 Token으로 Embadding\n", "1. **위는 알파벳 갯수로써** 비교를 했기 때문에 비슷하게 나오는데\n", "1. **단어 기준으로** Feacture Set을 생성하면 결과가 어떻게 되는지 확인해보자\n", "\n", "> **CountVectorizer()** 의 파라미터 확인\n", "\n", "1. **stop_words** : 불용어 목록을 활성화 한다 ex) **None(초기값), english, [a, the, of]**\n", "1. **ngram_range** : 추출할 ngram 하한/ 상한선을 지정 ex) **(1,1)(초기값) (1,2) (2,2)**\n", "1. **lowercase** : 소문자 변환 활성화 여부 ex) **True(초기값), False**\n", "1. **max_feacture** : None 아니면 최대 token 갯수를 지정 ex) **None(초기값), 500**\n", "1. **binary** : 바이너리 여부를 정의한다" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['00', '000', '0d', '0t', '10', '100', '11', '12', '13', '14', '145', '15', '16', '17', '18', '19', '1993', '1d9', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '34u', '35', '40', '45', '50', '55', '80', '92', '93', '__', '___', 'a86', 'able', 'ac', 'access', 'actually', 'address', 'ago', 'agree', 'al', 'american', 'andrew', 'answer', 'anybody', 'apple', 'application', 'apr', 'april', 'area', 'argument', 'armenian', 'armenians', 'article', 'ask', 'asked', 'att', 'au', 'available', 'away', 'ax', 'b8f', 'bad', 'based', 'believe', 'berkeley', 'best', 'better', 'bible', 'big', 'bike', 'bit', 'black', 'board', 'body', 'book', 'box', 'buy', 'ca', 'california', 'called', 'came', 'canada', 'car', 'card', 'care', 'case', 'cause']\n" ] } ], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "import numpy as np\n", "\n", "# 빈도상위 500개의 단어로만 추출한 결과를 분석\n", "cv = CountVectorizer(stop_words=\"english\", max_features=500)\n", "transformed = cv.fit_transform(groups.data)\n", "print(cv.get_feature_names()[:100]) \n", "# 빈도상위 100개의 Token을 출력한다 : 문장간 식별력이 낮은 숫자와 기호들이 포함" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEWCAYAAACJ0YulAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAIABJREFUeJzt3Xl8XHW5+PHPMzOZ7M3StGmbLkkXKC0tS0PLThDUsqNXZVEUr9iListFfxe8ehFxvdcNRbgIygVRqIjsi0DBQBEopS0t3SjdaNN0S5qkzZ7JPL8/zskwTbNMlpPJTJ736zWvzJzzPec835nJeeac7/d8j6gqxhhjDIAv3gEYY4wZPiwpGGOMibCkYIwxJsKSgjHGmAhLCsYYYyIsKRhjjImwpJDgROROEfmvQVrXZBGpFxG/+7pcRK4ZjHW763tWRD43WOuLcZs3i8ifhmhbp4nIe+57eOlQbHM4EJFiEVERCcQ7FjNwlhSGMRHZLiJNInJIRGpF5DURuVZEIp+bql6rqj+IcV3n9lRGVXeoapaqtg9C7EfsjFX1PFW9b6Dr7mJb94pIq7szPiAiL4jIzH6sp9f3qBe3AL9138PHulh/uYg0u3HWi8i7neZfKSLvi0iDiDwmIvlR8/JF5FF33vsicmU3dRjv7qALo6Z9p5tpfx9AXfvErdtbbr13uz8QTh+C7aqITPd6O8nEksLwd5GqZgNTgJ8CNwB/GOyNJMGvvP9R1SxgIrAPuDcOMUwB1vVS5jo3aWSp6tEdE0VkNvA74CqgEGgE7oha7nag1Z33aeB/3WUOo6q7gc3AmVGTzwQ2djHtlRjrFdGf74mIXA/cCvwYJ/7JOHW7pK/rMkNAVe0xTB/AduDcTtPmA2HgWPf1vcAP3ecFwFNALXAAWIqT+O93l2kC6oH/AIoBBb4A7MDZQXRMC7jrKwd+ArwJHAQeB/LdeWVARVfxAgtxdmBt7vZWR63vGve5D/gu8D7OTvyPQI47ryOOz7mxVQHf6eF9irwH7usLgHr3+c3An6LmXYyz46514znGnX7Ee9TNtr6Is9M9ADwBTHCnb+m0fGoXy0bq38W8HwMPRL2e5r6H2UCm+/yoqPn3Az/tZl1/AG5zn/uB/cC1naYdBE53X+e47/9+9/P4LuBz510N/BP4FVAN/NBd/ufu57IV+Er096ZTLDnu+/HJHj6/VJykUek+bu14/9ztv9qpvALToz7724GngUPAMmCaO+8Vt2yDG8NldPM/Eu//9eH0sCOFBKOqbwIVwBldzP6mO28Mzi+y/3QW0atwdq4XqfML9X+iljkLOAb4aDeb/Czwr8B4IAT8JoYY/46zk/uLu73juih2tfs4G5gKZAG/7VTmdOBo4BzgJhE5prdti0gWzi/pVV3MOwp4EPgGznv0DPCkiAR7eY86lv8QTpL8FM778T6w2K3ztE7Lt3QT4k9EpEpE/ikiZVHTZwOrO16o6hbcROA+Qqq6Kar8aneZrrzCB0cFJwAbgBc7TUvBSfYAt+HsvKfifB8+C3w+an0LcHb+hcCPcBLjhe56SoFPdBMHwClAGvBoD2W+A5wMHA8ch/PD57s9lO/scuD7QB5Owv4RgKp21Pc49zP5C938j/RhW0nPkkJiqgTyu5jehrOzmqKqbaq6VN2fTD24WVUbVLWpm/n3q+paVW0A/gv4VEdD9AB9Gvilqm5V1Xrg28DlnU5PfF9Vm1R1Nc5OsKvk0uFbIlKLs1PIwkk4nV0GPK2qL6hqG86v3XTg1D7EfI+qrnR3+t8GThGR4hiXvwFnx1sE3IWTkKa587KAuk7l63COFLJwftl3Na8rLwPHikguzo+Hpar6HjAmatobqtrqfpaXA99W1UOquh34Bc5prA6Vqnqbqobc78mngFtVdaeqHsBJlN0ZDVSpaqiHMp8GblHVfaq6H2cHf1UP5Tt7VFXfdLfxZ5zk0p3+/I+MKJYUElMRzqFvZz/D2Sk+LyJbReTGGNa1sw/z38f5hVkQU5Q9m+CuL3rdAZxfbx32RD1vxNk5dufnqpqrquNU9WL3l3aP21TVME79ivoTs5vMqmNdXlWXuTveFnUa3P8JnO/OrgdGdVpkFM4pkZ7mdbWd7cAunJ3/mTinSABei5rW0Z5QgPOZdv4souvU+TsygSO/F92pBgp6aYvo6rswoYfynfXle9Kf/5ERxZJCghGRk3D+YV/tPM/d4XxTVafinDu/XkTO6ZjdzSp7+5U0Ker5ZJxfWlU452kzouLy4xySx7reSpyG2eh1h4C9vSw3EIdtU0QEp3673El9illEMnF+Ce/qdomeKSDu83VEHQmJyFScc+2b3EdARGZELXscPTdqd5xCOgUnGYCTHM7EOS3XkRSqcD7Tzp9FdJ06vy+7OfJ70Z3XgRagpy66XX0XKt3nnb9n43pYT696+R8xWFJIGCIySkQuxDmH/SdVfaeLMheKyHR3Z1cHtOM0foKzs53aj01/RkRmiUgGTpfLh9XpsroJSBORC0QkBecccGrUcnuB4ujus508CPy7iJS47QAdbRA9nWYYqIeAC0TkHDfmb+LssDp2mr29Rw8CnxeR40Uk1Y15mfvLvEcikisiHxWRNBEJiMincXbQHd1C/wxcJCJnuMnmFuARdyfWADwC3CIimSJyGk7Pnft72OQrOG0DlaracerpVXdaDs7OGvezfAj4kYhki8gU4Hqgp2s7HgK+JiITRSQP6PbXtqrWATcBt4vIpSKSISIpInKeiHS02zwIfFdExohIgVu+Y/urgdnue56G03GgLw77THv5HzFYUkgET4rIIZzD9e8Av+TwRsBoM4AlOKcbXgfuUNV/uPN+gvOPVysi3+rD9u/H6eGxB6fB8GsQ+Wf/MvB7nF+VDTgNeB3+6v6tFpGVXaz3HnfdrwDbgGbgq32Iq89U9V3gMzgNq1XARTgNw61ukR7fI1VdgtOu8jecX8vTcM7HxyIFp+fOfnfbXwUu7Wg8VtV1OD2E/ozTGysb5/3t8GWc9o99ODvRL7nLdOdlYCyHH1G+7a5jhao2Rk3/Ks7nt9Ut/wDO59Odu4HncHbYK3ESVrdU9Rc4iea7OPXfCVwHdFzL8UPgLWAN8I67zh+6y27CSZBLgPfo4gi5FzcD97mf6afo+X/EAGJtLMYYYzrYkYIxxpgISwrGGGMiLCkYY4yJsKRgjDEmIuEGQSsoKNDi4mJPt9HQ0EBmZqan2xhqyVgnsHolkmSsEyROvVasWFGlqmN6K5dwSaG4uJi33nrL022Ul5dTVlbm6TaGWjLWCaxeiSQZ6wSJUy8R6enK8wg7fWSMMSbCkoIxxpgIz5KCiNwjIvtEZG0v5U4SkZCI9DT8rjHGmCHg5ZHCvTg3W+mWO4jafwPPexiHMcaYGHmWFFT1Fboe3jnaV3HGkdnnVRzGGGNi5+nYR+7NR55S1WO7mFeEM/DW2TiDbz2lqg93s55FwCKAwsLCeYsXL/YqZADq6+vJyuppSPbEk4x1AqtXIknGOkHi1Ovss89eoaqlvZWLZ5fUW4EbVDXsjGLbPVW9C+dOVZSWlqrX3b8SpYtZXyRjncDqlUiSsU6QfPWKZ1IoBRa7CaEAOF9EQqr6WM+LGWOM8UrckoKqlnQ8F5F7cU4fWUIwxpg48iwpiMiDQBnO/VkrgO/h3GgEVb3Tq+3G2wPLdvRa5soFPd290Bhj4sezpKCqV/Sh7NVexWGMMSZ2dkWzMcaYCEsKxhhjIiwpGGOMibCkYIwxJsKSgjHGmAhLCsYYYyIsKRhjjImwpGCMMSbCkoIxxpgISwrGGGMiLCkYY4yJsKRgjDEmwpKCMcaYCEsKxhhjIiwpGGOMibCkYIwxJsKSgjHGmAhLCsYYYyIsKRhjjImwpGCMMSbCkoIxxpgIz5KCiNwjIvtEZG038z8tImtE5B0ReU1EjvMqFmOMMbHx8kjhXmBhD/O3AWep6hzgB8BdHsZijDEmBgGvVqyqr4hIcQ/zX4t6+QYw0atYjDHGxEZU1buVO0nhKVU9tpdy3wJmquo13cxfBCwCKCwsnLd48eJBjvRw9fX1ZGVl9WvZAw2tvZbJzwz2a90DMZA6DWdWr8SRjHWCxKnX2WefvUJVS3sr59mRQqxE5GzgC8Dp3ZVR1btwTy+VlpZqWVmZpzGVl5fT3208sGxHr2XKFkzu17oHYiB1Gs6sXokjGesEyVevuCYFEZkL/B44T1Wr4xmLMcaYOHZJFZHJwCPAVaq6KV5xGGOM+YBnRwoi8iBQBhSISAXwPSAFQFXvBG4CRgN3iAhAKJbzXcYYY7zjZe+jK3qZfw3QZcOyMcaY+LArmo0xxkRYUjDGGBNhScEYY0yEJQVjjDERlhSMMcZEWFIwxhgTYUnBGGNMhCUFY4wxEZYUjDHGRFhSMMYYE2FJwRhjTIQlBWOMMRGWFIwxxkRYUjDGGBNhScEYY0yEJQVjjDERlhSMMcZEWFIwxhgTYUnBGGNMhCUFY4wxEZYUjDHGRHiWFETkHhHZJyJru5kvIvIbEdksImtE5ESvYjHGGBMbL48U7gUW9jD/PGCG+1gE/K+HsRhjjImBZ0lBVV8BDvRQ5BLgj+p4A8gVkfFexWOMMaZ3oqrerVykGHhKVY/tYt5TwE9V9VX39YvADar6VhdlF+EcTVBYWDhv8eLFnsUMUF9fT1ZWVr+WPdDQ2muZ/Mxgv9Y9EAOp03Bm9UocyVgnSJx6nX322StUtbS3coGhCGagVPUu4C6A0tJSLSsr83R75eXl9HcbDyzb0WuZsgWT+7XugRhInYYzq1fiSMY6QfLVK569j3YBk6JeT3SnGWOMiZN4JoUngM+6vZBOBupUdXcc4zHGmBHPs9NHIvIgUAYUiEgF8D0gBUBV7wSeAc4HNgONwOe9isUYY0xsPEsKqnpFL/MV+IpX2x+u2sPKsq3V/HNzFbvrmvnexbPJSk2Iph1jzAhge6Mhdv8b29m0tx6fgAKt7WFuvex4RCTeoRljjA1zMZR21TaxaW89/3bmVFbd9BG++eGjePztSh56a2e8QzPGGMCSwpB6Y0s1KX7hy2dPJyc9hS+VTef06QXc9Pg63t1zKN7hGWOMJYWh0tgSYnVFLSdMziMnPQUAv0/41WXHkx7084vn341zhMYYY0lhyCx/v4ZQWDl56ujDpo/JTuXK+ZNZsmEvOw80xik6Y4xxWFIYAmF1ehxNLchk3Ki0I+Z/5uQpiAh/euP9OERnjDEfsKQwBHYeaKS2qY35Jfldzp+Qm87C2eN48M0dNLaGhjg6Y4z5gCWFIbDDPS1UUpDZbZmrTyvmYHOIx1ZVDlVYxhhzBEsKQ2BnTRO5GSlkp6V0W6Z0Sh6zJ4zivte24+XItcYY0xO7eG0IVBxoZFJ+RuR1dyOpTh+bxeNvV/Lz5zfx/z569FCFZ4wxEXak4LGDzW3UNrUdlhS6M7coF79PWLWjZggiM8aYI1lS8FiF254wKS+917LpQT/HjMtm9c5a2trDXodmjDFHiCkpiMgcrwNJVjtrmvCLMCG396QAcMLkPBpa23ll036PIzPGmCPFeqRwh4i8KSJfFpEcTyNKMjsONDIuJ40Uf2xv9VGF2WQE/Tyyyu43ZIwZejHtqVT1DODTOHdKWyEiD4jIhz2NLAmEVdlV28Sk/NiOEsAZ+uK4ibm8sH4vdU1tHkZnjDFHirlNQVXfA74L3ACcBfxGRDaKyMe9Ci7R7TvYQmsozKS83huZo50wOZfWUJhn3rEb0RljhlasbQpzReRXwAbgQ8BFqnqM+/xXHsaX0DrGMoql51G0otx0po/N4tGVdgrJGDO0Yj1SuA1YCRynql9R1ZUAqlqJc/RgurCrtom0FB+jM4N9Wk5E+NgJRby5/QA7qm2QPGPM0Ik1KVwAPKCqTQAi4hORDABVvd+r4BLd/voWxman9euuapeeUIQIPGoNzsaYIRRrUlgCRLeWZrjTTA+q6lsoyErt17JFuemcMnU0j6yqsGEvjDFDJtakkKaq9R0v3Od9O1E+wrS0tXOoOURBVt9OHUX7+IkTeb+6kZU7agcxMmOM6V6sSaFBRE7seCEi84Cm3hYSkYUi8q6IbBaRG7uYP1lE/iEiq0RkjYicH3vow1tVfStAv48UABYeO460FB9/W1kxWGEZY0yPYk0K3wD+KiJLReRV4C/AdT0tICJ+4HbgPGAWcIWIzOpU7LvAQ6p6AnA5cEdfgh/OqupbACjI7n9SyEoNcP6c8TzxdiUNLXafBWOM92K9eG05MBP4EnAtcIyqruhlsfnAZlXdqqqtwGLgks6rBka5z3OApLmZwP76FgT63POosyvnT6a+JcSTq5PmrTHGDGN9GTr7JKDYXeZEEUFV/9hD+SJgZ9TrCmBBpzI3A8+LyFeBTODcPsQzrFXVt5CbkRLz8BaddQyvraqMzU7ltpc2E+7U3nzlgskDDdMYYw4TU1IQkfuBacDbQLs7WYGekkIsrgDuVdVfiMgpwP0icqyqHjZEqIgsAhYBFBYWUl5ePsDN9qy+vr7f20hrcNoSDtS2MTZVSKvaOOB4ThvTzqNbw1Rt38DErA+6t5aXb415HQOp03Bm9UocyVgnSL56xXqkUArM0r71jdyFM1ZSh4nutGhfABYCqOrrIpIGFAD7ogup6l3AXQClpaVaVlbWhzD6rry8nP5u44FlO1BV9jWvZ15hHs0FEwYcz7Gj2nnq/Q0srRnFx4qLItPL+nCkMJA6DWdWr8SRjHWC5KtXrOc21gLj+rju5cAMESkRkSBOQ/ITncrsAM4BEJFjgDQg4ceMPtQcojUUHlDPo2jpQT9zinJZXVFLc1t77wsYY0w/xZoUCoD1IvKciDzR8ehpAVUN4fRQeg5nzKSHVHWdiNwiIhe7xb4JfFFEVgMPAlf38WhkWOroeTRmkJICwMlT82kNhVm+/cCgrdMYYzqL9fTRzf1Zuao+AzzTadpNUc/XA6f1Z93D2f6O7qgDuHCts4l5GZQUZPLalmpOnVaA39f3oTOMMaY3sXZJfRnYDqS4z5fjDJBnulBd30qKXxiVnjKo6z1zRgF1TW2sqbArnI0x3oh16OwvAg8Dv3MnFQGPeRVUott/qIXRman4+jEQXk+OKsxmbHYqS9+rsvGQjDGeiLVN4Ss4p3kOQuSGO2O9CirRVdW3DOhK5u6ICGfMGMOeg828t6++9wWMMaaPYm1TaFHV1o4hoEUkgHOdgumkPazUNLYyp8ibW1kfNymHJRv28uKGvcwYm9XrsNx2gZsxpi9iPVJ4WUT+E0h37838V+BJ78JKXAeb2wgr5A1weIvuBHw+zj56LDtrmnh3zyFPtmGMGbliTQo34lw/8A7wbzg9iuyOa12oaXSuaM7L8CYpAMybkkd+ZpAXNuwlbG0LxphBFGvvo7Cq3q2qn1TVT7jPbW/UhdqGNgDyMga351E0v084Z+ZYdtc1s67yoGfbMcaMPLH2PtomIls7P7wOLhHVNLYiQM4gd0ft7LhJuYzJTuWF9Xtp7zxSnjHG9FNfxj7qkAZ8Esgf/HASX01jG9lpAQL9HB01Vj4RFs4ex/1vvM8bW6s5bXqBp9szxowMsZ4+qo567FLVW4ELPI4tIdU0tnranhBt5rhsZozN4sWNe6m3m/AYYwZBrKePTox6lIrItfTtXgwjRm1jq2c9jzoTES6YM57WUJgX1u8Zkm0aY5JbrDv2X0Q9D+EMefGpQY8mwYXaw9Q1tZHrYSNzZ2NHpXHK1NG8tqWa+SWjKcpNH7JtG2OST0xJQVXP9jqQZLDnYLNzjcIQnT7q8KGZhby9s5anVley6MypvV7QZowx3Yn1zmvX9zRfVX85OOEktoqaJmDok0J60M9HZo/j0VW7WFNRx3GTcod0+8aY5BFrF5lS4Es4A+EVAdcCJwLZ7sMQnRSG7vRRh3lT8piQm8aza3fTGgr3voAxxnQh1qQwEThRVb+pqt8E5gGTVfX7qvp978JLLBU1jUNyjUJXfCJcNHcCB5tDvLxpX+8LGGNMF2JNCoVAa9TrVneaiVJR0zQk1yh0Z8roTI6flMvS96qodm/0Y4wxfRHr3uuPwJsicrOI3AwsA+7zLKoEVVHTOOTtCZ0tnD0On094+p3dcY3DGJOYYr147UfA54Ea9/F5Vf2xl4ElooqapiG7RqE7o9JTOGfmWDbuOcTG3TYukjGmb/pyniMDOKiqvwYqRKTEo5gSUqg9zO665iG9RqE7p04rYEx2Kk+9s5vmtvZ4h2OMSSCxXtH8PeAG4NvupBTgT14FlYj2HGymPaxxP30EziiqF82dwIGGVu5+xcYtNMbELtYjhY8BFwMNAKpaSQxdUUVkoYi8KyKbReTGbsp8SkTWi8g6EXkg1sCHm3hdo9Cd6WOzOHbCKG4v30xFTWO8wzHGJIhYk0Kre/8EBRCRzN4WEBE/cDtwHjALuEJEZnUqMwPn6OM0VZ0NfKMPsQ8rHUlhOJw+6nD+nPEIwo+e3hDvUIwxCSLWpPCQiPwOyBWRLwJLgLt7WWY+sFlVt6pqK7AYuKRTmS8Ct6tqDYCqJmwH+8paJynE4xqF7uRmBLnuQ9N5du0e1lZZ24Ixpnex9j76OfAw8DfgaOAmVb2tl8WKgJ1RryvcadGOAo4SkX+KyBsisjC2sIefytomCrJSSYnTNQrdueaMEibnZ7B4Y4vdjMcY06texz5yTwMtcQfFe8GD7c8AynCumn5FROaoam2nGBYBiwAKCwspLy8f5DAOV19f3+dtrN3azCi/kla10Zug+un1V7dy4aR27lit/PiBJZwxcfgcyQyG/nxWiSAZ65WMdYLkq1evSUFV20UkLCI5qlrXh3XvAiZFvZ7oTotWASxT1TZgm4hswkkSyzvFcBdwF0BpaamWlZX1IYy+Ky8vp6/buGVFOTMnZ9NcMMaboPqpEigqUSZvXcdfN4eZVFJMMHDk0cyVCyYPeWyDoT+fVSJIxnolY50g+eoV67mOeuAdEfmDiPym49HLMsuBGSJSIiJB4HLgiU5lHsM5SkBECnBOJyVcH0pVpbK2iQk5w/NeBiLCxSU+DjaHeHVzVbzDMcYMY7HeZOcR9xEzVQ2JyHXAc4AfuEdV14nILcBbqvqEO+8jIrIeaAf+n6pW92U7w0FNYxvNbWEmDOMb3EzN8TFrfCavvLefk4rzyE5LrtNIxpjB0WNSEJHJqrpDVfs1zpGqPgM802naTVHPFbjefSSsjp5HRXnpVNe39lI6fhbOHsetL27ipY37uOT4zm3+xhjT++mjxzqeiMjfPI4lYe3qSArD+EgBoCA7lfkl+SzffoB9h5rjHY4xZhjqLSlE39dxqpeBJLJd7oVrw/n0UYcPzSwkxe/jubV74h2KMWYY6i0paDfPTZTK2ibSUnxxueNaX2WlBjjrqDFs2HOI7VUN8Q7HGDPM9JYUjhORgyJyCJjrPj8oIodExMZldlXWNTEhNx0R6b3wMHDqtAKyUgO8uHFvvEMxxgwzPSYFVfWr6ihVzVbVgPu84/WooQpyuNtV2zzs2xOiBQM+zjxqDFv2N7DNjhaMMVGG15gMCWpXTVNCJQWA+cX5drRgjDmCJYUBam5rp6q+JSEamaN1HC1staMFY0wUSwoDtKfO6dqZaEkB7GjBGHMkSwoDVJkg1yh0JRjwcZZ7tLBsa8JdSG6M8YAlhQFKlAvXujO/JJ/s1AC/fvG9eIdijBkGLCkM0K7aJkSgMCc13qH0S4rfaVt4bUu1HS0YYywpDFRlbRNjslJJDfjjHUq/zS/JZ0x2KrcusaMFY0Y6SwoDVFnbnJCNzNFS/D6uPWsar2+tZvn2A/EOxxgTR5YUBqiippGJeYmdFACunD+Zgqwgv7G2BWNGNEsKAxAOK5W1zRQlQVJID/q55oypLH2vird31va+gDEmKVlSGID99S20toeZmJcR71AGxWdOnkJuRgq/fcmOFowZqSwpDEBFTSNAUpw+AmcE1X89rYQlG/axrrIvt+M2xiQLSwoDUOHeR2FSkiQFgM+dWkx2aoDfvrQ53qEYY+LAksIAVCTQzXVilZOewtWnFfPs2j1s2nso3uEYY4aYJYUBqKhpYnRmkIxgj7e6Tjj/eloJGUE/t//DjhaMGWksKQxAsnRH7SwvM8hVJ0/hydWVNoKqMSOMp0lBRBaKyLsisllEbuyh3L+IiIpIqZfxDLZdNU1J0/Oos2vOmEow4LOjBWNGGM+Sgoj4gduB84BZwBUiMquLctnA14FlXsXihXBYqahtSsojBYAx2alcOX8Kj67aZfdyNmYE8fJIYT6wWVW3qmorsBi4pItyPwD+G2j2MJZBV1XfQmsonBQXrnXn2rKppPjFRlA1ZgTxMikUATujXle40yJE5ERgkqo+7WEcnqhwh8xO1iMFgLHZaXz2lGIee3sXm/dZTyRjRoK4dZsRER/wS+DqGMouAhYBFBYWUl5e7mls9fX1vW7jjd0hAHa/t47yPRsi09MaWr0Mrd98oWbSqjZ2O7+8fGuX048NKEEf/OcD/+TLx6d5FV6/xfJZJaJkrFcy1gmSr15eJoVdwKSo1xPdaR2ygWOBchEBGAc8ISIXq+pb0StS1buAuwBKS0u1rKzMw7ChvLyc3raxvnwzrH6XSz9yJpmpH7yNDyzb4Wls/ZVWtZHmgpndzi9bMLnbee/qRm7/xxa+d9QJzJ6Q40V4/RbLZ5WIkrFeyVgnSL56eXn6aDkwQ0RKRCQIXA480TFTVetUtUBVi1W1GHgDOCIhDFcVNU3kZaQclhCS1aIzppGTnsJPntmIqsY7HGOMhzxLCqoaAq4DngM2AA+p6joRuUVELvZqu0OlIom7o3aWk5HC186Zwaubq3h50/54h2OM8ZCnP3NV9RngmU7TbuqmbJmXsQy2XTWNHFWYHe8whsxVJ0/hj69v58fPbOD06QUE/HbdozHJKPnPfXhAVamoaeJDM8fGO5RBE0tbyA0LZ/LlP6/k4RUVXD6/+zYIY0zisp97/VBV30pLKHnuoxCr844dx4mTc/nFC5toaAnFOxxjjAcsKfTDjgPOFb7JfI1CV0SE71wwi/2HWrh7adddWI0xic2SQj9sq3JurlNckBnnSIbevCl5XDBnPL97eSvEQd06AAATA0lEQVT7DibURejGmBhYUuiH7VUN+AQmjbDTRx3+Y+HRhMJhfvnCpniHYowZZJYU+mFbdQMT8zIIBkbm2zdldCafPaWYv7y1k7W77LadxiSTkblXG6DtVQ0j8tRRtK+dM4P8jCA3Pb6WcNguaDMmWVhS6CNVZXtVA1NHeFLISU/hhvNmsnJHLY+u2tX7AsaYhGBJoY/217fQ0NpO8eiR2Z4Q7RMnTuSEybn85NmNHGxui3c4xphBYBev9dH2EdzzqKsL3E6dWsAdOzbzpftXcMHcCVzZw8B6xpjhz44U+qjjLmQlIzApdKUoL52TSvJ5fWs1e6yLqjEJz5JCH22rbiDgE4pyR9aFaz35yDGFpAb8PLm60kZRNSbBWVLoo+1VDUzOz7AB4aJkpAb4yOxCtlU18OSa3fEOxxgzALZn66Nt1h21SycV51OUm86Pnl5v4yIZk8AsKfSBqvJ+dSPFoy0pdOYT4aLjJrD3YAu3vbQ53uEYY/rJkkIf7D3YQlNbOyUF1h21K5PzM/jkvIn84dWtbNlfH+9wjDH9YF1S+2Cb2/PITh91b0ZhNn6f8G/3r+Dzpxbj3n/7MNZt1Zjhy44U+mB7tZsU7PRRt7JSA5x7TCGb99WztvJgvMMxxvSRJYU+2F7VQNDvY4J1R+3RgpLRjM9J4+k1lbS0tcc7HGNMH1hS6IMNew4xfWwWft+Rp0TMB/w+4dLjizjUHGLJhr3xDscY0weWFPpgw+6DHDN+VLzDSAiT8jM4qTif17ZUU1nbFO9wjDExsqQQo6r6FvYfauGY8dnxDiVhfHT2ODKCfh5dtYt2G17bmITgaVIQkYUi8q6IbBaRG7uYf72IrBeRNSLyoohM8TKegdiw22k0nWVHCjFLD/q56LgJ7Kpt4rUtVfEOxxgTA8+Sgoj4gduB84BZwBUiMqtTsVVAqarOBR4G/sereAaqIynY6aO+mVOUwzHjsnlh/V6q61viHY4xphdeHinMBzar6lZVbQUWA5dEF1DVf6hqo/vyDWCih/EMyIbdhxg3Ko28zGC8Q0koIsLFxxfh9wmPrNpF2AbMM2ZY8/LitSJgZ9TrCmBBD+W/ADzb1QwRWQQsAigsLKS8vHyQQuxafX39EdtY/l4jY9N9vW47raHVu8AGwBdqJq1qY1y2nQZcWiL85b0G3li9nonN2wZt3V19VskgGeuVjHWC5KvXsLiiWUQ+A5QCZ3U1X1XvAu4CKC0t1bKyMk/jKS8vJ3obLaF29jz/HJecVExZ2cwel+3qRjTDQVrVRpoLeo7dS3NHK2sbdvDM9oMsuuB45k7MHZT1dv6skkUy1isZ6wTJVy8vTx/tAiZFvZ7oTjuMiJwLfAe4WFWH5Unn9/bWEwqrtScMgIjwsROKyE5L4WsPrrKRVI0ZprxMCsuBGSJSIiJB4HLgiegCInIC8DuchLDPw1gGxBqZB0dGMMCnSiex40Aj3/rrasLWTdWYYcezpKCqIeA64DlgA/CQqq4TkVtE5GK32M+ALOCvIvK2iDzRzeriasPuQ6Sl+GzMo0FQUpDJf55/DM+u3cOvlmyKdzjGmE48bVNQ1WeAZzpNuynq+blebn+wbNh9kKPHjbLhLQbJF04v4b299dz20mamj83ikuOL4h2SMcZlVzT3IhxW1u8+yCy7knnQiAg/uPRYFpTk882HVvP8uj3xDskY47Kk0It39x6irqmNeVPy4x1KUgkGfNz9uVKOLcrhKw+sZMl6GzjPmOHAkkIvXttSDcAp00bHOZLkMyothT9+YT6zxo/iS39eweNvH9E5zRgzxCwp9OL1LdUUj86gyO6h4AknMSzghMl5fH3x2/x6yXuoXfVsTNwMi4vXhqtQe5hlW6u58LgJ8Q4lqXR1gd+Fc8bTFgrzqyWbWLJhLw98cQHZaSlxiM6Ykc2OFHqwrvIgh1pCdupoCAT8Pj4xbyIfnVXI2l11XHTbq6zdVRfvsIwZcSwp9CDSnjDVksJQEBHOOnos15wxlaa2dj5+x2vc//p2O51kzBCypNCD17dWc1RhFmOyU+MdyohSUpDJM187g1Omjea/Hl/HVx5YSV1TW7zDMmZEsKTQjdZQmOXbDnDqtIJ4hzIijc5K5f+uPokbz5vJc+v2cv6vl/LG1up4h2VM0rOk0I23d9bS1NbOyXbqKG58PuHas6bx8LWnkOIXrrj7DX767EZaQ+F4h2ZM0rKk0I0nV1eSGvBx6nRLCvF2wuQ8nv7aGVx+0iTufHkLH7vjn2zedyjeYRmTlKxLahda2pXH3t7FBXPGM8q6RcZFV91W5xTl4l/g45FVFSy8dSk3XTSLSdYIbcygsqTQhbf2hDjUHOKykyb1XtgMqVkTRjEpfwZ/W1nBTY+vY06Bn1nzmikclRbv0IxJCnb6qAsvV4QoKchkfomNdzQcZael8LlTivnBJbPZeKCdD/28nDtf3kJLqD3eoRmT8OxIoZMt++vZVBPmxvMmIWJDZQ9XIsJVJ08htWYrz+8fxU+f3cifl73PNadP5ZOlE8kI9v+rraocbApR1dDCgYZWGlvbaWlrJ6xKaoqfjBQ/haPSGJ+bRmrAP4i1Mib+LCl08pflO/ELfPxEG+M/EYzN8PH7z5VS/u4+bl3yHt97Yh2/fGET5x5TyFlHj2HelDzGjUo77F4Y4bBS3dBKZW0Tu+uaeHL1bg40tlLT0EptYxsHGltj6uEkQF5mkIl56UzKy2BGYRZfP2eG/ZgwCc2SQpTtVQ3c99p2Sgv9jM22c9SJpOzosZQdPZYV79dw/+vbeWnjXv62sgKAgE8oyErFJ9CuSk1DG63th+/0gwEf+RlB8jJSKBmTSV56CllpATKDAVIDPgJ+HyLQ1q60hsLUNbVR29jKnoPNvF/dyJqKOngHHlm5i4/OLuTCuROYOzHHEoRJOJYUXOGwcsPf1hAM+Lh8pvU4SlTzpuQxb0oe7WFlTUUt9732PrWNrRxsDiEAAkcX+snJCJKbnkKO+8gI+ge0A69pbOXdPYc41NzGva9t5+6l25icn8GFc8dz4dwJHDM+2xKESQiWFFyLl+9k2bYD/PTjc8hr3BrvcEwMHli2g7SG1i67r3YYqs4CeRnByIWOp08fw/rddaypqOPOl7dwR/kWCrJSmTsxh7lFOXzjw0cNSUzQddfezq5cMHkIIjGJwpICsHpnLT95ZgOnTB3NZSdN4uWXLSmY/ksP+pk3JZ95U/KpbwmxrtJJEP/YuI+XNu7j7+v2cMGc8Zw2o4C5RTkE/NYJ0AwfIz4pPPvObv79obcpyErlfz4x1w7xzaDKSg2woGQ0C0pGc7CpjbVugvjFC5v4xQubCAZ8TMhJZ3xOGuNy0hg3Ko3CUWkUx7h+VafRvKKmiYqaRnYecP52vN5xoJFQu3OBX8AvZKelkJ0aYOyoNMbnOI+GlhCZqSN+V2Bcnn4TRGQh8GvAD/xeVX/aaX4q8EdgHlANXKaq272MqcPmffXc889tPPjmDo6flMvdny2lIMtGQzXeGZWewqnTCjh1WgH1LSG2VTWwdX89u+uaWbGj5rAeT6OCQvHaVxmdGSQ7LYXMVKfrazgM9S0hahpb2X+ohYqaJpraDr8+IzcjhYl56RxVmM34nHRS/M4PndZQmPqWEHVNIdbuqmP59gMA3LV0K8WjM5k1fhSzi0Yxe0IOsyeMsv+HEcqzpCAifuB24MNABbBcRJ5Q1fVRxb4A1KjqdBG5HPhv4DIv4qmub2HZtgOsqahj5Y4a3tx2gGDAx6cXTOa7F8wiLcX6m5uhk5UaYE5RDnOKcgAIq1Lb2Maeuib2HGyh9VA1ZASpqm9lW1UDDa3tCCACmakB8jKCpPh9nDg5l7zMIHkZQXIzUsjLCMb0XVZV6pra2F3XTEFWKut317G6opan39kdKTM6M8jE/AyKctPIzwySk55CmntdRscBtYjQ1h6mNRRmTUUdoXCY9rASaldCYaXdfYRVCYYaebhyJakBP6kpPtJT/GSmBsgMun9T/WQGA2SlBshIDZCV6icjGCAY8BHwCQG/+9cn+H0y4KP69rA6sbeHaQuFaWt3XreEwjS3tdPU1k5Tq/P3xQ17aQupU9Z9tLrLhBsb+dmapYzJTsUnQmrAR2rARzDgIzXgJyPVT1bwgzo5dQ6QmRogI+gnKzVAetCPT5x6+UXw+cDvE3ziPFrbw7S0tRMM+Dy/I6GXRwrzgc2quhVARBYDlwDRSeES4Gb3+cPAb0VE1IO7qry2pZqvPriKFL9w9Lhsrv/wUVy5YLL9GjLDgk+E/Mwg+ZlBZk2ACU31lJXN73GZWBqRuyMi5GYEyc0IHtbQXNfYxvrdB1lXWceW/fVU1DSxcfchat0uuOFu/jNT/ILg7NQC/g923AGfz92BQ0Or0rT7IC1tYVpC7TS2Oo/+Crjr7U5Pe5Gward1iYUAKQEfKX4fqaIEUlrw+wRVaAm10xIK09IWptmt52CN7PulsmncsHDmoKyrO+LVXa1E5BPAQlW9xn19FbBAVa+LKrPWLVPhvt7ilqnqtK5FwCL35dHAu54E/YECoKrXUoklGesEVq9Ekox1gsSp1xRVHdNboYRoXVLVu4C7hmp7IvKWqpYO1faGQjLWCaxeiSQZ6wTJVy8v+8LtAqKHGZ3oTuuyjIgEgBycBmdjjDFx4GVSWA7MEJESEQkClwNPdCrzBPA59/kngJe8aE8wxhgTG89OH6lqSESuA57D6ZJ6j6quE5FbgLdU9QngD8D9IrIZOICTOIaDITtVNYSSsU5g9UokyVgnSLJ6edbQbIwxJvHY9fXGGGMiLCkYY4yJsKQQRURyReRhEdkoIhtE5JR4xzRQInK0iLwd9TgoIt+Id1yDQUT+XUTWichaEXlQRBL+Jhgi8nW3PusS+XMSkXtEZJ97LVLHtHwReUFE3nP/5sUzxv7opl6fdD+vsIgkfNdUSwqH+zXwd1WdCRwHbIhzPAOmqu+q6vGqejzOGFONwKNxDmvARKQI+BpQqqrH4nRmGC4dFfpFRI4FvogzGsBxwIUiMj2+UfXbvcDCTtNuBF5U1RnAi+7rRHMvR9ZrLfBx4JUhj8YDlhRcIpIDnInTIwpVbVXV2vhGNejOAbao6vvxDmSQBIB09xqXDKAyzvEM1DHAMlVtVNUQ8DLOzibhqOorOD0Ko10C3Oc+vw+4dEiDGgRd1UtVN6iq16MsDBlLCh8oAfYD/yciq0Tk9yKSGe+gBtnlwIPxDmIwqOou4OfADmA3UKeqz8c3qgFbC5whIqNFJAM4n8MvAE10haraMeLeHqAwnsGYrllS+EAAOBH4X1U9AWggMQ9vu+ReQHgx8Nd4xzIY3PPRl+Ak8wlApoh8Jr5RDYyqbsAZKfh54O/A20D/R4wbxtyLVK0//DBkSeEDFUCFqi5zXz+MkySSxXnASlXdG+9ABsm5wDZV3a+qbcAjwKlxjmnAVPUPqjpPVc8EaoBN8Y5pEO0VkfEA7t99cY7HdMGSgktV9wA7ReRod9I5HD7Md6K7giQ5deTaAZwsIhniDKx/DknQMUBExrp/J+O0JzwQ34gGVfSwNp8DHo9jLKYbdkVzFBE5Hvg9EAS2Ap9X1Zr4RjVwbtvIDmCqqtbFO57BIiLfx7kpUwhYBVyjqi3xjWpgRGQpMBpoA65X1RfjHFK/iMiDQBnOsNJ7ge8BjwEPAZOB94FPqWrnxuhhrZt6HQBuA8YAtcDbqvrReMU4UJYUjDHGRNjpI2OMMRGWFIwxxkRYUjDGGBNhScEYY0yEJQVjjDERlhTMiCEi9R6sc5yILBaRLSKyQkSeEZGjBnkbZSKS8BfmmcRgScGYfnIvmnsUKFfVaao6D/g2gz+mTxlJcLW2SQyWFMyIJiLFIvKSiKwRkRfdK4kRkWki8oaIvCMiP+zmKONsoE1V7+yYoKqrVXWpOH7m3hvhHRG5zF1vmYg8FbX934rI1e7z7SLyfRFZ6S4zU0SKgWuBf3fvh3GGZ2+GMVhSMOY24D5VnQv8GfiNO/3XwK9VdQ7OuFhdORZY0c28jwPH49wX4VzgZx3j/vSiSlVPBP4X+JaqbgfuBH7l3hdjaQzrMKbfLCmYke4UPhhf6H7g9KjpHSPK9mf8odOBB1W13R2E8GXgpBiWe8T9uwIo7sd2jRkQSwrG9N86nLvZ9UWIw//vOt9CtGPspnac4dyNGVKWFMxI9xof3Mbz00DH6Zk3gH9xn3d3m8+XgFQRWdQxQUTmuuf9lwKXiYhfRMbg3NXvTZyB4GaJSKqI5OKM7tqbQ0B2H+pkTL9ZUjAjSYaIVEQ9rge+CnxeRNYAVwFfd8t+A7jenT4dOGJ0WfdGMR8DznW7pK4DfoJzV7FHgTXAapzk8R+qukdVd+KMFLrW/bsqhrifBD5mDc1mKNgoqcZ0wb0dZpOqqohcDlyhqpfEOy5jvGbnLI3p2jzgt+61CLXAv8Y5HmOGhB0pGGOMibA2BWOMMRGWFIwxxkRYUjDGGBNhScEYY0yEJQVjjDER/x9LRmT5v1wdWwAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# 위에서 추출한 임베딩 데이터로 히스토그램을 보여준다\n", "sns.distplot(np.log(transformed.toarray().sum(axis=0)))\n", "plt.xlabel('Log Count')\n", "plt.ylabel('Frequency')\n", "plt.title('Distribution Plot of 500 Word Counts')\n", "plt.grid(); plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "## **3. newsgroups 데이터 분석**\n", "데이터를 전처리, 분석 과정을 단계별 진행한다 (분석에 용이한 텍스트로 재선별)\n", "### **01 데이터 전처리 (식별력이 높은 자료들만 추출한다)**\n", "1. 식별에 용이한 숫자 기호들은 제외한, 순수한 문자 데이터만 선별한다\n", "1. 단어별 일치도를 높이기 위해서 **표제어 복원을** 진행\n", "1. 문자중에도 Stopword, name 같이 식별력이 낮은 내용들은 제거한다\n", "\n", "```python\n", "# 실행 중 nltk 오류시\n", "import nltk\n", "nltk.download('names')\n", "```" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 아래에서 사용하는 알파벳 판단함수\n", "'names'.isalpha()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['able', 'accept', 'access', 'according', 'act', 'action', 'actually', 'add', 'address', 'ago']\n", "CPU times: user 12.6 s, sys: 30.8 ms, total: 12.6 s\n", "Wall time: 12.6 s\n" ] } ], "source": [ "%%time\n", "from nltk.corpus import names\n", "from nltk.stem import WordNetLemmatizer\n", "\n", "# 영문 제시어만 추출한다\n", "def letters_only(astr):\n", " for c in astr:\n", " if not c.isalpha(): return False\n", " return True\n", "\n", "# 추출한 데이터 Token을 하나씩 표제어복원을 진행한다\n", "all_names = set(names.words())\n", "lemmatizer = WordNetLemmatizer()\n", "cleaned = [' '.join([lemmatizer.lemmatize(word.lower())\n", " for word in post.split()\n", " if letters_only(word) and word not in all_names]) \n", " for post in groups.data]\n", " \n", "# cv = CountVectorizer(stop_words=\"english\", max_features=500)\n", "transformed = cv.fit_transform(cleaned)\n", "print(cv.get_feature_names()[:10])\n", "len(names.words())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **02 K-means 를 활용한 클러스터링**\n", "1. **(transformed)** : 위 전처리 및 빈도순서 500으로 선별된 데이터 활용\n", "1. 데이터세트를 몇 개의 클러스터로 묶는다\n", "1. **하드 클러스터링** : 개별 token 이 **1개 클러스터에만** 할당 (엄격)\n", "1. **소프트 클러스터링** : 개별 token 이 **다양한 확률값으로 여러 클러스터에** 할당 (유연)\n", "1. **이상치** (Outlier) : 어떠한 클러스터에도 할당되지 않는 값을 **이상치, 노이즈라** 한다\n", "\n", "> **KMeans(클러스터수, 샘플갯수, 반복횟수)**\n", "\n", "1. **n_cluster** : 클러스터 묶음 갯수 ex) 8 (기본값)\n", "1. **max_iter** : 반복자 할당 갯수 ex) 300 (기본값)\n", "1. **n_iter** : 다른 초기값으로 알고리즘 재실행 횟수 ex) 10 (기본값)\n", "1. **tol** : 실행 중지조건 ex) 1e-4" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYwAAAELCAYAAADKjLEqAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAIABJREFUeJzt3X+UXGWd5/H3xzZoTwRDoImkSQSZnLgoSLCGH4M6AQ1BUIiMR+GggrpmdMZRhyUzYWQVGWeNmxUdxaMTlAUGNuJoaLMDGrIqhx8DgQ4EGpQYfkSSBkk0hAC2Strv/lG3Q1Gp6r5Vt6tu39TndU6drnru89T95nZVf3Of+zzPVURgZmY2lpfkHYCZmRWDE4aZmaXihGFmZqk4YZiZWSpOGGZmlooThpmZpdKyhCFphqSfSvqZpAckfTIpnypptaQNyc9967Q/J6mzQdI5rYrTzMzSUavmYUg6EDgwIu6WtDewFlgAnAtsi4glkhYD+0bEP1S1nQr0AyUgkrZvjIinWhKsmZmNqWVnGBHxRETcnTx/Bvg50AucDlyZVLuSchKpNh9YHRHbkiSxGji5VbGamdnY2nINQ9LBwBxgDTAtIp5INv0KmFajSS+wqeL15qTMzMxy8tJW70DSK4DvA5+KiB2Sdm2LiJCUqU9M0kJgIcDkyZPf+NrXvjbL25mZdZS1a9f+OiJ60tRtacKQNIlysrgmIlYkxU9KOjAinkiuc2yp0XQQmFvx+iDgplr7iIhlwDKAUqkU/f394xS9mdmeT9Iv09Zt5SgpAd8Gfh4Rl1RsWgmMjHo6B/hBjeargJMk7ZuMojopKTMzs5y08hrG8cD7gRMlrUsepwBLgHmSNgBvS14jqSTpWwARsQ34J+Cu5HFxUmZmZjlp2bDaPLhLysysMZLWRkQpTV3P9DYzs1ScMMzMLJWWD6s1s/r67hlk6ar1PL59iOlTulk0fzYL5njKkU1MThhmOem7Z5ALVgww9PwwAIPbh7hgxQCAk4ZNSO6SMsvJ0lXrdyWLEUPPD7N01fqcIjIbnROGWU4e3z7UULlZ3pwwzHIyfUp3Q+VmeXPCMMvJovmz6Z7U9aKy7kldLJo/O6eIzEbni95mORm5sO1RUlYUThhmOVowp9cJwgrDXVJmZpaKE4aZmaXihGFmZql0/DWMoi/NcGHfAMvXbGI4gi6Js46ZwecXHN62/ed9/PLef94OXnz9bmUbl5zatv0X/fgXPf526+gzjJGlGQa3DxG8sDRD3z2DeYeWyoV9A1x9x2MMJ0vUD0dw9R2PcWHfQFv2n/fxy3v/eauVLEYrH29FP/5Fjz8PHZ0wir40w/I1mxoqH295H7+899/pin78ix5/Hjo6YRR9aYbhOje/qlc+3vI+fnnvv9MV/fgXPf48tPKe3pdL2iLp/oqyaytu17pR0ro6bTdKGkjqtewWekVfmqFLaqh8vOV9/PLef6cr+vEvevx5aOUZxhXAyZUFEfHeiDgyIo4Evg+sGKX9CUndVLcObEbRl2Y465gZDZWPt7yPX97773RFP/5Fjz8PLRslFRE3Szq41jZJAt4DnNiq/adR9KUZRkZD5TVKKu/jl/f+87Zxyam5jpIq+vEvevx5ULSwvztJGP8REa+vKn8LcEm9swdJjwJPAQH8a0QsS7O/UqkU/f0t68EyM9vjSFqbticnr3kYZwHLR9n+pogYlHQAsFrSgxFxc62KkhYCCwFmzpw5/pGamRmQwygpSS8FzgCurVcnIgaTn1uA64CjR6m7LCJKEVHq6ekZ73DNzCyRxxnG24AHI2JzrY2SJgMviYhnkucnARe3M0Czojj7stu57eFtu14ff+hUrvnIcTlGZHuyVg6rXQ7cDsyWtFnSh5NNZ1LVHSVpuqQbkpfTgFsl3QvcCVwfET9qVZxmRVWdLABue3gbZ192e04R2Z6ulaOkzqpTfm6NsseBU5LnjwBvaFVcZnuK6mQxVrlZVh0909vMzNJzwjAzs1ScMMwK6vhDpzZUbpaVE4ZZQV3zkeN2Sw4eJWWt1PE3UDIrMicHayefYZiZWSpOGGZmlooThpmZpeJrGGbWtHmX3MSGLc/tej3rgMmsPm9ufgFZS/kMw8yaUp0sADZseY55l9yUT0DWck4YZtaU6mQxVrkVnxOGmZml4oRhZmapOGGYWVNmHTC5oXIrPicMM2vK6vPm7pYcPEpqz+ZhtWbWNCeHzuIzDDMzS6WVt2i9XNIWSfdXlF0kaVDSuuRxSp22J0taL+khSYtbFaOZmaXXyi6pK4BLgauqyr8cEf+rXiNJXcDXgXnAZuAuSSsj4metCtSad2HfAMvXbGI4gi6Js46ZwecXHJ53WGbWAi07w4iIm4Fmbi58NPBQRDwSEX8AvgOcPq7B2bi4sG+Aq+94jOEIAIYjuPqOx7iwbyDnyMysFfK4hvFxSfclXVb71tjeC2yqeL05KbMJZvmaTQ2Vm1mxtTthfAM4FDgSeAL4UtY3lLRQUr+k/q1bt2Z9O2vAyJlF2nIzK7a2JoyIeDIihiPij8BllLufqg0CMypeH5SU1XvPZRFRiohST0/P+AZso+qSGio3s2Jra8KQdGDFy3cB99eodhcwS9IhkvYCzgRWtiM+a8xZx8xoqNzMiq1lo6QkLQfmAvtL2gx8Fpgr6UgggI3AXyV1pwPfiohTImKnpI8Dq4Au4PKIeKBVcVrzRkZDeZSUWWdQ7EH9zaVSKfr7+/MOw8ysMCStjYhSmrqe6W1mZqk4YZiZWSodv/jg2Zfdzm0PvzC/8PhDp3LNR47LMaLG9N0zyNJV63l8+xDTp3SzaP5sFsxp37SVvPefVd7xF33/RW+f9ft/xGd/xI7fD+96vc/Lurjvcyenbp91pYR2f346+hpG9YdlRFGSRt89g1ywYoCh51/4wHZP6uILZxzelj86ee8/q7zjL/r+i94+6/e/OlmMSJs0RlZKqPa+Y2emShrj9fnxNYyUan1YRiufaJauWv+iDwvA0PPDLF21viP2n1Xe8Rd9/0Vvn/X7XytZjFZeLetKCXl8fjo6YRTd49uHGirf0/afVd7xF33/RW+ft6wrJeTx73fCKLDpU7obKt/T9p9V3vEXff9Fb5+3rCsl5PHv7+iEcfyhUxsqn2gWzZ9N96SuF5V1T+pi0fzZHbH/rPKOv+j7L3r7rN//fV7W1VB5tawrJeTx+em66KKLWvbm7bZs2bKLFi5cmLr+X75xBnc9+hs2PfXCKVxRLngDvPbAfTho324GBp/m2d/tpHdKN59552Ftu+Cc9/6zyjv+ou+/6O2zfv8/dsKfcvktj/D74Re6kBoZJXXia6fx62d/zwODOwjKZxZnp7zgDeP3+fnc5z73xEUXXbQsTd2OHiVlZtbpPErKzMzGnROGmZml4oRhZmapdPzSIJ0u69ICx/zzap585g+7Xk/bey/WfHpeK0I1s5z5DKODjSwtMLh9iAAGtw9xwYoB+u6pe4PDF6lOFgBPPvMHjvnn1S2I1szy5oTRwbIuLVCdLMYqN7Nic8LoYEVfWsHM2qtlCUPS5ZK2SLq/omyppAcl3SfpOklT6rTdKGlA0jpJnljRIkVfWsHM2quVZxhXANVTHlcDr4+II4BfABeM0v6EiDgy7YQSa1zWpQWm7b1XQ+VmVmwtSxgRcTOwrarsxojYmby8AzioVfu3sS2Y08sXzjic3indCOid0t3QWvprPj1vt+TgUVJme648h9V+CLi2zrYAbpQUwL9GRKp1TqxxC+b0Zlq7yMnBrHPkkjAkfRrYCVxTp8qbImJQ0gHAakkPJmcstd5rIbAQYObMmS2J18zMchglJelc4B3A2VFn5cOIGEx+bgGuA46u934RsSwiShFR6unpaUHEZmYGbT7DkHQy8PfAX0TEb+vUmQy8JCKeSZ6fBFzcxjCtAe2+Cb2Z5aeVw2qXA7cDsyVtlvRh4FJgb8rdTOskfTOpO13SDUnTacCtku4F7gSuj4gftSpOa17WmeJmViy+H4Y17fglP2GwxiS/3ind3Lb4xBwiMrNG+X4Y1haeKW7WWZwwrGmeKW7WWZwwrGl53ITezPLj+2FY00ZGQ3mUlFlncMKwTLLOFDez4nCXlJmZpeKEYWZmqbhLquA809rM2sUJo8BGZlqP3GZ1ZKY14KRhZuPOXVIFlvWe3GZmjRgzYUjqkvR37QjGGuOZ1mbWTmMmjIgYBs5qQyzWIM+0NrN2StsldZukSyW9WdJRI4+WRmZj8kxrM2untBe9j0x+Vt6XIgAvSZojz7Q2s3ZKlTAi4oRWB2LN8UxrM2uXVF1SkqZJ+rakHyavD0tuiGRmZh0i7TWMK4BVwPTk9S+AT7UiIDMzm5jSJoz9I+K7wB8BImInMDx6E5B0uaQtku6vKJsqabWkDcnPfeu0PSeps0HSOSnjNDOzFkl70fs5SftRvtCNpGOBp1O0u4LyfbyvqihbDPw4IpZIWpy8/ofKRpKmAp8FSsk+10paGRFPpYzXzGzCm3fJTWzY8tyu17MOmMzq8+bmF9AY0p5hnAesBA6VdBvlBPCJsRpFxM3Atqri04Erk+dXAgtqNJ0PrI6IbUmSWA2cnDJWM7MJrzpZAGzY8hzzLrkpn4BSSHuG8QDwF8BsQMB6ml9WZFpEPJE8/xUwrUadXmBTxevNSZmZ2R6hOlmMVT4RpP2jf3tE7IyIByLi/oh4Hrg9684jIki6uZolaaGkfkn9W7duzRqSmZnVMWrCkPQqSW8EuiXNqZjlPRf4kyb3+aSkA5P3PxDYUqPOIDCj4vVBSdluImJZRJQiotTT09NkSGZmNpaxuqTmA+dS/oP9JcrdUQDPAP/Y5D5XAucAS5KfP6hRZxXwPypGUJ0EXNDk/szMJpxZB0yu2f0064DJOUSTzqhnGBFxZTLL+9yIODEiTkgep0XEirHeXNJyyl1XsyVtTib7LQHmSdoAvC15jaSSpG8l+90G/BNwV/K4OCkzM9sjrD5v7m7JYaKPklL5MsIYlaRPAv+b8pnFZcBRwOKIuLG14TWmVCpFf39/3mGYmRWGpLURUUpTN+1F7w9FxA7KXUP7Ae8nOTMwM7POkDZhjFy7OAW4KiIeqCgzM7MOkHYexlpJNwKHABdI2ptkmZCiO3jx9buVbVxyqtu3qf0hi69/0bhqAY8WKH63d/sit29U2jOMD1NewuPPIuK3wF7AB1sWVZvUOtijlbv9+LavThZQnpRzSEHid3u3L3L7ZqQ9w3hT8vMIyT1RNj7qDbfINJPTzFombcJYVPH85cDRwFp8xz0zs46R9o5776x8LWkG8JWWRGRmZhNSswsIbgb+y3gGYp2nXuemOz3NJqa0t2j9mqSvJo9LgVuAu1sbWuvVG02QdpSB22dr/+iSU3dLDo2Mkso7frd3+yK3b0bamd6Vd7zbCWyMiNtaFlWTPNPbzKwxjcz0TnsN48qxa5mZ2Z5s1IQhaYBRRjlGxBHjHpGZmU1IY51hnEH5jnibqspnUL5bnpmZdYixEsaXgQsi4peVhZL2Sba9s2arDnL2Zbdz28MvrLx+/KFTueYjx+UYkZlZa4w1SmpaRAxUFyZlB7ckogKpThYAtz28jbMvy3z3WjOzCWeshDFllG3d4xlIEVUni7HKzcyKbKyE0S/pI9WFkv4r5aVBzMysQ4x1DeNTwHWSzuaFBFGivFrtu5rZoaTZwLUVRa8BPhMRX6moM5fyvb4fTYpWRMTFzezPzMzGx6gJIyKeBP5c0gnA65Pi6yPiJ83uMCLWA0cCSOoCBoHralS9JSLe0ex+2uH4Q6fW7H46/tCpOURjZtZaqZYGiYifRsTXkkfTyaKGtwIPV4/CKoprPnLcbsnBo6TMbE+VdnnzVjkTWF5n23GS7gUeB85Pbgs74Tg5mFmnaHa12swk7QWcBvx7jc13A6+OiDcAXwP6RnmfhZL6JfVv3bq1NcGamVl+CQN4O3B3cp3kRSJiR0Q8mzy/AZgkaf9abxIRyyKiFBGlnp6e1kZsZtbB8uySOos63VGSXgU8GREh6WjKie037Qwurb57Blm6aj2Pbx9i+pRuFs2fzYI5vXmHZWY27nJJGJImA/OAv6oo+yhARHwTeDfwMUk7gSHgzEizDnub9d0zyAUrBhh6fhiAwe1DXLCiPDHeScPM9jS5JIyIeA7Yr6rsmxXPLwUubXdcjVq6av2uZDFi6Plhlq5a74RhZnucPK9hFN7j24caKjczKzInjAymT6m9nFa9cjOzInPCyGDR/Nl0T+p6UVn3pC4WzZ+dU0RmZq2T98S9Qhu5TuFRUmbWCZwwMlowp9cJwsw6grukzMwsFScMMzNLxV1SlsmFfQMsX7OJ4Qi6JM46ZgafX3B43mGZWQs4YVjTLuwb4Oo7Htv1ejhi12snDbM9j7ukrGnL12xqqNzMis0Jw5o2XGd5r3rlZlZsThjWtC6poXIzKzYnDGvaWcfMaKjczIrNF72taSMXtj1KyqwzaALeZqJppVIp+vv78w7DzKwwJK2NiFKauu6SMjOzVJwwzMwsldwShqSNkgYkrZO0Wz+Syr4q6SFJ90k6Ko84zcysLO+L3idExK/rbHs7MCt5HAN8I/k5rg5efP1uZRuXnOr2bu/2br/Ht2/URO6SOh24KsruAKZIOnA8d1DrYI9W7vZu7/Zuv6e0b0aeCSOAGyWtlbSwxvZeoHKNic1JmZmZ5SDPLqk3RcSgpAOA1ZIejIibG32TJNksBJg5c+Z4x2hmZonczjAiYjD5uQW4Dji6qsogUDll+KCkrPp9lkVEKSJKPT09rQrXzKzj5ZIwJE2WtPfIc+Ak4P6qaiuBDySjpY4Fno6IJ9ocqpmZJfI6w5gG3CrpXuBO4PqI+JGkj0r6aFLnBuAR4CHgMuCvxzuIeqMJ0o4ycHu3d3u3L2r7ZnhpEDOzDualQczMbNw5YZiZWSp5z/TOXdaZkn33DLJ01Xoe3z7E9CndLJo/mwVz0k8Xydo+q7z3n1XW+C/sG8i0PPshi6+nslNXwKNtnKmbNf6s5l1yExu2PLfr9awDJrP6vLmp2x/x2R+x4/fDu17v87Iu7vvcyanb5z1TuujtG9XRZxhZZ0r23TPIBSsGGNw+RACD24e4YMUAfffsNvq3Je2zynv/WWWN/8K+Aa6+47Fdt5QdjuDqOx7jwr6BVO2rkwWUZ6Me0qaZulnjz6o6WQBs2PIc8y65KVX76mQBsOP3wxzx2R+lap/3TOmit29GRyeMrJauWs/Q8y/+wA89P8zSVevb0j6rvPefVdb4l6/Z1FB5tXrDRdo1jCRr/FlVJ4uxyqtVJ4uxyi1/ThgZPL59qKHy8W6fVd77zypr/MN1RgjWK59oih6/FY8TRgbTp3Q3VD7e7bPKe/9ZZY2/S2qofKIpevxWPE4YGSyaP5vuSV0vKuue1MWi+bPb0j6rvPefVdb4zzpmRkPl1er9WW7Xn+us8Wc164DJDZVX2+dlXQ2VW/46OmFknSm5YE4vXzjjcHqndCOgd0o3Xzjj8NSjdLK2zyrv/WeVNf7PLzic9x07c9f/yLsk3nfszNSjjB5dcupuyaGRUVJZP39Z489q9Xlzd0sOjYySuu9zJ++WHBoZJZX3TOmit2+GZ3qbmXUwz/Q2M7Nx54RhZmapOGGYmVkqHb80iJlZs7IujVK0pXl8hmFm1oSsS6MUcWkeJwwzsyZkXRqliEvzOGGYmeWgiEvztD1hSJoh6aeSfibpAUmfrFFnrqSnJa1LHp9pd5xmZq1UxKV58jjD2An8t4g4DDgW+BtJh9Wod0tEHJk8Lm5viGZmo8u6NEoRl+Zpe8KIiCci4u7k+TPAz4GJOyzAzKyGrEujFHFpnlyXBpF0MHAz8PqI2FFRPhf4PrAZeBw4PyIeGOv9vDSImVljGlkaJLd5GJJeQTkpfKoyWSTuBl4dEc9KOgXoA2bVeZ+FwEKAmTNntjBiM7POlssoKUmTKCeLayJiRfX2iNgREc8mz28AJknav9Z7RcSyiChFRKmnp6elcZuZdbK2n2FIEvBt4OcRcUmdOq8CnoyIkHQ05cT2mzaGaWbWckWb6Z1Hl9TxwPuBAUnrkrJ/BGYCRMQ3gXcDH5O0ExgCzow9aR12M+t4IzO9Rybvjcz0BiZs0mh7woiIWxnjpmQRcSlwaXsiMjNrv9Fmek/UhOGZ3mZmOfBMbzMzS8Uzvc3MLJUizvT2/TDMzHIwcp3Co6TMzGxMC+b0TugEUc1dUmZmlooThpmZpeIuKbMCK9pMYSs2JwyzgiriTGErNndJmRVUEe8JbcXmhGFWUEWcKWzF5oRhVlBFnClsxeaEYVZQRZwpbMXmi95mBVXEmcJWbE4YZgVWtJnCVmzukjIzs1ScMMzMLJVcEoakkyWtl/SQpMU1tr9M0rXJ9jWSDm5/lGZmVqnt1zAkdQFfB+YBm4G7JK2MiJ9VVPsw8FRE/KmkM4EvAu9td6xmE13RlwY5ePH1u5VtXHJqDpFYGnmcYRwNPBQRj0TEH4DvAKdX1TkduDJ5/j3grZJGvQ+4WacZWRpkcPsQwQtLg/TdM5h3aKnUShajlVv+8kgYvcCmitebk7KadSJiJ/A0sF9bojMrCC8NYu1W+IvekhZK6pfUv3Xr1rzDMWsbLw1i7ZZHwhgEZlS8Pigpq1lH0kuBVwK/qfVmEbEsIkoRUerp6WlBuGYTk5cGsXbLI2HcBcySdIikvYAzgZVVdVYC5yTP3w38JCKijTGaTXheGsTare0JI7km8XFgFfBz4LsR8YCkiyWdllT7NrCfpIeA84Ddht6adboFc3r5whmH0zulGwG9U7r5whmHF2aUVL3RUB4lNXFpT/qPe6lUiv7+/rzDMDMrDElrI6KUpm7hL3qbmVl7OGGYmVkqThhmZpaKE4aZmaXihGFmZqnsUaOkJG0Fftlk8/2BX49jOOPN8WXj+LJxfNlM5PheHRGpZj3vUQkjC0n9aYeW5cHxZeP4snF82Uz0+NJyl5SZmaXihGFmZqk4YbxgWd4BjMHxZeP4snF82Uz0+FLxNQwzM0vFZxhmZpZKxyUMSSdLWi/pIUm7rYIr6WWSrk22r5F0cBtjmyHpp5J+JukBSZ+sUWeupKclrUsen2lXfMn+N0oaSPa920qPKvtqcvzuk3RUG2ObXXFc1knaIelTVXXaevwkXS5pi6T7K8qmSlotaUPyc986bc9J6myQdE6tOi2Kb6mkB5Pf33WSptRpO+pnoYXxXSRpsOJ3eEqdtqN+11sY37UVsW2UtK5O25Yfv3EXER3zALqAh4HXAHsB9wKHVdX5a+CbyfMzgWvbGN+BwFHJ872BX9SIby7wHzkew43A/qNsPwX4ISDgWGBNjr/rX1EeY57b8QPeAhwF3F9R9j+BxcnzxcAXa7SbCjyS/Nw3eb5vm+I7CXhp8vyLteJL81loYXwXAeen+P2P+l1vVXxV278EfCav4zfej047wzgaeCgiHomIPwDfAU6vqnM6cGXy/HvAWyWpHcFFxBMRcXfy/BnK9wspxs0NXnA6cFWU3QFMkXRgDnG8FXg4IpqdyDkuIuJmYFtVceVn7EpgQY2m84HVEbEtIp4CVgMntyO+iLgxyvetAbiD8l0xc1Hn+KWR5rue2WjxJX833gMsH+/95qXTEkYvsKni9WZ2/4O8q07ypXka2K8t0VVIusLmAGtqbD5O0r2SfijpdW0NDAK4UdJaSQtrbE9zjNvhTOp/UfM8fgDTIuKJ5PmvgGk16kyU4/ghymeMtYz1WWiljyddZpfX6dKbCMfvzcCTEbGhzvY8j19TOi1hFIKkVwDfBz4VETuqNt9NuZvlDcDXgL42h/emiDgKeDvwN5Le0ub9jym59e9pwL/X2Jz38XuRKPdNTMihipI+DewErqlTJa/PwjeAQ4EjgScod/tMRGcx+tnFhP8uVeu0hDEIzKh4fVBSVrOOpJcCrwR+05boyvucRDlZXBMRK6q3R8SOiHg2eX4DMEnS/u2KLyIGk59bgOson/pXSnOMW+3twN0R8WT1hryPX+LJkW665OeWGnVyPY6SzgXeAZydJLXdpPgstEREPBkRwxHxR+CyOvvN+/i9FDgDuLZenbyOXxadljDuAmZJOiT5X+iZwMqqOiuBkREp7wZ+Uu8LM96SPs9vAz+PiEvq1HnVyDUVSUdT/h22JaFJmixp75HnlC+O3l9VbSXwgWS01LHA0xXdL+1S9392eR6/CpWfsXOAH9Soswo4SdK+SZfLSUlZy0k6Gfh74LSI+G2dOmk+C62Kr/Ka2Lvq7DfNd72V3gY8GBGba23M8/hlkvdV93Y/KI/i+QXlERSfTsoupvzlAHg55a6Mh4A7gde0MbY3Ue6euA9YlzxOAT4KfDSp83HgAcqjPu4A/ryN8b0m2e+9SQwjx68yPgFfT47vAFBq8+93MuUE8MqKstyOH+XE9QTwPOV+9A9Tvib2Y2AD8P+AqUndEvCtirYfSj6HDwEfbGN8D1Hu/x/5DI6MGpwO3DDaZ6FN8f1b8tm6j3ISOLA6vuT1bt/1dsSXlF8x8pmrqNv24zfeD8/0NjOzVDqtS8rMzJrkhGFmZqk4YZiZWSpOGGZmlooThpmZpeKEYR1LUkj6UsXr8yVdlGNIZhOaE4Z1st8DZ+Qw07umZHaw2YTlhGGdbCflW2f+XfUGST2Svi/pruRxfFI+IGlKMpP9N5I+kJRfJWmepNdJujO5x8F9kmYl2/97cm+GWyUtl3R+Un6TpK8k90P4pKSDJf0kaftjSTOTeldIendFfM8mP+dKulnS9cn7f1OSv9fWEv5gWaf7OnC2pFdWlf8L8OWI+DPgL4FvJeW3AccDr6N8j4o3J+XHAf9JeVb5v0TEkZRnbm+WNPIeb6C8zlWpal97RUQpIr5EeUHEKyPiCMqL/n01xb/haOBvgcMoL8p3Rpp/uFmjfApsHS0idki6CvgEMFSx6W3AYRW3QtknWUX4Fso3zfkl5VVTF0rqBZ6KiOck3Q58WtJBwIqI2JCcnfwgIn4H/E7S/60Ko3KBuuN44Q/+v1G+2dJY7oyIRwAkLae8xMz30vz7zRrhMwwz+ArlNYomV5S9BDg2Io5MHr1RXuX2ZspnFW8GbgK2Ul4g462EAAABTklEQVSk8haAiPg/lJdWHwJukHRiiv0/l6LOziQmki6nvSq2Va/v4/V+rCWcMKzjRcQ24LuUk8aIGyl38wAg6cik7iZgf2BW8r/6W4HzKScSJL0GeCQivkp5FdojKHdjvVPSy5OzlHeMEs5/Ul5ZFeBskkRE+Xaeb0yenwZMqmhzdLIq60uA9yYxmY07Jwyzsi9RTgQjPgGUkovPP6N8bWLEGsqroEL5D3ovL/yRfg9wv6R1wOsp3672Lsqrqt5H+e51A5Tv5FjL3wIflHQf8H7gk0n5ZcBfSLqXcrdV5VnJXcCllG/p+yjleyuYjTuvVmvWBpJeERHPSvoTymcjCyO5f3vG950LnB8Ro521mI0LX/Q2a49lkg6jfL+VK8cjWZi1m88wzMwsFV/DMDOzVJwwzMwsFScMMzNLxQnDzMxSccIwM7NUnDDMzCyV/w9MmpId/zviNQAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 448 ms, sys: 401 ms, total: 848 ms\n", "Wall time: 1min 28s\n" ] } ], "source": [ "%%time\n", "# K Mean를 활용한 묶음처리\n", "from sklearn.cluster import KMeans\n", "km = KMeans(n_clusters=20, n_jobs=-1)\n", "km.fit(transformed)\n", "\n", "labels = groups.target\n", "plt.scatter(labels, km.labels_)\n", "plt.xlabel('Newsgroup'); plt.ylabel('Cluster')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **03 Topic 모델링**\n", "1. 문장 내 단어 Token중 **핵심(주제)가 되는 Token을** 선별한다\n", "1. Topic 마다 **다른 가중치를 할당하여 additive model을** 정의한다\n", "1. **비음수 행렬 인수분해** : Non-Negative Matrix Factorization" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0: wa thought later took left order seen taken\n", "1: db bit data place stuff add time line\n", "2: server using display screen support code mouse application\n", "3: file section information write source change entry number\n", "4: disk drive hard controller support card board head\n", "5: entry rule program source number info email build\n", "6: new york sale change service result study early\n", "7: image software user package using display include support\n", "8: window manager application using offer user information course\n", "9: gun united control house american second national issue\n", "10: hockey league team game division player list san\n", "11: turkish government sent war study came american world\n", "12: program change technology display information version application rate\n", "13: space nasa technology service national international small communication\n", "14: government political federal sure free private local country\n", "15: output line open write read return build section\n", "16: people country doing tell live killed lot saying\n", "17: widget application value set type return function list\n", "18: child case rate le report area research group\n", "19: jew jewish world war history help research arab\n", "20: armenian russian muslim turkish world city road today\n", "21: president said group tax press working package job\n", "22: ground box usually power code current house white\n", "23: russian president american support food money important private\n", "24: ibm color week memory hardware monitor software standard\n", "25: anonymous posting service server user group message post\n", "26: la win san went list year radio near\n", "27: work job young school lot private create business\n", "28: encryption technology access device policy security government data\n", "29: tape driver work memory using cause note following\n", "30: war military world attack way united russian force\n", "31: god bible shall man come life hell love\n", "32: atheist religious religion belief god sort feel idea\n", "33: data available information user research set model based\n", "34: center research medical institute national study test north\n", "35: think lot try trying talk kind agree certainly\n", "36: water city division list public similar north high\n", "37: section military shall weapon person division application mean\n", "38: good cover great pretty probably bad issue life\n", "39: drive head single mode set using model type\n", "40: israeli arab attack policy true apr fact stop\n", "41: use note using usually similar available standard work\n", "42: know tell way come sure understand let saw\n", "43: car speed driver change high buy different design\n", "44: internet email address information anonymous user network mail\n", "45: like look sound long little guy pretty having\n", "46: going come way mean kind sure working got\n", "47: state united public national political federal member local\n", "48: dod bike member computer list started live email\n", "49: greek killed act word western muslim turkish talk\n", "50: computer information public internet list issue network communication\n", "51: law act federal specific issue clear order moral\n", "52: book read reference list copy second study offer\n", "53: argument form true evidence event truth particular known\n", "54: make sense difference little sure making end tell\n", "55: scsi hard pc drive device bus different data\n", "56: time long having able lot order light response\n", "57: gun rate crime city death study control difference\n", "58: right second free shall security mean left american\n", "59: went came said told started saw took woman\n", "60: power period second san special le play goal\n", "61: used using product way function version note single\n", "62: problem work having using help apple running error\n", "63: available version widget server includes sun set support\n", "64: question answer ask asked science reason claim post\n", "65: san information police said group league political including\n", "66: number serial large men report following million le\n", "67: year ago old best sale hit long project\n", "68: want help let life reason trying copy tell\n", "69: point way different line algorithm exactly idea view\n", "70: run running home version start hit win speed\n", "71: got shot play took goal went hit lead\n", "72: thing saw sure got trying kind seen asked\n", "73: graphic send mail message package server various computer\n", "74: university science department general computer thanks engineering texas\n", "75: just maybe start thought big probably look getting\n", "76: key message public security algorithm standard method attack\n", "77: doe mean anybody actually different ask reading difference\n", "78: game win sound play left second lead great\n", "79: ha able called taken given past exactly looking\n", "80: believe belief christian truth evidence claim mean different\n", "81: drug study information war group reason usa evidence\n", "82: need help phone able needed kind thanks bike\n", "83: did death let money fact man wanted body\n", "84: chip clipper serial algorithm phone communication encryption key\n", "85: card driver video support mode mouse board bus\n", "86: church christian member group true bible different view\n", "87: ftp available anonymous general nasa package source version\n", "88: better player best play probably hit maybe big\n", "89: human life person moral kill claim reason world\n", "90: bit using let change mode attack size quite\n", "91: say mean word act clear said read simply\n", "92: health medical public national care study service user\n", "93: article post usa read world discussion opinion gmt\n", "94: team player win play city look bad great\n", "95: day come word christian said tell little way\n", "96: really lot sure look fact idea actually feel\n", "97: unit disk size serial total national got return\n", "98: image color version free available display current better\n", "99: woman men muslim religion way man great world\n", "CPU times: user 52.3 s, sys: 32.2 s, total: 1min 24s\n", "Wall time: 38.7 s\n" ] } ], "source": [ "%%time\n", "from sklearn.decomposition import NMF\n", "nmf = NMF(n_components=100, random_state=43).fit(transformed)\n", "\n", "for topic_idx, topic in enumerate(nmf.components_):\n", " label = '{}: '.format(topic_idx)\n", " print(label, \" \".join([cv.get_feature_names()[i]\n", " for i in topic.argsort()[:-9:-1]]))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 4 }