{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "# **Mini Project One**\n", "Stop Words" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "## **1 text 문서에서 token 추출하기**\n", "Document 에서 한글 추출하기" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Step 1 - pdf 에서 변환한 Document 불러오기\n", "filename = '../data/kr-Report_2018.txt'\n", "with open(filename, 'r', encoding='utf-8') as f:\n", " texts = f.read()\n", "texts[:300]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from txtutil import txtnoun\n", "texts = txtnoun(filename, tags=[\"Noun\", \"Adjective\", \"Verb\"], stem=True)\n", "texts[:300]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Document 문서를 Token List 객체로 변환하기\n", "from nltk.tokenize import word_tokenize\n", "texts = word_tokenize(texts)\n", "texts[:8]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "## **2 StopWord 데이터 만들기**\n", "**stopwords_list** : 2015, 2016, 2017, 2018년 모두 존재하는 단어목록" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from glob import glob\n", "filelist = glob('../data/kr-Report_201?.txt')\n", "filelist" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stopword_list = []\n", "for file in filelist:\n", " token_list = txtnoun(file, tags=[\"Noun\", \"Adjective\", \"Verb\"], set_tokens=True)\n", " if len(stopword_list) == 0:\n", " stopword_list = token_list\n", " else:\n", " stopword_list = [token for token in token_list \n", " if token in stopword_list]\n", " print(\"{}로 필터링 된 StopWord 갯수 : {}\".format(file, len(stopword_list)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "## **3 추출한 StopWord 로 Token 필터링**\n", "stopword 를 사용하여 필터링" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Stopwords 를 활용하여 Token을 필터링\n", "texts = [text for text in texts \n", " if text not in stopword_list]\n", "\n", "# pandas 를 활용하여 상위빈도 객체를 출력한다\n", "import pandas as pd\n", "from nltk import FreqDist\n", "freqtxt = pd.Series(dict(FreqDist(texts))).sort_values(ascending=False)\n", "freqtxt[:25]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "## **3 Konlpy 의 단점들**\n", "오타/ 비정형 텍스트의 처리" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from konlpy.tag import Twitter\n", "twitter = Twitter()\n", "twitter.pos('가치창출', stem=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "twitter.pos('갤럭시', stem=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "twitter.pos('갤러시', stem=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "## **4 WordCloud 출력**\n", "visualization" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# wordcloud 출력\n", "from wordcloud import WordCloud\n", "wcloud = WordCloud('../data/D2Coding.ttf',\n", " relative_scaling = 0.2,\n", " background_color = 'white').generate(\" \".join(texts))\n", "wcloud" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "plt.figure(figsize=(12,12))\n", "plt.imshow(wcloud, interpolation='bilinear')\n", "plt.axis(\"off\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }