{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br></br>\n",
    "# **Mini Project One**\n",
    "Stop Words"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br></br>\n",
    "## **1 text 문서에서 token 추출하기**\n",
    "Document 에서 한글 추출하기"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Step 1 - pdf 에서 변환한 Document 불러오기\n",
    "filename = '../data/kr-Report_2018.txt'\n",
    "with open(filename, 'r', encoding='utf-8') as f:\n",
    "    texts = f.read()\n",
    "texts[:300]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from txtutil import txtnoun\n",
    "texts = txtnoun(filename, tags=[\"Noun\", \"Adjective\", \"Verb\"], stem=True)\n",
    "texts[:300]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Document 문서를 Token List 객체로 변환하기\n",
    "from nltk.tokenize import word_tokenize\n",
    "texts = word_tokenize(texts)\n",
    "texts[:8]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br></br>\n",
    "## **2 StopWord 데이터 만들기**\n",
    "**stopwords_list** : 2015, 2016, 2017, 2018년 모두 존재하는 단어목록"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from glob import glob\n",
    "filelist = glob('../data/kr-Report_201?.txt')\n",
    "filelist"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "stopword_list = []\n",
    "for file in filelist:\n",
    "    token_list = txtnoun(file, tags=[\"Noun\", \"Adjective\", \"Verb\"], set_tokens=True)\n",
    "    if len(stopword_list) == 0:\n",
    "        stopword_list = token_list\n",
    "    else:\n",
    "        stopword_list = [token for token in token_list  \n",
    "                               if token in stopword_list]\n",
    "    print(\"{}로 필터링 된 StopWord 갯수 : {}\".format(file, len(stopword_list)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br></br>\n",
    "## **3 추출한 StopWord 로 Token 필터링**\n",
    "stopword 를 사용하여 필터링"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Stopwords 를 활용하여 Token을 필터링\n",
    "texts = [text for text in texts  \n",
    "              if text not in stopword_list]\n",
    "\n",
    "# pandas 를 활용하여 상위빈도 객체를 출력한다\n",
    "import pandas as pd\n",
    "from nltk import FreqDist\n",
    "freqtxt = pd.Series(dict(FreqDist(texts))).sort_values(ascending=False)\n",
    "freqtxt[:25]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br></br>\n",
    "## **3 Konlpy 의 단점들**\n",
    "오타/ 비정형 텍스트의 처리"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from konlpy.tag import Twitter\n",
    "twitter = Twitter()\n",
    "twitter.pos('가치창출', stem=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "twitter.pos('갤럭시', stem=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "twitter.pos('갤러시', stem=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br></br>\n",
    "## **4 WordCloud 출력**\n",
    "visualization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# wordcloud 출력\n",
    "from wordcloud import WordCloud\n",
    "wcloud = WordCloud('../data/D2Coding.ttf',\n",
    "                   relative_scaling = 0.2,\n",
    "                   background_color = 'white').generate(\" \".join(texts))\n",
    "wcloud"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "plt.figure(figsize=(12,12))\n",
    "plt.imshow(wcloud, interpolation='bilinear')\n",
    "plt.axis(\"off\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}