{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br></br>\n",
    "# **tf-idf**\n",
    "\n",
    "## **1 Document 자료를 불러오기**\n",
    "sklearn을 활용한 tf-idf 계산\n",
    "[**연간 기업결과 리포트**](https://news.samsung.com/global/samsung-electronics-announces-fourth-quarter-and-fy-2017-results)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Document 자료를 불러온다 : 2017년 연간결산 리포트\n",
    "with open('./data/News2017.txt', 'r', encoding='utf-8') as f:\n",
    "    texts = f.read()\n",
    "\n",
    "texts = texts.lower()\n",
    "texts[:300]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 영문 Token만 추출한다\n",
    "from nltk.tokenize import RegexpTokenizer\n",
    "re_capt  = RegexpTokenizer(r'[ =Quiz!= ]\\w+')\n",
    "tokens   = re_capt.tokenize(texts)\n",
    "document = \" \".join(tokens)\n",
    "document[:300]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 추출한 Token의 빈도를 계산한다\n",
    "from nltk import FreqDist\n",
    "import pandas as pd\n",
    "token_freq = FreqDist(tokens)\n",
    "token_freq = pd.Series(token_freq).sort_values(ascending=False)\n",
    "token_freq[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br></br>\n",
    "## **2 sklean 을 활용한 tf idf 계산**\n",
    "sklearn의 기본 데이터를 활용하여 tf-idf 결과값 출력"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ! pip3 install sklearn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ! pip3 install scipy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "tfidf_vec   = TfidfVectorizer(stop_words=' =Quiz!= ')\n",
    "transformed = tfidf_vec.fit_transform(raw_documents = [\" =Quiz!= \"])\n",
    "transformed = np.array(transformed.todense())\n",
    "transformed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "index_value   = {i[1]:i[0] for i in tfidf_vec.vocabulary_.items()}\n",
    "fully_indexed = {index_value[column]:value  for row in transformed  \n",
    "                                            for (column,value) in enumerate(row)}\n",
    "\n",
    "token_tfidf = pd.Series(fully_indexed).sort_values(ascending=False)\n",
    "token_tfidf[:10]"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}