{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"# **tf-idf**\n",
"\n",
"## **1 Document 자료를 불러오기**\n",
"sklearn을 활용한 tf-idf 계산\n",
"[**연간 기업결과 리포트**](https://news.samsung.com/global/samsung-electronics-announces-fourth-quarter-and-fy-2017-results)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Document 자료를 불러온다 : 2017년 연간결산 리포트\n",
"with open('./data/News2017.txt', 'r', encoding='utf-8') as f:\n",
" texts = f.read()\n",
"\n",
"texts = texts.lower()\n",
"texts[:300]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 영문 Token만 추출한다\n",
"from nltk.tokenize import RegexpTokenizer\n",
"re_capt = RegexpTokenizer(r'[ =Quiz!= ]\\w+')\n",
"tokens = re_capt.tokenize(texts)\n",
"document = \" \".join(tokens)\n",
"document[:300]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 추출한 Token의 빈도를 계산한다\n",
"from nltk import FreqDist\n",
"import pandas as pd\n",
"token_freq = FreqDist(tokens)\n",
"token_freq = pd.Series(token_freq).sort_values(ascending=False)\n",
"token_freq[:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"## **2 sklean 을 활용한 tf idf 계산**\n",
"sklearn의 기본 데이터를 활용하여 tf-idf 결과값 출력"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# ! pip3 install sklearn"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# ! pip3 install scipy"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"tfidf_vec = TfidfVectorizer(stop_words=' =Quiz!= ')\n",
"transformed = tfidf_vec.fit_transform(raw_documents = [\" =Quiz!= \"])\n",
"transformed = np.array(transformed.todense())\n",
"transformed"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"index_value = {i[1]:i[0] for i in tfidf_vec.vocabulary_.items()}\n",
"fully_indexed = {index_value[column]:value for row in transformed \n",
" for (column,value) in enumerate(row)}\n",
"\n",
"token_tfidf = pd.Series(fully_indexed).sort_values(ascending=False)\n",
"token_tfidf[:10]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}