{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "# **tf-idf**\n", "\n", "## **1 Document 자료를 불러오기**\n", "sklearn을 활용한 tf-idf 계산\n", "[**연간 기업결과 리포트**](https://news.samsung.com/global/samsung-electronics-announces-fourth-quarter-and-fy-2017-results)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Document 자료를 불러온다 : 2017년 연간결산 리포트\n", "with open('./data/News2017.txt', 'r', encoding='utf-8') as f:\n", " texts = f.read()\n", "\n", "texts = texts.lower()\n", "texts[:300]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 영문 Token만 추출한다\n", "from nltk.tokenize import RegexpTokenizer\n", "re_capt = RegexpTokenizer(r'[ =Quiz!= ]\\w+')\n", "tokens = re_capt.tokenize(texts)\n", "document = \" \".join(tokens)\n", "document[:300]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 추출한 Token의 빈도를 계산한다\n", "from nltk import FreqDist\n", "import pandas as pd\n", "token_freq = FreqDist(tokens)\n", "token_freq = pd.Series(token_freq).sort_values(ascending=False)\n", "token_freq[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "## **2 sklean 을 활용한 tf idf 계산**\n", "sklearn의 기본 데이터를 활용하여 tf-idf 결과값 출력" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ! pip3 install sklearn" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ! pip3 install scipy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "tfidf_vec = TfidfVectorizer(stop_words=' =Quiz!= ')\n", "transformed = tfidf_vec.fit_transform(raw_documents = [\" =Quiz!= \"])\n", "transformed = np.array(transformed.todense())\n", "transformed" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "index_value = {i[1]:i[0] for i in tfidf_vec.vocabulary_.items()}\n", "fully_indexed = {index_value[column]:value for row in transformed \n", " for (column,value) in enumerate(row)}\n", "\n", "token_tfidf = pd.Series(fully_indexed).sort_values(ascending=False)\n", "token_tfidf[:10]" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }