{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "LSA: 선형대수학적으로 토픽벡터를 발굴하는 방법

\n", "LDA: 확률적으로 토픽벡터를 계산해주는 방법" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import warnings\n", "import nltk\n", "from nltk.corpus import stopwords\n", "from sklearn.feature_extraction.text import TfidfVectorizer \n", "from sklearn.decomposition import TruncatedSVD #축소된 svd\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Latent Semantic Analysis (LSA):" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# The data.\n", "my_docs = [\"The economic slowdown is becoming more severe\",\n", " \"The movie was simply awesome\",\n", " \"I like cooking my own food\",\n", " \"Samsung is announcing a new technology\",\n", " \"Machine Learning is an example of awesome technology\",\n", " \"All of us were excited at the movie\",\n", " \"We have to do more to reverse the economic slowdown\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1. Create a TF IDF representation:\n", "TfidfVectorizer() arguments:
\n", "- *max_features* : maximum number of features (distict words).
\n", "- *min_df* : The minimum DF. Integer value means count and real number (0~1) means proportion.
\n", "- *max_df* : The maximum DF. Integer value means count and real number (0~1) means proportion. Helps to filter out the stop words.
" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "my_docs = [x.lower() for x in my_docs] #소문자로 변경" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "my_stop_words = ['us', 'like']" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "vectorizer = TfidfVectorizer(max_features = 15, min_df = 1, max_df = 3, stop_words = stopwords.words('english') + my_stop_words)\n", "X = vectorizer.fit_transform(my_docs).toarray() #my_stop_words 리스트 +로 추가" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0. , 0. , 0. , 0.53828134, 0. ,\n", " 0. , 0. , 0. , 0. , 0. ,\n", " 0. , 0.64846464, 0. , 0.53828134, 0. ],\n", " [0. , 0.53828134, 0. , 0. , 0. ,\n", " 0. , 0. , 0.53828134, 0. , 0. ,\n", " 0. , 0. , 0.64846464, 0. , 0. ],\n", " [0. , 0. , 0.70710678, 0. , 0. ,\n", " 0. , 0.70710678, 0. , 0. , 0. ,\n", " 0. , 0. , 0. , 0. , 0. ],\n", " [0.52064676, 0. , 0. , 0. , 0. ,\n", " 0. , 0. , 0. , 0.52064676, 0. ,\n", " 0.52064676, 0. , 0. , 0. , 0.43218152],\n", " [0. , 0.53828134, 0. , 0. , 0.64846464,\n", " 0. , 0. , 0. , 0. , 0. ,\n", " 0. , 0. , 0. , 0. , 0.53828134],\n", " [0. , 0. , 0. , 0. , 0. ,\n", " 0.76944876, 0. , 0.63870855, 0. , 0. ,\n", " 0. , 0. , 0. , 0. , 0. ],\n", " [0. , 0. , 0. , 0.53828134, 0. ,\n", " 0. , 0. , 0. , 0. , 0.64846464,\n", " 0. , 0. , 0. , 0.53828134, 0. ]])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(7, 15)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Size of X (=m x n). m = number of documents = 7 & n = number of features.\n", "X.shape #[]가 7개 열의 수가 15개" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['announcing', 'awesome', 'cooking', 'economic', 'example', 'excited', 'food', 'movie', 'new', 'reverse', 'samsung', 'severe', 'simply', 'slowdown', 'technology']\n" ] } ], "source": [ "# View the features.(칼럼의 이름들)\n", "features = vectorizer.get_feature_names()\n", "print(features)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2. Apply the truncated SVD:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "TruncatedSVD(algorithm='randomized', n_components=4, n_iter=100,\n", " random_state=None, tol=0.0)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#객체 만들고 학습\n", "n_topics = 4 #4개 토픽 발굴이 목표 (15개까지의 주성분이 나올 수 있는데 4차원으로 차원축소하겠다)\n", "svd = TruncatedSVD(n_components=n_topics, n_iter=100) #n_iter는 디폴트\n", "svd.fit(X) " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# get the V^t matrix. \n", "vt = svd.components_ #v의 행렬의 전치가 나옴\n", "vtabs = np.abs(vt)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(4, 15)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check for the size of V^t. \n", "vt.shape #전치되어 행이 4됨" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 4.09648554e-17, -3.60448192e-17, -1.64706896e-18,\n", " 6.05710899e-01, -3.14797383e-17, 1.41315903e-17,\n", " -1.64706896e-18, 1.68987806e-18, 1.51580645e-17,\n", " 3.64848333e-01, 1.51580645e-17, 3.64848333e-01,\n", " -1.20346441e-17, 6.05710899e-01, -1.34983170e-17],\n", " [ 1.09328020e-01, 5.24120449e-01, 9.83780423e-16,\n", " -7.70860566e-17, 2.79641926e-01, 3.00367284e-01,\n", " 9.83780423e-16, 5.41324276e-01, 1.09328020e-01,\n", " -2.12425741e-16, 1.09328020e-01, 1.27976612e-16,\n", " 3.51763163e-01, -7.70867745e-17, 3.22878460e-01],\n", " [ 3.17757779e-01, 1.09242359e-01, -3.28091723e-16,\n", " -4.77055574e-18, 2.84805383e-01, -3.73322920e-01,\n", " -4.04664825e-16, -4.37060653e-01, 3.17757779e-01,\n", " -2.37628237e-17, 3.17757779e-01, 7.91995645e-18,\n", " -1.53201700e-01, -4.77057326e-18, 5.00179172e-01],\n", " [ 6.99407274e-16, -1.07768538e-15, 7.07106781e-01,\n", " 3.12368259e-17, -9.61233971e-16, -2.53713733e-16,\n", " 7.07106781e-01, -5.54089684e-16, 6.90182980e-16,\n", " 6.04864874e-19, 6.90182980e-16, -5.77947367e-18,\n", " -3.31494761e-16, -2.42743250e-17, -2.59157645e-16]])" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vt \n", "\n", "#[5줄]이 한 행임\n", "#4.09648554e-1 첫번째원소는 announcing, -3.09648554e-1 두번쨰는 awesome에 해당하는 원소" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.3. From each topic, extract the top features: 토픽벡터들의 이름정하기" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "행렬가지고 SVD한 뒤 토픽벡터 가져옴

\n", "\n", "토픽벡터 성분이 개개 가중치임

\n", "\n", "성분가지고 3개의 탑 feature를 뽑음" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "n_top = 3 #토픽벡터들 0~3까지 하나씩 가져옴\n", "for i in range(n_topics): \n", " topic_features = [features[idx] for idx in np.argsort(-vtabs[i,:])] # argsort() shows the sorted index.\n", " #np.argsort: 소에서 대로 정렬해 그 위치값(인덱스)을 가져옴(값을 가져오는거 아님)\n", " #대에서 소로 정렬하고 싶기에 -vtabs \n", " #제일 큰 것들의 인덱스를 가져와 그 인덱스에 해당하는 feature들을 모아 리스트 만듬\n", " \n", " topic_features_top = topic_features[0:n_top]\n", " #탑3만 가져옴\n", " \n", " if i == 0:\n", " topic_matrix = [topic_features_top] # list의 list 만들 준비!\n", " else:\n", " topic_matrix.append(topic_features_top) " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['economic', 'slowdown', 'severe'],\n", " ['movie', 'awesome', 'simply'],\n", " ['technology', 'movie', 'excited'],\n", " ['cooking', 'food', 'awesome']]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show the top features for each topic.\n", "topic_matrix \n", "\n", "#첫번째 토픽벡터에서 탑 3는 ['economic', 'slowdown', 'severe']\n", "#제일 많은 기여하는 성분들 \n", "#성분가지고 3개의 탑 feature를 뽑음" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# In view of the top features, we can name the topics.\n", "topic_names = ['Economy', 'Movie','Technology', 'Cuisine']\n", "#첫번째 토픽벡터는 Economy라고 이름 지음" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.4. Label each document with the most predominant topic:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "개개문서에서 키워드들이 몇 번 발생하는지" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Document 1 = Economy\n", "Document 2 = Movie\n", "Document 3 = Cuisine\n", "Document 4 = Technology\n", "Document 5 = Movie\n", "Document 6 = Technology\n", "Document 7 = Economy\n" ] } ], "source": [ "n_docs = len(my_docs)\n", "for i in range(n_docs):\n", " score_pick = 0\n", " topic_pick = 0\n", " tokennized_doc = nltk.word_tokenize(my_docs[i])\n", " for j in range(n_topics):\n", " found = [ x in topic_matrix[j] for x in tokennized_doc ] \n", " score = np.sum(found)\n", " if (score > score_pick):\n", " score_pick = score\n", " topic_pick = j\n", " print(\"Document \" + str(i+1) + \" = \" + topic_names[topic_pick])\n", "\n", "#문서들을 하나씩 가져와서 문서안의 토픽벡터 탑3 키워드가 문서 안에 있냐 없냐 보고 많이 겹치면 그 문서의 토픽으로 삼음\n", " \n", "#첫번째문서는 이코노미\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**NOTE**: 어떤 경우는 제대로 레이블 된 것 같고, 어떤 경우는 제대로 안된 것 같음" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }