{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Blog Post Clustering\n", "\n", "*by Dominic Reichl (@domreichl)*\n", "\n", "October 2018\n", "\n", "## Introduction\n", "\n", "My blog (https://www.mindcoolness.com) currently has 322 blog posts, which I have categorized into four broad topics:\n", "- Psychology & Cognitive Science\n", "- Willpower & Self-Improvement\n", "- Philosophy & Spirituality\n", "- Morality & Ethics\n", "\n", "Recently, I experienced a curious desire to find out how unsupervised NLP models would cluster my writings, so I've created this notebook.\n", "\n", "## Overview\n", "1. Modules & Data\n", "2. Word Vectorization\n", "3. Word Frequency\n", "4. Clustering (KMeans)\n", "5. Cluster Visualization (MDS, TSNE)\n", "6. Cluster Exploration\n", "7. Predictive Evaluation\n", "8. More Models (NMF, LSA, LDA)\n", "9. Qualitative Evaluation\n", "10. Autoencoder\n", "11. Quantitative Evaluation\n", "12. Conclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. Modules & Data\n", "\n", "This notebook requires the libraries Pandas, Beautiful Soup 4, Matplotlib, Mpld3, Scikit-learn, and TensorFlow.\n", "\n", "With the blog post data already exported from my MySQL sever in CSV format, all we have to do here is load the CSV file into a Pandas DataFrame, filter the data, and convert the HTML code into text via BeautifulSoup." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
post_titlepost_content
317how the brain makes emotionshere's the latest state of the art in the cogn...
318is willpower a cognitive strength?\\nwillpower is the ability to pursue long-term...
3196 reasons why people use moral language\\nwhy do we use moral language?\\nhello, i'm do...
320the bayesian brain: placebo effects explained\\n\\nin my article on predictive processing, i ...
321great minds discuss ideas, great men also disc...\\non great minds and great men\\ngreat minds di...
\n", "
" ], "text/plain": [ " post_title \\\n", "317 how the brain makes emotions \n", "318 is willpower a cognitive strength? \n", "319 6 reasons why people use moral language \n", "320 the bayesian brain: placebo effects explained \n", "321 great minds discuss ideas, great men also disc... \n", "\n", " post_content \n", "317 here's the latest state of the art in the cogn... \n", "318 \\nwillpower is the ability to pursue long-term... \n", "319 \\nwhy do we use moral language?\\nhello, i'm do... \n", "320 \\n\\nin my article on predictive processing, i ... \n", "321 \\non great minds and great men\\ngreat minds di... " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "from bs4 import BeautifulSoup\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer\n", "from sklearn.cluster import KMeans\n", "from sklearn.metrics.pairwise import cosine_similarity\n", "from sklearn.manifold import MDS, TSNE\n", "import mpld3\n", "from sklearn.decomposition import NMF, TruncatedSVD, LatentDirichletAllocation\n", "import tensorflow as tf\n", "from sklearn.metrics import silhouette_score, calinski_harabaz_score\n", "\n", "# load data\n", "data = pd.read_csv('wp_posts.csv', sep=';')\n", "\n", "# filter data (exclude pages, drafts, revisions, etc.), then keep only title & content\n", "data = data[(data['post_type'] == 'post') & (data['post_status'] == 'publish')]\n", "data = data[['post_title', 'post_content']].reset_index(drop=True)\n", "\n", "# convert html code into text, then lowercase all words\n", "for i in data.index:\n", " soup = BeautifulSoup(data['post_content'].loc[i], 'html.parser')\n", " data['post_content'].loc[i] = soup.get_text().lower()\n", " data['post_title'].loc[i] = data['post_title'].loc[i].lower()\n", "\n", "# display the last five blog posts\n", "data.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2. Word Vectorization\n", "\n", "Natural language processing requires all words to be represented as numbers. For our purposes, the best way to achieve this is with Scikit-learn's TfidVectorizer. This tool not only transforms words into vectors, but also ensures that the terms defining a cluster provide enough differentiation.\n", "\n", "How does is work?\n", "\n", "Consider first that the frequency of words like 'the', 'a', 'is', and 'and' is likely to be high in any English corpus, which means that they're of little value for document clustering. Moreover, basically every AI-related document would have the same topic if 'network', 'model', and 'algorithm' were cluster-defining terms.\n", "\n", "A clustering algorithm will produce much better results if the term frequency (tf = how often a word appears in a document) is multiplied by the inverse document frequency (idf = a measure of how much information the word provides). This is precisely what the TfidVectorizer does when it penalizes high-frequency terms for lacking informational value.\n", "\n", "The CountVectorizer, by contrast, uses a simple bag-of-words approach where each term is transformed into a vector based on its count/frequency. This is useful for plotting the top 20 words in our data, and it's also needed for Latent Dirichlet Allocation (LDA), a structured probabilistic model.\n", "\n", "Lastly, we should set a lower bound on document frequency (min_df), which sets a cut-off threshold (0.05) to ignore the rarest words in our vocabulary. This will also prove useful for the autoencoder later because it speeds up the neural network training quite a lot." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((322, 995), (322, 995))" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# tf-idf (term frequency-inverse document frequency)\n", "tfidf_vectorizer = TfidfVectorizer(stop_words='english', min_df = 0.05)\n", "tfidf_matrix = tfidf_vectorizer.fit_transform(data['post_content'])\n", "tfidf_words = tfidf_vectorizer.get_feature_names()\n", "\n", "# bag of words (term frequency)\n", "tf_vectorizer = CountVectorizer(stop_words='english', min_df = 0.05)\n", "tf_matrix = tf_vectorizer.fit_transform(data['post_content'])\n", "tf_words = tf_vectorizer.get_feature_names()\n", "\n", "tfidf_matrix.shape, tf_matrix.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's 322 documents (blog posts) and a vocabulary with 955 vectorized words." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3. Word Frequency\n", "\n", "To visualize the frequency of words in our data set, we must first retrieve each term and its count (as the sum of its vector) from the vocabulary, sort all terms by count, and then build a list with the 20 most frequent words as well as a list with their counts. With that, we can plot the lists in a bar chart. (Note that the most common English words were already filtered out as stop words by the vectorizer.)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# get word frequencies from the bag of words and sort them by count in descending order\n", "term_frequency = [(term, tf_matrix.sum(axis=0)[0, i]) for term, i in tf_vectorizer.vocabulary_.items()]\n", "term_frequency = sorted(term_frequency, key = lambda x: x[1], reverse=True)\n", "terms = [i[0] for i in term_frequency[:20]] # get top 20 words\n", "count = [i[1] for i in term_frequency[:20]] # get counts of top 20 words\n", "\n", "# plot the 20 most frequent words in a bar chart\n", "fig, ax = plt.subplots(figsize=(16,8))\n", "ax.bar(range(len(terms)), count)\n", "ax.set_xticks(range(len(terms)))\n", "ax.set_xticklabels(terms)\n", "ax.set_title('Top 20 Most Frequent Words')\n", "ax.set_xlabel('Term')\n", "ax.set_ylabel('Count')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 4. Clustering (KMeans)\n", "\n", "Our main algorithm for clustering will be k-means, which randomly initializes cluster centers, assigns all data points to their closest centroid (measured as least squared Euclidean distance), and then gradually and heuristically moves the centroids until convergence, i.e., until each centroid has become the actual center of its assigned data points (although reaching an optimum is not guaranteed). Note that Scikit-learn's KMeans uses an improved initialization algorithm—k-means++—by default.\n", "\n", "In my blog, I manually grouped my posts into four broad topic categories, so we will tell the model to find 4 clusters. After fitting the model with the matrix of vectorized words and storing the centroids, we can peek into the clustering results by printing out the top 3 defining words of each cluster." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Cluster 0: moral, meaning, values\n", "Cluster 1: willpower, self, control\n", "Cluster 2: pride, ego, humility\n", "Cluster 3: mind, emotions, life\n" ] } ], "source": [ "k = 4 # number of clusters\n", "\n", "# build and fit model, then store centroids\n", "km = KMeans(k)\n", "km_matrix = km.fit_transform(tfidf_matrix)\n", "km_centroids = km.cluster_centers_.argsort()[:, ::-1]\n", "\n", "# create a dictionary with the top three words for each cluster\n", "top_words = {}\n", "for i in range(4):\n", " top_words[i] = \"\"\n", " for c in km_centroids[i, :3]:\n", " if top_words[i] == \"\":\n", " top_words[i] = tfidf_words[c]\n", " else:\n", " top_words[i] = top_words[i] + \", \" + tfidf_words[c]\n", " print('Cluster %s:' %i, top_words[i])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The clusters already make sense, but we shall wait until #9 for an in-depth qualitative evaluation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5. Cluster Visualization (MDS, TSNE)\n", "\n", "To visualize our clusters in a two-dimensional plot, we can use manifold learning models such as\n", "1. MDS (multi-dimensional scaling), which models dissimilarity data by computing geometric distances, and\n", "2. T-SNE (t-distributed Stochastic Neighbor Embedding), which converts pairwise affinities of data points to probabilities (see van der Maaten & Hinton, 2008).\n", "\n", "Let's fit both models with and without cosine distance.\n", "\n", "Using cosine distance is generally recommended for text data, but as we shall see, the plots look better without it. To build the plots, we create a data frame with x and y coordinates, post titles, and cluster labels before we group it by the latter. For the fun of it, let's also store the figure as a PNG image." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# fit two non-linear dimensionality reduction models\n", "mds = MDS().fit_transform(km_matrix)\n", "tsne = TSNE().fit_transform(km_matrix)\n", "\n", "# fit models with cosine distance\n", "cos_dist = 1 - cosine_similarity(km_matrix)\n", "mds_cos = MDS(dissimilarity=\"precomputed\").fit_transform(cos_dist)\n", "tsne_cos = TSNE(metric=\"cosine\").fit_transform(cos_dist)\n", "\n", "# create data frame with coordinates, cluster labels, and post titles, grouped by clusters\n", "df = pd.DataFrame(dict(x1=mds[:,0], y1=mds[:,1], x2=mds_cos[:,0], y2=mds_cos[:,1],\n", " x3=tsne[:,0], y3=tsne[:,1], x4=tsne_cos[:,0], y4=tsne_cos[:,1],\n", " label=km.labels_.tolist(), title=data['post_title']))\n", "groups = df.groupby('label')\n", "\n", "# set a color and get the top three words for each cluster\n", "clusters = {0: ('#1b9e77', top_words[0]),\n", " 1: ('#d98f02', top_words[1]),\n", " 2: ('#7580b3', top_words[2]),\n", " 3: ('#e7196a', top_words[3]), }\n", "\n", "# build two plots for the manifold learning models\n", "fig, ax = plt.subplots(2,2, figsize=(16,12)) # 2x2 subplots\n", "ax[0,0].set_title('MDS (euclidean distance)'); ax[0,1].set_title('MDS (cosine distance)') # titles for first row\n", "ax[1,0].set_title('T-SNE (euclidean distance)'); ax[1,1].set_title('T-SNE (cosine distance)') # titles for second row\n", "for i,g in groups: # iterate over clusters\n", " ax[0,0].plot(g.x1, g.y1, marker='o', linestyle='', ms=12, color=clusters[i][0], label=clusters[i][1])\n", " ax[0,1].plot(g.x2, g.y2, marker='o', linestyle='', ms=12, color=clusters[i][0], label=clusters[i][1])\n", " ax[1,0].plot(g.x3, g.y3, marker='o', linestyle='', ms=12, color=clusters[i][0], label=clusters[i][1])\n", " ax[1,1].plot(g.x4, g.y4, marker='o', linestyle='', ms=12, color=clusters[i][0], label=clusters[i][1])\n", "ax[0,0].legend(); ax[0,1].legend(); ax[1,0].legend(); ax[1,1].legend() # add legends\n", "\n", "# save the figure as png and display it\n", "plt.savefig('clusters.png', dpi=200)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 6. Cluster Exploration\n", "\n", "With the mpld3 library, we can use JavaScript and CSS code to create an interactive map that displays a tooltip with the blog title for each data point when the mouse hovers over it. This is great for exploring how my blog posts were clustered! I must give credit here to Brandon Rose (http://brandonrose.org) for this sweet piece of code—thank you." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#define custom toolbar location\n", "class TopToolbar(mpld3.plugins.PluginBase):\n", " \"\"\"Plugin for moving toolbar to top of figure\"\"\"\n", "\n", " JAVASCRIPT = \"\"\"\n", " mpld3.register_plugin(\"toptoolbar\", TopToolbar);\n", " TopToolbar.prototype = Object.create(mpld3.Plugin.prototype);\n", " TopToolbar.prototype.constructor = TopToolbar;\n", " function TopToolbar(fig, props){\n", " mpld3.Plugin.call(this, fig, props);\n", " };\n", "\n", " TopToolbar.prototype.draw = function(){\n", " this.fig.toolbar.draw();\n", " this.fig.toolbar.toolbar.attr(\"x\", 150);\n", " this.fig.toolbar.toolbar.attr(\"y\", 400);\n", " this.fig.toolbar.draw = function() {}\n", " }\n", " \"\"\"\n", " def __init__(self):\n", " self.dict_ = {\"type\": \"toptoolbar\"}\n", "\n", "\n", "# define custom css to format the font and to remove the axis labeling\n", "css = \"\"\"\n", "text.mpld3-text, div.mpld3-tooltip {\n", " font-family:Arial, Helvetica, sans-serif;\n", " font-size:14px;\n", " font-weight: bold;\n", " color: White;\n", " background-color: DodgerBlue;\n", "}\n", "\n", "g.mpld3-xaxis, g.mpld3-yaxis {\n", "display: none; }\n", "\n", "svg.mpld3-figure {\n", "margin-left: -75px;}\n", "\"\"\"\n", "\n", "# create plot\n", "fig, ax = plt.subplots(figsize=(14,6))\n", "for i,g in groups: # layer the plot by iterating through cluster labels\n", " points = ax.plot(g.x3, g.y3, marker='o', linestyle='', ms=14, color=clusters[i][0], label=clusters[i][1])\n", " labels = [i.title() for i in g.title] # get the blog posts titles in title case\n", " tooltip = mpld3.plugins.PointHTMLTooltip(points[0], labels, voffset=10, hoffset=10, css=css) # set tooltip\n", " mpld3.plugins.connect(fig, tooltip, TopToolbar()) # connect tooltip to fig\n", "ax.legend(edgecolor='white') # add \n", "\n", "# save as html file and show plot\n", "html = mpld3.fig_to_html(fig)\n", "with open(\"clusters.html\", \"w\") as file: file.write(html)\n", "mpld3.display()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Being deeply familiar with every data point (as it represents a blog post I have written), I can learn a lot from this interactive plot. But you, too, if you just briefly look at some of the titles and their relative distances, will quickly be able to confirm that the clustering has been very successful: the patterns make sense!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 7. Predictive Evaluation\n", "\n", "To find out the degree to which blog titles and contents belong to the same clusters, we can let the k-means model predict in which cluster each title and content fits best and then compute the overlap of these predictions.\n", "\n", "In addition, we can see how the four pairs of topic categories I use on my blog map to the four clusters generated by the model. 100% would mean that the model has categorized my blog posts very similarly to how I have categorized them." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Title/content match: 76.7%\n", "Category similarity: 75.0%\n" ] } ], "source": [ "# use model to predict the cluster for each title and content\n", "title_predictions = []\n", "content_predictions = []\n", "for i in range(len(data['post_content'])):\n", " titles = tfidf_vectorizer.transform([data['post_title'][i]])\n", " title_predictions.append(km.predict(titles))\n", " contents = tfidf_vectorizer.transform([data['post_content'][i]])\n", " content_predictions.append(km.predict(contents))\n", "\n", "# check how often a post's title and content are predicted to belong to the same cluster\n", "match = []\n", "for i in range(len(title_predictions)):\n", " if title_predictions[i] == content_predictions[i]:\n", " match.append(1)\n", " else:\n", " match.append(0)\n", "print('Title/content match: ' + str(round(sum(match)/len(match)*100, 1)) + '%')\n", "\n", "# test to what extent each manually defined topic category falls into its own cluster\n", "category_predictions = []\n", "for topic in ('psychology cognitive science', 'willpower self improvement',\n", " 'philosophy spirituality', 'morality ethics'):\n", " Category = tfidf_vectorizer.transform([topic])\n", " category_predictions.append(km.predict(Category)[0]) \n", "print('Category similarity: ' + str(len(set(category_predictions))/k*100) + '%')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 8. More Models (NMF, LSA, LDA)\n", "\n", "Now let's look at three additional models, and let's also combine them with k-means:\n", "1. NMF for non-negative matrix factorization (see Kim & Park, 2008)\n", "2. TruncatedSVD for Latent Semantic Analysis (LSA)\n", "3. LatentDirichletAllocation for Latent Dirichlet Allocation (LDA)\n", "4. NMF→KMeans for NMF-based k-means\n", "5. TruncatedSVD→Kmeans for LSA-based k-means\n", "6. LatentDirichletAllocation→KMeans for LDA-based k-means (see Guan, 2016 and Bui et. al, 2017)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "nmf = NMF(k)\n", "nmf_matrix = nmf.fit_transform(tfidf_matrix)\n", "\n", "lsa = TruncatedSVD(k)\n", "lsa_matrix = lsa.fit_transform(tfidf_matrix)\n", "\n", "lda = LatentDirichletAllocation(k, learning_method='batch')\n", "lda_matrix = lda.fit_transform(tf_matrix)\n", "\n", "km_nmf = KMeans(k).fit(nmf_matrix) # NMF-based k-means\n", "km_lsa = KMeans(k).fit(lsa_matrix) # LSA-based k-means\n", "km_lda = KMeans(k).fit(lda_matrix) # LDA-based k-means" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 9. Qualitative Evaluation\n", "\n", "For qualitative evaluation, we can look at the three words that were most defining for each cluster produced by a model. If the word combinations make sense, if all top words of a cluster belong to a distinct category, and if there's little topical overlap between clusters, we may judge the model as good." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " ---NMF---\n", "Cluster 0: meditation mindfulness life\n", "Cluster 1: moral values meaning\n", "Cluster 2: pride humility ego\n", "Cluster 3: willpower emotions control\n", "\n", " ---LSA---\n", "Cluster 0: pride self true\n", "Cluster 1: moral values meaning\n", "Cluster 2: pride emotion emotions\n", "Cluster 3: emotions willpower control\n", "\n", " ---LDA---\n", "Cluster 0: emotions pride self\n", "Cluster 1: people human values\n", "Cluster 2: true life want\n", "Cluster 3: willpower self control\n", "\n", " ---K-M---\n", "Cluster 0: moral meaning values\n", "Cluster 1: willpower self control\n", "Cluster 2: pride ego humility\n", "Cluster 3: mind emotions life\n", "\n", " ---NMF-KM---\n", "Cluster 0: life mind meditation\n", "Cluster 1: moral values meaning\n", "Cluster 2: willpower emotions control\n", "Cluster 3: pride humility true\n", "\n", " ---LSA-KM---\n", "Cluster 0: willpower control emotions\n", "Cluster 1: moral values meaning\n", "Cluster 2: pride humility true\n", "Cluster 3: life meditation mindfulness\n", "\n" ] } ], "source": [ "def top_words_decomp(model_name, model, terms):\n", " ''' prints the top 3 words of each cluster\n", " from the components of decomposition models '''\n", " print(model_name)\n", " for i, topic in enumerate(model.components_):\n", " print(\"Cluster %d: \" % (i), end=\"\")\n", " print(\" \".join([terms[t] for t in topic.argsort()[:-4:-1]]))\n", " print()\n", "\n", "top_words_decomp(\" ---NMF---\", nmf, tfidf_words)\n", "top_words_decomp(\" ---LSA---\", lsa, tfidf_words)\n", "top_words_decomp(\" ---LDA---\", lda, tf_words)\n", " \n", "def top_words_cluster(model_name, centers):\n", " ''' prints the top 3 words of each cluster\n", " from the centroids of the k-means models '''\n", " print(model_name)\n", " for i in range(k):\n", " print(\"Cluster %d: \" % i, end=\"\")\n", " print(\" \".join([tfidf_words[c] for c in centers[i, :3]]))\n", " print()\n", "\n", "top_words_cluster(\" ---K-M---\", km_centroids)\n", "top_words_cluster(\" ---NMF-KM---\", nmf.inverse_transform(km_nmf.cluster_centers_).argsort()[:, ::-1])\n", "top_words_cluster(\" ---LSA-KM---\", lsa.inverse_transform(km_lsa.cluster_centers_).argsort()[:, ::-1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From what I know about the data, consisting of my own blog posts (hence easy for me to interpret), the NMF clusters are certainly the best. Their top three words neatly outline the very topics I have written about the most on my blog:\n", "- \"meditation mindfulness life\" ⇨ Philosophy & Spirituality (especially the latter)\n", "- \"moral values meaning\" ⇨ Morality & Ethics (quite obviously)\n", "- \"pride humility ego\" ⇨ Psychology & Cognitive Science (especially psychology of pride)\n", "- \"willpower emotions control\" ⇨ Willpower & Self-Improvement\n", "\n", "Here's the full model ranking:\n", "1. NMF is the winner: clear and distinct clusters that match my manually chosen topic categories.\n", "2. NMF-based KMeans and LSA-based KMeans share the second place: they produced almost identical clusters, with only 'mind' vs. 'mindfulness' differing, and even the \"pride humility true\" cluster makes sense, given that I have written several posts on the notion of 'true pride' (or 'authentic pride' as it's called in psychology and behavioral economics).\n", "3. KMeans takes the third place: it has the same ethics/morality and pride/humility clusters as NMF and a cluster with 'willpower' and 'self-control', but the \"mind emotions life\" cluster could be more distinct.\n", "4. LDA performed worse than all the KMeans variations: \"people human values\" is an ethics cluster, but not particularly expressive; 'self' occurs twice among the top three words; \"true life want\" could be related to my posts on the True Will, a life philosophy topic.\n", "5. LSA is the loser here: 'self' shouldn't be a top word for the psychology of pride cluster because the 'self' as in 'self-control' and 'self-discipline' is associated with a somewhat different topic; 'pride' shouldn't occur twice and certainly not as the top word of two different clusters; and, of course, we shouldn't have three occurrences of 'emotion(s)'.\n", "\n", "Had we used better tokenization and stemmed our tokens, LSA would have performed better. Still, it is worth noting that all other algorithms did quite well even without any stemming prior to the word vectorization." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 10. Autoencoder\n", "\n", "Ready for something more complex?\n", "\n", "Autoencoders!\n", "\n", "What's that?\n", "\n", "Autoencoders are neural networks used for unsupervised learning. They are a powerful tool for dealing with the curse of dimensionality.\n", "\n", "Every autoencoder consists of two parts: (1) an encoder with multiple layers to reduce dimensionality and (2) a decoder with multiple layers to reconstruct the input from the dimensionally-reduced data. By reconstructing its inputs, the network detects the most important features in the data as it learns the identity function under the constraint of reduced dimensionality (or added noise). Since clustering is a form of dimensionality reduction, autoencoders should be useful for categorizing my blog posts into four broad topics.\n", "\n", "In the code below, we use TensorFlow to build an autoencoder with two hidden layers. First, we set two hyperparameters (learning rate and number of epochs) as well as the network parameters (numbers of nodes for all three layers). Then, after defining the graph input (X) and all weights and biases, initialized with normally-distributed random numbers, we build an encoder and a decoder function, both with sigmoid activation functions for each layer. Then we construct the model and define the functions for loss and optimization: minimize squared error with adaptive moment estimation (Adam). Finally, we initialize the variables and launch the graph before we run the session and training cycles. In the last two lines, we get the results and end the training session." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch: 000 cost = 0.076162167\n", "Epoch: 100 cost = 0.020960618\n", "Epoch: 200 cost = 0.007014040\n", "Epoch: 300 cost = 0.001955998\n", "Epoch: 400 cost = 0.000939283\n", "Epoch: 500 cost = 0.000937003\n" ] } ], "source": [ "learning_rate = 0.001\n", "training_epochs = 501\n", "\n", "n_input = tfidf_matrix.shape[1]\n", "n_hidden_1 = tfidf_matrix.shape[1] // 4\n", "n_hidden_2 = 4\n", "\n", "X = tf.placeholder(\"float\", [None, n_input])\n", "\n", "weights = {\n", " 'encoder_h1': tf.Variable(tf.random_normal([n_input, n_hidden_1])),\n", " 'encoder_h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2])),\n", " 'decoder_h1': tf.Variable(tf.random_normal([n_hidden_2, n_hidden_1])),\n", " 'decoder_h2': tf.Variable(tf.random_normal([n_hidden_1, n_input])),\n", "}\n", "\n", "biases = {\n", " 'encoder_b1': tf.Variable(tf.random_normal([n_hidden_1])),\n", " 'encoder_b2': tf.Variable(tf.random_normal([n_hidden_2])),\n", " 'decoder_b1': tf.Variable(tf.random_normal([n_hidden_1])),\n", " 'decoder_b2': tf.Variable(tf.random_normal([n_input])),\n", "}\n", "\n", "\n", "def encoder(x):\n", " layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(x, weights['encoder_h1']), biases['encoder_b1']))\n", " layer_2 = tf.nn.sigmoid(tf.add(tf.matmul(layer_1, weights['encoder_h2']), biases['encoder_b2']))\n", " return layer_2\n", "\n", "def decoder(x):\n", " layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(x, weights['decoder_h1']),biases['decoder_b1']))\n", " layer_2 = tf.nn.sigmoid(tf.add(tf.matmul(layer_1, weights['decoder_h2']), biases['decoder_b2']))\n", " return layer_2\n", "\n", "enc = encoder(X)\n", "dec = decoder(enc)\n", "\n", "cost = tf.reduce_mean(tf.pow(X - dec, 2))\n", "optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)\n", "\n", "init = tf.global_variables_initializer()\n", "sess = tf.InteractiveSession() # interactive for jupyter notebook\n", "sess.run(init)\n", "\n", "for epoch in range(training_epochs):\n", " for i in range(len(data)): # one batch per blog post\n", " _, c = sess.run([optimizer, cost], feed_dict={X: tfidf_matrix[i].toarray()})\n", " if epoch % 100 == 0: # display every hundredth epoch\n", " print(\"Epoch:\", '%03d' % epoch, \"cost =\", \"{:.9f}\".format(c))\n", "\n", "autoenc_results = dec.eval(feed_dict={X: tfidf_matrix.toarray()}) \n", "sess.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The loss decreased well enough, certainly better than for all the many other architectures and parameters I tried before.\n", "\n", "Now, let's feed the output of our NN into a k-means model and plot the results." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "scrolled": false }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "km_autoenc = KMeans(k).fit(autoenc_results) # autoencoder-based k-means\n", "\n", "# fit T-SNE with cosine distance of autoencoder and autoencoder-based k-means results\n", "cos_dist_autoenc = 1 - cosine_similarity(autoenc_results)\n", "tsne_autoenc = TSNE(metric=\"cosine\").fit_transform(cos_dist_autoenc)\n", "cos_dist_km_autoenc = 1 - cosine_similarity(KMeans(k).fit_transform(autoenc_results))\n", "tsne_km_autoenc = TSNE(metric=\"cosine\").fit_transform(cos_dist_km_autoenc)\n", "\n", "# plot T-SNE results\n", "fig, ax = plt.subplots(1,2, figsize=(16,8))\n", "ax[0].set_title('Autoencoder')\n", "ax[0].scatter(tsne_autoenc[:,0], tsne_autoenc[:,1])\n", "ax[1].set_title('Autoencoder-based KMeans')\n", "ax[1].scatter(tsne_km_autoenc[:,0], tsne_km_autoenc[:,1])\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 11. Quantitative Evaluation\n", "\n", "For quantitative evaluation, we will use three metrics that don't require ground truth labels:\n", "1. *Silhouette* is a coefficient that measures consistency within clusters; it should be non-negative and the closer to 1 the better.\n", "2. *WCSS* or *inertia* means within-cluster sum-of-squares, which measures cluster compactness; the smaller the better.\n", "3. *Calinski-Harabasz* is an index calculated as the ratio of between-clusters dispertion and within-cluster dispersion, thus measuring both denseness and separateness of clusters; the larger the better.\n", "\n", "In this final step, we create an evaluation table with the scores for all our k-means models on these three metrics." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ModelSilhouetteWCSSCalinski-Harabasz
0km0.023253276.335.97
1km_nmf0.0235092.376.35
2km_lsa0.0240748.686.32
3km_lda-0.00982923.090.54
4km_autoenc0.0123560.091.92
\n", "
" ], "text/plain": [ " Model Silhouette WCSS Calinski-Harabasz\n", "0 km 0.023253 276.33 5.97\n", "1 km_nmf 0.023509 2.37 6.35\n", "2 km_lsa 0.024074 8.68 6.32\n", "3 km_lda -0.009829 23.09 0.54\n", "4 km_autoenc 0.012356 0.09 1.92" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create evaluation table\n", "evaluation = pd.DataFrame({'Model': ['km', 'km_nmf', 'km_lsa', 'km_lda', 'km_autoenc']})\n", "sc, wcss, chi = [], [], []\n", "\n", "# calculate scores\n", "for model in (km, km_nmf, km_lsa, km_lda, km_autoenc):\n", " sc.append(silhouette_score(tfidf_matrix.toarray(), model.labels_))\n", " wcss.append(round(model.inertia_, 2))\n", " chi.append(round(calinski_harabaz_score(tfidf_matrix.toarray(), model.labels_), 2))\n", "\n", "# use term frequency matrix for LDA\n", "sc[-2] = silhouette_score(tf_matrix.toarray(), km_lda.labels_)\n", "chi[-2] = round(calinski_harabaz_score(tf_matrix.toarray(), model.labels_), 2)\n", "\n", "# fill and display evaluation table\n", "evaluation['Silhouette'] = sc\n", "evaluation['WCSS'] = wcss\n", "evaluation['Calinski-Harabasz'] = chi\n", "evaluation.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 12. Conclusion\n", "\n", "Although these stats aren't particularly impressive (e.g., all silhouette coefficients are almost zero), their relative values are diagnostic nonetheless:\n", "1. NMF-based KMeans and LSA-based KMeans are the best models: relatively high SC, low WCSS, and high CHI.\n", "2. KMeans alone is not quite as good: worse on all three metrics.\n", "3. Autoencoder-based KMeans takes third place.\n", "4. LDA-based KMeans is the loser with a negative silhouette score and the lowest Calinski-Harabasz index.\n", "\n", "This confirms the above quantitative analysis.\n", "\n", "In conclusion, it appears that sticking to NMF-based KMeans, LSA-based KMeans, or even just NMF alone is the best choice for this data set and our word vectorization method." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }