{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "< [Model inference](tutorial_5.ipynb) | [Contents](index.ipynb) | >" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Model application and visualization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial we will create visualizations of topic model results. We'll highlight some of the most important techniques and give some examples of faceting by meta-data and semantic clusters. First, we load the text data and the previously computed data (Tutorial I). We create two variables $\\theta = p(z|d)$ and $\\phi = p(w|z)$. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "options(stringsAsFactors = FALSE)\n", "require(topicmodels)\n", "require(lda)\n", "require(data.table)\n", "require(quanteda)\n", "require(ggplot2)\n", "require(magrittr)\n", "require(dplyr)\n", "\n", "\n", "\n", "# textdata <- read.csv(\"data/data job posts.csv\", header = TRUE, sep = \",\", encoding = \"UTF-8\",quote = \"\\\"\")\n", "# textdata <- as.data.table(textdata)\n", "\n", "# textdata %<>% mutate(d_id = 1:nrow(textdata))\n", "\n", "# Load the posterior and data from tutorial I\n", "load(\"DTM.RData\")\n", "load(\"tmResult.RData\")\n", "load(\"topicModel.RData\")\n", "load(\"corpus.RData\")\n", "load(\"textdata.RData\")\n", "K <- topicModel@k # set a global topic model K parameter to use\n", "\n", "# From the posterior get theta (p(z|d)) and phi (p(w|z)) \n", "theta <- tmResult$topics\n", "phi <- tmResult$terms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Topic visualizations\n", "\n", "Different quantities can be determined from the topic model posterior. In this section we explain and show some of the most important measures and utilize different visualizations.\n", "\n", "### Visualizing p(w|z)\n", "\n", "Even though one can question the scientific contribution of word-clouds this visualization allows a quick overview of weighted sets of terms. So we take a closer look on some topics with word-clouds." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# visualize topics as word cloud\n", "library(wordcloud)\n", "\n", "#generate the names again to filter for a certain topic\n", "#terms: Function to extract the most likely terms for each topic or the most likely topics for each document.\n", "top5termsPerTopicProb <- lda::top.topic.words(phi, 5, by.score = T)\n", "topicNames <- apply(top5termsPerTopicProb, 2, paste, collapse = \" \")\n", "\n", "topicToViz <- 11 # change for your own topic of interest\n", "topicToViz <- grep('engineer', topicNames)[1] # Or select a topic by a term contained in its name\n", "# select to 40 most probable terms from the topic by sorting the term-topic-probability vector in decreasing order\n", "top40terms <- sort(phi[topicToViz,], decreasing = TRUE)[1:40]\n", "words <- names(top40terms)\n", "# extract the probabilities of each of the 40 terms\n", "probabilities <- sort(phi[topicToViz,], decreasing = TRUE)[1:40]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# visualize the terms as word-cloud\n", "library(wordcloud)\n", "wordcloud(names(probabilities), probabilities, scale = c(3, .9), colors = brewer.pal(8, \"Dark2\"), random.order = F)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you change the variable **topicToViz** with values between 1 and 20 you can visualize other topics.\n", "\n", "### Visualizing p(z|d)\n", "\n", "In the next step we will visualize the topic distributions within single documents. We use 3 example documents for this purpose." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "library(IRdisplay)\n", "\n", "\n", "exampleIds <- c(150, 10000, 15000)\n", "display_html(paste0(\"\",gsub(\"\\n\", \"
\", textdata$jobpost[exampleIds[1]])))\n", "display_html(paste0(\"

\",gsub(\"\\n\", \"
\", textdata$jobpost[exampleIds[2]])))\n", "display_html(paste0(\"

\",gsub(\"\\n\", \"
\", textdata$jobpost[exampleIds[3]])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We visualize the topic-distributions $p(z|d)$ within. We use the `ggplot2` library for plotting. This is a good choice since this library offers a lot of freedom in design and layout. Furthermore, multiple plotting types are possible with slightly different commands." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# load libraries for visualization\n", "require(\"reshape2\")\n", "require(\"ggplot2\")\n", "\n", "N <- length(exampleIds)\n", "\n", "# get topic proportions form example documents\n", "topicProportionExamples <- theta[as.character(textdata$d_id[exampleIds]), ]\n", "colnames(topicProportionExamples) <- topicNames\n", "vizDataFrame <- melt(data = cbind(data.frame(topicProportionExamples), document = stringr::str_sub(textdata$Company[exampleIds],1,10)), \n", " variable.name = \"topic\", \n", " id.vars = \"document\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ggplot(data = vizDataFrame, aes(x = topic, y = value, fill = document), ylab = \"proportion\") +\n", " geom_bar(stat = \"identity\", position = \"stack\") +\n", " theme(axis.text.x = element_text(angle = 90, hjust = 1)) + \n", " coord_flip() + facet_wrap(~document, ncol = N)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another useful quantity is the share each topic has within the whole document collection. We can display this proportion in two ways. In the first example we are using a pie chart. To prepare the plot, we define a color scheme and a ranking for the topics. We keep the order for the rest of the tutorial in order to create a consistent reading of the plots." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "library(RColorBrewer)\n", "getPalette = colorRampPalette(brewer.pal(9, \"Set1\"))\n", "\n", "\n", "# Again we use the topic distribution over documents to determine the share of each topic\n", "# What are the most probable topics in the entire collection?\n", "topicProportions <- colSums(theta) / sum(theta)\n", "names(topicProportions) <- topicNames\n", "\n", "# For the next examples we create a main sorting and coloring of the topics which we could apply to all visualizations in ggplot\n", "\n", "# We start with sorting the topics by their probability\n", "topicProportions <- topicProportions[order(topicProportions)]\n", "\n", "# ordering in the ggplot library can be done using a factor for the topic labels\n", "topicsOrd <- factor(\n", " names(topicProportions), # Take the names of the topics as examples\n", " levels = names(topicProportions), # Set them also as possible levels of the nominal factor\n", " ordered = T) # the given order of the topic names is also the order of the factor\n", "\n", "# next we randomly create some colors from the \"rainbow\"-palette of R\n", "# colorScale <- sample(rainbow(length(topicsOrd)))\n", "# Alternative: use precompiled color palettes from the the pals package\n", "colorScale <- paste0(getPalette(length(topicsOrd)), \"FF\")\n", "\n", "# Finally, a data frame is created associating the colors with the topic names\n", "refOrdColors <- data.frame(topicsOrd, colorScale)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll now use the colors in order to create a colored pie chart." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ggplot2 does only understand data.frame objects\n", "# melt creates a data.frame from our matrix representing each cell as a row\n", "topicProportions_df <- melt(topicProportions) \n", "\n", "# add the just created factor as name description column to the rows of the data.frame\n", "topicProportions_df$topicNamesFactor <- refOrdColors$topicsOrd \n", "\n", "# Create a bar plot:\n", "# Initialize the plot by assigning the values to the y-axis. The running order of the topics is given by fill and the ordered topic factor. The scale_fill_manual command defines the order of the colors and is assigned to our reference colors.\n", "bp <- ggplot(topicProportions_df, aes(x = \"\", y = value, fill = refOrdColors$topicsOrd)) + \n", " geom_bar(width = 1, stat = \"identity\") +\n", " scale_fill_manual(values = refOrdColors$colorScale) \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "require(scales)\n", "# from the bar plot create a polar coordinate view, choose a minimal theme\n", "pie <- bp + coord_polar(\"y\", start = 0) + theme_minimal() + \n", " theme(\n", " axis.text.x = element_blank(), \n", " axis.title.x = element_blank(), # make every graphical element blank except the pie\n", " axis.title.y = element_blank(),\n", " panel.border = element_blank(),\n", " panel.grid = element_blank(),\n", " axis.ticks = element_blank(),\n", " legend.position = \"left\", \n", " legend.key.width = unit(3, \"mm\"),\n", " legend.key.height = unit(3, \"mm\"),\n", " plot.title = element_text(size = 14, face = \"bold\")\n", " ) +\n", " ggtitle(\"Topic distribution in corpus\") +\n", " geom_text(size = 3, aes(x = 1.7, \n", " label = percent(value)), \n", " position = position_stack(vjust = 0.5)) + \n", " guides(fill = guide_legend(title = \"Topic names\", reverse = T))\n", "print(pie)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The second example plots a histogram of each topic's share." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ggplot(data = topicProportions_df, \n", " aes(x = topicNamesFactor, y = value, fill = refOrdColors$topicsOrd)) + # set data for axis and fill a gradient\n", " geom_bar(stat = \"identity\", width = .5) + # define attributes for bars\n", " scale_fill_manual(values = refOrdColors$colorScale) + \n", " coord_flip() + # flip the plot to horizontal bars\n", " guides(fill = FALSE) + # hide guide\n", " ggtitle(\"Topic proportions\") + # set the title\n", " xlab(\"Topic name\") + # set the x axis label\n", " ylab(\"Proportion\") # set the y axis label" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## LDAvis Topic Browser\n", "\n", "The `ldaVis`-package can produce a very comfortable visualization to browse a topic models outcome. With a short method call on the posterior of a topic model the package creates a web-application which can be viewed in an external browser. This application subsumes many of the ideas from the above exercises and provides a very convenient way to access and communicate the results." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An example was created within the workspace folder. Please navigate to the folder `./ldaviz` and open the index.html file within a browser. The the created result is a graphical topic browser and should look like\n", "\n", "![The LDAviz topic browser](resources/screenLDAViz.PNG)\n", "\n", "## Topics across meta data\n", "\n", "### Filter documents\n", "\n", "The posterior variable $\\theta$ contains the modeled topic-probabilities for each document. This information is useful to select semantical coherent clusters from the document collection. Such a filtering can be applied by introducing a topic-threshold which has to be exceeded for a certain topic in a document in order to be selected. For example, a filtering query could be to select documents where the topic “work job …” has a minimum share of 15 %. \n", "\n", "In the following code snippet we select documents on the basis of the topic distribution from our text source.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "topicToFilter <- 20 # you can set this manually ...\n", "# ... or have it selected by a term in the topic name (e.g. 'job')\n", "topicToFilter <- grep('engineer', topicNames)[1] \n", "topicThreshold <- 0.15\n", "selectedDocumentIndexes <- which(theta[, topicToFilter] >= topicThreshold)\n", "selectedDocumentIndexes <- as.integer(rownames(theta[selectedDocumentIndexes,]))\n", "#The document Id's from the corpus are contained within the rownames of the posterior variable theta\n", "filteredCorpus <- corpus_subset(data_corpus, textdata$d_id %in% selectedDocumentIndexes)\n", "\n", "# show length of filtered corpus\n", "length(filteredCorpus)\n", "\n", "# look into post\n", "display_html(gsub(\"\\n\", \"
\", filteredCorpus[1]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Frequency time series\n", "\n", "The filtered documents can be further analyzed w.r.t their diachronic distribution. We extract a time series of the filtered documents by aggregating the document's counts for years. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "docYear <- textdata$Year[textdata$d_id %in% selectedDocumentIndexes]\n", "# Second, count how many occurrences are present for each decade\n", "docsPerYear <- table(docYear)\n", "\n", "# classic way in R\n", "# plot(docsPerDecade, type = \"o\", xlab = \"Jahr\", ylab = \"Absolute frequency\", main = paste0(\"Topic \", topicNames[topicToFilter]))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Determine the topic color from our ordered reference factor\n", "topicColor <- refOrdColors[refOrdColors$topicsOrd == topicNames[topicToFilter], \"colorScale\"]\n", "\n", "ggplot(data = melt(docsPerYear), aes(x = docYear, y = value)) +\n", " geom_line(aes(linetype = \"solid\"), size=1,colour=topicColor) + # define a line plot\n", " xlab(\"Decade\") + ylab(\"Count\") +\n", " ggtitle(\"Topic count per decade\") +\n", " theme(legend.position=\"none\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In case you have an uneven distribution of documents across time, it would make sense to plot relative frequencies instead of absolute counts. Relative frequencies can be obtained by dividing counts in `docsPerYear` by the number of all documents in each year (`(docsPerYear / table(textdata$Year))[names(docsPerYear)]`).\n", "\n", "Exercise: Print a new line plot with relative frequencies!\n", "\n", "## Heat-maps\n", "\n", "A line plot can become confusing when comparing many time series of topic frequencies. A visualization technique called heat-maps can be a better choice for visualization. In this method each time series is displayed as a row in a grid. The number of the columns in the grid is the same than the number of data points within the time series. A color which corresponds to to a certain value in the time series will be assigned to each cell in the grid. Through this mechanism we could easily plot many time series in parallel. Additionally, heat-map plotting can align time series that have a similar progression next to each other, such that the user can capture similar topic trajectories faster and explorative." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We create a matrix where we repeat the creation of a time series \n", "topicThreshold <- 0.1\n", "all_years <- unique(textdata$Year)\n", "hm_matrix <- matrix(0, nrow = K, ncol = length(all_years), dimnames = list(topicNames, all_years))\n", "for (k in 1:K) {\n", " selectedDocumentIndexes <- theta[, k] >= topicThreshold\n", " docYears <- textdata$Year[selectedDocumentIndexes]\n", " docsPerYear <- table(docYears)\n", " hm_matrix[k, names(docsPerYear)] <- docsPerYear\n", "}\n", "\n", "# The basic ?heatmap command of R is able to cluster similar time series and place them next to each other in the plot. In ggplot we need to do this by ourselves. But the plot looks prettier.\n", "# We cluster the data by the Manhattan measure and build a cluster dendrogram in order to determine the optimal sorting within the heat-map.\n", "ord <- hclust(dist(hm_matrix, method = \"manhattan\"), method = \"average\" )$order\n", "\n", "# According to the sorting we create a data.frame with the melt command and the given order by the clustering.\n", "hm_matrix_ord <- melt(hm_matrix[ord, ])\n", "colnames(hm_matrix_ord) <- c(\"topicNames\", \"Year\",\"value\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ggplot(hm_matrix_ord, aes(Year, topicNames)) + geom_tile(aes(fill = value), colour = \"grey\") + \n", " scale_fill_gradient(low = \"white\",high = \"green\") +\n", " theme_bw() +\n", " xlab(\"Year\") + ylab(\"Topic name\") +\n", " theme(legend.position=\"none\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Area plots\n", "\n", "The frequency of multiple topics can also be displayed as area plot which stacks the shares of each topic for a point in time. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We order the decade counts by topics, and then normalize them to sum to 1\n", "hm_matrix_norm <- t(hm_matrix[as.character(refOrdColors$topicsOrd), ])\n", "hm_matrix_norm <- hm_matrix_norm / rowSums(hm_matrix_norm)\n", "\n", "matrix.m <- melt(t(hm_matrix_norm))\n", "colnames(matrix.m) <- c(\"topicNames\", \"Year\", \"value\")\n", "\n", "head(matrix.m)\n", "\t\t\t\t\t \n", "# Establish a sorting of topics by tranforming the topicNames column into a factor\n", "# with a specific ordering of levels\n", "matrix.m$topicNames <- factor(matrix.m$topicNames, levels(refOrdColors$topicsOrd), ordered = T)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ggplot(matrix.m, aes(x = Year, y = value, fill = topicNames)) + \n", " geom_area() + scale_fill_manual(values = refOrdColors$colorScale)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Topic networks\n", "\n", "### Co-occurrence of primary and secondary topics\n", "\n", "According to the definition of topic models the documents contain mixtures of multiple topics. Therefore, an interesting observation is the interaction between topics, e.g. to answer the question which topics appear together in documents. In the next section we analyze co-occurring topics and visualize them in a topic-network with a co-occurrence analysis. \n", "\n", "Analogue to the term-term-matrix of co-occurrence of words we create a topic-topic matrix from the $\\theta$ variable of our model. Basic co-occurrence calculation is described in detail in @heyer_text_2006. As a result we will construct and visualize a graph connecting the topics in a network.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "require(igraph)\n", "# Load tm result on paragraphs from Tutorial I\n", "load(\"tmResult_documents.RData\")\n", "# Create a 0-matrix of the same dimension like theta\n", "doc_topic_matrix <- matrix(0, nrow = nrow(new_data$topics), ncol = ncol(new_data$topics))\n", "topicNames <- apply(lda::top.topic.words(new_data$terms, 5, by.score = T), 2, paste, collapse = \" \")\n", "colnames(doc_topic_matrix) <- topicNames\n", "\n", "# Count the 2 most probable topics for each paragraph\n", "sapply(1:nrow(new_data$topics), function(x){\n", " doc_topic_matrix[x, order(new_data$topics[x, ], decreasing = T)[1:2]] <<- 1\n", "})\n", "\n", "# count co-occurrence of topics\n", "topic_topic_matrix <- t(doc_topic_matrix) %*% doc_topic_matrix\n", "diag(topic_topic_matrix) <- 0\n", "\n", "# now we calculate Dice statistics to determine significant topic combinations \n", "final_topic_coocs <- matrix(0, nrow = nrow(topic_topic_matrix), ncol = ncol(topic_topic_matrix), dimnames = dimnames(topic_topic_matrix))\n", "\n", "k <- nrow(doc_topic_matrix) # number of all documents\n", "kj <- colSums(doc_topic_matrix) # number of docs containing topic j as primary/secondary\n", "names(kj) <- colnames(doc_topic_matrix)\n", "\n", "for (topicName in colnames(doc_topic_matrix)) {\n", " # retrieve numbers for statistic calculation\n", " ki <- kj[topicName]\n", " kij <- topic_topic_matrix[topicName, ]\n", " \n", " dicesig <- 2 * kij / (ki + kj)\n", " sig <- dicesig\n", " sig[is.na(sig)] <- 0\n", " final_topic_coocs[topicName,] <- sig\n", "}\n", "\n", "# Create a data.frame to produce the graph object\n", "topicGraph <- melt(final_topic_coocs)\n", "# The table must be of the form from to sig -- This is the description of all edges in the graph\n", "colnames(topicGraph) <- c(\"from\",\"to\",\"sig\")\n", "\n", "# We only use the edges with a significance of more than a defined threshold\n", "topicGraph <- topicGraph[topicGraph[, 3] > 0.07, ]\n", "\n", "# The visualization is done with the igraph package\n", "require(igraph)\n", "\n", "# We initialize the graph with our edge list and define the type of the graph as undirected\n", "graphNetwork <- graph.data.frame(topicGraph, directed = F)\n", "\n", "# The vertices get a size based on the proportion of the topic in the overall collection\n", "V(graphNetwork)$size <- colSums(new_data$topics) / sum(new_data$topics) * 200\n", "\n", "# But vertices get a minimum size \n", "V(graphNetwork)$size[V(graphNetwork)$size < 3] <- 3 \n", "\n", "# We deactivate the standard curved form of the edges in the graph and force straight lines.\n", "E(graphNetwork)$curved <- 0 \n", "\n", "# Define the thickness of the edges\n", "E(graphNetwork)$width <- 2\n", "\n", "# Definition of some margins for the plot\n", "par(mai=c(0,0,0,0)) \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot(graphNetwork,\n", " layout = layout.fruchterman.reingold,\t# Force Directed Layout\n", " main = \"Topic co-occurrence\", # Title\n", " vertex.label.family = \"sans\",\n", " vertex.shape = \"circle\",\n", " vertex.label.dist = 0.5,\t\t\t# Slightly push the labels away from the vertices\n", " vertex.frame.color = 'darkolivegreen',\n", " vertex.label.color = 'black',\t\t# Vertex label color\n", " vertex.label.font = 2,\t\t\t# font for the vertex label\n", " vertex.label = V(graphNetwork)$name,\t\t# content of the vertex label\n", " vertex.label.cex = 0.7 # size of the vertex label\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extended Exercise\n", "\n", "Visualize the topics and the words with the alternative rankings of tutorial I. You can copy the code for the ranking calculation and put the results to the visualizations from this tutorial. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.3.2" } }, "nbformat": 4, "nbformat_minor": 2 }