{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "< [Co-occurrence analysis](tutorial_4.ipynb) | [Contents](index.ipynb) | [Model application and visualization](tutorial_6.ipynb) >" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Model inference" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial the application and the basic operations of the topic modelling packages in R are introduced. We apply the LDA model from [1] to a publicly available dataset: the Job postings from the Kaggle page. \n", "\n", "We introduce different observations which can be derived from the posterior distributions of the LDA model. These basic observations allow to interpret and quantify semantically coherent topics within the data. It is also possible to compare the topical composition of a text source with other sources.\n", "\n", "First, we start with initializing our text data set, creating a quanteda-corpus object from it, apply some pre-processing measures, and finally create a Document-Term-Matrix (DTM). For the model inferences we use the `topicmodels`-package of R [2]." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Setup and initial package import\n", "options(stringsAsFactors = FALSE)\n", "require(quanteda)\n", "require(magrittr)\n", "require(dplyr)\n", "require(topicmodels)\n", "require(lda)\n", "require(openNLP)\n", "require(NLP)\n", "require(data.table)\n", "# Read stop word file\n", "english_stopwords <- readLines(\"resources/stopwords_en.txt\", encoding = \"UTF-8\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We read a CSV containing 19.000 \"Job Description\". The texts are freely available from https://www.kaggle.com/madhab/jobposts. Our CSV file has the format: `\"jobpost\";\"date\";\"Title\"; etc.`..\n", "\n", "For linguistic preprocessing to unify the vocabulary, we can use either stemming or lemmatization. Stemming can be done by rules for a given language (use the `stemDocument` function). We will use a simple lemmatization on the basis of a lookup table (`resources/baseform_en.tsv`)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Read text data from CSV file\n", "textdata <- read.csv(\"data/data job posts.csv\", header = TRUE, sep = \",\", encoding = \"UTF-8\",quote = \"\\\"\")\n", "textdata <- as.data.table(textdata)\n", "\n", "english_stopwords <- readLines(\"resources/stopwords_en.txt\", encoding = \"UTF-8\")\n", "\n", "textdata %<>% filter(!duplicated(jobpost))\n", "textdata %<>% mutate(d_id = 1:nrow(textdata))\n", "\n", "save(textdata,file=\"textdata.RData\")\n", "\n", "#Build a dictionary of lemmas\n", "lemmaData <- read.csv2(\"resources/baseform_en.tsv\", sep=\"\\t\", header=FALSE, encoding = \"UTF-8\", stringsAsFactors = F)\n", "\n", "data_corpus <- corpus(textdata$jobpost, docnames = textdata$d_id)\n", " \n", "data_dfm_entries <- data_corpus %>% tokens() %>% tokens_remove(pattern = c(stopwords(),english_stopwords)) %>%\n", " tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>% tokens_tolower() %>% \n", " tokens_replace(., lemmaData$V1, lemmaData$V2) %>%\n", " tokens_ngrams(1) %>% dfm() \n", "\n", "\n", "data_dfm_entries_sub <- data_dfm_entries %>% dfm_trim(min_docfreq = 5) %>%\n", " dfm_select(pattern = \"[a-z]\", valuetype = \"regex\", selection = 'keep')\n", "\n", "colnames(data_dfm_entries_sub) <- colnames(data_dfm_entries_sub) %>% stringi::stri_replace_all_regex(\"[^_a-z]\", \"\") \n", "\n", "data_dfm_entries_sub <- dfm_compress(data_dfm_entries_sub, \"features\")\n", "\n", "data_dfm_entries_sub <- \n", " data_dfm_entries_sub[rowSums(data_dfm_entries_sub) >=10, nchar(colnames(data_dfm_entries_sub)) > 1]\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we save the corpus objects for later use and create the DTM. We keep only terms occurring more than 5 times." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "save(data_corpus, file = \"corpus.RData\")\n", "\n", "# RENAME\n", "DTM <- data_dfm_entries_sub" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We save our DTM for later use in other tutorials, so we do not need to compute it all over again." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# have a look at the number of documents and terms in the matrix\n", "dim(DTM)\n", "save(DTM, file = \"DTM.RData\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## LDA Model inference\n", "\n", "The inference of topic models aims for the separation of a document collection into a fixed set of topics. Topic models are unsupervised machine learning algorithms. Thus, they are suited best for exploring data. It is useful to experiment with different parameters to find the optimal setting for the intended analysis.\n", "\n", "For parametrized models like the Latent Dirichlet Allocation the number of topics K has to be set in advance. The decision for an optimal K is dependent on different factors. If K is too low the collection of documents is separated into few very coarse semantic coherent topics. If K is too large the result is a separation into very detailed topics which are hard to distinguish or interpret. \n", "\n", "As a start we choose a manually overseeable number of 20 topics. The other hyper-parameters relevant for the inference process are set to `alpha = 20` as prior for the topic-distributions, and beta to an automatically estimated value based on the given data. The inference is run with Gibbs sampling for 500 iterations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# number of topics\n", "K <- 20\n", "\n", "selector <- textdata %>% filter(Company != \"Samsung Electronics Representative Office in Armenia\") %>% select(\"d_id\")\n", "selector <- rownames(DTM) %in% as.character(selector$d_id)\n", "\n", "# compute the LDA model, inference via 500 iterations of Gibbs sampling\n", "topicModel <- LDA(DTM[selector,], K, method=\"Gibbs\", control=list(iter = 500, verbose = 20, alpha = 0.2, estimate.beta = TRUE))\n", "save(topicModel, file = \"topicModel.RData\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The inference can take quite long, dependent on the size of the vocabulary, the number of documents and the setting of K. Our calculation may take a few minutes. If an inference process takes too long on your computer, you can reduce the size of the vocabulary by raising the minimum frequency passed to the creation process of the document term matrix or reduce the number of iterations.\n", "\n", "After running the inference process for a while, two posterior distributions can be calculated. The first, called `theta`, represents the distributions of all topics K w.r.t. each document. The second posterior parameter, called `beta` represents the distribution of all terms V w.r.t. each topic. V represents the length of the vocabulary (`r ncol(DTM)`). Let's take a closer look:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# have a look to some of the results (posterior distributions)\n", "tmResult <- posterior(topicModel)\n", "save(tmResult, file = \"tmResult.RData\") #we save the posterior for later use\n", "# format of the resulting object\n", "attributes(tmResult)\n", "\n", "ncol(DTM) # lengthOfVocab\n", "# topics are probability distributions over the entire vocabulary\n", "beta <- tmResult$terms # get beta from results\n", "dim(beta) # K distributions over nTerms(DTM) terms\n", "rowSums(beta) # rows in beta sum to 1\n", "nrow(DTM) # size of collection \n", "# for every document we have a probaility distribution of its contained topics\n", "theta <- tmResult$topics \n", "dim(theta) # nDocs(DTM) distributions over K topics\n", "rowSums(theta)[1:10] # rows in theta sum to 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Ranking of words\n", "\n", "Now let us look at the 10 **most probable words** for each topic within the inferred `beta` distribution." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "terms(topicModel, 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of numbers we would like to set descriptive names for the topics in the next step. For this reason, we concatenate the top 5 probable words of each topic into a pseudo-name for each topic. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# the function terms extracts the most likely terms for each topic\n", "top5termsPerTopicProb <- terms(topicModel, 5)\n", "topicNames <- apply(top5termsPerTopicProb, 2, paste, collapse=\" \")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Often, the ranking by probability returns words which might be very frequent in a topic but, at the same time, are not very distinctive. This might be the case for high frequent words such as \"year\", \"job\" or \"qualification\" that are no stopwords, but relevant for many topics. An alternative to probability based ranking is the ranking of the words by a tf/idf-like ranking mechanism. Within this procedure words get punished if they appear in many topics with high probability.\n", "\n", "### Ranking by score (lda package)\n", "\n", "This mechanism ranks the words according to a score defined by [3] and is defined as \n", "\n", "$$relevance(w|t) = p(w, t) \\cdot \\left(\\log p(w|t) - \\frac{1}{K} \\sum_{t'} \\log p(w|t')\\right).$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# LDA package function top.topic.words needs a K x V matrix\n", "dim(tmResult$terms) # The output from the topicmodels package has the right dimension\n", "\n", "# The dimension is compatible so we can put beta directly into the function\n", "top5termsPerTopicScore <- top.topic.words(beta, num.words = 5, by.score = T)\n", "topicNamesByScore <- apply(top5termsPerTopicScore, 2, paste, collapse = \" \")\n", "print(topicNamesByScore)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, for instance the term \"year\", which occurred `r sum(unlist(top5termsPerTopicProb) == \"year\")` time among the top 5 terms ranked by probability, only occurs `r sum(unlist(top5termsPerTopicScore) == \"year\")` time(s) using the score ranking.\n", "\n", "### Ranking by score II (LDAvis package)\n", "\n", "This alternate scoring strategy ranks the words according to a score defined by [4] and is defined as\n", "\n", "$$relevance(w|t) = \\lambda \\cdot \\log p(w|t) + (1 - \\lambda)\\cdot \\log \\frac{p(w|t)}{p(w)}.$$\n", "\n", "The parameter $\\lambda$, ranging between 0 and 1, influences the scoring by penalizing if a term is not only likely to a specific topic, but also to the entire collection. $\\lamba = 1$ simply ranks a term according to its probability. $\\lamba$ near 0 puts emphasis on the exlusivity of a term for a topic. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Unfortunately, there is no package containing this formula. \n", "# So, we have to implement it on our own\n", "scoreByLambda <- function(p_w_t, num.words = 5, p_w, lambda)\n", "{\n", " apply(p_w_t, 1, function(x, num.words, p_w,lambda) {\n", " x <- lambda * log(x) + (1 - lambda) * log(x / p_w)\n", " return(names(sort(x, decreasing = T)[1:num.words]))\n", " }, num.words, p_w, lambda)\n", " \n", "}\n", "\n", "p_w <- slam::col_sums(DTM) / sum(DTM)\n", "top5termsPerTopicScoreII <- scoreByLambda(tmResult$terms, num.words = 5, p_w, lambda = 0.6)\n", "topicNamesByScoreII <- apply(top5termsPerTopicScoreII, 2, paste, collapse = \" \")\n", "print(top5termsPerTopicScoreII)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "@sievert_ldavis:_2014 claim $\\lambda = 0.6$ as a good value balancing between the two terms of the scoring formula.\n", "\n", "## Ranking of topics\n", "\n", "Which topics are prominent in a document collection? Multiple approaches exist to determine a ranking of the inferred topics.\n", "\n", "### Approach 1\n", "\n", "We sort the topics by their probability among the whole collection:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# What are the most probable topics in the entire collection?\n", "topicProportions <- colSums(theta) / sum(theta) # mean probablities over all documents\n", "names(topicProportions) <- topicNamesByScoreII # assign the topic names we created before\n", "sort(topicProportions, decreasing = TRUE) # show summed proportions in decreased order" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some topics with a higher probability are observable. Those topics describe rather general contexts. Other topics represent tangible and interpretable topics (Cuba island, slavery, ...). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Approach 2\n", "\n", "We count how often a topic appears as the primary topic within the documents. This method is called *Rank-1*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "countsOfPrimaryTopics <- rep(0, K)\n", "names(countsOfPrimaryTopics) <- topicNamesByScoreII\n", "for (i in 1:dim(theta)[1]) {\n", " topicsPerDoc <- theta[i, ] # select topic distribution for document i\n", " # get first element position from ordered list\n", " primaryTopic <- order(topicsPerDoc, decreasing = TRUE)[1] \n", " countsOfPrimaryTopics[primaryTopic] <- countsOfPrimaryTopics[primaryTopic] + 1\n", "}\n", "sort(countsOfPrimaryTopics, decreasing = TRUE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As one can see, the Rank-1 sorting puts some very specific topics more to the upper ranks (e.g. terrorism). This sorting can be used for further analysis steps like the semantic interpretation of the topics in the documents, time series analysis of topics or the filtering of the documents w.r.t. specific topics. \n", "\n", "## Predict topics on new data\n", "\n", "Once a topic model has been created we can use the inferred posterior to infer a unknown document collection w.r.t. our reference topic model. This can very helpful in case topics should be compared with respect to a reference collection. In antother scenario, where a very large corpus should be modeled, we could calculate a model on a smaller sample of documents and then use this model to infer topics for the remaining out-of-sample documents.\n", "\n", "IMPORTANT: The matrix for the out-of-sample documents must be of the same dimension that the matrix from which the model was created. This could lead to empty or very short documents due to the unknown vocabulary of new documents\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "selector <- textdata %>% filter(Company == \"Samsung Electronics Representative Office in Armenia\") %>% select(\"d_id\")\n", "selector <- rownames(DTM) %in% as.character(selector$d_id)\n", "\n", "new_data <- posterior(topicModel, DTM[selector, ])\n", "save(new_data, file = \"tmResult_documents.RData\")\n", "\n", "# Proportions of topics\n", "topicProportionsNewData <- colSums(new_data$topics) / sum(new_data$topics)\n", "names(topicProportionsNewData) <- topicNamesByScoreII \n", "sort(topicProportionsNewData, decreasing = TRUE)\n", "\n", "\n", "# Rank-1 metric\n", "countsOfPrimaryTopicsNewData <- rep(0, K)\n", "names(countsOfPrimaryTopicsNewData) <- topicNamesByScore\n", "for (i in 1:nrow(DTM[selector, ])) {\n", " topicsPerDoc <- new_data$topics[i, ] \n", " primaryTopic <- order(topicsPerDoc, decreasing = TRUE)[1] \n", " countsOfPrimaryTopicsNewData[primaryTopic] <- countsOfPrimaryTopicsNewData[primaryTopic] + 1\n", "}\n", "sort(countsOfPrimaryTopicsNewData, decreasing = TRUE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Use POS-Tagging for pre-processing\n", "\n", "Very frequent words can impede the modelling itself, resp. the interpretation of a topic model result. Due to this, filtering stop words beforhand is usually a very good idea. \n", "\n", "Further, it can be a good strategy to filter specific word types such as verbs or adjectives. In the next code block we'll process the corpus with POS-tagging and only use nouns and named entities to calculate the model. \n", "\n", "First, we create some functions to run the tagging.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Function to convert a document in a vector of sentences\n", "convert_text_to_sentences_postag <- function(text, SentModel = NULL, TokModel = NULL, POSModel = NULL) {\n", " # Convert to NLP:String Object\n", " text <- NLP::as.String(text)\n", " # Annotate sentence boundaries\n", " sentenceBoundaries <- NLP::annotate(text, SentModel)\n", " # Select the sentences out of the original text object using the annotations\n", " sentences <- text[sentenceBoundaries]\n", " # Pass sentences to POS annotator\n", " sentences <- sapply(sentences, processPos, TokModel = TokModel, POSModel = POSModel)\n", " document <- paste(sentences[\"POStagged\", ], collapse = \" \")\n", " # return the documents\n", " return(document)\n", "}\n", "\n", "processPos <- function(sentence, TokModel = NULL, POSModel = NULL) {\n", " sentence <- as.String(sentence)\n", " # Add sentence annotation\n", " a1 <- Annotation(1L, \"sentence\", 1L, nchar(sentence))\n", " # Add token annotation\n", " a2 <- NLP::annotate(sentence, TokModel, a1)\n", " # Add POS annotation\n", " a3 <- NLP::annotate(sentence, POSModel, a2)\n", " # Select words from annotation list\n", " a3w <- a3[a3$type == \"word\"]\n", " # Select POS tags from annotation list\n", " POStags <- unlist(lapply(a3w$features, `[[`, \"POS\"))\n", " # Concat word+/+tag for all words in the sentences\n", " POStagged <- paste(sprintf(\"%s/%s\", sentence[a3w], POStags), collapse = \" \")\n", " # return result as list\n", " list(POStagged = POStagged, POStags = POStags)\n", "}\n", "\n", "# Function to convert a vector of documents into a vector of POS-Tagged documents\n", "postag_text_source <- function(text, ...) {\n", " # load models\n", " WTA <- Maxent_Word_Token_Annotator(model = \"resources/en-token.bin\")\n", " STA <- Maxent_Sent_Token_Annotator(model = \"resources/en-sent.bin\")\n", " PTA <- Maxent_POS_Tag_Annotator(model = \"resources/en-pos-maxent.bin\")\n", " # create txt-progress bar to follow progress of pos tagging\n", " pb <- txtProgressBar(min = 0, max = length(text))\n", " i <- 0\n", " # loop over documents to extract sentences and pos-tag them\n", " docs <- sapply(text, FUN=function(x) {\n", " i <<- i + 1\n", " setTxtProgressBar(pb, i)\n", " convert_text_to_sentences_postag(x, SentModel = STA, TokModel = WTA, POSModel = PTA)\n", " }, ...)\n", " close(pb)\n", " return(docs)\n", "}\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we run the tagging. This may take a while. Consider loading the precomuted results with `load(\"pos_result.RData\")`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pos_result <- postag_text_source(textdata$jobpost)\n", "save(pos_result, file = \"pos_result.RData\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check the results. All tokens in the first document should now have a Part-od-Speech-Tag in the end. This allows subsequent processing steps to distinguish, for instance, the term \"state\" as noun (\"state/nn\") from state as verb (\"state/vb\")." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "substr(pos_result[[1]], 0, 300)\n", "textdata$POS_TEXT <- pos_result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we can use this annotated text corpus to create a new DTM. From this DTM, we remove all tokens, which do not belong into the NN(S) classes for nouns, or NNP(S) classes for proper nouns. Finally, we compute a new topic model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create corpus object\n", "data_corpus_pos <- corpus(textdata$POS_TEXT, docnames = textdata$d_id)\n", " \n", "\n", "data_dfm_entries_pos <- data_corpus_pos %>% tokens(what = \"fasterword\") %>% tokens_tolower() %>%\n", " tokens_ngrams(1) %>% dfm() \n", "\n", "\n", "data_dfm_entries_sub_pos <- data_dfm_entries_pos %>% dfm_trim(min_docfreq = 5) \n", "\n", "colnames(data_dfm_entries_sub_pos) <- colnames(data_dfm_entries_sub_pos) %>% stringi::stri_replace_all_regex(\"[^_/a-z]\", \"\") \n", "\n", "data_dfm_entries_sub_pos <- dfm_compress(data_dfm_entries_sub_pos, \"features\")\n", "\n", "data_dfm_entries_sub_pos <- \n", " data_dfm_entries_sub_pos[rowSums(data_dfm_entries_sub_pos) >=10, nchar(colnames(data_dfm_entries_sub_pos)) > 1]\n", "\n", "DTM_pos <- data_dfm_entries_sub_pos\n", "\n", "# Remove unwanted pos-tags by using a regex query to the DTM\n", "idx <- grep(colnames(DTM_pos), perl = T, pattern = \"(/nns?|/nnps?)$\")\n", "\n", "DTM_pos <- DTM_pos[, idx]\n", "\n", "#This needs to be an extra step in order to work properly\n", "DTM_pos <- DTM_pos[rowSums(DTM_pos) > 0, ]\n", "\n", "# number of topics\n", "K <- 30\n", "\n", "# compute the LDA model, inference via 500 iterations of Gibbs sampling\n", "topicModel_pos <- LDA(DTM_pos, K, method = \"Gibbs\", control = list(iter = 500, verbose = 100, alpha = 0.1, estimate.beta = TRUE))\n", "top5termsPerTopicScore <- top.topic.words(posterior(topicModel_pos)$terms, num.words = 5, by.score = T)\n", "print(top5termsPerTopicScore)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the topic terms consist only (proper) nouns and might be easier to interpret." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References\n", "\n", "1. Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. The Journal of Machine Learning Research. 3, 993–1022 (2003).\n", "\n", "2. Hornik, K., Grün, B.: Topicmodels: An R package for fitting topic models. Journal of Statistical Software. 40, 1–30 (2011).\n", "\n", "3. Chang, J., Chang, M.J.: Package “lda”. Citeseer (2010).\n", "\n", "4. Sievert, C., Shirley, K.E.: LDAvis: A method for visualizing and interpreting topics. In: Proceedings of the workshop on interactive language learning, visualization, and interfaces. pp. 63–70 (2014)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.3.2" } }, "nbformat": 4, "nbformat_minor": 2 }