{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "< [Frequency analysis](tutorial_2.ipynb) | [Contents](index.ipynb) | [Co-occurrence analysis](tutorial_4.ipynb) >" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Key term extraction\n", "\n", "This tutorial shows how to extract key terms from document and (sub-)collections with TF-IDF and the log-likelihood statistic and a reference corpus.\n", "\n", "1. TF-IDF\n", "2. Log-likelihood ratio test\n", "3. Aggregations and visualization\n", "\n", "Like in the previous tutorial we read the CSV data file containing the job posts and preprocess the corpus object with a sequence of `quanteda` functions.\n", "\n", "This time, we also apply **lemmatization** to the corpus. Lemmatization reduces (potentially) inflected word forms to its lexical lemma to unify similar semantic forms to the same text representation. For instance, 'presidents' becomes '' and 'singing' becomes 'sing'.\n", "\n", "Finally, we create a Document-Term-Matrix." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "options(stringsAsFactors = FALSE, digits = 3)\n", "require(quanteda)\n", "require(magrittr)\n", "require(dplyr)\n", "require(data.table)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# read csv into a data.frame\n", "textdata <- read.csv(\"data/data job posts.csv\", header = TRUE, sep = \",\", encoding = \"UTF-8\",quote = \"\\\"\")\n", "textdata <- as.data.table(textdata)\n", "\n", "english_stopwords <- readLines(\"data/stopwords_en.txt\", encoding = \"UTF-8\")\n", "\n", "textdata %<>% filter(!duplicated(jobpost))\n", "textdata %<>% mutate(d_id = 1:nrow(textdata))\n", "\n", "#Build a dictionary of lemmas\n", "lemmaData <- read.csv2(\"data/baseform_en.tsv\", sep=\"\\t\", header=FALSE, encoding = \"UTF-8\", stringsAsFactors = F)\n", "\n", "data_corpus <- corpus(textdata$jobpost, docnames = textdata$d_id)\n", "\n", "# Create a DTM (may take a while)\n", "data_dfm_entries <- data_corpus %>% tokens() %>%\n", " tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>% tokens_tolower() %>% \n", " tokens_replace(., lemmaData$V1, lemmaData$V2) %>%\n", " tokens_ngrams(1) %>% tokens_remove(pattern = stopwords()) %>% dfm() \n", "\n", "\n", "data_dfm_entries_sub <- data_dfm_entries %>%\n", " dfm_select(pattern = \"[a-z]\", valuetype = \"regex\", selection = 'keep')\n", "\n", "colnames(data_dfm_entries_sub) <- colnames(data_dfm_entries_sub) %>% stringi::stri_replace_all_regex(\"[^_a-z]\", \"\") \n", "\n", "DTM <- dfm_compress(data_dfm_entries_sub, \"features\")\n", "# Show some information\n", "DTM" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## TF-IDF\n", "\n", "A widely used method to weight terms according to their semantic contribution to a document is the **term frequency--inverse document frequency** measure (TF-IDF). The idea is, the more a term occurs in a document, the more contributing it is. At the same time, in the more documents a term occurs, the less informative it is for a single document. The product of both measures is the resulting weight.\n", "\n", "Let us compute TF-IDF weights for all terms." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Compute IDF: log(N / n_i)\n", "number_of_docs <- nrow(DTM)\n", "term_in_docs <- colSums(DTM > 0)\n", "idf <- log2(number_of_docs / term_in_docs)\n", "\n", "# Compute TF\n", "jobpost <- which(textdata$Company == \"Samsung Electronics Representative Office in Armenia\")[1]\n", "tf <- as.vector(DTM[jobpost, ])\n", "\n", "# Compute TF-IDF\n", "tf_idf <- tf * idf\n", "names(tf_idf) <- colnames(DTM)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The last operation is to append the column names again to the resulting term weight vector. If we now sort the tf-idf weights decreasingly, we get the most important terms for the job post, according to this weight." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sort(tf_idf, decreasing = T)[1:20]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we would have just relied upon term frequency, we would have obtained a list of stop words as most important terms. By re-weighting with inverse document frequency, we can see a hmore senseful word list. By the way, the quanteda-package provides a convenient function for computing tf-idf weights of a given DTM: `weightTfIdf(DTM)`.\n", "\n", "## Log likelihood\n", "\n", "We now use a more sophisticated method with a comparison corpus and the log likelihood statistic.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "targetDTM <- DTM\n", "\n", "termCountsTarget <- as.vector(targetDTM[jobpost, ])\n", "names(termCountsTarget) <- colnames(targetDTM)\n", "# Just keep counts greater than zero\n", "termCountsTarget <- termCountsTarget[termCountsTarget > 0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In *termCountsTarget* we have the tf for the document again.\n", "\n", "As a comparison corpus, we select a corpus from the Leipzig Corpora Collection (http://corpora.uni-leipzig.de): 30.000 randomly selected sentences from the Wikipedia of 2010. \n", "**CAUTION:** The preprocessing of the comparison corpus must be identical to the preprocessing Of the target corpus to achieve meaningful results!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lines <- data.frame(text=readLines(\"resources/eng_wikipedia_2010_30K-sentences.txt\", encoding = \"UTF-8\"))\n", "\n", "lines %<>% mutate(d_id = 1:nrow(lines))\n", "\n", "#Build a dictionary of lemmas\n", "data_corpus_reference <- corpus(lines$text, docnames = lines$d_id)\n", "\n", "# Create a DTM (may take a while)\n", "data_dfm_entries_reference <- data_corpus_reference %>% tokens() %>%\n", " tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>% tokens_tolower() %>% \n", " tokens_replace(., lemmaData$V1, lemmaData$V2) %>%\n", " tokens_ngrams(1) %>% tokens_remove(pattern = stopwords()) %>% dfm() \n", "\n", "\n", "data_dfm_entries_sub_reference <- data_dfm_entries_reference %>%\n", " dfm_select(pattern = \"[a-z]\", valuetype = \"regex\", selection = 'keep')\n", "\n", "colnames(data_dfm_entries_sub_reference) <- colnames(data_dfm_entries_sub_reference) %>% stringi::stri_replace_all_regex(\"[^_a-z]\", \"\") \n", "\n", "DTM_reference <- dfm_compress(data_dfm_entries_sub_reference, \"features\")\n", "# Show some information\n", "DTM_reference" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the comparison corpus, we also create a count of all terms. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "comparisonDTM <- DTM_reference\n", "termCountsComparison <- colSums(comparisonDTM)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In *termCountsComparison* we now have the frequencies of all (target) terms in the comparison corpus.\n", "\n", "Let us now calculate the log-likelihood ratio test by comparing frequencies of a term in both corpora, taking the size of both corpora into account. First for a single term:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Loglikelihood for a single term\n", "term <- \"representative\"\n", "\n", "# Determine variables\n", "a <- termCountsTarget[term]\n", "b <- termCountsComparison[term]\n", "c <- sum(termCountsTarget)\n", "d <- sum(termCountsComparison)\n", "\n", "# Compute log likelihood test\n", "Expected1 = c * (a+b) / (c+d)\n", "Expected2 = d * (a+b) / (c+d)\n", "t1 <- a * log((a/Expected1))\n", "t2 <- b * log((b/Expected2))\n", "logLikelihood <- 2 * (t1 + t2)\n", "\n", "print(logLikelihood)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The LL value indicates whether the term occurs significantly more frequently / less frequently in the target counts than we would expect from the observation in the comparative counts. Specific significance thresholds are defined for the LL values:\n", "\n", "- 95th percentile; 5% level; p < 0.05; critical value = 3.84\n", "- 99th percentile; 1% level; p < 0.01; critical value = 6.63\n", "- 99.9th percentile; 0.1% level; p < 0.001; critical value = 10.83\n", "- 99.99th percentile; 0.01% level; p < 0.0001; critical value = 15.13 \n", "\n", "With R it is easy to calculate the LL-value for all terms at once. This is possible because many computing operations in R can be applied not only to individual values, but to entire vectors and matrices. For example, `a / 2` results in a single value *a divided by 2* if `a` is a single number. If `a` is a vector, the result is also a vector, in which all values are divided by 2.\n", "\n", "ATTENTION: A comparison of term occurrences between two documents/corpora is actually only useful if the term occurs in both units. Since, however, we also want to include terms which are not contained in the comparative corpus (the `termCountsComparison` vector contains 0 values for these terms), we simply add 1 to all counts during the test. This is necessary to avoid `NaN` values which otherwise would result from the log-function on 0-values during the LL test. Alternatively, the test could be performed only on terms that actually occur in both corpora.\n", "\n", "First, let's have a look into the set of terms only occurring in the target document, but not in the comparison corpus.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# use set operation to get terms only occurring in target document\n", "uniqueTerms <- setdiff(names(termCountsTarget), names(termCountsComparison))\n", "# Have a look into a random selection of terms unique in the target corpus\n", "sample(uniqueTerms, 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we calculate the statistics the same way as above, but with vectors. But, since there might be terms in the targetCounts which we did not observe in the comparison corpus, we need to make both vocabularies matching. For this, we append unique terms from the target as zero counts to the comparison frequency vector. Moreover, we use a little trick to check for zero counts of frequency values in a or b when computing t1 or t2. If a count is zero the log function would produce an NaN value, which we want to avoid. In this case the `a == 0` resp. `b == 0` expression add 1 to the expression which yields a 0 value after applying the log function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create vector of zeros to append to comparison counts\n", "zeroCounts <- rep(0, length(uniqueTerms))\n", "names(zeroCounts) <- uniqueTerms\n", "termCountsComparison <- c(termCountsComparison, zeroCounts)\n", "\n", "# Get list of terms to compare from intersection of target and comparison vocabulary\n", "termsToCompare <- intersect(names(termCountsTarget), names(termCountsComparison))\n", "\n", "# Calculate statistics (same as above, but now with vectors!)\n", "a <- termCountsTarget[termsToCompare]\n", "b <- termCountsComparison[termsToCompare]\n", "c <- sum(termCountsTarget)\n", "d <- sum(termCountsComparison)\n", "Expected1 = c * (a+b) / (c+d)\n", "Expected2 = d * (a+b) / (c+d)\n", "t1 <- a * log((a/Expected1) + (a == 0))\n", "t2 <- b * log((b/Expected2) + (b == 0))\n", "logLikelihood <- 2 * (t1 + t2)\n", "\n", "# Compare relative frequencies to indicate over/underuse\n", "relA <- a / c\n", "relB <- b / d\n", "# underused terms are multiplied by -1\n", "logLikelihood[relA < relB] <- logLikelihood[relA < relB] * -1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a look at the results: The 50 more frequently used / less frequently used terms, and then the more frequently used terms compared to their frequency. We also see terms that have comparatively low frequencies are identified by the LL test as statistically significant compared to the reference corpus." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# top terms (overuse in targetCorpus compared to comparisonCorpus)\n", "sort(logLikelihood, decreasing=TRUE)[1:25]\n", "# bottom terms (underuse in targetCorpus compared to comparisonCorpus)\n", "sort(logLikelihood, decreasing=FALSE)[1:25]\n", "\n", "llTop100 <- sort(logLikelihood, decreasing=TRUE)[1:100]\n", "frqTop100 <- termCountsTarget[names(llTop100)]\n", "frqLLcomparison <- data.frame(llTop100, frqTop100)\n", "frqLLcomparison" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Number of signficantly overused terms (p < 0.01)\n", "sum(logLikelihood > 6.63)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aggregation and visualization\n", "\n", "Finally, visualize the result of the 50 most significant terms as Wordcloud. This can be realized simply by function of the package wordcloud. Additionally to the words and their weights (here we use likelihood values), we override default scaling and color parameters. Feel free to try different parameters to modify the wordcloud rendering." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "require(wordcloud)\n", "top50 <- sort(logLikelihood, decreasing = TRUE)[1:50]\n", "wordcloud(names(top50), top50, max.words = 50, scale = c(3, .9), colors = brewer.pal(8, \"Dark2\"), random.order = F)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Key term extraction cannot be done for single documents, but for entire (sub-)corpora. Depending on the comparison corpora, the results may vary. Instead of comparing a single document to a Wikipedia corpus, we now compare collections of job posts of a single company, to job posts of all other companies.\n", "\n", "For this, we iterate over all different company names using a for-loop. Within the loop, we utilize a logical vector (Boolean TRUE/FALSE values), to split the DTM into two sub matrices: rows of the currently selected company and rows of all other companies. From these matrices our counts of target and comparison frequencies are created. The statistical computation of the log-likelihood measure from above, we outsourced into the function `calculateLogLikelihood` which we load with the `source` command at the beginning of the block. The function just takes both frequency vectors as input parameters and outputs a LL-value for each term of the target vector.\n", "\n", "Results of the LL key term extraction are visualized again as a wordcloud. Instead of plotting the wordcloud into RStudio, this time we write the visualization as a PDF-file to disk into the `wordclouds` folder. After the for-loop is completed, the folder should contain numerous wordcloud PDFs, one for each company." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "source(\"calculateLogLikelihood.R\")\n", "\n", "companies <- unique(textdata$Company)\n", "for (company in companies) {\n", " \n", " cat(\"Extracting terms for company\", company, \"\\n\")\n", " \n", " selector_logical_idx <- textdata$Company == company\n", " \n", " companyDTM <- targetDTM[selector_logical_idx, ]\n", " termCountsTarget <- colSums(companyDTM)\n", " \n", " otherDTM <- targetDTM[!selector_logical_idx, ]\n", " termCountsComparison <- colSums(otherDTM)\n", " \n", " loglik_terms <- calculateLogLikelihood(termCountsTarget, termCountsComparison)\n", " \n", " top50 <- sort(loglik_terms, decreasing = TRUE)[1:50]\n", " \n", " fileName <- paste0(\"wordclouds-\", company, \".pdf\")\n", " # pdf(fileName, width = 9, height = 7)\n", " wordcloud(names(top50), top50, max.words = 50, scale = c(3, .9), colors = brewer.pal(8, \"Dark2\"), random.order = F)\n", " dev.off()\n", " \n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.3.2" } }, "nbformat": 4, "nbformat_minor": 2 }