{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![](https://github.com/MartinSchweinberger/SLAT7829/blob/master/images/bannerSLAT7829.jpeg?raw=true)\n", "\n", "# Topic modelling\n", "\n", "This tutorial show how you can perform topic modeling.\n", "\n", "However, please keep in mind that the case studies merely aim to exemplify ways in which R can be used in language-based research - rather than providing detailed procedures on how to do corpus-based research. \n", "\n", "Topic modeling is a technique used in data analysis to identify and extract topics or themes from a large collection of texts. It is a way to automatically identify patterns and insights in large amounts of unstructured data.\n", "\n", "Topic modeling is useful in a variety of fields, such as social sciences, humanities, and marketing research. For example, it can be used to analyze customer reviews to identify common themes, to study political speeches to identify key issues and topics, or to analyze social media data to understand public opinion on a particular topic.\n", "\n", "The technique works by analyzing the frequency of words that appear in a text corpus and grouping them into topics that frequently co-occur. These topics can then be interpreted and labeled based on the words that are most strongly associated with them. The result is a set of topics that represent the most important themes in the text corpus.\n", "\n", "## Preparation and session set up\n", "\n", "Activate required packages.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# load packages\n", "library(here)\n", "library(tidyr)\n", "library(quanteda.textstats)\n", "library(quanteda.textplots)\n", "library(seededlda)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading data\n", "\n", "To perform topic modelling, we first need to load some data. In this tutorial, we will use the essays written by German and Spanish learners of English provided in the International Corpus of Learner English (ICLE).\n", "\n", "Loading corpus data into R consists of two steps: \n", "\n", "1. create a list of paths of the corpus files\n", "\n", "2. loop over these paths and load the data in the files identified by the paths.\n", "\n", "To create a list of corpus files, you could use the code chunk below (the code chunk assumes that the ICLE data is in a folder called *ICLE*).\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# load ace files\n", "corpusfiles <- list.files(here::here(\"ICLE\"), # path to the corpus data\n", " pattern = \"GE|SP\",\n", " # full paths - not just the names of the files\n", " full.names = T) \n", "# load the files by scanning the content\n", "corpus <- sapply(corpusfiles, function(x){\n", " x <- scan(x, what = \"char\", sep = \"\", quote = \"\", quiet = T, skipNul = T)\n", " x <- paste0(x, sep = \" \", collapse = \" \")\n", " x <- stringr::str_squish(x)\n", "})\n", "# inspect\n", "str(corpus)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "\n", "\n", "## Using your own data\n", "\n", "
\n", "\n", "

\n", "To use your own data<\/b>, click on the folder called `MyTexts`<\/b> (it is in the menu to the left of the screen) and then simply drag and drop your txt-files into the folder.
When you then execute the code chunk below, you will upload your own data and you can then use it in this notebook.
You can upload only txt-files<\/b> (simple unformatted files created in or saved by a text editor)!
The notebook assumes that you upload some form of text data - not tabular data! \n", "
\n", "<\/p>\n", "

\n", "<\/p><\/span>\n", "<\/div>\n", "\n", "
\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# load function that helps loading texts\n", "source(\"https://slcladal.github.io/rscripts/loadtxts.R\")\n", "# load texts\n", "text <- loadtxts(\"notebooks/MyTexts\")\n", "# inspect the structure of the text object\n", "str(text)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "\n", "## Cleaning and tokenising\n", "\n", "We start by cleaning the corpus data by removing tags, artefacts and non-alpha-numeric characters.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "corpus_clean <- \n", " # IF YOU ARE USING YOUR OWN DATA: \n", " # REPLACE \"corpus\" WITH \"text\" IN THE LINE BELOW\n", " stringr::str_remove_all(corpus, \"<.*?>\") %>%\n", " # remove superfluous white spaces\n", " stringr::str_squish()\n", "# inspect\n", "substr(corpus_clean[1], start=1, stop=200)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now split the clean corpora into individual words.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "toks_corpus <- tokens(corpus, remove_punct = TRUE, remove_numbers = TRUE, remove_symbol = TRUE)\n", "toks_corpus <- tokens_remove(toks_corpus, pattern = c(stopwords(\"en\"), \"*-time\", \"updated-*\", \"gmt\", \"bst\"))\n", "dfmat_corpus <- dfm(toks_corpus) %>% \n", " dfm_trim(min_termfreq = 0.8, termfreq_type = \"quantile\",\n", " max_docfreq = 0.1, docfreq_type = \"prop\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Unsupervised LDA\n", "\n", "Now that we have cleaned the data, we can perform the topic modelling. This consists of two steps:\n", "\n", "1. First, we perform an unsupervised LDA. We do this to check what topics are in our corpus. \n", "\n", "2. Then, we perform a supervised LDA (based on the results of the unsupervised LDA) to identify meaningful topics in our data. For the supervised LDA, we define so-called *seed terms* that help in generating coherent topics.\n", "\n", "Here we look for 15 topics but we would vary the number of topics (k) to check what topics are in our data.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# set seed\n", "set.seed(1234)\n", "# generate model: change k to different numbers, e.g. 10 or 20 and look for consistencies in the keywords for the topics below.\n", "tmod_lda <- seededlda::textmodel_lda(dfmat_corpus, k = 15)\n", "# inspect\n", "terms(tmod_lda, 10)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Supervised LDA\n", "\n", "Now, we perform a supervised LDA. Here we use the keywords extracted based on the unsupervised LDA as *seed terms* for topics to create coherent topics.\n", "\n", "

\n", "\n", "

\n", "IMPORTANT<\/b>: If you are using your own data, you need to change and adapt the topics and keywords defined below (simply replace the topics and seed terms with your own topics and seed terms (based on the results of the unsupervised LDA!). \n", "
\n", "<\/p>\n", "<\/span>\n", "<\/div>\n", "\n", "
\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# semisupervised LDA\n", "dict <- dictionary(list(Computer = c(\"computers\", \"information\", \"machine\", \"computer\"),\n", " Education = c(\"students\", \"courses\", \"education\", \"university\"),\n", " Movies = c(\"movie\", \"film\", \"commercial\", \"watch\"),\n", " Family = c(\"parents\", \"home\", \"mother\", \"father\"),\n", " War = c(\"war\", \"peace\", \"somalia\", \"consequences\"),\n", " Foreigners = c(\"foreigners\", \"germans\", \"turkish\", \"turks\"),\n", " Phone = c(\"phone\", \"telephone\", \"call\"),\n", " Food = c(\"mcdonald's\", \"restaurant\", \"chips\", \"fastfood\", \"taste\"),\n", " Pets = c(\"dog*\", \"walk\", \"cat\", \"pet\", \"happy\"),\n", " Eco = c(\"green\", \"car*\", \"drive\", \"speed\", \"accident\", \"exhaust\"),\n", " Dating = c(\"girl\", \"boy\", \"date\", \"merries\")))\n", "tmod_slda <- textmodel_seededlda(dfmat_corpus, dict, residual = TRUE, min_termfreq = 10)\n", "terms(tmod_slda)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now inspect topics by file. This shows what text contains which topic.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "topics(tmod_slda)[1:20]\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we can extract files and create a data frame of topics and documents. This shows what topic is dominant in which file in tabular form. \n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "files <- stringr::str_replace_all(names(topics(tmod_slda)), \".*/(.*?).txt\", \"\\\\1\")\n", "topics <- topics(tmod_slda)\n", "language <- ifelse(stringr::str_detect(files, \"GE\"), \"German\", \"Spanish\")\n", "df <- data.frame(language, topics) %>%\n", " dplyr::mutate_if(is.character, factor)\n", "# inspect\n", "head(df)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exporting data\n", "\n", "To export a data frame as an MS Excel spreadsheet, we use `write_xlsx`. Be aware that we use the `here` function to save the file in the current working directory.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# save data for MyOutput folder\n", "write_xlsx(dfp, here::here(\"notebooks/MyOutput/df.xlsx\"))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "\n", "

\n", "You will find the generated MS Excel spreadsheet named \"df.xlsx\" in the `MyOutput` folder (located on the left side of the screen).<\/b>

Simply double-click the `MyOutput` folder icon, then right-click on the \"df.xlsx\" file, and choose Download from the dropdown menu to download the file.
\n", "<\/p>\n", "

\n", "<\/p><\/span>\n", "<\/div>\n", "\n", "
\n", "\n", "\n", "To visualize the results, we summaries the table to show the percentage of topics by language.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dfp <- df %>%\n", " dplyr::group_by(language, topics) %>%\n", " dplyr::summarise(freq = n()) %>%\n", " dplyr::group_by(language) %>%\n", " dplyr::mutate(all = sum(freq),\n", " percent = round(freq/all*100, 2))\n", "# inspect\n", "head(dfp)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a final step, we visualize the results.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dfp %>%\n", " ggplot(aes(x = topics, y = percent, label = percent, fill = language)) +\n", " geom_bar(stat = \"identity\", position = position_dodge()) + \n", " geom_text(vjust=-0.3, position = position_dodge(0.9)) + \n", " theme_bw() +\n", " coord_cartesian(ylim = c(0, 30)) +\n", " labs(x = \"Topic\", y = \"Percent\") +\n", " theme(legend.position = \"top\",\n", " axis.text.x = element_text(angle = 90))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exporting images\n", "\n", "To export network graph as an png-file, we use `ggsave`. Be aware that we use the `here` function to save the file in the `MyOutput` folder.\n", "\n", "The `ggsave` function has the following main arguments:\n", "\n", "+ `filename`: File name to create on disk. \n", "+ `device`: Device to use. Can either be a device function (e.g. png), or one of \"eps\", \"ps\", \"tex\" (pictex), \"pdf\", \"jpeg\", \"tiff\", \"png\", \"bmp\", \"svg\" or \"wmf\" (windows only). If NULL (default), the device is guessed based on the filename extension \n", "+ `path`: Path of the directory to save plot to: path and filename are combined to create the fully qualified file name. Defaults to the working directory. \n", "+ `width, height`: Plot size in units expressed by the units argument. If not supplied, uses the size of the current graphics device. \n", "+ `units`: One of the following units in which the width and height arguments are expressed: \"in\", \"cm\", \"mm\" or \"px\". \n", "+ `dpi`: Plot resolution. Also accepts a string input: \"retina\" (320), \"print\" (300), or \"screen\" (72). Applies only to raster output types. \n", "+ `bg`: Background colour. If NULL, uses the plot.background fill value from the plot theme. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# save network graph for MyOutput folder\n", "ggsave(here::here(\"notebooks/MyOutput/image_01.png\"), bg = \"white\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "\n", "

\n", "You will find the image-file named *image_01.png* in the `MyOutput` folder (located on the left side of the screen).<\/b>

Simply double-click the `MyOutput` folder icon, then right-click on the *image_01.png* file, and choose Download from the dropdown menu to download the file.
\n", "<\/p>\n", "

\n", "<\/p><\/span>\n", "<\/div>\n", "\n", "
\n", "\n", "\n", "\n", "## Outro\n", "\n", "\n", "We end the session by calling the session info which tells us what packages and what version of the software and packages we have used.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sessionInfo()\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "\n", "[Back to HOME](https://github.com/MartinSchweinberger/SLAT7829Tutorials)\n", "\n", "***\n" ] } ], "metadata": { "anaconda-cloud": "", "kernelspec": { "display_name": "R", "langauge": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.4.1" } }, "nbformat": 4, "nbformat_minor": 1 }