{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "![](https://github.com/MartinSchweinberger/SLAT7829/blob/master/images/bannerSLAT7829.jpeg?raw=true)\n",
                "\n",
                "# Case Study: Swearing in Irish English\n",
                "\n",
                "This tutorial presents a case study on swearing in Irish English based on the Irish component of the *International Corpus of English* (ICE-Ireland, Version 1.2.2).\n",
                "\n",
                "This case study represents a corpus-based study of sociolinguistic variation that aims to answer if swearing differs across social groups in Irish English. In particular, this case study analyzes if speakers coming from different age groups and genders, i.e. whether old or young or men or women, differ in their use of swear words based on a sub-sample of the Irish component of the [International Corpus of English (ICE)](https://www.ice-corpora.uzh.ch/en.html). .\n",
                "\n",
                "\n",
                "## Preparation and session set up\n",
                "\n",
                "Activate required packages.\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# load packages\n",
                "library(dplyr)\n",
                "library(stringr)\n",
                "library(ggplot2)\n",
                "library(openxlsx)\n",
                "library(quanteda)\n",
                "library(cfa)\n",
                "library(here)\n",
                "library(tidyr)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Loading the corpus data\n",
                "\n",
                "Loading corpus data consists of two steps: \n",
                "\n",
                "1. create a list of paths of the corpus files\n",
                "\n",
                "2. loop over these paths and load the data in the files identified by the paths.\n",
                "\n",
                "To create a list of corpus files, you could use the code chunk below (the code chunk assumes that the corpus data is in a folder called *Corpus* in the *data* sub-folder of your Rproject folder).\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "corpusfiles <- list.files(here::here(\"ICEIreland\"), # path to the corpus data\n",
                "                          # file types you want to analyze, e.g. txt-files\n",
                "                          pattern = \".*.txt\",\n",
                "                          # full paths - not just the names of the files\n",
                "                          full.names = T) \n",
                "# inspect\n",
                "head(corpusfiles)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "You can then use the `sapply` function to loop over the paths and load the data int R using e.g. the `scan` function as shown below. In addition to loading the file content, we also paste all the content together using the `paste0` function and remove superfluous white spaces using the `str_squish` function from the `stringr` package.\n",
                "\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "corpus <- sapply(corpusfiles, function(x){\n",
                "  x <- scan(x, \n",
                "            what = \"char\", \n",
                "            sep = \"\", \n",
                "            quote = \"\", \n",
                "            quiet = T, \n",
                "            skipNul = T)\n",
                "  x <- paste0(x, sep = \" \", collapse = \" \")\n",
                "  x <- stringr::str_squish(x)\n",
                "})\n",
                "# inspect\n",
                "str(corpus)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Once you have loaded your data into R, you can then continue with processing and transforming the data according to your needs.\n",
                "\n",
                "\n",
                "***\n",
                "\n",
                "## Using your own data\n",
                "\n",
                "You can also use your own data. You can see below what you need to do to upload and use your own data.\n",
                "\n",
                "To be able to load your own data, you need to click on the folder symbol to the left of the screen:\n",
                "\n",
                "![Binder Folder Symbol](https://slcladal.github.io/images/binderfolder.JPG)\n",
                "\n",
                "Then, click on the `New Folder` symbol and create a new folder and call it `MyData`.\n",
                "\n",
                "![Binder New Folder Symbol](https://slcladal.github.io/images/bindernewfolder.JPG)\n",
                "\n",
                "Then click on the upload symbol and upload your files into the `MyData` folder.\n",
                "\n",
                "![Binder Upload Symbol](https://slcladal.github.io/images/binderupload.JPG)\n",
                "\n",
                "Select and upload the files you want to analyze (**IMPORTANT**: here, we assume that you upload some form of text data - not tabular data!). When you then execute the code chunk below, you will upload your own data and you can then use it in this notebook.\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "myfiles <- list.files(here::here(\"MyData\"), # path to the corpus data\n",
                "                          # full paths - not just the names of the files\n",
                "                          full.names = T) \n",
                "# load colt files\n",
                "mycorpus <- sapply(myfiles, function(x){\n",
                "  x <- scan(x, \n",
                "            what = \"char\", \n",
                "            sep = \"\", \n",
                "            quote = \"\", \n",
                "            quiet = T, \n",
                "            skipNul = T)\n",
                "  x <- paste0(x, sep = \" \", collapse = \" \")\n",
                "  x <- stringr::str_squish(x)\n",
                "})\n",
                "# inspect\n",
                "str(mycorpus)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "**Keep in mind though that you need to adapt the names of the texts in the code chunks below so that the code below work on your own texts!**\n",
                "\n",
                "***\n",
                "\n",
                "\n",
                "## Data processing\n",
                "\n",
                "\n",
                "Now that the corpus data is loaded, we can prepare the searches by defining the search patterns. We will use regular expressions to retrieve all variants of the swear words. The sequence `\\\\b` denotes word boundaries while the sequence `[a-z]{0,3}` means that the sequences *ass* can be followed by a string consisting of any character symbol that is maximally three characters long (so that the search would also retrieve *asses*). We separate the search patterns by `|` as this means *or*.\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "searchpatterns <- c(\"\\\\bass[es]{0,2}\\\\b|\\\\basshole[s]{0,1}\\\\b|\\\\bbitch[a-z]{0,3}\\\\b|\\\\b[a-z]{0,}fuck[a-z]{0,3}\\\\b|\\\\bshit[a-z]{0,3}\\\\b|\\\\bcock[a-z]{0,1}\\\\b|\\\\bwanker[a-z]{0,3}\\\\b|\\\\bboll[io]{1,1}[a-z]{0,3}\\\\b|\\\\bcrap[a-z]{0,3}\\\\b|\\\\bbugger[a-z]{0,3}\\\\b|\\\\bcunt[a-z]{0,3}\\\\b\")\n",
                "\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "After defining the search pattern(s), we extract the kwics (keyword(s) in context) of the swear words. \n",
                "\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# extract kwic\n",
                "kwicswears <- quanteda::kwic(tokens(corpus), searchpatterns, window = 10, valuetype = \"regex\")\n",
                "# inspect data\n",
                "head(kwicswears)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "The `kwic` function has the following schema:\n",
                "\n",
                "`kwic(x, pattern, window = 5, valuetype = c(\"glob\", \"regex\", \"fixed\"), separator = \" \", case_insensitive = TRUE, index = NULL)`\n",
                "\n",
                "The arguments (or parameters) of the `kwic` function mean:\n",
                "\n",
                "* `x`: a character, corpus, or tokens object  \n",
                "* `pattern`: a character vector, list of character vectors, dictionary, or collocations object. See pattern for details.  \n",
                "* `window`: the number of context words to be displayed around the keyword  \n",
                "* `valuetype`: the type of pattern matching: \"glob\" for \"glob\"-style wildcard expressions; \"regex\" for regular expressions; or \"fixed\" for exact matching. See valuetype for details.  \n",
                "* `separator`: a character to separate words in the output  \n",
                "* `case_insensitive`: logical; if TRUE, ignore case when matching a pattern or dictionary values  \n",
                "* `index`: an index object to specify keywords\n",
                "\n",
                "\n",
                "\n",
                "We now clean the kwic so that it is easier to see the relevant information.\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "kwicswearsclean <- kwicswears %>%\n",
                "  as.data.frame() %>%\n",
                "  dplyr::rename(\"File\" = colnames(.)[1], \n",
                "                \"PreviousContext\" = colnames(.)[4], \n",
                "                \"Token\" = colnames(.)[5], \n",
                "                \"FollowingContext\" = colnames(.)[6]) %>%\n",
                "  dplyr::select(-from, -to, -pattern) %>%\n",
                "  dplyr::mutate(File = str_remove_all(File, \".*/\"),\n",
                "                File = stringr::str_remove_all(File, \" .*\"))\n",
                "# inspect data\n",
                "head(kwicswearsclean)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "We now create another kwic but with much more context because we want to extract the speaker that has uttered the swear word. To this end, we remove everything that proceeds the `$` symbol as the speakers are identified by characters that follow the `$` symbol, remove everything that follows the `>` symbol which end the speaker identification sequence, remove remaining white spaces, and convert the remaining character to upper case. \n",
                "\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# extract kwic\n",
                "Speaker <- kwic(tokens(corpus), searchpatterns, window = 1000, valuetype = \"regex\") %>%\n",
                "  as.data.frame() %>%\n",
                "  dplyr::mutate(Speaker = str_remove_all(pre, \".*\\\\$\"),\n",
                "         Speaker = str_remove_all(Speaker, \">.*\"),\n",
                "         Speaker = str_squish(Speaker),\n",
                "         Speaker = toupper(Speaker)) %>%\n",
                "  dplyr::pull(Speaker)\n",
                "# inspect results\n",
                "head(Speaker)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "We now add the Speaker to our initial kwic. This way, we combine the swear word kwic with the speaker and as we already have the file, we can use the file plus speaker identification to check if the speaker was a man or a woman.\n",
                "\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "swire <- kwicswearsclean %>%\n",
                "  dplyr::mutate(Speaker = Speaker)\n",
                "# inspect data\n",
                "head(swire)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Now, we inspect the extracted swear word tokens to check if our search strings have indeed captured swear words. \n",
                "\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# convert tokens to lower case\n",
                "swire <- swire %>%\n",
                "  dplyr::mutate(Token = tolower(Token))\n",
                "# inspect tokens\n",
                "table(swire$Token)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "FUCK and its variants is by far the most common swear word in our corpus. However, we do not need the type of swear word to answer our research question and we thus summarize the table to show which speaker in which files has used how many swear words.\n",
                "\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "swire <- swire %>%\n",
                "  dplyr::mutate(File = stringr::str_remove_all(File, \" .*\")) %>%\n",
                "  dplyr::group_by(File, Speaker) %>%\n",
                "  dplyr::summarise(Swearwords = n())\n",
                "# inspect data\n",
                "head(swire)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Now that we extract how many swear words the speakers in the corpus have used, we can load the biodata of the speakers.\n",
                "\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# load bio data\n",
                "bio <- openxlsx::read.xlsx(here::here(\"speakerinfo/SpeakerInfoICEIreland.xlsx\"))\n",
                "# inspect data\n",
                "head(bio)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Saving data from Binder\n",
                "\n",
                "To save data from a notebook, you need to save the data and then you can simply download it. The code below saves the data to the folder in which you currently are (NOTE: the date will be deleted after the session closes though!).\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "openxlsx::write.xlsx(bio, here::here(\"SpeakerInfoICEIreland.xlsx\"))\n",
                "\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Combining swearwords with speaker information\n",
                "\n",
                "\n",
                "In a next step, we rename column names so that they match the column names in the concordance data (swire).\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "bio <- bio %>%\n",
                "  dplyr::rename(File = text.id, \n",
                "         Speaker = spk.ref,\n",
                "         Gender = sex,\n",
                "         Age = age,\n",
                "         Words = word.count) %>%\n",
                "  dplyr::select(File, Speaker, Gender, Age, Words)\n",
                "# inspect data\n",
                "head(bio)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "In a next step, we combine the table with the speaker information with the table showing the swear word use.\n",
                "\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# combine frequencies and biodata\n",
                "swire <- dplyr::left_join(bio, swire, by = c(\"File\", \"Speaker\")) %>%\n",
                "  # replace NA with 0\n",
                "  dplyr::mutate(Swearwords = ifelse(is.na(Swearwords), 0, Swearwords),\n",
                "                File = factor(File),\n",
                "                Speaker = factor(Speaker),\n",
                "                Gender = factor(Gender),\n",
                "                Age = factor(Age))\n",
                "# inspect data\n",
                "head(swire)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "We now clean the table by removing speakers for which we do not have any information on their age and gender. Also, we summarize the table to extract the mean frequencies of swear words (per 1,000 words) by age and gender.\n",
                "\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# clean data\n",
                "swire_vis <- swire %>%\n",
                "  dplyr::filter(is.na(Gender) == F,\n",
                "                is.na(Age) == F,\n",
                "                Age != \"0-18\") %>%\n",
                "  dplyr::group_by(Age, Gender) %>%\n",
                "  dplyr::summarise(SumWords = sum(Words),\n",
                "                   SumSwearwords = sum(Swearwords),\n",
                "                   FrequencySwearwords = round(SumSwearwords/SumWords*1000, 3)) \n",
                "# inspect data\n",
                "head(swire_vis)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Tabulating and visualizing the data\n",
                "\n",
                "Now that we have prepared our data, we can plot swear word use by gender. \n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "swire_vis %>% \n",
                "ggplot(aes(x = Age, y = FrequencySwearwords, group = Gender, fill = Gender)) +\n",
                "  geom_bar(stat = \"identity\", position = position_dodge()) +\n",
                "  theme_bw() +\n",
                "  scale_fill_manual(values = c(\"orange\", \"darkgrey\")) +\n",
                "  labs(y = \"Relative frequency \\n swear words per 1,000 words\")\n",
                "ggsave(here::here(\"swearing_ire.png\"))\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "The graph suggests that the genders do not differ in their use of swear words except for the age bracket from 26 to 41: men swear more among speakers aged between 26 and 33 while women swear more between 34 and 41 years of age. \n",
                "\n",
                "## Statistical analysis\n",
                "\n",
                "We  now perform a statistical test, e.g. a Configural Frequency Analysis (CFA)  to check if specifically which groups in the data significantly over and under-use swearwords.\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "cfa_swear <- swire %>%\n",
                "  dplyr::filter(is.na(Gender) == F,\n",
                "                is.na(Age) == F,\n",
                "                Age != \"0-18\") %>%\n",
                "  dplyr::group_by(Age, Gender) %>%\n",
                "  dplyr::summarise(SumWords = sum(Words),\n",
                "                   SumSwearwords = sum(Swearwords)) %>%\n",
                "  tidyr::gather(Type, Frequency, SumWords:SumSwearwords)\n",
                "# inspect data\n",
                "head(cfa_swear, 20)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "We now select only Age, Gender, and Type and store the counts ina  separate vector (or object).\n",
                "\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# define configurations\n",
                "configs <- cfa_swear %>%\n",
                "  dplyr::select(Age, Gender, Type)\n",
                "# define counts\n",
                "counts <- cfa_swear$Frequency\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Now that configurations and counts are separated, we can perform the configural frequency analysis.\n",
                "\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# perform cfa\n",
                "cfa(configs,counts)$table %>%\n",
                "  as.data.frame() %>%\n",
                "  dplyr::filter(p.chisq < .1,\n",
                "                stringr::str_detect(label, \"Swear\")) %>%\n",
                "  dplyr::select(-z, -p.z, -sig.z, -sig.chisq, -Q)\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "After filtering out significant over use of non-swear words from the results of the CFA, we find that men and women in the age bracket 26 to 33 use significantly more swear words and other groups in the data.\n",
                "\n",
                "It has to be borne in mind, though, that this is merely a case study and that a more fine-grained analysis on a substantially larger data set were necessary to get a more reliable impression.\n",
                "\n",
                "\n",
                "We end the session by calling the session info which tells us what packages and what version of the software and packages we have used.\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "sessionInfo()\n",
                "\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "***\n",
                "\n",
                "[Back to HOME](https://github.com/MartinSchweinberger/SLAT7829Tutorials)\n",
                "\n",
                "***\n"
            ]
        }
    ],
    "metadata": {
        "anaconda-cloud": "",
        "kernelspec": {
            "display_name": "R",
            "langauge": "R",
            "name": "ir"
        },
        "language_info": {
            "codemirror_mode": "r",
            "file_extension": ".r",
            "mimetype": "text/x-r-source",
            "name": "R",
            "pygments_lexer": "r",
            "version": "3.4.1"
        }
    },
    "nbformat": 4,
    "nbformat_minor": 1
}