{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Import and Preprocessing\n",
    "\n",
    "In this notebook, we demonstrate how to read text data in R, tokenize texts and create a document-term matrix.\n",
    "\n",
    "We start by loading the required dependencies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "require(quanteda)\n",
    "require(magrittr)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The corpus we will work with is a collection of blogposts about American politics written in 2008 put together by the Carnegie Mellon University 2008 Political Blog Corpus ([Eisenstein & Xing 2010](http://www.sailing.cs.cmu.edu/main/socialmedia/blog2008.pdf))."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "poliblogs2008 <- read.csv(\"data/poliblogs2008.csv\", header = TRUE, sep = \",\", encoding = \"UTF-8\",quote = \"\\\"\", stringsAsFactors = F)\n",
    "head(poliblogs2008,2) # inspect the first 2 documents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "table(poliblogs2008$rating)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "table(poliblogs2008$blog)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_corpus <- corpus(poliblogs2008, text_field = \"documents\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "DTM.1 <- data_corpus %>% tokens() %>%\n",
    "  tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>% tokens_tolower() %>%  dfm() \n",
    "DTM.1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Simple Frequency Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "wordlist <- textstat_frequency(DTM.1)\n",
    "head(wordlist, 20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot(wordlist$frequency , type = \"l\", lwd=2, main = \"Rank frequency Plot\", xlab=\"Rank\", ylab =\"Frequency\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot(wordlist$frequency , type = \"l\", log=\"xy\", lwd=2, main = \"Rank-Frequency Plot\", xlab=\"log-Rank\", ylab =\"log-Frequency\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "stopw_idx <- which(wordlist$feature %in% stopwords('en'))\n",
    "low_frequent_idx <- which(wordlist$frequency < 10)\n",
    "trash_idx <- union(stopw_idx, low_frequent_idx)\n",
    "vocab_idx <- setdiff(1:nrow(wordlist), trash_idx)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot(wordlist$frequency, type = \"l\", log=\"xy\",lwd=2, main = \"Rank-Frequency plot\", xlab=\"Rank\", ylab = \"Frequency\")\n",
    "lines(vocab_idx, wordlist$frequency[vocab_idx], col = \"green\", lwd=2, type=\"p\", pch=20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "head(wordlist[vocab_idx], 20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "head(wordlist[trash_idx], 20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "DTM.2 <- dfm_remove(DTM.1, wordlist[trash_idx]$feature)\n",
    "DTM.2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "textplot_wordcloud(DTM.2, max_words = 100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Finding Important Words in a Document\n",
    "**T**erm **F**requency–**I**nverse **D**ocument **F**requency (**TF-IDF**), is intended to reflect the importance of a word in a document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "some_docname <- \"at0800300_2.text\"\n",
    "print(poliblogs2008[poliblogs2008$docname == some_docname, ]$documents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "number_of_docs <- nrow(DTM.2)\n",
    "term_in_docs <- colSums(DTM.2 > 0)\n",
    "idf <- log2(number_of_docs / term_in_docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tf <- as.vector(DTM.2[poliblogs2008[poliblogs2008$docname == some_docname, ]$X, ])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tf_idf <- tf * idf\n",
    "names(tf_idf) <- colnames(DTM.2)\n",
    "head(sort(tf_idf, decreasing = T),10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Working with Dictionaries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "positive_terms <- data_dictionary_LSD2015$positive\n",
    "negative_terms <- data_dictionary_LSD2015$negative"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "positive_terms_in_suto <- intersect(colnames(DTM.2), positive_terms)\n",
    "counts_positive <- rowSums(DTM.2[, positive_terms_in_suto])\n",
    "\n",
    "negative_terms_in_suto <- intersect(colnames(DTM.2), negative_terms)\n",
    "counts_negative <- rowSums(DTM.2[, negative_terms_in_suto])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "counts_all_terms <- rowSums(DTM.2)\n",
    "\n",
    "relative_sentiment_frequencies <- data.frame(\n",
    "  docname = docvars(DTM.2)$docname,\n",
    "  positive = counts_positive / counts_all_terms,\n",
    "  negative = counts_negative / counts_all_terms\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "head(relative_sentiment_frequencies,5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# save(DTM.2, file = \"data/DTM.2.RData\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "3.4.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}