{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Models\n",
    "## Naive Bayes classifier\n",
    "We continue with the Political Blog Corpus"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "require(quanteda)\n",
    "load(\"data/DTM.2.RData\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and split our data test (0.2) / train (0.8)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "set.seed(1)\n",
    "id_train <- sample(1:ndoc(DTM.2), ndoc(DTM.2) %*% .8, replace = FALSE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "training_dfm <- DTM.2[id_train] # training set\n",
    "test_dfm <- DTM.2[!docvars(DTM.2)$X %in% id_train] # test set\n",
    "table(docvars(training_dfm, \"rating\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training the classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nb <- textmodel_nb(training_dfm, docvars(training_dfm, \"rating\"))\n",
    "summary(nb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Model evaluation\n",
    "The classifier can only take features into consideration that occur both in the training set and the test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_dfm <- dfm_select(test_dfm, training_dfm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "table(docvars(test_dfm, \"rating\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "actual_class <- docvars(test_dfm, \"rating\")\n",
    "predicted_class <- predict(nb, test_dfm)\n",
    "table(actual_class, predicted_class)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Consider using the *caret* package ([link](http://topepo.github.io/caret/index.html))."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Mixed Membership Clustering"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "require(topicmodels)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dtm <- convert(DTM.2, to = \"topicmodels\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Parameter estimation can take some time, depending on the size of the vocabulary, the number of documents and the setting of K"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lda <- LDA(dtm, k = 10, method=\"Gibbs\", control=list(iter = 100, verbose = 20, alpha = 0.2, estimate.beta = TRUE))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**What is this Gibbs sampler?**\n",
    "\n",
    "Julia code implementing [T. Griffiths and M. Steyvers, 2004](http://psiexp.ss.uci.edu/research/papers/sciencetopics.pdf)\n",
    "```julia\n",
    "for iter = 1:maxIter\n",
    "    for n = randperm(N)\n",
    "\n",
    "        w = data[1,n]\n",
    "        d = data[2,n]\n",
    "        topic = z[n]\n",
    "        \n",
    "        n_kw[topic,w] -= 1\n",
    "        n_dk[d,topic] -= 1\n",
    "        n_k[topic] -= 1\n",
    "        \n",
    "        # Full conditional posterior distribution in Eq. 5\n",
    "        p = [(n_dk[d,k] + α) * (n_kw[k,w] + β) / (n_k[k] + Vβ) for k = 1:K ]\n",
    "        \n",
    "        topic = discrete(p)\n",
    "\n",
    "        n_kw[topic,w] += 1\n",
    "        n_dk[d,topic] += 1\n",
    "        n_k[topic] += 1\n",
    "        \n",
    "        z[n] = topic\n",
    "    \n",
    "    end\n",
    "end\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "terms(lda, 10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Visualize"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "example_ids <- c(1, 2, 3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "require(\"reshape2\")\n",
    "require(\"ggplot2\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lda_posterior <- posterior(lda)\n",
    "top5termsPerTopicProb <- lda::top.topic.words(lda_posterior$terms, 5, by.score = T)\n",
    "topicProportionExamples <- lda_posterior$topics[example_ids, ]\n",
    "colnames(topicProportionExamples) <- apply(top5termsPerTopicProb, 2, paste, collapse = \" \")\n",
    "\n",
    "vizDataFrame <- melt(data = cbind(data.frame(topicProportionExamples), document = docvars(DTM.2[example_ids])$docname), \n",
    "                     variable.name = \"topic\", \n",
    "                     id.vars = \"document\")\n",
    "\n",
    "ggplot(data = vizDataFrame, aes(x = topic, y = value, fill = document), ylab = \"proportion\") +\n",
    "  geom_bar(stat = \"identity\", position = \"stack\") +\n",
    "  theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.position=\"none\") +  \n",
    "  coord_flip() + facet_wrap(~document, ncol = length(example_ids))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For further options see [here](https://nbviewer.jupyter.org/github/gesiscss/ptm/blob/master/tutorial_6.ipynb).\n",
    "\n",
    "## The Structural Topic Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "library(stm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "DTM.stm <- convert(DTM.2, to = \"stm\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "head(DTM.stm$meta)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "poliblogPrevFit <- stm(documents = DTM.stm$documents, vocab = DTM.stm$vocab, K = 10, prevalence =~ rating + s(day), max.em.its = 75, data = DTM.stm$meta)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "labelTopics(poliblogPrevFit)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot(poliblogPrevFit, type = \"summary\", xlim = c(0, .3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prep <- estimateEffect(1:10 ~ rating+s(day), poliblogPrevFit, meta=DTM.stm$meta, uncertainty=\"Global\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot(prep, \"day\", method = \"continuous\", topics = 7, model = z, printlegend = FALSE, xaxt = \"n\", xlab = \"Time (2008)\")\n",
    "monthseq <- seq(from = as.Date(\"2008-01-01\"), to = as.Date(\"2008-12-01\"), by = \"month\")\n",
    "monthnames <- months(monthseq)\n",
    "axis(1, at=as.numeric(monthseq)-min(as.numeric(monthseq)), labels=monthnames)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Please have a look at the *stm* [vignette](https://cran.r-project.org/web/packages/stm/vignettes/stmVignette.pdf) and [Learning Structural Topic Modeling](https://github.com/dondealban/learning-stm)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "3.4.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}