{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Crowd Tangle LDA Evaluation Workflow "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Import text and stop words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import pandas as pd\n",
    "from util import read_crowdtangle_files, create_corpus\n",
    "import time\n",
    "from datetime import timedelta\n",
    "from pprint import pprint\n",
    "from gensim.models.wrappers import LdaMallet\n",
    "import pickle\n",
    "from gensim.corpora.mmcorpus import MmCorpus\n",
    "\n",
    "#Specify path to input and output directories\n",
    "input_dir = '/Users/dankoban/Documents/EM6575/LDAInput'\n",
    "output_dir = '/Users/dankoban/Documents/EM6575/LDAOutput'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 1/1 [00:03<00:00,  3.09s/it]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--- 0:00:03.095897 time elapsed ---\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "# Extract file names from input directory\n",
    "files = [file for file in os.listdir(input_dir) if file.endswith(\".csv\")]   \n",
    "file_paths = [input_dir + \"/\" + file for file in files]\n",
    "\n",
    "# Select only n files for testing\n",
    "file_paths = file_paths[0:1]\n",
    "\n",
    "start_time = time.time()\n",
    "df = read_crowdtangle_files(file_paths)\n",
    "print(\"--- %s time elapsed ---\" % str(timedelta(seconds=time.time() - start_time)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "130193\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Facebook Id</th>\n",
       "      <th>Text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>624614494274945</td>\n",
       "      <td>Nika Vetsko, excerpts: ...Many researchers believe that Russia is trying to increase this traffic in Georgia, having already been active in fuelling anti-vaccination conspiracy theories. Some link this directly to the countrys measles outbreak last year. ...Russia has also revived conspiracy theories around the Lugar Laboratory, a US fi ced high-tech research centre in Tbilisi. Over the years, Russian authorities and media have worked to discredit the lab and US-Georgia relations more widely.     Is Russia Exploiting Coronavirus Fears In Georgia? By Nika Vetsko* Experts warn that Russia is exploiting the recent appearance of coronavirus in Georgia to spread a new wave of disinformation and conspiracy theories. Georgia has registered only 15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>26781952138</td>\n",
       "      <td>The capitals first Covid-19 patient, a 45-year-old man from Mayur Vihar Phase II, has recovered fully from the viral infection. He was discharged from Ram Manohar Lohia Hospital on Saturday, said a source.   Delhis first coronavirus patient recovers fully The capitals first Covid-19 patient, a 45-year-old man from Mayur Vihar Phase II, has recovered fully from the viral infection. He was discharged fro</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>251907774312</td>\n",
       "      <td>The coronavirus pandemic is yet to force widespread school shutdowns but many families are voluntarily withdrawing their children.   'I'm happy to be a small drop': Families withdrawing children from school to fight coronavirus The coronavirus pandemic is yet to force widespread school shutdowns but across Sydney, many families are voluntarily withdrawing their children.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>138280549589759</td>\n",
       "      <td>The safety and well-being of our community and the Brothers Fish&amp;chips family is always the top priority. In challenging times like this, we are faced with many uncertainties. However, one thing that is certain is that together as a community we will overcome this situation and wed like to reassure that we are following CDC recommended guidelines regarding coronavirus, COVID-19 to keep you and our family safe as much as we can! #ossining #croton #briarcliff #westchester #lohudfood We are temporarily offering prepaid delivery and curb side pick-up. Call (914) 488-5141 to place your order and before arrival. Timeline Photos</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>32204506174</td>\n",
       "      <td>With the coronavirus spreading across the globe @carynceolin with how the White House is trying to prevent it from spreading around the West Wing.     Trump tested negative for COVID-19 - CityNews Toronto As the coronavirus inches closer to President Trump, Caryn Ceolin with how the White House is trying to prevent it from spreading around the West Wing.</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       Facebook Id  \\\n",
       "0  624614494274945   \n",
       "1      26781952138   \n",
       "2     251907774312   \n",
       "3  138280549589759   \n",
       "4      32204506174   \n",
       "\n",
       "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             Text  \n",
       "0  Nika Vetsko, excerpts: ...Many researchers believe that Russia is trying to increase this traffic in Georgia, having already been active in fuelling anti-vaccination conspiracy theories. Some link this directly to the countrys measles outbreak last year. ...Russia has also revived conspiracy theories around the Lugar Laboratory, a US fi ced high-tech research centre in Tbilisi. Over the years, Russian authorities and media have worked to discredit the lab and US-Georgia relations more widely.     Is Russia Exploiting Coronavirus Fears In Georgia? By Nika Vetsko* Experts warn that Russia is exploiting the recent appearance of coronavirus in Georgia to spread a new wave of disinformation and conspiracy theories. Georgia has registered only 15  \n",
       "1                                                                                                                                                                                                                                                                                                                                                           The capitals first Covid-19 patient, a 45-year-old man from Mayur Vihar Phase II, has recovered fully from the viral infection. He was discharged from Ram Manohar Lohia Hospital on Saturday, said a source.   Delhis first coronavirus patient recovers fully The capitals first Covid-19 patient, a 45-year-old man from Mayur Vihar Phase II, has recovered fully from the viral infection. He was discharged fro  \n",
       "2                                                                                                                                                                                                                                                                                                                                                                                           The coronavirus pandemic is yet to force widespread school shutdowns but many families are voluntarily withdrawing their children.   'I'm happy to be a small drop': Families withdrawing children from school to fight coronavirus The coronavirus pandemic is yet to force widespread school shutdowns but across Sydney, many families are voluntarily withdrawing their children.  \n",
       "3                                                                                                                         The safety and well-being of our community and the Brothers Fish&chips family is always the top priority. In challenging times like this, we are faced with many uncertainties. However, one thing that is certain is that together as a community we will overcome this situation and wed like to reassure that we are following CDC recommended guidelines regarding coronavirus, COVID-19 to keep you and our family safe as much as we can! #ossining #croton #briarcliff #westchester #lohudfood We are temporarily offering prepaid delivery and curb side pick-up. Call (914) 488-5141 to place your order and before arrival. Timeline Photos    \n",
       "4                                                                                                                                                                                                                                                                                                                                                                                                            With the coronavirus spreading across the globe @carynceolin with how the White House is trying to prevent it from spreading around the West Wing.     Trump tested negative for COVID-19 - CityNews Toronto As the coronavirus inches closer to President Trump, Caryn Ceolin with how the White House is trying to prevent it from spreading around the West Wing.  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Extract subset of total data for testing the workflow\n",
    "pd.set_option('display.max_colwidth', None)\n",
    "df = pd.concat(df)\n",
    "print(len(df))\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# write a text file of the merged files to run via the terminal\n",
    "df.to_csv('/Users/dankoban/Documents/EM6575/coherence_test2/input.txt',sep=' ',header=False)\n",
    "#~/mallet-2.0.8/bin/mallet import-file --input mallet_terminal_input_crowdtangle.txt --output ct.mallet --remove-stopwords TRUE --extra-stopwords new_stopwords.txt --keep-sequence  TRUE\n",
    "#~/mallet-2.0.8/bin/mallet train-topics --input ct.mallet --output-topic-keys ct.keys --topic-word-weights-file ct.topicwordweights --word-topic-counts-file ct.wordtopiccounts --output-doc-topics ct.doctopics_sparse --num-topics 20 --num-threads 48 --optimize-interval 10 --doc-topics-threshold 0.3 --diagnostics-file diagnostics.xml"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import custom list of stop words\n",
    "stop_words = pd.read_csv(\"/Users/dankoban/Documents/lda_evaluation/data/new_stopwords.csv\")\n",
    "stop_words = stop_words['stop_word'].tolist()\n",
    "stop_words[0:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Generate a corpus and dictionary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "TopicTermFreq = pd.read_csv('/Users/dankoban/Documents/EM6575/mallet_command_line/tidy_topics.csv')\n",
    "def top_n_terms(k, n = 20):\n",
    "    result = (TopicTermFreq[TopicTermFreq['topic'] == k].\n",
    "                  sort_values('count', ascending=False).head(n))\n",
    "    return result\n",
    "\n",
    "topics = []\n",
    "for k in range(0, 20):\n",
    "    terms = top_n_terms(k, n = 20)['term'].tolist()\n",
    "    terms = [term.replace('.', '') for term in terms]\n",
    "    terms = [term.replace(\"'\", '') for term in terms]\n",
    "    topics.append(terms)\n",
    "#topics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "[dictionary, corpus] = create_corpus(text = df.Text, stop_words = stop_words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save dictionary and corpus to disc\n",
    "dictionary.save(output_dir + \"/dictionary.pkl\")\n",
    "MmCorpus.serialize(output_dir + \"/corpus.pkl\", corpus)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Fit an LDA model using gensim LdaMallet "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Fit a new model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "os.environ.update({'MALLET_HOME':r'/Users/dankoban/mallet-2.0.8/'})\n",
    "\n",
    "start_time = time.time()\n",
    "lda = LdaMallet(mallet_path = '/Users/dankoban/mallet-2.0.8/bin/mallet', \n",
    "                corpus=corpus, num_topics=50, id2word=dictionary, \n",
    "                workers = 20, iterations = 500, random_seed = 1)\n",
    "\n",
    "print(\"--- %s time elapsed ---\" % str(timedelta(seconds=time.time() - start_time)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save model to disk\n",
    "pickle.dump(lda, open(output_dir + \"/mallet.pkl\", \"wb\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## ** Save point.  If a model is already fit, start here and continue on for follow on evaluation **"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load an existing model.  If an existing model doesn't exist, execute the code to fit a new model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lda = pickle.load(open(output_dir + \"/mallet.pkl\", \"rb\"))\n",
    "dictionary = pickle.load(open(output_dir + \"/dictionary.pkl\", \"rb\"))\n",
    "corpus = MmCorpus(open(output_dir + \"/corpus.pkl\", \"rb\"))\n",
    "\n",
    "# Show Topics\n",
    "pprint(lda.show_topic(1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Reorganize topics into a readable format"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tm_results = lda[corpus]\n",
    "corpus_topics = [sorted(topics, \n",
    "                        key=lambda record: -record[1])[0] for topics in tm_results]\n",
    "\n",
    "topics = [[(term, round(wt, 3)) for term, wt in lda.show_topic(n, topn=20)] \n",
    "                                for n in range(0, lda.num_topics)]\n",
    "\n",
    "topics_df = pd.DataFrame([[term for term, wt in topic] for topic in topics], \n",
    "                           columns = ['Term'+str(i) for i in range(1, 21)], \n",
    "                           index=['Topic '+str(t) for t in range(1, lda.num_topics+1)]).T\n",
    "topics_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# set column width\n",
    "pd.set_option('display.max_colwidth', None)\n",
    "topics_df = pd.DataFrame([', '.join([term for term, wt in topic]) for topic in topics], \n",
    "                         columns = ['Terms per Topic'], \n",
    "                         index=['Topic'+str(t) for t in range(1, lda.num_topics+1)] )\n",
    "topics_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Visualize topics with a clustermap"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import OrderedDict\n",
    "data_lda = {i: OrderedDict(lda.show_topic(i,5)) for i in range(50)}\n",
    "df_lda = pd.DataFrame(data_lda)\n",
    "df_lda = df_lda.fillna(0).T\n",
    "print(df_lda.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "g=sns.clustermap(df_lda.corr(), center=0, standard_scale=1, \n",
    "                 cmap=\"RdBu\", metric='cosine', linewidths=.75, figsize=(15, 15))\n",
    "plt.setp(g.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)\n",
    "plt.show()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "forcenexus",
   "language": "python",
   "name": "forcenexus"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}