{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://github.com/dc-aihub/dc-aihub.github.io/blob/master/img/ai-logo-transparent-banner.png?raw=true\" \n",
    "alt=\"Ai/Hub Logo\"/>\n",
    "\n",
    "<h1 style=\"text-align:center;color:#0B8261;\"><center>TensorFlow NLP</center></h1>\n",
    "<h1 style=\"text-align:center;\"><center>Lesson 4</center></h1>\n",
    "<h1 style=\"text-align:center;\"><center>Keras Text Summarization</center></h1>\n",
    "\n",
    "<hr />\n",
    "\n",
    "<center><a href=\"#TensorFlow-Devices\">TensorFlow Devices</a></center>\n",
    "\n",
    "<center><a href=\"#Prep-and-Process\">Preparation and Pre-Processing</a></center>\n",
    "\n",
    "<center><a href=\"#Training\">Training the Model</a></center>\n",
    "\n",
    "<center><a href=\"#Beast\">The Beast</a></center>\n",
    "\n",
    "<center><a href=\"#Testing\">Testing the Model</a></center>\n",
    "\n",
    "<center><a href=\"#Summary\">Summary</a></center>\n",
    "\n",
    "<center><a href=\"#Challenge\">Challenge</a></center>\n",
    "\n",
    "<hr />\n",
    "\n",
    "<center>***Original Content by Xianshun Chen:*** <br/>https://github.com/chen0040/keras-text-summarization</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;\">\n",
    "OVERVIEW\n",
    "</div>\n",
    "\n",
    "<center style=\"color:#0B8261;\">\n",
    "This Lesson will show you how to implement the Keras Sequence2Sequence Text Summarizer on a News dataset in order to create summaries.\n",
    "<br/>\n",
    "This lessons folder (L4_data) contains several different Seq2Seq and Encoder-Decoder RNN implementations for you to experiment with. They may even yield better results depending on the data-set you use.\n",
    "</center>\n",
    "\n",
    "<br/>\n",
    "\n",
    "<center><b>[Click here for an Introduction to Text Summarization](https://machinelearningmastery.com/gentle-introduction-text-summarization/)</b></center>\n",
    "\n",
    "<center><b>[Click here for an Introduction to Encoder/Decoder Models](https://machinelearningmastery.com/encoder-decoder-models-text-summarization-keras/)</b></center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;\" id=\"TensorFlow-Devices\">\n",
    "TENSORFLOW DEVICES\n",
    "</div>\n",
    "\n",
    "<b>After executing the code cell below, you can see further details for your devices in the Jupyter Console.</b> "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tensorflow.python.client import device_lib\n",
    "\n",
    "def get_available_devices():\n",
    "    local_device_protos = device_lib.list_local_devices()\n",
    "    return [x.name for x in local_device_protos]\n",
    "\n",
    "get_available_devices()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;\" id=\"Prep-and-Process\">\n",
    "PREPARATION AND PRE-PROCESSING\n",
    "</div>\n",
    "\n",
    "<h3 style=\"color:#45A046;\">Imports</h3>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Using TensorFlow backend.\n"
     ]
    }
   ],
   "source": [
    "from __future__ import print_function\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import tensorflow as tf\n",
    "\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "from keras_text_summarization.library.utility.plot_utils import plot_and_save_history\n",
    "from keras_text_summarization.library.seq2seq import Seq2SeqSummarizer\n",
    "from keras_text_summarization.library.applications.fake_news_loader import fit_text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "LOAD_EXISTING_WEIGHTS = True\n",
    "\n",
    "np.random.seed(42)\n",
    "data_dir_path = './L4_data/data'\n",
    "report_dir_path = './L4_data/reports'\n",
    "model_dir_path = './L4_data/models'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;\" id=\"Training\">\n",
    "TRAINING\n",
    "</div>\n",
    "\n",
    "<h3 style=\"color:#45A046;\">Load Training Data</h3>\n",
    "\n",
    "We will use a provided news data-set which contains articles and titles from various news sources.\n",
    "\n",
    "This data is pre-processed inside the custom functions in the 'keras_text_summarization' folder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading CSV . . .\n",
      "Extracting for config . . . \n",
      "-> Complete\n"
     ]
    }
   ],
   "source": [
    "# Load CSV into DataFrame\n",
    "print('Loading CSV . . .')\n",
    "df = pd.read_csv(data_dir_path + \"/news.csv\")\n",
    "\n",
    "# Extract text for configuration\n",
    "print('Extracting for config . . . ')\n",
    "Y = df.title\n",
    "X = df['text']\n",
    "config = fit_text(X, Y)\n",
    "print('-> Complete')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"background-color:#D33222; margin-left:10%; width:90%; height:38px; color:white; font-size:18px; padding:10px; float:right;\">\n",
    "WARNING\n",
    "</div>\n",
    ">- Make sure that the dataset is fully downloaded and extracted before continuing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3 style=\"color:#45A046;\">Quote</h3>\n",
    "\n",
    "<blockquote style=\"font-style: italic;\">\n",
    "    \n",
    "...there are two different approaches for automatic summarization currently:\n",
    "<br/><br/>\n",
    "<b>Extraction</b> and <b>Abstraction</b>.\n",
    "<br/><br/>\n",
    "<b>Extractive summarization</b> methods work by identifying important sections of the text and generating them verbatim;  \n",
    "<br/>\n",
    "<b>...Abstractive summarization</b> methods aim at producing important material in a new way. In other words, they interpret and examine the text using advanced natural language techniques in order to generate a new shorter text that conveys the most critical information from the original text.\n",
    "<br/><br/>\n",
    "- [Text Summarization Techniques: A Brief Survey, 2017](https://arxiv.org/abs/1707.02268)</blockquote>\n",
    "\n",
    "\n",
    "<h3 style=\"color:#45A046;\">Initialize Summarizer Model</h3>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "summarizer = Seq2SeqSummarizer(config)\n",
    "\n",
    "# Change this value to 'false' above to start fresh!\n",
    "if LOAD_EXISTING_WEIGHTS:\n",
    "    summarizer.load_weights(weight_file_path=Seq2SeqSummarizer.get_weight_file_path(model_dir_path=model_dir_path))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3 style=\"color:#45A046;\">Split Data into Train and Test Sets</h3>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.2, random_state=42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3 style=\"color:#45A046;\">Fit the Training Data to the Model</h3>\n",
    "\n",
    "In other words - let's start training our model!\n",
    "\n",
    "</br>\n",
    "\n",
    "<div style=\"background-color:#D33222; margin-left:10%; width:90%; height:38px; color:white; font-size:18px; padding:10px; float:right;\">\n",
    "WARNING\n",
    "</div>\n",
    ">- <b>The code cell directly below will start training the model!</b>\n",
    ">- This model is set to execute 100 epochs with a batch size of 5.\n",
    ">- This results in a <b>long training time</b> unless you are secretly Megatron.\n",
    ">- See 'The Beast' section for more information on speeding this up.\n",
    ">- If you get tired of waiting for it to train locally:\n",
    "    - Interrupt the kernel and continue to the 'Testing' section."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Optional TF Device Selection (code below must be indented)\n",
    "with tf.device('/GPU:0'):\n",
    "    history = summarizer.fit(Xtrain, Ytrain, Xtest, Ytest, epochs=100, batch_size=5, model_dir_path=model_dir_path)\n",
    "    \n",
    "history_plot_file_path = report_dir_path + '/' + Seq2SeqSummarizer.model_name + '-history.png'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;\" id=\"Beast\">\n",
    "THE BEAST\n",
    "</div>\n",
    "\n",
    "AI/Hub Team Members can also use 'The Beast' to process this training code at a faster rate!\n",
    "\n",
    "An informational document is being created for using The Beast; It will be available on the ORSIE AI/Hub Internal Site once it has been completed!\n",
    "\n",
    "<b>Please ask your Lead Researcher for more information regarding this.</b>\n",
    "\n",
    "However, you will be able to test the current model locally, even with limited training!\n",
    "\n",
    "(. . . Mind the results)\n",
    "\n",
    "</br>\n",
    "\n",
    "<div style=\"background-color:#D33222; margin-left:10%; width:90%; height:38px; color:white; font-size:18px; padding:10px; float:right;\">\n",
    "NOTE\n",
    "</div>\n",
    ">- <b>The code cell directly below will only execute after completing a full training loop!</b>\n",
    "\n",
    ">- 'history' is created on completion of the summarizer.fit() function\n",
    ">- If you manually stop the training, you will not be able to run this cell!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if LOAD_EXISTING_WEIGHTS:\n",
    "    history_plot_file_path = report_dir_path + '/' + Seq2SeqSummarizer.model_name + '-history-v' + str(summarizer.version) + '.png'\n",
    "# Plot and Save History\n",
    "plot_and_save_history(history, summarizer.model_name, history_plot_file_path, metrics={'loss', 'acc'})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;\" id=\"Testing\">\n",
    "TESTING\n",
    "</div>\n",
    "\n",
    "<h3 style=\"color:#45A046;\">Load Testing Data</h3>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading CSV . . .\n",
      "Extracting features . . .\n",
      "-> Complete\n"
     ]
    }
   ],
   "source": [
    "# Randomize Seed\n",
    "np.random.seed(42)\n",
    "\n",
    "# Define Directory Paths\n",
    "data_dir_path = './L4_data/data' # refers to the demo/data folder\n",
    "model_dir_path = './L4_data/models' # refers to the demo/models folder\n",
    "\n",
    "# Load CSV from Directory\n",
    "print('Loading CSV . . .')\n",
    "df = pd.read_csv(data_dir_path + \"/news.csv\")\n",
    "\n",
    "# Assign dataframe text and title to X and Y values\n",
    "print('Extracting features . . .')\n",
    "X = df['text']\n",
    "Y = df.title\n",
    "print('-> Complete')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3 style=\"color:#45A046;\">Load Stored Model and Re-Initialize</h3>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load stored model configuration using NumPy.load()\n",
    "config = np.load(Seq2SeqSummarizer.get_config_file_path(model_dir_path=model_dir_path)).item()\n",
    "\n",
    "# Re-Initialize the model using the stored configuration\n",
    "summarizer = Seq2SeqSummarizer(config)\n",
    "\n",
    "# Load the stored weights into the model\n",
    "summarizer.load_weights(weight_file_path=Seq2SeqSummarizer.get_weight_file_path(model_dir_path=model_dir_path))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3 style=\"color:#45A046;\">Predict Some Headlines</h3>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Predicting Headlines . . .\n",
      "\n",
      " Original:  You Can Smell Hillary’s Fear\n",
      "Generated:  clinton campaign biggest national are - the onion - america's finest news source\n",
      "\n",
      " Original:  Watch The Exact Moment Paul Ryan Committed Political Suicide At A Trump Rally (VIDEO)\n",
      "Generated:  the trump is what trump's rick of gop debate\n",
      "\n",
      " Original:  Kerry to go to Paris in gesture of sympathy\n",
      "Generated:  not to back to back at least time\n",
      "\n",
      " Original:  Bernie supporters on Twitter erupt in anger against the DNC: 'We tried to warn you!'\n",
      "Generated:  the gop debate on the party is in against trump is a bit to twitter\n",
      "\n",
      " Original:  The Battle of New York: Why This Primary Matters\n",
      "Generated:  the battle of new why why many could go to win\n",
      "\n",
      " Original:  Tehran, USA\n",
      "Generated:  john obama: political top to daily\n",
      "\n",
      " Original:  Girl Horrified At What She Watches Boyfriend Do After He Left FaceTime On\n",
      "Generated:  of be hillary’s why trump’s campaign in 2016\n",
      "\n",
      " Original:  ‘Britain’s Schindler’ Dies at 106\n",
      "Generated:  re: clinton’s email and coming\n",
      "\n",
      " Original:  Fact check: Trump and Clinton at the 'commander-in-chief' forum\n",
      "Generated:  is republicans the jeb bush director up the gop debate in the\n",
      "\n",
      " Original:  Iran reportedly makes new push for uranium concessions in nuclear talks\n",
      "Generated:  election is coming in the world war iii - the onion - america's finest news source\n",
      "\n",
      " -> Complete\n"
     ]
    }
   ],
   "source": [
    "# Print predicted headlines along with their original title\n",
    "print('Predicting Headlines . . .')\n",
    "for i in range(10):\n",
    "    x = X[i]\n",
    "    actual_headline = Y[i]\n",
    "    headline = summarizer.summarize(x)\n",
    "\n",
    "    print('\\n', 'Original: ', actual_headline)\n",
    "    #print('Article: ', x)\n",
    "    print('Generated: ', headline)\n",
    "print('\\n', '-> Complete')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;\" id=\"Summary\">\n",
    "SUMMARY\n",
    "</div>\n",
    "\n",
    "This tutorial showed how to generate headlines for news articles of various length using Keras' sequence2sequence text summarizer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;\" id=\"Challenge\">\n",
    "CHALLENGE\n",
    "</div>\n",
    "\n",
    "These are a few suggestions for exercises that may help improve your skills with TensorFlow. It is important to get hands-on experience with TensorFlow in order to learn how to use it properly.\n",
    "\n",
    "You may want to backup this Notebook before making any changes.\n",
    "\n",
    "* Train the model for larger/smaller batches. Does it improve the quality of the generated summaries?\n",
    "* Try another architecture for the Recurrent Neural Network (See the demo folder) Can you improve the quality of the generated summaries?\n",
    "* Try using a different dataset to train and test this model - or one of the others provided in the lesson folder (L4_data)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}