<img src="https://github.com/dc-aihub/dc-aihub.github.io/blob/master/img/ai-logo-transparent-banner.png?raw=true" 
alt="Ai/Hub Logo"/>

<h1 style="text-align:center;color:#0B8261;"><center>TensorFlow NLP</center></h1>
<h1 style="text-align:center;"><center>Lesson 4</center></h1>
<h1 style="text-align:center;"><center>Keras Text Summarization</center></h1>

<hr />

<center><a href="#TensorFlow-Devices">TensorFlow Devices</a></center>

<center><a href="#Prep-and-Process">Preparation and Pre-Processing</a></center>

<center><a href="#Training">Training the Model</a></center>

<center><a href="#Beast">The Beast</a></center>

<center><a href="#Testing">Testing the Model</a></center>

<center><a href="#Summary">Summary</a></center>

<center><a href="#Challenge">Challenge</a></center>

<hr />

<center>***Original Content by Xianshun Chen:*** <br/>https://github.com/chen0040/keras-text-summarization</center>

<div style="background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;">
OVERVIEW
</div>

<center style="color:#0B8261;">
This Lesson will show you how to implement the Keras Sequence2Sequence Text Summarizer on a News dataset in order to create summaries.
<br/>
This lessons folder (L4_data) contains several different Seq2Seq and Encoder-Decoder RNN implementations for you to experiment with. They may even yield better results depending on the data-set you use.
</center>

<br/>

<center><b>[Click here for an Introduction to Text Summarization](https://machinelearningmastery.com/gentle-introduction-text-summarization/)</b></center>

<center><b>[Click here for an Introduction to Encoder/Decoder Models](https://machinelearningmastery.com/encoder-decoder-models-text-summarization-keras/)</b></center>

<div style="background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;" id="TensorFlow-Devices">
TENSORFLOW DEVICES
</div>

<b>After executing the code cell below, you can see further details for your devices in the Jupyter Console.</b> 

In [None]:
from tensorflow.python.client import device_lib

def get_available_devices():
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos]

get_available_devices()

<div style="background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;" id="Prep-and-Process">
PREPARATION AND PRE-PROCESSING
</div>

<h3 style="color:#45A046;">Imports</h3>

In [2]:
from __future__ import print_function

import numpy as np
import pandas as pd
import tensorflow as tf

from sklearn.model_selection import train_test_split

from keras_text_summarization.library.utility.plot_utils import plot_and_save_history
from keras_text_summarization.library.seq2seq import Seq2SeqSummarizer
from keras_text_summarization.library.applications.fake_news_loader import fit_text

Using TensorFlow backend.


In [3]:
LOAD_EXISTING_WEIGHTS = True

np.random.seed(42)
data_dir_path = './L4_data/data'
report_dir_path = './L4_data/reports'
model_dir_path = './L4_data/models'

<div style="background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;" id="Training">
TRAINING
</div>

<h3 style="color:#45A046;">Load Training Data</h3>

We will use a provided news data-set which contains articles and titles from various news sources.

This data is pre-processed inside the custom functions in the 'keras_text_summarization' folder.

In [4]:
# Load CSV into DataFrame
print('Loading CSV . . .')
df = pd.read_csv(data_dir_path + "/news.csv")

# Extract text for configuration
print('Extracting for config . . . ')
Y = df.title
X = df['text']
config = fit_text(X, Y)
print('-> Complete')

Loading CSV . . .
Extracting for config . . . 
-> Complete


<div style="background-color:#D33222; margin-left:10%; width:90%; height:38px; color:white; font-size:18px; padding:10px; float:right;">
WARNING
</div>
>- Make sure that the dataset is fully downloaded and extracted before continuing.

<h3 style="color:#45A046;">Quote</h3>

<blockquote style="font-style: italic;">
    
...there are two different approaches for automatic summarization currently:
<br/><br/>
<b>Extraction</b> and <b>Abstraction</b>.
<br/><br/>
<b>Extractive summarization</b> methods work by identifying important sections of the text and generating them verbatim;  
<br/>
<b>...Abstractive summarization</b> methods aim at producing important material in a new way. In other words, they interpret and examine the text using advanced natural language techniques in order to generate a new shorter text that conveys the most critical information from the original text.
<br/><br/>
- [Text Summarization Techniques: A Brief Survey, 2017](https://arxiv.org/abs/1707.02268)</blockquote>


<h3 style="color:#45A046;">Initialize Summarizer Model</h3>

In [5]:
summarizer = Seq2SeqSummarizer(config)

# Change this value to 'false' above to start fresh!
if LOAD_EXISTING_WEIGHTS:
    summarizer.load_weights(weight_file_path=Seq2SeqSummarizer.get_weight_file_path(model_dir_path=model_dir_path))

<h3 style="color:#45A046;">Split Data into Train and Test Sets</h3>

In [6]:
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.2, random_state=42)

<h3 style="color:#45A046;">Fit the Training Data to the Model</h3>

In other words - let's start training our model!

</br>

<div style="background-color:#D33222; margin-left:10%; width:90%; height:38px; color:white; font-size:18px; padding:10px; float:right;">
WARNING
</div>
>- <b>The code cell directly below will start training the model!</b>
>- This model is set to execute 100 epochs with a batch size of 5.
>- This results in a <b>long training time</b> unless you are secretly Megatron.
>- See 'The Beast' section for more information on speeding this up.
>- If you get tired of waiting for it to train locally:
    - Interrupt the kernel and continue to the 'Testing' section.

In [None]:
# Optional TF Device Selection (code below must be indented)
with tf.device('/GPU:0'):
    history = summarizer.fit(Xtrain, Ytrain, Xtest, Ytest, epochs=100, batch_size=5, model_dir_path=model_dir_path)
    
history_plot_file_path = report_dir_path + '/' + Seq2SeqSummarizer.model_name + '-history.png'

<div style="background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;" id="Beast">
THE BEAST
</div>

AI/Hub Team Members can also use 'The Beast' to process this training code at a faster rate!

An informational document is being created for using The Beast; It will be available on the ORSIE AI/Hub Internal Site once it has been completed!

<b>Please ask your Lead Researcher for more information regarding this.</b>

However, you will be able to test the current model locally, even with limited training!

(. . . Mind the results)

</br>

<div style="background-color:#D33222; margin-left:10%; width:90%; height:38px; color:white; font-size:18px; padding:10px; float:right;">
NOTE
</div>
>- <b>The code cell directly below will only execute after completing a full training loop!</b>

>- 'history' is created on completion of the summarizer.fit() function
>- If you manually stop the training, you will not be able to run this cell!

In [None]:
if LOAD_EXISTING_WEIGHTS:
    history_plot_file_path = report_dir_path + '/' + Seq2SeqSummarizer.model_name + '-history-v' + str(summarizer.version) + '.png'
# Plot and Save History
plot_and_save_history(history, summarizer.model_name, history_plot_file_path, metrics={'loss', 'acc'})

<div style="background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;" id="Testing">
TESTING
</div>

<h3 style="color:#45A046;">Load Testing Data</h3>

In [7]:
# Randomize Seed
np.random.seed(42)

# Define Directory Paths
data_dir_path = './L4_data/data' # refers to the demo/data folder
model_dir_path = './L4_data/models' # refers to the demo/models folder

# Load CSV from Directory
print('Loading CSV . . .')
df = pd.read_csv(data_dir_path + "/news.csv")

# Assign dataframe text and title to X and Y values
print('Extracting features . . .')
X = df['text']
Y = df.title
print('-> Complete')

Loading CSV . . .
Extracting features . . .
-> Complete


<h3 style="color:#45A046;">Load Stored Model and Re-Initialize</h3>

In [8]:
# Load stored model configuration using NumPy.load()
config = np.load(Seq2SeqSummarizer.get_config_file_path(model_dir_path=model_dir_path)).item()

# Re-Initialize the model using the stored configuration
summarizer = Seq2SeqSummarizer(config)

# Load the stored weights into the model
summarizer.load_weights(weight_file_path=Seq2SeqSummarizer.get_weight_file_path(model_dir_path=model_dir_path))

<h3 style="color:#45A046;">Predict Some Headlines</h3>

In [10]:
# Print predicted headlines along with their original title
print('Predicting Headlines . . .')
for i in range(10):
    x = X[i]
    actual_headline = Y[i]
    headline = summarizer.summarize(x)

    print('\n', 'Original: ', actual_headline)
    #print('Article: ', x)
    print('Generated: ', headline)
print('\n', '-> Complete')

Predicting Headlines . . .

 Original:  You Can Smell Hillary’s Fear
Generated:  clinton campaign biggest national are - the onion - america's finest news source

 Original:  Watch The Exact Moment Paul Ryan Committed Political Suicide At A Trump Rally (VIDEO)
Generated:  the trump is what trump's rick of gop debate

 Original:  Kerry to go to Paris in gesture of sympathy
Generated:  not to back to back at least time

 Original:  Bernie supporters on Twitter erupt in anger against the DNC: 'We tried to warn you!'
Generated:  the gop debate on the party is in against trump is a bit to twitter

 Original:  The Battle of New York: Why This Primary Matters
Generated:  the battle of new why why many could go to win

 Original:  Tehran, USA
Generated:  john obama: political top to daily

 Original:  Girl Horrified At What She Watches Boyfriend Do After He Left FaceTime On
Generated:  of be hillary’s why trump’s campaign in 2016

 Original:  ‘Britain’s Schindler’ Dies at 106
Generated:  re: c

<div style="background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;" id="Summary">
SUMMARY
</div>

This tutorial showed how to generate headlines for news articles of various length using Keras' sequence2sequence text summarizer.

<div style="background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;" id="Challenge">
CHALLENGE
</div>

These are a few suggestions for exercises that may help improve your skills with TensorFlow. It is important to get hands-on experience with TensorFlow in order to learn how to use it properly.

You may want to backup this Notebook before making any changes.

* Train the model for larger/smaller batches. Does it improve the quality of the generated summaries?
* Try another architecture for the Recurrent Neural Network (See the demo folder) Can you improve the quality of the generated summaries?
* Try using a different dataset to train and test this model - or one of the others provided in the lesson folder (L4_data).