{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "

TensorFlow NLP

\n", "

Lesson 4

\n", "

Keras Text Summarization

\n", "\n", "
\n", "\n", "
TensorFlow Devices
\n", "\n", "
Preparation and Pre-Processing
\n", "\n", "
Training the Model
\n", "\n", "
The Beast
\n", "\n", "
Testing the Model
\n", "\n", "
Summary
\n", "\n", "
Challenge
\n", "\n", "
\n", "\n", "
***Original Content by Xianshun Chen:***
https://github.com/chen0040/keras-text-summarization
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "OVERVIEW\n", "
\n", "\n", "
\n", "This Lesson will show you how to implement the Keras Sequence2Sequence Text Summarizer on a News dataset in order to create summaries.\n", "
\n", "This lessons folder (L4_data) contains several different Seq2Seq and Encoder-Decoder RNN implementations for you to experiment with. They may even yield better results depending on the data-set you use.\n", "
\n", "\n", "
\n", "\n", "
[Click here for an Introduction to Text Summarization](https://machinelearningmastery.com/gentle-introduction-text-summarization/)
\n", "\n", "
[Click here for an Introduction to Encoder/Decoder Models](https://machinelearningmastery.com/encoder-decoder-models-text-summarization-keras/)
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "TENSORFLOW DEVICES\n", "
\n", "\n", "After executing the code cell below, you can see further details for your devices in the Jupyter Console. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from tensorflow.python.client import device_lib\n", "\n", "def get_available_devices():\n", " local_device_protos = device_lib.list_local_devices()\n", " return [x.name for x in local_device_protos]\n", "\n", "get_available_devices()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "PREPARATION AND PRE-PROCESSING\n", "
\n", "\n", "

Imports

" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using TensorFlow backend.\n" ] } ], "source": [ "from __future__ import print_function\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import tensorflow as tf\n", "\n", "from sklearn.model_selection import train_test_split\n", "\n", "from keras_text_summarization.library.utility.plot_utils import plot_and_save_history\n", "from keras_text_summarization.library.seq2seq import Seq2SeqSummarizer\n", "from keras_text_summarization.library.applications.fake_news_loader import fit_text" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "LOAD_EXISTING_WEIGHTS = True\n", "\n", "np.random.seed(42)\n", "data_dir_path = './L4_data/data'\n", "report_dir_path = './L4_data/reports'\n", "model_dir_path = './L4_data/models'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "TRAINING\n", "
\n", "\n", "

Load Training Data

\n", "\n", "We will use a provided news data-set which contains articles and titles from various news sources.\n", "\n", "This data is pre-processed inside the custom functions in the 'keras_text_summarization' folder." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading CSV . . .\n", "Extracting for config . . . \n", "-> Complete\n" ] } ], "source": [ "# Load CSV into DataFrame\n", "print('Loading CSV . . .')\n", "df = pd.read_csv(data_dir_path + \"/news.csv\")\n", "\n", "# Extract text for configuration\n", "print('Extracting for config . . . ')\n", "Y = df.title\n", "X = df['text']\n", "config = fit_text(X, Y)\n", "print('-> Complete')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "WARNING\n", "
\n", ">- Make sure that the dataset is fully downloaded and extracted before continuing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Quote

\n", "\n", "
\n", " \n", "...there are two different approaches for automatic summarization currently:\n", "

\n", "Extraction and Abstraction.\n", "

\n", "Extractive summarization methods work by identifying important sections of the text and generating them verbatim; \n", "
\n", "...Abstractive summarization methods aim at producing important material in a new way. In other words, they interpret and examine the text using advanced natural language techniques in order to generate a new shorter text that conveys the most critical information from the original text.\n", "

\n", "- [Text Summarization Techniques: A Brief Survey, 2017](https://arxiv.org/abs/1707.02268)
\n", "\n", "\n", "

Initialize Summarizer Model

" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "summarizer = Seq2SeqSummarizer(config)\n", "\n", "# Change this value to 'false' above to start fresh!\n", "if LOAD_EXISTING_WEIGHTS:\n", " summarizer.load_weights(weight_file_path=Seq2SeqSummarizer.get_weight_file_path(model_dir_path=model_dir_path))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Split Data into Train and Test Sets

" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.2, random_state=42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Fit the Training Data to the Model

\n", "\n", "In other words - let's start training our model!\n", "\n", "
\n", "\n", "
\n", "WARNING\n", "
\n", ">- The code cell directly below will start training the model!\n", ">- This model is set to execute 100 epochs with a batch size of 5.\n", ">- This results in a long training time unless you are secretly Megatron.\n", ">- See 'The Beast' section for more information on speeding this up.\n", ">- If you get tired of waiting for it to train locally:\n", " - Interrupt the kernel and continue to the 'Testing' section." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Optional TF Device Selection (code below must be indented)\n", "with tf.device('/GPU:0'):\n", " history = summarizer.fit(Xtrain, Ytrain, Xtest, Ytest, epochs=100, batch_size=5, model_dir_path=model_dir_path)\n", " \n", "history_plot_file_path = report_dir_path + '/' + Seq2SeqSummarizer.model_name + '-history.png'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "THE BEAST\n", "
\n", "\n", "AI/Hub Team Members can also use 'The Beast' to process this training code at a faster rate!\n", "\n", "An informational document is being created for using The Beast; It will be available on the ORSIE AI/Hub Internal Site once it has been completed!\n", "\n", "Please ask your Lead Researcher for more information regarding this.\n", "\n", "However, you will be able to test the current model locally, even with limited training!\n", "\n", "(. . . Mind the results)\n", "\n", "
\n", "\n", "
\n", "NOTE\n", "
\n", ">- The code cell directly below will only execute after completing a full training loop!\n", "\n", ">- 'history' is created on completion of the summarizer.fit() function\n", ">- If you manually stop the training, you will not be able to run this cell!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if LOAD_EXISTING_WEIGHTS:\n", " history_plot_file_path = report_dir_path + '/' + Seq2SeqSummarizer.model_name + '-history-v' + str(summarizer.version) + '.png'\n", "# Plot and Save History\n", "plot_and_save_history(history, summarizer.model_name, history_plot_file_path, metrics={'loss', 'acc'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "TESTING\n", "
\n", "\n", "

Load Testing Data

" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading CSV . . .\n", "Extracting features . . .\n", "-> Complete\n" ] } ], "source": [ "# Randomize Seed\n", "np.random.seed(42)\n", "\n", "# Define Directory Paths\n", "data_dir_path = './L4_data/data' # refers to the demo/data folder\n", "model_dir_path = './L4_data/models' # refers to the demo/models folder\n", "\n", "# Load CSV from Directory\n", "print('Loading CSV . . .')\n", "df = pd.read_csv(data_dir_path + \"/news.csv\")\n", "\n", "# Assign dataframe text and title to X and Y values\n", "print('Extracting features . . .')\n", "X = df['text']\n", "Y = df.title\n", "print('-> Complete')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Load Stored Model and Re-Initialize

" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# Load stored model configuration using NumPy.load()\n", "config = np.load(Seq2SeqSummarizer.get_config_file_path(model_dir_path=model_dir_path)).item()\n", "\n", "# Re-Initialize the model using the stored configuration\n", "summarizer = Seq2SeqSummarizer(config)\n", "\n", "# Load the stored weights into the model\n", "summarizer.load_weights(weight_file_path=Seq2SeqSummarizer.get_weight_file_path(model_dir_path=model_dir_path))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Predict Some Headlines

" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Predicting Headlines . . .\n", "\n", " Original: You Can Smell Hillary’s Fear\n", "Generated: clinton campaign biggest national are - the onion - america's finest news source\n", "\n", " Original: Watch The Exact Moment Paul Ryan Committed Political Suicide At A Trump Rally (VIDEO)\n", "Generated: the trump is what trump's rick of gop debate\n", "\n", " Original: Kerry to go to Paris in gesture of sympathy\n", "Generated: not to back to back at least time\n", "\n", " Original: Bernie supporters on Twitter erupt in anger against the DNC: 'We tried to warn you!'\n", "Generated: the gop debate on the party is in against trump is a bit to twitter\n", "\n", " Original: The Battle of New York: Why This Primary Matters\n", "Generated: the battle of new why why many could go to win\n", "\n", " Original: Tehran, USA\n", "Generated: john obama: political top to daily\n", "\n", " Original: Girl Horrified At What She Watches Boyfriend Do After He Left FaceTime On\n", "Generated: of be hillary’s why trump’s campaign in 2016\n", "\n", " Original: ‘Britain’s Schindler’ Dies at 106\n", "Generated: re: clinton’s email and coming\n", "\n", " Original: Fact check: Trump and Clinton at the 'commander-in-chief' forum\n", "Generated: is republicans the jeb bush director up the gop debate in the\n", "\n", " Original: Iran reportedly makes new push for uranium concessions in nuclear talks\n", "Generated: election is coming in the world war iii - the onion - america's finest news source\n", "\n", " -> Complete\n" ] } ], "source": [ "# Print predicted headlines along with their original title\n", "print('Predicting Headlines . . .')\n", "for i in range(10):\n", " x = X[i]\n", " actual_headline = Y[i]\n", " headline = summarizer.summarize(x)\n", "\n", " print('\\n', 'Original: ', actual_headline)\n", " #print('Article: ', x)\n", " print('Generated: ', headline)\n", "print('\\n', '-> Complete')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "SUMMARY\n", "
\n", "\n", "This tutorial showed how to generate headlines for news articles of various length using Keras' sequence2sequence text summarizer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "CHALLENGE\n", "
\n", "\n", "These are a few suggestions for exercises that may help improve your skills with TensorFlow. It is important to get hands-on experience with TensorFlow in order to learn how to use it properly.\n", "\n", "You may want to backup this Notebook before making any changes.\n", "\n", "* Train the model for larger/smaller batches. Does it improve the quality of the generated summaries?\n", "* Try another architecture for the Recurrent Neural Network (See the demo folder) Can you improve the quality of the generated summaries?\n", "* Try using a different dataset to train and test this model - or one of the others provided in the lesson folder (L4_data)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }