{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e7b777dd-94ac-409a-be55-c785890890ee",
   "metadata": {},
   "outputs": [],
   "source": [
    "# =============================================================\n",
    "# Copyright © 2023 Intel Corporation\n",
    "# \n",
    "# SPDX-License-Identifier: MIT\n",
    "# ============================================================="
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3871d018-5cda-4049-b15e-78ad140f58fe",
   "metadata": {},
   "source": [
    "# Leveraging Intel Extension for TensorFlow with LSTM for Text Generation\n",
    "\n",
    "The sample will present the way to train the model for text generation with LSTM (Long short-term Memory) using the Intel extension for TensorFlow. It will focus on the parts that are relevant for faster execution on Intel hardware enabling transition of existing model training notebooks to use Intel extension for TensorFlow (later in the text Itex).\n",
    "\n",
    "In order to have text generated, one needs a deep learning model. The goal of text generation model is to predict the probability distribution of the next word in a sequence given the previous words. For that, large amount of text is feed to the model training."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5f67d14-613f-43b1-b756-b1e310d4c832",
   "metadata": {},
   "source": [
    "## Preparing the data\n",
    "\n",
    "Training data can be text input (in form of a book, article, etc). For this sample we will [The Republic by Plato](https://www.gutenberg.org/cache/epub/1497/pg1497.txt). \n",
    "\n",
    "Once downloaded, the data should be pre-processed: remove punctuation, remove special characters and set all letter to lowercase."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1f4ab219-6dec-46d7-9317-8a97514cca73",
   "metadata": {},
   "outputs": [],
   "source": [
    "import string\n",
    "import requests\n",
    "import os\n",
    "\n",
    "response = requests.get('https://www.gutenberg.org/cache/epub/1497/pg1497.txt')\n",
    "data = response.text.split('\\n')\n",
    "data = \" \".join(data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d7bc08f6-b520-4171-95f2-f43d10d32b24",
   "metadata": {},
   "outputs": [],
   "source": [
    "def clean_text(doc):\n",
    "    tokens = doc.split()\n",
    "    table = str.maketrans('', '', string.punctuation)\n",
    "    tokens = [(w.translate(table)) for w in tokens] # list without punctuations\n",
    "    tokens = [word for word in tokens if word.isalpha()] # remove  alphanumeric special characters\n",
    "    tokens = [word.lower() for word in tokens]\n",
    "    return tokens"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bcec7409-0800-4083-b8e2-412723b52096",
   "metadata": {},
   "outputs": [],
   "source": [
    "tokens = clean_text(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "848b0685-600f-49c6-af80-f1f5dc770f8b",
   "metadata": {},
   "source": [
    "Depending on the data, training data width (the number of words/tokens) should be updated. For instance: for longer texts (e.g. novels, lecture books, etc) that need a context the width could be 40 or 50.\n",
    "This means we would provide the input of training data width to get our model to generate next word.\n",
    "\n",
    "According to the width (the number of words/tokens), training data needs to be updated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "40d53ef9-ac59-416d-bb25-c0a61204112b",
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_aligned_training_data(text_tokens, train_data_width):\n",
    "    text_tokens[:train_data_width]\n",
    "    \n",
    "    length = train_data_width + 1\n",
    "    lines = []\n",
    "    \n",
    "    for i in range(length, len(text_tokens)): \n",
    "        seq = text_tokens[i - length:i]\n",
    "        line = ' '.join(seq)\n",
    "        lines.append(line)\n",
    "    return lines\n",
    "\n",
    "lines = get_aligned_training_data(tokens, 50)\n",
    "len(lines)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d7dd5fd9-4c5d-4b05-a88c-0621f5107991",
   "metadata": {},
   "source": [
    "## Checking available devices\n",
    "\n",
    "Since we want to leverage Intel's GPU for model training, here are simple instructions on how to check if the environment is setup.\n",
    "\n",
    "In order to see which devices are available for TensorFlow to run its training on, run the next cell.\n",
    "\n",
    "NOTE: GPU will be displayed as `XPU` The line should look like:\n",
    "```\n",
    "PhysicalDevice(name='/physical_device:XPU:0', device_type='XPU')\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b7a66682-c0cb-494d-80a5-9fb5c53080bc",
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "\n",
    "xpus = tf.config.list_physical_devices()\n",
    "xpus"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ff91f6df-f2b2-46d6-9e48-b4d8f3ee27dc",
   "metadata": {},
   "source": [
    "When it comes to the TensorFlow execution, by default, it is using eager mode. In case user wants to run graph mode, it can be done by adding following line:\n",
    "```\n",
    "tf.compat.v1.disable_eager_execution()\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cd99b2ec-6964-425e-b8da-fdfcb8cb263a",
   "metadata": {},
   "outputs": [],
   "source": [
    "tf.compat.v1.disable_eager_execution()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12819985-d0f5-4b19-91fd-4c4d96019837",
   "metadata": {},
   "source": [
    "## Preparing and training the model\n",
    "\n",
    "As a final step for training the model, the data needs to be tokenized (every word gets the index assigned) and converted to sequences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ea4e3a3-0ab3-4eec-891f-612c6316f911",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Tokenization\n",
    "import numpy as np\n",
    "from tensorflow.keras.preprocessing.text import Tokenizer\n",
    "from tensorflow.keras.utils import to_categorical\n",
    "# Keras layers\n",
    "from tensorflow.keras.layers import Embedding, Dense\n",
    "from tensorflow.keras.models import Sequential"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8c3ea71f-15df-41fa-931f-244432572e20",
   "metadata": {},
   "outputs": [],
   "source": [
    "def tokenize_prepare_dataset(lines):\n",
    "    tokenizer = Tokenizer()\n",
    "    tokenizer.fit_on_texts(lines)\n",
    "    \n",
    "    # Get vocabulary size of our model\n",
    "    vocab_size = len(tokenizer.word_index) + 1\n",
    "    sequences = tokenizer.texts_to_sequences(lines)\n",
    "    \n",
    "    # Convert to numpy matrix\n",
    "    sequences = np.array(sequences)\n",
    "    x, y = sequences[:, :-1], sequences[:, -1]\n",
    "    y = to_categorical(y, num_classes=vocab_size)\n",
    "    return x, y, tokenizer\n",
    "\n",
    "x, y, itex_tokenizer = tokenize_prepare_dataset(lines)\n",
    "seq_length = x.shape[1]\n",
    "vocab_size = y.shape[1]\n",
    "vocab_size"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7af822da-4567-4cd3-8343-904df220a26e",
   "metadata": {},
   "source": [
    "## LSTM Operator Override Intel Extension for TensorFlow \n",
    "\n",
    "Besides leveraging AI and oneAPI Kit for GPU execution, Intel extension for TensorFlow (Itex) offers operator overrides for some of the Keras layers. In this sample LSTM for text generation model will be used. A LSTM (Long Short-term Memory first proposed in Hochreiter & Schmidhuber, 1997) Neural Network is just another kind of Artificial Neural Network, containing LSTM cells as neurons in some of its layers. Every LSTM layer will contain many LSTM cells. Every LSTM cell looks at its own input column, plus the previous column's cell output. This is how an LSTM cell logic works on determining what qualifies to extract from the previous cell:\n",
    " - Forget gate - determines influence from the previous cell (and its timestamp) on the current one. It uses the 'sigmoid' layer (the result is between 0.0 and 1.0) to decide whether should it be forgotten or remembered\n",
    " - Input gate - the cell tries to learn from the input to this cell. It does that with a dot product of sigmoid (from Forget Gate) and tanh unit. \n",
    " - Output gate - another dot product of the tanh and sigmoid unit. Updated information is passed from the current to the next timestamp.\n",
    "\n",
    "Passing a cell's state/information at timestamps is related to long-term memory and hidden state - to short-term memory.\n",
    "\n",
    "Instead of using LSTM layer from Keras, Itex LSTM layer is better optimized for execution on Intel platform. LSTM layer, provided by Itex is semantically the same as [tf.keras.layers.LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM). Based on available runtime hardware and constraints, this layer will choose different implementations (ITEX-based or fallback-TensorFlow) to maximize the performance.\n",
    "\n",
    "After creating the model and adding Embedding layer, optimized LSTM layer from Itex can be added. Note, however, that this is sample for mentioned input text. Parameters are open to experiment, depending on the input text. Model accuracy with existing parameters is reaching 80%."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9758c3d6-9f7e-4ab6-9366-76ca1d23e63c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import intel_extension_for_tensorflow as itex"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "56d4d5c3-e2d1-4d13-9a5d-27731bf33206",
   "metadata": {},
   "outputs": [],
   "source": [
    "num_epochs = 200\n",
    "# For custom epochs numbers from the environment\n",
    "if \"ITEX_NUM_EPOCHS\" in os.environ:\n",
    "    num_epochs = int(os.environ.get('ITEX_NUM_EPOCHS'))\n",
    "\n",
    "neuron_coef = 4\n",
    "itex_lstm_model = Sequential()\n",
    "itex_lstm_model.add(Embedding(input_dim=vocab_size, output_dim=seq_length, input_length=seq_length))\n",
    "itex_lstm_model.add(itex.ops.ItexLSTM(seq_length * neuron_coef, return_sequences=True))\n",
    "itex_lstm_model.add(itex.ops.ItexLSTM(seq_length * neuron_coef))\n",
    "itex_lstm_model.add(Dense(units=seq_length * neuron_coef, activation='relu'))\n",
    "itex_lstm_model.add(Dense(units=vocab_size, activation='softmax'))\n",
    "itex_lstm_model.summary()\n",
    "itex_lstm_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])\n",
    "itex_lstm_model.fit(x,y, batch_size=256, epochs=num_epochs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e733a15f-e5f4-43f6-8f4f-ff6cabc5f374",
   "metadata": {},
   "source": [
    "## Compared to LSTM from Keras\n",
    "\n",
    "The training done with Itex LSTM has efficient memory management on Intel GPU. As a reference, on the system with Intel® Arc™ 770, GPU memory was constant at around 4.5GB.\n",
    "\n",
    "Below is the example cell with the same dataset using LSTM layer from keras. To run on the same system, parameters such as sequence length, number of epochs, and other training layer parameters had to be lowered.\n",
    "\n",
    "Compared to parameters that were used by training with Itex LSTM (3192280 total parameters), with keras LSTM only 221870 total parameters were used. Besides accelerating the model training, Itex LSTM offers better memory management in Intel platform."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9a56430e-b17d-46d7-afad-30de62e71125",
   "metadata": {},
   "outputs": [],
   "source": [
    "from tensorflow.keras.layers import LSTM\n",
    "\n",
    "# Reducing the sequence to 10 compared to 50 with Itex LSTM\n",
    "lines = get_aligned_training_data(tokens, 10)\n",
    "\n",
    "# Tokenization\n",
    "x, y, keras_tokenizer = tokenize_prepare_dataset(lines)\n",
    "seq_length = x.shape[1]\n",
    "vocab_size = y.shape[1]\n",
    "\n",
    "num_epochs = 20\n",
    "# For custom epochs numbers\n",
    "if \"KERAS_NUM_EPOCHS\" in os.environ:\n",
    "    num_epochs = int(os.environ.get('KERAS_NUM_EPOCHS'))\n",
    "\n",
    "neuron_coef = 1\n",
    "keras_lstm_model = Sequential()\n",
    "keras_lstm_model.add(Embedding(input_dim=vocab_size, output_dim=seq_length, input_length=seq_length))\n",
    "keras_lstm_model.add(LSTM(seq_length * neuron_coef, return_sequences=True))\n",
    "keras_lstm_model.add(LSTM(seq_length * neuron_coef))\n",
    "keras_lstm_model.add(Dense(units=seq_length * neuron_coef, activation='relu'))\n",
    "keras_lstm_model.add(Dense(units=vocab_size, activation='softmax'))\n",
    "keras_lstm_model.summary()\n",
    "keras_lstm_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])\n",
    "keras_lstm_model.fit(x,y, batch_size=256, epochs=num_epochs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "800d3cba-50bc-4654-96fb-f2c1ecac03c7",
   "metadata": {},
   "source": [
    "## Generating text based on the input\n",
    "\n",
    "Now that that the model has been trained, it is time to use it for generating text based on given input (seed text).\n",
    "One can input its own line, but for better result it is best to take the input line from text.\n",
    "\n",
    "A method for generating text has been created, which will take following parameters:\n",
    " - trained model;\n",
    " - tokenizer;\n",
    " - data width that we used;\n",
    " - input text - seed text;\n",
    " - number of words to generate and append to the seed text.\n",
    "\n",
    "For testing, a random line from input text will be taken as a seed text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "34e50af6-e783-434d-87c4-352639acb653",
   "metadata": {},
   "outputs": [],
   "source": [
    "from tensorflow.keras.preprocessing.sequence import pad_sequences\n",
    "def generate_text_seq(model, tokenizer, text_seq_length, seed_text, generated_words_count):\n",
    "    text = []\n",
    "    input_text = seed_text\n",
    "    for _ in range(generated_words_count):\n",
    "        encoded = tokenizer.texts_to_sequences([input_text])[0]\n",
    "        encoded = pad_sequences([encoded], maxlen = text_seq_length, truncating = 'pre')\n",
    "        predict_x=model.predict(encoded)\n",
    "        y_predict=np.argmax(predict_x, axis=1)\n",
    "        predicted_word = ''\n",
    "        for word, index in tokenizer.word_index.items():\n",
    "            if index == y_predict:\n",
    "                predicted_word = word\n",
    "                break\n",
    "        input_text += ' ' + predicted_word\n",
    "        text.append(predicted_word)\n",
    "    return ' '.join(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e537a26f-97d4-4f51-9624-fc33dcf2f747",
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "random.seed(101)\n",
    "random_index = random.randint(0, len(lines))\n",
    "random_seed_text = lines[random_index]\n",
    "random_seed_text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "18e42b75-1411-465b-b742-6366237ad9a0",
   "metadata": {},
   "outputs": [],
   "source": [
    "number_of_words_to_generate = 10\n",
    "generated_text = generate_text_seq(itex_lstm_model, itex_tokenizer, 50, random_seed_text, number_of_words_to_generate)\n",
    "print(\"::: SEED TEXT::: \" + random_seed_text)\n",
    "print(\"::: GENERATED TEXT::: \" + generated_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8dded495",
   "metadata": {},
   "source": [
    "# Summary\n",
    "\n",
    "This was the sample with the basic concept of how to train the model for text generation. The main focus was on leveraging Intel's libraries and platform to address one of the challenges with LSTM and that it is a more complex architecture than simple RNN (Recurrent Neural Network). LSTM takes more memory and time to train due to additional parameters and operations. On the other hand, LSTM has the ability to learn long-term dependencies and capture complex patterns in sequential data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5860692c-3b07-405a-a81d-6758619adfe7",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"[CODE_SAMPLE_COMPLETED_SUCCESFULLY]\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}