{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## AI for Medicine Course 3 Week 2 lecture notebook \n",
    "## Preparing Input for Text Classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this lecture notebook you'll be working with input for text classification models. You'll simulate [BERT's](https://github.com/google-research/bert) tokenizer for a simple example, and then use it for real in the upcoming assignment!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import Library"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorflow as tf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define Model Inputs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Say you're in the following situation: You have a passage containing a patient's medical information and would like your model to be able to answer questions using information from this passage. First, you'll need to reformulate this question and text in a way that BERT can interpret correctly. Let's define a question and the passage:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "q = \"How old is the patient?\"\n",
    "p = '''\n",
    "The patient is a 64-year-old male named Bob. \n",
    "He has no history of chronic spine conditions but is \n",
    "showing mild degenerative changes in the lumbar spine and old right rib fractures.\n",
    "'''"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tokenize Sentences"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With this information, you would normally use BERT's tokenizer to tokenize the sentences like this: \n",
    "```python\n",
    "tokenizer.tokenize(q)\n",
    "```\n",
    "\n",
    "Luckily, this has already been taken care of for you!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "q_tokens = ['How', 'old', 'is', 'the', 'patient', '?']\n",
    "p_tokens = ['The', 'patient', 'is', 'a', '64', 'year', 'old', 'male', 'named', 'Bob', '.',\n",
    "            'He', 'has', 'no', 'history', 'of', 'chronic', 'spine', 'conditions', 'but', 'is',\n",
    "            'showing', 'mild', 'de', '##gene', '##rative', 'changes', 'in', 'the', 'l', '##umba',\n",
    "            '##r', 'spine', 'and', 'old', 'right', 'rib', 'fracture', '##s', '.']\n",
    "classification_token = '[CLS]'\n",
    "separator_token = '[SEP]'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The classification token and separator token are also provided. These tokens can be accessed using the tokenizer like so:\n",
    "\n",
    "```python\n",
    "CLS = tokenizer.cls_token\n",
    "SEP = tokenizer.sep_token\n",
    "```\n",
    "These tokens are really important because you'll need to combine the question and passage tokens into a single list of tokens. These special tokens allow BERT to understand which is which.\n",
    "\n",
    "The CLS, or classification token, should come first. Then, use the SEP token as a separator between the question and the passage. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The token list looks like this: \n",
      "\n",
      "['[CLS]', 'How', 'old', 'is', 'the', 'patient', '?', '[SEP]', 'The', 'patient', 'is', 'a', '64', 'year', 'old', 'male', 'named', 'Bob', '.', 'He', 'has', 'no', 'history', 'of', 'chronic', 'spine', 'conditions', 'but', 'is', 'showing', 'mild', 'de', '##gene', '##rative', 'changes', 'in', 'the', 'l', '##umba', '##r', 'spine', 'and', 'old', 'right', 'rib', 'fracture', '##s', '.']\n"
     ]
    }
   ],
   "source": [
    "tokens = []\n",
    "tokens.append(classification_token)\n",
    "tokens.extend(q_tokens)\n",
    "tokens.append(separator_token)\n",
    "tokens.extend(p_tokens)\n",
    "print(f\"The token list looks like this: \\n\\n{tokens}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Convert Tokens to Numerical Representations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You now have the complete token list. However, you still need to convert these tokens into numeric representations of themselves. Usually, you would convert them like this:\n",
    "\n",
    "```python\n",
    "tokenizer.convert_tokens_to_ids(tokens)\n",
    "```\n",
    "Fortunately for you, this has also been provided: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "token_ids = [101, 1731, 1385, 1110, 1103, 5351, 136, 102, 1109, 5351, 1110, 170, 3324,\n",
    "             1214, 1385, 2581, 1417, 3162, 119, 1124, 1144, 1185, 1607, 1104, 13306, 8340,\n",
    "             2975, 1133, 1110, 4000, 10496, 1260, 27054, 15306, 2607, 1107, 1103, 181, 25509,\n",
    "             1197, 8340, 1105, 1385, 1268, 23298, 22869, 1116, 119]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Apply Padding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is great, except the length of the list of `token ids` depends on the number of words in the question. The passage and BERT only accepts fixed-size input.\n",
    "\n",
    "To deal with this, you'll use **padding**, which involves filling out the rest of this list with an empty value until it reaches a maximum length that you set. \n",
    "\n",
    "In this case you'll use \"0\" as your empty value, 60 as the maximum length, and then leverage the `pad_sequences()` function from Keras' Sequence module:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[  101,  1731,  1385,  1110,  1103,  5351,   136,   102,  1109,\n",
       "         5351,  1110,   170,  3324,  1214,  1385,  2581,  1417,  3162,\n",
       "          119,  1124,  1144,  1185,  1607,  1104, 13306,  8340,  2975,\n",
       "         1133,  1110,  4000, 10496,  1260, 27054, 15306,  2607,  1107,\n",
       "         1103,   181, 25509,  1197,  8340,  1105,  1385,  1268, 23298,\n",
       "        22869,  1116,   119,     0,     0,     0,     0,     0,     0,\n",
       "            0,     0,     0,     0,     0,     0]], dtype=int32)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from tensorflow.keras.preprocessing.sequence import pad_sequences\n",
    "\n",
    "max_length = 60\n",
    "\n",
    "token_ids = pad_sequences([token_ids], padding=\"post\", maxlen=max_length)\n",
    "token_ids"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It appears the padding has been done correctly. Usually, this list of token ids would need to be a Tensor, but this is easily recast using the `convert_to_tensor()` function from TensorFlow: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<tf.Tensor: id=0, shape=(1, 60), dtype=int32, numpy=\n",
       "array([[  101,  1731,  1385,  1110,  1103,  5351,   136,   102,  1109,\n",
       "         5351,  1110,   170,  3324,  1214,  1385,  2581,  1417,  3162,\n",
       "          119,  1124,  1144,  1185,  1607,  1104, 13306,  8340,  2975,\n",
       "         1133,  1110,  4000, 10496,  1260, 27054, 15306,  2607,  1107,\n",
       "         1103,   181, 25509,  1197,  8340,  1105,  1385,  1268, 23298,\n",
       "        22869,  1116,   119,     0,     0,     0,     0,     0,     0,\n",
       "            0,     0,     0,     0,     0,     0]], dtype=int32)>"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "token_ids = tf.convert_to_tensor(token_ids)\n",
    "token_ids"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Add the Input Mask"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You're almost done! BERT still needs an input mask as one of its inputs. An input mask is just a list of the same length as the `token ids` list, indicating whether a certain position contains a token or empty values created from padding.\n",
    "\n",
    "You'll see how to do this using Keras' Masking layer, but in this case it could be done more simply using a little Python. \n",
    "\n",
    "If you're interested in learning some of the details of padding and masking, check [this](https://www.tensorflow.org/guide/keras/masking_and_padding) out."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<tf.Tensor: id=13, shape=(1, 60), dtype=bool, numpy=\n",
       "array([[ True,  True,  True,  True,  True,  True,  True,  True,  True,\n",
       "         True,  True,  True,  True,  True,  True,  True,  True,  True,\n",
       "         True,  True,  True,  True,  True,  True,  True,  True,  True,\n",
       "         True,  True,  True,  True,  True,  True,  True,  True,  True,\n",
       "         True,  True,  True,  True,  True,  True,  True,  True,  True,\n",
       "         True,  True,  True, False, False, False, False, False, False,\n",
       "        False, False, False, False, False, False]])>"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from tensorflow.keras import layers\n",
    "\n",
    "masking_layer = layers.Masking()\n",
    "\n",
    "unmasked = tf.cast(\n",
    "    tf.tile(tf.expand_dims(tf.convert_to_tensor(\n",
    "        token_ids), axis=-1), [1, 1, 1]),\n",
    "    tf.float32)\n",
    "\n",
    "masked = masking_layer(unmasked)\n",
    "token_mask = masked._keras_mask\n",
    "token_mask"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, the token mask outputs `True` for tokens and `False` for padding.\n",
    "\n",
    "Now you've successfully created and formatted the inputs necessary to use the BERT model!\n",
    "\n",
    "In cases where you don't want to use Keras, you can get the same result by using plain Python lists, which provide a different structure and data type than the one produced by the Masking layer. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Manipulating Tensors\n",
    "\n",
    "Before moving on, let's convert the padded token ids list to the same type as the one you just did:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "padded_token_ids = [101,  1731,  1385,  1110,  1103,  5351,   136,   102,  1109,\n",
    "                    5351,  1110,   170,  3324,  1214,  1385,  2581,  1417,  3162,\n",
    "                    119,  1124,  1144,  1185,  1607,  1104, 13306,  8340,  2975,\n",
    "                    1133,  1110,  4000, 10496,  1260, 27054, 15306,  2607,  1107,\n",
    "                    1103,   181, 25509,  1197,  8340,  1105,  1385,  1268, 23298,\n",
    "                    22869,  1116,   119,     0,     0,     0,     0,     0,     0,\n",
    "                    0,     0,     0,     0,     0,     0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First let's convert the list into a tensor:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "padded_token_ids = tf.convert_to_tensor(padded_token_ids)\n",
    "padded_token_ids"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that the shape of this tensor doesn't match the desired one. You can easily check this by doing the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "padded_token_ids.shape == token_ids.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the `expand_dims()` function from TensorFlow you can reshape this tensor like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "padded_token_ids = tf.expand_dims(padded_token_ids, 0)\n",
    "padded_token_ids"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "padded_token_ids.shape == token_ids.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Congratulations on finishing this lecture notebook!!!** \n",
    "\n",
    "You're all done preparing some simple input for BERT. Excellent job!"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}