{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "etGrrhGmuRi0" }, "source": [ "+ title: Natural Language Processing of German texts - Part 3: Introducing transformer models to predict ratings\n", "+ date: 2020-07-06\n", "+ tags: python, NLP, classification, BERT, neural-networks, tensorflow, transformers\n", "+ Slug: binary-text-classification-predict-ratings-part3-transformer-neural-network-bert\n", "+ Category: Python\n", "+ Authors: MC\n", "+ Summary: Using a unique German data set containing ratings and comments on doctors, we build a Binary Text Classifier. In this third part, we introduce a state of the art model based on the transformer architecture, namely BERT. Using the transformers library for tensorflow, we push our ability to predict the sentiment of comments towards its limit." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "GV_mzVrnuh7l" }, "source": [ "### Motivation\n", "\n", "In the [first part]({filename}/doctors_nlp1.ipynb) of this series, we implemented a complete machine-learning workflow for binary text classification. Using a unique German data set, we achieved decent results in predicting doctor ratings from patients' text comments. In the [second part]({filename}/doctors_nlp2.ipynb) we introduced the concept of word embeddings and built a LSTM neural network in tensorflow, which significantly improved our predictions. \n", "In this post, we will further improve our model and achieve tremendous results in predicting a comment's sentiment. For this, we will introduce a state of the art NLP model based on the transformer architecture. Bidirectional Encoder Representations from Transformers (BERT) combines multiple novel approaches and has significantly surpassed former models in many language related tasks. As before, we will follow this work flow:\n", "\n", "
\n", "\n", "\"Chart\n", "\n", "
A typical NLP machine-learning workflow (own illustration)
\n", "
\n", "\n", "The cleaning and pre-processing will be identical to [part 2]({filename}/doctors_nlp2.ipynb). For the feature creation and the modeling, we will use the Hugging Face implementation of [transformers](https://github.com/huggingface/transformers) for Tensorflow 2.0. Transformers provides a general architecture implementation for several state of the art models in the natural language domain. We will use one of the most exciting ones, namely BERT. \n", "In the following notebook, we will go through this process step by step. \n", "
\n", "You can download this notebook or follow along in an interactive version of it on \"Open and \n", " \"Open." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "SERsEbpuugGN" }, "source": [ "### Setup / Data set / cleaning / pre processing\n", "\n", "As this is not the focus of this post, we will go through these steps rather quickly. If you need a refresher, check out the [first]({filename}/doctors_nlp1.ipynb) and [second post]({filename}/doctors_nlp2.ipynb) again. \n", "\n", "We'll be using the same data as before. You can take a look at it on [data.world](https://data.world/mc51/german-language-reviews-of-doctors-by-patients) or directly download it from [here](https://query.data.world/s/v5xl53bs2rgq476vqy7cg7xx2db55y). \n", "As in the second part, we will need to use a [TPU](https://cloud.google.com/tpu/docs/colabs) because of the high computational demand of BERT. While you can get away with using GPUs, you won't stand any chance to run this on a CPU. Luckily, the notebooks on Google Colab offer free TPU usage. If you want to replicate this post, your best bet is to start there." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 714 }, "colab_type": "code", "id": "r-JKjIUr5Frz", "outputId": "57682e4b-e2e3-47bc-b355-dd1c082973b8" }, "outputs": [], "source": [ "# Needed on Google Colab\n", "import os\n", "if os.environ.get('COLAB_GPU', False):\n", " !pip install -U transformers\n", " from google.colab import drive\n", " drive.mount(\"/drive\")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 69 }, "colab_type": "code", "id": "fYFC1iYJuQEZ", "outputId": "ef6953b7-60d5-43b7-f133-6c636b252712" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2.2.0\n" ] } ], "source": [ "import nltk\n", "import re\n", "import pickle\n", "import pandas as pd\n", "import numpy as np\n", "import seaborn as sns\n", "import tensorflow as tf\n", "from datetime import datetime\n", "from sklearn import metrics\n", "from sklearn.model_selection import train_test_split\n", "from nltk.stem.snowball import SnowballStemmer\n", "from nltk.tokenize import word_tokenize\n", "from nltk.corpus import stopwords\n", "import warnings\n", "\n", "pd.options.display.max_colwidth = 6000\n", "pd.options.display.max_rows = 400\n", "np.set_printoptions(suppress=True)\n", "warnings.filterwarnings(\"ignore\")\n", "print(tf.__version__)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "14VeTB-gu72U" }, "source": [ "Executing this on Colab will make sure that our model runs on a TPU if available and falls back to GPU / CPU otherwise:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 746 }, "colab_type": "code", "id": "nDBKQm8j_Meg", "outputId": "97e74683-0eef-48fd-d5d9-9d85bff57b5b" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Running on TPU ['10.111.30.122:8470']\n", "INFO:tensorflow:Initializing the TPU system: grpc://10.111.30.122:8470\n", "INFO:tensorflow:Clearing out eager caches\n", "INFO:tensorflow:Finished initializing TPU system.\n", "INFO:tensorflow:Found TPU system:\n", "INFO:tensorflow:*** Num TPU Cores: 8\n", "INFO:tensorflow:*** Num TPU Workers: 1\n", "INFO:tensorflow:*** Num TPU Cores Per Worker: 8\n", "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)\n", "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)\n", "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)\n", "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)\n", "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)\n", "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)\n", "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)\n", "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)\n", "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)\n", "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)\n", "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)\n", "INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)\n", "REPLICAS: 8\n" ] } ], "source": [ "# Try to run on TPU if available\n", "# Detect hardware, return appropriate distribution strategy\n", "try:\n", " tpu = tf.distribute.cluster_resolver.TPUClusterResolver() # TPU detection\n", " print(\"Running on TPU \", tpu.cluster_spec().as_dict()[\"worker\"])\n", "except ValueError:\n", " tpu = None\n", "\n", "if tpu:\n", " tf.config.experimental_connect_to_cluster(tpu)\n", " tf.tpu.experimental.initialize_tpu_system(tpu)\n", " strategy = tf.distribute.experimental.TPUStrategy(tpu)\n", "else:\n", " strategy = tf.distribute.get_strategy()\n", "print(\"REPLICAS: \", strategy.num_replicas_in_sync)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "Zi6GR-GBvBId" }, "source": [ "In our case, we are running on a TPU with eight cores. This will give us an immense speed up! \n", "As before, we first download and extract our data:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 350 }, "colab_type": "code", "id": "OqiD4dLhv1UK", "outputId": "50976f07-6afc-4c87-bbf9-b68b12ff4476" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2020-07-11 09:23:05-- https://query.data.world/s/v5xl53bs2rgq476vqy7cg7xx2db55y\n", "Saving to: ‘reviews.zip’\n", "\n", "reviews.zip [ <=> ] 42.94M 60.4MB/s in 0.7s \n", "\n", "2020-07-11 09:23:06 (60.4 MB/s) - ‘reviews.zip’ saved [45022322]\n", "\n", "Archive: reviews.zip\n", " inflating: german_doctor_reviews.csv \n" ] } ], "source": [ "# store current path and download and extract data there\n", "CURR_PATH = !pwd\n", "!wget -O reviews.zip https://query.data.world/s/v5xl53bs2rgq476vqy7cg7xx2db55y\n", "!unzip reviews.zip" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "colab": {}, "colab_type": "code", "id": "zxTJmxYsSGRM" }, "outputs": [], "source": [ "# PARAMETERS\n", "PATH_DATA = CURR_PATH[0]\n", "PATH_GDRIVE_TMP = \"/drive/My Drive/tmp/\" # Google Drive" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "XjeL0DV3vQgb" }, "source": [ "Let's load the data set and create our target variable. Positive ratings (one or two) will be considered as good and negative ones (five or six) as negative. As before, we ignore neutral ratings:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 164 }, "colab_type": "code", "id": "_ds8eLyeQ-qf", "outputId": "29d20e48-2ca2-456c-d73d-f811f0c06585" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
commentratinggrade_bad
0Ich bin franzose und bin seit ein paar Wochen in muenchen. Ich hatte Zahn Schmerzen und mein Kollegue hat mir Dr mainka empfohlen. Ich habe schnell ein Termin bekommen, das Team war nett und meine schmerzen sind weg!! Ich bin als Angst Patient sehr zurieden!!2.00.0
1Dieser Arzt ist das unmöglichste was mir in meinem Leben je begegnet ist,er ist unfreundlich ,sehr herablassend und medizinisch unkompetent.Nach seiner Diagnose bin ich zu einem anderen Hautarzt gegangen der mich ordentlich behandelt hat und mir auch half.Meine Beschweerden hatten einen völlig anderen Grund.<br />\\nNach seiner \" Behandlung \" und Diagnose ,waren seine letzten Worte .....und tschüss.Alles inerhalb von ca 5 Minuten.6.01.0
\n", "
" ], "text/plain": [ " comment ... grade_bad\n", "0 Ich bin franzose und bin seit ein paar Wochen in muenchen. Ich hatte Zahn Schmerzen und mein Kollegue hat mir Dr mainka empfohlen. Ich habe schnell ein Termin bekommen, das Team war nett und meine schmerzen sind weg!! Ich bin als Angst Patient sehr zurieden!! ... 0.0\n", "1 Dieser Arzt ist das unmöglichste was mir in meinem Leben je begegnet ist,er ist unfreundlich ,sehr herablassend und medizinisch unkompetent.Nach seiner Diagnose bin ich zu einem anderen Hautarzt gegangen der mich ordentlich behandelt hat und mir auch half.Meine Beschweerden hatten einen völlig anderen Grund.
\\nNach seiner \" Behandlung \" und Diagnose ,waren seine letzten Worte .....und tschüss.Alles inerhalb von ca 5 Minuten. ... 1.0\n", "\n", "[2 rows x 3 columns]" ] }, "execution_count": 6, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "# read data from csv\n", "data = pd.read_csv(PATH_DATA + \"/german_doctor_reviews.csv\")\n", "\n", "# Create binary grade, class 1-2 or 5-6 = good or bad\n", "data[\"grade_bad\"] = 0\n", "data.loc[data[\"rating\"] >= 3, \"grade_bad\"] = np.nan\n", "data.loc[data[\"rating\"] >= 5, \"grade_bad\"] = 1\n", "\n", "data.head(2)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "LTqSgBkZvVyC" }, "source": [ "Our cleaning and pre-processing strategy will be almost identical to the [second post]({filename}/doctors_nlp2.ipynb). We will keep the text structure unchanged and only tidy it up a bit. The only difference to before is that we won't transform the texts to lowercase:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 86 }, "colab_type": "code", "id": "ikzWTxAASFZM", "outputId": "35166e88-83ef-4fe6-9abe-1d368387da2e" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[nltk_data] Downloading package stopwords to /root/nltk_data...\n", "[nltk_data] Unzipping corpora/stopwords.zip.\n", "[nltk_data] Downloading package punkt to /root/nltk_data...\n", "[nltk_data] Unzipping tokenizers/punkt.zip.\n" ] } ], "source": [ "nltk.download(\"stopwords\")\n", "nltk.download(\"punkt\")\n", "stemmer = SnowballStemmer(\"german\")\n", "stop_words = set(stopwords.words(\"german\"))\n", "\n", "\n", "def clean_text(text, for_embedding=False):\n", " \"\"\"\n", " - remove any html tags (< /br> often found)\n", " - Keep only ASCII + European Chars and whitespace, no digits\n", " - remove single letter chars\n", " - convert all whitespaces (tabs etc.) to single wspace\n", " if not for embedding (but e.g. tdf-idf):\n", " - all lowercase\n", " - remove stopwords, punctuation and stemm\n", " \"\"\"\n", " RE_WSPACE = re.compile(r\"\\s+\", re.IGNORECASE)\n", " RE_TAGS = re.compile(r\"<[^>]+>\")\n", " RE_ASCII = re.compile(r\"[^A-Za-zÀ-ž ]\", re.IGNORECASE)\n", " RE_SINGLECHAR = re.compile(r\"\\b[A-Za-zÀ-ž]\\b\", re.IGNORECASE)\n", " if for_embedding:\n", " # Keep punctuation\n", " RE_ASCII = re.compile(r\"[^A-Za-zÀ-ž,.!? ]\", re.IGNORECASE)\n", " RE_SINGLECHAR = re.compile(r\"\\b[A-Za-zÀ-ž,.!?]\\b\", re.IGNORECASE)\n", "\n", " text = re.sub(RE_TAGS, \" \", text)\n", " text = re.sub(RE_ASCII, \" \", text)\n", " text = re.sub(RE_SINGLECHAR, \" \", text)\n", " text = re.sub(RE_WSPACE, \" \", text)\n", "\n", " word_tokens = word_tokenize(text)\n", " words_tokens_lower = [word.lower() for word in word_tokens]\n", "\n", " if for_embedding:\n", " # no stemming, lowering and punctuation / stop words removal\n", " words_filtered = word_tokens\n", " else:\n", " words_filtered = [\n", " stemmer.stem(word) for word in words_tokens_lower if word not in stop_words\n", " ]\n", "\n", " text_clean = \" \".join(words_filtered)\n", " return text_clean" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "Vmb_LGpLvdxm" }, "source": [ "Now, we can we apply this pre-processing and cleaning to our original data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 52 }, "colab_type": "code", "id": "8YJgw-JPSnWD", "outputId": "a18c83fa-6d14-4462-e390-59648b694b5d" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3min 47s, sys: 236 ms, total: 3min 47s\n", "Wall time: 3min 47s\n" ] } ], "source": [ "%%time\n", "# Clean Comments\n", "data[\"comment_clean\"] = data.loc[data[\"comment\"].str.len() > 20, \"comment\"]\n", "data[\"comment_clean\"] = data[\"comment_clean\"].map(\n", " lambda x: clean_text(x, for_embedding=True) if isinstance(x, str) else x\n", ")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "NqAX1AmfvjcM" }, "source": [ "This is how the final comments will look like:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 164 }, "colab_type": "code", "id": "bqglg5EmV9No", "outputId": "6b7b82fd-1680-43ef-9b9d-4e43ff70fa7a" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textlabel
0Ich bin franzose und bin seit ein paar Wochen in muenchen . Ich hatte Zahn Schmerzen und mein Kollegue hat mir Dr mainka empfohlen . Ich habe schnell ein Termin bekommen , das Team war nett und meine schmerzen sind weg ! ! Ich bin als Angst Patient sehr zurieden ! !0.0
1Dieser Arzt ist das unmöglichste was mir in meinem Leben je begegnet ist er ist unfreundlich , sehr herablassend und medizinisch unkompetent Nach seiner Diagnose bin ich zu einem anderen Hautarzt gegangen der mich ordentlich behandelt hat und mir auch half Meine Beschweerden hatten einen völlig anderen Grund . Nach seiner Behandlung und Diagnose , waren seine letzten Worte ... ..und tschüss Alles inerhalb von ca Minuten .1.0
\n", "
" ], "text/plain": [ " text label\n", "0 Ich bin franzose und bin seit ein paar Wochen in muenchen . Ich hatte Zahn Schmerzen und mein Kollegue hat mir Dr mainka empfohlen . Ich habe schnell ein Termin bekommen , das Team war nett und meine schmerzen sind weg ! ! Ich bin als Angst Patient sehr zurieden ! ! 0.0\n", "1 Dieser Arzt ist das unmöglichste was mir in meinem Leben je begegnet ist er ist unfreundlich , sehr herablassend und medizinisch unkompetent Nach seiner Diagnose bin ich zu einem anderen Hautarzt gegangen der mich ordentlich behandelt hat und mir auch half Meine Beschweerden hatten einen völlig anderen Grund . Nach seiner Behandlung und Diagnose , waren seine letzten Worte ... ..und tschüss Alles inerhalb von ca Minuten . 1.0" ] }, "execution_count": 9, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "# Drop Missing\n", "data = data.dropna(axis=\"index\", subset=[\"grade_bad\", \"comment_clean\"]).reset_index(\n", " drop=True\n", ")\n", "data = data[[\"comment_clean\", \"grade_bad\"]]\n", "data.columns = [\"text\", \"label\"]\n", "data.head(2)\n", "# data.to_csv(PATH_GDRIVE_TMP + \"/data.csv\", index=False)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": {}, "colab_type": "code", "id": "z3PI8aNr76jc" }, "outputs": [], "source": [ "# skip pre processing if done before\n", "# data = pd.read_csv(PATH_GDRIVE_TMP + \"/data.csv\")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "CoDBvCOPDT1K" }, "source": [ "### A brief background on transformers and BERT\n", "\n", "You've made it to the interesting part. But before we further dive into the code, let's get a brief background on transformers and BERT. Transformer models, similar to RNNs (e.g. the LSTM we used in [part 2]({filename}/doctors_nlp2.ipynb)), are designed to handle sequences of data well. They shine at picking up relations between various inputs in an input sequence. Hence, they lend themselves perfectly for NLP tasks where such associations correspond to semantic relationships. This is also a convenient property of LSTM models, e.g. the one we implemented previously. It was able to learn relationships between sequences of the input, e.g. word vectors. For LSTMs, however, this ability is limited. Because inputs are processed in sequence, in practice, at the end of a long sentence information from the beginning will often be lost. Also, computational limits come into play really quick for the same reason. Transformer models address this issue by using the so called `attention` mechanism. This allows them to process inputs in a non sequential way which allows for better parallelization. Following, computational efficiency is greatly improved. In addition, they learn associations between words even if those are far apart in a sentence. This is an outstanding improvement when trying to model language. It explains why transformers are at the foundation of most state of the art models and why they have surpassed former architectures like LSTMs. \n", "BERT is currently one of the most successful transformer architectures (although, this will change quickly as the speed of innovation at the moment is mind blowing). For a well illustrated and thorough introduction, check out [this article](http://jalammar.github.io/illustrated-bert/) by Jay Allamar. BERT has been developed by Google and has set new records on several language related tasks. Its main novelty is the introduction of bi-directionality. Previous models looked at inputs uni directionally, i.e. from left to right. In contrast, BERT also looks at them from right to left. The authors have proved that this leads to a deeper understanding of context in texts and increases performance significantly. \n", "As most other state of the art models, a lot of BERT's performance comes from sheer size. BERT large (24 layers, 16 attention heads, 340 million parameters) is trained on a huge text dataset in order to gain a thorough understanding of language in general. For this, it follows several ingenious unsupervised training strategies. Training such models is extremely computationally demanding and expensive. For all but big institutions and companies it is prohibitory. \n", "Fortunately, we can apply the principle of transfer learning to such models. That is, we can fine tune a pre trained model on a more specific task and dataset (which doesn't need to be that huge). In our case, we will simply add a classification layer to the pre trained BERT model. Then, we will do a supervised training on our labeled dataset. Consequently, we will mainly train our classifier layer while most other layers will only be minimally impacted. In a sense, we use the general language understanding of the pre trained model and improve its understanding of our unique domain. Moreover, we teach it to solve a specific task. In our case, this will be sentiment analysis in the form of a binary text classification." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "bRP_b3_6vnk3" }, "source": [ "### Feture Creation and Modeling\n", "\n", "As mentioned before, we use the [`transformers`](https://github.com/huggingface/transformers) library. To stay consistent with our previous neural network model, we use the Tensorflow 2.0 implementation. In addition, this allows us to use the TPU on Google Colab without much trouble. Another invaluable advantage of the transformers library, is that several pre trained models are readily available for usage. \n", "BERT comes with its own Tokenizer. As with the model itself, we will use a pre trained version of the tokenizer. Here, we use a variant that has been specifically trained on German texts and made public. Hence, it already comes with a huge German vocabulary:\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 66, "referenced_widgets": [ "9059282e9697457491a909f99b07d7d6", "357c5cecfded46158c390c94938cdafa", "29a0fd54e7c84a1c9d94af6ed2d54d9e", "818dfbe55e61424c975383f033e7687c", "95f1d26e2a9d4234aad0b092dc22ffeb", "7c4f58ccfb71419cb35746b1e6db289d", "0b030820fc404024a713bc98c07abe09", "b7272f21484244d0b49f5da3a28395b5" ] }, "colab_type": "code", "id": "J7tt6I3g5r1k", "outputId": "88e1c493-cf2c-43ba-be28-801790e3607a" }, "outputs": [], "source": [ "# this will download and initialize the pre trained tokenizer\n", "from transformers import BertTokenizer, TFBertModel\n", "\n", "tokenizer = BertTokenizer.from_pretrained(\"bert-base-german-cased\")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "iTe4YJL27VYx" }, "source": [ "The inputs expected by BERT are very similar to the ones we've used before: they are just vectors containing integers which can be mapped to tokens by using a dictionary. The only difference is that BERT expects several \"special\" tokens. `[CLS]` stands for classification and marks the beginning of a new input to be classified. `[SEP]` marks the separation between sentences. Finally, `[PAD]` is used as a placeholder in order to pad all vectors to the same fixed length. The helper method `encode_plus` of the `Tokenizer` object deals with creating the numeric vectors while taking care of the extra tokens: \n", "\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": {}, "colab_type": "code", "id": "V1V50oOB5rUV" }, "outputs": [], "source": [ "MAXLEN = 192\n", "\n", "def preprocess_text(data):\n", " \"\"\" take texts and prepare as input features for BERT \n", " \"\"\"\n", " input_ids = []\n", " # For every sentence...\n", " for comment in data:\n", " encoded_sent = tokenizer.encode_plus(\n", " text=comment,\n", " add_special_tokens=True, # Add `[CLS]` and `[SEP]`\n", " max_length=MAXLEN, # Max length to truncate/pad\n", " pad_to_max_length=True, # Pad sentence to max length\n", " return_attention_mask=False, # attention mask not needed for our task\n", " )\n", " # Add the outputs to the lists\n", " input_ids.append(encoded_sent.get(\"input_ids\"))\n", " return input_ids" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "P4s_1of29uJz" }, "source": [ "Before creating our features, let's check out the workings of the tokenizer with an example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 69 }, "colab_type": "code", "id": "AxfQAU3sH5yV", "outputId": "6906ebf2-d56d-47c0-eaee-9027aaa8233d" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Comment: ['Ich liebe data-dive.com und meine Katze.']\n", "Tokenized Comment: ['[CLS]', 'Ich', 'liebe', 'dat', '##a', '-', 'di', '##ve', '.', 'c', '##om', 'und', 'meine', 'Katze', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']\n", "Token IDs: [3, 1671, 16619, 10453, 26903, 26935, 4616, 2304, 26914, 1350, 101, 42, 6667, 25285, 26914, 4, 0, 0, 0, 0]\n" ] } ], "source": [ "# Original Comment and encoding outputs\n", "comment = [\"Ich liebe data-dive.com und meine Katze.\"]\n", "input_ids = preprocess_text(comment)\n", "print(\"Comment: \", comment)\n", "print(\"Tokenized Comment: \", tokenizer.convert_ids_to_tokens(input_ids[0])[0:20])\n", "print(\"Token IDs: \", input_ids[0][0:20])" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "nd457Xb197pS" }, "source": [ "First, you can see the special tokens being automatically added in the right places. Second, we see that some tokens simply correspond to regular words, i.e. \"Ich\" and \"liebe\". This will be the case for all words that are in the vocabulary. Here, that will amount to most regular German words. In contrast, unknown words get a special treatment. Instead of just being left out, as was the case with the FastText embeddings, they are split to shorter character sequences. For example \"data\", which is not a German word, becomes \"dat\" and \"##a\". This allows the model to still use these novel words. \n", "There is one more major difference between BERT's word embeddings and e.g. FastText's. Recall how vector representations of words with similar semantic meaning in FastText were similar, i.e. had a short distance between them. This allowed our model to group similar words together. BERT takes this one step further. A word's vector representation is not static anymore but depends on context. Consequently, the vector for \"broke\" is different when it's in a context of \"money\" vs. \" a record\". This immensely improves contextual awareness and might benefit predictions in many cases. \n", "Now, we apply the tokenization process to our data:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 52 }, "colab_type": "code", "id": "HLK7nNldJgaj", "outputId": "9880d0d9-d2ac-4d5e-b925-567a236dcb38" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.62 s, sys: 932 ms, total: 4.55 s\n", "Wall time: 5.16 s\n" ] } ], "source": [ "%%time\n", "import pickle\n", "\n", "input_ids = preprocess_text(data[\"text\"])\n", "# tokenization takes quite long\n", "# we can save the result and load it quickly via pickle\n", "pickle.dump(input_ids, open(PATH_GDRIVE_TMP + \"/input_ids.pkl\", \"wb\"))\n", "# input_ids = pickle.load(open(PATH_GDRIVE_TMP+\"/input_ids.pkl\", \"rb\"))" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "hmlMfRp5_5tm" }, "source": [ "Next, we split out data into train and test for cross validation:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 52 }, "colab_type": "code", "id": "w9TyLpZhOW2u", "outputId": "c3ba7758-e8c4-4705-e499-4fe1c941d308" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train set: 253783\n", "Test set: 84595\n" ] } ], "source": [ "# Sample data for cross validation\n", "train_ids, test_ids, train_labels, test_labels = train_test_split(\n", " input_ids, data[\"label\"], random_state=1, test_size=0.25, shuffle=True\n", ")\n", "print(f\"Train set: {len(train_ids)}\\nTest set: {len(test_ids)}\")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "dare43crAM1B" }, "source": [ "Here, we set the model parameters. `MAXLEN` is the max. number of tokens in our input. Longer inputs will be truncated to this. While greater lengths will yield better predictions, they also mean a greater computational toll. We differentiate between `BATCH_SIZE_PER_REPLICA` and `BATCH_SIZE` when we run on multiple GPUs or TPUs, as is the case on Google Colab. Each TPU core will deal with `BATCH_SIZE_PER_REPLICA` batches at a time. `EPOCHS` is simply the number of training iterations over the whole training set:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "colab": {}, "colab_type": "code", "id": "VQcG-40V6REl" }, "outputs": [], "source": [ "# Set Model Parameters\n", "MAXLEN = MAXLEN\n", "BATCH_SIZE_PER_REPLICA = 16\n", "BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync\n", "EPOCHS = 8\n", "LEARNING_RATE = 1e-5\n", "DATA_LENGTH = len(data)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "OmkFfQVXBkBO" }, "source": [ "As a last step before building our model, we need to prepare our dataset. Before, we used NumPy arrays as inputs. Here, we use the `tf.data.Dataset` class which offers several convenient methods:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "colab": {}, "colab_type": "code", "id": "h2Su9FAcNWiK" }, "outputs": [], "source": [ "def create_dataset(\n", " data_tuple,\n", " epochs=EPOCHS,\n", " batch_size=BATCH_SIZE,\n", " buffer_size=DATA_LENGTH,\n", " train=False,\n", "):\n", " dataset = tf.data.Dataset.from_tensor_slices(data_tuple)\n", " if train:\n", " dataset = dataset.shuffle(\n", " buffer_size=buffer_size, reshuffle_each_iteration=True\n", " ).repeat(epochs)\n", " dataset = dataset.batch(batch_size)\n", " return dataset\n", "\n", "\n", "train = create_dataset(\n", " (train_ids, train_labels), buffer_size=len(train_ids), train=True\n", ")\n", "test = create_dataset((test_ids, test_labels), buffer_size=len(test_ids))" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "CVuhbfcACT6-" }, "source": [ "Finally, we define a function that returns our model architecture:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "colab": {}, "colab_type": "code", "id": "aP7WVslnIabU" }, "outputs": [], "source": [ "def build_model(transformer, max_len=MAXLEN):\n", " \"\"\" add binary classification to pretrained model\n", " \"\"\"\n", " input_word_ids = tf.keras.layers.Input(\n", " shape=(max_len,), dtype=tf.int32, name=\"input_word_ids\"\n", " )\n", " sequence_output = transformer(input_word_ids)[0]\n", " cls_token = sequence_output[:, 0, :]\n", " out = tf.keras.layers.Dense(1, activation=\"sigmoid\")(cls_token)\n", " model = tf.keras.models.Model(inputs=input_word_ids, outputs=out)\n", " model.compile(\n", " tf.keras.optimizers.Adam(lr=LEARNING_RATE),\n", " loss=\"binary_crossentropy\",\n", " metrics=[\"accuracy\"],\n", " )\n", " return model" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "cWt-_DYeCbzL" }, "source": [ "In the first step, we define an `Input` layer which expects our numeric vectors as input. Then, we add the pre trained transformer which receives the inputs from the previous layer. The output of the transformer is then fed into a `Dense` layer which finally outputs a probability for our input belonging to class 0 or 1. \n", "Now, we build our model by first downloading the pre trained BERT and passing it to our `build_model` function:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 382 }, "colab_type": "code", "id": "b7FUGjp9IIoq", "outputId": "35ee73ba-06fa-4dc7-8212-3878e8dbdbcd" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:transformers.modeling_tf_utils:Some weights of the model checkpoint at bert-base-german-cased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']\n", "- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).\n", "- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n", "WARNING:transformers.modeling_tf_utils:All the weights of TFBertModel were initialized from the model checkpoint at bert-base-german-cased.\n", "If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Model: \"model_1\"\n", "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", "input_word_ids (InputLayer) [(None, 192)] 0 \n", "_________________________________________________________________\n", "tf_bert_model_1 (TFBertModel ((None, 192, 768), (None, 109081344 \n", "_________________________________________________________________\n", "tf_op_layer_strided_slice_1 [(None, 768)] 0 \n", "_________________________________________________________________\n", "dense_1 (Dense) (None, 1) 769 \n", "=================================================================\n", "Total params: 109,082,113\n", "Trainable params: 109,082,113\n", "Non-trainable params: 0\n", "_________________________________________________________________\n" ] } ], "source": [ "with strategy.scope():\n", " transformer_layers = TFBertModel.from_pretrained(\"bert-base-german-cased\")\n", " model = build_model(transformer_layers, max_len=MAXLEN)\n", "model.summary()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "NGLGT6q_FX_f" }, "source": [ "In total we have almost 110 Mio. trainable parameters. It's convenient that almost all of them have already been trained. Because we use the pre trained weights for them, they will only need to change ever so slightly. Still, it is a big model and even training on GPUs takes considerable time. So again, good thing that we can use TPUs. \n", "Next, we define callbacks that will be used during training. The `EartlyStopping` callback will stop the training if validation loss stops decreasing between epochs. This avoids overfitting. `ModelCheckpoint` saves checkpoints of the model after each epoch, so that training can be resumed. On Google Colab, you can currently only save TPU models in a Cloud Bucket but not on a mounted Google Drive. Hence, I commented out that part:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "colab": {}, "colab_type": "code", "id": "huw96xaoPwHG" }, "outputs": [], "source": [ "# Stop training when validation acc starts dropping\n", "# Save checkpoint of model each period\n", "now = datetime.now().strftime(\"%Y-%m-%d_%H%M\")\n", "# Create callbacks\n", "callbacks = [\n", " tf.keras.callbacks.EarlyStopping(\n", " monitor=\"val_loss\", verbose=1, patience=1, restore_best_weights=True\n", " ),\n", " # tf.keras.callbacks.ModelCheckpoint(\n", " # PATH_GDRIVE_TMP + now + \"_Model_{epoch:02d}_{val_loss:.4f}.h5\",\n", " # monitor=\"val_loss\",\n", " # save_best_only=True,\n", " # verbose=1,\n", " # ),\n", "]" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "GWSJkmkXIGVq" }, "source": [ "Finally, time to get excited! We can now start the model training:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 416 }, "colab_type": "code", "id": "H0EWQeqgZA2N", "outputId": "49b5d4f0-5bfa-4437-fc35-bb1f956534bc" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model Params:\n", "batch_size: 128\n", "Epochs: 8\n", "Step p. Epoch: 1982\n", "Learning rate: 1e-05\n", "Epoch 1/8\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:Gradients do not exist for variables ['tf_bert_model_1/bert/pooler/dense/kernel:0', 'tf_bert_model_1/bert/pooler/dense/bias:0'] when minimizing the loss.\n", "WARNING:tensorflow:Gradients do not exist for variables ['tf_bert_model_1/bert/pooler/dense/kernel:0', 'tf_bert_model_1/bert/pooler/dense/bias:0'] when minimizing the loss.\n", "WARNING:tensorflow:Gradients do not exist for variables ['tf_bert_model_1/bert/pooler/dense/kernel:0', 'tf_bert_model_1/bert/pooler/dense/bias:0'] when minimizing the loss.\n", "WARNING:tensorflow:Gradients do not exist for variables ['tf_bert_model_1/bert/pooler/dense/kernel:0', 'tf_bert_model_1/bert/pooler/dense/bias:0'] when minimizing the loss.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "1982/1982 [==============================] - 501s 253ms/step - accuracy: 0.9684 - loss: 0.0789 - val_accuracy: 0.9779 - val_loss: 0.0571\n", "Epoch 2/8\n", "1982/1982 [==============================] - 480s 242ms/step - accuracy: 0.9815 - loss: 0.0488 - val_accuracy: 0.9792 - val_loss: 0.0544\n", "Epoch 3/8\n", "1982/1982 [==============================] - ETA: 0s - accuracy: 0.9869 - loss: 0.0363Restoring model weights from the end of the best epoch.\n", "1982/1982 [==============================] - 482s 243ms/step - accuracy: 0.9869 - loss: 0.0363 - val_accuracy: 0.9791 - val_loss: 0.0569\n", "Epoch 00003: early stopping\n", "CPU times: user 2min 4s, sys: 10.6 s, total: 2min 15s\n", "Wall time: 25min 29s\n" ] } ], "source": [ "%%time\n", "# Train using appropriate steps per epochs (go through all train data in an epoch)\n", "steps_per_epoch = int(np.floor((len(train_ids) / BATCH_SIZE)))\n", "print(\n", " f\"Model Params:\\nbatch_size: {BATCH_SIZE}\\nEpochs: {EPOCHS}\\n\"\n", " f\"Step p. Epoch: {steps_per_epoch}\\n\"\n", " f\"Learning rate: {LEARNING_RATE}\"\n", ")\n", "hist = model.fit(\n", " train,\n", " batch_size=BATCH_SIZE,\n", " epochs=EPOCHS,\n", " steps_per_epoch=steps_per_epoch,\n", " validation_data=test,\n", " verbose=1,\n", " callbacks=callbacks,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "_nzBqsg8I1eY" }, "source": [ "The model converges really fast and stops early after only three epochs. This is because it already achieves the lowest validation loss in the second period:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 399 }, "colab_type": "code", "id": "r1HG7IBOfzmf", "outputId": "8ef62530-f8da-4fd6-e3af-b6c0a4d5d362" }, "outputs": [ { "data": { "text/plain": [ "[Text(0, 0.5, ''),\n", " [,\n", " ,\n", " ],\n", " [Text(0, 0, '1'), Text(0, 0, '2'), Text(0, 0, '3')],\n", " Text(0.5, 1.0, 'Model loss')]" ] }, "execution_count": 28, "metadata": { "tags": [] }, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAEWCAYAAABollyxAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nO3deXxU9dn//9eVnSQQsrEmEBBlCYFAhr2KSlWU3opCAgqCuIHd7tZvvcFfe1tt7a1Wa60tFVBUFBVDUIt1rRXFlrAkEWSXVZOwhYQEyEK26/fHDBjSJARIMsnM9Xw88nBmzpkz18TwPmc+5zPXEVXFGGOM5/JxdwHGGGOalwW9McZ4OAt6Y4zxcBb0xhjj4SzojTHGw1nQG2OMh7OgNwYQkTgRURHxa8S6d4jIvy52O8a0FAt60+aIyH4RKReRqFqPf+kK2Tj3VGZM62RBb9qqfcCtp++ISAIQ7L5yjGm9LOhNW/UqMKPG/ZnAKzVXEJEwEXlFRPJE5BsR+ZWI+LiW+YrIUyJyVET2AhPqeO5iETkoIrki8qiI+J5vkSLSTURWikiBiOwWkXtqLBsuIhkiclxEDovI067Hg0RkqYjki0ihiGwQkc7n+9rGnGZBb9qqtUAHEenvCuCpwNJa6/wZCAN6A2Nx7hhmuZbdA/wAGAI4gMm1nvsyUAn0ca1zLXD3BdS5DMgBurle4/9E5GrXsj8Bf1LVDsAlQKrr8ZmuumOBSGAOUHoBr20MYEFv2rbTR/XXANuB3NMLaoT/g6p6QlX3A38AbnetkgI8o6rZqloAPFbjuZ2BG4CfqWqxqh4B/ujaXqOJSCwwBpirqmWquhF4ge8+iVQAfUQkSlVPquraGo9HAn1UtUpVM1X1+Pm8tjE1WdCbtuxV4DbgDmoN2wBRgD/wTY3HvgG6u253A7JrLTutp+u5B11DJ4XAQqDTedbXDShQ1RP11HAXcBmwwzU884Ma7+sjYJmIHBCR34uI/3m+tjFnWNCbNktVv8F5UvYG4K1ai4/iPDLuWeOxHnx31H8Q59BIzWWnZQOngChV7ej66aCq8edZ4gEgQkTa11WDqu5S1Vtx7kCeANJEJERVK1T1EVUdAIzGOcQ0A2MukAW9aevuAq5W1eKaD6pqFc4x79+JSHsR6Qncz3fj+KnAT0UkRkTCgXk1nnsQ+Bj4g4h0EBEfEblERMaeT2Gqmg2sAR5znWAd5Kp3KYCITBeRaFWtBgpdT6sWkatEJME1/HQc5w6r+nxe25iaLOhNm6aqe1Q1o57FPwGKgb3Av4DXgRddy57HOTyyCcjiPz8RzAACgG3AMSAN6HoBJd4KxOE8un8b+LWqfuJaNh7YKiIncZ6YnaqqpUAX1+sdx3nu4XOcwznGXBCxC48YY4xnsyN6Y4zxcBb0xhjj4SzojTHGw1nQG2OMh2t1rVSjoqI0Li7O3WUYY0ybkpmZeVRVo+ta1uqCPi4ujoyM+mbLGWOMqYuIfFPfskYN3YjIeBHZ6eq+N6+O5YEi8qZr+brT/cBFxF9ElojIZhHZLiIPXuibMMYYc2HOGfSub+fNB64HBgC3isiAWqvdBRxT1T44mz894Xo8GQhU1QQgCZhtF4UwxpiW1Zgj+uHAblXdq6rlONuu3lRrnZuAJa7bacA4ERFAgRDXZdXaAeU4v+1njDGmhTRmjL47Z3f5ywFG1LeOqlaKSBHONqtpOHcCB3Fe/efnrpawZxGRe4F7AXr06FF7sTHGQ1RUVJCTk0NZWZm7S2mzgoKCiImJwd+/8Q1Nm/tk7HCgCme71nDgCxH5RFX31lxJVRcBiwAcDof1ZDDGQ+Xk5NC+fXvi4uJwfug350NVyc/PJycnh169ejX6eY0Zusnl7HauMdS4wEPtdVzDNGFAPs5e4R+62q4eAf6N82o+xhgvVFZWRmRkpIX8BRIRIiMjz/sTUWOCfgNwqYj0EpEAnFfZWVlrnZU4L38GzsulfarObmnfAle7CgwBRgI7zqtCY4xHsZC/OBfy+ztn0KtqJfBjnC1dtwOpqrpVRH4jIje6VlsMRIrIbpw9v09PwZwPhIrIVpw7jJdU9avzrrIRSsureHjlVgpLyptj88YY02Y1ah69qr6vqpep6iWq+jvXYw+p6krX7TJVTVbVPqo6/PQYvOs6mMmqGq+qA1T1yeZ6I1sOFPH6um9JWZjOoSI70WOMaR433HADhYWFDa4TGhpa5+N33HEHaWlpzVFWgzym182wuAhevnMYBwrLmPTcGvbmnXR3ScYYD6KqVFdX8/7779OxY0d3l3NePCboAUZfEsWye0dSVlFF8oJ0NucUubskY0wrM2/ePObPn3/m/sMPP8yjjz7KuHHjGDp0KAkJCfztb38DYP/+/fTt25cZM2YwcOBAsrOziYuL4+jRowBMnDiRpKQk4uPjWbRo0Vmv8/Of/5z4+HjGjRtHXl7ef9SRmZnJ2LFjSUpK4rrrruPgwYPN96ZVtVX9JCUl6cXac+SEjn7snxr/0If67915F709Y0zT2LZtm7tL0KysLL3iiivO3O/fv79+++23WlRUpKqqeXl5eskll2h1dbXu27dPRUTT09PPrN+zZ0/Ny3PmSn5+vqqqlpSUaHx8vB49elRVVQFdunSpqqo+8sgj+qMf/UhVVWfOnKnLly/X8vJyHTVqlB45ckRVVZctW6azZs1q9Huo6/cIZGg9udrqmpo1hd7Roay4bzQzXlzHHS9u4NlbExk/8EIu92mM8TRDhgzhyJEjHDhwgLy8PMLDw+nSpQs///nPWb16NT4+PuTm5nL48GEAevbsyciRI+vc1rPPPsvbb78NQHZ2Nrt27SIyMhIfHx+mTJkCwPTp07nlllvOet7OnTvZsmUL11xzDQBVVVV07dp8GeWRQQ/QJSyI1NmjuGtJBj98LYvf3ZzArcPtW7fGGEhOTiYtLY1Dhw4xZcoUXnvtNfLy8sjMzMTf35+4uLgzc9VDQkLq3MZnn33GJ598Qnp6OsHBwVx55ZX1zm+vPSVSVYmPjyc9Pb1p31g9PGqMvraOwQEsvWsEV1wWzYNvbWb+qt2oXQzdGK83ZcoUli1bRlpaGsnJyRQVFdGpUyf8/f1ZtWoV33xTb8ffM4qKiggPDyc4OJgdO3awdu3aM8uqq6vPzK55/fXX+d73vnfWc/v27UteXt6ZoK+oqGDr1q1N+A7P5tFBD9AuwJfnZziYmNiNJz/ayaPvbae62sLeGG8WHx/PiRMn6N69O127dmXatGlkZGSQkJDAK6+8Qr9+/c65jfHjx1NZWUn//v2ZN2/eWcM7ISEhrF+/noEDB/Lpp5/y0EMPnfXcgIAA0tLSmDt3LoMHDyYxMZE1a9Y0+fs8TVrbEa7D4dDmuPBIdbXy2/e28dK/93PLkO48MXkQ/r4ev58zplXZvn07/fv3d3cZbV5dv0cRyVTVOlvMeOwYfW0+PsJDPxhAZEgAT338NYWlFcy/bSjtAnzdXZoxxjQrrzqkFRF+fPWl/O7mgazaeYTbF6+jqKTC3WUZY0yz8qqgP23aiJ7Mv20oX+UUMWVROoePW8sEY4zn8sqgB7ghoSsvzRpGdkEJk55bw/6jxe4uyRhjmoXXBj3AmD5RvHHvSErKq5i8YA1bcq1lgjHG83h10AMMiunI8jmjCPTz5dZFa1m7N9/dJRljTJPy+qAHuCQ6lLT7RtElLIgZL67no62H3F2SMaYZFBYW8te//vWCntuY9sQ1Pfzwwzz11FMX9FpNzYLepWtYO1JnjyK+WwfuW5pJ6obscz/JGNOmNBT0lZWVDT63LbYnPs2CvobwkABeu3sE37s0mv9Z8RULPt/j7pKMMU1o3rx57Nmzh8TERB544AE+++wzLr/8cm688UYGDBgA1N96+HR74v3799O/f3/uuece4uPjufbaayktLW3wdTdu3MjIkSMZNGgQN998M8eOHQOcTdEGDBjAoEGDmDp1KgCff/45iYmJJCYmMmTIEE6cOHHR79trvjDVWMEBfrwww8Evlm/i8Q92UFBczoPX97PrXBrTxB55dyvbDhxv0m0O6NaBX/9XfL3LH3/8cbZs2cLGjRsBZ2OyrKwstmzZQq9evQB48cUXiYiIoLS0lGHDhjFp0iQiIyPP2s6uXbt44403eP7550lJSWHFihVMnz693tedMWMGf/7znxk7diwPPfQQjzzyCM888wyPP/44+/btIzAw8Myw0FNPPcX8+fMZM2YMJ0+eJCgo6GJ/LXZEX5cAPx+emZLIzFE9WbR6L79Y/hWVVdXuLssY0wyGDx9+JuTBeZQ9ePBgRo4ceab1cG29evUiMTERgKSkJPbv31/v9ouKiigsLGTs2LEAzJw5k9WrVwMwaNAgpk2bxtKlS/Hzcx53jxkzhvvvv59nn32WwsLCM49fDDuir4ePj/DwjfGEhwTwzCe7KCot5y+3DSXI31omGNMUGjrybkk12xA3tvVwYGDgmdu+vr7nHLqpz3vvvcfq1at59913+d3vfsfmzZuZN28eEyZM4P3332fMmDF89NFHjWqy1hA7om+AiPCz71/Gb2+K5587jjBj8XqKSq1lgjFtVfv27Rsc826o9fCFCgsLIzw8nC+++AKAV199lbFjx1JdXU12djZXXXUVTzzxBEVFRZw8eZI9e/aQkJDA3LlzGTZsGDt27LjoGuyIvhFuHxVHx+AA7k/dyNRFa1ly5zA6tb/4cTNjTMuKjIxkzJgxDBw4kOuvv54JEyactXz8+PEsWLCA/v3707dv33qvLHW+lixZwpw5cygpKaF379689NJLVFVVMX36dIqKilBVfvrTn9KxY0f+93//l1WrVuHj40N8fDzXX3/9Rb++17Qpbgqrv85jztJMokIDefWu4fSMrPvKM8aYulmb4qZxvm2KbejmPFxxWTSv3T2C42UVTHouvclnDBhjTHOwoD9PQ3qEkzZnFP6+wpRF6azfV+DukowxpkEW9BegT6f2pN03muj2gdy+eB2fbDvs7pKMMaZeFvQXqHvHdqTNGU2/Lu2ZvTST5RnWMsEY0zpZ0F+EiJAAXrtnJKN6R/JA2lcsWm0tE4wxrY8F/UUKDfRj8R0OJiR05f/e38FjH2yntc1kMsZ4Nwv6JhDo58uztw5h2ogeLPx8L3NXWMsEY1qji2lTDPDMM89QUlJS57Irr7yS1jo13IK+ifj6CI9OHMhPx11KakYOP3wti7KKKneXZYypoTmDvjWzoG9CIsL911zGw/81gI+3HWbmi+s5XmYtE4xpLWq3KQZ48sknGTZsGIMGDeLXv/41AMXFxUyYMIHBgwczcOBA3nzzTZ599lkOHDjAVVddxVVXXdXg67zxxhskJCQwcOBA5s6dC0BVVRV33HEHAwcOJCEhgT/+8Y9A3a2Km5q1QGgGd4zpRXhIAP8vdRNTF65lyZ3DiW4feO4nGuNNPpgHhzY37Ta7JMD1j9e7uHab4o8//phdu3axfv16VJUbb7yR1atXk5eXR7du3XjvvfcAZw+csLAwnn76aVatWkVUVFS9r3HgwAHmzp1LZmYm4eHhXHvttbzzzjvExsaSm5vLli1bAM60Ja6rVXFTsyP6ZnJTYndemOlg39FikhesIbug7X3cM8bTffzxx3z88ccMGTKEoUOHsmPHDnbt2kVCQgL/+Mc/mDt3Ll988QVhYWGN3uaGDRu48soriY6Oxs/Pj2nTprF69Wp69+7N3r17+clPfsKHH35Ihw4dgLpbFTc1O6JvRlf27cTSu0dw58sbmPTcGl65azj9unRwd1nGtA4NHHm3FFXlwQcfZPbs2f+xLCsri/fff59f/epXjBs3joceeuiiXis8PJxNmzbx0UcfsWDBAlJTU3nxxRfrbFXc1IFvR/TNLKlnOMvnjMJHhJQF6WzYby0TjHGX2m2Kr7vuOl588UVOnjwJQG5uLkeOHOHAgQMEBwczffp0HnjgAbKysup8fl2GDx/O559/ztGjR6mqquKNN95g7NixHD16lOrqaiZNmsSjjz5KVlZWva2Km5od0beAyzq3J+2+UcxYvJ7pL6zjuelDubpfZ3eXZYzXqd2m+Mknn2T79u2MGjUKgNDQUJYuXcru3bt54IEH8PHxwd/fn+eeew6Ae++9l/Hjx9OtWzdWrVpV52t07dqVxx9/nKuuugpVZcKECdx0001s2rSJWbNmUV3tnHr92GOP1duquKk1qk2xiIwH/gT4Ai+o6uO1lgcCrwBJQD4wRVX3i8g04IEaqw4ChqrqxvpeqzW3Kb5Y+SdPccdLG9h28DhPTh7ELUNj3F2SMS3K2hQ3jSZvUywivsB84HpgAHCriAyotdpdwDFV7QP8EXgCQFVfU9VEVU0Ebgf2NRTyni4yNJA37h3JiF4R3J+6iRe+2OvukowxXqAxY/TDgd2quldVy4FlwE211rkJWOK6nQaMExGptc6trud6tdBAP16aNYzrB3bh0fe28/sPd1jLBGNMs2pM0HcHarZmzHE9Vuc6qloJFAGRtdaZArxxYWV6lkA/X/5y21BuHd6Dv362h//v7c1UVVvYG+9gBzYX50J+fy1yMlZERgAlqrqlnuX3AvcC9OjRoyVKcjtfH+H/bh5IZEgAf1m1m2PFFTwzNZEgf193l2ZMswkKCiI/P5/IyEj+80O/ORdVJT8/n6Cg87tmdWOCPheIrXE/xvVYXevkiIgfEIbzpOxpU2ngaF5VFwGLwHkythE1eQQR4RfX9SUiJIDf/H0bs17awKIZSbQP8nd3acY0i5iYGHJycsjLy3N3KW1WUFAQMTHnN5GjMUG/AbhURHrhDPSpwG211lkJzATSgcnAp+r6fCEiPkAKcPl5VeZF7vxeL8JD/Hlg+Vfc+vxaXp41nKhQa5lgPI+/vz+9evVydxle55xj9K4x9x8DHwHbgVRV3SoivxGRG12rLQYiRWQ3cD8wr8YmrgCyVdWmmDTg5iExPD/Dwe4jJ0lekG4tE4wxTaZR8+hbkifPo2+MjP0F3PnyBtoF+PLKnSPo26W9u0syxrQBFzWP3rQsR1wEqXNGoQopC9PJ/OaYu0syxrRxFvStUL8uHVhx32jCg/2Z/sI6Ptt5xN0lGWPaMAv6Vio2Ipi0+0bTOzqEu5dk8LeNtSc6GWNM41jQt2JRoYEsu3ckjrhw/nvZRl7+9z53l2SMaYMs6Fu59kH+vDxrONfFd+bhd7fx9Mc77ZuFxpjzYkHfBgT5+zL/tqFMccTy7Ke7+dU7W6xlgjGm0awffRvh5+vD45MSiAgN4LnP9lBYUsHTUwYT6GctE4wxDbOgb0NEhLnj+xEZEsCj722nqLSCBbcnERpo/xuNMfWzoZs26O7Le/OH5MGk781n2vNrKSgud3dJxphWzIK+jZqUFMPC6UnsOHSCyQvWkFtY6u6SjDGtlAV9G/b9AZ1ZevcI8k6cYvJza9h9pOGLFhtjvJMFfRs3LC6C1NmjqKxWJi9I58tvrWWCMeZsFvQeoH/XDqyYM5qwdv5Me2Edq7+2Xt/GmO9Y0HuIHpHBLJ8zip6RIdy1ZAPvbjrg7pKMMa2EBb0H6dQ+iDdnj2RIj3B+uuxLXk3f7+6SjDGtgAW9h+kQ5M8rdw5nXL/O/O/ftvLHf3xtLROM8XIW9B4oyN+XBdOHkpwUw5/+uYtfr9xKtbVMMMZr2VcqPZSfrw+/nzyIiJAAFq7eS0FxOU+nJBLgZ/t2Y7yNBb0HExEevKE/ESEBPPbBDmfLhOlJhFjLBGO8ih3eeYHZYy/h95MHsWZPPtNeWMcxa5lgjFexoPcSKY5YFkxPYtvB4yQvTOeAtUwwxmtY0HuRawZ05tU7h3O4qMzVMuGku0syxrQAC3ovM6J3JMtmj6S8SklesIZN2YXuLskY08ws6L1QfLcwVtw3itAgP259fi3/2nXU3SUZY5qRBb2X6hkZwoo5o+kREcysl9fz3lcH3V2SMaaZWNB7sU4dgnhz9igSYzvy4zeyWLr2G3eXZIxpBhb0Xi6snT+v3DmCq/t24lfvbOHZf+6ylgnGeBgLekO7AF8W3J7ELUO78/Q/vuaRd7dZywRjPIh9RdIA4O/rw1OTBxMRHMAL/9rHsZJynpw82FomGOMBLOjNGT4+wi8n9CciNIDff7iTwpIKnps+lOAA+zMxpi2zwzVzFhHhh1f24fFbEvhiVx7TX1hHYYm1TDCmLbOgN3WaOrwHf502lC25x0lZmM6hojJ3l2SMuUAW9KZe4wd25eU7h3GgsIxJz61hb561TDCmLbKgNw0afUkUy+4dSVlFFZMXpLM5p8jdJRljzpMFvTmngd3DWD5nFO38fZm6KJ01u61lgjFtiQW9aZTe0aGsuG803cPbccdLG/hgs7VMMKatsKA3jdYlLIjU2aNIiAnjR69n8cb6b91dkjGmESzozXnpGBzA0rtGcMVl0Tz41mbmr9ptLROMaeUaFfQiMl5EdorIbhGZV8fyQBF507V8nYjE1Vg2SETSRWSriGwWkaCmK9+4Q7sAX56f4WBiYjee/Ggnv/37dmuZYEwrds6vPIqILzAfuAbIATaIyEpV3VZjtbuAY6raR0SmAk8AU0TED1gK3K6qm0QkEqho8ndhWpy/rw9PpyQSHhLAi/92tkz4/eRB+Pvah0RjWpvG/KscDuxW1b2qWg4sA26qtc5NwBLX7TRgnIgIcC3wlapuAlDVfFWtaprSjbv5+AgP/WAAv7j2Mt7+MpfZr2ZSWm7/e41pbRoT9N2B7Br3c1yP1bmOqlYCRUAkcBmgIvKRiGSJyP/U9QIicq+IZIhIRl5e3vm+B+NGIsKPr76U3908kFU7jzB98TqKSuxDmzGtSXN/zvYDvgdMc/33ZhEZV3slVV2kqg5VdURHRzdzSaY5TBvRk/m3DWVzThEpC9M5fNxaJhjTWjQm6HOB2Br3Y1yP1bmOa1w+DMjHefS/WlWPqmoJ8D4w9GKLNq3TDQldeWnWMHKOlTDpuTXsO1rs7pKMMTQu6DcAl4pILxEJAKYCK2utsxKY6bo9GfhUnXPuPgISRCTYtQMYC2zDeKwxfaJ4496RlJRXkbxgDVtyrWWCMe52zqB3jbn/GGdobwdSVXWriPxGRG50rbYYiBSR3cD9wDzXc48BT+PcWWwEslT1vaZ/G6Y1GRTTkeVzRhHo58vURWtJ35Pv7pKM8WrS2r7s4nA4NCMjw91lmCZwsKiUGYvX801BCX++dQjXxXdxd0nGeCwRyVRVR13LbNKzaTZdw9qROnsU8d06cN/STN7cYC0TjHEHC3rTrMJDAnjt7hF879Jo5q7YzHOf7bGWCca0MAt60+yCA/x4YYaDGwd344kPd/B/71vLBGNakl312bSIAD8fnpmSSHiwP89/sY+C4goen5RgLROMaQEW9KbF+PgID98YT0RIIH/85GsKS8qZP20oQf6+7i7NGI9mh1OmRYkI//39S/ntxIF8uvMIty9eR1GptUwwpjl5zhF98VHYtAxCopw/wVEQEu287Rfo7upMLbeP7El4sD8/f3MjUxam88qdw+nUwTpYG9McPCfo8/fAx7+se1lgBwiO/C74a+8Iat4PjgS/gJat3Uv9YFA3wtr5M/vVTCYvSOfVu4bTMzLE3WUZ03Kqq6GsEEoKoLQAAkKgc3yTv4znfGFKFcqKnEf2JUehOM95u977R6G+jslBYfXvCOraMfh6zv7SHTZmFzLrpfX4+viw5M5hxHcLc3dJxpy/6iooPQYl+c7gLsl3hvfp+6UFrsdrLCs9Blr93Tbib4Hkly7o5Rv6wpTnBP35Or0nrRn8xXnO/wFndgo17pfkn/0/pKZ24a7gP8dOISQK2kXYjqEOu4+c4PbF6zlZVskLMx2M6B3p7pKMN6uqqBHOtYO7oO4gLyusf3u+gc6DwuAI50+7iBr3I7+737EHRF92QSVb0DeF6mrX3vpo3TuC2p8eSgqAun634twxNObTQkiUc10f75iVcqCwlNsXryPnWCl/uW0o1wzo7O6SjCeoPFV3MDcU5KeO1789/2BXMEf8Z1CfuR9+9n3/YBBp1rdpQe8O1VWuP54aO4b6Pi0UH3X+gdVFfJx/RHXuCFznHc58moiGoI7g03YnUxUUlzPr5Q1syS3i8VsSSHbEnvtJxnuUl9QRzjWGS+oK8vKT9W8voD0EhzcurE+Hu3+7lnu/56GhoLcxhObi4wuh0c4f+p97/apK5x9lvZ8WXPcPbXber+9jovi6Tjw34tNCSJRzx9DMRxrnIyIkgNfvHsGcpZk8kPYVBcXlzB57ibvLMk1NFcqLax1lH6vjqLvWcEllaf3bDAz77ig7tBNE93MFdQNB7iUz8izoWwtfP+cfZ2inxq1fVVH/sFHNTw8HNznvn6qnL7yPXz3nF2p+WqixYwjs0Ow7hpBAP16Y6eD+1E089sEOCorLmXd9P6QV7ZBMDarOoY4Gh0PqGC6pKq9ngwLtOn4Xzh1ioMugeo6wXffbhYOvf4u+7bbEgr6t8vWH9l2cP41RecoV/uc46XzsG+dj5Sfq3o6P/3c7gsbMTApsf0E7hkA/X56dOoTwYH8Wrt5LQXE5j92SgJ+1TGhepycpNDR7pK7hkurKurd3eujxdDiHx0H3oQ2MbUc4Q95Lzku1FAt6b+EXCB26OX8ao6KsxmykBqaoFux1/oOvbxzUN7Dxw0jBUc55xK4dg6+P8NubBhIZEsif/rmLwtIK/nzrEGuZ0FjVVVBa2MBwSB3DJbWn+9Xk43d2MEddWv/skdNDKIFhbfqckaewk7GmaVSUnvvTwumdRHFe/WOtfu3qHEbKyPNl2bZSojp148c3jiK0YxfnsoDgln2f7lJV4TqKbsQR9pk52oXUPfML8A1wBXIDJx3Puh3RIsN25sLZyVjT/PzbQcdY509jlBfX82mh5o4hD/J2QHEejsoyHP7AMWBJzdcNbvynhZCo1jFj4vR0v0bP0T5W/zkWcO4ca550DIupJ7hr3K/xycl4Pgt64x4BIc6f8J7nXvf0DI3iPLJ27Ob5D9YT166E2Y4OdNTj3+0kThyEw1uc9+s70RcQeo6TzrXOPZxrVkZF6fnP0W5wul/o2cEc0fs/h0POOgkZ4T2faswFs6Eb0+ZkfXuMO1/eQLFsS90AABHiSURBVICvD6/cNZx+XTqcvYIqnDrRuGGk0+caquvpoFmzT1JwpHPIqWaQV5TUX2hgWD1T++oI69O3vWS6n2l69oUp43F2HXa2TCgpr2TxHcMYFhdx4Rs73SfpP3YKR88+r1BSAP5B9Q+H1J6jbdP9TAuyoDceKedYCTMWrye3sJTnpg/l6n7WMsF4r4aC3uY9mTYrJjyY5XNG0bdLe+55JZO3snLcXZIxrZIFvWnTIkMDef2ekYzsHcH9qZt44Yu97i7JmFbHgt60eaGBfrx4xzBuSOjCo+9t5/cf7qC1DUka4042vdJ4hEA/X/5861A6Bm/hr5/t4VhJOY9OTMDXx+aKG2NBbzyGr4/wu4kDiQwJ4M+f7uZYcQXPTE20lgnG69nQjfEoIsL/u7Yvv/6vAXy49RCzXtrAibJ65sgb4yUs6I1HmjWmF89MSWTD/gJufX4tR0+ecndJxriNBb3xWBOHdOf5mQ52HzlJ8oJ0sgsa+BarMR7Mgt54tKv6duK1u0eQf/IUkxesYeehevrsG+PBLOiNx0vqGcHyOaMBSFmYTuY39Vyf1xgPZUFvvELfLu1JmzOaiJAApr2wjlU7j7i7JGNajAW98RqxEc6WCX06hXLPkgze+TLX3SUZ0yIs6I1XiQoN5I17RjIsLoKfvbmRl/69z90lGdPsLOiN12kf5M9Ls4YxPr4Lj7y7jT98vNNaJhiPZkFvvFKQvy/zpw1l6rBY/vzpbn75zhaqqi3sjWdqVNCLyHgR2Skiu0VkXh3LA0XkTdfydSIS53o8TkRKRWSj62dB05ZvzIXz9REeuyWBH155Ca+v+5afvJHFqcoqd5dlTJM7Z68bEfEF5gPXADnABhFZqarbaqx2F3BMVfuIyFTgCWCKa9keVU1s4rqNaRIiwv+M70dESACPvredotINLLzdQWigtYEynqMxR/TDgd2quldVy4FlwE211rkJWOK6nQaME7FLzJu24+7Le/OH5MGs3VvAbc+vJd9aJhgP0pig7w5k17if43qsznVUtRIoAiJdy3qJyJci8rmIXF7XC4jIvSKSISIZeXl55/UGjGkqk5JiWDg9iZ2HTpC8MJ3cwlJ3l2RMk2juk7EHgR6qOgS4H3hdRDrUXklVF6mqQ1Ud0dHRzVySMfX7/oDOLL17BHknTjHpr2vYddhaJpi2rzFBnwvE1rgf43qsznVExA8IA/JV9ZSq5gOoaiawB7jsYos2pjkNi4sgdfYoqlRJXphO1rfH3F2SMRelMUG/AbhURHqJSAAwFVhZa52VwEzX7cnAp6qqIhLtOpmLiPQGLgXsop6m1evftQMr5owmrJ0/055fx+df25CiabvOGfSuMfcfAx8B24FUVd0qIr8RkRtdqy0GIkVkN84hmtNTMK8AvhKRjThP0s5RVesoZdqEHpHOlglxUSHcvWQDKzcdcHdJxlwQaW3fCHQ4HJqRkeHuMow543hZBXcvyWDD/gJuGRLDlGGxDIsLxyaWmdZERDJV1VHXMpssbMw5dAjy55U7h/PY+9tJy8xhRVYOcZHBJDtiuWVod7qGtXN3icY0yI7ojTkPJeWVfLD5EKkZ2azbV4CPwOWXRpPiiOX7AzoR6GcXIjfu0dARvQW9MRfom/xi0jJzSMvM4WBRGR2D/ZmY2J1kRwzx3cLcXZ7xMhb0xjSjqmrl37uPkpqRzcdbD1NeVc2Arh1IccRwU2J3wkMC3F2i8QIW9Ma0kMKSclZuOkBqRjZbco8T4OvDNQM6k+yI4fJLo/H1sRO4pnlY0BvjBtsOHGd5ZjbvfJnLsZIKunQIYlJSd5KTYomLCnF3ecbDWNAb40anKqv4dPsRUjOy+fzrPKoVhsdFkOyI4YaEroRYp0zTBCzojWklDhWV8daXOSzPyGHf0WJCAnyZMKgrKY5Yknra3Hxz4SzojWllVJXMb46RmpHN3786SEl5Fb2jQpjsiGHS0Bg6dwhyd4mmjbGgN6YVKz5VyfubD7I8I4f1+51z88de5pybP65/ZwL87Iqf5tws6I1pI/YdLSYtM5sVmbkcOl5GREgANyV2IzkplgHd/qPDtzFnWNAb08ZUVStf7MpjeUYO/9jmnJs/sHsHUhyx3Di4Gx2DbW6+OZsFvTFt2LHicv62MZfUjBy2HTxOgJ8P1w7oTIojljF9omxuvgEs6I3xGFtyi0jLzOGdjbkUllTQNSyIyUkxTE6KoWekzc33Zhb0xniYU5VVfLLNOTf/i13OufkjekWQ4ojl+oQuBAfY3HxvY0FvjAc7WFTKW1m5LM/IZn9+CaGBfvxgUFeSHbEM7dHR5uZ7CQt6Y7yAqrJhv3Nu/vubXXPzo0NIccRyy5DudLK5+R7Ngt4YL3PyVCXvf3WQ1IxsMr45hq+PcOVl0SQ7Yrm6Xyebm++BLOiN8WJ7806yPDOHFZk5HDlxioiQAG4e4uyb36+Lzc33FBb0xhgqq6r5Ypezb/4n2w9TUaUMigkj2TU3P6ydv7tLNBfBgt4Yc5aC4nLe+TKX1Ixsdhw6QYCfD+Pju5DiiGX0JZH42Nz8NseC3hhTJ1Vl64HjpGZk87eNBygqraB7x3ZMSoohOSmG2Ihgd5doGsmC3hhzTmUVVfxj22FSM7L51+6jqMKo3pGkDIthfHxX2gXYhc9bMwt6Y8x5OVBYyorMHJZn5vBtQQntA/34weBupDhiSIy1ufmtkQW9MeaCVFcr6/cXkJqRzQebD1FaUUWfTqGkOGK4eUgM0e0D3V2icbGgN8ZctBNlFbz31UGWZ+aQ6Zqbf1XfTqQ4YriqXyf8fW1uvjtZ0BtjmtTuIydZnpnNW1m55J04RVTo6bn5sVzWub27y/NKFvTGmGZRWVXN5187++Z/sv0wldXK4NiOpDhi+K/B3egQZHPzW4oFvTGm2eWfPMXbX+ayPCOHnYdPEOjnw/UDnXPzR/a2ufnNzYLeGNNiVJXNuUUsz8jhbxtzOV5WSUx4OyYnOS98bnPzm4cFvTHGLcoqqvho6yHSMnPOzM0f0yeSFEcs18V3Icjf5uY3FQt6Y4zb5RwrYUVmLmlZ2WQXlNI+yI8bB3cjxRHLoJgwm5t/kSzojTGtRnW1snZfPmkZOby/5SBlFdVc1jmUFEcsE4d0JyrU5uZfCAt6Y0yrdLysgr9vOsjyzGy+/LYQPx/h6n6dSHHEcmXfaPxsbn6jWdAbY1q9XYdPkJaZw4qsXI6ePEVUaCCThjr75vfpZHPzz8WC3hjTZlRUVfP5zjxSM7L5dMcRKquVIT06kuKI5QeDutLe5ubXyYLeGNMm5Z04daZv/q4jJwny9+GGgc4Ln4/oFWFz82uwoDfGtGmqyqacIpZnZLNy4wFOnKokNqIdyUmxTEqKoXvHdu4u0e0uOuhFZDzwJ8AXeEFVH6+1PBB4BUgC8oEpqrq/xvIewDbgYVV9qqHXsqA3xjTk9Nz81Ixs/r07HxH4Xp8okh2xXDugs9fOzb+ooBcRX+Br4BogB9gA3Kqq22qs80NgkKrOEZGpwM2qOqXG8jRAgXUW9MaYppJdUMKKrByWZ+SQW1hKhyA/bkrsToojloHdO3jV3PyLDfpROI/Er3PdfxBAVR+rsc5HrnXSRcQPOAREq6qKyERgDFAMnLSgN8Y0tepqZe3efGff/C2HOFVZTb8u7Ul2xDIxsRuRXjA3v6Gg92vE87sD2TXu5wAj6ltHVStFpAiIFJEyYC7OTwO/aKDAe4F7AXr06NGIkowx5js+PsLoPlGM7hPFI6UV/P2rA6Rm5PDbv2/j8Q+2M65fZ1KGxXDFpd45N78xQX8xHgb+qKonG/oIpaqLgEXgPKJv5pqMMR4srJ0/00b0ZNqInnx9+ATLM5x98z/ceohO7QO5ZWgMyY4YLokOdXepLaYxQZ8LxNa4H+N6rK51clxDN2E4T8qOACaLyO+BjkC1iJSp6l8uunJjjDmHyzq355cTBvA/4/uxascRUjNyeP6LvSz4fA9JPcNJccQwYVA3QgOb+5jXvRozRu+H82TsOJyBvgG4TVW31ljnR0BCjZOxt6hqSq3tPIyN0Rtj3OzIiTLX3Pwcdh85STt/X25I6EqKI4bhvSLa7Ancixqjd425/xj4COf0yhdVdauI/AbIUNWVwGLgVRHZDRQAU5uufGOMaTqd2gdx7xWXcM/lvdmYXUhqRg7vbjrAiqwcekYGk5wUw6SkGLqGec7cfPvClDHG65WWV/Hh1oOkbsghfa9zbv7ll0aT4ojh+/3bxtx8+2asMcY00rf5JaRl5bAi0zk3P6ydPxMTu5HsiGVg9zB3l1cvC3pjjDlP1dXKmj3Oufkfbj1EeWU1/bt2IMURw8TE7oSHBLi7xLNY0BtjzEUoKqlg5VcHWJ6RzVc5Rfj7CtcM6EyyI5YrLo3GtxU0V7OgN8aYJrLj0HGWZ+Tw9pe5FBSX07lDIJOGxpDsiKVXVIjb6rKgN8aYJlZeWc2nO46wPCObVTuPUK0wLC6cZEcsExK6EtLCc/Mt6I0xphkdOV7GW66++XvzigkO8GVCgrNv/rC48BaZm29Bb4wxLUBVyfq2kOUZ2by76QDF5VX0igphclIMk4bG0CUsqNle24LeGGNaWEl5JR9sdvbNX7evAB+BKy6LJjkplu8P6ESgX9POzbegN8YYN/omv5i0zBzSMnM4WFRGx2B/JiY6L3we361p5uZb0BtjTCtQVa38e/dRUjOy+XjrYcqrqonv1oEURyw3JXajY/CFz823oDfGmFamsKSclZsOkJqRzZbc4wT4+jBzdE9+OWHABW3vYi88Yowxpol1DA5gxqg4ZoyKY9uB4yzPzG62i5xb0BtjjJsN6NaBX3eLb7bte981tYwxxstY0BtjjIezoDfGGA9nQW+MMR7Ogt4YYzycBb0xxng4C3pjjPFwFvTGGOPhWl0LBBHJA765iE1EAUebqBxjarO/L9OcLubvq6eqRte1oNUF/cUSkYz6+j0Yc7Hs78s0p+b6+7KhG2OM8XAW9MYY4+E8MegXubsA49Hs78s0p2b5+/K4MXpjjDFn88QjemOMMTVY0BtjjIfziKAXkRdF5IiIbHF3LcbziEisiKwSkW0islVE/tvdNRnPIiJBIrJeRDa5/sYeadLte8IYvYhcAZwEXlHVge6ux3gWEekKdFXVLBFpD2QCE1V1m5tLMx5CRAQIUdWTIuIP/Av4b1Vd2xTb94gjelVdDRS4uw7jmVT1oKpmuW6fALYD3d1blfEk6nTSddff9dNkR+EeEfTGtBQRiQOGAOvcW4nxNCLiKyIbgSPAP1S1yf7GLOiNaSQRCQVWAD9T1ePursd4FlWtUtVEIAYYLiJNNgxtQW9MI7jGTVcAr6nqW+6ux3guVS0EVgHjm2qbFvTGnIPrRNliYLuqPu3ueoznEZFoEenout0OuAbY0VTb94igF5E3gHSgr4jkiMhd7q7JeJQxwO3A1SKy0fVzg7uLMh6lK7BKRL4CNuAco/97U23cI6ZXGmOMqZ9HHNEbY4ypnwW9McZ4OAt6Y4zxcBb0xhjj4SzojTHGw1nQG9OERORKEWmyaXHGNAULemOM8XAW9MYrich0V//vjSKy0NVQ6qSI/NHVD/yfIhLtWjdRRNaKyFci8raIhLse7yMin7h6iGeJyCWuzYeKSJqI7BCR11zfrDXGbSzojdcRkf7AFGCMq4lUFTANCAEyVDUe+Bz4tesprwBzVXUQsLnG468B81V1MDAaOOh6fAjwM2AA0BvnN2uNcRs/dxdgjBuMA5KADa6D7XY4W8NWA2+61lkKvCUiYUBHVf3c9fgSYLnrAiTdVfVtAFUtA3Btb72q5rjubwTicF5Iwhi3sKA33kiAJar64FkPivxvrfUutD/IqRq3q7B/Z8bNbOjGeKN/ApNFpBOAiESISE+c/x4mu9a5DfiXqhYBx0TkctfjtwOfu640lSMiE13bCBSR4BZ9F8Y0kh1pGK+jqttE5FfAxyLiA1QAPwKKcV7w4Vc4h3KmuJ4yE1jgCvK9wCzX47cDC0XkN65tJLfg2zCm0ax7pTEuInJSVUPdXYcxTc2GbowxxsPZEb0xxng4O6I3xhgPZ0FvjDEezoLeGGM8nAW9McZ4OAt6Y4zxcP8/FDMQmpQAfU8AAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light", "tags": [] }, "output_type": "display_data" } ], "source": [ "loss = pd.DataFrame(\n", " {\"train loss\": hist.history[\"loss\"], \"test loss\": hist.history[\"val_loss\"]}\n", ").melt()\n", "loss[\"epoch\"] = loss.groupby(\"variable\").cumcount() + 1\n", "sns.lineplot(x=\"epoch\", y=\"value\", hue=\"variable\", data=loss).set(\n", " title=\"Model loss\",\n", " ylabel=\"\",\n", " xticks=range(1, loss[\"epoch\"].max() + 1),\n", " xticklabels=loss[\"epoch\"].unique(),\n", ")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "ttN0O4gq8fpU" }, "source": [ "Since the low loss value indicates that our model is already really good, we don't do any hyperparameter tuning. Also, our parameters were already chosen with good care. However, when looking for peak performance, it would be reasonable to compare different configurations. Natural starting points would be to:\n", "1. Increasing the input length\n", "2. Change the learning rate or use a dynamic rate adaption\n", "3. Try different batch sizes\n", "\n", "I'd expect to see some further but very marginal improvement from these steps." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "vtpQKGwN78Cm" }, "source": [ "### Evaluation \n", "\n", "Was this all worth it? Let's find out:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 34 }, "colab_type": "code", "id": "cr7Q2oMRWc1p", "outputId": "3b39ca9b-1c0e-4815-e433-316dbcbc03d0" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "661/661 - 55s\n" ] } ], "source": [ "# predict on test set\n", "pred = model.predict(test, batch_size=BATCH_SIZE, verbose=2, use_multiprocessing=True)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 173 }, "colab_type": "code", "id": "QR5Tr86Yf3zr", "outputId": "bafa43a8-1ca3-4246-9e60-c88ccd60d931" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0.0 0.99 0.99 0.99 76290\n", " 1.0 0.88 0.92 0.90 8305\n", "\n", " accuracy 0.98 84595\n", " macro avg 0.93 0.95 0.94 84595\n", "weighted avg 0.98 0.98 0.98 84595\n", "\n" ] } ], "source": [ "# Load best model from Checkpoint\n", "# model = load_model(PATH_GDRIVE_TMP+\"BERT.h5\", compile=False)\n", "pred_class = (pred > 0.5).astype(int)\n", "report = metrics.classification_report(test_labels, pred_class)\n", "print(report)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "mgEzkcGZG_dY" }, "source": [ "Let's recap our previous results: In [the first part]({filename}/doctors_nlp1.ipynb) we used a traditional classification method and achieved a decent macro f1-score of `0.9`. In [part two]({filename}/doctors_nlp1.ipynb), we improved the score to `0.93` using a LSTM neural network and FastText word embeddings. For our harder to predict class 1 (bad ratings), we had a precision of `0.86` and recall of `0.89`. \n", "Compared to that, our BERT approach is a significant improvement. We increase the macro f1 to `0.94` while getting a decent improvement in both, the precision and recall for class 1 . You might be fooled by the small increase of the absolute numbers and wonder why I'm talking about \"big improvement\". Well, we really need to put this into relation. The better a model already performs, the harder it gets to squeeze out further improvements. An increase from a f1 of `0.6` to `0.7` is usually easier to achieve and less impressive than an improvement from `0.95` to `0.97`. Think \"pareto principle\" and \"decreasing marginal returns\"! \n" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "joHx0eFRH4U0" }, "source": [ "### Conclusion\n", "\n", "In this third part of our series, we have further advanced our binary text classification model of German comments of patients on their doctors. We have introduced the transformer architecture and more specifically the BERT model. Using the HuggingFace `transformers` library for Python, we have fine tuned a pre trained German language BERT model. For that training, we have used a free Google Colab instance with a TPU. As a final result, we have significantly improved the capacity of our prediction model once more. The classifier is able to assign the correct sentiment to comments and predict their ratings with high proficiency. \n", "This concludes our search for the best possible classification model for our task. However, our NLP journey is not finished yet: In the next part of this series, we will use this data set to explore the capability of AIs to generate text. Using the GPT-2 neural network architecture, we will build a model that generates comments in German. Will our approach be able to create credible patient reviews in a medical context?\n", "
" ] } ], "metadata": { "accelerator": "TPU", "colab": { "collapsed_sections": [], "machine_shape": "hm", "name": "doctors_nlp3.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.9" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "0b030820fc404024a713bc98c07abe09": { "model_module": "@jupyter-widgets/controls", "model_name": "DescriptionStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "DescriptionStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "description_width": "" } }, "29a0fd54e7c84a1c9d94af6ed2d54d9e": { "model_module": "@jupyter-widgets/controls", "model_name": "FloatProgressModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "FloatProgressModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "ProgressView", "bar_style": "success", "description": "Downloading: 100%", "description_tooltip": null, "layout": "IPY_MODEL_7c4f58ccfb71419cb35746b1e6db289d", "max": 254728, "min": 0, "orientation": "horizontal", "style": "IPY_MODEL_95f1d26e2a9d4234aad0b092dc22ffeb", "value": 254728 } }, "357c5cecfded46158c390c94938cdafa": { "model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "7c4f58ccfb71419cb35746b1e6db289d": { "model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } }, "818dfbe55e61424c975383f033e7687c": { "model_module": "@jupyter-widgets/controls", "model_name": "HTMLModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HTMLModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HTMLView", "description": "", "description_tooltip": null, "layout": "IPY_MODEL_b7272f21484244d0b49f5da3a28395b5", "placeholder": "​", "style": "IPY_MODEL_0b030820fc404024a713bc98c07abe09", "value": " 255k/255k [00:00<00:00, 546kB/s]" } }, "9059282e9697457491a909f99b07d7d6": { "model_module": "@jupyter-widgets/controls", "model_name": "HBoxModel", "state": { "_dom_classes": [], "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "HBoxModel", "_view_count": null, "_view_module": "@jupyter-widgets/controls", "_view_module_version": "1.5.0", "_view_name": "HBoxView", "box_style": "", "children": [ "IPY_MODEL_29a0fd54e7c84a1c9d94af6ed2d54d9e", "IPY_MODEL_818dfbe55e61424c975383f033e7687c" ], "layout": "IPY_MODEL_357c5cecfded46158c390c94938cdafa" } }, "95f1d26e2a9d4234aad0b092dc22ffeb": { "model_module": "@jupyter-widgets/controls", "model_name": "ProgressStyleModel", "state": { "_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0", "_model_name": "ProgressStyleModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "StyleView", "bar_color": null, "description_width": "initial" } }, "b7272f21484244d0b49f5da3a28395b5": { "model_module": "@jupyter-widgets/base", "model_name": "LayoutModel", "state": { "_model_module": "@jupyter-widgets/base", "_model_module_version": "1.2.0", "_model_name": "LayoutModel", "_view_count": null, "_view_module": "@jupyter-widgets/base", "_view_module_version": "1.2.0", "_view_name": "LayoutView", "align_content": null, "align_items": null, "align_self": null, "border": null, "bottom": null, "display": null, "flex": null, "flex_flow": null, "grid_area": null, "grid_auto_columns": null, "grid_auto_flow": null, "grid_auto_rows": null, "grid_column": null, "grid_gap": null, "grid_row": null, "grid_template_areas": null, "grid_template_columns": null, "grid_template_rows": null, "height": null, "justify_content": null, "justify_items": null, "left": null, "margin": null, "max_height": null, "max_width": null, "min_height": null, "min_width": null, "object_fit": null, "object_position": null, "order": null, "overflow": null, "overflow_x": null, "overflow_y": null, "padding": null, "right": null, "top": null, "visibility": null, "width": null } } } } }, "nbformat": 4, "nbformat_minor": 4 }