{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "vFlPm1QPTzVS", "toc": true }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 296 }, "colab_type": "code", "id": "f8GF5FE1T_rG", "outputId": "d5b323a5-6147-43f7-ca65-914236a5480e" }, "outputs": [ { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# code for loading the format for the notebook\n", "import os\n", "\n", "# path : store the current path to convert back to it later\n", "path = os.getcwd()\n", "os.chdir(os.path.join('..', '..', 'notebook_format'))\n", "\n", "from formats import load_style\n", "load_style(plot_style=False)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 254 }, "colab_type": "code", "id": "8AIJxRPfTzVT", "outputId": "75b674d2-0604-444f-82c0-20529c5a9a5c" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using TensorFlow backend.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Ethen 2019-12-31 11:20:36 \n", "\n", "CPython 3.6.4\n", "IPython 7.9.0\n", "\n", "numpy 1.16.5\n", "pandas 0.25.0\n", "sklearn 0.21.2\n", "keras 2.2.2\n", "sentencepiece n\u0007\n" ] } ], "source": [ "os.chdir(path)\n", "\n", "# 1. magic for inline plot\n", "# 2. magic to print version\n", "# 3. magic so that the notebook will reload external python modules\n", "# 4. magic to enable retina (high resolution) plots\n", "# https://gist.github.com/minrk/3301035\n", "%matplotlib inline\n", "%load_ext watermark\n", "%load_ext autoreload\n", "%autoreload 2\n", "%config InlineBackend.figure_format='retina'\n", "\n", "import os\n", "import time\n", "import numpy as np\n", "import pandas as pd\n", "from sklearn.model_selection import train_test_split\n", "from typing import List, Tuple\n", "from keras import layers\n", "from keras.models import Model\n", "from keras.preprocessing.text import Tokenizer\n", "from keras.utils.np_utils import to_categorical\n", "from keras.preprocessing.sequence import pad_sequences\n", "\n", "# prevent scientific notations\n", "pd.set_option('display.float_format', lambda x: '%.3f' % x)\n", "\n", "%watermark -a 'Ethen' -d -t -v -p numpy,pandas,sklearn,keras,sentencepiece" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "Ra_cl9HpTzVX" }, "source": [ "# Subword Tokenization for Text Classification" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "LIxkV18yd_7o" }, "source": [ "In this notebook, we will be experimenting with subword tokenization. Tokenization is often times one of the first mandatory task that's performed in NLP task, where we break down a piece of text into meaningful individual units/tokens.\n", "\n", "There're three major ways of performing tokenization.\n", "\n", "**Character Level**\n", "\n", "Treats each character (or unicode) as one individual token.\n", "\n", "- Pros: This one requires the least amount of preprocessing techniques.\n", "- Cons: The downstream task needs to be able to learn relative positions of the characters, dependencies, spellings, making it harder to achieve good performance.\n", "\n", "**Word Level**\n", "\n", "Performs word segmentation on top of our text data.\n", "\n", "- Pros: Words are how we as human process text information.\n", "- Cons: The correctness of the segmentation is highly dependent on the software we're using. e.g. [Spacy's Tokenization](https://spacy.io/usage/spacy-101#annotations-token) performs language specific rules to segment the original text into words. Also word level can't handle unseen words (a.k.a. out of vocabulary words) and performs poorly on rare words.\n", "\n", "[Blog: Language modeling a billion words](http://torch.ch/blog/2016/07/25/nce.html) also shared some thoughts comparing character based tokenization v.s. word based tokenization. Taken directly from the post.\n", "\n", "> Word-level models have an important advantage over char-level models. Take the following sequence as an example (a quote from Robert A. Heinlein):\n", ">\n", "> Progress isn't made by early risers. It's made by lazy men trying to find easier ways to do something.\n", ">\n", "> After tokenization, the word-level model might view this sequence as containing 22 tokens. On the other hand, the char-level will view this sequence as containing 102 tokens. This longer sequence makes the task of the character model harder than the word model, as it must take into account dependencies between more tokens over more time-steps. Another issue with character language models is that they need to learn spelling in addition to syntax, semantics, etc. In any case, word language models will typically have lower error than character models.\n", ">\n", "> The main advantage of character over word language models is that they have a really small vocabulary. For example, the GBW dataset will contain approximately 800 characters compared to 800,000 words (after pruning low-frequency tokens). In practice this means that character models will require less memory and have faster inference than their word counterparts. Another advantage is that they do not require tokenization as a preprocessing step.\n", "\n", "**Subword Level**\n", "\n", "As we can probably imagine, subword level is somewhere between character level and word level, hence tries to bring in the the pros (being able to handle out of vocabulary or rare words better) and mitigate the drawback (too fine-grained for downstream tasks) from both approaches. With subword level, what we are aiming for is to represent open vocabulary through a fixed-sized vocabulary of variable length character sequences. e.g. the word highest might be segmented into subwords high and est.\n", "\n", "There're many different methods for generating these subwords. e.g.\n", "\n", "- A naive way way is to brute force generate the subwords by sliding through a fix sized window. e.g. highest -> hig, igh, ghe, etc.\n", "- More clever approaches such as Byte Pair Encoding, Unigram models. We won't be covering the internals of these approaches here. There's another [document](https://nbviewer.jupyter.org/github/ethen8181/machine-learning/blob/master/deep_learning/subword/bpe.ipynb) that goes more in-depth into Byte Pair Encoding and sentencepiece, the open-sourced package that we'll be using here to experiment with subword tokenization." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "5DMXNXY8cxyB" }, "source": [ "## Data Preprocessing" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "J4-4wdUCLTfd" }, "source": [ "We'll use the movie review sentiment analysis dataset from [Kaggle](https://www.kaggle.com/c/word2vec-nlp-tutorial/overview) for this example. It's a binary classification problem with AUC as the ultimate evaluation metric. The next few code chunk performs the usual text preprocessing, build up the word vocabulary and performing a train/test split." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": {}, "colab_type": "code", "id": "ouINoqYoLj-G" }, "outputs": [], "source": [ "data_dir = 'data'\n", "submission_dir = 'submission'" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 215 }, "colab_type": "code", "id": "Sw-Fv7onTzVa", "outputId": "9a9596b6-bf7a-4f7a-ff91-75a5ccf6ec0b" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(25000, 3)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idsentimentreview
05814_81With all this stuff going down at the moment w...
12381_91\\The Classic War of the Worlds\\\" by Timothy Hi...
27759_30The film starts with a manager (Nicholas Bell)...
33630_40It must be assumed that those who praised this...
49495_81Superbly trashy and wondrously unpretentious 8...
\n", "
" ], "text/plain": [ " id sentiment review\n", "0 5814_8 1 With all this stuff going down at the moment w...\n", "1 2381_9 1 \\The Classic War of the Worlds\\\" by Timothy Hi...\n", "2 7759_3 0 The film starts with a manager (Nicholas Bell)...\n", "3 3630_4 0 It must be assumed that those who praised this...\n", "4 9495_8 1 Superbly trashy and wondrously unpretentious 8..." ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "input_path = os.path.join(data_dir, 'word2vec-nlp-tutorial', 'labeledTrainData.tsv')\n", "df = pd.read_csv(input_path, delimiter='\\t')\n", "print(df.shape)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 55 }, "colab_type": "code", "id": "jhkw8aNWTzVc", "outputId": "b80aa706-984b-46a7-8b97-ae4fb7b6c55c" }, "outputs": [ { "data": { "text/plain": [ "\"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.

Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.

The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci's character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ's music.

Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.

Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ's bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i've gave this subject....hmmm well i don't know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter.\"" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "raw_text = df['review'].iloc[0]\n", "raw_text" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": {}, "colab_type": "code", "id": "EzKRUBdaTzVd" }, "outputs": [], "source": [ "import re\n", "\n", "def clean_str(string: str) -> str:\n", " string = re.sub(r\"\\\\\", \"\", string) \n", " string = re.sub(r\"\\'\", \"\", string) \n", " string = re.sub(r\"\\\"\", \"\", string) \n", " return string.strip().lower()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": {}, "colab_type": "code", "id": "N7zXKhi_TzVf" }, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "def clean_text(df: pd.DataFrame,\n", " text_col: str,\n", " label_col: str) -> Tuple[List[str], List[int]]:\n", " texts = []\n", " labels = []\n", " for raw_text, label in zip(df[text_col], df[label_col]): \n", " text = BeautifulSoup(raw_text).get_text()\n", " cleaned_text = clean_str(text)\n", " texts.append(cleaned_text)\n", " labels.append(label)\n", "\n", " return texts, labels" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 72 }, "colab_type": "code", "id": "atEsQAQmTzVh", "outputId": "47af827f-3347-43ac-94a4-d8e711e2ef36" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sample text: with all this stuff going down at the moment with mj ive started listening to his music, watching the odd documentary here and there, watched the wiz and watched moonwalker again. maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. some of it has subtle messages about mjs feeling towards the press and also the obvious message of drugs are bad mkay.visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring. some may call mj an egotist for consenting to the making of this movie but mj and most of his fans would say that he made it for the fans which if true is really nice of him.the actual feature film bit when it finally starts is only on for 20 minutes or so excluding the smooth criminal sequence and joe pesci is convincing as a psychopathic all powerful drug lord. why he wants mj dead so bad is beyond me. because mj overheard his plans? nah, joe pescis character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates mjs music.lots of cool things in this like mj turning into a car and a robot and the whole speed demon sequence. also, the director must have had the patience of a saint when it came to filming the kiddy bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.bottom line, this movie is for people who like mj on one level or another (which i think is most people). if not, then stay away. it does try and give off a wholesome message and ironically mjs bestest buddy in this movie is a girl! michael jackson is truly one of the most talented people ever to grace this planet but is he guilty? well, with all the attention ive gave this subject....hmmm well i dont know because people can be different behind closed doors, i know this for a fact. he is either an extremely nice but stupid guy or one of the most sickest liars. i hope he is not the latter.\n", "corresponding label: 1\n" ] } ], "source": [ "text_col = 'review'\n", "label_col = 'sentiment'\n", "\n", "texts, labels = clean_text(df, text_col, label_col)\n", "print('sample text: ', texts[0])\n", "print('corresponding label:', labels[0])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 69 }, "colab_type": "code", "id": "TegoqBX2LXhN", "outputId": "eec77599-a881-49ac-f79b-aec45c5e53ae" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "labels shape: (25000, 2)\n", "train size: 20000\n", "validation size: 5000\n" ] } ], "source": [ "random_state = 1234\n", "val_split = 0.2\n", "\n", "labels = to_categorical(labels)\n", "texts_train, texts_val, y_train, y_val = train_test_split(\n", " texts, labels,\n", " test_size=val_split,\n", " random_state=random_state)\n", "\n", "print('labels shape:', labels.shape)\n", "print('train size: ', len(texts_train))\n", "print('validation size: ', len(texts_val))" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "nuzAIjT2coR0" }, "source": [ "## Model" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "P2xIWcTuN51O" }, "source": [ "To train our text classifier, we specify a 1D convolutional network. The comparison we'll be experimenting is whether subword-level model gives a better performance than word-level model." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": {}, "colab_type": "code", "id": "dU9OGjo-ogZg" }, "outputs": [], "source": [ "def simple_text_cnn(max_sequence_len: int, max_features: int, num_classes: int,\n", " optimizer: str='adam', metrics: List[str]=['acc']) -> Model:\n", "\n", " sequence_input = layers.Input(shape=(max_sequence_len,), dtype='int32')\n", " embedded_sequences = layers.Embedding(max_features, 100,\n", " trainable=True)(sequence_input)\n", " conv1 = layers.Conv1D(128, 5, activation='relu')(embedded_sequences)\n", " pool1 = layers.MaxPooling1D(5)(conv1)\n", " conv2 = layers.Conv1D(128, 5, activation='relu')(pool1)\n", " pool2 = layers.MaxPooling1D(5)(conv2)\n", " conv3 = layers.Conv1D(128, 5, activation='relu')(pool2)\n", " pool3 = layers.MaxPooling1D(35)(conv3)\n", " flatten = layers.Flatten()(pool3)\n", " dense = layers.Dense(128, activation='relu')(flatten)\n", " preds = layers.Dense(num_classes, activation='softmax')(dense)\n", "\n", " model = Model(sequence_input, preds)\n", " model.compile(loss='categorical_crossentropy',\n", " optimizer=optimizer,\n", " metrics=metrics)\n", " return model" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "nZ7kauOBPoRH" }, "source": [ "### Subword-Level Tokenizer" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "cJQ4UXyEP3ZJ" }, "source": [ "The next couple of code chunks trains the subword vocabulary, encode our original text into these subwords and pads the sequences into a fixed length.\n", "\n", "Note the the `pad_sequences` function from keras assumes that index 0 is reserved for padding, hence when learning the subword vocabulary using `sentencepiece`, we make sure to keep the index consistent." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": {}, "colab_type": "code", "id": "Dl0bs8GMP7BG" }, "outputs": [], "source": [ "# write the raw text so that sentencepiece can consume it\n", "temp_file = 'train.txt'\n", "with open(temp_file, 'w') as f:\n", " f.write('\\n'.join(texts))" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 52 }, "colab_type": "code", "id": "o2SDuD4OP7Ms", "outputId": "86114e4e-7a74-4756-f7c8-e075ccdb22cf" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--input=train.txt --model_type=unigram --model_prefix=unigram --vocab_size=30000 --pad_id=0 --unk_id=1 --bos_id=2 --eos_id=3\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sentencepiece import SentencePieceTrainer, SentencePieceProcessor\n", "\n", "max_num_words = 30000\n", "model_type = 'unigram'\n", "model_prefix = model_type\n", "pad_id = 0\n", "unk_id = 1\n", "bos_id = 2\n", "eos_id = 3\n", "\n", "sentencepiece_params = ' '.join([\n", " '--input={}'.format(temp_file),\n", " '--model_type={}'.format(model_type),\n", " '--model_prefix={}'.format(model_type),\n", " '--vocab_size={}'.format(max_num_words),\n", " '--pad_id={}'.format(pad_id),\n", " '--unk_id={}'.format(unk_id),\n", " '--bos_id={}'.format(bos_id),\n", " '--eos_id={}'.format(eos_id)\n", "])\n", "print(sentencepiece_params)\n", "SentencePieceTrainer.train(sentencepiece_params)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "colab_type": "code", "id": "0Cx88PSVP7Ts", "outputId": "9fb8ff68-ac3c-4e61-e180-0fbb18164faf" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found 30000 unique tokens.\n" ] } ], "source": [ "sp = SentencePieceProcessor()\n", "sp.load(\"{}.model\".format(model_prefix))\n", "print('Found %s unique tokens.' % sp.get_piece_size())" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "colab_type": "code", "id": "tNAqHThVP7aq", "outputId": "c7eb12e6-ae8e-4939-efea-4438c5648ed6" }, "outputs": [ { "data": { "text/plain": [ "[62, 5086, 4170, 2260, 2520]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "max_sequence_len = 1000\n", "\n", "sequences_train = [sp.encode_as_ids(text) for text in texts_train]\n", "x_train = pad_sequences(sequences_train, maxlen=max_sequence_len)\n", "\n", "sequences_val = [sp.encode_as_ids(text) for text in texts_val]\n", "x_val = pad_sequences(sequences_val, maxlen=max_sequence_len)\n", "\n", "sequences_train[0][:5]" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 72 }, "colab_type": "code", "id": "9IdOv4MYQCU8", "outputId": "3c6b9111-c645-4d2e-c058-0a65ce9d3e59" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sample text: when gundam0079 became the movie trilogy most of us are familiar with, a lot of it was sheer action and less of anything else. this ova is kinda the opposite. though therere only half a dozen episodes, it isnt filled with action, but emotional things. the two main action sequences in this, i believe, are enough to satisfy me. after seeing so many gundam series, movies, and ovas, i was completely ready for a civilian-esquire movie. this movie did a fantastic job of that. what makes this movie stand out is that shows both sides of the war have good and bad people. it made the zeons seem more human rather than the original movies where theyre depicted as the second rise of evil nazis. most people that dont like anime that ive forced to watch this movie (lol), liked it. so, id recommend it to a lot of people just for the anti-war story. if youre a gundam fan, and havent seen this, you shouldnt be reading this; you should already be watching it right now.\n", "sample text: ['▁when', '▁gundam', '00', '7', '9', '▁became', '▁the', '▁movie', '▁trilogy', '▁most', '▁of', '▁us', '▁are', '▁familiar', '▁with', ',', '▁a', '▁lot', '▁of', '▁it', '▁was', '▁sheer', '▁action', '▁and', '▁less', '▁of', '▁anything', '▁else', '.', '▁this', '▁ova', '▁is', '▁kinda', '▁the', '▁opposite', '.', '▁though', '▁there', 're', '▁only', '▁half', '▁a', '▁dozen', '▁episodes', ',', '▁it', '▁isnt', '▁filled', '▁with', '▁action', ',', '▁but', '▁emotional', '▁things', '.', '▁the', '▁two', '▁main', '▁action', '▁sequences', '▁in', '▁this', ',', '▁i', '▁believe', ',', '▁are', '▁enough', '▁to', '▁satisfy', '▁me', '.', '▁after', '▁seeing', '▁so', '▁many', '▁gundam', '▁series', ',', '▁movies', ',', '▁and', '▁ova', 's', ',', '▁i', '▁was', '▁completely', '▁ready', '▁for', '▁a', '▁civilian', '-', 'esquire', '▁movie', '.', '▁this', '▁movie', '▁did', '▁a', '▁fantastic', '▁job', '▁of', '▁that', '.', '▁what', '▁makes', '▁this', '▁movie', '▁stand', '▁out', '▁is', '▁that', '▁shows', '▁both', '▁sides', '▁of', '▁the', '▁war', '▁have', '▁good', '▁and', '▁bad', '▁people', '.', '▁it', '▁made', '▁the', '▁zeon', 's', '▁seem', '▁more', '▁human', '▁rather', '▁than', '▁the', '▁original', '▁movies', '▁where', '▁theyre', '▁depicted', '▁as', '▁the', '▁second', '▁rise', '▁of', '▁evil', '▁nazis', '.', '▁most', '▁people', '▁that', '▁dont', '▁like', '▁anime', '▁that', '▁ive', '▁forced', '▁to', '▁watch', '▁this', '▁movie', '▁(', 'lol', '),', '▁liked', '▁it', '.', '▁so', ',', '▁id', '▁recommend', '▁it', '▁to', '▁a', '▁lot', '▁of', '▁people', '▁just', '▁for', '▁the', '▁anti', '-', 'war', '▁story', '.', '▁if', '▁youre', '▁a', '▁gundam', '▁fan', ',', '▁and', '▁havent', '▁seen', '▁this', ',', '▁you', '▁shouldnt', '▁be', '▁reading', '▁this', ';', '▁you', '▁should', '▁already', '▁be', '▁watching', '▁it', '▁right', '▁now', '.']\n" ] } ], "source": [ "print('sample text: ', texts_train[0])\n", "print('sample text: ', sp.encode_as_pieces(sp.decode_ids(x_train[0].tolist())))" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 763 }, "colab_type": "code", "id": "x1thP8gLomx1", "outputId": "480a08b0-b80a-4d6f-bf5e-7d80eeac5e99" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:66: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.\n", "\n", "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:541: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.\n", "\n", "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4432: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.\n", "\n", "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4267: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.\n", "\n", "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:793: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.\n", "\n", "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3576: The name tf.log is deprecated. Please use tf.math.log instead.\n", "\n", "Model: \"model_1\"\n", "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", "input_1 (InputLayer) (None, 1000) 0 \n", "_________________________________________________________________\n", "embedding_1 (Embedding) (None, 1000, 100) 3000100 \n", "_________________________________________________________________\n", "conv1d_1 (Conv1D) (None, 996, 128) 64128 \n", "_________________________________________________________________\n", "max_pooling1d_1 (MaxPooling1 (None, 199, 128) 0 \n", "_________________________________________________________________\n", "conv1d_2 (Conv1D) (None, 195, 128) 82048 \n", "_________________________________________________________________\n", "max_pooling1d_2 (MaxPooling1 (None, 39, 128) 0 \n", "_________________________________________________________________\n", "conv1d_3 (Conv1D) (None, 35, 128) 82048 \n", "_________________________________________________________________\n", "max_pooling1d_3 (MaxPooling1 (None, 1, 128) 0 \n", "_________________________________________________________________\n", "flatten_1 (Flatten) (None, 128) 0 \n", "_________________________________________________________________\n", "dense_1 (Dense) (None, 128) 16512 \n", "_________________________________________________________________\n", "dense_2 (Dense) (None, 2) 258 \n", "=================================================================\n", "Total params: 3,245,094\n", "Trainable params: 3,245,094\n", "Non-trainable params: 0\n", "_________________________________________________________________\n" ] } ], "source": [ "num_classes = 2\n", "model1 = simple_text_cnn(max_sequence_len, max_num_words + 1, num_classes)\n", "model1.summary()" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 676 }, "colab_type": "code", "id": "vPnaTN_Pybx-", "outputId": "112be31d-3950-4407-cac8-b774266f7f6e" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/math_grad.py:1424: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use tf.where in 2.0, which has the same broadcast rule as np.where\n", "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1033: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.\n", "\n", "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1020: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.\n", "\n", "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3005: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.\n", "\n", "Train on 20000 samples, validate on 5000 samples\n", "Epoch 1/8\n", "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.\n", "\n", "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:197: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.\n", "\n", "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:207: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.\n", "\n", "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:216: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.\n", "\n", "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:223: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.\n", "\n", "20000/20000 [==============================] - 7s 363us/step - loss: 0.5963 - acc: 0.6101 - val_loss: 0.3138 - val_acc: 0.8702\n", "Epoch 2/8\n", "20000/20000 [==============================] - 4s 224us/step - loss: 0.2239 - acc: 0.9120 - val_loss: 0.2991 - val_acc: 0.8820\n", "Epoch 3/8\n", "20000/20000 [==============================] - 4s 223us/step - loss: 0.0797 - acc: 0.9738 - val_loss: 0.3427 - val_acc: 0.8852\n", "Epoch 4/8\n", "20000/20000 [==============================] - 4s 224us/step - loss: 0.0193 - acc: 0.9946 - val_loss: 0.5095 - val_acc: 0.8814\n", "Epoch 5/8\n", "20000/20000 [==============================] - 4s 222us/step - loss: 0.0050 - acc: 0.9988 - val_loss: 0.7519 - val_acc: 0.8704\n", "Epoch 6/8\n", "20000/20000 [==============================] - 4s 223us/step - loss: 0.0016 - acc: 0.9999 - val_loss: 0.7487 - val_acc: 0.8840\n", "Epoch 7/8\n", "20000/20000 [==============================] - 4s 223us/step - loss: 2.0759e-04 - acc: 1.0000 - val_loss: 0.8045 - val_acc: 0.8810\n", "Epoch 8/8\n", "20000/20000 [==============================] - 4s 223us/step - loss: 5.2034e-05 - acc: 1.0000 - val_loss: 0.8260 - val_acc: 0.8824\n" ] }, { "data": { "text/plain": [ "39.04836106300354" ] }, "execution_count": 18, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "# time : 120\n", "# performance : 0.92936\n", "start = time.time()\n", "history1 = model1.fit(x_train, y_train,\n", " validation_data=(x_val, y_val),\n", " batch_size=128,\n", " epochs=8)\n", "end = time.time()\n", "elapse1 = end - start\n", "elapse1" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "ujh5usDDPwca" }, "source": [ "### Word-Level Tokenizer" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "colab_type": "code", "id": "dzasobcmv0NK", "outputId": "2c08cac6-73ef-46e9-a744-f55a531d5aba" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found 74207 unique tokens.\n" ] } ], "source": [ "tokenizer = Tokenizer(num_words=max_num_words, oov_token='')\n", "tokenizer.fit_on_texts(texts_train)\n", "print('Found %s unique tokens.' % len(tokenizer.word_index))" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": {}, "colab_type": "code", "id": "QcCS7RnJv3gJ" }, "outputs": [], "source": [ "sequences_train = tokenizer.texts_to_sequences(texts_train)\n", "x_train = pad_sequences(sequences_train, maxlen=max_sequence_len)\n", "\n", "sequences_val = tokenizer.texts_to_sequences(texts_val)\n", "x_val = pad_sequences(sequences_val, maxlen=max_sequence_len)" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 535 }, "colab_type": "code", "id": "2izlBZeKwd4a", "outputId": "c73da6f6-2842-4570-cf65-1a62852d3f87" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model: \"model_2\"\n", "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", "input_2 (InputLayer) (None, 1000) 0 \n", "_________________________________________________________________\n", "embedding_2 (Embedding) (None, 1000, 100) 3000100 \n", "_________________________________________________________________\n", "conv1d_4 (Conv1D) (None, 996, 128) 64128 \n", "_________________________________________________________________\n", "max_pooling1d_4 (MaxPooling1 (None, 199, 128) 0 \n", "_________________________________________________________________\n", "conv1d_5 (Conv1D) (None, 195, 128) 82048 \n", "_________________________________________________________________\n", "max_pooling1d_5 (MaxPooling1 (None, 39, 128) 0 \n", "_________________________________________________________________\n", "conv1d_6 (Conv1D) (None, 35, 128) 82048 \n", "_________________________________________________________________\n", "max_pooling1d_6 (MaxPooling1 (None, 1, 128) 0 \n", "_________________________________________________________________\n", "flatten_2 (Flatten) (None, 128) 0 \n", "_________________________________________________________________\n", "dense_3 (Dense) (None, 128) 16512 \n", "_________________________________________________________________\n", "dense_4 (Dense) (None, 2) 258 \n", "=================================================================\n", "Total params: 3,245,094\n", "Trainable params: 3,245,094\n", "Non-trainable params: 0\n", "_________________________________________________________________\n" ] } ], "source": [ "num_classes = 2\n", "model2 = simple_text_cnn(max_sequence_len, max_num_words + 1, num_classes)\n", "model2.summary()" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 328 }, "colab_type": "code", "id": "cV9kwyEFwiCA", "outputId": "923c0643-c82b-4809-a036-43192b6ad7ca" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train on 20000 samples, validate on 5000 samples\n", "Epoch 1/8\n", "20000/20000 [==============================] - 5s 257us/step - loss: 0.5386 - acc: 0.6734 - val_loss: 0.3237 - val_acc: 0.8708\n", "Epoch 2/8\n", "20000/20000 [==============================] - 5s 227us/step - loss: 0.2028 - acc: 0.9216 - val_loss: 0.2670 - val_acc: 0.8908\n", "Epoch 3/8\n", "20000/20000 [==============================] - 4s 225us/step - loss: 0.0668 - acc: 0.9785 - val_loss: 0.3612 - val_acc: 0.8886\n", "Epoch 4/8\n", "20000/20000 [==============================] - 5s 225us/step - loss: 0.0205 - acc: 0.9937 - val_loss: 0.4852 - val_acc: 0.8826\n", "Epoch 5/8\n", "20000/20000 [==============================] - 5s 225us/step - loss: 0.0059 - acc: 0.9985 - val_loss: 0.6764 - val_acc: 0.8786\n", "Epoch 6/8\n", "20000/20000 [==============================] - 5s 228us/step - loss: 0.0021 - acc: 0.9995 - val_loss: 0.7321 - val_acc: 0.8788\n", "Epoch 7/8\n", "20000/20000 [==============================] - 5s 226us/step - loss: 0.0022 - acc: 0.9995 - val_loss: 0.8057 - val_acc: 0.8840\n", "Epoch 8/8\n", "20000/20000 [==============================] - 5s 226us/step - loss: 0.0034 - acc: 0.9990 - val_loss: 0.8816 - val_acc: 0.8808\n" ] }, { "data": { "text/plain": [ "37.271193742752075" ] }, "execution_count": 22, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "# time : 120\n", "# performance : 0.92520\n", "start = time.time()\n", "history2 = model2.fit(x_train, y_train,\n", " validation_data=(x_val, y_val),\n", " batch_size=128,\n", " epochs=8)\n", "end = time.time()\n", "elapse2 = end - start\n", "elapse2" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "4dML3eb9XZn2" }, "source": [ "## Submission" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "xtshvmlWPOrM" }, "source": [ "For the submission section, we read in and preprocess the test data provided by the competition, then generate the predicted probability column for both the model that uses word-level tokenization and one that uses subword tokenization to compare their performance." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 215 }, "colab_type": "code", "id": "lVQr2m_7uvaK", "outputId": "83c75fe3-51e0-4413-c8a0-c96e1bb22ada" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(25000, 2)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idreview
012311_10Naturally in a film who's main themes are of m...
18348_2This movie is a disaster within a disaster fil...
25828_4All in all, this is a movie for kids. We saw i...
37186_2Afraid of the Dark left me with the impression...
412128_7A very accurate depiction of small time mob li...
\n", "
" ], "text/plain": [ " id review\n", "0 12311_10 Naturally in a film who's main themes are of m...\n", "1 8348_2 This movie is a disaster within a disaster fil...\n", "2 5828_4 All in all, this is a movie for kids. We saw i...\n", "3 7186_2 Afraid of the Dark left me with the impression...\n", "4 12128_7 A very accurate depiction of small time mob li..." ] }, "execution_count": 23, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "input_path = os.path.join(data_dir, 'word2vec-nlp-tutorial', 'testData.tsv')\n", "df_test = pd.read_csv(input_path, delimiter='\\t')\n", "print(df_test.shape)\n", "df_test.head()" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": {}, "colab_type": "code", "id": "126OJ6zRuzLQ" }, "outputs": [], "source": [ "def clean_text_without_label(df: pd.DataFrame, text_col: str) -> List[str]:\n", " texts = []\n", " for raw_text in df[text_col]:\n", " text = BeautifulSoup(raw_text).get_text()\n", " cleaned_text = clean_str(text)\n", " texts.append(cleaned_text)\n", "\n", " return texts" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "colab_type": "code", "id": "dWv6qd2Ru15x", "outputId": "041fa64c-f30c-4cdc-fcb7-b667479a177b" }, "outputs": [ { "data": { "text/plain": [ "25000" ] }, "execution_count": 25, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "texts_test = clean_text_without_label(df_test, text_col)\n", "\n", "# word-level\n", "word_sequences_test = tokenizer.texts_to_sequences(texts_test)\n", "word_x_test = pad_sequences(word_sequences_test, maxlen=max_sequence_len)\n", "len(word_x_test)" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "colab_type": "code", "id": "tARSoP6IxvcL", "outputId": "632ff76c-ccc7-489a-d30d-7097962bcf14" }, "outputs": [ { "data": { "text/plain": [ "25000" ] }, "execution_count": 26, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "# subword-level\n", "sentencepiece_sequences_test = [sp.encode_as_ids(text) for text in texts_test]\n", "sentencepiece_x_test = pad_sequences(sentencepiece_sequences_test, maxlen=max_sequence_len)\n", "len(sentencepiece_x_test)" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": {}, "colab_type": "code", "id": "uH29uBsFv3my" }, "outputs": [], "source": [ "def create_submission(ids, predictions, ids_col, prediction_col, submission_path) -> pd.DataFrame:\n", " df_submission = pd.DataFrame({\n", " ids_col: ids,\n", " prediction_col: predictions\n", " }, columns=[ids_col, prediction_col])\n", "\n", " if submission_path is not None:\n", " # create the directory if need be, e.g. if the submission_path = submission/submission.csv\n", " # we'll create the submission directory first if it doesn't exist\n", " directory = os.path.split(submission_path)[0]\n", " if (directory != '' or directory != '.') and not os.path.isdir(directory):\n", " os.makedirs(directory, exist_ok=True)\n", "\n", " df_submission.to_csv(submission_path, index=False, header=True)\n", "\n", " return df_submission" ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 250 }, "colab_type": "code", "id": "I2U6tjSxu96J", "outputId": "449677e8-ea5a-4b4f-fd94-f80cb7c71159" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "generating submission for: sentencepiece_cnn\n", "generating submission for: word_cnn\n", "(25000, 2)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idsentiment
012311_101.000
18348_20.000
25828_40.000
37186_21.000
412128_71.000
\n", "
" ], "text/plain": [ " id sentiment\n", "0 12311_10 1.000\n", "1 8348_2 0.000\n", "2 5828_4 0.000\n", "3 7186_2 1.000\n", "4 12128_7 1.000" ] }, "execution_count": 28, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "ids_col = 'id'\n", "prediction_col = 'sentiment'\n", "ids = df_test[ids_col]\n", "\n", "predictions_dict = {\n", " 'sentencepiece_cnn': model1.predict(sentencepiece_x_test)[:, 1], # 0.92936\n", " 'word_cnn': model2.predict(word_x_test)[:, 1] # 0.92520\n", "}\n", "\n", "for model_name, predictions in predictions_dict.items():\n", " print('generating submission for: ', model_name)\n", " submission_path = os.path.join(submission_dir, '{}_submission.csv'.format(model_name))\n", " df_submission = create_submission(ids, predictions, ids_col, prediction_col, submission_path)\n", "\n", "# sanity check to make sure the size and the output of the submission makes sense\n", "print(df_submission.shape)\n", "df_submission.head()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "W88BHuhWYb11" }, "source": [ "## Summary" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "j-KOzbOLYgAD" }, "source": [ "We've looked at the performance of leveraging subword tokenization for our text classification task. Note that some other ideas that we did not try out are:\n", "\n", "- Use [other word-level tokenizers](https://www.analyticsvidhya.com/blog/2019/07/how-get-started-nlp-6-unique-ways-perform-tokenization/). Another popular choice at the point of writing this documentation is [spacy's tokenizer](https://spacy.io/usage/spacy-101#annotations-token).\n", "- [Sentencepiece suggests](https://github.com/google/sentencepiece#trains-from-raw-sentences) that it can be trained on raw text without the need to perform language specific segmentation beforehand, e.g. using the spacy tokenizer on our raw text data before feeding it to sentencepiece to learn the subword vocabulary. We can conduct our own experiment on the task at hand to verify that claim. Sentencepiece also includes an [experiments page](https://github.com/google/sentencepiece/blob/master/doc/experiments.md) that documents some of the experiments they've conducted." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "xy-F2InzTzVw" }, "source": [ "# Reference" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "R_7heWfXTzVx" }, "source": [ "- [Github: sentencepiece](https://github.com/google/sentencepiece)\n", "- [Blog: NLP - Four Ways to Tokenize Chinese Documents](https://medium.com/the-artificial-impostor/nlp-four-ways-to-tokenize-chinese-documents-f349eb6ba3c3)" ] } ], "metadata": { "accelerator": "GPU", "colab": { "collapsed_sections": [], "name": "keras_subword_tokenization.ipynb", "provenance": [], "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" }, "toc": { "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "274px" }, "toc_section_display": true, "toc_window_display": true }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 1 }