{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": true }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# code for loading the format for the notebook\n", "import os\n", "\n", "# path : store the current path to convert back to it later\n", "path = os.getcwd()\n", "os.chdir(os.path.join('..', '..', 'notebook_format'))\n", "\n", "from formats import load_style\n", "load_style(plot_style=False)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Using TensorFlow backend.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Ethen 2019-06-14 09:48:45 \n", "\n", "CPython 3.6.4\n", "IPython 7.5.0\n", "\n", "numpy 1.16.1\n", "pandas 0.24.2\n", "sklearn 0.20.3\n", "matplotlib 3.0.3\n", "xgboost 0.81\n", "keras 2.2.2\n" ] } ], "source": [ "os.chdir(path)\n", "\n", "# 1. magic for inline plot\n", "# 2. magic to print version\n", "# 3. magic so that the notebook will reload external python modules\n", "# 4. magic to enable retina (high resolution) plots\n", "# https://gist.github.com/minrk/3301035\n", "%matplotlib inline\n", "%load_ext watermark\n", "%load_ext autoreload\n", "%autoreload 2\n", "%config InlineBackend.figure_format='retina'\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "from sklearn.model_selection import train_test_split\n", "\n", "# change default style figure and font size\n", "plt.rcParams['figure.figsize'] = 8, 6\n", "plt.rcParams['font.size'] = 12\n", "\n", "%watermark -a 'Ethen' -d -t -v -p numpy,pandas,sklearn,matplotlib,xgboost,keras" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Leveraging Word2vec for Text Classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Many machine learning algorithms requires the input features to be represented as a fixed-length feature\n", "vector. When it comes to texts, one of the most common fixed-length features is one hot encoding methods such as bag of words or tf-idf. The advantage of these approach is that they have fast execution time, while the main drawback is they lose the ordering & semantics of the words.\n", "\n", "The motivation behind converting text into semantic vectors (such as the ones provided by `Word2Vec`) is that not only do these type of methods have the capabilities to extract the semantic relationships (e.g. the word powerful should be closely related to strong as oppose to another word like bank), but they should be preserve most of the relevant information about a text while having relatively low dimensionality.\n", "\n", "In this notebook, we'll take a look at how a `Word2Vec` model can also be used as a dimensionality reduction algorithm to feed into a text classifier. A good one should be able to extract the signal from the noise efficiently, hence improving the performance of the classifier." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Preparation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll download the text classification data, read it into a `pandas` dataframe and split it into train and test set." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import os\n", "from subprocess import call\n", "\n", "\n", "def download_data(base_dir='.'):\n", " \"\"\"download Reuters' text categorization benchmarks from its url.\"\"\"\n", " \n", " train_data = 'r8-train-no-stop.txt'\n", " test_data = 'r8-test-no-stop.txt'\n", " concat_data = 'r8-no-stop.txt'\n", " base_url = 'http://www.cs.umb.edu/~smimarog/textmining/datasets/'\n", " \n", " if not os.path.isdir(base_dir):\n", " os.makedirs(base_dir, exist_ok=True)\n", " \n", " dir_prefix_flag = ' --directory-prefix ' + base_dir\n", "\n", " # brew install wget\n", " # on a mac if you don't have it\n", " train_data_path = os.path.join(base_dir, train_data)\n", " if not os.path.isfile(train_data_path):\n", " call('wget ' + base_url + train_data + dir_prefix_flag, shell=True)\n", " \n", " test_data_path = os.path.join(base_dir, test_data)\n", " if not os.path.isfile(test_data_path):\n", " call('wget ' + base_url + test_data + dir_prefix_flag, shell=True)\n", "\n", " concat_data_path = os.path.join(base_dir, concat_data)\n", " if not os.path.isfile(concat_data_path):\n", " # concatenate train and test files, we'll make our own train-test splits\n", " # the > piping symbol directs the concatenated file to a new file, it\n", " # will replace the file if it already exists; on the other hand, the >> symbol\n", " # will append if it already exists\n", " train_test_path = os.path.join(base_dir, 'r8-*-no-stop.txt')\n", " call('cat {} > {}'.format(train_test_path, concat_data_path), shell=True)\n", "\n", " return concat_data_path" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'data/r8-no-stop.txt'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "base_dir = 'data'\n", "data_path = download_data(base_dir)\n", "data_path" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def load_data(data_path):\n", " texts, labels = [], []\n", " with open(data_path) as f:\n", " for line in f:\n", " label, text = line.split('\\t')\n", " # texts are already tokenized, just split on space\n", " # in a real use-case we would put more effort in preprocessing\n", " texts.append(text.split())\n", " labels.append(label)\n", " \n", " return pd.DataFrame({'texts': texts, 'labels': labels})" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dimension: (7674, 2)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textslabels
0[asian, exporters, fear, damage, japan, rift, ...trade
1[china, daily, vermin, eat, pct, grain, stocks...grain
2[australian, foreign, ship, ban, ends, nsw, po...ship
3[sumitomo, bank, aims, quick, recovery, merger...acq
4[amatil, proposes, two, for, bonus, share, iss...earn
\n", "
" ], "text/plain": [ " texts labels\n", "0 [asian, exporters, fear, damage, japan, rift, ... trade\n", "1 [china, daily, vermin, eat, pct, grain, stocks... grain\n", "2 [australian, foreign, ship, ban, ends, nsw, po... ship\n", "3 [sumitomo, bank, aims, quick, recovery, merger... acq\n", "4 [amatil, proposes, two, for, bonus, share, iss... earn" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = load_data(data_path)\n", "data['labels'] = data['labels'].astype('category')\n", "print('dimension: ', data.shape)\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "label_mapping = data['labels'].cat.categories\n", "data['labels'] = data['labels'].cat.codes\n", "X = data['texts']\n", "y = data['labels']" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "test_size = 0.1\n", "random_state = 1234\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=test_size, random_state=random_state, stratify=y)\n", "\n", "# val_size = 0.1\n", "# X_train, X_val, y_train, y_val = train_test_split(\n", "# X_train, y_train, test_size=val_size, random_state=random_state, stratify=y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Gensim Implementation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After feeding the `Word2Vec` algorithm with our corpus, it will learn a vector representation for each word. This by itself, however, is still not enough to be used as features for text classification as each record in our data is a document not a word.\n", "\n", "To extend these word vectors and generate document level vectors, we'll take the naive approach and use an average of all the words in the document (We could also leverage tf-idf to generate a weighted-average version, but that is not done here). The `Word2Vec` algorithm is wrapped inside a sklearn-compatible transformer which can be used almost the same way as `CountVectorizer` or `TfidfVectorizer` from `sklearn.feature_extraction.text`. Almost - because sklearn vectorizers can also do their own tokenization - a feature which we won't be using anyway because the corpus we will be using is already tokenized.\n", "\n", "In the next few code chunks, we will build a pipeline that transforms the text into low dimensional vectors via average word vectors as use it to fit a boosted tree model, we then report the performance of the training/test set.\n", "\n", "The `transformers` folder that contains the implementation is at the following [link](https://github.com/ethen8181/machine-learning/tree/master/keras/text_classification/transformers)." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Pipeline(memory=None,\n", " steps=[('w2v', GensimWord2VecVectorizer(alpha=0.025, batch_words=10000, callbacks=(),\n", " cbow_mean=1, compute_loss=False,\n", " hashfxn=, hs=0, iter=10,\n", " max_final_vocab=None, max_vocab_size=None, min_alpha=0.0001,\n", " min_count=3, negati...tate=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,\n", " seed=None, silent=True, subsample=1))])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from xgboost import XGBClassifier\n", "from sklearn.pipeline import Pipeline\n", "from transformers import GensimWord2VecVectorizer\n", "\n", "gensim_word2vec_tr = GensimWord2VecVectorizer(size=50, min_count=3, sg=1, alpha=0.025, iter=10)\n", "xgb = XGBClassifier(learning_rate=0.01, n_estimators=100, n_jobs=-1)\n", "w2v_xgb = Pipeline([\n", " ('w2v', gensim_word2vec_tr), \n", " ('xgb', xgb)\n", "])\n", "w2v_xgb" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "elapsed: 11.784907102584839\n" ] }, { "data": { "text/plain": [ "Pipeline(memory=None,\n", " steps=[('w2v', GensimWord2VecVectorizer(alpha=0.025, batch_words=10000, callbacks=(),\n", " cbow_mean=1, compute_loss=False,\n", " hashfxn=, hs=0, iter=10,\n", " max_final_vocab=None, max_vocab_size=None, min_alpha=0.0001,\n", " min_count=3, negati...\n", " reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,\n", " silent=True, subsample=1))])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import time\n", "\n", "start = time.time()\n", "w2v_xgb.fit(X_train, y_train)\n", "elapse = time.time() - start\n", "print('elapsed: ', elapse)\n", "w2v_xgb" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training set accuracy 0.9485954242687518\n" ] }, { "data": { "text/plain": [ "array([[2026, 2, 29, 0, 0, 3, 0, 3],\n", " [ 12, 313, 3, 0, 0, 0, 7, 1],\n", " [ 148, 0, 3381, 0, 0, 1, 0, 0],\n", " [ 3, 1, 6, 23, 1, 0, 4, 8],\n", " [ 5, 0, 3, 0, 205, 28, 1, 2],\n", " [ 5, 1, 7, 0, 9, 238, 0, 4],\n", " [ 16, 6, 5, 0, 2, 1, 96, 4],\n", " [ 3, 1, 7, 0, 7, 6, 0, 269]])" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.metrics import accuracy_score, confusion_matrix\n", "\n", "y_train_pred = w2v_xgb.predict(X_train)\n", "print('Training set accuracy %s' % accuracy_score(y_train, y_train_pred))\n", "confusion_matrix(y_train, y_train_pred)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test set accuracy 0.9244791666666666\n" ] }, { "data": { "text/plain": [ "array([[224, 0, 3, 0, 0, 1, 0, 1],\n", " [ 2, 34, 0, 0, 0, 0, 1, 1],\n", " [ 17, 0, 375, 0, 0, 0, 1, 0],\n", " [ 0, 0, 0, 1, 0, 0, 1, 3],\n", " [ 1, 1, 0, 0, 19, 5, 0, 1],\n", " [ 2, 0, 1, 0, 3, 21, 0, 2],\n", " [ 1, 1, 2, 0, 0, 0, 8, 2],\n", " [ 0, 2, 0, 0, 0, 3, 0, 28]])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_test_pred = w2v_xgb.predict(X_test)\n", "print('Test set accuracy %s' % accuracy_score(y_test, y_test_pred))\n", "confusion_matrix(y_test, y_test_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can extract the Word2vec part of the pipeline and do some sanity check of whether the word vectors that were learned made any sense." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "vocabulary size: 9846\n" ] }, { "data": { "text/plain": [ "[('shares', 0.8288736939430237),\n", " ('common', 0.8102667927742004),\n", " ('dilutive', 0.7489669919013977),\n", " ('effected', 0.7259572744369507),\n", " ('warrants', 0.718142032623291),\n", " ('lazo', 0.7158320546150208),\n", " ('pubco', 0.7046886682510376),\n", " ('dealings', 0.7039074897766113),\n", " ('fractional', 0.7031500339508057),\n", " ('spartech', 0.7023240327835083)]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vocab_size = len(w2v_xgb.named_steps['w2v'].model_.wv.index2word)\n", "print('vocabulary size:', vocab_size)\n", "w2v_xgb.named_steps['w2v'].model_.wv.most_similar(positive=['stock'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Keras Implementation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll also show how we can use a generic deep learning framework to implement the `Wor2Vec` part of the pipeline. There are many variants of `Wor2Vec`, here, we'll only be implementing skip-gram and negative sampling.\n", "\n", "The flow would look like the following:\n", "\n", "An (integer) input of a target word and a real or negative context word. This is essentially the skipgram part where any word within the context of the target word is a real context word and we randomly draw from the rest of the vocabulary to serve as the negative context words.\n", "\n", "An embedding layer lookup (i.e. looking up the integer index of the word in the embedding matrix to get the word vector).\n", "\n", "A dot product operation. As the network trains, words which are similar should end up having similar embedding vectors. The most popular way of measuring similarity between two vectors $A$ and $B$ is the cosine similarity.\n", "\n", "\\begin{align}\n", "similarity = cos(\\theta) = \\frac{\\textbf{A}\\cdot\\textbf{B}}{\\parallel\\textbf{A}\\parallel_2 \\parallel \\textbf{B} \\parallel_2}\n", "\\end{align}\n", "\n", "The denominator of this measure acts to normalize the result – the real similarity operation is on the numerator: the dot product between vectors $A$ and $B$.\n", "\n", "Followed by a sigmoid output layer. Our network is a binary classifier since it's distinguishing words from the same context versus those that aren't." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /Users/mingyuliu/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Colocations handled automatically by placer.\n", "__________________________________________________________________________________________________\n", "Layer (type) Output Shape Param # Connected to \n", "==================================================================================================\n", "input_1 (InputLayer) (None, 1) 0 \n", "__________________________________________________________________________________________________\n", "input_2 (InputLayer) (None, 1) 0 \n", "__________________________________________________________________________________________________\n", "embed_in (Embedding) (None, 1, 100) 984600 input_1[0][0] \n", " input_2[0][0] \n", "__________________________________________________________________________________________________\n", "reshape_1 (Reshape) (None, 100) 0 embed_in[0][0] \n", "__________________________________________________________________________________________________\n", "reshape_2 (Reshape) (None, 100) 0 embed_in[1][0] \n", "__________________________________________________________________________________________________\n", "dot_1 (Dot) (None, 1) 0 reshape_1[0][0] \n", " reshape_2[0][0] \n", "__________________________________________________________________________________________________\n", "dense_1 (Dense) (None, 1) 2 dot_1[0][0] \n", "==================================================================================================\n", "Total params: 984,602\n", "Trainable params: 984,602\n", "Non-trainable params: 0\n", "__________________________________________________________________________________________________\n" ] } ], "source": [ "# the keras model/graph would look something like this:\n", "from keras import layers, optimizers, Model\n", "\n", "# adjustable parameter that control the dimension of the word vectors\n", "embed_size = 100\n", "\n", "input_center = layers.Input((1,))\n", "input_context = layers.Input((1,))\n", "\n", "embedding = layers.Embedding(vocab_size, embed_size, input_length=1, name='embed_in')\n", "center = embedding(input_center) # shape [seq_len, # features (1), embed_size]\n", "context = embedding(input_context)\n", "\n", "center = layers.Reshape((embed_size,))(center)\n", "context = layers.Reshape((embed_size,))(context)\n", "\n", "dot_product = layers.dot([center, context], axes=1)\n", "output = layers.Dense(1, activation='sigmoid')(dot_product)\n", "model = Model(inputs=[input_center, input_context], outputs=output)\n", "model.compile(loss='binary_crossentropy', optimizer=optimizers.RMSprop(lr=0.01))\n", "model.summary()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /Users/mingyuliu/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n", "Instructions for updating:\n", "Use tf.cast instead.\n" ] }, { "data": { "text/plain": [ "0.6951684" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# then we can feed in the skipgram and its label (whether the word pair is in or outside\n", "# the context)\n", "batch_center = [2354, 2354, 2354, 69, 69]\n", "batch_context = [4288, 203, 69, 2535, 815]\n", "batch_label = [0, 1, 1, 0, 1]\n", "model.train_on_batch([batch_center, batch_context], batch_label)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `transformers` folder that contains the implementation is at the following [link](https://github.com/ethen8181/machine-learning/tree/master/keras/text_classification/transformers)." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "KerasWord2VecVectorizer(batch_size=64, embed_size=50, epochs=5000,\n", " learning_rate=0.05, min_count=3, negative_samples=2,\n", " sort_vocab=True, use_sampling_table=True, window_size=5)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from transformers import KerasWord2VecVectorizer\n", "\n", "keras_word2vec_tr = KerasWord2VecVectorizer(embed_size=50, min_count=3, epochs=5000,\n", " negative_samples=2)\n", "keras_word2vec_tr" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 5000/5000 [02:49<00:00, 29.45it/s]\n" ] }, { "data": { "text/plain": [ "Pipeline(memory=None,\n", " steps=[('w2v', KerasWord2VecVectorizer(batch_size=64, embed_size=50, epochs=5000,\n", " learning_rate=0.05, min_count=3, negative_samples=2,\n", " sort_vocab=True, use_sampling_table=True, window_size=5)), ('xgb', XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n", " ...\n", " reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,\n", " silent=True, subsample=1))])" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "keras_w2v_xgb = Pipeline([\n", " ('w2v', keras_word2vec_tr), \n", " ('xgb', xgb)\n", "])\n", "\n", "keras_w2v_xgb.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training set accuracy 0.9093541847668694\n" ] }, { "data": { "text/plain": [ "array([[1966, 4, 69, 0, 6, 6, 2, 10],\n", " [ 57, 255, 14, 0, 1, 1, 5, 3],\n", " [ 143, 4, 3377, 0, 2, 0, 2, 2],\n", " [ 10, 2, 3, 16, 0, 2, 8, 5],\n", " [ 5, 2, 5, 0, 209, 16, 0, 7],\n", " [ 19, 6, 13, 0, 26, 177, 0, 23],\n", " [ 43, 6, 7, 4, 0, 1, 65, 4],\n", " [ 10, 14, 39, 0, 5, 7, 3, 215]])" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_train_pred = keras_w2v_xgb.predict(X_train)\n", "print('Training set accuracy %s' % accuracy_score(y_train, y_train_pred))\n", "confusion_matrix(y_train, y_train_pred)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Test set accuracy 0.8958333333333334\n" ] }, { "data": { "text/plain": [ "array([[218, 1, 6, 1, 0, 0, 1, 2],\n", " [ 7, 28, 1, 0, 0, 0, 1, 1],\n", " [ 20, 1, 371, 0, 1, 0, 0, 0],\n", " [ 1, 1, 1, 0, 0, 0, 1, 1],\n", " [ 0, 0, 4, 0, 18, 4, 0, 1],\n", " [ 2, 0, 1, 0, 4, 20, 0, 2],\n", " [ 2, 2, 1, 0, 0, 1, 7, 1],\n", " [ 2, 0, 2, 0, 2, 1, 0, 26]])" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_test_pred = keras_w2v_xgb.predict(X_test)\n", "print('Test set accuracy %s' % accuracy_score(y_test, y_test_pred))\n", "confusion_matrix(y_test, y_test_pred)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "vocabulary size: 9847\n" ] }, { "data": { "text/plain": [ "[('common', 0.7450751),\n", " ('split', 0.72033733),\n", " ('annual', 0.7107709),\n", " ('acquistion', 0.7084387),\n", " ('remaining', 0.6948875),\n", " ('total', 0.69227046),\n", " ('shareholder', 0.68726707),\n", " ('shareholders', 0.6678084),\n", " ('associates', 0.64986753),\n", " ('board', 0.6477556)]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print('vocabulary size:', keras_w2v_xgb.named_steps['w2v'].vocab_size_)\n", "keras_w2v_xgb.named_steps['w2v'].most_similar(positive=['stock'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Benchmarks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll compare the word2vec + xgboost approach with tfidf + logistic regression. The latter approach is known for its interpretability and fast training time, hence serves as a strong baseline.\n", "\n", "Note that for sklearn's tfidf, we didn't use the default analyzer 'words', as this means it expects that input is a single string which it will try to split into individual words, but our texts are already tokenized, i.e. already lists of words." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", " \n", "tfidf = TfidfVectorizer(stop_words='english', analyzer=lambda x: x)\n", "logistic = LogisticRegression(solver='liblinear', multi_class='auto')\n", "\n", "tfidf_logistic = Pipeline([\n", " ('tfidf', tfidf), \n", " ('logistic', logistic)\n", "])" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "from scipy.stats import randint, uniform\n", "\n", "w2v_params = {'w2v__size': [100, 150, 200]}\n", "tfidf_params = {'tfidf__ngram_range': [(1, 1), (1, 2)]}\n", "logistic_params = {'logistic__C': [0.5, 1.0, 1.5]}\n", "xgb_params = {'xgb__max_depth': randint(low=3, high=12),\n", " 'xgb__colsample_bytree': uniform(loc=0.8, scale=0.2),\n", " 'xgb__subsample': uniform(loc=0.8, scale=0.2)}\n", "\n", "tfidf_logistic_params = {**tfidf_params, **logistic_params}\n", "w2v_xgb_params = {**w2v_params, **xgb_params}" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "training: w2v_xgb\n", "Fitting 3 folds for each of 3 candidates, totalling 9 fits\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=-1)]: Done 4 out of 9 | elapsed: 2.0min remaining: 2.4min\n", "[Parallel(n_jobs=-1)]: Done 9 out of 9 | elapsed: 2.3min finished\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "training: tfidf_logistic\n", "Fitting 3 folds for each of 3 candidates, totalling 9 fits\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=-1)]: Done 4 out of 9 | elapsed: 1.4s remaining: 1.7s\n", "[Parallel(n_jobs=-1)]: Done 9 out of 9 | elapsed: 2.0s finished\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
model_nametrain_scoretest_scoreestimator
0tfidf_logistic0.9491750.962240RandomizedSearchCV(cv=3, error_score='raise-de...
1w2v_xgb0.9545320.958333RandomizedSearchCV(cv=3, error_score='raise-de...
\n", "
" ], "text/plain": [ " model_name train_score test_score \\\n", "0 tfidf_logistic 0.949175 0.962240 \n", "1 w2v_xgb 0.954532 0.958333 \n", "\n", " estimator \n", "0 RandomizedSearchCV(cv=3, error_score='raise-de... \n", "1 RandomizedSearchCV(cv=3, error_score='raise-de... " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.model_selection import RandomizedSearchCV\n", "\n", "cv = 3\n", "n_iter = 3\n", "random_state = 1234\n", "scoring = 'accuracy'\n", "\n", "all_models = [\n", " ('w2v_xgb', w2v_xgb, w2v_xgb_params),\n", " ('tfidf_logistic', tfidf_logistic, tfidf_logistic_params)\n", "]\n", "\n", "all_models_info = []\n", "for name, model, params in all_models:\n", " print('training:', name)\n", " model_tuned = RandomizedSearchCV(\n", " estimator=model,\n", " param_distributions=params,\n", " cv=cv,\n", " n_iter=n_iter,\n", " n_jobs=-1,\n", " verbose=1,\n", " scoring=scoring,\n", " random_state=random_state,\n", " return_train_score=False\n", " ).fit(X_train, y_train)\n", " \n", " y_test_pred = model_tuned.predict(X_test)\n", " test_score = accuracy_score(y_test, y_test_pred)\n", " info = name, model_tuned.best_score_, test_score, model_tuned\n", " all_models_info.append(info)\n", "\n", "columns = ['model_name', 'train_score', 'test_score', 'estimator']\n", "results = pd.DataFrame(all_models_info, columns=columns)\n", "results = (results\n", " .sort_values('test_score', ascending=False)\n", " .reset_index(drop=True))\n", "results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that different run may result in different performance being reported. And as our dataset changes, different approaches might that worked the best on one dataset might no longer be the best. Especially since the dataset we're working with here isn't very big, training an embedding from scratch will most likely not reach its full potential.\n", "\n", "There are many other text classification techniques in the deep learning realm that we haven't yet explored, we'll leave that for another day." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Reference" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- [Blog: A Word2Vec Keras tutorial](https://adventuresinmachinelearning.com/word2vec-keras-tutorial/)\n", "- [Blog: Text Classification With Word2Vec](http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" }, "toc": { "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "287px" }, "toc_section_display": true, "toc_window_display": true }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }