{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "import os\n", "os.environ[\"CUDA_DEVICE_ORDER\"]=\"PCI_BUS_ID\";\n", "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\";" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Text Regression with Extra Regressors: An Example of Using Custom Data Formats and Models in *ktrain*\n", "\n", "This notebook illustrates how one can construct custom data formats and models for use in *ktrain*. In this example, we will build a model that can predict the price of a wine by **both** its textual description and the winery from which it was produced. This example is inspired by [FloydHub's regression template](https://github.com/floydhub/regression-template) for wine price prediction. However, instead of using the wine variety as the extra regressor, we will use the winery.\n", "\n", "Text classification (or text regression) with extra predictors arises across many scenarios. For instance, when making a prediction about the trustworthiness of a news story, one may want to consider both the text of the news aricle in addition to extra metadata such as the news publication and the authors. Here, such models can be built.\n", "\n", "The dataset in CSV format can be obtained from Floydhub at [this URL](https://www.floydhub.com/floydhub/datasets/wine-reviews/1/wine_data.csv). We will begin by importing some necessary modules and reading in the dataset." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0countrydescriptiondesignationpointspriceprovinceregion_1region_2varietywinery
8295682956SpainSpiced apple and dried cheese aromas are simul...Mercat Brut8412.0CataloniaCavaNaNSparkling BlendEl Xamfrà
6076760767USA little too sharp and acidic, with jammy cher...NaN829.0CaliforniaCaliforniaCalifornia OtherShirazWoodbridge by Robert Mondavi
123576123576SpainStarts out rustic and leathery, with hints of ...Selección 12 Crianza8915.0LevanteJumillaNaNRed BlendBodegas Luzón
7100371003ChileRipe to the point that it's soft and flat. Big...NaN828.0Maule ValleyNaNNaNChardonnayMelania
7816878168ItalyFrom one of the best producers in the little-t...Contado Riserva8817.0Southern ItalyMoliseNaNAglianicoDi Majo Norante
\n", "
" ], "text/plain": [ " Unnamed: 0 country description \\\n", "82956 82956 Spain Spiced apple and dried cheese aromas are simul... \n", "60767 60767 US A little too sharp and acidic, with jammy cher... \n", "123576 123576 Spain Starts out rustic and leathery, with hints of ... \n", "71003 71003 Chile Ripe to the point that it's soft and flat. Big... \n", "78168 78168 Italy From one of the best producers in the little-t... \n", "\n", " designation points price province region_1 \\\n", "82956 Mercat Brut 84 12.0 Catalonia Cava \n", "60767 NaN 82 9.0 California California \n", "123576 Selección 12 Crianza 89 15.0 Levante Jumilla \n", "71003 NaN 82 8.0 Maule Valley NaN \n", "78168 Contado Riserva 88 17.0 Southern Italy Molise \n", "\n", " region_2 variety winery \n", "82956 NaN Sparkling Blend El Xamfrà \n", "60767 California Other Shiraz Woodbridge by Robert Mondavi \n", "123576 NaN Red Blend Bodegas Luzón \n", "71003 NaN Chardonnay Melania \n", "78168 NaN Aglianico Di Majo Norante " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# import some modules and read in the dataset\n", "import pandas as pd\n", "from tensorflow import keras\n", "import numpy as np\n", "import math\n", "path = 'data/wine/wine_data.csv' # ADD path/to/dataset\n", "data = pd.read_csv(path)\n", "data = data.sample(frac=1., random_state=42)\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cleaning the Data\n", "\n", "We use the exact same data-cleaning steps employed in [FloydHub's regression example](https://github.com/floydhub/regression-template) for this dataset." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train size: 95612\n", "Test size: 23904\n" ] } ], "source": [ "# Clean it from null values\n", "data = data[pd.notnull(data['country'])]\n", "data = data[pd.notnull(data['price'])]\n", "data = data.drop(data.columns[0], axis=1) \n", "variety_threshold = 500 # Anything that occurs less than this will be removed.\n", "value_counts = data['variety'].value_counts()\n", "to_remove = value_counts[value_counts <= variety_threshold].index\n", "data.replace(to_remove, np.nan, inplace=True)\n", "data = data[pd.notnull(data['variety'])]\n", "data = data[pd.notnull(data['winery'])]\n", "\n", "# Split data into train and test\n", "train_size = int(len(data) * .8)\n", "print (\"Train size: %d\" % train_size)\n", "print (\"Test size: %d\" % (len(data) - train_size))\n", "\n", "# Train features\n", "description_train = data['description'][:train_size]\n", "variety_train = data['variety'][:train_size]\n", "\n", "# Train labels\n", "labels_train = data['price'][:train_size]\n", "\n", "# Test features\n", "description_test = data['description'][train_size:]\n", "variety_test = data['variety'][train_size:]\n", "\n", "# Test labels\n", "labels_test = data['price'][train_size:]\n", "\n", "x_train = description_train.values\n", "y_train = labels_train.values\n", "x_test = description_test.values\n", "y_test = labels_test.values\n", "\n", "# winery metadata to be used later\n", "winery_train = data['winery'][:train_size]\n", "winery_test = data['winery'][train_size:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Building a Vanilla Text Regression Model in *ktrain*\n", "\n", "We will preprocess the data and select a `linreg` model for our initial \"vanilla\" text regression model." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "using Keras version: 2.2.4-tf\n" ] } ], "source": [ "import ktrain\n", "from ktrain import text" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "task: text regression (supply class_names argument if this is supposed to be classification task)\n", "language: en\n", "Word Counts: 30807\n", "Nrows: 95612\n", "95612 train sequences\n", "train sequence lengths:\n", "\tmean : 41\n", "\t95percentile : 62\n", "\t99percentile : 74\n", "Adding 3-gram features\n", "max_features changed to 1765149 with addition of ngrams\n", "Average train sequence length with ngrams: 120\n", "train (w/ngrams) sequence lengths:\n", "\tmean : 121\n", "\t95percentile : 183\n", "\t99percentile : 219\n", "x_train shape: (95612,200)\n", "y_train shape: 95612\n", "23904 test sequences\n", "test sequence lengths:\n", "\tmean : 41\n", "\t95percentile : 62\n", "\t99percentile : 74\n", "Average test sequence length with ngrams: 111\n", "test (w/ngrams) sequence lengths:\n", "\tmean : 112\n", "\t95percentile : 171\n", "\t99percentile : 207\n", "x_test shape: (23904,200)\n", "y_test shape: 23904\n" ] } ], "source": [ "trn, val, preproc = text.texts_from_array(x_train=x_train, y_train=y_train,\n", " x_test=x_test, y_test=y_test,\n", " ngram_range=3, \n", " maxlen=200, \n", " max_features=35000)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fasttext: a fastText-like model [http://arxiv.org/pdf/1607.01759.pdf]\n", "linreg: linear text regression using a trainable Embedding layer\n", "bigru: Bidirectional GRU with pretrained word vectors [https://arxiv.org/abs/1712.09405]\n", "standard_gru: simple 2-layer GRU with randomly initialized embeddings\n", "bert: Bidirectional Encoder Representations from Transformers (BERT) [https://arxiv.org/abs/1810.04805]\n", "distilbert: distilled, smaller, and faster BERT from Hugging Face [https://arxiv.org/abs/1910.01108]\n" ] } ], "source": [ "text.print_text_regression_models()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "maxlen is 200\n", "done.\n" ] } ], "source": [ "model = text.text_regression_model('linreg', train_data=trn, preproc=preproc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Adding an Extra Regressor to Our Model\n", "\n", "Next, we will add an extra regressor to our model, thereby, creating a new, augmented model. We choose the winery as the extra regressor, which is a categorical variable. Instead of representing the winery as a typical one-hot-encoded vector, we will learn an embedding for the winery during training. The embedding module will then be concatenated with our `linreg` text regression model forming a new model. The new model expects two distinct inputs. The first input is an integer representing the winery. The second input is a sequence of word IDs - standard input to neural text classifiers/regressors." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "50\n" ] } ], "source": [ "extra_train_data = winery_train\n", "extra_test_data = winery_test\n", "\n", "# encode winery as integers\n", "from sklearn.preprocessing import LabelEncoder\n", "encoder = LabelEncoder()\n", "encoder.fit(data['winery'])\n", "extra_train = encoder.transform(extra_train_data)\n", "extra_test = encoder.transform(extra_test_data)\n", "no_of_unique_cat = np.max(extra_train) + 1\n", "embedding_size = min(np.ceil((no_of_unique_cat)/2), 50 )\n", "embedding_size = int(embedding_size)\n", "vocab = no_of_unique_cat+1\n", "print(embedding_size)\n", "extra_train = np.expand_dims(extra_train, -1)\n", "extra_test = np.expand_dims(extra_test, -1)\n", "\n", "# winery module\n", "extra_input = keras.layers.Input(shape=(1,))\n", "extra_output = keras.layers.Embedding(vocab, embedding_size, input_length=1)(extra_input)\n", "extra_output = keras.layers.Flatten()(extra_output)\n", "extra_model = keras.Model(inputs=extra_input, outputs=extra_output)\n", "extra_model.compile(loss='mse', optimizer='adam', metrics=['mae'])\n", "\n", "# Combine winery module with linreg model\n", "merged_out = keras.layers.concatenate([extra_model.output, model.output])\n", "merged_out = keras.layers.Dropout(0.25)(merged_out)\n", "merged_out = keras.layers.Dense(1000, activation='relu')(merged_out)\n", "merged_out = keras.layers.Dropout(0.25)(merged_out)\n", "merged_out = keras.layers.Dense(500, activation='relu')(merged_out)\n", "merged_out = keras.layers.Dropout(0.5)(merged_out)\n", "merged_out = keras.layers.Dense(1)(merged_out)\n", "combined_model = keras.Model([extra_model.input] + [model.input], merged_out)\n", "combined_model.compile(loss='mae',\n", " optimizer='adam',\n", " metrics=['mae'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Wrapping our Data in an Instance of `ktrain.Dataset`\n", "To use this custom data format of two inputs in *ktrain*, we will wrap it in a `ktrain.Dataset` instance. There are two ways to do this. \n", "\n", "The first is to represent our datasets as `tf.data.Dataset` instances and then wrap each in a `ktrain.TFDataset` instance, which is a wrapper to a `tf.data.Dataset`. Use of `tf.data.Dataset` instances can potentially [yield certain performance improvements](https://www.tensorflow.org/guide/data_performance). See [this example notebook](https://github.com/amaiya/ktrain/blob/master/examples/vision/mnist-tf_workflow.ipynb) for a demonstration of using the `ktrain.TFDataset` class. For this example, one can make us of `ktrain.TFDataset` instances as follows:\n", "\n", "```python\n", "import tensorflow as tf\n", "from ktrain.data import TFDataset\n", "BATCH_SIZE = 256\n", "\n", "trn_combined = [extra_train] + [trn[0]] + [trn[1]]\n", "val_combined = [extra_test] + [val[0]] + [val[1]]\n", "\n", "def features_to_tfdataset(examples):\n", " \n", " def gen():\n", " for idx, ex0 in enumerate(examples[0]):\n", " ex1 = examples[1][idx]\n", " label = examples[2][idx]\n", " x = (ex0, ex1)\n", " y = label\n", " yield ( (x, y) )\n", "\n", " tfdataset= tf.data.Dataset.from_generator(gen,\n", " ((tf.int32, tf.int32), tf.int64),\n", " ((tf.TensorShape([None]), tf.TensorShape([None])), tf.TensorShape([])) )\n", " return tfdataset\n", "train_tfdataset= features_to_tfdataset(trn_combined)\n", "val_tfdataset= features_to_tfdataset(val_combined)\n", "train_tfdataset = train_tfdataset.shuffle(trn_combined[0].shape[0]).batch(BATCH_SIZE).repeat(-1)\n", "val_tfdataset = val_tfdataset.batch(BATCH_SIZE)\n", "\n", "train_data = ktrain.TFDataset(train_tfdataset, n=trn_combined[0].shape[0], y=trn_combined[2])\n", "val_data = ktrain.TFDataset(val_tfdataset, n=val_combined[0].shape[0], y=val_combined[2])\n", "learner = ktrain.get_learner(combined_model, train_data=train_data, val_data=val_data)\n", "```\n", "\n", "\n", "\n", "The second approach is to wrap our datasets in a subclass of `ktrain.SequenceDataset`. We must be sure to override and implment the required methods (e.g., `def nsamples` and `def get_y`). The `ktrain.SequenceDataset` class is simply a subclass of `tf.keras.utils.Sequence`. See the TensorFlow documentation on the [Sequence class](https://www.tensorflow.org/api_docs/python/tf/keras/utils/Sequence) for more information on how Sequence wrappers work. \n", "\n", "We employ the second approach in this tutorial. Note that, in the implementation below, we have made `MyCustomDataset` more general such that it can wrap lists containing an arbitrary number of inputs instead of just the two needed in our example. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "class MyCustomDataset(ktrain.SequenceDataset):\n", " def __init__(self, x, y, batch_size=32, shuffle=True):\n", " # error checks\n", " err = False\n", " if type(x) == np.ndarray and len(x.shape) != 2: err = True\n", " elif type(x) == list:\n", " for d in x:\n", " if type(d) != np.ndarray or len(d.shape) != 2:\n", " err = True\n", " break\n", " else: err = True\n", " if err:\n", " raise ValueError('x must be a 2d numpy array or a list of 2d numpy arrays')\n", " if type(y) != np.ndarray:\n", " raise ValueError('y must be a numpy array')\n", " if type(x) == np.ndarray:\n", " x = [x]\n", "\n", " # set variables\n", " super().__init__(batch_size=batch_size)\n", " self.x, self.y = x, y\n", " self.indices = np.arange(self.x[0].shape[0])\n", " self.n_inputs = len(x)\n", " self.shuffle = shuffle\n", "\n", " # required for instances of tf.keras.utils.Sequence\n", " def __len__(self):\n", " return math.ceil(self.x[0].shape[0] / self.batch_size)\n", "\n", " # required for instances of tf.keras.utils.Sequence\n", " def __getitem__(self, idx):\n", " inds = self.indices[idx * self.batch_size:(idx + 1) * self.batch_size]\n", " batch_x = []\n", " for i in range(self.n_inputs):\n", " batch_x.append(self.x[i][inds])\n", " batch_y = self.y[inds]\n", " return tuple(batch_x), batch_y\n", "\n", " # required for instances of ktrain.Dataset\n", " def nsamples(self):\n", " return self.x[0].shape[0]\n", "\n", " #required for instances of ktrain.Dataset\n", " def get_y(self):\n", " return self.y\n", "\n", " def on_epoch_end(self):\n", " if self.shuffle: np.random.shuffle(self.indices)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that, you can also add a `to_tfdataset` method to your `ktrain.SequenceDataset` subclass. The `to_tfdataset` method is responsible for converting your dataset to a `tf.Dataset` and, if it exists, will be called by *ktrain* just prior to training. We have not done this here.\n", "\n", "\n", "\n", "## Using the Custom Model and Data Format\n", "\n", "Once we wrap our data in a `ktrain.SequenceDataset` instance, we can wrap the model and datasets in a `Learner` object and use *ktrain* normally." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "train_data = MyCustomDataset([extra_train] + [trn[0]], trn[1], shuffle=True)\n", "val_data = MyCustomDataset([extra_test] + [val[0]], val[1], shuffle=False)\n", "learner = ktrain.get_learner(combined_model, train_data=train_data, val_data=val_data, batch_size=256)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Estimate Learning Rate\n", "\n", "We'll choose a learning rate where the loss is falling. As shown in the plot, *1e-3* seems to be a good choice in this case." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "simulating training for different learning rates... this may take a few moments...\n", "Train for 373 steps\n", "Epoch 1/1024\n", "373/373 [==============================] - 9s 24ms/step - loss: 34.1117 - mae: 34.1153\n", "Epoch 2/1024\n", "373/373 [==============================] - 8s 20ms/step - loss: 28.8677 - mae: 28.8826\n", "Epoch 3/1024\n", "373/373 [==============================] - 8s 20ms/step - loss: 13.2890 - mae: 13.2908\n", "Epoch 4/1024\n", "373/373 [==============================] - 8s 21ms/step - loss: 20.4389 - mae: 20.4431\n", "Epoch 5/1024\n", "359/373 [===========================>..] - ETA: 0s - loss: 17.9780 - mae: 17.9780\n", "\n", "done.\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "learner.lr_find(show_plot=True, restore_weights_only=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train the Model\n", "\n", "We will now train the model using the estimated learning rate from above for 12 epochs using the [1cycle learning rate policy](https://arxiv.org/pdf/1803.09820.pdf)." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "begin training using onecycle policy with max lr of 0.001...\n", "Train for 374 steps, validate for 94 steps\n", "Epoch 1/12\n", "374/374 [==============================] - 9s 23ms/step - loss: 22.8788 - mae: 22.8866 - val_loss: 13.7107 - val_mae: 13.7028\n", "Epoch 2/12\n", "374/374 [==============================] - 9s 23ms/step - loss: 12.2521 - mae: 12.2531 - val_loss: 10.8341 - val_mae: 10.8276\n", "Epoch 3/12\n", "374/374 [==============================] - 9s 23ms/step - loss: 9.9158 - mae: 9.9183 - val_loss: 9.9131 - val_mae: 9.9106\n", "Epoch 4/12\n", "374/374 [==============================] - 8s 23ms/step - loss: 8.9252 - mae: 8.9264 - val_loss: 9.4691 - val_mae: 9.4692\n", "Epoch 5/12\n", "374/374 [==============================] - 8s 23ms/step - loss: 8.3064 - mae: 8.3072 - val_loss: 9.1714 - val_mae: 9.1709\n", "Epoch 6/12\n", "374/374 [==============================] - 8s 22ms/step - loss: 7.9027 - mae: 7.9037 - val_loss: 9.0367 - val_mae: 9.0353\n", "Epoch 7/12\n", "374/374 [==============================] - 9s 23ms/step - loss: 7.4723 - mae: 7.4741 - val_loss: 8.6807 - val_mae: 8.6820\n", "Epoch 8/12\n", "374/374 [==============================] - 9s 23ms/step - loss: 6.9741 - mae: 6.9762 - val_loss: 8.3878 - val_mae: 8.3916\n", "Epoch 9/12\n", "374/374 [==============================] - 8s 22ms/step - loss: 6.4518 - mae: 6.4508 - val_loss: 8.2264 - val_mae: 8.2321\n", "Epoch 10/12\n", "374/374 [==============================] - 8s 23ms/step - loss: 5.9795 - mae: 5.9803 - val_loss: 7.8524 - val_mae: 7.8609\n", "Epoch 11/12\n", "374/374 [==============================] - 8s 23ms/step - loss: 5.7376 - mae: 5.7394 - val_loss: 7.8682 - val_mae: 7.8760\n", "Epoch 12/12\n", "374/374 [==============================] - 8s 23ms/step - loss: 5.5266 - mae: 5.5273 - val_loss: 7.8161 - val_mae: 7.8243\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "learner.fit_onecycle(1e-3, 12)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our final validation MAE is **7.82**, which means our predictions are, on average, about $8 off the mark, which is not bad considering our model only looks at the textual description of the wine and the winery.\n", "\n", "### Plot Some Training History\n", "\n", "The validation loss is still decreasing, which suggests we could train further if desired. The second and third plot show the learning rate and momentum schedules employed by `fit_onecycle`." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "learner.plot('loss')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "learner.plot('lr')" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "learner.plot('momentum')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### View Top Losses\n", "\n", "Let's examine the validation examples that we got the most wrong. Looks like our model has trouble with expensive wines." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "----------\n", "id:21790 | loss:1042.46 | true:1100.0 | pred:57.54)\n", "\n", "----------\n", "id:13745 | loss:1014.34 | true:1400.0 | pred:385.66)\n", "\n", "----------\n", "id:11710 | loss:884.58 | true:980.0 | pred:95.42)\n", "\n" ] } ], "source": [ "learner.view_top_losses(n=3)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Wet earth, rain-wet stones, damp moss, wild sage and very ripe pear make for a complex opening. Further sniffs reveal more citrus: both juice and zest of lemon. The palate still holds a lot of leesy yeast flavors but its phenolic richness is tempered by total citrus freshness. This is still tightly wound; leave it so it can come into its own. The warming resonance on the palate suggests it has a long future. Drink from 2019.\n" ] } ], "source": [ "print(x_test[21790])" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A wine that has created its own universe. It has a unique, special softness that allies with the total purity that comes from a small, enclosed single vineyard. The fruit is almost irrelevant here, because it comes as part of a much deeper complexity. This is a great wine, at the summit of Champagne, a sublime, unforgettable experience.\n" ] } ], "source": [ "print(x_test[13745])" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "preds = learner.predict(val_data)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([385.65793], dtype=float32)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "preds[13745]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Making Predictions\n", "\n", "Lastly, we will use our model to make predictions on 5 randomly selected wines in the validation set." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TEXT:\n", "Relatively full-bodied and muscular as well as dry, this new effort from winemaker Steve Bird features plenty of brawny citrus and spice flavors that finish long. There's no real track record, so it's probably best to drink now.\n", "\n", "\tpredicted: 18.009167\n", "\tactual: 17.0\n", "----------------------------------------\n", "TEXT:\n", "Very tart and spicy, with distinct notes of clove and orange peel. Citrus and apple flavors crop up unexpectedly, and the tannins have a hint of green tea about them.\n", "\n", "\tpredicted: 20.4764\n", "\tactual: 20.0\n", "----------------------------------------\n", "TEXT:\n", "Dusty apple aromas are given lift courtesy of citrus notes. This feels good on the palate, with zesty acidity. Flavors of stone fruits, tropical fruits, apple and citrus meld together well, while the finish is pure and long.\n", "\n", "\tpredicted: 15.768029\n", "\tactual: 17.0\n", "----------------------------------------\n", "TEXT:\n", "Smoky and savory on the nose, with saucy fruit sitting below a veil of firm oak. Runs a bit tart and racy in the mouth, where cherry and plum flavors are boosted by blazing natural acidity. Not a sour wine, but definitely crisp and racy.\n", "\n", "\tpredicted: 15.798236\n", "\tactual: 24.0\n", "----------------------------------------\n", "TEXT:\n", "Textbook Gewurztraminer, done well, starting with scents of rose petals and lychees, and moving through pear and melon flavors into a finish that shows a hint of bitterness. Medium-weight and just slightly off-dry.\n", "\n", "\tpredicted: 23.65241\n", "\tactual: 16.0\n", "----------------------------------------\n" ] } ], "source": [ "# 5 random predictions\n", "val_data.batch_size = 1\n", "for i in range(5):\n", " idx = np.random.choice(len(x_test))\n", " print(\"TEXT:\\n%s\" % (x_test[idx]))\n", " print()\n", " print(\"\\tpredicted: %s\" % (np.squeeze(learner.predict(val_data[idx]))))\n", " print(\"\\tactual: %s\" % (y_test[idx])) \n", " print('----------------------------------------')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at our most expensive prediction. Our most expensive prediction (`$404`) is associated with an expensive wine priced at `$800`, which is good. However, we are `~$400` off. Again, our model has trouble with expensive wines. This is somewhat understandable since our model only looks at short textual descriptions and the winery - neither of which contain clear indicators of their exorbitant prices." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "highest-priced prediction: 404.31885\n", "actual price for this wine:800.0\n", "TEXT:\n", "The palate opens slowly, offering an initial citrus character, followed by wood and then, finally, wonderfully rich, but taut fruit. There is still a toast character here, with apricots and pear on top of the citrus, but it is still only just developing. In 10–15 years, it will be a magnificent wine.\n" ] } ], "source": [ "max_pred_id = np.argmax(preds)\n", "print(\"highest-priced prediction: %s\" % (np.squeeze(preds[max_pred_id])))\n", "print(\"actual price for this wine:%s\" % (y_test[max_pred_id]))\n", "print('TEXT:\\n%s' % (x_test[max_pred_id]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Making Predictions on Unseen Examples\n", "\n", "In the example above, we made predictions for examples in the validation set. To make predictions for an arbitrary set of wine data, the steps are as follows:\n", "1. Encode the winery using the same label encoder used above for validation data\n", "2. Preprocess the wine description using the `preprocess_test` method. In this example, you will use `preproc.preprocess_test`.\n", "3. Combine both into a `ktrain.Dataset` instance, as we did above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }