{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "import os\n", "os.environ[\"CUDA_DEVICE_ORDER\"]=\"PCI_BUS_ID\";\n", "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\";" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "using Keras version: 2.2.4-tf\n" ] } ], "source": [ "import ktrain\n", "from ktrain import text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Predicting Wine Prices from Textual Descriptions\n", "\n", "This notebook shows an example of **text regression** in *ktrain*. Given a textual description of a wine, we will attempt to predict its price. The data is available from FloydHub [here](https://www.floydhub.com/floydhub/datasets/wine-reviews/1/wine_data.csv).\n", "\n", "## Clean and Prepare the Data\n", "\n", "We will simply perform the same data preparation as performed by the [original FloydHub example notebook](https://github.com/floydhub/regression-template) that inspired this exmaple." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0countrydescriptiondesignationpointspriceprovinceregion_1region_2varietywinery
84868486ItalyMade entirely from Nero d'Avola, this opens wi...Violino8920.0Sicily & SardiniaVittoriaNaNNero d'AvolaPaolo Calì
148584148585PortugalWarre's seems to have found just the right for...Otima 20-year old tawny9042.0PortNaNNaNPortWarre's
1835318353ItalyA more evolved and sophisticated expression of...Campogrande8723.0VenetoSoave SuperioreNaNGarganegaSandro de Bruno
52815281SpainRed-fruit and citrus aromas create an astringe...NaN8412.0Northern SpainRibera del DueroNaNTempranilloCondado de Oriza
8776887768USLightly funky and showing definite signs of ea...Lia's Vineyard8935.0OregonChehalem MountainsWillamette ValleyPinot NoirSeven of Hearts
\n", "
" ], "text/plain": [ " Unnamed: 0 country \\\n", "8486 8486 Italy \n", "148584 148585 Portugal \n", "18353 18353 Italy \n", "5281 5281 Spain \n", "87768 87768 US \n", "\n", " description \\\n", "8486 Made entirely from Nero d'Avola, this opens wi... \n", "148584 Warre's seems to have found just the right for... \n", "18353 A more evolved and sophisticated expression of... \n", "5281 Red-fruit and citrus aromas create an astringe... \n", "87768 Lightly funky and showing definite signs of ea... \n", "\n", " designation points price province \\\n", "8486 Violino 89 20.0 Sicily & Sardinia \n", "148584 Otima 20-year old tawny 90 42.0 Port \n", "18353 Campogrande 87 23.0 Veneto \n", "5281 NaN 84 12.0 Northern Spain \n", "87768 Lia's Vineyard 89 35.0 Oregon \n", "\n", " region_1 region_2 variety winery \n", "8486 Vittoria NaN Nero d'Avola Paolo Calì \n", "148584 NaN NaN Port Warre's \n", "18353 Soave Superiore NaN Garganega Sandro de Bruno \n", "5281 Ribera del Duero NaN Tempranillo Condado de Oriza \n", "87768 Chehalem Mountains Willamette Valley Pinot Noir Seven of Hearts " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "path = 'data/wine/wine_data.csv' # ADD path/to/dataset\n", "data = pd.read_csv(path)\n", "data = data.sample(frac=1., random_state=0)\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train size: 95646\n", "Test size: 23912\n" ] } ], "source": [ "# this code was taken directly from FloydHub's regression template for\n", "# wine price prediction: https://github.com/floydhub/regression-template\n", "\n", "# Clean it from null values\n", "data = data[pd.notnull(data['country'])]\n", "data = data[pd.notnull(data['price'])]\n", "data = data.drop(data.columns[0], axis=1) \n", "variety_threshold = 500 # Anything that occurs less than this will be removed.\n", "value_counts = data['variety'].value_counts()\n", "to_remove = value_counts[value_counts <= variety_threshold].index\n", "data.replace(to_remove, np.nan, inplace=True)\n", "data = data[pd.notnull(data['variety'])]\n", "\n", "# Split data into train and test\n", "train_size = int(len(data) * .8)\n", "print (\"Train size: %d\" % train_size)\n", "print (\"Test size: %d\" % (len(data) - train_size))\n", "\n", "# Train features\n", "description_train = data['description'][:train_size]\n", "variety_train = data['variety'][:train_size]\n", "\n", "# Train labels\n", "labels_train = data['price'][:train_size]\n", "\n", "# Test features\n", "description_test = data['description'][train_size:]\n", "variety_test = data['variety'][train_size:]\n", "\n", "# Test labels\n", "labels_test = data['price'][train_size:]\n", "\n", "x_train = description_train.values\n", "y_train = labels_train.values\n", "x_test = description_test.values\n", "y_test = labels_test.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 1: Preprocess the Data" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "task: text regression (supply class_names argument if this is supposed to be classification task)\n", "language: en\n", "Word Counts: 30953\n", "Nrows: 95646\n", "95646 train sequences\n", "train sequence lengths:\n", "\tmean : 41\n", "\t95percentile : 62\n", "\t99percentile : 74\n", "Adding 3-gram features\n", "max_features changed to 1769319 with addition of ngrams\n", "Average train sequence length with ngrams: 120\n", "train (w/ngrams) sequence lengths:\n", "\tmean : 121\n", "\t95percentile : 183\n", "\t99percentile : 219\n", "x_train shape: (95646,200)\n", "y_train shape: 95646\n", "23912 test sequences\n", "test sequence lengths:\n", "\tmean : 41\n", "\t95percentile : 62\n", "\t99percentile : 73\n", "Average test sequence length with ngrams: 111\n", "test (w/ngrams) sequence lengths:\n", "\tmean : 112\n", "\t95percentile : 172\n", "\t99percentile : 207\n", "x_test shape: (23912,200)\n", "y_test shape: 23912\n" ] } ], "source": [ "trn, val, preproc = text.texts_from_array(x_train=x_train, y_train=y_train,\n", " x_test=x_test, y_test=y_test,\n", " ngram_range=3, \n", " maxlen=200, \n", " max_features=35000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 2: Create a Text Regression Model and Wrap in Learner" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fasttext: a fastText-like model [http://arxiv.org/pdf/1607.01759.pdf]\n", "linreg: linear text regression using a trainable Embedding layer\n", "bigru: Bidirectional GRU with pretrained word vectors [https://arxiv.org/abs/1712.09405]\n", "standard_gru: simple 2-layer GRU with randomly initialized embeddings\n", "bert: Bidirectional Encoder Representations from Transformers (BERT) [https://arxiv.org/abs/1810.04805]\n", "distilbert: distilled, smaller, and faster BERT from Hugging Face [https://arxiv.org/abs/1910.01108]\n" ] } ], "source": [ "text.print_text_regression_models()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "maxlen is 200\n", "done.\n" ] } ], "source": [ "model = text.text_regression_model('linreg', train_data=trn, preproc=preproc)\n", "learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=256)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lower the `batch size` above if you run out of GPU memory." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 3: Estimate the LR" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "simulating training for different learning rates... this may take a few moments...\n", "Train on 95646 samples\n", "Epoch 1/1024\n", "95646/95646 [==============================] - 8s 81us/sample - loss: 2627.6407 - mae: 34.2769\n", "Epoch 2/1024\n", "95646/95646 [==============================] - 7s 70us/sample - loss: 2610.0313 - mae: 34.0299\n", "Epoch 3/1024\n", "95646/95646 [==============================] - 7s 70us/sample - loss: 2148.5174 - mae: 26.8848\n", "Epoch 4/1024\n", "95646/95646 [==============================] - 7s 71us/sample - loss: 1158.6146 - mae: 15.1160\n", "Epoch 5/1024\n", "15360/95646 [===>..........................] - ETA: 5s - loss: 4022.5116 - mae: 36.6476\n", "\n", "done.\n", "Please invoke the Learner.lr_plot() method to visually inspect the loss plot to help identify the maximal learning rate associated with falling loss.\n" ] } ], "source": [ "learner.lr_find()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "learner.lr_plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 4: Train and Inspect the Model" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "begin training using onecycle policy with max lr of 0.03...\n", "Train on 95646 samples, validate on 23912 samples\n", "Epoch 1/10\n", "95646/95646 [==============================] - 8s 79us/sample - loss: 1556.0435 - mae: 19.2369 - val_loss: 984.7442 - val_mae: 15.1122\n", "Epoch 2/10\n", "95646/95646 [==============================] - 8s 79us/sample - loss: 1052.5454 - mae: 13.0505 - val_loss: 808.2142 - val_mae: 12.5382\n", "Epoch 3/10\n", "95646/95646 [==============================] - 7s 76us/sample - loss: 809.7949 - mae: 9.4578 - val_loss: 695.8532 - val_mae: 10.8098\n", "Epoch 4/10\n", "95646/95646 [==============================] - 8s 80us/sample - loss: 616.9707 - mae: 6.6427 - val_loss: 621.5498 - val_mae: 9.9253\n", "Epoch 5/10\n", "95646/95646 [==============================] - 7s 78us/sample - loss: 471.5737 - mae: 4.8021 - val_loss: 582.4865 - val_mae: 9.9948\n", "Epoch 6/10\n", "95646/95646 [==============================] - 8s 79us/sample - loss: 369.3043 - mae: 4.1017 - val_loss: 572.5836 - val_mae: 10.4219\n", "Epoch 7/10\n", "95646/95646 [==============================] - 8s 79us/sample - loss: 304.9035 - mae: 3.7351 - val_loss: 563.6406 - val_mae: 10.3136\n", "Epoch 8/10\n", "95646/95646 [==============================] - 8s 79us/sample - loss: 257.2997 - mae: 2.8500 - val_loss: 562.7244 - val_mae: 10.0789\n", "Epoch 9/10\n", "95646/95646 [==============================] - 8s 79us/sample - loss: 226.9375 - mae: 1.9855 - val_loss: 559.9848 - val_mae: 9.7024\n", "Epoch 10/10\n", "95646/95646 [==============================] - 8s 79us/sample - loss: 211.9495 - mae: 1.3842 - val_loss: 561.2627 - val_mae: 9.6977\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "learner.fit_onecycle(0.03, 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our MAE is roughly 10, which means our model's predictions are about $10 off on average. This isn't bad considering there is a wide range of wine prices and predictions are being made purely from text descriptions. \n", "\n", "Let's examine the wines we got the most wrong." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "----------\n", "id:6695 | loss:675000.75 | true:980.0 | pred:158.42)\n", "\n", "this was a great vintage port year and this white port which was bottled in 2015 has hints of the firm tannins and structure that marked out the year it also has preserved an amazing amount of freshness still suggesting orange marmalade flavors these are backed up by the fine concentrated old wood tastes the wine is of course ready to drink\n", "----------\n", "id:19469 | loss:524528.9 | true:775.0 | pred:50.76)\n", "\n", "perfumed florals mingle curiously with deep dusty mineral notes on this bracing tba sunny nectarine and tangerine flavors are mouthwatering and juicy struck with acidity then plunged into of sweet honey and nectar it's a delightful sensory roller coaster that feels endless on the finish\n", "----------\n", "id:3310 | loss:400394.03 | true:848.0 | pred:215.23)\n", "\n", "full of ripe fruit opulent and concentrated this is a fabulous and impressive wine it has a beautiful line of acidity balanced with ripe fruits the wood aging is subtle just a hint of smokiness and toast this is one of those wines from a great white wine vintage that will age many years drink from 2024\n" ] } ], "source": [ "learner.view_top_losses(n=3, preproc=preproc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like our model has trouble with expensive wines, which is understandable given the descriptions of them, which may not differ much from less expensive wines." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## STEP 5: Making Predictions" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "predictor = ktrain.get_predictor(learner.model, preproc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make a prediction for a random wine in the validation set." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Description: This Millesimato sparkling blend of Pinot Nero and oak-aged Chardonnay delivers a generous and creamy mouthfeel followed by refined aromas of dried fruit and baked bread. This is a beautiful wine to serve with tempura appetizers.\n", "Actual Price: 52.0\n" ] } ], "source": [ "idx = np.random.randint(len(x_test))\n", "print('Description: %s' % (x_test[idx]))\n", "print('Actual Price: %s' % (y_test[idx]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our prediction for this wine:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([52.698753], dtype=float32)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predictor.predict(x_test[idx])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using the Transfomer API for Text Regression\n", "\n", "*ktrain* includes a simplified interface to the Hugging Face transformers library. This interface can also be used for text regression. Here is a short example of training a [DistilBERT model](https://arxiv.org/abs/1910.01108) for a single epoch to predict wine prices." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "preprocessing train...\n", "language: en\n", "train sequence lengths:\n", "\tmean : 41\n", "\t95percentile : 61\n", "\t99percentile : 73\n" ] }, { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "preprocessing test...\n", "language: en\n", "test sequence lengths:\n", "\tmean : 41\n", "\t95percentile : 62\n", "\t99percentile : 73\n" ] }, { "data": { "text/html": [], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "begin training using onecycle policy with max lr of 0.0001...\n", "Train for 748 steps, validate for 187 steps\n", "748/748 [==============================] - 310s 415ms/step - loss: 1443.0076 - mae: 18.3470 - val_loss: 879.1416 - val_mae: 13.9600\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "MODEL_NAME = 'distilbert-base-uncased'\n", "t = text.Transformer(MODEL_NAME, maxlen=75)\n", "trn = t.preprocess_train(x_train, y_train)\n", "val = t.preprocess_test(x_test, y_test)\n", "model = t.get_regression_model()\n", "learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=128)\n", "learner.fit_onecycle(1e-4, 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### For the Prediction part as discussed in the [issue](https://github.com/amaiya/ktrain/issues/417) ###" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "p = ktrain.get_predictor(learner.model, t)\n", "p.predict(['This is first document.', 'This is second document.', 'This is third document.'])" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }