{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "import os\n", "os.environ[\"CUDA_DEVICE_ORDER\"]=\"PCI_BUS_ID\";\n", "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\";" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Text Regression with Extra Regressors: An Example of Using Custom Data Formats and Models in *ktrain*\n", "\n", "This notebook illustrates how one can construct custom data formats and models for use in *ktrain*. In this example, we will build a model that can predict the price of a wine by **both** its textual description and the winery from which it was produced. This example is inspired by [FloydHub's regression template](https://github.com/floydhub/regression-template) for wine price prediction. However, instead of using the wine variety as the extra regressor, we will use the winery.\n", "\n", "Text classification (or text regression) with extra predictors arises across many scenarios. For instance, when making a prediction about the trustworthiness of a news story, one may want to consider both the text of the news aricle in addition to extra metadata such as the news publication and the authors. Here, such models can be built.\n", "\n", "The dataset in CSV format can be obtained from Floydhub at [this URL](https://www.floydhub.com/floydhub/datasets/wine-reviews/1/wine_data.csv). We will begin by importing some necessary modules and reading in the dataset." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Unnamed: 0 | \n", "country | \n", "description | \n", "designation | \n", "points | \n", "price | \n", "province | \n", "region_1 | \n", "region_2 | \n", "variety | \n", "winery | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|
82956 | \n", "82956 | \n", "Spain | \n", "Spiced apple and dried cheese aromas are simul... | \n", "Mercat Brut | \n", "84 | \n", "12.0 | \n", "Catalonia | \n", "Cava | \n", "NaN | \n", "Sparkling Blend | \n", "El Xamfrà | \n", "
60767 | \n", "60767 | \n", "US | \n", "A little too sharp and acidic, with jammy cher... | \n", "NaN | \n", "82 | \n", "9.0 | \n", "California | \n", "California | \n", "California Other | \n", "Shiraz | \n", "Woodbridge by Robert Mondavi | \n", "
123576 | \n", "123576 | \n", "Spain | \n", "Starts out rustic and leathery, with hints of ... | \n", "Selección 12 Crianza | \n", "89 | \n", "15.0 | \n", "Levante | \n", "Jumilla | \n", "NaN | \n", "Red Blend | \n", "Bodegas Luzón | \n", "
71003 | \n", "71003 | \n", "Chile | \n", "Ripe to the point that it's soft and flat. Big... | \n", "NaN | \n", "82 | \n", "8.0 | \n", "Maule Valley | \n", "NaN | \n", "NaN | \n", "Chardonnay | \n", "Melania | \n", "
78168 | \n", "78168 | \n", "Italy | \n", "From one of the best producers in the little-t... | \n", "Contado Riserva | \n", "88 | \n", "17.0 | \n", "Southern Italy | \n", "Molise | \n", "NaN | \n", "Aglianico | \n", "Di Majo Norante | \n", "