{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "import os\n", "os.environ[\"CUDA_DEVICE_ORDER\"]=\"PCI_BUS_ID\";\n", "os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\"\n", "\n", "import numpy as np\n", "import random\n", "import tensorflow as tf\n", "import pandas as pd\n", "pd.set_option('display.max_columns', None)\n", "\n", "seed_value = 0\n", "os.environ['PYTHONHASHSEED']=str(seed_value)\n", "random.seed(seed_value)\n", "np.random.seed(seed_value)\n", "tf.random.set_seed(seed_value)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import ktrain\n", "from ktrain import tabular" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Classification and Regression on Tabular Data in `ktrain`\n", "\n", "As of v0.19.x, *ktrain* supports classification and regression on \"traditional\" tabular datasets. We will cover two examples in this notebook:\n", "- **Part I: Classification**: predicting which [Titanic passengers survived](https://www.kaggle.com/c/titanic)\n", "- **Part II: Regression**: predicting the age of people from [census data](http://archive.ics.uci.edu/ml/datasets/Census+Income)\n", "\n", "Let's begin with a demonstration of tabular classfication using the well-studied Titatnic dataset from Kaggle.\n", "\n", "## Part I: Classification for Tabular Data\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Solving the Titanic Kaggle Challenge in `ktrain`\n", "\n", "This notebook demonstrates using *ktrain* for predicting which passengers survived the Titatnic shipwreck.\n", "\n", "The dataset can be [downloaded from Kaggle here](https://www.kaggle.com/c/titanic/overview). There is a `train.csv` with labels (i.e., `Survived`) and a `test.csv` with no labels. We will only use `train.csv` in this notebook.\n", "\n", "Let's begin by loading the data as a pandas DataFrame and inspecting it." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "train_df = pd.read_csv('data/titanic/train.csv', index_col=0)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
PassengerId
103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", "
" ], "text/plain": [ " Survived Pclass \\\n", "PassengerId \n", "1 0 3 \n", "2 1 1 \n", "3 1 3 \n", "4 1 1 \n", "5 0 3 \n", "\n", " Name Sex Age \\\n", "PassengerId \n", "1 Braund, Mr. Owen Harris male 22.0 \n", "2 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 \n", "3 Heikkinen, Miss. Laina female 26.0 \n", "4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 \n", "5 Allen, Mr. William Henry male 35.0 \n", "\n", " SibSp Parch Ticket Fare Cabin Embarked \n", "PassengerId \n", "1 1 0 A/5 21171 7.2500 NaN S \n", "2 1 0 PC 17599 71.2833 C85 C \n", "3 0 0 STON/O2. 3101282 7.9250 NaN S \n", "4 1 0 113803 53.1000 C123 S \n", "5 0 0 373450 8.0500 NaN S " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll drop the `Name`, `Ticket`, `Cabin` columns, as they seem like they'll be less predictive. These columns are largely unique or near-unique to passengers." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "train_df = train_df.drop('Name', 1)\n", "train_df = train_df.drop('Ticket', 1)\n", "train_df = train_df.drop('Cabin', 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*ktrain* will automatically split out a validation set if given only a training set. But, let's also manually split out a test set that we can evaluate later." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "np.random.seed(42)\n", "p = 0.1 # 10% for test set\n", "prop = 1-p\n", "df = train_df.copy()\n", "msk = np.random.rand(len(df)) < prop\n", "train_df = df[msk]\n", "test_df = df[~msk]" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(799, 8)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df.shape" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(92, 8)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### STEP 1: Load and Preprocess the Data" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "processing train: 717 rows x 8 columns\n", "\n", "The following integer column(s) are being treated as categorical variables:\n", "['Pclass', 'SibSp', 'Parch']\n", "To treat any of these column(s) as numerical, cast the column to float in DataFrame or CSV\n", " and re-run tabular_from* function.\n", "\n", "processing test: 82 rows x 8 columns\n" ] } ], "source": [ "trn, val, preproc = tabular.tabular_from_df(train_df, label_columns=['Survived'], random_state=42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Automated Preprocessing\n", "*ktrain* automatically preprocesses the dataset appropriately. Numerical columns are automatically normalized, missing values are handled, and categorical variables will be vectorized as [entity embeddings](https://arxiv.org/abs/1604.06737) for input to a neural network. \n", "\n", "##### Auto-generated Features\n", "*ktrain* will auto-generate some new features. For instance, if `Age` is missing for a particular individual, an `Age_na=True` feature will be automatically added.\n", "\n", "New date features are also automatically added. This dataset does not have any **date** fields. If it did, we could populate the `date_columns` parameter to `tabular_from_df` in which case they would be used to auto-generate new features (e.g., `Day`, `Week`, `Is_month_start`, `Is_quarter_end`, etc.) using methods adapted from the **fastai** library.\n", "\n", "##### Manually-Engineered Features\n", "\n", "In addition to these auto-generated features, one can also optionally add manually-generated, dataset-specific features to `train_df` **prior** to invoking `tabular_from_df`. For instance, the `Cabin` feature we discarded earlier might be used to extract the **deck** associated with each passenger (e.g., **B22** --> **Deck B**)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### STEP 2: Create a Model and Wrap in `Learner`\n", "\n", "*ktrain* uses multilayer perceptrons as the model for tabular datasets. The model can be configured with arguments to `tabular_classifier` (e.g., number and size of hidden layers, dropout values, etc.), but we will leave the defaults here." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "mlp: a configurable multilayer perceptron with categorical variable embeddings [https://arxiv.org/abs/1604.06737]\n" ] } ], "source": [ "tabular.print_tabular_classifiers()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Is Multi-Label? False\n", "done.\n" ] } ], "source": [ "model = tabular.tabular_classifier('mlp', trn)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=32)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### STEP 3: Estimate the Learning Rate\n", "\n", "Based on the plot below, we will choose a learning rate of `1e-3`." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "simulating training for different learning rates... this may take a few moments...\n", "Train for 22 steps\n", "Epoch 1/5\n", "22/22 [==============================] - 1s 50ms/step - loss: 0.6882 - accuracy: 0.5985\n", "Epoch 2/5\n", "22/22 [==============================] - 0s 18ms/step - loss: 0.6819 - accuracy: 0.6263\n", "Epoch 3/5\n", "22/22 [==============================] - 0s 20ms/step - loss: 0.6495 - accuracy: 0.6584\n", "Epoch 4/5\n", "22/22 [==============================] - 0s 19ms/step - loss: 2.1039 - accuracy: 0.6569\n", "Epoch 5/5\n", " 3/22 [===>..........................] - ETA: 0s - loss: 25.0747 - accuracy: 0.5455\n", "\n", "done.\n", "Visually inspect loss plot and select learning rate associated with falling loss\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "learner.lr_find(show_plot=True, max_epochs=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### STEP 4: Train the Model" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "begin training using onecycle policy with max lr of 0.005...\n", "Train for 23 steps, validate for 3 steps\n", "Epoch 1/10\n", "23/23 [==============================] - 1s 58ms/step - loss: 0.6388 - accuracy: 0.6597 - val_loss: 0.5437 - val_accuracy: 0.7561\n", "Epoch 2/10\n", "23/23 [==============================] - 1s 23ms/step - loss: 0.5855 - accuracy: 0.6876 - val_loss: 0.4851 - val_accuracy: 0.7805\n", "Epoch 3/10\n", "23/23 [==============================] - 0s 22ms/step - loss: 0.5259 - accuracy: 0.7448 - val_loss: 0.4044 - val_accuracy: 0.8659\n", "Epoch 4/10\n", "23/23 [==============================] - 1s 23ms/step - loss: 0.4985 - accuracy: 0.7713 - val_loss: 0.3639 - val_accuracy: 0.8902\n", "Epoch 5/10\n", "23/23 [==============================] - 1s 22ms/step - loss: 0.4762 - accuracy: 0.7894 - val_loss: 0.3364 - val_accuracy: 0.8659\n", "Epoch 6/10\n", "23/23 [==============================] - 1s 23ms/step - loss: 0.4626 - accuracy: 0.7908 - val_loss: 0.3174 - val_accuracy: 0.9146\n", "Epoch 7/10\n", "23/23 [==============================] - 1s 24ms/step - loss: 0.4444 - accuracy: 0.8061 - val_loss: 0.3126 - val_accuracy: 0.9024\n", "Epoch 8/10\n", "23/23 [==============================] - 1s 23ms/step - loss: 0.4279 - accuracy: 0.8159 - val_loss: 0.2599 - val_accuracy: 0.9146\n", "Epoch 9/10\n", "23/23 [==============================] - 1s 25ms/step - loss: 0.4030 - accuracy: 0.8243 - val_loss: 0.2721 - val_accuracy: 0.9024\n", "Epoch 10/10\n", "23/23 [==============================] - 1s 23ms/step - loss: 0.3990 - accuracy: 0.8257 - val_loss: 0.2686 - val_accuracy: 0.9024\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "learner.fit_onecycle(5e-3, 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we don't appear to be quite overfitting yet, we could try to train further. But, we will stop here.\n", "\n", "\n", "**Let's evaluate the validation set:**" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", "not_Survived 0.89 0.96 0.92 49\n", " Survived 0.93 0.82 0.87 33\n", "\n", " accuracy 0.90 82\n", " macro avg 0.91 0.89 0.90 82\n", "weighted avg 0.90 0.90 0.90 82\n", "\n" ] }, { "data": { "text/plain": [ "array([[47, 2],\n", " [ 6, 27]])" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "learner.evaluate(val, class_names=preproc.get_classes())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Make Predictions\n", "\n", "The `Predictor` for tabular datasets accepts input as a dataframe in the same format as the original training dataframe. \n", "\n", "We will use `test_df` that we created earlier." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "predictor = ktrain.get_predictor(learner.model, preproc)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "preds = predictor.predict(test_df, return_proba=True)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(92, 2)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "preds.shape" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "test accuracy:\n" ] }, { "data": { "text/plain": [ "0.8478260869565217" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print('test accuracy:')\n", "(np.argmax(preds, axis=1) == test_df['Survived'].values).sum()/test_df.shape[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Our final results as a DataFrame:**" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PclassSexAgeSibSpParchFareEmbarkedSurvivedpredicted_Survived
PassengerId
21female38.01071.2833C11
121female58.00026.5500S11
342male66.00010.5000S00
351male28.01082.1708C00
442female3.01241.5792C11
\n", "
" ], "text/plain": [ " Pclass Sex Age SibSp Parch Fare Embarked Survived \\\n", "PassengerId \n", "2 1 female 38.0 1 0 71.2833 C 1 \n", "12 1 female 58.0 0 0 26.5500 S 1 \n", "34 2 male 66.0 0 0 10.5000 S 0 \n", "35 1 male 28.0 1 0 82.1708 C 0 \n", "44 2 female 3.0 1 2 41.5792 C 1 \n", "\n", " predicted_Survived \n", "PassengerId \n", "2 1 \n", "12 1 \n", "34 0 \n", "35 0 \n", "44 1 " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = test_df.copy()[[c for c in test_df.columns.values if c != 'Survived']]\n", "df['Survived'] = test_df['Survived']\n", "df['predicted_Survived'] = np.argmax(preds, axis=1)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Explaining Predictions\n", "\n", "We can use the `explain` method to better understand **why** a prediction was made for a particular example. Consider the passenger in the fourth row above (`PassengerID=35`) that did not survive. Although we classified this passenger correctly here, this row tends to get classified differently across different training runs. It is sometimes classified correctly (as in this run), but is also often misclassifeid. \n", "\n", "Let's better understand why.\n", "\n", "The `explain` method accepts at minimum the following three inputs:\n", "1. **df**: a pandas DataFrame in the same format is the original training DataFrame\n", "2. **row_index**: the DataFrame index of the example (here, we choose PassengerID=35)\n", "3. **class_id**: the id of the class of interest (we choose the **Survived** class in this case)\n", "\n", "One can also replace the `row_index=35` with `row_num=3`, as both denote the fourth row." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Explanation for class = Survived (PassengerId=35): \n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "predictor.explain(test_df, row_index=35, class_id=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The plot above is generated using the [shap](https://github.com/slundberg/shap) library. You can install it with either `pip install shap` or, for *conda* users, `conda install -c conda-forge shap`. The features in red are causing our model to increase the prediction for the **Survived** class, while features in blue cause our model to *decrease* the prediction for **Survived** (or *increase* the prediction for **Not_Survived**). \n", "\n", "From the plot, we see that the predicted softmax probability for `Survived` is **50%**, which is a comparatively much less confident classification than other classifications. Why is this?\n", "\n", "We see that`Sex=male` is an influential feature that is pushing the prediction lower towards **Not_Survived**, as it was women and children given priority when allocating lifeboats on the Titanic. \n", "\n", "On the other hand, we also see that this is a First Class passenger (`Pclass=1`) with a higher-than-average `Fare` price of *82.17*. In the cell below, you'll see that the average `Fare` price is only *32*. (Moreover, this passenger embarked from Cherbourg, which has been shown to be correlated with survival.) Such features suggest that this is an upper-class, wealthier passenger and, therefore, more likely to make it onto a lifeboat and survive. We know from history that crew members were ordered to close gates that lead to the upper decks so the first and second class passengers could be evacuated first. As a result, these \"upper class\" features are pushing our model to increase the classification to **Survived**. \n", "\n", "**Thus, there are two opposing forces at play working against each other in this prediction,** which explains why the prediction probability is comparatively nearer to the border than other examples.\n", "\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "32.23080325406759" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df['Fare'].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**NOTE**: We choose `class_id=1` in the example above because the **Survived** class of interest has an index position of 1 in the `class_names` list:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "['not_Survived', 'Survived']" ], "text/plain": [ "['not_Survived', 'Survived']" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "preproc.get_classes()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us now look at the examples for which we were the most wrong (highest loss)." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "processing test: 92 rows x 8 columns\n", "----------\n", "id:27 | loss:3.31 | true:Survived | pred:not_Survived)\n", "\n", "----------\n", "id:53 | loss:2.84 | true:not_Survived | pred:Survived)\n", "\n", "----------\n", "id:19 | loss:2.52 | true:Survived | pred:not_Survived)\n", "\n" ] } ], "source": [ "learner.view_top_losses(val_data=preproc.preprocess_test(test_df), preproc=preproc, n=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The example with the highest losses are `row_num={27, 53, 19}`. Why did we get these so wrong? Let's examine `row_num=53`. Note that these IDs shown in the `view_top_losses` output are the raw row numbers, not DataFrame indices (or PassengerIDs). So, we need to use `row_num`, not `row_index` here.\n" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Explanation for class = Survived (row_num=53): \n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "predictor.explain(test_df, row_num=53, class_id=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a wealthy First Class (`Pclass=1`) female passenger with a very high `Fare` price of 151.55. As mentioned above, such a passenger had a high chance for survival, which explains our model's high prediction for **Survival**. Yet, she did not survive. Upon further investigation, we can understand why. This particular passenger is **Bess Allison**, a wealthy married 25-year old mother to two toddlers. When the collision occurred, her and her husband could not locate their nanny (Alice Cleaver) and son (Trevor). So, Bess, her husband, and her 3-year-old daughter Loraine stayed behind to wait for them instead of evacuating with other First and Second Class passengers with children. They were last seen standing together smiling on the promenage deck. All three died with her daughter Loraine being the only child in 1st class and 2nd class who died on the Titanic. Their son and nanny successfully evacuated and survived.\n", "\n", "REFERENCE: [https://rt.cto.mil/stpe/](https://rt.cto.mil/stpe/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Saving and Reloading the Tabular Predictor\n", "\n", "It is easy to save and reload the predictor for deployment scenarios." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "predictor.save('/tmp/titanic_predictor')" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "reloaded_predictor = ktrain.load_predictor('/tmp/titanic_predictor/')" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "['Survived', 'Survived', 'not_Survived', 'not_Survived', 'Survived']" ], "text/plain": [ "['Survived', 'Survived', 'not_Survived', 'not_Survived', 'Survived']" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reloaded_predictor.predict(test_df)[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluating Test Sets Automatically\n", "\n", "When we evaulated the test set above, we did so manually. To evaluate a test set automatically,\n", "one can invoke the `learner.evaluate` method and supply a preprocessed test set as an argument:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "processing test: 92 rows x 8 columns\n", " precision recall f1-score support\n", "\n", "not_Survived 0.85 0.91 0.88 57\n", " Survived 0.84 0.74 0.79 35\n", "\n", " accuracy 0.85 92\n", " macro avg 0.85 0.83 0.83 92\n", "weighted avg 0.85 0.85 0.85 92\n", "\n" ] }, { "data": { "text/plain": [ "array([[52, 5],\n", " [ 9, 26]])" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "learner.evaluate(preproc.preprocess_test(test_df), class_names=preproc.get_classes())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `learner.evaluate` method is simply an alias to `learner.validate`, which can also accept a dataset as an argument. If no argument is supplied, metrics will be computed for `learner.val_data`, which was supplied to `get_learner` above." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part II: Regression for Tabular Data\n", "\n", "We will briefly demonstrate tabular regression in *ktrain* by simply predicting the `age` attribute in the Census dataset available from te UCI Machine Learning repository. This is the same example used in the [AutoGluon regression example](https://autogluon.mxnet.io/tutorials/tabular_prediction/tabular-quickstart.html#regression-predicting-numeric-table-columns). Let's begin by downloading the dataset from the AutoGluon website." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/tmp/train.csv\r\n" ] } ], "source": [ "import urllib.request\n", "urllib.request.urlretrieve('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv', \n", " '/tmp/train.csv')\n", "!ls /tmp/train.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### STEP 1: Load and Preprocess Data\n", "\n", "Make sure you specify `is_regression=True` here as we are predicting a numerical dependent variable (i.e., `age`)." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "processing train: 35179 rows x 15 columns\n", "\n", "The following integer column(s) are being treated as categorical variables:\n", "['education-num']\n", "To treat any of these column(s) as numerical, cast the column to float in DataFrame or CSV\n", " and re-run tabular_from* function.\n", "\n", "processing test: 3894 rows x 15 columns\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Task is being treated as REGRESSION because either class_names argument was not supplied or is_regression=True. If this is incorrect, change accordingly.\n" ] } ], "source": [ "trn, val, preproc = tabular.tabular_from_csv('/tmp/train.csv', label_columns='age', \n", " is_regression=True, random_state=42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We used `tabular_from_csv` to load the dataset, but let's also quickly load as DataFrame to see it:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countryclass
025Private178478Bachelors13Never-marriedTech-supportOwn-childWhiteFemale0040United-States<=50K
123State-gov617435th-6th3Never-marriedTransport-movingNot-in-familyWhiteMale0035United-States<=50K
246Private376789HS-grad9Never-marriedOther-serviceNot-in-familyWhiteMale0015United-States<=50K
355?200235HS-grad9Married-civ-spouse?HusbandWhiteMale0050United-States>50K
436Private2245417th-8th4Married-civ-spouseHandlers-cleanersHusbandWhiteMale0040El-Salvador<=50K
\n", "
" ], "text/plain": [ " age workclass fnlwgt education education-num marital-status \\\n", "0 25 Private 178478 Bachelors 13 Never-married \n", "1 23 State-gov 61743 5th-6th 3 Never-married \n", "2 46 Private 376789 HS-grad 9 Never-married \n", "3 55 ? 200235 HS-grad 9 Married-civ-spouse \n", "4 36 Private 224541 7th-8th 4 Married-civ-spouse \n", "\n", " occupation relationship race sex capital-gain \\\n", "0 Tech-support Own-child White Female 0 \n", "1 Transport-moving Not-in-family White Male 0 \n", "2 Other-service Not-in-family White Male 0 \n", "3 ? Husband White Male 0 \n", "4 Handlers-cleaners Husband White Male 0 \n", "\n", " capital-loss hours-per-week native-country class \n", "0 0 40 United-States <=50K \n", "1 0 35 United-States <=50K \n", "2 0 15 United-States <=50K \n", "3 0 50 United-States >50K \n", "4 0 40 El-Salvador <=50K " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv('/tmp/train.csv').head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### STEP 2: Create a Model and Wrap in `Learner`\n", "\n", "We'll use `tabular_regression_model` to create a regression model." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "mlp: a configurable multilayer perceptron with categorical variable embeddings [https://arxiv.org/abs/1604.06737]\n" ] } ], "source": [ "tabular.print_tabular_regression_models()" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "done.\n" ] } ], "source": [ "model = tabular.tabular_regression_model('mlp', trn)\n", "learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=128)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### STEP 3: Estimate Learning Rate" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "simulating training for different learning rates... this may take a few moments...\n", "Train for 274 steps\n", "Epoch 1/1024\n", "274/274 [==============================] - 8s 29ms/step - loss: 1681.9281 - mae: 38.6405\n", "Epoch 2/1024\n", "274/274 [==============================] - 7s 25ms/step - loss: 1650.8196 - mae: 38.2378\n", "Epoch 3/1024\n", "274/274 [==============================] - 7s 26ms/step - loss: 677.0598 - mae: 20.7480\n", "Epoch 4/1024\n", "274/274 [==============================] - 7s 26ms/step - loss: 123.2551 - mae: 8.7116\n", "Epoch 5/1024\n", "274/274 [==============================] - 7s 26ms/step - loss: 229.2279 - mae: 11.2846\n", "Epoch 6/1024\n", " 67/274 [======>.......................] - ETA: 6s - loss: 384.0570 - mae: 12.9530\n", "\n", "done.\n", "Visually inspect loss plot and select learning rate associated with falling loss\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "learner.lr_find(show_plot=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### STEP 4: Train the Model\n", "\n", "According to our final validation MAE (see below), our age predictions are only off about **~7 years**." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "early_stopping automatically enabled at patience=5\n", "reduce_on_plateau automatically enabled at patience=2\n", "\n", "\n", "begin training using triangular learning rate policy with max lr of 0.001...\n", "Train for 275 steps, validate for 122 steps\n", "Epoch 1/1024\n", "275/275 [==============================] - 11s 39ms/step - loss: 411.0144 - mae: 14.8399 - val_loss: 108.4525 - val_mae: 8.2879\n", "Epoch 2/1024\n", "275/275 [==============================] - 10s 35ms/step - loss: 116.3624 - mae: 8.4576 - val_loss: 102.6719 - val_mae: 8.0353\n", "Epoch 3/1024\n", "275/275 [==============================] - 10s 35ms/step - loss: 112.9161 - mae: 8.3066 - val_loss: 100.8348 - val_mae: 7.9844\n", "Epoch 4/1024\n", "275/275 [==============================] - 10s 35ms/step - loss: 111.2987 - mae: 8.2026 - val_loss: 97.9699 - val_mae: 7.7425\n", "Epoch 5/1024\n", "275/275 [==============================] - 10s 35ms/step - loss: 109.3430 - mae: 8.1120 - val_loss: 95.7590 - val_mae: 7.6947\n", "Epoch 6/1024\n", "275/275 [==============================] - 10s 35ms/step - loss: 107.6256 - mae: 8.0252 - val_loss: 95.1659 - val_mae: 7.5768\n", "Epoch 7/1024\n", "275/275 [==============================] - 10s 35ms/step - loss: 107.1517 - mae: 8.0267 - val_loss: 94.3338 - val_mae: 7.5559\n", "Epoch 8/1024\n", "275/275 [==============================] - 10s 36ms/step - loss: 106.7320 - mae: 7.9814 - val_loss: 94.3334 - val_mae: 7.5357\n", "Epoch 9/1024\n", "275/275 [==============================] - 10s 35ms/step - loss: 106.7105 - mae: 7.9865 - val_loss: 94.0436 - val_mae: 7.5332\n", "Epoch 10/1024\n", "275/275 [==============================] - 10s 35ms/step - loss: 105.0565 - mae: 7.9049 - val_loss: 94.0949 - val_mae: 7.5426\n", "Epoch 11/1024\n", "275/275 [==============================] - 10s 35ms/step - loss: 105.6540 - mae: 7.9441 - val_loss: 93.6455 - val_mae: 7.5383\n", "Epoch 12/1024\n", "275/275 [==============================] - 10s 35ms/step - loss: 105.2223 - mae: 7.9144 - val_loss: 93.8997 - val_mae: 7.5404\n", "Epoch 13/1024\n", "275/275 [==============================] - 10s 35ms/step - loss: 105.0021 - mae: 7.9089 - val_loss: 93.5568 - val_mae: 7.5250\n", "Epoch 14/1024\n", "275/275 [==============================] - 10s 36ms/step - loss: 105.1489 - mae: 7.9176 - val_loss: 94.2954 - val_mae: 7.5771\n", "Epoch 15/1024\n", "273/275 [============================>.] - ETA: 0s - loss: 104.6181 - mae: 7.8825\n", "Epoch 00015: Reducing Max LR on Plateau: new max lr will be 0.0005 (if not early_stopping).\n", "275/275 [==============================] - 10s 36ms/step - loss: 104.7387 - mae: 7.8875 - val_loss: 93.6825 - val_mae: 7.4777\n", "Epoch 16/1024\n", "275/275 [==============================] - 10s 35ms/step - loss: 103.6717 - mae: 7.8581 - val_loss: 92.8922 - val_mae: 7.4872\n", "Epoch 17/1024\n", "275/275 [==============================] - 10s 35ms/step - loss: 103.1032 - mae: 7.8318 - val_loss: 92.6652 - val_mae: 7.4591\n", "Epoch 18/1024\n", "275/275 [==============================] - 10s 35ms/step - loss: 103.3571 - mae: 7.8300 - val_loss: 92.6492 - val_mae: 7.4712\n", "Epoch 19/1024\n", "275/275 [==============================] - 10s 35ms/step - loss: 102.9795 - mae: 7.8306 - val_loss: 92.6980 - val_mae: 7.4230\n", "Epoch 20/1024\n", "275/275 [==============================] - 10s 36ms/step - loss: 102.9318 - mae: 7.8292 - val_loss: 92.5345 - val_mae: 7.4105\n", "Epoch 21/1024\n", "275/275 [==============================] - 10s 36ms/step - loss: 103.0119 - mae: 7.8332 - val_loss: 92.7922 - val_mae: 7.4064\n", "Epoch 22/1024\n", "269/275 [============================>.] - ETA: 0s - loss: 102.2146 - mae: 7.7910\n", "Epoch 00022: Reducing Max LR on Plateau: new max lr will be 0.00025 (if not early_stopping).\n", "275/275 [==============================] - 10s 35ms/step - loss: 102.1557 - mae: 7.7870 - val_loss: 93.0830 - val_mae: 7.5391\n", "Epoch 23/1024\n", "275/275 [==============================] - 10s 36ms/step - loss: 102.1588 - mae: 7.7912 - val_loss: 92.6078 - val_mae: 7.4737\n", "Epoch 24/1024\n", "272/275 [============================>.] - ETA: 0s - loss: 101.5359 - mae: 7.7678\n", "Epoch 00024: Reducing Max LR on Plateau: new max lr will be 0.000125 (if not early_stopping).\n", "275/275 [==============================] - 10s 35ms/step - loss: 101.7744 - mae: 7.7765 - val_loss: 92.8352 - val_mae: 7.5266\n", "Epoch 25/1024\n", "275/275 [==============================] - 10s 35ms/step - loss: 101.2561 - mae: 7.7520 - val_loss: 92.2433 - val_mae: 7.4054\n", "Epoch 26/1024\n", "275/275 [==============================] - 10s 35ms/step - loss: 101.2565 - mae: 7.7492 - val_loss: 92.1415 - val_mae: 7.4383\n", "Epoch 27/1024\n", "275/275 [==============================] - 10s 36ms/step - loss: 101.9846 - mae: 7.7632 - val_loss: 92.1260 - val_mae: 7.4596\n", "Epoch 28/1024\n", "275/275 [==============================] - 10s 35ms/step - loss: 101.0920 - mae: 7.7495 - val_loss: 91.9819 - val_mae: 7.4022\n", "Epoch 29/1024\n", "275/275 [==============================] - 10s 35ms/step - loss: 100.7677 - mae: 7.7300 - val_loss: 91.7970 - val_mae: 7.3984\n", "Epoch 30/1024\n", "275/275 [==============================] - 10s 35ms/step - loss: 100.7598 - mae: 7.7330 - val_loss: 91.9531 - val_mae: 7.4084\n", "Epoch 31/1024\n", "269/275 [============================>.] - ETA: 0s - loss: 101.6460 - mae: 7.7700\n", "Epoch 00031: Reducing Max LR on Plateau: new max lr will be 6.25e-05 (if not early_stopping).\n", "275/275 [==============================] - 10s 35ms/step - loss: 101.8179 - mae: 7.7705 - val_loss: 91.9712 - val_mae: 7.4199\n", "Epoch 32/1024\n", "275/275 [==============================] - 10s 35ms/step - loss: 100.8309 - mae: 7.7345 - val_loss: 91.9763 - val_mae: 7.3991\n", "Epoch 33/1024\n", "272/275 [============================>.] - ETA: 0s - loss: 100.7709 - mae: 7.7300\n", "Epoch 00033: Reducing Max LR on Plateau: new max lr will be 3.125e-05 (if not early_stopping).\n", "275/275 [==============================] - 10s 35ms/step - loss: 100.8522 - mae: 7.7294 - val_loss: 91.8345 - val_mae: 7.4091\n", "Epoch 34/1024\n", "269/275 [============================>.] - ETA: 0s - loss: 100.5609 - mae: 7.7150Restoring model weights from the end of the best epoch.\n", "275/275 [==============================] - 10s 35ms/step - loss: 100.5432 - mae: 7.7158 - val_loss: 91.8488 - val_mae: 7.3933\n", "Epoch 00034: early stopping\n", "Weights from best epoch have been loaded into model.\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "learner.autofit(1e-3)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "[('mae', 7.398410168683522)]" ], "text/plain": [ "[('mae', 7.398410168683522)]" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "learner.validate()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See the [House Price Prediction notebook](https://github.com/amaiya/ktrain/blob/master/examples/tabular/HousePricePrediction-MLP.ipynb) for another regression example." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }