{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import re\n", "import pandas as pd\n", "import numpy as np\n", "import statsmodels.api as sm # for datasets\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "# Hide Numpy warnings from Statsmodels\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "# Sequence of Appelpy imports\n", "from appelpy.eda import statistical_moments\n", "from appelpy.utils import DummyEncoder\n", "from appelpy.linear_model import OLS\n", "from appelpy.diagnostics import heteroskedasticity_test" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook shows how to make a simple **model pipeline** with Appelpy, using the [Cars93 dataset](https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Cars93.html).\n", "\n", "There is data available about 93 cars. One variable of interest to model is the price of the cars.\n", "\n", "It can be particularly challenging to fit models with many **categorical variables**. If you fit models on data that come from a database, where categories of a variable may shift over time, then model pipelines can help to make the modelling process more robust and easier to maintain.\n", "\n", "**Notebook workflow:**\n", "- Load data\n", "- Explore data\n", "- Transform data (set up functions for transforming the raw data)\n", " - Column renaming\n", " - Dummy column encoding\n", " - Log variables\n", "- Fit model\n", " - Initial model: set up a basic `ModelPipeline` and fit `model1` using it. Run model diagnostics.\n", " - Second model: fit `model2` with the pipeline. Run model diagnostics.\n", " - Second model with different base levels for categorical variables." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Load data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "df_raw = sm.datasets.get_rdataset('Cars93', 'MASS').data" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(93, 27)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_raw.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Explore data" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
meanvarskewkurtosis
Min.Price17.125876.4931.163820.9016
Price19.509793.30461.508243.18369
Max.Price21.8989121.6712.000916.9816
MPG.city22.365631.58231.676823.72841
MPG.highway29.08628.42731.209972.41192
EngineSize2.667741.076120.8454940.297016
Horsepower143.8282743.080.9363080.98822
RPM5280.65356089-0.254344-0.451623
Rev.per.mile2332.22465190.2769840.145034
Fuel.tank.capacity16.664510.75430.1063940.0566398
Passengers5.086021.079480.0615040.822782
Length183.204213.23-0.08863490.361628
Wheelbase103.94646.50790.111884-0.819052
Width69.376314.28070.25975-0.297207
Turn.circle38.95710.3894-0.131405-0.757256
Rear.seat.room27.82978.934550.07696360.781057
Luggage.room13.89028.98780.2253450.444448
Weight3072.9347978-0.141341-0.873658
\n", "
" ], "text/plain": [ " mean var skew kurtosis\n", "Min.Price 17.1258 76.493 1.16382 0.9016\n", "Price 19.5097 93.3046 1.50824 3.18369\n", "Max.Price 21.8989 121.671 2.00091 6.9816\n", "MPG.city 22.3656 31.5823 1.67682 3.72841\n", "MPG.highway 29.086 28.4273 1.20997 2.41192\n", "EngineSize 2.66774 1.07612 0.845494 0.297016\n", "Horsepower 143.828 2743.08 0.936308 0.98822\n", "RPM 5280.65 356089 -0.254344 -0.451623\n", "Rev.per.mile 2332.2 246519 0.276984 0.145034\n", "Fuel.tank.capacity 16.6645 10.7543 0.106394 0.0566398\n", "Passengers 5.08602 1.07948 0.061504 0.822782\n", "Length 183.204 213.23 -0.0886349 0.361628\n", "Wheelbase 103.946 46.5079 0.111884 -0.819052\n", "Width 69.3763 14.2807 0.25975 -0.297207\n", "Turn.circle 38.957 10.3894 -0.131405 -0.757256\n", "Rear.seat.room 27.8297 8.93455 0.0769636 0.781057\n", "Luggage.room 13.8902 8.9878 0.225345 0.444448\n", "Weight 3072.9 347978 -0.141341 -0.873658" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "statistical_moments(df_raw)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exploratory data analysis and causal reasoning are not the focus of this notebook.\n", "\n", "We are interested in **modelling the `Price` of cars based on these independent variables:**\n", "- `Type` (categorical variable)\n", "- `MPG.city` (continuous variable)\n", "- `Airbags` (categorical variable)\n", "- `Origin` (categorical variable)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Transform data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Rename columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The source dataset has columns in camel case. As we're using Python let's get it closer to **snake case**." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def _rename_columns(df):\n", " \"Make column names lowercase and more like snake-case.\"\n", " return (df\n", " .rename(columns=lambda x: re.sub(\"[ ,.,-]\", '_', x))\n", " .rename(columns=str.lower))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['Manufacturer', 'Model', 'Type', 'Min.Price', 'Price', 'Max.Price',\n", " 'MPG.city', 'MPG.highway', 'AirBags', 'DriveTrain', 'Cylinders',\n", " 'EngineSize', 'Horsepower', 'RPM', 'Rev.per.mile', 'Man.trans.avail',\n", " 'Fuel.tank.capacity', 'Passengers', 'Length', 'Wheelbase', 'Width',\n", " 'Turn.circle', 'Rear.seat.room', 'Luggage.room', 'Weight', 'Origin',\n", " 'Make'],\n", " dtype='object')" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_raw.columns" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "df_raw = df_raw.pipe(_rename_columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Clean the column names by using the `rename_columns` method." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['manufacturer', 'model', 'type', 'min_price', 'price', 'max_price',\n", " 'mpg_city', 'mpg_highway', 'airbags', 'drivetrain', 'cylinders',\n", " 'enginesize', 'horsepower', 'rpm', 'rev_per_mile', 'man_trans_avail',\n", " 'fuel_tank_capacity', 'passengers', 'length', 'wheelbase', 'width',\n", " 'turn_circle', 'rear_seat_room', 'luggage_room', 'weight', 'origin',\n", " 'make'],\n", " dtype='object')" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_raw.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Encoding of columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are many categorical variables in the dataset, so for a linear regression model we will need to encode dummy variables from each categorical variable." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def _process_data(raw_df, cat_base_levels):\n", " \"Transform raw dataset by encoding columns, e.g. dummy columns from categorical variables.\"\n", " return (df_raw\n", " .pipe(_rename_columns)\n", " .pipe(DummyEncoder,\n", " categorical_col_base_levels=cat_base_levels)\n", " .transform()\n", " .pipe(_rename_columns))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is an example of a categorical variable in the dataset and its values:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['None', 'Driver & Passenger', 'Driver only'], dtype=object)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_raw['airbags'].unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The raw dataset comes with category values that have capital letters and whitespace. When encoding dummy columns from them, it is desirable to have the new columns as close to snake case as possible. The **encoder** will process the columns so that they are more like snake case." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For demonstration, see the columns from a transformed dataset below." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['manufacturer', 'model', 'min_price', 'price', 'max_price', 'mpg_city',\n", " 'mpg_highway', 'drivetrain', 'cylinders', 'enginesize', 'horsepower',\n", " 'rpm', 'rev_per_mile', 'man_trans_avail', 'fuel_tank_capacity',\n", " 'passengers', 'length', 'wheelbase', 'width', 'turn_circle',\n", " 'rear_seat_room', 'luggage_room', 'weight', 'make', 'type_large',\n", " 'type_midsize', 'type_small', 'type_sporty', 'type_van',\n", " 'airbags_driver_&_passenger', 'airbags_driver_only', 'origin_non_usa'],\n", " dtype='object')" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_raw.pipe(_process_data, {'type': 'Compact',\n", " 'airbags': 'None',\n", " 'origin': 'USA'}).columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Log variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a basic function for generating log transformations of variables. This will be particularly useful for positively-skewed variables." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "def _create_log_variables(raw_df, cols_list):\n", " \"Transform dataset to include log variables of the variables in cols_list.\"\n", " df = raw_df.copy()\n", " for col in cols_list:\n", " df['ln_' + col] = np.log(df[col])\n", " return df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These data cleaning functions could be gathered in a specific class, e.g. `Transformer`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Fit model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Initial model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's set out these key things for a linear model, from the raw dataset:\n", "- Dependent variable (`raw_y_list`)\n", "- Independent variables (`raw_X_list`) and the base levels for the categorical variables." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "raw_y_list = ['price']\n", "raw_X_list = ['type', 'mpg_city', 'airbags', 'origin']\n", "cat_base_levels = {'type': 'Compact',\n", " 'airbags': 'None',\n", " 'origin': 'USA'}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how many of the variables in `raw_X_list` are categorical variables." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "type object\n", "mpg_city int64\n", "airbags object\n", "origin object\n", "dtype: object" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_raw[raw_X_list].dtypes" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "type 6\n", "mpg_city 21\n", "airbags 3\n", "origin 2\n", "dtype: int64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_raw[raw_X_list].nunique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One trick is to make another representation of the independent variables – an X_list – where the categorical variables have the `DummyEncoder` separator appended to them." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['type_', 'mpg_city', 'airbags_', 'origin_']" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[str(x + '_') if x in cat_base_levels.keys()\n", " else x for x in raw_X_list]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `ModelPipeline`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can transform the raw dataset, given lists of its y & X variables (and base levels), and make a dataset to use for modelling.\n", "\n", "Let's set up a **`ModelPipeline` class**, which takes those variables from the raw dataset and does the following:\n", "- Get a final dataset to use for modelling (`get_dataset` method)\n", "- Get an Appelpy model object that consumes the final dataset (`get_model` method)." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "class ModelPipeline:\n", " def __init__(self, raw_df, raw_y_list, raw_X_list, cat_base_levels):\n", " self.raw_df = raw_df\n", " self.raw_y_list = raw_y_list\n", " self.raw_X_list = raw_X_list\n", " self.cat_base_levels = cat_base_levels\n", "\n", " # Transformed dataset and its columns\n", " self.df = None\n", " self.y_list = None\n", " self.X_list = None\n", "\n", " def get_dataset(self):\n", " \"Return processed dataset, ready to use for modelling\"\n", " df = (self.raw_df\n", " .pipe(_process_data, self.cat_base_levels)\n", " .pipe(_create_log_variables, ['price', 'mpg_city']))\n", "\n", " # Append separator to columns that represent categorical variables: \n", " X_list_prefixes = [str(x + '_') if x in self.cat_base_levels.keys()\n", " else x for x in self.raw_X_list]\n", "\n", " # Get column names to use in model dataset:\n", " self.y_list = self.raw_y_list\n", " self.X_list = df.columns[df.columns.str.startswith(tuple(X_list_prefixes))]\n", " # and filter the source dataset on those columns:\n", " filter_condition = df.columns.str.startswith(tuple([*self.y_list, *self.X_list]))\n", " self.df = df.loc[:, filter_condition]\n", " return self.df\n", "\n", " def get_model(self):\n", " \"Return OLS Appelpy model instance (not fitted)\"\n", " if self.df:\n", " return OLS(self.df, self.y_list, self.X_list)\n", " else:\n", " self.df = self.get_dataset()\n", " return OLS(self.df, self.y_list, self.X_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is useful, as it will make it easier to run different specifications of models and inspect the dataset used for modelling." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Model results" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: price R-squared: 0.616
Model: OLS Adj. R-squared: 0.574
Method: Least Squares F-statistic: 14.79
Date: Fri, 03 Jan 2020 Prob (F-statistic): 5.17e-14
Time: 21:40:32 Log-Likelihood: -297.87
No. Observations: 93 AIC: 615.7
Df Residuals: 83 BIC: 641.1
Df Model: 9
Covariance Type: nonrobust
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
const 29.6931 4.723 6.288 0.000 20.300 39.086
mpg_city -0.7957 0.191 -4.162 0.000 -1.176 -0.415
type_large 3.0755 2.774 1.109 0.271 -2.442 8.593
type_midsize 5.1573 2.183 2.362 0.020 0.815 9.499
type_small -0.2819 2.598 -0.109 0.914 -5.449 4.885
type_sporty 0.3151 2.329 0.135 0.893 -4.318 4.948
type_van -0.8718 2.904 -0.300 0.765 -6.647 4.903
airbags_driver_&_passenger 8.9089 2.284 3.900 0.000 4.365 13.452
airbags_driver_only 4.5643 1.672 2.730 0.008 1.239 7.890
origin_non_usa 5.1411 1.439 3.573 0.001 2.280 8.003
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 43.351 Durbin-Watson: 1.962
Prob(Omnibus): 0.000 Jarque-Bera (JB): 123.544
Skew: 1.620 Prob(JB): 1.49e-27
Kurtosis: 7.624 Cond. No. 195.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: price R-squared: 0.616\n", "Model: OLS Adj. R-squared: 0.574\n", "Method: Least Squares F-statistic: 14.79\n", "Date: Fri, 03 Jan 2020 Prob (F-statistic): 5.17e-14\n", "Time: 21:40:32 Log-Likelihood: -297.87\n", "No. Observations: 93 AIC: 615.7\n", "Df Residuals: 83 BIC: 641.1\n", "Df Model: 9 \n", "Covariance Type: nonrobust \n", "==============================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "----------------------------------------------------------------------------------------------\n", "const 29.6931 4.723 6.288 0.000 20.300 39.086\n", "mpg_city -0.7957 0.191 -4.162 0.000 -1.176 -0.415\n", "type_large 3.0755 2.774 1.109 0.271 -2.442 8.593\n", "type_midsize 5.1573 2.183 2.362 0.020 0.815 9.499\n", "type_small -0.2819 2.598 -0.109 0.914 -5.449 4.885\n", "type_sporty 0.3151 2.329 0.135 0.893 -4.318 4.948\n", "type_van -0.8718 2.904 -0.300 0.765 -6.647 4.903\n", "airbags_driver_&_passenger 8.9089 2.284 3.900 0.000 4.365 13.452\n", "airbags_driver_only 4.5643 1.672 2.730 0.008 1.239 7.890\n", "origin_non_usa 5.1411 1.439 3.573 0.001 2.280 8.003\n", "==============================================================================\n", "Omnibus: 43.351 Durbin-Watson: 1.962\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 123.544\n", "Skew: 1.620 Prob(JB): 1.49e-27\n", "Kurtosis: 7.624 Cond. No. 195.\n", "==============================================================================\n", "\n", "Warnings:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "\"\"\"" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model1 = (ModelPipeline(df_raw, raw_y_list, raw_X_list, cat_base_levels)\n", " .get_model()\n", " .fit())\n", "model1.results_output" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unstandardized and Standardized Estimates
coef t P>|t| coef_stdX coef_stdXy stdev_X
price
mpg_city-0.7957-4.1620.000-4.4719-0.46305.6198
type_large+3.0755+1.1090.271+0.9986+0.10340.3247
type_midsize+5.1573+2.3620.020+2.2036+0.22810.4273
type_small-0.2819-0.1090.914-0.1185-0.01230.4204
type_sporty+0.3151+0.1350.893+0.1133+0.01170.3595
type_van-0.8718-0.3000.765-0.2591-0.02680.2973
airbags_driver_&_passenger+8.9089+3.9000.000+3.3806+0.35000.3795
airbags_driver_only+4.5643+2.7300.008+2.2880+0.23690.5013
origin_non_usa+5.1411+3.5730.001+2.5831+0.26740.5024
" ], "text/plain": [ "" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model1.results_output_standardized" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Access the dataset used for modelling from the model pipeline." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "df = (ModelPipeline(df_raw, raw_y_list, raw_X_list, cat_base_levels)\n", " .get_dataset())" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pricempg_citytype_largetype_midsizetype_smalltype_sportytype_vanairbags_driver_&_passengerairbags_driver_onlyorigin_non_usa
015.92500100001
133.91801000101
229.12000000011
337.71901000101
430.02201000011
515.72201000010
620.81910000010
723.71610000010
826.31901000010
934.71610000010
\n", "
" ], "text/plain": [ " price mpg_city type_large type_midsize type_small type_sporty \\\n", "0 15.9 25 0 0 1 0 \n", "1 33.9 18 0 1 0 0 \n", "2 29.1 20 0 0 0 0 \n", "3 37.7 19 0 1 0 0 \n", "4 30.0 22 0 1 0 0 \n", "5 15.7 22 0 1 0 0 \n", "6 20.8 19 1 0 0 0 \n", "7 23.7 16 1 0 0 0 \n", "8 26.3 19 0 1 0 0 \n", "9 34.7 16 1 0 0 0 \n", "\n", " type_van airbags_driver_&_passenger airbags_driver_only origin_non_usa \n", "0 0 0 0 1 \n", "1 0 1 0 1 \n", "2 0 0 1 1 \n", "3 0 1 0 1 \n", "4 0 0 1 1 \n", "5 0 0 1 0 \n", "6 0 0 1 0 \n", "7 0 0 1 0 \n", "8 0 0 1 0 \n", "9 0 0 1 0 " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Model diagnostics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's gather the main residual plots in one place and examine whether the assumptions of OLS hold in the model." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "fig, axarr = plt.subplots(2, 2, figsize=(10, 10))\n", "model1.diagnostic_plot('pp_plot', ax=axarr[0][0])\n", "model1.diagnostic_plot('qq_plot', ax=axarr[1][0])\n", "model1.diagnostic_plot('rvf_plot', ax=axarr[1][1])\n", "axarr[0, 1].axis('off')\n", "plt.tight_layout()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "fig, ax = plt.subplots(figsize=(8, 5))\n", "model1.resid_standardized.hist(ax=ax)\n", "ax.set_title('Histogram of standardized residuals')\n", "ax.set_xlabel('Standardized residuals')\n", "ax.set_ylabel('Frequency')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the residual plots, it is clear that the assumptions of OLS do not hold in the model:\n", "- Residuals are not normally distributed (for y-values close to mean and ones near the tails)\n", "- RVF plot has a funnel pattern. Perhaps there is heteroskedasticity in the dataset." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Breusch-Pagan test (studentized) :: chi2(1)\n", "Test statistic: 10.6543\n", "Test p-value: 0.3002\n" ] } ], "source": [ "bps_stats = heteroskedasticity_test('breusch_pagan_studentized', model1)\n", "print('Breusch-Pagan test (studentized) :: {}'.format(bps_stats['distribution'] + '({})'.format(bps_stats['nu'])))\n", "print('Test statistic: {:.4f}'.format(bps_stats['test_stat']))\n", "print('Test p-value: {:.4f}'.format(bps_stats['p_value']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Second model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The functional form of the first model may be inappropriate.\n", "\n", "In the statistical moments dataframe we see that the dependent variable `price` is positively skewed. The model estimates were worst at the tail end of the `price` distribution and the residuals did not appear normally distributed.\n", "\n", "Let's use the log-transformed price (`ln_price`) as the dependent variable in a second model." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "def log_plots(df, col, figsize=(11,5)): \n", " plot_df = (df[col].to_frame()\n", " .pipe(_create_log_variables, [col]))\n", " \n", " fig, axarr = plt.subplots(1, 2, figsize=figsize)\n", " \n", " plot_df[col].hist(ax=axarr[0])\n", " axarr[0].set_title('Histogram of {} values'.format(col))\n", " axarr[0].set_ylabel('Frequency')\n", " axarr[0].set_xlabel('Value')\n", " \n", " plot_df['ln_' + col].hist(ax=axarr[1])\n", " axarr[1].set_title('Histogram of ln_{} values'.format(col))\n", " axarr[1].set_ylabel('Frequency')\n", " axarr[1].set_xlabel('Value')\n", "\n", " return fig" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "log_plots(df_raw, 'price')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "log_plots(df_raw, 'mpg_city')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The log-transforms of the continuous variables in the model have a more normal distribution than their original variables. However, they still have a slight positive skew. There are a few particularly influential points in the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`model2` arguments" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "raw_y_list = ['ln_price']\n", "raw_X_list = ['type', 'ln_mpg_city', 'airbags', 'origin']\n", "cat_base_levels = {'type': 'Compact',\n", " 'airbags': 'None',\n", " 'origin': 'USA'}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Model results" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: ln_price R-squared: 0.789
Model: OLS Adj. R-squared: 0.766
Method: Least Squares F-statistic: 34.46
Date: Fri, 03 Jan 2020 Prob (F-statistic): 1.98e-24
Time: 21:40:33 Log-Likelihood: 14.159
No. Observations: 93 AIC: -8.318
Df Residuals: 83 BIC: 17.01
Df Model: 9
Covariance Type: nonrobust
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
const 6.3091 0.559 11.286 0.000 5.197 7.421
type_large 0.0889 0.099 0.895 0.374 -0.109 0.286
type_midsize 0.1393 0.078 1.790 0.077 -0.015 0.294
type_small -0.1280 0.090 -1.428 0.157 -0.306 0.050
type_sporty -0.0140 0.082 -0.172 0.864 -0.176 0.148
type_van -0.1196 0.107 -1.114 0.268 -0.333 0.094
airbags_driver_&_passenger 0.3635 0.080 4.550 0.000 0.205 0.522
airbags_driver_only 0.2386 0.058 4.088 0.000 0.122 0.355
origin_non_usa 0.2322 0.050 4.632 0.000 0.132 0.332
ln_mpg_city -1.2106 0.178 -6.789 0.000 -1.565 -0.856
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 5.184 Durbin-Watson: 1.891
Prob(Omnibus): 0.075 Jarque-Bera (JB): 4.821
Skew: 0.555 Prob(JB): 0.0898
Kurtosis: 3.101 Cond. No. 86.9


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: ln_price R-squared: 0.789\n", "Model: OLS Adj. R-squared: 0.766\n", "Method: Least Squares F-statistic: 34.46\n", "Date: Fri, 03 Jan 2020 Prob (F-statistic): 1.98e-24\n", "Time: 21:40:33 Log-Likelihood: 14.159\n", "No. Observations: 93 AIC: -8.318\n", "Df Residuals: 83 BIC: 17.01\n", "Df Model: 9 \n", "Covariance Type: nonrobust \n", "==============================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "----------------------------------------------------------------------------------------------\n", "const 6.3091 0.559 11.286 0.000 5.197 7.421\n", "type_large 0.0889 0.099 0.895 0.374 -0.109 0.286\n", "type_midsize 0.1393 0.078 1.790 0.077 -0.015 0.294\n", "type_small -0.1280 0.090 -1.428 0.157 -0.306 0.050\n", "type_sporty -0.0140 0.082 -0.172 0.864 -0.176 0.148\n", "type_van -0.1196 0.107 -1.114 0.268 -0.333 0.094\n", "airbags_driver_&_passenger 0.3635 0.080 4.550 0.000 0.205 0.522\n", "airbags_driver_only 0.2386 0.058 4.088 0.000 0.122 0.355\n", "origin_non_usa 0.2322 0.050 4.632 0.000 0.132 0.332\n", "ln_mpg_city -1.2106 0.178 -6.789 0.000 -1.565 -0.856\n", "==============================================================================\n", "Omnibus: 5.184 Durbin-Watson: 1.891\n", "Prob(Omnibus): 0.075 Jarque-Bera (JB): 4.821\n", "Skew: 0.555 Prob(JB): 0.0898\n", "Kurtosis: 3.101 Cond. No. 86.9\n", "==============================================================================\n", "\n", "Warnings:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "\"\"\"" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model2 = (ModelPipeline(df_raw, raw_y_list, raw_X_list, cat_base_levels)\n", " .get_model()\n", " .fit())\n", "model2.results_output" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unstandardized and Standardized Estimates
coef t P>|t| coef_stdX coef_stdXy stdev_X
ln_price
type_large+0.0889+0.8950.374+0.0289+0.06350.3247
type_midsize+0.1393+1.7900.077+0.0595+0.13090.4273
type_small-0.1280-1.4280.157-0.0538-0.11840.4204
type_sporty-0.0140-0.1720.864-0.0051-0.01110.3595
type_van-0.1196-1.1140.268-0.0355-0.07820.2973
airbags_driver_&_passenger+0.3635+4.5500.000+0.1379+0.30330.3795
airbags_driver_only+0.2386+4.0880.000+0.1196+0.26300.5013
origin_non_usa+0.2322+4.6320.000+0.1166+0.25650.5024
ln_mpg_city-1.2106-6.7890.000-0.2710-0.59610.2239
" ], "text/plain": [ "" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model2.results_output_standardized" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Model diagnostics" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "fig, axarr = plt.subplots(2, 2, figsize=(10, 10))\n", "model2.diagnostic_plot('pp_plot', ax=axarr[0][0])\n", "model2.diagnostic_plot('qq_plot', ax=axarr[1][0])\n", "model2.diagnostic_plot('rvf_plot', ax=axarr[1][1])\n", "axarr[0, 1].axis('off')\n", "plt.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The residual plots look much better for this model:\n", "- The residuals are closer to the 45-degree lines in the P-P and Q-Q plots.\n", "- The funnel-like appearance of the RVF plot has disappeared." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "fig, ax = plt.subplots(figsize=(8, 5))\n", "model2.resid_standardized.hist(ax=ax)\n", "ax.set_title('Histogram of standardized residuals')\n", "ax.set_xlabel('Standardized residuals')\n", "ax.set_ylabel('Frequency')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Breusch-Pagan test (studentized) :: chi2(1)\n", "Test statistic: 5.8346\n", "Test p-value: 0.7564\n" ] } ], "source": [ "bps_stats = heteroskedasticity_test('breusch_pagan_studentized', model2)\n", "print('Breusch-Pagan test (studentized) :: {}'.format(bps_stats['distribution'] + '({})'.format(bps_stats['nu'])))\n", "print('Test statistic: {:.4f}'.format(bps_stats['test_stat']))\n", "print('Test p-value: {:.4f}'.format(bps_stats['p_value']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Second model - change base levels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose we want to run a model with different base levels for the categorical variables.\n", "\n", "This is relatively easy now that we have made the `ModelPipeline`.\n", "\n", "It's possible to do so without the model pipeline class, e.g. by manually listing the final columns in an `X_list`, but it is difficult to check whether column names are correct or if columns are missing, especially for datasets where there are multiple categorical variables or categorical variables that have multiple values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's set up different base levels for categorical variables in an alternative form of model 2, `model2_alt`." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "raw_y_list = ['ln_price']\n", "raw_X_list = ['type', 'ln_mpg_city', 'airbags', 'origin']\n", "cat_base_levels = {'type': 'Small', # changed level\n", " 'airbags': 'None',\n", " 'origin': 'USA'}" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
OLS Regression Results
Dep. Variable: ln_price R-squared: 0.789
Model: OLS Adj. R-squared: 0.766
Method: Least Squares F-statistic: 34.46
Date: Fri, 03 Jan 2020 Prob (F-statistic): 1.98e-24
Time: 21:40:34 Log-Likelihood: 14.159
No. Observations: 93 AIC: -8.318
Df Residuals: 83 BIC: 17.01
Df Model: 9
Covariance Type: nonrobust
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
coef std err t P>|t| [0.025 0.975]
const 6.1811 0.602 10.271 0.000 4.984 7.378
type_compact 0.1280 0.090 1.428 0.157 -0.050 0.306
type_large 0.2169 0.127 1.703 0.092 -0.036 0.470
type_midsize 0.2673 0.104 2.574 0.012 0.061 0.474
type_sporty 0.1140 0.099 1.153 0.252 -0.083 0.311
type_van 0.0085 0.131 0.065 0.949 -0.252 0.269
airbags_driver_&_passenger 0.3635 0.080 4.550 0.000 0.205 0.522
airbags_driver_only 0.2386 0.058 4.088 0.000 0.122 0.355
origin_non_usa 0.2322 0.050 4.632 0.000 0.132 0.332
ln_mpg_city -1.2106 0.178 -6.789 0.000 -1.565 -0.856
\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "
Omnibus: 5.184 Durbin-Watson: 1.891
Prob(Omnibus): 0.075 Jarque-Bera (JB): 4.821
Skew: 0.555 Prob(JB): 0.0898
Kurtosis: 3.101 Cond. No. 95.1


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/plain": [ "\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: ln_price R-squared: 0.789\n", "Model: OLS Adj. R-squared: 0.766\n", "Method: Least Squares F-statistic: 34.46\n", "Date: Fri, 03 Jan 2020 Prob (F-statistic): 1.98e-24\n", "Time: 21:40:34 Log-Likelihood: 14.159\n", "No. Observations: 93 AIC: -8.318\n", "Df Residuals: 83 BIC: 17.01\n", "Df Model: 9 \n", "Covariance Type: nonrobust \n", "==============================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "----------------------------------------------------------------------------------------------\n", "const 6.1811 0.602 10.271 0.000 4.984 7.378\n", "type_compact 0.1280 0.090 1.428 0.157 -0.050 0.306\n", "type_large 0.2169 0.127 1.703 0.092 -0.036 0.470\n", "type_midsize 0.2673 0.104 2.574 0.012 0.061 0.474\n", "type_sporty 0.1140 0.099 1.153 0.252 -0.083 0.311\n", "type_van 0.0085 0.131 0.065 0.949 -0.252 0.269\n", "airbags_driver_&_passenger 0.3635 0.080 4.550 0.000 0.205 0.522\n", "airbags_driver_only 0.2386 0.058 4.088 0.000 0.122 0.355\n", "origin_non_usa 0.2322 0.050 4.632 0.000 0.132 0.332\n", "ln_mpg_city -1.2106 0.178 -6.789 0.000 -1.565 -0.856\n", "==============================================================================\n", "Omnibus: 5.184 Durbin-Watson: 1.891\n", "Prob(Omnibus): 0.075 Jarque-Bera (JB): 4.821\n", "Skew: 0.555 Prob(JB): 0.0898\n", "Kurtosis: 3.101 Cond. No. 95.1\n", "==============================================================================\n", "\n", "Warnings:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "\"\"\"" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model2_alt = (ModelPipeline(df_raw, raw_y_list, raw_X_list, cat_base_levels)\n", " .get_model()\n", " .fit())\n", "model2_alt.results_output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the alternative base levels, of course we still have the same model performance (e.g. R-squared) but different dummy columns are shown." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unstandardized and Standardized Estimates
coef t P>|t| coef_stdX coef_stdXy stdev_X
ln_price
type_compact+0.1280+1.4280.157+0.0486+0.10690.3795
type_large+0.2169+1.7030.092+0.0704+0.15490.3247
type_midsize+0.2673+2.5740.012+0.1142+0.25120.4273
type_sporty+0.1140+1.1530.252+0.0410+0.09010.3595
type_van+0.0085+0.0650.949+0.0025+0.00550.2973
airbags_driver_&_passenger+0.3635+4.5500.000+0.1379+0.30330.3795
airbags_driver_only+0.2386+4.0880.000+0.1196+0.26300.5013
origin_non_usa+0.2322+4.6320.000+0.1166+0.25650.5024
ln_mpg_city-1.2106-6.7890.000-0.2710-0.59610.2239
" ], "text/plain": [ "" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model2_alt.results_output_standardized" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 2 }