{ "cells": [ { "cell_type": "markdown", "id": "587f4596", "metadata": {}, "source": [ "# Debug housing price predictions" ] }, { "cell_type": "markdown", "id": "df6550b0", "metadata": {}, "source": [ "This notebook demonstrates the use of the `responsibleai` API to assess a classification model trained on Kaggle's apartments dataset (https://www.kaggle.com/alphaepsilon/housing-prices-dataset). The model predicts if the house sells for more than median price or not. It walks through the API calls necessary to create a widget with model analysis insights, then guides a visual analysis of the model." ] }, { "cell_type": "markdown", "id": "b20f25c3", "metadata": {}, "source": [ "## Launch Responsible AI Toolbox" ] }, { "cell_type": "markdown", "id": "a122e129", "metadata": {}, "source": [ "The following section examines the code necessary to create datasets and a model. It then generates insights using the `responsibleai` API that can be visually analyzed." ] }, { "cell_type": "markdown", "id": "4736295c", "metadata": {}, "source": [ "### Train a Model\n", "*The following section can be skipped. It loads a dataset and trains a model for illustrative purposes.*" ] }, { "cell_type": "code", "execution_count": null, "id": "ebae7586", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "from sklearn.model_selection import train_test_split\n", "from lightgbm import LGBMClassifier\n", "import zipfile" ] }, { "cell_type": "markdown", "id": "927997ce", "metadata": {}, "source": [ "First, load the apartment dataset and specify the different types of features. Then, clean it and put it into a DataFrame with named columns. After loading and cleaning the data, split the datapoints into training and test sets. Assemble separate datasets for the full sample and the test data." ] }, { "cell_type": "code", "execution_count": null, "id": "99f4447e", "metadata": {}, "outputs": [], "source": [ "from packaging import version\n", "from raiutils.dataset import fetch_dataset\n", "import sklearn\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.impute import SimpleImputer\n", "from sklearn.preprocessing import OneHotEncoder\n", "from sklearn.compose import ColumnTransformer\n", "\n", "# for older scikit-learn versions use sparse, for newer sparse_output:\n", "if version.parse(sklearn.__version__) < version.parse('1.2'):\n", " ohe_params = {\"sparse\": False}\n", "else:\n", " ohe_params = {\"sparse_output\": False}\n", "\n", "def split_label(dataset, target_feature):\n", " X = dataset.drop([target_feature], axis=1)\n", " y = dataset[[target_feature]]\n", " return X, y\n", "\n", "def clean_data(X, y, target_feature):\n", " features = X.columns.values.tolist()\n", " classes = y[target_feature].unique().tolist()\n", " pipe_cfg = {\n", " 'num_cols': X.dtypes[X.dtypes == 'int64'].index.values.tolist(),\n", " 'cat_cols': X.dtypes[X.dtypes == 'object'].index.values.tolist(),\n", " }\n", " num_pipe = Pipeline([\n", " ('num_imputer', SimpleImputer(strategy='median'))\n", " ])\n", " cat_pipe = Pipeline([\n", " ('cat_imputer', SimpleImputer(strategy='constant', fill_value='?')),\n", " ('cat_encoder', OneHotEncoder(handle_unknown='ignore', **ohe_params))\n", " ])\n", " feat_pipe = ColumnTransformer([\n", " ('num_pipe', num_pipe, pipe_cfg['num_cols']),\n", " ('cat_pipe', cat_pipe, pipe_cfg['cat_cols'])\n", " ])\n", " X = feat_pipe.fit_transform(X)\n", " print(pipe_cfg['cat_cols'])\n", " return X, feat_pipe, features, classes\n", "\n", "target_feature = 'Sold_HigherThan_Median'\n", "categorical_features = []\n", "\n", "outdirname = 'responsibleai.12.28.21'\n", "zipfilename = outdirname + '.zip'\n", "\n", "fetch_dataset('https://publictestdatasets.blob.core.windows.net/data/' + zipfilename, zipfilename)\n", "\n", "with zipfile.ZipFile(zipfilename, 'r') as unzip:\n", " unzip.extractall('.')\n", "\n", "all_data = pd.read_csv('apartments-train.csv')\n", "all_data = all_data.drop(['SalePrice','SalePriceK'], axis=1)\n", "X, y = split_label(all_data, target_feature)\n", "\n", "\n", "X_train_original, X_test_original, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=7, stratify=y)\n", "\n", "X_train, feat_pipe, features, classes = clean_data(X_train_original, y_train, target_feature)\n", "y_train = y_train[target_feature].to_numpy()\n", "\n", "X_test = feat_pipe.transform(X_test_original)\n", "y_test = y_test[target_feature].to_numpy()\n", "\n", "train_data = X_train_original.copy()\n", "train_data[target_feature] = y_train\n", "\n", "test_data = X_test_original.copy()\n", "test_data[target_feature] = y_test" ] }, { "cell_type": "markdown", "id": "9b29ec13", "metadata": {}, "source": [ "Train a LightGBM classifier on the training data." ] }, { "cell_type": "code", "execution_count": null, "id": "8eb4d0de", "metadata": {}, "outputs": [], "source": [ "clf = LGBMClassifier(random_state=0)\n", "model = clf.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "id": "69e3860a", "metadata": {}, "source": [ "### Create Model and Data Insights" ] }, { "cell_type": "code", "execution_count": null, "id": "516b2fa9", "metadata": {}, "outputs": [], "source": [ "from raiwidgets import ResponsibleAIDashboard\n", "from responsibleai import RAIInsights" ] }, { "cell_type": "markdown", "id": "7c8b0634", "metadata": {}, "source": [ "To use Responsible AI Dashboard, initialize a RAIInsights object upon which different components can be loaded.\n", "\n", "RAIInsights accepts the model, the full dataset, the test dataset, the target feature string and the task type string as its arguments.\n", "\n", "You may also create the `FeatureMetadata` container, identify any feature of your choice as the `identity_feature`, specify a list of strings of categorical feature names via the `categorical_features` parameter, and specify dropped features via the `dropped_features` parameter. The `FeatureMetadata` may also be passed into the `RAIInsights`." ] }, { "cell_type": "code", "execution_count": null, "id": "d965f768", "metadata": {}, "outputs": [], "source": [ "from responsibleai.feature_metadata import FeatureMetadata\n", "feature_metadata = FeatureMetadata(categorical_features=categorical_features, dropped_features=[])\n" ] }, { "cell_type": "code", "execution_count": null, "id": "c941835b", "metadata": {}, "outputs": [], "source": [ "from sklearn.pipeline import Pipeline\n", "\n", "dashboard_pipeline = Pipeline(steps=[('preprocess', feat_pipe), ('model', model)])\n", "rai_insights = RAIInsights(dashboard_pipeline, train_data, test_data, target_feature, 'classification',\n", " feature_metadata=feature_metadata, \n", " classes=['Less than median', 'More than median'])" ] }, { "cell_type": "markdown", "id": "ab1068ee", "metadata": {}, "source": [ "Add the components of the toolbox that are focused on model assessment." ] }, { "cell_type": "code", "execution_count": null, "id": "b8587d53", "metadata": {}, "outputs": [], "source": [ "# Interpretability\n", "rai_insights.explainer.add()\n", "# Error Analysis\n", "rai_insights.error_analysis.add()\n", "# Counterfactuals: accepts total number of counterfactuals to generate, the label that they should have, and a list of \n", " # strings of categorical feature names\n", "rai_insights.counterfactual.add(total_CFs=10, desired_class='opposite')" ] }, { "cell_type": "markdown", "id": "ec82a856", "metadata": {}, "source": [ "Once all the desired components have been loaded, compute insights on the test set." ] }, { "cell_type": "code", "execution_count": null, "id": "127ec383", "metadata": {}, "outputs": [], "source": [ "rai_insights.compute()" ] }, { "cell_type": "markdown", "id": "83b5112c", "metadata": {}, "source": [ "Finally, visualize and explore the model insights. Use the resulting widget or follow the link to view this in a new tab." ] }, { "cell_type": "code", "execution_count": null, "id": "a73ad853", "metadata": {}, "outputs": [], "source": [ "ResponsibleAIDashboard(rai_insights)" ] }, { "cell_type": "markdown", "id": "3a6d610b", "metadata": {}, "source": [ "See this [developer blog](https://techcommunity.microsoft.com/t5/ai-machine-learning-blog/responsible-ai-dashboard-a-one-stop-shop-for-operationalizing/ba-p/3030944) (Model Debugging Flow section) to learn more about this use case and how to use the dashboard to debug your housing price prediction model." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 5 }