{ "cells": [ { "cell_type": "markdown", "id": "chronic-ethiopia", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Introduction to Azure ML SDK\n", "> Presentation notebook from 'Azure Saturday, Hamburg 2021' event. \n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- categories: [azureml, sdk, mlops, python, dataops]\n", "- hide: false" ] }, { "cell_type": "markdown", "id": "automatic-marina", "metadata": {}, "source": [ "## Azure Saturday Hamburg, Feb 20, 2021\n", "\n", "### Sandeep Pawar \n", "\n", "#### Twitter : @PawarBI | LinkedIn: in/sanpawar | Blog : PawarBI.com" ] }, { "cell_type": "markdown", "id": "eligible-finding", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "id": "overall-kinase", "metadata": {}, "source": [ "*Note:* This notebook was presented at [Azure Saturday Hamburg](https://www.eventleaf.com/azuresaturdayhamburg)" ] }, { "cell_type": "markdown", "id": "stock-supervision", "metadata": {}, "source": [ "# Agenda\n", "\n", "#### - Machine Learning Process - *As Advertised*\n", " - Motivation\n", " - Demo\n", "\n", "#### - Glimpse of *Real* Machine Learning Process - *Using Azure ML*\n", " - What is Azure ML Service\n", " - Classes in Azure ML SDK\n", " - Workspace\n", " - DataOps using Datastore & Datasets\n", " - Experiments\n", " - Model Deployment\n" ] }, { "cell_type": "markdown", "id": "integrated-planning", "metadata": {}, "source": [ "# Motivation\n", "\n", "Before I talk about Azure ML, I would like to first provide some motivation for why we want to learn and use Azure ML. \n", "\n", "The goal of this presentation is not to show how to create machine learning models but rather, how to use Azure ML to operationalize the machine learning models at scale. I will create an example machine learning model but really the focus is understanding the common 'design patterns' in Azure ML. If you are familiar with theory of machine lerning, this presentation/example notebook will help you understand the often neglected MLOps part of ML. If you do not have experience with creating ML models or are new to Python/Azure, focus on the logical process rather than the exact mechanics. You can always revisit this example notebook or Microsoft Learn but hopefully from this session you will understand, at a high-level, how to use Azure ML to deploy ML models in production." ] }, { "cell_type": "markdown", "id": "talented-enzyme", "metadata": {}, "source": [ "## Machine Learning Process - *As Advertised*\n", "\n", "Let's start with a typical machine learning process. You will see plenty of tutorials on how to create machine learning models. Just type in \"Machine learning process\" in Google and you will see below results. Most of these describe the process broadly as follows: \n", "\n", " - Obtain data\n", " - Clean data\n", " - EDA\n", " - Preprocess the data\n", " - Build model(s)\n", " - Validate the model\n", " - Serialize the model \n" ] }, { "cell_type": "markdown", "id": "honey-dispatch", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "id": "according-transformation", "metadata": {}, "source": [ "Let's follow this process to build a model. \n", "\n", "#### Data \n", "I will use a dataset from UCI Machine Learning reporsitory called [\"Bank Marketing Data Set\"](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing). You may have seen this in many tutorials. I chose this dataset because the focus of this presentation is learning Azure ML so I wanted to pick something that most can understand and I recently gave a presentation on [Machine Learning Model Interpretability](https://youtu.be/0ocVtXU8o1I) using the same dataset. In case you are interested in that topic, you will already be familiar with this dataset after this presentation. \n", "\n", "This dataset has 20 features, mix of numerical and categorical features, and a target label with \"Yes/No\" values. It's a binary classification problem and the goal is to predict if a customer will sign up for a bank term deposit. Feel free to explore the dataset on your own before proceeding. " ] }, { "cell_type": "code", "execution_count": 16, "id": "spread-botswana", "metadata": {}, "outputs": [], "source": [ "#collapse-hide\n", "import pandas as pd\n", "import numpy as np\n", "from sklearn.model_selection import train_test_split\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "\n", "from sklearn.ensemble import RandomForestClassifier\n", "\n", "from sklearn.metrics import roc_auc_score, accuracy_score\n", "from sklearn.preprocessing import StandardScaler\n", "\n", "from sklearn.preprocessing import OneHotEncoder, FunctionTransformer, StandardScaler\n", "from sklearn.compose import ColumnTransformer\n", "from sklearn.pipeline import Pipeline\n", "\n", "from sklearn import metrics\n", "from interpret import show\n", "from interpret.perf import ROC\n", "\n", "from sklearn import metrics\n", "\n", "import seaborn as sns" ] }, { "cell_type": "code", "execution_count": 17, "id": "funded-joyce", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.2.1\n" ] } ], "source": [ "print(pd.__version__)\n" ] }, { "cell_type": "markdown", "id": "impressed-layout", "metadata": {}, "source": [ "#### Obtain data\n", "\n", "There are 32950 observations and 20 features. Each observation describes a potential customer with their details such as job, age, martial status etc. and also the macro economic conditions (employment rate, bond rate etc. when that customer was last contacted. " ] }, { "cell_type": "code", "execution_count": 18, "id": "universal-danish", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(32950, 21)\n" ] }, { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>age</th>\n", " <th>job</th>\n", " <th>marital</th>\n", " <th>education</th>\n", " <th>default</th>\n", " <th>housing</th>\n", " <th>loan</th>\n", " <th>contact</th>\n", " <th>month</th>\n", " <th>day_of_week</th>\n", " <th>...</th>\n", " <th>campaign</th>\n", " <th>pdays</th>\n", " <th>previous</th>\n", " <th>poutcome</th>\n", " <th>emp.var.rate</th>\n", " <th>cons.price.idx</th>\n", " <th>cons.conf.idx</th>\n", " <th>euribor3m</th>\n", " <th>nr.employed</th>\n", " <th>y</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>57</td>\n", " <td>technician</td>\n", " <td>married</td>\n", " <td>high.school</td>\n", " <td>no</td>\n", " <td>no</td>\n", " <td>yes</td>\n", " <td>cellular</td>\n", " <td>may</td>\n", " <td>mon</td>\n", " <td>...</td>\n", " <td>1</td>\n", " <td>999</td>\n", " <td>1</td>\n", " <td>failure</td>\n", " <td>-1.8</td>\n", " <td>92.893</td>\n", " <td>-46.2</td>\n", " <td>1.299</td>\n", " <td>5099.1</td>\n", " <td>no</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>55</td>\n", " <td>unknown</td>\n", " <td>married</td>\n", " <td>unknown</td>\n", " <td>unknown</td>\n", " <td>yes</td>\n", " <td>no</td>\n", " <td>telephone</td>\n", " <td>may</td>\n", " <td>thu</td>\n", " <td>...</td>\n", " <td>2</td>\n", " <td>999</td>\n", " <td>0</td>\n", " <td>nonexistent</td>\n", " <td>1.1</td>\n", " <td>93.994</td>\n", " <td>-36.4</td>\n", " <td>4.860</td>\n", " <td>5191.0</td>\n", " <td>no</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>33</td>\n", " <td>blue-collar</td>\n", " <td>married</td>\n", " <td>basic.9y</td>\n", " <td>no</td>\n", " <td>no</td>\n", " <td>no</td>\n", " <td>cellular</td>\n", " <td>may</td>\n", " <td>fri</td>\n", " <td>...</td>\n", " <td>1</td>\n", " <td>999</td>\n", " <td>1</td>\n", " <td>failure</td>\n", " <td>-1.8</td>\n", " <td>92.893</td>\n", " <td>-46.2</td>\n", " <td>1.313</td>\n", " <td>5099.1</td>\n", " <td>no</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>36</td>\n", " <td>admin.</td>\n", " <td>married</td>\n", " <td>high.school</td>\n", " <td>no</td>\n", " <td>no</td>\n", " <td>no</td>\n", " <td>telephone</td>\n", " <td>jun</td>\n", " <td>fri</td>\n", " <td>...</td>\n", " <td>4</td>\n", " <td>999</td>\n", " <td>0</td>\n", " <td>nonexistent</td>\n", " <td>1.4</td>\n", " <td>94.465</td>\n", " <td>-41.8</td>\n", " <td>4.967</td>\n", " <td>5228.1</td>\n", " <td>no</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>27</td>\n", " <td>housemaid</td>\n", " <td>married</td>\n", " <td>high.school</td>\n", " <td>no</td>\n", " <td>yes</td>\n", " <td>no</td>\n", " <td>cellular</td>\n", " <td>jul</td>\n", " <td>fri</td>\n", " <td>...</td>\n", " <td>2</td>\n", " <td>999</td>\n", " <td>0</td>\n", " <td>nonexistent</td>\n", " <td>1.4</td>\n", " <td>93.918</td>\n", " <td>-42.7</td>\n", " <td>4.963</td>\n", " <td>5228.1</td>\n", " <td>no</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>5 rows × 21 columns</p>\n", "</div>" ], "text/plain": [ " age job marital education default housing loan contact \\\n", "0 57 technician married high.school no no yes cellular \n", "1 55 unknown married unknown unknown yes no telephone \n", "2 33 blue-collar married basic.9y no no no cellular \n", "3 36 admin. married high.school no no no telephone \n", "4 27 housemaid married high.school no yes no cellular \n", "\n", " month day_of_week ... campaign pdays previous poutcome emp.var.rate \\\n", "0 may mon ... 1 999 1 failure -1.8 \n", "1 may thu ... 2 999 0 nonexistent 1.1 \n", "2 may fri ... 1 999 1 failure -1.8 \n", "3 jun fri ... 4 999 0 nonexistent 1.4 \n", "4 jul fri ... 2 999 0 nonexistent 1.4 \n", "\n", " cons.price.idx cons.conf.idx euribor3m nr.employed y \n", "0 92.893 -46.2 1.299 5099.1 no \n", "1 93.994 -36.4 4.860 5191.0 no \n", "2 92.893 -46.2 1.313 5099.1 no \n", "3 94.465 -41.8 4.967 5228.1 no \n", "4 93.918 -42.7 4.963 5228.1 no \n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#https://archive.ics.uci.edu/ml/datasets/Bank+Marketing\n", "\n", "path = \"https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv\"\n", "data = pd.read_csv(path)\n", "\n", "print(data.shape)\n", "data.head()" ] }, { "cell_type": "markdown", "id": "conservative-blackjack", "metadata": {}, "source": [ "#### Clean the data\n", "\n", "The column names contain some columns with periods ('.') in them. We will clean the column names, change dtype of some columns to categoricals and also binarize the target to [1,0] instead of yes/no. " ] }, { "cell_type": "code", "execution_count": 19, "id": "registered-reply", "metadata": {}, "outputs": [], "source": [ "#Define functions to clean the data\n", "\n", "def clean_col_names(df):\n", " \n", " df.columns = [col.replace('.','_') for col in df.columns]\n", " \n", " return df\n", "\n", "def clean_dtype(df):\n", " \n", " cat_cols = ['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome']\n", " \n", " for col in cat_cols:\n", " df.loc[:,col] = df[col].astype('category')\n", " \n", " return df\n", "\n", "def binarize_y(y):\n", " y = (y=='yes').astype(int)\n", " \n", " return y\n", "\n" ] }, { "cell_type": "code", "execution_count": 20, "id": "wireless-binding", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>age</th>\n", " <th>job</th>\n", " <th>marital</th>\n", " <th>education</th>\n", " <th>default</th>\n", " <th>housing</th>\n", " <th>loan</th>\n", " <th>contact</th>\n", " <th>month</th>\n", " <th>day_of_week</th>\n", " <th>...</th>\n", " <th>campaign</th>\n", " <th>pdays</th>\n", " <th>previous</th>\n", " <th>poutcome</th>\n", " <th>emp_var_rate</th>\n", " <th>cons_price_idx</th>\n", " <th>cons_conf_idx</th>\n", " <th>euribor3m</th>\n", " <th>nr_employed</th>\n", " <th>y</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>57</td>\n", " <td>technician</td>\n", " <td>married</td>\n", " <td>high.school</td>\n", " <td>no</td>\n", " <td>no</td>\n", " <td>yes</td>\n", " <td>cellular</td>\n", " <td>may</td>\n", " <td>mon</td>\n", " <td>...</td>\n", " <td>1</td>\n", " <td>999</td>\n", " <td>1</td>\n", " <td>failure</td>\n", " <td>-1.8</td>\n", " <td>92.893</td>\n", " <td>-46.2</td>\n", " <td>1.299</td>\n", " <td>5099.1</td>\n", " <td>no</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>55</td>\n", " <td>unknown</td>\n", " <td>married</td>\n", " <td>unknown</td>\n", " <td>unknown</td>\n", " <td>yes</td>\n", " <td>no</td>\n", " <td>telephone</td>\n", " <td>may</td>\n", " <td>thu</td>\n", " <td>...</td>\n", " <td>2</td>\n", " <td>999</td>\n", " <td>0</td>\n", " <td>nonexistent</td>\n", " <td>1.1</td>\n", " <td>93.994</td>\n", " <td>-36.4</td>\n", " <td>4.860</td>\n", " <td>5191.0</td>\n", " <td>no</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>33</td>\n", " <td>blue-collar</td>\n", " <td>married</td>\n", " <td>basic.9y</td>\n", " <td>no</td>\n", " <td>no</td>\n", " <td>no</td>\n", " <td>cellular</td>\n", " <td>may</td>\n", " <td>fri</td>\n", " <td>...</td>\n", " <td>1</td>\n", " <td>999</td>\n", " <td>1</td>\n", " <td>failure</td>\n", " <td>-1.8</td>\n", " <td>92.893</td>\n", " <td>-46.2</td>\n", " <td>1.313</td>\n", " <td>5099.1</td>\n", " <td>no</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>3 rows × 21 columns</p>\n", "</div>" ], "text/plain": [ " age job marital education default housing loan contact \\\n", "0 57 technician married high.school no no yes cellular \n", "1 55 unknown married unknown unknown yes no telephone \n", "2 33 blue-collar married basic.9y no no no cellular \n", "\n", " month day_of_week ... campaign pdays previous poutcome emp_var_rate \\\n", "0 may mon ... 1 999 1 failure -1.8 \n", "1 may thu ... 2 999 0 nonexistent 1.1 \n", "2 may fri ... 1 999 1 failure -1.8 \n", "\n", " cons_price_idx cons_conf_idx euribor3m nr_employed y \n", "0 92.893 -46.2 1.299 5099.1 no \n", "1 93.994 -36.4 4.860 5191.0 no \n", "2 92.893 -46.2 1.313 5099.1 no \n", "\n", "[3 rows x 21 columns]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Copy and Clean the data\n", "\n", "df = data.copy()\n", "df = clean_col_names(df)\n", "df = clean_dtype(df)\n", "\n", "\n", "df.head(3)" ] }, { "cell_type": "markdown", "id": "royal-easter", "metadata": {}, "source": [ "##### Split the data" ] }, { "cell_type": "markdown", "id": "forced-world", "metadata": {}, "source": [ "Before conducting the exploratory data analysis (EDA), we will split the data into train and test. EDA should *always* be performed on the training data only to prevent information leakage, i.e overfitting. Test set should be used for final model evaluation.\n", "\n", "Quick note - I have dropped the `duration` column because based on my analysis explained [here](https://youtu.be/0ocVtXU8o1I), this feature leaks information so I am dropping it. Watch the presentation if you would like to understand how creating interpretable models can help avoid such data leakage. " ] }, { "cell_type": "code", "execution_count": 21, "id": "chicken-aluminum", "metadata": {}, "outputs": [], "source": [ "X = df.drop(['y','duration'], axis=1)\n", "y = df.y\n", "\n", "y = binarize_y(y)" ] }, { "cell_type": "code", "execution_count": 22, "id": "shaped-settlement", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training set: 26360 \n", "Test set: 6590\n" ] } ], "source": [ "cat_cols = ['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome']\n", "num_cols = list(set(X.columns)-set(cat_cols))\n", "\n", "\n", "x1,x2, y1,y2 = train_test_split(X,y, stratify=y, train_size=0.80, shuffle=True, random_state = 0)\n", "\n", "print(\"Training set:\",len(x1),\"\\nTest set:\",len(x2))" ] }, { "cell_type": "markdown", "id": "ahead-block", "metadata": {}, "source": [ "Training set has 26,360 observations and test has 6590 observations. The 80/20 split is arbitrary at this point. You can create [learning curves](https://www.dataquest.io/blog/learning-curves-machine-learning/) to figure out how much data you need for training. It will also depend on the algorithm you are using. " ] }, { "cell_type": "code", "execution_count": 8, "id": "cardiovascular-centre", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>age</th>\n", " <th>job</th>\n", " <th>marital</th>\n", " <th>education</th>\n", " <th>default</th>\n", " <th>housing</th>\n", " <th>loan</th>\n", " <th>contact</th>\n", " <th>month</th>\n", " <th>day_of_week</th>\n", " <th>campaign</th>\n", " <th>pdays</th>\n", " <th>previous</th>\n", " <th>poutcome</th>\n", " <th>emp_var_rate</th>\n", " <th>cons_price_idx</th>\n", " <th>cons_conf_idx</th>\n", " <th>euribor3m</th>\n", " <th>nr_employed</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>23612</th>\n", " <td>55</td>\n", " <td>technician</td>\n", " <td>divorced</td>\n", " <td>professional.course</td>\n", " <td>unknown</td>\n", " <td>yes</td>\n", " <td>no</td>\n", " <td>cellular</td>\n", " <td>aug</td>\n", " <td>tue</td>\n", " <td>1</td>\n", " <td>999</td>\n", " <td>0</td>\n", " <td>nonexistent</td>\n", " <td>1.4</td>\n", " <td>93.444</td>\n", " <td>-36.1</td>\n", " <td>4.965</td>\n", " <td>5228.1</td>\n", " </tr>\n", " <tr>\n", " <th>32560</th>\n", " <td>40</td>\n", " <td>admin.</td>\n", " <td>single</td>\n", " <td>university.degree</td>\n", " <td>no</td>\n", " <td>no</td>\n", " <td>no</td>\n", " <td>telephone</td>\n", " <td>jun</td>\n", " <td>mon</td>\n", " <td>2</td>\n", " <td>999</td>\n", " <td>0</td>\n", " <td>nonexistent</td>\n", " <td>1.4</td>\n", " <td>94.465</td>\n", " <td>-41.8</td>\n", " <td>4.958</td>\n", " <td>5228.1</td>\n", " </tr>\n", " <tr>\n", " <th>15168</th>\n", " <td>51</td>\n", " <td>technician</td>\n", " <td>divorced</td>\n", " <td>unknown</td>\n", " <td>unknown</td>\n", " <td>no</td>\n", " <td>no</td>\n", " <td>telephone</td>\n", " <td>jul</td>\n", " <td>fri</td>\n", " <td>1</td>\n", " <td>999</td>\n", " <td>0</td>\n", " <td>nonexistent</td>\n", " <td>1.4</td>\n", " <td>93.918</td>\n", " <td>-42.7</td>\n", " <td>4.962</td>\n", " <td>5228.1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " age job marital education default housing loan \\\n", "23612 55 technician divorced professional.course unknown yes no \n", "32560 40 admin. single university.degree no no no \n", "15168 51 technician divorced unknown unknown no no \n", "\n", " contact month day_of_week campaign pdays previous poutcome \\\n", "23612 cellular aug tue 1 999 0 nonexistent \n", "32560 telephone jun mon 2 999 0 nonexistent \n", "15168 telephone jul fri 1 999 0 nonexistent \n", "\n", " emp_var_rate cons_price_idx cons_conf_idx euribor3m nr_employed \n", "23612 1.4 93.444 -36.1 4.965 5228.1 \n", "32560 1.4 94.465 -41.8 4.958 5228.1 \n", "15168 1.4 93.918 -42.7 4.962 5228.1 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x1.head(3)" ] }, { "cell_type": "markdown", "id": "thirty-intent", "metadata": {}, "source": [ "#### Exploratory Data Analysis\n", "\n", "The data is clean for our demonstration purposes. Before building the model, you should invest significant time in understanding the data first. This is definitely the most important part of building a reliable machine learning model. In this demo, I am going to skip this step and leave it up to you. " ] }, { "cell_type": "code", "execution_count": 9, "id": "faced-knitting", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "<Figure size 720x720 with 20 Axes>" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "#For demonstration, using only few numerical columns and 1000 random observations\n", "sns.pairplot(x1[['emp_var_rate','cons_price_idx','cons_conf_idx','euribor3m']].sample(1000),diag_kind='kde');" ] }, { "cell_type": "code", "execution_count": 10, "id": "owned-participation", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 0.887936\n", "1 0.112064\n", "Name: y, dtype: float64\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD1CAYAAABA+A6aAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAAKmElEQVR4nO3cUYid+VnH8e/PCbmQqhUzlnaSmIBZa4Su6Jh6oVgR3aR7EQQvshVLl5YQMKJ3mxu96U1LEUSaGkIJxRtz46Kxjc2FUL1YipmFddt0yTqk7WZMobNaBOtFzO7jxYx6PHtmzjvZMzmZZ78fGJj3/f9z5rmYfPPy5j0nVYUkae/7gXkPIEmaDYMuSU0YdElqwqBLUhMGXZKaMOiS1MS+ef3gAwcO1JEjR+b14yVpT3rxxRdfr6rFSWtzC/qRI0dYWVmZ14+XpD0pybe3WvOWiyQ1YdAlqQmDLklNGHRJasKgS1ITBl2SmjDoktSEQZekJub2xqK94siFL817hFa+9amn5z2C1JZX6JLUhEGXpCYMuiQ1YdAlqQmDLklNGHRJasKgS1ITBl2SmjDoktSEQZekJgy6JDVh0CWpCYMuSU0YdElqwqBLUhMGXZKaGBT0JCeT3E6ymuTChPUfSfI3Sf4pya0kz85+VEnSdqYGPckCcBE4BRwHnklyfGzb7wLfqKongQ8Bf5xk/4xnlSRtY8gV+glgtaruVNV94CpwemxPAT+UJMC7gH8DHsx0UknStoYEfQm4O3K8tnlu1GeBnwbuAV8Dfr+q3pzJhJKkQYYEPRPO1djxU8BLwPuAnwU+m+SH3/JCydkkK0lW1tfXdziqJGk7Q4K+BhwaOT7IxpX4qGeB52vDKvBN4P3jL1RVl6tquaqWFxcXH3ZmSdIEQ4J+EziW5Ojmf3SeAa6N7XkN+DWAJO8Bfgq4M8tBJUnb2zdtQ1U9SHIeuAEsAFeq6laSc5vrl4BPAl9I8jU2btE8V1Wv7+LckqQxU4MOUFXXgetj5y6NfH8P+I3ZjiZJ2gnfKSpJTRh0SWrCoEtSEwZdkpow6JLUhEGXpCYMuiQ1YdAlqQmDLklNGHRJasKgS1ITBl2SmjDoktSEQZekJgy6JDVh0CWpCYMuSU0YdElqwqBLUhMGXZKaMOiS1IRBl6QmDLokNWHQJakJgy5JTRh0SWrCoEtSEwZdkpow6JLUhEGXpCYMuiQ1YdAlqQmDLklNGHRJasKgS1ITBl2SmjDoktSEQZekJgYFPcnJJLeTrCa5sMWeDyV5KcmtJH8/2zElSdPsm7YhyQJwEfh1YA24meRaVX1jZM+7gc8BJ6vqtSQ/vkvzSpK2MOQK/QSwWlV3quo+cBU4PbbnI8DzVfUaQFV9d7ZjSpKmGRL0JeDuyPHa5rlRTwA/muQrSV5M8tFJL5TkbJKVJCvr6+sPN7EkaaIhQc+EczV2vA/4eeBp4CngD5M88ZY/VHW5qparanlxcXHHw0qStjb1HjobV+SHRo4PAvcm7Hm9qr4PfD/JPwBPAq/OZEpJ0lRDrtBvAseSHE2yHzgDXBvb89fALyfZl+QHgQ8Cr8x2VEnSdqZeoVfVgyTngRvAAnClqm4lObe5fqmqXknyZeBl4E3g81X19d0cXJL0/w255UJVXQeuj527NHb8GeAzsxtNkrQTvlNUkpow6JLUhEGXpCYMuiQ1YdAlqQmDLklNGHRJasKgS1ITBl2SmjDoktSEQZekJgy6JDVh0CWpCYMuSU0YdElqwqBLUhMGXZKaMOiS1IRBl6QmDLokNWHQJakJgy5JTRh0SWrCoEtSEwZdkpow6JLUhEGXpCYMuiQ1YdAlqQmDLklNGHRJasKgS1ITBl2SmjDoktSEQZekJgy6JDVh0CWpiUFBT3Iyye0kq0kubLPvF5K8keS3ZjeiJGmIqUFPsgBcBE4Bx4FnkhzfYt+ngRuzHlKSNN2QK/QTwGpV3amq+8BV4PSEfb8H/CXw3RnOJ0kaaEjQl4C7I8drm+f+V5Il4DeBS7MbTZK0E0OCngnnauz4T4DnquqNbV8oOZtkJcnK+vr6wBElSUPsG7BnDTg0cnwQuDe2Zxm4mgTgAPDhJA+q6q9GN1XVZeAywPLy8vg/CpKkt2FI0G8Cx5IcBf4FOAN8ZHRDVR39n++TfAH44njMJUm7a2rQq+pBkvNsPL2yAFypqltJzm2ue99ckh4DQ67QqarrwPWxcxNDXlUfe/tjSZJ2yneKSlITBl2SmjDoktSEQZekJgy6JDVh0CWpCYMuSU0YdElqwqBLUhMGXZKaMOiS1IRBl6QmDLokNWHQJakJgy5JTRh0SWrCoEtSEwZdkpow6JLUhEGXpCYMuiQ1YdAlqQmDLklNGHRJasKgS1ITBl2SmjDoktSEQZekJgy6JDVh0CWpCYMuSU0YdElqwqBLUhMGXZKaMOiS1IRBl6QmDLokNWHQJamJQUFPcjLJ7SSrSS5MWP/tJC9vfr2Q5MnZjypJ2s7UoCdZAC4Cp4DjwDNJjo9t+ybwK1X1AeCTwOVZDypJ2t6QK/QTwGpV3amq+8BV4PTohqp6oaq+t3n4VeDgbMeUJE0zJOhLwN2R47XNc1v5OPC3b2coSdLO7RuwJxPO1cSNya+yEfRf2mL9LHAW4PDhwwNHlCQNMeQKfQ04NHJ8ELg3vinJB4DPA6er6l8nvVBVXa6q5apaXlxcfJh5JUlbGBL0m8CxJEeT7AfOANdGNyQ5DDwP/E5VvTr7MSVJ00y95VJVD5KcB24AC8CVqrqV5Nzm+iXgj4AfAz6XBOBBVS3v3tiSpHFD7qFTVdeB62PnLo18/wngE7MdTZK0E75TVJKaMOiS1IRBl6QmDLokNWHQJakJgy5JTRh0SWrCoEtSEwZdkpow6JLUhEGXpCYMuiQ1YdAlqQmDLklNGHRJamLQ56FLevwcufCleY/Qyrc+9fS8R3jbvEKXpCYMuiQ1YdAlqQmDLklNGHRJasKgS1ITBl2SmjDoktSEQZekJgy6JDVh0CWpCYMuSU0YdElqwqBLUhMGXZKaMOiS1IRBl6QmDLokNWHQJakJgy5JTRh0SWrCoEtSE4OCnuRkkttJVpNcmLCeJH+6uf5ykp+b/aiSpO1MDXqSBeAicAo4DjyT5PjYtlPAsc2vs8CfzXhOSdIUQ67QTwCrVXWnqu4DV4HTY3tOA39eG74KvDvJe2c8qyRpG/sG7FkC7o4crwEfHLBnCfjO6KYkZ9m4ggf4jyS3dzSttnMAeH3eQ0yTT897As2Bv5uz9RNbLQwJeiacq4fYQ1VdBi4P+JnaoSQrVbU87zmkcf5uPjpDbrmsAYdGjg8C9x5ijyRpFw0J+k3gWJKjSfYDZ4BrY3uuAR/dfNrlF4F/r6rvjL+QJGn3TL3lUlUPkpwHbgALwJWqupXk3Ob6JeA68GFgFfhP4NndG1lb8FaWHlf+bj4iqXrLrW5J0h7kO0UlqQmDLklNGHRJamLIc+h6DCV5Pxvv0F1i45n/e8C1qnplroNJmhuv0PegJM+x8REMAf6RjUdLA/zFpA9Pkx4HSXz6bZf5lMselORV4Geq6r/Gzu8HblXVsflMJm0tyWtVdXjec3TmLZe96U3gfcC3x86/d3NNmoskL2+1BLznUc7yTmTQ96Y/AP4uyT/zfx+Kdhj4SeD8vIaS2Ij2U8D3xs4HeOHRj/POYtD3oKr6cpIn2Pho4yU2/rKsATer6o25Dqd3ui8C76qql8YXknzlkU/zDuM9dElqwqdcJKkJgy5JTRh0SWrCoEtSEwZdkpr4b64TIkv+PEboAAAAAElFTkSuQmCC\n", "text/plain": [ "<Figure size 432x288 with 1 Axes>" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "print((y1.value_counts(normalize=True)))\n", "(y1.value_counts(normalize=True)).plot(kind='bar');" ] }, { "cell_type": "markdown", "id": "satisfied-affiliate", "metadata": {}, "source": [ "In the bar chart above, 1 is 'yes' and 0 is 'no'. As you can see, ~89% customers did not sign up for the term deposit and 11% did. Thus, the target labels are not balanced (i.e not ~50/50%). This will affect the model performance metric we choose. For imbalanced dataset, using `accuracy` as the metric can lead to incorrect results. `ROC-AUC` is often used in such situations. This is a big topic so for now we just need to know that based on the EDA we see that the target is imbalanced and we will have to keep it in mind when building the model." ] }, { "cell_type": "markdown", "id": "loving-reflection", "metadata": {}, "source": [ "#### Preprocess the data\n" ] }, { "cell_type": "markdown", "id": "oriented-curtis", "metadata": {}, "source": [ "In the previous steps we split the data and now we are ready to build the ML pipeline. We build the preprocessing pipeline for catgorical and numerical columns using `Pipeline()` from sklearn.\n", "\n", "Categorical columns will be encoded using `OneHotEncoder` and numerical features will be scaled using `StandardScaler`. Standard scaler will bring all numerical features to mean = 0 and std dev = 1. There are various ways of encoding and scaling but for demo purposes we will stick with this. \n", "\n", "Also note that not all ML algorithms need encoding and scaling. Linear methods such as Logistic Regression do while tree-based algorithms (Random Forest, GBMs) don't. We will still preprocess the data so we can use the same pipeline for different algirithms, if needed. \n" ] }, { "cell_type": "code", "execution_count": 11, "id": "opponent-glucose", "metadata": {}, "outputs": [], "source": [ "#Get column index for each of the columns types \n", "\n", "cat_nums = [list(x1.columns).index(col) for col in cat_cols]\n", "num_nums = [list(x1.columns).index(col) for col in num_cols]" ] }, { "cell_type": "code", "execution_count": 12, "id": "bright-causing", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1, 2, 3, 4, 5, 6, 7, 8, 9, 13]\n", "[14, 0, 15, 12, 11, 10, 17, 16, 18]\n" ] } ], "source": [ "print(cat_nums)\n", "print(num_nums)" ] }, { "cell_type": "markdown", "id": "rational-teens", "metadata": {}, "source": [ " " ] }, { "cell_type": "code", "execution_count": 13, "id": "killing-diagram", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ColumnTransformer(transformers=[('cat',\n", " Pipeline(steps=[('ohe',\n", " OneHotEncoder(handle_unknown='ignore',\n", " sparse=False))]),\n", " [1, 2, 3, 4, 5, 6, 7, 8, 9, 13]),\n", " ('num',\n", " Pipeline(steps=[('std', StandardScaler())]),\n", " [14, 0, 15, 12, 11, 10, 17, 16, 18])])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#One hot encode\n", "cat_ohe_step = ('ohe', OneHotEncoder(sparse=False,\n", " handle_unknown='ignore'))\n", "#Build Pipeline\n", "cat_pipe = Pipeline([cat_ohe_step])\n", "num_pipe = Pipeline([('std', StandardScaler())])\n", "transformers = [\n", " ('cat', cat_pipe, cat_nums),\n", " ('num', num_pipe, num_nums)\n", "]\n", "ct = ColumnTransformer(transformers=transformers)\n", "\n", "ct" ] }, { "cell_type": "markdown", "id": "administrative-appreciation", "metadata": {}, "source": [ "Visualize the preprocessing steps:" ] }, { "cell_type": "code", "execution_count": 14, "id": "likely-action", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<style>div.sk-top-container {color: black;background-color: white;}div.sk-toggleable {background-color: white;}label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.2em 0.3em;box-sizing: border-box;text-align: center;}div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}div.sk-estimator {font-family: monospace;background-color: #f0f8ff;margin: 0.25em 0.25em;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;}div.sk-estimator:hover {background-color: #d4ebff;}div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 2em;bottom: 0;left: 50%;}div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;}div.sk-item {z-index: 1;}div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;}div.sk-parallel-item {display: flex;flex-direction: column;position: relative;background-color: white;}div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}div.sk-parallel-item:only-child::after {width: 0;}div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0.2em;box-sizing: border-box;padding-bottom: 0.1em;background-color: white;position: relative;}div.sk-label label {font-family: monospace;font-weight: bold;background-color: white;display: inline-block;line-height: 1.2em;}div.sk-label-container {position: relative;z-index: 2;text-align: center;}div.sk-container {display: inline-block;position: relative;}</style><div class=\"sk-top-container\"><div class=\"sk-container\"><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"7b7a6ed5-6823-4e8a-be28-eb59184036bd\" type=\"checkbox\" ><label class=\"sk-toggleable__label\" for=\"7b7a6ed5-6823-4e8a-be28-eb59184036bd\">ColumnTransformer</label><div class=\"sk-toggleable__content\"><pre>ColumnTransformer(transformers=[('cat',\n", " Pipeline(steps=[('ohe',\n", " OneHotEncoder(handle_unknown='ignore',\n", " sparse=False))]),\n", " [1, 2, 3, 4, 5, 6, 7, 8, 9, 13]),\n", " ('num',\n", " Pipeline(steps=[('std', StandardScaler())]),\n", " [14, 0, 15, 12, 11, 10, 17, 16, 18])])</pre></div></div></div><div class=\"sk-parallel\"><div class=\"sk-parallel-item\"><div class=\"sk-item\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"55ed1997-5c52-41ee-8ad8-b87d3c039d98\" type=\"checkbox\" ><label class=\"sk-toggleable__label\" for=\"55ed1997-5c52-41ee-8ad8-b87d3c039d98\">cat</label><div class=\"sk-toggleable__content\"><pre>[1, 2, 3, 4, 5, 6, 7, 8, 9, 13]</pre></div></div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"bc5c0c0f-465f-42d4-b6f2-2aa75fe7cd03\" type=\"checkbox\" ><label class=\"sk-toggleable__label\" for=\"bc5c0c0f-465f-42d4-b6f2-2aa75fe7cd03\">OneHotEncoder</label><div class=\"sk-toggleable__content\"><pre>OneHotEncoder(handle_unknown='ignore', sparse=False)</pre></div></div></div></div></div></div></div></div><div class=\"sk-parallel-item\"><div class=\"sk-item\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"5fe22998-4ef9-41cf-94ac-dca017fee752\" type=\"checkbox\" ><label class=\"sk-toggleable__label\" for=\"5fe22998-4ef9-41cf-94ac-dca017fee752\">num</label><div class=\"sk-toggleable__content\"><pre>[14, 0, 15, 12, 11, 10, 17, 16, 18]</pre></div></div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"40d28cda-2103-4150-b156-020c7e3906c3\" type=\"checkbox\" ><label class=\"sk-toggleable__label\" for=\"40d28cda-2103-4150-b156-020c7e3906c3\">StandardScaler</label><div class=\"sk-toggleable__content\"><pre>StandardScaler()</pre></div></div></div></div></div></div></div></div></div></div></div></div>" ], "text/plain": [ "ColumnTransformer(transformers=[('cat',\n", " Pipeline(steps=[('ohe',\n", " OneHotEncoder(handle_unknown='ignore',\n", " sparse=False))]),\n", " [1, 2, 3, 4, 5, 6, 7, 8, 9, 13]),\n", " ('num',\n", " Pipeline(steps=[('std', StandardScaler())]),\n", " [14, 0, 15, 12, 11, 10, 17, 16, 18])])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Visualize the Preprocessing steps\n", "\n", "from sklearn import set_config\n", "\n", "set_config(display='diagram') \n", "\n", "ct" ] }, { "cell_type": "markdown", "id": "seven-furniture", "metadata": {}, "source": [ "#### Build the model\n", "\n", "I am going to use **Random forest** algorithm. Random Forest often gives a good baseline performance right out of the box in most scenarios without overfitting. Also, Random Forest has a nice feature - ['Out of Bag' (OOB)](https://en.wikipedia.org/wiki/Out-of-bag_error) score. It will help us estimate model performance over multiple bootstrapped samples thus providing a good proxy for cross-validation performance. I am using OOB here, just to save model building time. In a real project, you will carefully construct a CV scheme. \n", "\n", "We use the above pipeline of transformations with the Random Forest estimator with default parameters. Set the `oob_score=True` to get the OOB score. Also, `class_weight` is set to `balanced` to mitigate class imbalance.\n" ] }, { "cell_type": "code", "execution_count": 15, "id": "proud-allah", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<style>div.sk-top-container {color: black;background-color: white;}div.sk-toggleable {background-color: white;}label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.2em 0.3em;box-sizing: border-box;text-align: center;}div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}div.sk-estimator {font-family: monospace;background-color: #f0f8ff;margin: 0.25em 0.25em;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;}div.sk-estimator:hover {background-color: #d4ebff;}div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 2em;bottom: 0;left: 50%;}div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;}div.sk-item {z-index: 1;}div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;}div.sk-parallel-item {display: flex;flex-direction: column;position: relative;background-color: white;}div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}div.sk-parallel-item:only-child::after {width: 0;}div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0.2em;box-sizing: border-box;padding-bottom: 0.1em;background-color: white;position: relative;}div.sk-label label {font-family: monospace;font-weight: bold;background-color: white;display: inline-block;line-height: 1.2em;}div.sk-label-container {position: relative;z-index: 2;text-align: center;}div.sk-container {display: inline-block;position: relative;}</style><div class=\"sk-top-container\"><div class=\"sk-container\"><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"c8ab40bc-f1d0-45df-9f3b-12b6f547731a\" type=\"checkbox\" ><label class=\"sk-toggleable__label\" for=\"c8ab40bc-f1d0-45df-9f3b-12b6f547731a\">Pipeline</label><div class=\"sk-toggleable__content\"><pre>Pipeline(steps=[('ct',\n", " ColumnTransformer(transformers=[('cat',\n", " Pipeline(steps=[('ohe',\n", " OneHotEncoder(handle_unknown='ignore',\n", " sparse=False))]),\n", " [1, 2, 3, 4, 5, 6, 7, 8, 9,\n", " 13]),\n", " ('num',\n", " Pipeline(steps=[('std',\n", " StandardScaler())]),\n", " [14, 0, 15, 12, 11, 10, 17,\n", " 16, 18])])),\n", " ('rf',\n", " RandomForestClassifier(class_weight='balanced', oob_score=True,\n", " random_state=0))])</pre></div></div></div><div class=\"sk-serial\"><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"5136d72f-3d1d-43d5-b38b-15eb55eab08c\" type=\"checkbox\" ><label class=\"sk-toggleable__label\" for=\"5136d72f-3d1d-43d5-b38b-15eb55eab08c\">ct: ColumnTransformer</label><div class=\"sk-toggleable__content\"><pre>ColumnTransformer(transformers=[('cat',\n", " Pipeline(steps=[('ohe',\n", " OneHotEncoder(handle_unknown='ignore',\n", " sparse=False))]),\n", " [1, 2, 3, 4, 5, 6, 7, 8, 9, 13]),\n", " ('num',\n", " Pipeline(steps=[('std', StandardScaler())]),\n", " [14, 0, 15, 12, 11, 10, 17, 16, 18])])</pre></div></div></div><div class=\"sk-parallel\"><div class=\"sk-parallel-item\"><div class=\"sk-item\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"36b1ebcb-9464-414a-833c-f10631a8a999\" type=\"checkbox\" ><label class=\"sk-toggleable__label\" for=\"36b1ebcb-9464-414a-833c-f10631a8a999\">cat</label><div class=\"sk-toggleable__content\"><pre>[1, 2, 3, 4, 5, 6, 7, 8, 9, 13]</pre></div></div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"d5682005-de0d-4c82-a808-00b37b7dd47d\" type=\"checkbox\" ><label class=\"sk-toggleable__label\" for=\"d5682005-de0d-4c82-a808-00b37b7dd47d\">OneHotEncoder</label><div class=\"sk-toggleable__content\"><pre>OneHotEncoder(handle_unknown='ignore', sparse=False)</pre></div></div></div></div></div></div></div></div><div class=\"sk-parallel-item\"><div class=\"sk-item\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"aefca277-ab0c-4ff2-839e-370ff9882d3c\" type=\"checkbox\" ><label class=\"sk-toggleable__label\" for=\"aefca277-ab0c-4ff2-839e-370ff9882d3c\">num</label><div class=\"sk-toggleable__content\"><pre>[14, 0, 15, 12, 11, 10, 17, 16, 18]</pre></div></div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"59f8ff65-058e-4989-875b-7b9f6854f98a\" type=\"checkbox\" ><label class=\"sk-toggleable__label\" for=\"59f8ff65-058e-4989-875b-7b9f6854f98a\">StandardScaler</label><div class=\"sk-toggleable__content\"><pre>StandardScaler()</pre></div></div></div></div></div></div></div></div></div></div><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"d4533a61-61f2-419d-9808-c4541fa2123c\" type=\"checkbox\" ><label class=\"sk-toggleable__label\" for=\"d4533a61-61f2-419d-9808-c4541fa2123c\">RandomForestClassifier</label><div class=\"sk-toggleable__content\"><pre>RandomForestClassifier(class_weight='balanced', oob_score=True, random_state=0)</pre></div></div></div></div></div></div></div>" ], "text/plain": [ "Pipeline(steps=[('ct',\n", " ColumnTransformer(transformers=[('cat',\n", " Pipeline(steps=[('ohe',\n", " OneHotEncoder(handle_unknown='ignore',\n", " sparse=False))]),\n", " [1, 2, 3, 4, 5, 6, 7, 8, 9,\n", " 13]),\n", " ('num',\n", " Pipeline(steps=[('std',\n", " StandardScaler())]),\n", " [14, 0, 15, 12, 11, 10, 17,\n", " 16, 18])])),\n", " ('rf',\n", " RandomForestClassifier(class_weight='balanced', oob_score=True,\n", " random_state=0))])" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe = Pipeline([\n", " ('ct', ct),\n", " ('rf', RandomForestClassifier(oob_score=True,\n", " random_state=0, \n", " class_weight = 'balanced')),\n", " ])\n", "\n", "#Fit the model\n", "pipe.fit(x1,y1)\n", "\n" ] }, { "cell_type": "markdown", "id": "virtual-mirror", "metadata": {}, "source": [ "#### Validate the model" ] }, { "cell_type": "code", "execution_count": 49, "id": "rubber-guatemala", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "OOB AUC is: 0.63\n", "OOB AUC on test set is: 0.63\n" ] } ], "source": [ "#Access the RF estimator from the pipeline\n", "rf = pipe[-1]\n", "\n", "# OOB, by default, gives Accuracy score. This is a slightly imbalanced dataset,\n", "# so we will calculate AUC on OOB predictions\n", "\n", "oob_pred = np.argmax(rf.oob_decision_function_,axis=1)\n", "auc1 = metrics.roc_auc_score(y1, oob_pred)\n", "\n", "print(\"OOB AUC is: \",np.round(auc1,2))\n", "print(\"OOB AUC on test set is: \",np.round(roc_auc_score(y2, pipe.predict(x2)),2))\n", "\n" ] }, { "cell_type": "markdown", "id": "fantastic-chase", "metadata": {}, "source": [ "Although the AUC is not very high, OOB gave an excellent estimation of the test score. We are happy with the model and it's ready to be used for future predictions." ] }, { "cell_type": "markdown", "id": "romance-vacation", "metadata": {}, "source": [ "#### Serialize the Model" ] }, { "cell_type": "markdown", "id": "committed-encounter", "metadata": {}, "source": [ "Serialize the model using `joblib`" ] }, { "cell_type": "code", "execution_count": 50, "id": "terminal-insider", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['baseline_rf.pkl']" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import joblib\n", "\n", "joblib.dump(pipe, 'baseline_rf.pkl')" ] }, { "cell_type": "markdown", "id": "nominated-brave", "metadata": {}, "source": [ "Test the pickle file on the test set again to make sure it's working as expected." ] }, { "cell_type": "code", "execution_count": 51, "id": "boxed-killing", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 0, 1, ..., 0, 0, 0])" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "joblib.load('baseline_rf.pkl').predict(x2)" ] }, { "cell_type": "markdown", "id": "designed-northwest", "metadata": {}, "source": [ "**Success !**" ] }, { "cell_type": "markdown", "id": "beneficial-review", "metadata": {}, "source": [ "We followed the entire process that's laid out in the 'typical' machine learning process. But is that how it works in real life? \n", "\n", "The answer is - A Resounding **NO**. \n", "\n", "This is in fact just the fraction of the actual process. In reality it's very convoluted, non-linear process with multiple stakeholders/teams involved in creating the final model. You have business stakeholders defining the goals and business objectives, IT/Data Engineers who work on extracting/staging the data, Data scientist creating the models, software engineers integrating it with the product and business intelligence developers consuming the predictions in a dashboard. All these teams collaborate with each other, going through many iterations before finalizng a model.\n", "\n", "\n", "\n", "\n", "Ref: https://chapeau.freevariable.com/static/202010/mlops-tube.png" ] }, { "cell_type": "markdown", "id": "affecting-section", "metadata": {}, "source": [ "Things that are missing from this 'Typical' process are:\n", " \n", " - Multiple stakeholders and collaborators. Identifying business objectives and tying it to model metric\n", " - Computational resources needed to run the ML models. If you are working on a dataset with millions of rows or a DNN model, you will very likely need GPUs \n", " - Experimentation design: algorithms, preprocessing steps, feature selection, feature engineering. You will create 1000s of models before identifying few model candidates that meet the business objectives.\n", " - Experiment tracking: You will need to efficiently track these 1000s of ML experiments to understand the patterns\n", " - Data versioning: You will work with several different versions of the data. By the time you arrive at the final model, the data used for training & evaluating that model will be very different from what you started with. You or your colleagues will need to use a different version of that data for som eother project. \n", " - Track model artifacts: Each model will have its dependency requirements, input/output schema, hyper parameters\n", " - Package the model: Containerize the model with the dependendencies \n", " - Deploy & monitor: scale, data security, performance monitoring, data drift, model interpretability" ] }, { "cell_type": "markdown", "id": "periodic-prime", "metadata": {}, "source": [ "This is where **Azure Machine Learning Service** helps! It's a fully managed cloud service that lets you:\n", " - Work in collaboration while giving control on data security and resources\n", " - Scale the compute targets as needed\n", " - Track data and model versions\n", " - Experiment with thousands of models and keep track of them\n", " - Deploy the models based on requirements (real-time, batch, IoT) \n", " - Monitor in production\n", " - Trace the model back to data and model artificats\n", " - DevOps" ] }, { "cell_type": "markdown", "id": "occupational-nature", "metadata": {}, "source": [ "Azure ML is a **Fully managed MLOPS Platform** that will help you manage the machine learning process based on project requirements." ] }, { "cell_type": "markdown", "id": "executive-infection", "metadata": {}, "source": [ "# Azure ML Service" ] }, { "cell_type": "markdown", "id": "historical-membership", "metadata": {}, "source": [ "Hopefully above example gave you reasons to learn and understand why MLOps is important. With Scikit-learn you can create the models *but* it won't help you put those models in production. We will now see how to operationalize this model using Azure ML. " ] }, { "cell_type": "markdown", "id": "enabling-contrast", "metadata": {}, "source": [ "### Tour of Azure ML Service" ] }, { "cell_type": "markdown", "id": "maritime-indonesian", "metadata": {}, "source": [ "Create a free Azure account by visiting the Azure page: https://azure.microsoft.com/en-us/services/machine-learning/\n", "The account is free and you get $200 credit for the first 30 days. Create a 'Pay-As-You-Go' subscription so you will incur costs for only the services you use. Be careful of creating compute resources. Shut them down when you are not using them to avoid a costly surprises. If you are a student, you may get some additional benefits. " ] }, { "cell_type": "markdown", "id": "million-prompt", "metadata": {}, "source": [ "##### Create Pay As You Go subscription\n", "\n", "" ] }, { "cell_type": "markdown", "id": "great-opportunity", "metadata": {}, "source": [ "#### Create Azure ML Resource\n", "\n", "[From MS Docs: ](https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/manage-resource-groups-portal#what-is-a-resource-group)A resource group is a container that holds related resources for an Azure solution. The resource group can include all the resources for the solution, or only those resources that you want to manage as a group. You decide how you want to allocate resources to resource groups based on what makes the most sense for your organization. Generally, add resources that share the same lifecycle to the same resource group so you can easily deploy, update, and delete them as a group.\n", "\n", "The resource group stores metadata about the resources. Therefore, when you specify a location for the resource group, you are specifying where that metadata is stored. For compliance reasons, you may need to ensure that your data is stored in a particular region." ] }, { "cell_type": "markdown", "id": "departmental-favor", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "id": "right-disposition", "metadata": {}, "source": [ "#### What's in Azure Resources ?\n", "\n", "\n", "Ref: https://miro.medium.com/max/700/0*2B9p3J0A0efCL2J0.jpg" ] }, { "cell_type": "markdown", "id": "promotional-synthetic", "metadata": {}, "source": [ "You can access and manage these resources in Azure ML studio using GUI. Some of these resources can also be managed using Azure ML SDK. As you create machine learning models, you will need to access these resources based on project requirements. The Python sdk will allow you to access them in your notebook on the fly. If the resources don't exist, you can create them programmatically. \n", "\n", "###### Architecture\n", "" ] }, { "cell_type": "markdown", "id": "official-payday", "metadata": {}, "source": [ "### Azure ML Python SDK" ] }, { "cell_type": "markdown", "id": "exciting-orange", "metadata": {}, "source": [ "I highly recommend creating a virtual enviroment that's specific to Azure ML projects to manage dependencies, especially for Azure Auto ML. Azure AutoML dependencies are often hard to resolve.\n", "\n", "\n", "\n", "Create a virtual enviroment (e.g `evenv`) and install Azure ML : `pip install --upgrade azureml-sdk[notebooks,automl]`.\n", "You can read more [here](https://docs.microsoft.com/en-us/python/api/overview/azure/ml/install?view=azure-ml-py)" ] }, { "cell_type": "code", "execution_count": 52, "id": "modular-balance", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I am working with, Azure ML sdk ver: 1.20.0\n" ] } ], "source": [ "print(\"I am working with, Azure ML sdk ver: \",azureml.core.VERSION)" ] }, { "cell_type": "markdown", "id": "juvenile-integration", "metadata": {}, "source": [ "##### Azure ML Classes\n", "\n", "You will use these classes to operationalize the model in Azure ML. This list is defnitely not exchaustive, but 80% of the time you will be working with these classes in Azure ML. I encourage you to read the Microsoft documentation for each of these. " ] }, { "cell_type": "markdown", "id": "included-password", "metadata": {}, "source": [ "\n", "|Feature | Description | Class |\n", "|--|--|--|\n", "| Workspace |Foundational resource in the cloud to manage experiments, models |`Workspace(..)` \n", "| Compute Instance |Fully managed development environment (DVSM) |`ComputeInstance(..)`\n", "| Compute Cluster |Fully managed multi-node, scaleable compute |`ComputeTarget(..)`\n", "| Datastore |Azure Data storage |`Datastore(..)`\n", "| Dataset |Abstracted File or Tabular data stored in Datastore |`Dataset(..)`\n", "| Experiment |ML Experiment folder |`Experiment(..)`\n", "| Run |An instance of an experiment with artifacts |`Run(..)`\n", "| Log |Log metrics, artifacts related to run |`Environment(..)`\n", "| Environment |Package environment and dependencies |`.log(..)`\n", "| ScriptRunConfig |Configuration to run experiments |`ScriptRunConfig(..)`\n", "|Model |Manage, register, deploy models in the cloud |`Model(..)`\n", "|Webservice |Containerized packages for deployment, Endpoints |`Webservice(..)`\n" ] }, { "cell_type": "markdown", "id": "educated-richardson", "metadata": {}, "source": [ "### Workspace" ] }, { "cell_type": "markdown", "id": "bigger-transformation", "metadata": {}, "source": [ "You may have different workspaces for different teams, projects etc. In fact, it's recommended to create different resource groups so all the project data, metadata, artifacts remain in that workspace. Especially if you are just trying Azure ML so you can just delete that resource without incurring any charges for any resources in the future. To create or access a workspace, we use `Workspace()` class. The easiest way is to download the `config.json` file from the resource group to your working directory. It has all the tenant, subscription information to connect to that workspace. You will be prompted to authenticate your credentials." ] }, { "cell_type": "markdown", "id": "eight-season", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": 53, "id": "supported-substitute", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Connecting to the Workspace....\n", "Workspacename: demows , \n", "Workspace location: centralus\n" ] } ], "source": [ "from azureml.core import Workspace\n", "\n", "print(\"Connecting to the Workspace....\", end=\"\",sep='\\n')\n", "\n", "ws = Workspace.from_config()\n", "\n", "print(\"\\nWorkspacename:\",ws.name, \", \\nWorkspace location:\", ws.location)" ] }, { "cell_type": "markdown", "id": "solid-graduation", "metadata": {}, "source": [ "We are connected to the workspace, now we can access the assets and artifacts in this workspace" ] }, { "cell_type": "markdown", "id": "tracked-notion", "metadata": {}, "source": [ "### Datastore" ] }, { "cell_type": "markdown", "id": "seventh-editor", "metadata": {}, "source": [ "When the ML resource was created, Azure automatically created and attached a Blob storage to this workspace. That's the power of managed resources ! You can always attach other Blob, ADLSg2 storage accounts as needed. Let's access this default datastore." ] }, { "cell_type": "code", "execution_count": 54, "id": "general-information", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "workspacefilestore AzureFile\n", "workspaceblobstore AzureBlob\n" ] } ], "source": [ "# List all datastores registered in the current workspace\n", "datastores = ws.datastores\n", "for name, datastore in datastores.items():\n", " print(name, datastore.datastore_type)" ] }, { "cell_type": "markdown", "id": "reflected-panama", "metadata": {}, "source": [ "We have two blog storage accounts in this workspace. Let's conenct to the default datastore." ] }, { "cell_type": "code", "execution_count": 55, "id": "protecting-pearl", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{\n", " \"name\": \"workspaceblobstore\",\n", " \"container_name\": \"azureml-blobstore-dca32a5a-2be2-43c0-a924-9f9d9b7c7789\",\n", " \"account_name\": \"demows8142312183\",\n", " \"protocol\": \"https\",\n", " \"endpoint\": \"core.windows.net\"\n", "}" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "datastore = ws.get_default_datastore()\n", "datastore" ] }, { "cell_type": "markdown", "id": "affected-agreement", "metadata": {}, "source": [ "Remember this is the 'datastore'. We haven't accessed any datasets in this datastore yet. Your data engineering team, for example, can do ETL using ADF, Synapse Analytics, Power Query etc. for you and register a dataset in this datastore.You can also add other datasets to this datastore. We will register the current bank marketing data to this datastore. Once registered, your other team members can access this dataset by just pointing to that dataset. " ] }, { "cell_type": "markdown", "id": "brave-burden", "metadata": {}, "source": [ "### Dataset" ] }, { "cell_type": "markdown", "id": "editorial-cartoon", "metadata": {}, "source": [ "Sometimes you may find it easier to use the GUI in the Azure ML studio to register a dataset. The GUI is more interactive and can also generate dataset profile. \n", "\n", "Although it may not seem like a big deal, but being able to register, track, version the datasets seamlessly is one of the most important steps in creating reliable machine lerning models. In the model creation process, you will generate different versions of the data. By versioning and tracking, you will be able to trace which dataset was used for the training the deployed model and thus debug the models in production. \n", "\n", "Don't take my work for it. See what renowned ML researchers, Andrew Ng and Francois Chollet (creator of Tensfor Flow), say about importance of data collection, versioning. [Ref](https://twitter.com/AndrewYNg/status/1353814743190913024)" ] }, { "cell_type": "markdown", "id": "intelligent-surfing", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "id": "speaking-mumbai", "metadata": {}, "source": [ "Following DataOps ptactices will pay rich dividends and avoid many headaches when yu have to debug and re-train models in the future.\n", "\n", "The dataset you create in Azure ML is actually an abstraction/reference to the stored data and its metadata [(ref)](https://docs.microsoft.com/en-us/azure/machine-learning/concept-data#reference-data-in-storage-with-datasets). The datasets are lazily evaluated, which means:\n", " - No additional storage cost\n", " - Data versioning\n", " - No risk of changing original data\n", " " ] }, { "cell_type": "markdown", "id": "advanced-louisiana", "metadata": {}, "source": [ "##### Register a dataset" ] }, { "cell_type": "code", "execution_count": 56, "id": "polar-double", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Registering dataset to the cloud...\n", "\n", "Data registration successful\n", "\n", " { 'bankmarketing': DatasetRegistration(id='5845bcb9-fde0-4419-bba8-55b9e8296da8', name='bankmarketing', version=1, description='This is the original data', tags={})}\n" ] } ], "source": [ "from azureml.core import Dataset\n", "from azureml.data.dataset_factory import DataType\n", "\n", "#First create a dataset object\n", "ds1 = Dataset.Tabular.from_delimited_files(path=path)\n", "\n", "#Register this dataset to the datastore\n", "print(\"Registering dataset to the cloud...\", end=\"\")\n", "\n", "\n", "ds1 = ds1.register(workspace = ws,\n", " name= \"bankmarketing\",\n", " description = \"This is the original data\")\n", "\n", "print(\"\\n\\nData registration successful\\n\\n\", Dataset.get_all(ws) )" ] }, { "cell_type": "markdown", "id": "adaptive-kernel", "metadata": {}, "source": [ "We have registered the original dataset to the default datastore." ] }, { "cell_type": "markdown", "id": "practical-drain", "metadata": {}, "source": [ "##### Create a new version of the dataset\n", "\n", "If you noticed in the training data used above, I dropped the `duration` column. We will create another version of the same dataset `ds1` by dropping the `duration` column and call it `ds2`." ] }, { "cell_type": "code", "execution_count": 57, "id": "closed-input", "metadata": {}, "outputs": [], "source": [ "ds2 = ds1.drop_columns('duration')\n", "\n", "ds2 = ds2.register(workspace = ws,\n", " name= \"bankmarketing\",\n", " description = \"Duration column dropped\",\n", " create_new_version=True)\n" ] }, { "cell_type": "code", "execution_count": 58, "id": "occupied-sustainability", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'bankmarketing': DatasetRegistration(id='9980d47e-df35-45ec-972a-b68a8dd64bf6', name='bankmarketing', version=2, description='Duration column dropped', tags={})}" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Dataset.get_all(ws)" ] }, { "cell_type": "markdown", "id": "professional-editing", "metadata": {}, "source": [ "Note that we still have only 1 dataset in the Datastore. We just replaced the first dataset with the new version. Notice the version number `version=2` above. We can retrieve any version when needed. We have only 1 file in the datastore not 2. This is how, by creating abstraction, we are able to save storage cost. If needed, you can add properties, tags, description to the dataset for future reference. In fact it's a good practice to do so for tracebility.\n", "\n", "By default when you reference a dataset, it will always pull the latest version, unless specified." ] }, { "cell_type": "markdown", "id": "looking-temperature", "metadata": {}, "source": [ "We are still not done with the dataset. For training, we cleaned the data and split it into train/test. Those also need to be registered to the datastore. We don't have to but that's good DataOps/MLOps practice. Also, anytime you create a cross-validation folds for your final model training/validation, register those in the datastore too for reproducibility. To keep things simple, I am going to upload the train and test data created above to the datastore as one single csv file. Also note that you can directly register a pandas dataframe as a dataset. " ] }, { "cell_type": "code", "execution_count": 59, "id": "artificial-aspect", "metadata": {}, "outputs": [], "source": [ "#Create one single file with training and testing data\n", "#Add a column to label which data is train and test\n", "#This way we can keep the data in one single file\n", "# Besure sure to drop the 'data' columns before training and testing.\n", "\n", "train = x1.copy()\n", "train['target'] = y1\n", "train['data'] = 'train'\n", "\n", "test = x2.copy()\n", "test['target'] = y2\n", "test['data'] = 'test'\n", "\n", "train_test_data = train.append(test)\n", "\n" ] }, { "cell_type": "code", "execution_count": 60, "id": "coated-macintosh", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Method register_pandas_dataframe: This is an experimental method, and may change at any time.<br/>For more information, see https://aka.ms/azuremlexperimental.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Validating arguments.\n", "Arguments validated.\n", "Successfully obtained datastore reference and path.\n", "Uploading file to bank_train_test/e9360b10-63b9-4d6b-be50-53add1436e99/\n", "Successfully uploaded file to datastore.\n", "Creating and registering a new dataset.\n", "Successfully created and registered a new dataset.\n" ] } ], "source": [ "# Register the pandas dataframe as a dataset \n", "# Add tags for traceability\n", "\n", "from azureml.data.dataset_factory import TabularDatasetFactory\n", "\n", "ds3 = (TabularDatasetFactory\n", " .register_pandas_dataframe(\n", " train_test_data,\n", " target=(datastore,'bank_train_test'),\n", " name='bank_train_test',\n", " tags = {'Author':'Sandeep','Project':'Bank Marketing'},\n", " show_progress=True)\n", " )" ] }, { "cell_type": "code", "execution_count": 61, "id": "athletic-mitchell", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bank_train_test\n", "bankmarketing\n" ] } ], "source": [ "for dataset in Dataset.get_all(ws):\n", " print(dataset)" ] }, { "cell_type": "markdown", "id": "objective-engineering", "metadata": {}, "source": [ "Just for illustration purposes, if we want to retrieve a dataset by name, we can use the `get_by_name()` method. We can also see the `id` (i.e unique id) for the dataset. We will log this as an artifact during model building so we can trace the exact train/test used for future reference." ] }, { "cell_type": "code", "execution_count": 62, "id": "promising-founder", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{\n", " \"source\": [\n", " \"('workspaceblobstore', 'bank_train_test/e9360b10-63b9-4d6b-be50-53add1436e99/')\"\n", " ],\n", " \"definition\": [\n", " \"GetDatastoreFiles\",\n", " \"ReadParquetFile\",\n", " \"DropColumns\"\n", " ],\n", " \"registration\": {\n", " \"id\": \"7b81a6c0-1e72-4948-86bb-ddac0e4e5d77\",\n", " \"name\": \"bank_train_test\",\n", " \"version\": 1,\n", " \"tags\": {\n", " \"Author\": \"Sandeep\",\n", " \"Project\": \"Bank Marketing\"\n", " },\n", " \"workspace\": \"Workspace.create(name='demows', subscription_id='4cedc5dd-e3ad-468d-bf66-32e31bdb9148', resource_group='1-f4dcfa62-playground-sandbox')\"\n", " }\n", "}" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Dataset.get_by_name(ws, 'bank_train_test')" ] }, { "cell_type": "code", "execution_count": 64, "id": "minute-complexity", "metadata": {}, "outputs": [], "source": [ "ds_uid = '7b81a6c0-1e72-4948-86bb-ddac0e4e5d77'" ] }, { "cell_type": "markdown", "id": "special-publicity", "metadata": {}, "source": [ "Datastore has two datasets now which can be accessed anytime or versioned." ] }, { "cell_type": "markdown", "id": "quick-robinson", "metadata": {}, "source": [ "### Compute " ] }, { "cell_type": "markdown", "id": "afraid-thong", "metadata": {}, "source": [ "We can train the model locally and deploy it to the cloud. But if you want to scale-up the process by parallelizing model training, you can use the compute cluster. There two types of compute:\n", "\n", " - Compute Instance: This is like a managed VM with R,Python, Jupyter installed. You can use it for remote training but can also be accessed from the Studio for development. \n", " \n", " - Compute Cluster: This is a scalable multi-node compute, meaning if your training requires lot of compute power (e.g. 12 machines with 24 cores each) you can push the training to the compute cluster to do that. This can also be used for batch-inferencing.\n", " \n", "For example purposes, I will show how to create it but won't use it. Compute is expensive. Companies often create compute quota to limit cost and use remote compute for hyperparameter tuning or large jobs. Note that if you are using Azure ML pipelines, you have to use Compute instance/cluster and local training is not available.\n", "\n", "I generally prefer creating compute using GUI because you can see the cost of each compute option. " ] }, { "cell_type": "code", "execution_count": 65, "id": "indie-terrace", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DS12V2 exists already\n", "\n", "Running\n" ] } ], "source": [ "from azureml.core.compute import ComputeTarget, AmlCompute\n", "\n", "\n", "compute_name = \"DS12V2\"\n", "\n", "try:\n", " vm = ComputeTarget(ws, compute_name)\n", " print(f\"{compute_name} exists already\")\n", "except:\n", " compute_config = AmlCompute.provisioning_configuration(vm_size=\"Standard_D2_V2\", max_nodes=4)\n", " vm = ComputeTarget.create(ws, compute_name, compute_config)\n", " \n", "vm.wait_for_completion(show_output=True)" ] }, { "cell_type": "markdown", "id": "driving-globe", "metadata": {}, "source": [ "### Experiment" ] }, { "cell_type": "markdown", "id": "changing-processor", "metadata": {}, "source": [ "This is the heart of machine learning and where all the magic happens. When you are working on a machine learning project, it's rarely a linear process as we discussed above. You try many different algorithms, debug them, understand how they work, try different preprocessing steps, feature engineering, data augmentation etc,, which means you will end up creating thousands of models per project. To keep track of all these experimental runs, Azure ML provides the `Experiment` class. \n", "\n", "Think of `Experiment()` as a big giant folder where you save the model runs and the artifacts associated with that experiment. At the end of your experiment, you will see how each model performed based on selected metric and choose the right model for your project. The steps you will follow for each experimental run:\n", "\n", " - Create Experiment object\n", " - Start run\n", " - Log metrics\n", " - Get run/experiment details\n", " \n", "Just for demonstration purposes, I will create a `Demo Experiment` and log values `1,2,3` for metric called `demo_metric`. " ] }, { "cell_type": "code", "execution_count": 66, "id": "wooden-malaysia", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<table style=\"width:100%\"><tr><th>Name</th><th>Workspace</th><th>Report Page</th><th>Docs Page</th></tr><tr><td>Demo_Experiment</td><td>demows</td><td><a href=\"https://ml.azure.com/experiments/Demo_Experiment?wsid=/subscriptions/4cedc5dd-e3ad-468d-bf66-32e31bdb9148/resourcegroups/1-f4dcfa62-playground-sandbox/workspaces/demows\" target=\"_blank\" rel=\"noopener\">Link to Azure Machine Learning studio</a></td><td><a href=\"https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.experiment.Experiment?view=azure-ml-py\" target=\"_blank\" rel=\"noopener\">Link to Documentation</a></td></tr></table>" ], "text/plain": [ "Experiment(Name: Demo_Experiment,\n", "Workspace: demows)" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from azureml.core import Experiment\n", "\n", "exp1 = Experiment(workspace=ws, name=\"Demo_Experiment\")\n", "\n", "exp1" ] }, { "cell_type": "markdown", "id": "arranged-fifth", "metadata": {}, "source": [ "If you click on the above link, it will take you directly to the Azure ML Studio Experiment page. We will created the Experiment, i.e folder. Now, we *run* some experiments" ] }, { "cell_type": "code", "execution_count": 67, "id": "binding-freight", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "8ab071769b804bb1a45b352ce3865140", "version_major": 2, "version_minor": 0 }, "text/plain": [ "_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/aml.mini.widget.v1": "{\"status\": \"Completed\", \"workbench_run_details_uri\": \"https://ml.azure.com/experiments/Demo_Experiment/runs/e41963a9-ce60-411b-93e8-30849ac12ed2?wsid=/subscriptions/4cedc5dd-e3ad-468d-bf66-32e31bdb9148/resourcegroups/1-f4dcfa62-playground-sandbox/workspaces/demows\", \"run_id\": \"e41963a9-ce60-411b-93e8-30849ac12ed2\", \"run_properties\": {\"run_id\": \"e41963a9-ce60-411b-93e8-30849ac12ed2\", \"created_utc\": \"2021-02-22T17:30:19.558499Z\", \"properties\": {\"ContentSnapshotId\": \"3b0e7ceb-338e-4d0d-bb10-d1bde356c31f\"}, \"tags\": {}, \"end_time_utc\": \"2021-02-22T17:30:33.00177Z\", \"status\": \"Completed\", \"log_files\": {}, \"log_groups\": [], \"run_duration\": \"0:00:13\"}, \"child_runs\": [], \"children_metrics\": {}, \"run_metrics\": [{\"name\": \"demo_metric\", \"run_id\": \"e41963a9-ce60-411b-93e8-30849ac12ed2\", \"categories\": [0, 1, 2], \"series\": [{\"data\": [1, 2, 3]}]}], \"run_logs\": \"\\nRun is completed.\", \"graph\": {}, \"widget_settings\": {\"childWidgetDisplay\": \"popup\", \"send_telemetry\": false, \"log_level\": \"INFO\", \"sdk_version\": \"1.20.0\"}, \"loading\": false}" }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#Start run\n", "from azureml.widgets import RunDetails\n", "\n", "\n", "demo_run = exp1.start_logging()\n", "\n", "#Start Logging\n", "demo_run.log('demo_metric' , 1)\n", "demo_run.log('demo_metric' , 2)\n", "demo_run.log('demo_metric' , 3)\n", "\n", "#Stop logging\n", "\n", "demo_run.complete()\n", "\n", "RunDetails(demo_run).show()\n" ] }, { "cell_type": "markdown", "id": "convenient-september", "metadata": {}, "source": [ "Remember to use `run.complete()` to stop the run. A better and easier way is to use `with` as follows. When the run is complete, it will be completed automatically.\n", "\n", "For the bank marketing project, we created a random forest model using default hyper params. To demonstrate how create experiments, we will train four RF models by changing the `max_depth` parameter. When `max_depth` is None, it's just a stump of a tree. As we grow the depth, features are split and will identify non-linear patterns in the data. We will try `max_depth` = [None, 5,7,9]. In a real project, you will perform hyperparameter optimization using RandomSearch, Baysian Optimization using SKLearn, HyperOpt, HyperDrive etc. \n" ] }, { "cell_type": "code", "execution_count": 68, "id": "fallen-canberra", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "max_depth: None , oob_auc: 0.626\n", "max_depth: 5 , oob_auc: 0.74\n", "max_depth: 7 , oob_auc: 0.746\n", "max_depth: 9 , oob_auc: 0.746\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "d00ac55f06b545b4bbca1b55e851d358", "version_major": 2, "version_minor": 0 }, "text/plain": [ "_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/aml.mini.widget.v1": "{\"status\": \"Completed\", \"workbench_run_details_uri\": \"https://ml.azure.com/experiments/Bank_Marketing/runs/feee8fa9-f7f0-40ca-8766-163f88def905?wsid=/subscriptions/4cedc5dd-e3ad-468d-bf66-32e31bdb9148/resourcegroups/1-f4dcfa62-playground-sandbox/workspaces/demows\", \"run_id\": \"feee8fa9-f7f0-40ca-8766-163f88def905\", \"run_properties\": {\"run_id\": \"feee8fa9-f7f0-40ca-8766-163f88def905\", \"created_utc\": \"2021-02-22T17:31:15.164474Z\", \"properties\": {\"ContentSnapshotId\": \"c560339c-c187-45a7-b3d6-2bb9c8932ce1\"}, \"tags\": {}, \"end_time_utc\": \"2021-02-22T17:31:22.63437Z\", \"status\": \"Completed\", \"log_files\": {}, \"log_groups\": [], \"run_duration\": \"0:00:07\"}, \"child_runs\": [], \"children_metrics\": {}, \"run_metrics\": [{\"name\": \"Model\", \"run_id\": \"feee8fa9-f7f0-40ca-8766-163f88def905\", \"categories\": [0], \"series\": [{\"data\": [\"Random Forest\"]}]}, {\"name\": \"Dataset\", \"run_id\": \"feee8fa9-f7f0-40ca-8766-163f88def905\", \"categories\": [0], \"series\": [{\"data\": [\"7b81a6c0-1e72-4948-86bb-ddac0e4e5d77\"]}]}, {\"name\": \"max_depth\", \"run_id\": \"feee8fa9-f7f0-40ca-8766-163f88def905\", \"categories\": [0], \"series\": [{\"data\": [9]}]}, {\"name\": \"input_columns\", \"run_id\": \"feee8fa9-f7f0-40ca-8766-163f88def905\", \"categories\": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18], \"series\": [{\"data\": [\"age\", \"job\", \"marital\", \"education\", \"default\", \"housing\", \"loan\", \"contact\", \"month\", \"day_of_week\", \"campaign\", \"pdays\", \"previous\", \"poutcome\", \"emp_var_rate\", \"cons_price_idx\", \"cons_conf_idx\", \"euribor3m\", \"nr_employed\"]}]}, {\"name\": \"oob_score\", \"run_id\": \"feee8fa9-f7f0-40ca-8766-163f88def905\", \"categories\": [0], \"series\": [{\"data\": [\"True\"]}]}, {\"name\": \"class_weight\", \"run_id\": \"feee8fa9-f7f0-40ca-8766-163f88def905\", \"categories\": [0], \"series\": [{\"data\": [\"balanced\"]}]}, {\"name\": \"oob_auc\", \"run_id\": \"feee8fa9-f7f0-40ca-8766-163f88def905\", \"categories\": [0], \"series\": [{\"data\": [0.7460066717843007]}]}], \"run_logs\": \"\\nRun is completed.\", \"graph\": {}, \"widget_settings\": {\"childWidgetDisplay\": \"popup\", \"send_telemetry\": false, \"log_level\": \"INFO\", \"sdk_version\": \"1.20.0\"}, \"loading\": false}" }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#Create another experiment called bank\n", "from azureml.core import Run \n", "\n", "#Define new experiment\n", "bank = Experiment(workspace=ws, name=\"Bank_Marketing\", )\n", "\n", "#Define Hyperparameter to tune\n", "max_depth=[None,5,7,9] \n", " \n", "#Run the experiment\n", "\n", "for depth in max_depth:\n", " \n", " with bank.start_logging() as run: #snapshot only the snapshot directory snapshot_directory = 'snapshot'\n", "\n", " #Log max_depth\n", " run.log('Model', 'Random Forest')\n", " run.log('Dataset', ds_uid)\n", " run.log('max_depth', int(0 if depth is None else depth))\n", " run.log_list(\"input_columns\", list(x1.columns))\n", " \n", " #train the pipeline\n", " pipe2 = Pipeline([\n", " ('ct', ct),\n", " ('rf', RandomForestClassifier(oob_score=True,\n", " random_state=0, \n", " class_weight = 'balanced', \n", " max_depth = depth )),\n", " ])\n", "\n", " pipe2.fit(x1,y1)\n", " \n", "\n", " rf2 = pipe2[-1]\n", "\n", " #Log model details\n", " run.log('oob_score', 'True')\n", " run.log('class_weight', 'balanced')\n", "\n", "\n", " oob_pred2 = np.argmax(rf2.oob_decision_function_,axis=1)\n", " auc2 = metrics.roc_auc_score(y1, oob_pred2)\n", " \n", " #Log metrics\n", " run.log('oob_auc', auc2)\n", " \n", " print(\"max_depth: \",depth,\" , oob_auc: \", np.round(auc2,3))\n", " \n", "RunDetails(run).show()\n" ] }, { "cell_type": "markdown", "id": "saved-nursing", "metadata": {}, "source": [ "By increasing the max_depth, AUC increased from 62% to 74% ! \n", "\n", "In the experiment above, we logged the model class, dataset used, hyper parameter, input columns, AUC score etc. After the experiment is complete, you can visit the Studio to see the output and/or interact with the model artifacts. \n", "\n", "Note that by default, when you create an experiment, Azure ML will take a snapshot of the working folder. See below. Depending on your needs this is a good/bad thing. You may not want to snapshot all the files and folder. You can specify an `amlingnore` or `gitignore` file to indicate which files/folders to ignore. Another option is to specify which directory to snapshot. For example, above I specified `start_logging(snapshot_directory = 'snapshot')` to snapshot the `snapshot` folder. This helps reproducibility. You can save data, yaml, config files etc so you or your colleagues can reproduce the results months later. The maximum snapshot limit is 300MB. If your directory exceeds that the run will fail. You can increase the limit but you will incur storage costs. Also, directories `./output` and `./logs` are special. They will always be automatically uploaded as snapshot.\n", "\n", "I recommending 'snapshotting' only the required model artifacts and specifying which folder to snapshot.\n", "\n" ] }, { "cell_type": "markdown", "id": "foreign-institute", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "id": "dress-joint", "metadata": {}, "source": [ "### Model Packaging - (Registration & Deployment)" ] }, { "cell_type": "markdown", "id": "grand-syria", "metadata": {}, "source": [ "We ran some experiments with `max_depth` hyperparameters and found that using `max_depth` = [5,7,9] will improve the results significantly. Let's use 'one-standard error' rule [(Ref: ESL, pp61)](https://web.stanford.edu/~hastie/ElemStatLearn//printings/ESLII_print12_toc.pdf) to pick a parsomonious model. We will pick `max_depth=5` for create a pickle file and deploy it in service.\n", "\n", "Model accuracy is not the only metric, in fact it shouldn't be, to select a model. Focus should be on selecting simple, parsimonious models that are interpretable & explainable. Watch my interpretability presentation for more details. For now, we will assume this is the right model for us. " ] }, { "cell_type": "code", "execution_count": 69, "id": "persistent-lindsay", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['bank_model.pkl']" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "final_model = Pipeline([\n", " ('ct', ct),\n", " ('rf', RandomForestClassifier(oob_score=True,\n", " random_state=0, \n", " class_weight = 'balanced',\n", " max_depth = 5)),\n", " ])\n", "final_model.fit(x1,y1)\n", "\n", "joblib.dump(final_model, 'bank_model.pkl')" ] }, { "cell_type": "code", "execution_count": 70, "id": "athletic-stick", "metadata": {}, "outputs": [], "source": [ "test_final_model = joblib.load('bank_model.pkl')" ] }, { "cell_type": "code", "execution_count": 71, "id": "sustained-equipment", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 0, 0, ..., 1, 1, 1])" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_final_model.predict(x1.iloc[0:])" ] }, { "cell_type": "code", "execution_count": 79, "id": "celtic-morgan", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AUC on Test set: , 0.74\n" ] } ], "source": [ "print(\"AUC on Test set: ,\", np.round(metrics.roc_auc_score(y2, test_final_model.predict(x2)),2))" ] }, { "cell_type": "markdown", "id": "partial-adams", "metadata": {}, "source": [ "Excellent, OOB score is same as the test (very rare!)." ] }, { "cell_type": "markdown", "id": "fancy-northeast", "metadata": {}, "source": [ "There are actually multiple ways to register and deploy a model as webservice. Typically, you will first create training script, register an environment, create inference schema, register model, create deployment config etc. But there is a shorter way to do all of that in one single step. Usually you will go through everything step-by-step but for demonstration purposes, I will roll these steps into one by using `ResourceConfiguration` class. Also, note that this is for real-time inferencing using Azure Container Instance. You should always deploy the model locally first for debugging, testing before deploying it to the cloud. For batch-inferencing, follow [these steps](https://docs.microsoft.com/en-us/learn/modules/deploy-batch-inference-pipelines-with-azure-machine-learning/). \n", "\n", "You can also register and deploy using the interface in the Studio.\n", "\n", "We will also save the sample features and labels for future reference and model debugging. " ] }, { "cell_type": "code", "execution_count": 81, "id": "hungry-nickname", "metadata": {}, "outputs": [], "source": [ "np.savetxt('features.csv', np.array(x1), delimiter=',', fmt='%s')\n", "np.savetxt('labels.csv', y1, delimiter=',')" ] }, { "cell_type": "code", "execution_count": 82, "id": "different-excerpt", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Uploading an estimated of 2 files\n", "Uploading ./features.csv\n", "Uploaded ./features.csv, 1 files out of an estimated total of 2\n", "Uploading ./labels.csv\n", "Uploaded ./labels.csv, 2 files out of an estimated total of 2\n", "Uploaded 2 files\n" ] } ], "source": [ "\n", "datastore.upload_files(files=['./features.csv', './labels.csv'],\n", " target_path='sample_data/',\n", " overwrite=True)\n", "\n", "input_dataset = Dataset.Tabular.from_delimited_files(path=[(datastore, 'sample_data/features.csv')])\n", "output_dataset = Dataset.Tabular.from_delimited_files(path=[(datastore, 'sample_data/labels.csv')])" ] }, { "cell_type": "code", "execution_count": 83, "id": "preceding-arthritis", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Registering model bank_model\n" ] } ], "source": [ "from azureml.core import Model\n", "from azureml.core.resource_configuration import ResourceConfiguration\n", "\n", "model = (Model.register(workspace = ws, \n", " model_name = \"bank_model\", #name for the model\n", " model_path = './bank_model.pkl', #Specify the .pkl file\n", " model_framework=Model.Framework.SCIKITLEARN, #This will automatically create environment & schema\n", " sample_input_dataset=input_dataset, #Sample input\n", " sample_output_dataset=output_dataset, #Sample output \n", " resource_configuration=ResourceConfiguration(cpu=1, memory_in_gb=0.5), #ACI config to use\n", " description='Bank Marketing model to predict of customer will sign up',\n", " tags = {'Author':'Sandeep', \n", " 'Date':'2/18/2021', \n", " 'Model':'RandomForest', \n", " 'Dataset':ds_uid} \n", " ))" ] }, { "cell_type": "code", "execution_count": 84, "id": "joint-circle", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Name: bank_model\n", "Version: 1\n" ] } ], "source": [ "print('Name:', model.name)\n", "print('Version:', model.version)" ] }, { "cell_type": "code", "execution_count": 85, "id": "colonial-seafood", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.\n", "Running............................................................................................\n", "Succeeded\n", "ACI service creation operation finished, operation \"Succeeded\"\n" ] } ], "source": [ "service = Model.deploy(ws, \"service3\", [model])\n", "service.wait_for_deployment(show_output=True)" ] }, { "cell_type": "code", "execution_count": 86, "id": "varying-basis", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "AciWebservice(workspace=Workspace.create(name='demows', subscription_id='4cedc5dd-e3ad-468d-bf66-32e31bdb9148', resource_group='1-f4dcfa62-playground-sandbox'), name=service3, image_id=None, compute_type=None, state=ACI, scoring_uri=Healthy, tags=http://ec1b6ac7-1a4a-4d2e-8f98-ed8f8628ba44.centralus.azurecontainer.io/score, properties={}, created_by={'hasInferenceSchema': 'True', 'hasHttps': 'False'})" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "service" ] }, { "cell_type": "code", "execution_count": 87, "id": "dominican-advantage", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Healthy'" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "service.state" ] }, { "cell_type": "markdown", "id": "white-constitution", "metadata": {}, "source": [ "Webservcie status is `Healthy` and ready to be used in production for inferencing." ] }, { "cell_type": "markdown", "id": "signal-audio", "metadata": {}, "source": [ "#### Create an input data to test inferencing\n", "\n", "We will send 10 sample observations to test if the service is responding and returning expected output. You can also test this using the scoring_uri." ] }, { "cell_type": "code", "execution_count": 89, "id": "therapeutic-lottery", "metadata": {}, "outputs": [], "source": [ "import json \n", "\n", "input = json.dumps({'data':x1.iloc[:10,:].to_dict('list'),'method': 'predict'})\n", "headers = {'Content-Type': 'application/json'}" ] }, { "cell_type": "code", "execution_count": 90, "id": "treated-baghdad", "metadata": {}, "outputs": [], "source": [ "output = service.run(input)" ] }, { "cell_type": "code", "execution_count": 91, "id": "eastern-nowhere", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'predict': [0, 0, 0, 0, 0, 1, 0, 0, 0, 0]}" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "output" ] }, { "cell_type": "markdown", "id": "employed-apache", "metadata": {}, "source": [ "We got the response back with predictions. Service is running and ready for action.\n", "If you are trying this example, be sure to delet the service and the weorkspace to avoid charges." ] }, { "cell_type": "code", "execution_count": 87, "id": "micro-andrew", "metadata": {}, "outputs": [], "source": [ "\n", "service.delete()" ] }, { "cell_type": "markdown", "id": "decreased-interim", "metadata": {}, "source": [ "#### Monitor the Webservice" ] }, { "cell_type": "markdown", "id": "peaceful-excerpt", "metadata": {}, "source": [ "To monitor the performance of the webservice and the deployed model, we need to do few things:\n", "\n", " - Stress test the model for distribution shifts and loads\n", " - Collect webservice performance metric using **Application Insights**. This will help us collect:\n", " - Responses\n", " - Request rates, response time, failure rates\n", " - Exceptions\n", " \n", " - Monitor **Concept Drift**\n", " - Performance of the ML model will likely degrade over time due to change in distribution of the input data\n", " - By monitoring drift, we can measure the drift and decide when to re-train the model \n", " \n", " - Collect **Model Interpretability** data during inferencing\n", " - This is to track how model is creating predictions and if predictions are fair \n", " \n", " \n", "This is big topic and will require a separate presentation. But just know that with Azure ML service, you can monitor the model performance in production environment." ] }, { "cell_type": "markdown", "id": "cognitive-cathedral", "metadata": {}, "source": [ "### Next Steps\n", "\n", "This was just the introduction to give you flavor for how to use Azure ML sdk. There are more advanced methods available depending on the project needs. I would encourage you to research those on your own from MS Docs and MS Learn." ] }, { "cell_type": "markdown", "id": "generic-melbourne", "metadata": {}, "source": [ " - Azure ML Pipelines\n", " - Azure ML HyperDrive\n", " - Azure Auto ML\n", " - Azure ML Studio Designer" ] }, { "cell_type": "markdown", "id": "exempt-research", "metadata": {}, "source": [ "### Resources" ] }, { "cell_type": "markdown", "id": "defined-upgrade", "metadata": {}, "source": [ " - [Microsoft Learn](https://docs.microsoft.com/en-us/learn/paths/create-machine-learn-models/)\n", " - [Microsoft Documentation](https://docs.microsoft.com/en-us/python/api/overview/azure/ml/?view=azure-ml-py)\n" ] }, { "cell_type": "markdown", "id": "soviet-consumer", "metadata": {}, "source": [ "Thank you ! I hope you found this helpful. As always, feel free to get in touch if you have any questions. " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.9" } }, "nbformat": 4, "nbformat_minor": 5 }