{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "TZ-PuwoyX4qQ" }, "source": [ "# Week 1 Assignment: Data Validation\n", "\n", "[Tensorflow Data Validation (TFDV)](https://cloud.google.com/solutions/machine-learning/analyzing-and-validating-data-at-scale-for-ml-using-tfx) is an open-source library that helps to understand, validate, and monitor production machine learning (ML) data at scale. Common use-cases include comparing training, evaluation and serving datasets, as well as checking for training/serving skew. You have seen the core functionalities of this package in the previous ungraded lab and you will get to practice them in this week's assignment.\n", "\n", "In this lab, you will use TFDV in order to:\n", "\n", "* Generate and visualize statistics from a dataframe\n", "* Infer a dataset schema\n", "* Calculate, visualize and fix anomalies\n", "\n", "Let's begin!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Table of Contents\n", "\n", "- [1 - Setup and Imports](#1)\n", "- [2 - Load the Dataset](#2)\n", " - [2.1 - Read and Split the Dataset](#2-1)\n", " - [2.1.1 - Data Splits](#2-1-1)\n", " - [2.1.2 - Label Column](#2-1-2)\n", "- [3 - Generate and Visualize Training Data Statistics](#3)\n", " - [3.1 - Removing Irrelevant Features](#3-1)\n", " - [Exercise 1 - Generate Training Statistics](#ex-1)\n", " - [Exercise 2 - Visualize Training Statistics](#ex-2)\n", "- [4 - Infer a Data Schema](#4)\n", " - [Exercise 3: Infer the training set schema](#ex-3)\n", "- [5 - Calculate, Visualize and Fix Evaluation Anomalies](#5)\n", " - [Exercise 4: Compare Training and Evaluation Statistics](#ex-4)\n", " - [Exercise 5: Detecting Anomalies](#ex-5)\n", " - [Exercise 6: Fix evaluation anomalies in the schema](#ex-6)\n", "- [6 - Schema Environments](#6)\n", " - [Exercise 7: Check anomalies in the serving set](#ex-7)\n", " - [Exercise 8: Modifying the domain](#ex-8)\n", " - [Exercise 9: Detecting anomalies with environments](#ex-9)\n", "- [7 - Check for Data Drift and Skew](#7)\n", "- [8 - Display Stats for Data Slices](#8)\n", "- [9 - Freeze the Schema](#8)" ] }, { "cell_type": "markdown", "metadata": { "id": "ZEnMK4DRNV1O" }, "source": [ "\n", "## 1 - Setup and Imports" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "zrLPRsQgImel" }, "outputs": [], "source": [ "# Import packages\n", "import os\n", "import pandas as pd\n", "import tensorflow as tf\n", "import tempfile, urllib, zipfile\n", "import tensorflow_data_validation as tfdv\n", "\n", "\n", "from tensorflow.python.lib.io import file_io\n", "from tensorflow_data_validation.utils import slicing_util\n", "from tensorflow_metadata.proto.v0.statistics_pb2 import DatasetFeatureStatisticsList, DatasetFeatureStatistics\n", "\n", "# Set TF's logger to only display errors to avoid internal warnings being shown\n", "tf.get_logger().setLevel('ERROR')" ] }, { "cell_type": "markdown", "metadata": { "id": "5MizoHg1DRlK" }, "source": [ "\n", "## 2 - Load the Dataset\n", "You will be using the [Diabetes 130-US hospitals for years 1999-2008 Data Set](https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008) donated to the University of California, Irvine (UCI) Machine Learning Repository. The dataset represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes.\n", "\n", "This dataset has already been included in your Jupyter workspace so you can easily load it." ] }, { "cell_type": "markdown", "metadata": { "id": "S2o2NGqIxc5e" }, "source": [ "\n", "### 2.1 Read and Split the Dataset" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "YyO3RSuLF0Nf" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
encounter_idpatient_nbrracegenderageweightadmission_type_iddischarge_disposition_idadmission_source_idtime_in_hospital...citogliptoninsulinglyburide-metforminglipizide-metforminglimepiride-pioglitazonemetformin-rosiglitazonemetformin-pioglitazonechangediabetesMedreadmitted
022783928222157CaucasianFemale[0-10)NaN62511...NoNoNoNoNoNoNoNoNoNO
114919055629189CaucasianFemale[10-20)NaN1173...NoUpNoNoNoNoNoChYes>30
26441086047875AfricanAmericanFemale[20-30)NaN1172...NoNoNoNoNoNoNoNoYesNO
350036482442376CaucasianMale[30-40)NaN1172...NoUpNoNoNoNoNoChYesNO
41668042519267CaucasianMale[40-50)NaN1171...NoSteadyNoNoNoNoNoChYesNO
\n", "

5 rows × 50 columns

\n", "
" ], "text/plain": [ " encounter_id patient_nbr race gender age weight \\\n", "0 2278392 8222157 Caucasian Female [0-10) NaN \n", "1 149190 55629189 Caucasian Female [10-20) NaN \n", "2 64410 86047875 AfricanAmerican Female [20-30) NaN \n", "3 500364 82442376 Caucasian Male [30-40) NaN \n", "4 16680 42519267 Caucasian Male [40-50) NaN \n", "\n", " admission_type_id discharge_disposition_id admission_source_id \\\n", "0 6 25 1 \n", "1 1 1 7 \n", "2 1 1 7 \n", "3 1 1 7 \n", "4 1 1 7 \n", "\n", " time_in_hospital ... citoglipton insulin glyburide-metformin \\\n", "0 1 ... No No No \n", "1 3 ... No Up No \n", "2 2 ... No No No \n", "3 2 ... No Up No \n", "4 1 ... No Steady No \n", "\n", " glipizide-metformin glimepiride-pioglitazone metformin-rosiglitazone \\\n", "0 No No No \n", "1 No No No \n", "2 No No No \n", "3 No No No \n", "4 No No No \n", "\n", " metformin-pioglitazone change diabetesMed readmitted \n", "0 No No No NO \n", "1 No Ch Yes >30 \n", "2 No No Yes NO \n", "3 No Ch Yes NO \n", "4 No Ch Yes NO \n", "\n", "[5 rows x 50 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Read CSV data into a dataframe and recognize the missing data that is encoded with '?' string as NaN\n", "df = pd.read_csv('dataset_diabetes/diabetic_data.csv', header=0, na_values = '?')\n", "\n", "# Preview the dataset\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "#### Data splits\n", "\n", "In a production ML system, the model performance can be negatively affected by anomalies and divergence between data splits for training, evaluation, and serving. To emulate a production system, you will split the dataset into:\n", "\n", "* 70% training set \n", "* 15% evaluation set\n", "* 15% serving set\n", "\n", "You will then use TFDV to visualize, analyze, and understand the data. You will create a data schema from the training dataset, then compare the evaluation and serving sets with this schema to detect anomalies and data drift/skew.\n", "\n", "\n", "#### Label Column\n", "\n", "This dataset has been prepared to analyze the factors related to readmission outcome. In this notebook, you will treat the `readmitted` column as the *target* or label column. \n", "\n", "The target (or label) is important to know while splitting the data into training, evaluation and serving sets. In supervised learning, you need to include the target in the training and evaluation datasets. For the serving set however (i.e. the set that simulates the data coming from your users), the **label column needs to be dropped** since that is the feature that your model will be trying to predict.\n", "\n", "The following function returns the training, evaluation and serving partitions of a given dataset:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "Tv1I6Dd2IS5J" }, "outputs": [], "source": [ "def prepare_data_splits_from_dataframe(df):\n", " '''\n", " Splits a Pandas Dataframe into training, evaluation and serving sets.\n", "\n", " Parameters:\n", " df : pandas dataframe to split\n", "\n", " Returns:\n", " train_df: Training dataframe(70% of the entire dataset)\n", " eval_df: Evaluation dataframe (15% of the entire dataset) \n", " serving_df: Serving dataframe (15% of the entire dataset, label column dropped)\n", " '''\n", " \n", " # 70% of records for generating the training set\n", " train_len = int(len(df) * 0.7)\n", " \n", " # Remaining 30% of records for generating the evaluation and serving sets\n", " eval_serv_len = len(df) - train_len\n", " \n", " # Half of the 30%, which makes up 15% of total records, for generating the evaluation set\n", " eval_len = eval_serv_len // 2\n", " \n", " # Remaining 15% of total records for generating the serving set\n", " serv_len = eval_serv_len - eval_len \n", " \n", " # Sample the train, validation and serving sets. We specify a random state for repeatable outcomes.\n", " train_df = df.iloc[:train_len].sample(frac=1, random_state=48).reset_index(drop=True)\n", " eval_df = df.iloc[train_len: train_len + eval_len].sample(frac=1, random_state=48).reset_index(drop=True)\n", " serving_df = df.iloc[train_len + eval_len: train_len + eval_len + serv_len].sample(frac=1, random_state=48).reset_index(drop=True)\n", " \n", " # Serving data emulates the data that would be submitted for predictions, so it should not have the label column.\n", " serving_df = serving_df.drop(['readmitted'], axis=1)\n", "\n", " return train_df, eval_df, serving_df" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "rJV6__uVhz0B" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training dataset has 71236 records\n", "Validation dataset has 15265 records\n", "Serving dataset has 15265 records\n" ] } ], "source": [ "# Split the datasets\n", "train_df, eval_df, serving_df = prepare_data_splits_from_dataframe(df)\n", "print('Training dataset has {} records\\nValidation dataset has {} records\\nServing dataset has {} records'.format(len(train_df),len(eval_df),len(serving_df)))" ] }, { "cell_type": "markdown", "metadata": { "id": "Nnln8dH8Nmm8" }, "source": [ "\n", "## 3 - Generate and Visualize Training Data Statistics\n", "\n", "In this section, you will be generating descriptive statistics from the dataset. This is usually the first step when dealing with a dataset you are not yet familiar with. It is also known as performing an *exploratory data analysis* and its purpose is to understand the data types, the data itself and any possible issues that need to be addressed.\n", "\n", "It is important to mention that **exploratory data analysis should be perfomed on the training dataset** only. This is because getting information out of the evaluation or serving datasets can be seen as \"cheating\" since this data is used to emulate data that you have not collected yet and will try to predict using your ML algorithm. **In general, it is a good practice to avoid leaking information from your evaluation and serving data into your model.**" ] }, { "cell_type": "markdown", "metadata": { "id": "PCrnVmQUY4We" }, "source": [ "\n", "### Removing Irrelevant Features\n", "\n", "Before you generate the statistics, you may want to drop irrelevant features from your dataset. You can do that with TFDV with the [tfdv.StatsOptions](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/StatsOptions) class. It is usually **not a good idea** to drop features without knowing what information they contain. However there are times when this can be fairly obvious.\n", "\n", "One of the important parameters of the `StatsOptions` class is `feature_whitelist`, which defines the features to include while calculating the data statistics. You can check the [documentation](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/StatsOptions#args) to learn more about the class arguments.\n", "\n", "In this case, you will omit the statistics for `encounter_id` and `patient_nbr` since they are part of the internal tracking of patients in the hospital and they don't contain valuable information for the task at hand." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "z4jKM0gyj8Qc" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['race', 'gender', 'age', 'weight', 'admission_type_id', 'discharge_disposition_id', 'admission_source_id', 'time_in_hospital', 'payer_code', 'medical_specialty', 'num_lab_procedures', 'num_procedures', 'num_medications', 'number_outpatient', 'number_emergency', 'number_inpatient', 'diag_1', 'diag_2', 'diag_3', 'number_diagnoses', 'max_glu_serum', 'A1Cresult', 'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide', 'examide', 'citoglipton', 'insulin', 'glyburide-metformin', 'glipizide-metformin', 'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted']\n" ] } ], "source": [ "# Define features to remove\n", "features_to_remove = {'encounter_id', 'patient_nbr'}\n", "\n", "# Collect features to whitelist while computing the statistics\n", "approved_cols = [col for col in df.columns if (col not in features_to_remove)]\n", "\n", "# Instantiate a StatsOptions class and define the feature_whitelist property\n", "stats_options = tfdv.StatsOptions(feature_whitelist=approved_cols)\n", "\n", "# Review the features to generate the statistics\n", "print(stats_options.feature_whitelist)" ] }, { "cell_type": "markdown", "metadata": { "id": "CvHoWMcYNzIx" }, "source": [ "\n", "### Exercise 1: Generate Training Statistics " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "TFDV allows you to generate statistics from different data formats such as CSV or a Pandas DataFrame. \n", "\n", "Since you already have the data stored in a DataFrame you can use the function [`tfdv.generate_statistics_from_dataframe()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/generate_statistics_from_dataframe) which, given a DataFrame and `stats_options`, generates an object of type `DatasetFeatureStatisticsList`. This object includes the computed statistics of the given dataset.\n", "\n", "Complete the cell below to generate the statistics of the training set. Remember to pass the training dataframe and the `stats_options` that you defined above as arguments." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "id": "EE481oMbT-H0" }, "outputs": [], "source": [ "### START CODE HERE\n", "train_stats = tfdv.generate_statistics_from_dataframe(train_df, stats_options)\n", "### END CODE HERE" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of features used: 48\n", "Number of examples used: 71236\n", "First feature: race\n", "Last feature: readmitted\n" ] } ], "source": [ "# TEST CODE\n", "\n", "# get the number of features used to compute statistics\n", "print(f\"Number of features used: {len(train_stats.datasets[0].features)}\")\n", "\n", "# check the number of examples used\n", "print(f\"Number of examples used: {train_stats.datasets[0].num_examples}\")\n", "\n", "# check the column names of the first and last feature\n", "print(f\"First feature: {train_stats.datasets[0].features[0].path.step[0]}\")\n", "print(f\"Last feature: {train_stats.datasets[0].features[-1].path.step[0]}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Expected Output:**\n", "\n", "```\n", "Number of features used: 48\n", "Number of examples used: 71236\n", "First feature: race\n", "Last feature: readmitted\n", "```" ] }, { "cell_type": "markdown", "metadata": { "id": "ElOMvBOKNvLp" }, "source": [ "\n", "### Exercise 2: Visualize Training Statistics\n", "\n", "Now that you have the computed statistics in the `DatasetFeatureStatisticsList` instance, you will need a way to **visualize** these to get actual insights. TFDV provides this functionality through the method [`tfdv.visualize_statistics()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/visualize_statistics).\n", "\n", "Using this function in an interactive Python environment such as this one will output a very nice and convenient way to interact with the descriptive statistics you generated earlier. \n", "\n", "**Try it out yourself!** Remember to pass in the generated training statistics in the previous exercise as an argument." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "graded": true, "id": "U3tUKgh7Up3x", "name": "train_stats_visualize_statistics" }, "outputs": [ { "data": { "text/html": [ "\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "### START CODE HERE\n", "tfdv.visualize_statistics(train_stats)\n", "### END CODE HERE" ] }, { "cell_type": "markdown", "metadata": { "id": "KVR02-y4V0uM" }, "source": [ "\n", "## 4 - Infer a data schema" ] }, { "cell_type": "markdown", "metadata": { "id": "IPRioB7hZ03b" }, "source": [ "A schema defines the **properties of the data** and can thus be used to detect errors. Some of these properties include:\n", "\n", "- which features are expected to be present\n", "- feature type\n", "- the number of values for a feature in each example\n", "- the presence of each feature across all examples\n", "- the expected domains of features\n", "\n", "The schema is expected to be fairly static, whereas statistics can vary per data split. So, you will **infer the data schema from only the training dataset**. Later, you will generate statistics for evaluation and serving datasets and compare their state with the data schema to detect anomalies, drift and skew." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Exercise 3: Infer the training set schema\n", "\n", "Schema inference is straightforward using [`tfdv.infer_schema()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/infer_schema). This function needs only the **statistics** (an instance of `DatasetFeatureStatisticsList`) of your data as input. The output will be a Schema [protocol buffer](https://developers.google.com/protocol-buffers) containing the results.\n", "\n", "A complimentary function is [`tfdv.display_schema()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/display_schema) for displaying the schema in a table. This accepts a **Schema** protocol buffer as input.\n", "\n", "Fill the code below to infer the schema from the training statistics using TFDV and display the result." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "id": "6LLkRJThVr9m" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TypePresenceValencyDomain
Feature name
'race'STRINGoptionalsingle'race'
'gender'STRINGrequired'gender'
'age'STRINGrequired'age'
'weight'STRINGoptionalsingle'weight'
'admission_type_id'INTrequired-
'discharge_disposition_id'INTrequired-
'admission_source_id'INTrequired-
'time_in_hospital'INTrequired-
'payer_code'STRINGoptionalsingle'payer_code'
'medical_specialty'STRINGoptionalsingle'medical_specialty'
'num_lab_procedures'INTrequired-
'num_procedures'INTrequired-
'num_medications'INTrequired-
'number_outpatient'INTrequired-
'number_emergency'INTrequired-
'number_inpatient'INTrequired-
'diag_1'BYTESoptionalsingle-
'diag_2'BYTESoptionalsingle-
'diag_3'BYTESoptionalsingle-
'number_diagnoses'INTrequired-
'max_glu_serum'STRINGrequired'max_glu_serum'
'A1Cresult'STRINGrequired'A1Cresult'
'metformin'STRINGrequired'metformin'
'repaglinide'STRINGrequired'repaglinide'
'nateglinide'STRINGrequired'nateglinide'
'chlorpropamide'STRINGrequired'chlorpropamide'
'glimepiride'STRINGrequired'glimepiride'
'acetohexamide'STRINGrequired'acetohexamide'
'glipizide'STRINGrequired'glipizide'
'glyburide'STRINGrequired'glyburide'
'tolbutamide'STRINGrequired'tolbutamide'
'pioglitazone'STRINGrequired'pioglitazone'
'rosiglitazone'STRINGrequired'rosiglitazone'
'acarbose'STRINGrequired'acarbose'
'miglitol'STRINGrequired'miglitol'
'troglitazone'STRINGrequired'troglitazone'
'tolazamide'STRINGrequired'tolazamide'
'examide'STRINGrequired'examide'
'citoglipton'STRINGrequired'citoglipton'
'insulin'STRINGrequired'insulin'
'glyburide-metformin'STRINGrequired'glyburide-metformin'
'glipizide-metformin'STRINGrequired'glipizide-metformin'
'glimepiride-pioglitazone'STRINGrequired'glimepiride-pioglitazone'
'metformin-rosiglitazone'STRINGrequired'metformin-rosiglitazone'
'metformin-pioglitazone'STRINGrequired'metformin-pioglitazone'
'change'STRINGrequired'change'
'diabetesMed'STRINGrequired'diabetesMed'
'readmitted'STRINGrequired'readmitted'
\n", "
" ], "text/plain": [ " Type Presence Valency \\\n", "Feature name \n", "'race' STRING optional single \n", "'gender' STRING required \n", "'age' STRING required \n", "'weight' STRING optional single \n", "'admission_type_id' INT required \n", "'discharge_disposition_id' INT required \n", "'admission_source_id' INT required \n", "'time_in_hospital' INT required \n", "'payer_code' STRING optional single \n", "'medical_specialty' STRING optional single \n", "'num_lab_procedures' INT required \n", "'num_procedures' INT required \n", "'num_medications' INT required \n", "'number_outpatient' INT required \n", "'number_emergency' INT required \n", "'number_inpatient' INT required \n", "'diag_1' BYTES optional single \n", "'diag_2' BYTES optional single \n", "'diag_3' BYTES optional single \n", "'number_diagnoses' INT required \n", "'max_glu_serum' STRING required \n", "'A1Cresult' STRING required \n", "'metformin' STRING required \n", "'repaglinide' STRING required \n", "'nateglinide' STRING required \n", "'chlorpropamide' STRING required \n", "'glimepiride' STRING required \n", "'acetohexamide' STRING required \n", "'glipizide' STRING required \n", "'glyburide' STRING required \n", "'tolbutamide' STRING required \n", "'pioglitazone' STRING required \n", "'rosiglitazone' STRING required \n", "'acarbose' STRING required \n", "'miglitol' STRING required \n", "'troglitazone' STRING required \n", "'tolazamide' STRING required \n", "'examide' STRING required \n", "'citoglipton' STRING required \n", "'insulin' STRING required \n", "'glyburide-metformin' STRING required \n", "'glipizide-metformin' STRING required \n", "'glimepiride-pioglitazone' STRING required \n", "'metformin-rosiglitazone' STRING required \n", "'metformin-pioglitazone' STRING required \n", "'change' STRING required \n", "'diabetesMed' STRING required \n", "'readmitted' STRING required \n", "\n", " Domain \n", "Feature name \n", "'race' 'race' \n", "'gender' 'gender' \n", "'age' 'age' \n", "'weight' 'weight' \n", "'admission_type_id' - \n", "'discharge_disposition_id' - \n", "'admission_source_id' - \n", "'time_in_hospital' - \n", "'payer_code' 'payer_code' \n", "'medical_specialty' 'medical_specialty' \n", "'num_lab_procedures' - \n", "'num_procedures' - \n", "'num_medications' - \n", "'number_outpatient' - \n", "'number_emergency' - \n", "'number_inpatient' - \n", "'diag_1' - \n", "'diag_2' - \n", "'diag_3' - \n", "'number_diagnoses' - \n", "'max_glu_serum' 'max_glu_serum' \n", "'A1Cresult' 'A1Cresult' \n", "'metformin' 'metformin' \n", "'repaglinide' 'repaglinide' \n", "'nateglinide' 'nateglinide' \n", "'chlorpropamide' 'chlorpropamide' \n", "'glimepiride' 'glimepiride' \n", "'acetohexamide' 'acetohexamide' \n", "'glipizide' 'glipizide' \n", "'glyburide' 'glyburide' \n", "'tolbutamide' 'tolbutamide' \n", "'pioglitazone' 'pioglitazone' \n", "'rosiglitazone' 'rosiglitazone' \n", "'acarbose' 'acarbose' \n", "'miglitol' 'miglitol' \n", "'troglitazone' 'troglitazone' \n", "'tolazamide' 'tolazamide' \n", "'examide' 'examide' \n", "'citoglipton' 'citoglipton' \n", "'insulin' 'insulin' \n", "'glyburide-metformin' 'glyburide-metformin' \n", "'glipizide-metformin' 'glipizide-metformin' \n", "'glimepiride-pioglitazone' 'glimepiride-pioglitazone' \n", "'metformin-rosiglitazone' 'metformin-rosiglitazone' \n", "'metformin-pioglitazone' 'metformin-pioglitazone' \n", "'change' 'change' \n", "'diabetesMed' 'diabetesMed' \n", "'readmitted' 'readmitted' " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Values
Domain
'race''AfricanAmerican', 'Asian', 'Caucasian', 'Hispanic', 'Other'
'gender''Female', 'Male', 'Unknown/Invalid'
'age''[0-10)', '[10-20)', '[20-30)', '[30-40)', '[40-50)', '[50-60)', '[60-70)', '[70-80)', '[80-90)', '[90-100)'
'weight''>200', '[0-25)', '[100-125)', '[125-150)', '[150-175)', '[175-200)', '[25-50)', '[50-75)', '[75-100)'
'payer_code''BC', 'CH', 'CM', 'CP', 'DM', 'HM', 'MC', 'MD', 'MP', 'OG', 'OT', 'PO', 'SI', 'SP', 'UN', 'WC'
'medical_specialty''AllergyandImmunology', 'Anesthesiology', 'Anesthesiology-Pediatric', 'Cardiology', 'Cardiology-Pediatric', 'Dentistry', 'Dermatology', 'Emergency/Trauma', 'Endocrinology', 'Family/GeneralPractice', 'Gastroenterology', 'Gynecology', 'Hematology', 'Hematology/Oncology', 'Hospitalist', 'InfectiousDiseases', 'InternalMedicine', 'Nephrology', 'Neurology', 'Obsterics&Gynecology-GynecologicOnco', 'Obstetrics', 'ObstetricsandGynecology', 'Oncology', 'Ophthalmology', 'Orthopedics', 'Orthopedics-Reconstructive', 'Osteopath', 'Otolaryngology', 'OutreachServices', 'Pathology', 'Pediatrics', 'Pediatrics-AllergyandImmunology', 'Pediatrics-CriticalCare', 'Pediatrics-EmergencyMedicine', 'Pediatrics-Endocrinology', 'Pediatrics-Hematology-Oncology', 'Pediatrics-InfectiousDiseases', 'Pediatrics-Neurology', 'Pediatrics-Pulmonology', 'Perinatology', 'PhysicalMedicineandRehabilitation', 'PhysicianNotFound', 'Podiatry', 'Proctology', 'Psychiatry', 'Psychiatry-Addictive', 'Psychiatry-Child/Adolescent', 'Psychology', 'Pulmonology', 'Radiologist', 'Radiology', 'Rheumatology', 'Speech', 'SportsMedicine', 'Surgeon', 'Surgery-Cardiovascular', 'Surgery-Cardiovascular/Thoracic', 'Surgery-Colon&Rectal', 'Surgery-General', 'Surgery-Maxillofacial', 'Surgery-Neuro', 'Surgery-Pediatric', 'Surgery-Plastic', 'Surgery-PlasticwithinHeadandNeck', 'Surgery-Thoracic', 'Surgery-Vascular', 'SurgicalSpecialty', 'Urology'
'max_glu_serum''>200', '>300', 'None', 'Norm'
'A1Cresult''>7', '>8', 'None', 'Norm'
'metformin''Down', 'No', 'Steady', 'Up'
'repaglinide''Down', 'No', 'Steady', 'Up'
'nateglinide''Down', 'No', 'Steady', 'Up'
'chlorpropamide''Down', 'No', 'Steady', 'Up'
'glimepiride''Down', 'No', 'Steady', 'Up'
'acetohexamide''No', 'Steady'
'glipizide''Down', 'No', 'Steady', 'Up'
'glyburide''Down', 'No', 'Steady', 'Up'
'tolbutamide''No', 'Steady'
'pioglitazone''Down', 'No', 'Steady', 'Up'
'rosiglitazone''Down', 'No', 'Steady', 'Up'
'acarbose''Down', 'No', 'Steady', 'Up'
'miglitol''Down', 'No', 'Steady', 'Up'
'troglitazone''No', 'Steady'
'tolazamide''No', 'Steady', 'Up'
'examide''No'
'citoglipton''No'
'insulin''Down', 'No', 'Steady', 'Up'
'glyburide-metformin''Down', 'No', 'Steady', 'Up'
'glipizide-metformin''No', 'Steady'
'glimepiride-pioglitazone''No'
'metformin-rosiglitazone''No'
'metformin-pioglitazone''No'
'change''Ch', 'No'
'diabetesMed''No', 'Yes'
'readmitted''<30', '>30', 'NO'
\n", "
" ], "text/plain": [ " Values\n", "Domain \n", "'race' 'AfricanAmerican', 'Asian', 'Caucasian', 'Hispanic', 'Other' \n", "'gender' 'Female', 'Male', 'Unknown/Invalid' \n", "'age' '[0-10)', '[10-20)', '[20-30)', '[30-40)', '[40-50)', '[50-60)', '[60-70)', '[70-80)', '[80-90)', '[90-100)' \n", "'weight' '>200', '[0-25)', '[100-125)', '[125-150)', '[150-175)', '[175-200)', '[25-50)', '[50-75)', '[75-100)' \n", "'payer_code' 'BC', 'CH', 'CM', 'CP', 'DM', 'HM', 'MC', 'MD', 'MP', 'OG', 'OT', 'PO', 'SI', 'SP', 'UN', 'WC' \n", "'medical_specialty' 'AllergyandImmunology', 'Anesthesiology', 'Anesthesiology-Pediatric', 'Cardiology', 'Cardiology-Pediatric', 'Dentistry', 'Dermatology', 'Emergency/Trauma', 'Endocrinology', 'Family/GeneralPractice', 'Gastroenterology', 'Gynecology', 'Hematology', 'Hematology/Oncology', 'Hospitalist', 'InfectiousDiseases', 'InternalMedicine', 'Nephrology', 'Neurology', 'Obsterics&Gynecology-GynecologicOnco', 'Obstetrics', 'ObstetricsandGynecology', 'Oncology', 'Ophthalmology', 'Orthopedics', 'Orthopedics-Reconstructive', 'Osteopath', 'Otolaryngology', 'OutreachServices', 'Pathology', 'Pediatrics', 'Pediatrics-AllergyandImmunology', 'Pediatrics-CriticalCare', 'Pediatrics-EmergencyMedicine', 'Pediatrics-Endocrinology', 'Pediatrics-Hematology-Oncology', 'Pediatrics-InfectiousDiseases', 'Pediatrics-Neurology', 'Pediatrics-Pulmonology', 'Perinatology', 'PhysicalMedicineandRehabilitation', 'PhysicianNotFound', 'Podiatry', 'Proctology', 'Psychiatry', 'Psychiatry-Addictive', 'Psychiatry-Child/Adolescent', 'Psychology', 'Pulmonology', 'Radiologist', 'Radiology', 'Rheumatology', 'Speech', 'SportsMedicine', 'Surgeon', 'Surgery-Cardiovascular', 'Surgery-Cardiovascular/Thoracic', 'Surgery-Colon&Rectal', 'Surgery-General', 'Surgery-Maxillofacial', 'Surgery-Neuro', 'Surgery-Pediatric', 'Surgery-Plastic', 'Surgery-PlasticwithinHeadandNeck', 'Surgery-Thoracic', 'Surgery-Vascular', 'SurgicalSpecialty', 'Urology'\n", "'max_glu_serum' '>200', '>300', 'None', 'Norm' \n", "'A1Cresult' '>7', '>8', 'None', 'Norm' \n", "'metformin' 'Down', 'No', 'Steady', 'Up' \n", "'repaglinide' 'Down', 'No', 'Steady', 'Up' \n", "'nateglinide' 'Down', 'No', 'Steady', 'Up' \n", "'chlorpropamide' 'Down', 'No', 'Steady', 'Up' \n", "'glimepiride' 'Down', 'No', 'Steady', 'Up' \n", "'acetohexamide' 'No', 'Steady' \n", "'glipizide' 'Down', 'No', 'Steady', 'Up' \n", "'glyburide' 'Down', 'No', 'Steady', 'Up' \n", "'tolbutamide' 'No', 'Steady' \n", "'pioglitazone' 'Down', 'No', 'Steady', 'Up' \n", "'rosiglitazone' 'Down', 'No', 'Steady', 'Up' \n", "'acarbose' 'Down', 'No', 'Steady', 'Up' \n", "'miglitol' 'Down', 'No', 'Steady', 'Up' \n", "'troglitazone' 'No', 'Steady' \n", "'tolazamide' 'No', 'Steady', 'Up' \n", "'examide' 'No' \n", "'citoglipton' 'No' \n", "'insulin' 'Down', 'No', 'Steady', 'Up' \n", "'glyburide-metformin' 'Down', 'No', 'Steady', 'Up' \n", "'glipizide-metformin' 'No', 'Steady' \n", "'glimepiride-pioglitazone' 'No' \n", "'metformin-rosiglitazone' 'No' \n", "'metformin-pioglitazone' 'No' \n", "'change' 'Ch', 'No' \n", "'diabetesMed' 'No', 'Yes' \n", "'readmitted' '<30', '>30', 'NO' " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "### START CODE HERE\n", "# Infer the data schema by using the training statistics that you generated\n", "schema = tfdv.infer_schema(train_stats)\n", "\n", "# Display the data schema\n", "tfdv.display_schema(schema)\n", "### END CODE HERE" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of features in schema: 48\n", "Second feature in schema: gender\n" ] } ], "source": [ "# TEST CODE\n", "\n", "# Check number of features\n", "print(f\"Number of features in schema: {len(schema.feature)}\")\n", "\n", "# Check domain name of 2nd feature\n", "print(f\"Second feature in schema: {list(schema.feature)[1].domain}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Expected Output:**\n", "\n", "```\n", "Number of features in schema: 48\n", "Second feature in schema: gender\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Be sure to check the information displayed before moving forward.**" ] }, { "cell_type": "markdown", "metadata": { "id": "ZVa3EXE8WEDE" }, "source": [ "\n", "## 5 - Calculate, Visualize and Fix Evaluation Anomalies\n" ] }, { "cell_type": "markdown", "metadata": { "id": "_PG0tVZDaDTF" }, "source": [ "It is important that the schema of the evaluation data is consistent with the training data since the data that your model is going to receive should be consistent to the one you used to train it with.\n", "\n", "Moreover, it is also important that the **features of the evaluation data belong roughly to the same range as the training data**. This ensures that the model will be evaluated on a similar loss surface covered during training." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Exercise 4: Compare Training and Evaluation Statistics\n", "\n", "Now you are going to generate the evaluation statistics and compare it with training statistics. You can use the [`tfdv.generate_statistics_from_dataframe()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/generate_statistics_from_dataframe) function for this. But this time, you'll need to pass the **evaluation data**. For the `stats_options` parameter, the list you used before works here too.\n", "\n", "Remember that to visualize the evaluation statistics you can use [`tfdv.visualize_statistics()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/visualize_statistics). \n", "\n", "However, it is impractical to visualize both statistics separately and do your comparison from there. Fortunately, TFDV has got this covered. You can use the `visualize_statistics` function and pass additional parameters to overlay the statistics from both datasets (referenced as left-hand side and right-hand side statistics). Let's see what these parameters are:\n", "\n", "- `lhs_statistics`: Required parameter. Expects an instance of `DatasetFeatureStatisticsList `.\n", "\n", "\n", "- `rhs_statistics`: Expects an instance of `DatasetFeatureStatisticsList ` to compare with `lhs_statistics`.\n", "\n", "\n", "- `lhs_name`: Name of the `lhs_statistics` dataset.\n", "\n", "\n", "- `rhs_name`: Name of the `rhs_statistics` dataset.\n", "\n", "For this case, remember to define the `lhs_statistics` protocol with the `eval_stats`, and the optional `rhs_statistics` protocol with the `train_stats`.\n", "\n", "Additionally, check the function for the protocol name declaration, and define the lhs and rhs names as `'EVAL_DATASET'` and `'TRAIN_DATASET'` respectively." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "graded": true, "id": "j_P0RLYlV6XG", "name": "eval_stats_visualize_statistics" }, "outputs": [ { "data": { "text/html": [ "\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "### START CODE HERE\n", "# Generate evaluation dataset statistics\n", "# HINT: Remember to use the evaluation dataframe and to pass the stats_options (that you defined before) as an argument\n", "eval_stats = tfdv.generate_statistics_from_dataframe(eval_df, stats_options=stats_options)\n", "\n", "# Compare evaluation data with training data \n", "# HINT: Remember to use both the evaluation and training statistics with the lhs_statistics and rhs_statistics arguments\n", "# HINT: Assign the names of 'EVAL_DATASET' and 'TRAIN_DATASET' to the lhs and rhs protocols\n", "tfdv.visualize_statistics(lhs_statistics=eval_stats, rhs_statistics=train_stats,\n", " lhs_name='EVAL_DATASET', rhs_name='TRAIN_DATASET')\n", " \n", "### END CODE HERE" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of features: 48\n", "Number of examples: 15265\n", "First feature: race\n", "Last feature: readmitted\n" ] } ], "source": [ "# TEST CODE\n", "\n", "# get the number of features used to compute statistics\n", "print(f\"Number of features: {len(eval_stats.datasets[0].features)}\")\n", "\n", "# check the number of examples used\n", "print(f\"Number of examples: {eval_stats.datasets[0].num_examples}\")\n", "\n", "# check the column names of the first and last feature\n", "print(f\"First feature: {eval_stats.datasets[0].features[0].path.step[0]}\")\n", "print(f\"Last feature: {eval_stats.datasets[0].features[-1].path.step[0]}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Expected Output:**\n", "\n", "```\n", "Number of features: 48\n", "Number of examples: 15265\n", "First feature: race\n", "Last feature: readmitted\n", "```" ] }, { "cell_type": "markdown", "metadata": { "id": "COwqJqf8aLGx" }, "source": [ "\n", "### Exercise 5: Detecting Anomalies ###\n", "\n", "At this point, you should ask if your evaluation dataset matches the schema from your training dataset. For instance, if you scroll through the output cell in the previous exercise, you can see that the categorical feature **glimepiride-pioglitazone** has 1 unique value in the training set while the evaluation dataset has 2. You can verify with the built-in Pandas `describe()` method as well." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 71236\n", "unique 1 \n", "top No \n", "freq 71236\n", "Name: glimepiride-pioglitazone, dtype: object" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df[\"glimepiride-pioglitazone\"].describe()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 15265\n", "unique 2 \n", "top No \n", "freq 15264\n", "Name: glimepiride-pioglitazone, dtype: object" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eval_df[\"glimepiride-pioglitazone\"].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is possible but highly inefficient to visually inspect and determine all the anomalies. So, let's instead use TFDV functions to detect and display these.\n", "\n", "You can use the function [`tfdv.validate_statistics()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/validate_statistics) for detecting anomalies and [`tfdv.display_anomalies()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/display_anomalies) for displaying them.\n", "\n", "The `validate_statistics()` method has two required arguments:\n", "- an instance of `DatasetFeatureStatisticsList`\n", "- an instance of `Schema`\n", "\n", "Fill in the following graded function which, given the statistics and schema, displays the anomalies found." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "id": "QBUX-ocHs5NK" }, "outputs": [], "source": [ "def calculate_and_display_anomalies(statistics, schema):\n", " '''\n", " Calculate and display anomalies.\n", "\n", " Parameters:\n", " statistics : Data statistics in statistics_pb2.DatasetFeatureStatisticsList format\n", " schema : Data schema in schema_pb2.Schema format\n", "\n", " Returns:\n", " display of calculated anomalies\n", " '''\n", " ### START CODE HERE\n", " # HINTS: Pass the statistics and schema parameters into the validation function \n", " anomalies = tfdv.validate_statistics(statistics, schema)\n", " \n", " # HINTS: Display input anomalies by using the calculated anomalies\n", " tfdv.display_anomalies(anomalies)\n", " ### END CODE HERE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should see detected anomalies in the `medical_specialty` and `glimepiride-pioglitazone` features by running the cell below." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "graded": true, "id": "T7uGVeL2WOam", "name": "calculate_and_display_anomalies" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Anomaly short descriptionAnomaly long description
Feature name
'glimepiride-pioglitazone'Unexpected string valuesExamples contain values missing from the schema: Steady (<1%).
'medical_specialty'Unexpected string valuesExamples contain values missing from the schema: Neurophysiology (<1%).
\n", "
" ], "text/plain": [ " Anomaly short description \\\n", "Feature name \n", "'glimepiride-pioglitazone' Unexpected string values \n", "'medical_specialty' Unexpected string values \n", "\n", " Anomaly long description \n", "Feature name \n", "'glimepiride-pioglitazone' Examples contain values missing from the schema: Steady (<1%). \n", "'medical_specialty' Examples contain values missing from the schema: Neurophysiology (<1%). " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Check evaluation data for errors by validating the evaluation data staticss using the previously inferred schema\n", "calculate_and_display_anomalies(eval_stats, schema=schema)" ] }, { "cell_type": "markdown", "metadata": { "id": "dzxx1gBpJIBa" }, "source": [ "\n", "### Exercise 6: Fix evaluation anomalies in the schema\n", "\n", "The evaluation data has records with values for the features **glimepiride-pioglitazone** and **medical_speciality** that were not included in the schema generated from the training data. You can fix this by adding the new values that exist in the evaluation dataset to the domain of these features." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get the `domain` of a particular feature you can use [`tfdv.get_domain()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/get_domain).\n", "\n", "You can use the `append()` method to the `value` property of the returned `domain` to add strings to the valid list of values. To be more explicit, given a domain you can do something like:\n", "\n", "```python\n", "domain.value.append(\"feature_value\")\n", "\n", "```" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "id": "legN2nXLWZAc" }, "outputs": [ { "data": { "text/html": [ "

No anomalies found.

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "### START CODE HERE\n", "\n", "# Get the domain associated with the input feature, glimepiride-pioglitazone, from the schema\n", "glimepiride_pioglitazone_domain = tfdv.get_domain(schema, 'glimepiride-pioglitazone') \n", "\n", "# HINT: Append the missing value 'Steady' to the domain\n", "glimepiride_pioglitazone_domain.value.append('Steady')\n", "\n", "# Get the domain associated with the input feature, medical_specialty, from the schema\n", "medical_specialty_domain = tfdv.get_domain(schema, 'medical_specialty') \n", "\n", "# HINT: Append the missing value 'Neurophysiology' to the domain\n", "medical_specialty_domain.value.append('Neurophysiology')\n", "\n", "# HINT: Re-calculate and re-display anomalies with the new schema\n", "calculate_and_display_anomalies(eval_stats, schema=schema)\n", "### END CODE HERE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you did the exercise correctly, you should see *\"No anomalies found.\"* after running the cell above." ] }, { "cell_type": "markdown", "metadata": { "id": "KZ1P4ucHJj5o" }, "source": [ "\n", "## 6 - Schema Environments\n", "\n", "By default, all datasets in a pipeline should use the same schema. However, there are some exceptions. \n", "\n", "For example, the **label column is dropped in the serving set** so this will be flagged when comparing with the training set schema. \n", "\n", "**In this case, introducing slight schema variations is necessary.**\n", "\n", "\n", "### Exercise 7: Check anomalies in the serving set\n", "\n", "Now you are going to check for anomalies in the **serving data**. The process is very similar to the one you previously did for the evaluation data with a little change. \n", "\n", "Let's create a new `StatsOptions` that is aware of the information provided by the schema and use it when generating statistics from the serving DataFrame." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# Define a new statistics options by the tfdv.StatsOptions class for the serving data by passing the previously inferred schema\n", "options = tfdv.StatsOptions(schema=schema, \n", " infer_type_from_schema=True, \n", " feature_whitelist=approved_cols)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "id": "OhtYF8aAczpd" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Anomaly short descriptionAnomaly long description
Feature name
'metformin-pioglitazone'Unexpected string valuesExamples contain values missing from the schema: Steady (<1%).
'payer_code'Unexpected string valuesExamples contain values missing from the schema: FR (<1%).
'medical_specialty'Unexpected string valuesExamples contain values missing from the schema: DCPTEAM (<1%), Endocrinology-Metabolism (<1%), Resident (<1%).
'metformin-rosiglitazone'Unexpected string valuesExamples contain values missing from the schema: Steady (<1%).
'readmitted'Column droppedColumn is completely missing
\n", "
" ], "text/plain": [ " Anomaly short description \\\n", "Feature name \n", "'metformin-pioglitazone' Unexpected string values \n", "'payer_code' Unexpected string values \n", "'medical_specialty' Unexpected string values \n", "'metformin-rosiglitazone' Unexpected string values \n", "'readmitted' Column dropped \n", "\n", " Anomaly long description \n", "Feature name \n", "'metformin-pioglitazone' Examples contain values missing from the schema: Steady (<1%). \n", "'payer_code' Examples contain values missing from the schema: FR (<1%). \n", "'medical_specialty' Examples contain values missing from the schema: DCPTEAM (<1%), Endocrinology-Metabolism (<1%), Resident (<1%). \n", "'metformin-rosiglitazone' Examples contain values missing from the schema: Steady (<1%). \n", "'readmitted' Column is completely missing " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "### START CODE HERE\n", "# Generate serving dataset statistics\n", "# HINT: Remember to use the serving dataframe and to pass the newly defined statistics options\n", "serving_stats = tfdv.generate_statistics_from_dataframe(serving_df, stats_options=options)\n", "\n", "# HINT: Calculate and display anomalies using the generated serving statistics\n", "calculate_and_display_anomalies(serving_stats, schema=schema)\n", "### END CODE HERE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should see that `metformin-rosiglitazone`, `metformin-pioglitazone`, `payer_code` and `medical_specialty` features have an anomaly (i.e. Unexpected string values) which is less than 1%. \n", "\n", "Let's **relax the anomaly detection constraints** for the last two of these features by defining the `min_domain_mass` of the feature's distribution constraints." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Anomaly short descriptionAnomaly long description
Feature name
'metformin-pioglitazone'Unexpected string valuesExamples contain values missing from the schema: Steady (<1%).
'metformin-rosiglitazone'Unexpected string valuesExamples contain values missing from the schema: Steady (<1%).
'readmitted'Column droppedColumn is completely missing
\n", "
" ], "text/plain": [ " Anomaly short description \\\n", "Feature name \n", "'metformin-pioglitazone' Unexpected string values \n", "'metformin-rosiglitazone' Unexpected string values \n", "'readmitted' Column dropped \n", "\n", " Anomaly long description \n", "Feature name \n", "'metformin-pioglitazone' Examples contain values missing from the schema: Steady (<1%). \n", "'metformin-rosiglitazone' Examples contain values missing from the schema: Steady (<1%). \n", "'readmitted' Column is completely missing " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# This relaxes the minimum fraction of values that must come from the domain for the feature.\n", "\n", "# Get the feature and relax to match 90% of the domain\n", "payer_code = tfdv.get_feature(schema, 'payer_code')\n", "payer_code.distribution_constraints.min_domain_mass = 0.9 \n", "\n", "# Get the feature and relax to match 90% of the domain\n", "medical_specialty = tfdv.get_feature(schema, 'medical_specialty')\n", "medical_specialty.distribution_constraints.min_domain_mass = 0.9 \n", "\n", "# Detect anomalies with the updated constraints\n", "calculate_and_display_anomalies(serving_stats, schema=schema)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the `payer_code` and `medical_specialty` are no longer part of the output cell, then the relaxation worked!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Exercise 8: Modifying the Domain\n", "\n", "Let's investigate the possible cause of the anomalies for the other features, namely `metformin-pioglitazone` and `metformin-rosiglitazone`. From the output of the previous exercise, you'll see that the `anomaly long description` says: \"Examples contain values missing from the schema: Steady (<1%)\". You can redisplay the schema and look at the domain of these features to verify this statement.\n", "\n", "When you inferred the schema at the start of this lab, it's possible that some values were not detected in the training data so it was not included in the expected domain values of the feature's schema. In the case of `metformin-rosiglitazone` and `metformin-pioglitazone`, the value \"Steady\" is indeed missing. You will just see \"No\" in the domain of these two features after running the code cell below." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TypePresenceValencyDomain
Feature name
'race'STRINGoptionalsingle'race'
'gender'STRINGrequired'gender'
'age'STRINGrequired'age'
'weight'STRINGoptionalsingle'weight'
'admission_type_id'INTrequired-
'discharge_disposition_id'INTrequired-
'admission_source_id'INTrequired-
'time_in_hospital'INTrequired-
'payer_code'STRINGoptionalsingle'payer_code'
'medical_specialty'STRINGoptionalsingle'medical_specialty'
'num_lab_procedures'INTrequired-
'num_procedures'INTrequired-
'num_medications'INTrequired-
'number_outpatient'INTrequired-
'number_emergency'INTrequired-
'number_inpatient'INTrequired-
'diag_1'BYTESoptionalsingle-
'diag_2'BYTESoptionalsingle-
'diag_3'BYTESoptionalsingle-
'number_diagnoses'INTrequired-
'max_glu_serum'STRINGrequired'max_glu_serum'
'A1Cresult'STRINGrequired'A1Cresult'
'metformin'STRINGrequired'metformin'
'repaglinide'STRINGrequired'repaglinide'
'nateglinide'STRINGrequired'nateglinide'
'chlorpropamide'STRINGrequired'chlorpropamide'
'glimepiride'STRINGrequired'glimepiride'
'acetohexamide'STRINGrequired'acetohexamide'
'glipizide'STRINGrequired'glipizide'
'glyburide'STRINGrequired'glyburide'
'tolbutamide'STRINGrequired'tolbutamide'
'pioglitazone'STRINGrequired'pioglitazone'
'rosiglitazone'STRINGrequired'rosiglitazone'
'acarbose'STRINGrequired'acarbose'
'miglitol'STRINGrequired'miglitol'
'troglitazone'STRINGrequired'troglitazone'
'tolazamide'STRINGrequired'tolazamide'
'examide'STRINGrequired'examide'
'citoglipton'STRINGrequired'citoglipton'
'insulin'STRINGrequired'insulin'
'glyburide-metformin'STRINGrequired'glyburide-metformin'
'glipizide-metformin'STRINGrequired'glipizide-metformin'
'glimepiride-pioglitazone'STRINGrequired'glimepiride-pioglitazone'
'metformin-rosiglitazone'STRINGrequired'metformin-rosiglitazone'
'metformin-pioglitazone'STRINGrequired'metformin-pioglitazone'
'change'STRINGrequired'change'
'diabetesMed'STRINGrequired'diabetesMed'
'readmitted'STRINGrequired'readmitted'
\n", "
" ], "text/plain": [ " Type Presence Valency \\\n", "Feature name \n", "'race' STRING optional single \n", "'gender' STRING required \n", "'age' STRING required \n", "'weight' STRING optional single \n", "'admission_type_id' INT required \n", "'discharge_disposition_id' INT required \n", "'admission_source_id' INT required \n", "'time_in_hospital' INT required \n", "'payer_code' STRING optional single \n", "'medical_specialty' STRING optional single \n", "'num_lab_procedures' INT required \n", "'num_procedures' INT required \n", "'num_medications' INT required \n", "'number_outpatient' INT required \n", "'number_emergency' INT required \n", "'number_inpatient' INT required \n", "'diag_1' BYTES optional single \n", "'diag_2' BYTES optional single \n", "'diag_3' BYTES optional single \n", "'number_diagnoses' INT required \n", "'max_glu_serum' STRING required \n", "'A1Cresult' STRING required \n", "'metformin' STRING required \n", "'repaglinide' STRING required \n", "'nateglinide' STRING required \n", "'chlorpropamide' STRING required \n", "'glimepiride' STRING required \n", "'acetohexamide' STRING required \n", "'glipizide' STRING required \n", "'glyburide' STRING required \n", "'tolbutamide' STRING required \n", "'pioglitazone' STRING required \n", "'rosiglitazone' STRING required \n", "'acarbose' STRING required \n", "'miglitol' STRING required \n", "'troglitazone' STRING required \n", "'tolazamide' STRING required \n", "'examide' STRING required \n", "'citoglipton' STRING required \n", "'insulin' STRING required \n", "'glyburide-metformin' STRING required \n", "'glipizide-metformin' STRING required \n", "'glimepiride-pioglitazone' STRING required \n", "'metformin-rosiglitazone' STRING required \n", "'metformin-pioglitazone' STRING required \n", "'change' STRING required \n", "'diabetesMed' STRING required \n", "'readmitted' STRING required \n", "\n", " Domain \n", "Feature name \n", "'race' 'race' \n", "'gender' 'gender' \n", "'age' 'age' \n", "'weight' 'weight' \n", "'admission_type_id' - \n", "'discharge_disposition_id' - \n", "'admission_source_id' - \n", "'time_in_hospital' - \n", "'payer_code' 'payer_code' \n", "'medical_specialty' 'medical_specialty' \n", "'num_lab_procedures' - \n", "'num_procedures' - \n", "'num_medications' - \n", "'number_outpatient' - \n", "'number_emergency' - \n", "'number_inpatient' - \n", "'diag_1' - \n", "'diag_2' - \n", "'diag_3' - \n", "'number_diagnoses' - \n", "'max_glu_serum' 'max_glu_serum' \n", "'A1Cresult' 'A1Cresult' \n", "'metformin' 'metformin' \n", "'repaglinide' 'repaglinide' \n", "'nateglinide' 'nateglinide' \n", "'chlorpropamide' 'chlorpropamide' \n", "'glimepiride' 'glimepiride' \n", "'acetohexamide' 'acetohexamide' \n", "'glipizide' 'glipizide' \n", "'glyburide' 'glyburide' \n", "'tolbutamide' 'tolbutamide' \n", "'pioglitazone' 'pioglitazone' \n", "'rosiglitazone' 'rosiglitazone' \n", "'acarbose' 'acarbose' \n", "'miglitol' 'miglitol' \n", "'troglitazone' 'troglitazone' \n", "'tolazamide' 'tolazamide' \n", "'examide' 'examide' \n", "'citoglipton' 'citoglipton' \n", "'insulin' 'insulin' \n", "'glyburide-metformin' 'glyburide-metformin' \n", "'glipizide-metformin' 'glipizide-metformin' \n", "'glimepiride-pioglitazone' 'glimepiride-pioglitazone' \n", "'metformin-rosiglitazone' 'metformin-rosiglitazone' \n", "'metformin-pioglitazone' 'metformin-pioglitazone' \n", "'change' 'change' \n", "'diabetesMed' 'diabetesMed' \n", "'readmitted' 'readmitted' " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Values
Domain
'race''AfricanAmerican', 'Asian', 'Caucasian', 'Hispanic', 'Other'
'gender''Female', 'Male', 'Unknown/Invalid'
'age''[0-10)', '[10-20)', '[20-30)', '[30-40)', '[40-50)', '[50-60)', '[60-70)', '[70-80)', '[80-90)', '[90-100)'
'weight''>200', '[0-25)', '[100-125)', '[125-150)', '[150-175)', '[175-200)', '[25-50)', '[50-75)', '[75-100)'
'payer_code''BC', 'CH', 'CM', 'CP', 'DM', 'HM', 'MC', 'MD', 'MP', 'OG', 'OT', 'PO', 'SI', 'SP', 'UN', 'WC'
'medical_specialty''AllergyandImmunology', 'Anesthesiology', 'Anesthesiology-Pediatric', 'Cardiology', 'Cardiology-Pediatric', 'Dentistry', 'Dermatology', 'Emergency/Trauma', 'Endocrinology', 'Family/GeneralPractice', 'Gastroenterology', 'Gynecology', 'Hematology', 'Hematology/Oncology', 'Hospitalist', 'InfectiousDiseases', 'InternalMedicine', 'Nephrology', 'Neurology', 'Obsterics&Gynecology-GynecologicOnco', 'Obstetrics', 'ObstetricsandGynecology', 'Oncology', 'Ophthalmology', 'Orthopedics', 'Orthopedics-Reconstructive', 'Osteopath', 'Otolaryngology', 'OutreachServices', 'Pathology', 'Pediatrics', 'Pediatrics-AllergyandImmunology', 'Pediatrics-CriticalCare', 'Pediatrics-EmergencyMedicine', 'Pediatrics-Endocrinology', 'Pediatrics-Hematology-Oncology', 'Pediatrics-InfectiousDiseases', 'Pediatrics-Neurology', 'Pediatrics-Pulmonology', 'Perinatology', 'PhysicalMedicineandRehabilitation', 'PhysicianNotFound', 'Podiatry', 'Proctology', 'Psychiatry', 'Psychiatry-Addictive', 'Psychiatry-Child/Adolescent', 'Psychology', 'Pulmonology', 'Radiologist', 'Radiology', 'Rheumatology', 'Speech', 'SportsMedicine', 'Surgeon', 'Surgery-Cardiovascular', 'Surgery-Cardiovascular/Thoracic', 'Surgery-Colon&Rectal', 'Surgery-General', 'Surgery-Maxillofacial', 'Surgery-Neuro', 'Surgery-Pediatric', 'Surgery-Plastic', 'Surgery-PlasticwithinHeadandNeck', 'Surgery-Thoracic', 'Surgery-Vascular', 'SurgicalSpecialty', 'Urology', 'Neurophysiology'
'max_glu_serum''>200', '>300', 'None', 'Norm'
'A1Cresult''>7', '>8', 'None', 'Norm'
'metformin''Down', 'No', 'Steady', 'Up'
'repaglinide''Down', 'No', 'Steady', 'Up'
'nateglinide''Down', 'No', 'Steady', 'Up'
'chlorpropamide''Down', 'No', 'Steady', 'Up'
'glimepiride''Down', 'No', 'Steady', 'Up'
'acetohexamide''No', 'Steady'
'glipizide''Down', 'No', 'Steady', 'Up'
'glyburide''Down', 'No', 'Steady', 'Up'
'tolbutamide''No', 'Steady'
'pioglitazone''Down', 'No', 'Steady', 'Up'
'rosiglitazone''Down', 'No', 'Steady', 'Up'
'acarbose''Down', 'No', 'Steady', 'Up'
'miglitol''Down', 'No', 'Steady', 'Up'
'troglitazone''No', 'Steady'
'tolazamide''No', 'Steady', 'Up'
'examide''No'
'citoglipton''No'
'insulin''Down', 'No', 'Steady', 'Up'
'glyburide-metformin''Down', 'No', 'Steady', 'Up'
'glipizide-metformin''No', 'Steady'
'glimepiride-pioglitazone''No', 'Steady'
'metformin-rosiglitazone''No'
'metformin-pioglitazone''No'
'change''Ch', 'No'
'diabetesMed''No', 'Yes'
'readmitted''<30', '>30', 'NO'
\n", "
" ], "text/plain": [ " Values\n", "Domain \n", "'race' 'AfricanAmerican', 'Asian', 'Caucasian', 'Hispanic', 'Other' \n", "'gender' 'Female', 'Male', 'Unknown/Invalid' \n", "'age' '[0-10)', '[10-20)', '[20-30)', '[30-40)', '[40-50)', '[50-60)', '[60-70)', '[70-80)', '[80-90)', '[90-100)' \n", "'weight' '>200', '[0-25)', '[100-125)', '[125-150)', '[150-175)', '[175-200)', '[25-50)', '[50-75)', '[75-100)' \n", "'payer_code' 'BC', 'CH', 'CM', 'CP', 'DM', 'HM', 'MC', 'MD', 'MP', 'OG', 'OT', 'PO', 'SI', 'SP', 'UN', 'WC' \n", "'medical_specialty' 'AllergyandImmunology', 'Anesthesiology', 'Anesthesiology-Pediatric', 'Cardiology', 'Cardiology-Pediatric', 'Dentistry', 'Dermatology', 'Emergency/Trauma', 'Endocrinology', 'Family/GeneralPractice', 'Gastroenterology', 'Gynecology', 'Hematology', 'Hematology/Oncology', 'Hospitalist', 'InfectiousDiseases', 'InternalMedicine', 'Nephrology', 'Neurology', 'Obsterics&Gynecology-GynecologicOnco', 'Obstetrics', 'ObstetricsandGynecology', 'Oncology', 'Ophthalmology', 'Orthopedics', 'Orthopedics-Reconstructive', 'Osteopath', 'Otolaryngology', 'OutreachServices', 'Pathology', 'Pediatrics', 'Pediatrics-AllergyandImmunology', 'Pediatrics-CriticalCare', 'Pediatrics-EmergencyMedicine', 'Pediatrics-Endocrinology', 'Pediatrics-Hematology-Oncology', 'Pediatrics-InfectiousDiseases', 'Pediatrics-Neurology', 'Pediatrics-Pulmonology', 'Perinatology', 'PhysicalMedicineandRehabilitation', 'PhysicianNotFound', 'Podiatry', 'Proctology', 'Psychiatry', 'Psychiatry-Addictive', 'Psychiatry-Child/Adolescent', 'Psychology', 'Pulmonology', 'Radiologist', 'Radiology', 'Rheumatology', 'Speech', 'SportsMedicine', 'Surgeon', 'Surgery-Cardiovascular', 'Surgery-Cardiovascular/Thoracic', 'Surgery-Colon&Rectal', 'Surgery-General', 'Surgery-Maxillofacial', 'Surgery-Neuro', 'Surgery-Pediatric', 'Surgery-Plastic', 'Surgery-PlasticwithinHeadandNeck', 'Surgery-Thoracic', 'Surgery-Vascular', 'SurgicalSpecialty', 'Urology', 'Neurophysiology'\n", "'max_glu_serum' '>200', '>300', 'None', 'Norm' \n", "'A1Cresult' '>7', '>8', 'None', 'Norm' \n", "'metformin' 'Down', 'No', 'Steady', 'Up' \n", "'repaglinide' 'Down', 'No', 'Steady', 'Up' \n", "'nateglinide' 'Down', 'No', 'Steady', 'Up' \n", "'chlorpropamide' 'Down', 'No', 'Steady', 'Up' \n", "'glimepiride' 'Down', 'No', 'Steady', 'Up' \n", "'acetohexamide' 'No', 'Steady' \n", "'glipizide' 'Down', 'No', 'Steady', 'Up' \n", "'glyburide' 'Down', 'No', 'Steady', 'Up' \n", "'tolbutamide' 'No', 'Steady' \n", "'pioglitazone' 'Down', 'No', 'Steady', 'Up' \n", "'rosiglitazone' 'Down', 'No', 'Steady', 'Up' \n", "'acarbose' 'Down', 'No', 'Steady', 'Up' \n", "'miglitol' 'Down', 'No', 'Steady', 'Up' \n", "'troglitazone' 'No', 'Steady' \n", "'tolazamide' 'No', 'Steady', 'Up' \n", "'examide' 'No' \n", "'citoglipton' 'No' \n", "'insulin' 'Down', 'No', 'Steady', 'Up' \n", "'glyburide-metformin' 'Down', 'No', 'Steady', 'Up' \n", "'glipizide-metformin' 'No', 'Steady' \n", "'glimepiride-pioglitazone' 'No', 'Steady' \n", "'metformin-rosiglitazone' 'No' \n", "'metformin-pioglitazone' 'No' \n", "'change' 'Ch', 'No' \n", "'diabetesMed' 'No', 'Yes' \n", "'readmitted' '<30', '>30', 'NO' " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tfdv.display_schema(schema)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Towards the bottom of the Domain-Values pairs of the cell above, you can see that many features (including **'metformin'**) have the same values: `['Down', 'No', 'Steady', 'Up']`. These values are common to many features including the ones with missing values during schema inference. \n", "\n", "TFDV allows you to modify the domains of some features to match an existing domain. To address the detected anomaly, you can **set the domain** of these features to the domain of the `metformin` feature.\n", "\n", "Complete the function below to set the domain of a feature list to an existing feature domain. \n", "\n", "For this, use the [`tfdv.set_domain()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/set_domain) function, which has the following parameters:\n", "\n", "- `schema`: The schema\n", "\n", "\n", "- `feature_path`: The name of the feature whose domain needs to be set.\n", "\n", "\n", "- `domain`: A domain protocol buffer or the name of a global string domain present in the input schema." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "id": "1ACLvPPTkZfp" }, "outputs": [], "source": [ "def modify_domain_of_features(features_list, schema, to_domain_name):\n", " '''\n", " Modify a list of features' domains.\n", "\n", " Parameters:\n", " features_list : Features that need to be modified\n", " schema: Inferred schema\n", " to_domain_name : Target domain to be transferred to the features list\n", "\n", " Returns:\n", " schema: new schema\n", " '''\n", " ### START CODE HERE\n", " # HINT: Loop over the feature list and use set_domain with the inferred schema, feature name and target domain name\n", " for feature in features_list:\n", " tfdv.set_domain(schema, feature, to_domain_name)\n", " ### END CODE HERE\n", " return schema" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using this function, set the domain of the features defined in the `domain_change_features` list below to be equal to **metformin's domain** to address the anomalies found.\n", "\n", "**Since you are overriding the existing domain of the features, it is normal to get a warning so you don't do this by accident.**" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "id": "4_jNanzjfeS-" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:root:Replacing existing domain of feature \"repaglinide\".\n", "WARNING:root:Replacing existing domain of feature \"nateglinide\".\n", "WARNING:root:Replacing existing domain of feature \"chlorpropamide\".\n", "WARNING:root:Replacing existing domain of feature \"glimepiride\".\n", "WARNING:root:Replacing existing domain of feature \"acetohexamide\".\n", "WARNING:root:Replacing existing domain of feature \"glipizide\".\n", "WARNING:root:Replacing existing domain of feature \"glyburide\".\n", "WARNING:root:Replacing existing domain of feature \"tolbutamide\".\n", "WARNING:root:Replacing existing domain of feature \"pioglitazone\".\n", "WARNING:root:Replacing existing domain of feature \"rosiglitazone\".\n", "WARNING:root:Replacing existing domain of feature \"acarbose\".\n", "WARNING:root:Replacing existing domain of feature \"miglitol\".\n", "WARNING:root:Replacing existing domain of feature \"troglitazone\".\n", "WARNING:root:Replacing existing domain of feature \"tolazamide\".\n", "WARNING:root:Replacing existing domain of feature \"examide\".\n", "WARNING:root:Replacing existing domain of feature \"citoglipton\".\n", "WARNING:root:Replacing existing domain of feature \"insulin\".\n", "WARNING:root:Replacing existing domain of feature \"glyburide-metformin\".\n", "WARNING:root:Replacing existing domain of feature \"glipizide-metformin\".\n", "WARNING:root:Replacing existing domain of feature \"glimepiride-pioglitazone\".\n", "WARNING:root:Replacing existing domain of feature \"metformin-rosiglitazone\".\n", "WARNING:root:Replacing existing domain of feature \"metformin-pioglitazone\".\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TypePresenceValencyDomain
Feature name
'race'STRINGoptionalsingle'race'
'gender'STRINGrequired'gender'
'age'STRINGrequired'age'
'weight'STRINGoptionalsingle'weight'
'admission_type_id'INTrequired-
'discharge_disposition_id'INTrequired-
'admission_source_id'INTrequired-
'time_in_hospital'INTrequired-
'payer_code'STRINGoptionalsingle'payer_code'
'medical_specialty'STRINGoptionalsingle'medical_specialty'
'num_lab_procedures'INTrequired-
'num_procedures'INTrequired-
'num_medications'INTrequired-
'number_outpatient'INTrequired-
'number_emergency'INTrequired-
'number_inpatient'INTrequired-
'diag_1'BYTESoptionalsingle-
'diag_2'BYTESoptionalsingle-
'diag_3'BYTESoptionalsingle-
'number_diagnoses'INTrequired-
'max_glu_serum'STRINGrequired'max_glu_serum'
'A1Cresult'STRINGrequired'A1Cresult'
'metformin'STRINGrequired'metformin'
'repaglinide'STRINGrequired'metformin'
'nateglinide'STRINGrequired'metformin'
'chlorpropamide'STRINGrequired'metformin'
'glimepiride'STRINGrequired'metformin'
'acetohexamide'STRINGrequired'metformin'
'glipizide'STRINGrequired'metformin'
'glyburide'STRINGrequired'metformin'
'tolbutamide'STRINGrequired'metformin'
'pioglitazone'STRINGrequired'metformin'
'rosiglitazone'STRINGrequired'metformin'
'acarbose'STRINGrequired'metformin'
'miglitol'STRINGrequired'metformin'
'troglitazone'STRINGrequired'metformin'
'tolazamide'STRINGrequired'metformin'
'examide'STRINGrequired'metformin'
'citoglipton'STRINGrequired'metformin'
'insulin'STRINGrequired'metformin'
'glyburide-metformin'STRINGrequired'metformin'
'glipizide-metformin'STRINGrequired'metformin'
'glimepiride-pioglitazone'STRINGrequired'metformin'
'metformin-rosiglitazone'STRINGrequired'metformin'
'metformin-pioglitazone'STRINGrequired'metformin'
'change'STRINGrequired'change'
'diabetesMed'STRINGrequired'diabetesMed'
'readmitted'STRINGrequired'readmitted'
\n", "
" ], "text/plain": [ " Type Presence Valency Domain\n", "Feature name \n", "'race' STRING optional single 'race' \n", "'gender' STRING required 'gender' \n", "'age' STRING required 'age' \n", "'weight' STRING optional single 'weight' \n", "'admission_type_id' INT required - \n", "'discharge_disposition_id' INT required - \n", "'admission_source_id' INT required - \n", "'time_in_hospital' INT required - \n", "'payer_code' STRING optional single 'payer_code' \n", "'medical_specialty' STRING optional single 'medical_specialty'\n", "'num_lab_procedures' INT required - \n", "'num_procedures' INT required - \n", "'num_medications' INT required - \n", "'number_outpatient' INT required - \n", "'number_emergency' INT required - \n", "'number_inpatient' INT required - \n", "'diag_1' BYTES optional single - \n", "'diag_2' BYTES optional single - \n", "'diag_3' BYTES optional single - \n", "'number_diagnoses' INT required - \n", "'max_glu_serum' STRING required 'max_glu_serum' \n", "'A1Cresult' STRING required 'A1Cresult' \n", "'metformin' STRING required 'metformin' \n", "'repaglinide' STRING required 'metformin' \n", "'nateglinide' STRING required 'metformin' \n", "'chlorpropamide' STRING required 'metformin' \n", "'glimepiride' STRING required 'metformin' \n", "'acetohexamide' STRING required 'metformin' \n", "'glipizide' STRING required 'metformin' \n", "'glyburide' STRING required 'metformin' \n", "'tolbutamide' STRING required 'metformin' \n", "'pioglitazone' STRING required 'metformin' \n", "'rosiglitazone' STRING required 'metformin' \n", "'acarbose' STRING required 'metformin' \n", "'miglitol' STRING required 'metformin' \n", "'troglitazone' STRING required 'metformin' \n", "'tolazamide' STRING required 'metformin' \n", "'examide' STRING required 'metformin' \n", "'citoglipton' STRING required 'metformin' \n", "'insulin' STRING required 'metformin' \n", "'glyburide-metformin' STRING required 'metformin' \n", "'glipizide-metformin' STRING required 'metformin' \n", "'glimepiride-pioglitazone' STRING required 'metformin' \n", "'metformin-rosiglitazone' STRING required 'metformin' \n", "'metformin-pioglitazone' STRING required 'metformin' \n", "'change' STRING required 'change' \n", "'diabetesMed' STRING required 'diabetesMed' \n", "'readmitted' STRING required 'readmitted' " ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Values
Domain
'race''AfricanAmerican', 'Asian', 'Caucasian', 'Hispanic', 'Other'
'gender''Female', 'Male', 'Unknown/Invalid'
'age''[0-10)', '[10-20)', '[20-30)', '[30-40)', '[40-50)', '[50-60)', '[60-70)', '[70-80)', '[80-90)', '[90-100)'
'weight''>200', '[0-25)', '[100-125)', '[125-150)', '[150-175)', '[175-200)', '[25-50)', '[50-75)', '[75-100)'
'payer_code''BC', 'CH', 'CM', 'CP', 'DM', 'HM', 'MC', 'MD', 'MP', 'OG', 'OT', 'PO', 'SI', 'SP', 'UN', 'WC'
'medical_specialty''AllergyandImmunology', 'Anesthesiology', 'Anesthesiology-Pediatric', 'Cardiology', 'Cardiology-Pediatric', 'Dentistry', 'Dermatology', 'Emergency/Trauma', 'Endocrinology', 'Family/GeneralPractice', 'Gastroenterology', 'Gynecology', 'Hematology', 'Hematology/Oncology', 'Hospitalist', 'InfectiousDiseases', 'InternalMedicine', 'Nephrology', 'Neurology', 'Obsterics&Gynecology-GynecologicOnco', 'Obstetrics', 'ObstetricsandGynecology', 'Oncology', 'Ophthalmology', 'Orthopedics', 'Orthopedics-Reconstructive', 'Osteopath', 'Otolaryngology', 'OutreachServices', 'Pathology', 'Pediatrics', 'Pediatrics-AllergyandImmunology', 'Pediatrics-CriticalCare', 'Pediatrics-EmergencyMedicine', 'Pediatrics-Endocrinology', 'Pediatrics-Hematology-Oncology', 'Pediatrics-InfectiousDiseases', 'Pediatrics-Neurology', 'Pediatrics-Pulmonology', 'Perinatology', 'PhysicalMedicineandRehabilitation', 'PhysicianNotFound', 'Podiatry', 'Proctology', 'Psychiatry', 'Psychiatry-Addictive', 'Psychiatry-Child/Adolescent', 'Psychology', 'Pulmonology', 'Radiologist', 'Radiology', 'Rheumatology', 'Speech', 'SportsMedicine', 'Surgeon', 'Surgery-Cardiovascular', 'Surgery-Cardiovascular/Thoracic', 'Surgery-Colon&Rectal', 'Surgery-General', 'Surgery-Maxillofacial', 'Surgery-Neuro', 'Surgery-Pediatric', 'Surgery-Plastic', 'Surgery-PlasticwithinHeadandNeck', 'Surgery-Thoracic', 'Surgery-Vascular', 'SurgicalSpecialty', 'Urology', 'Neurophysiology'
'max_glu_serum''>200', '>300', 'None', 'Norm'
'A1Cresult''>7', '>8', 'None', 'Norm'
'metformin''Down', 'No', 'Steady', 'Up'
'repaglinide''Down', 'No', 'Steady', 'Up'
'nateglinide''Down', 'No', 'Steady', 'Up'
'chlorpropamide''Down', 'No', 'Steady', 'Up'
'glimepiride''Down', 'No', 'Steady', 'Up'
'acetohexamide''No', 'Steady'
'glipizide''Down', 'No', 'Steady', 'Up'
'glyburide''Down', 'No', 'Steady', 'Up'
'tolbutamide''No', 'Steady'
'pioglitazone''Down', 'No', 'Steady', 'Up'
'rosiglitazone''Down', 'No', 'Steady', 'Up'
'acarbose''Down', 'No', 'Steady', 'Up'
'miglitol''Down', 'No', 'Steady', 'Up'
'troglitazone''No', 'Steady'
'tolazamide''No', 'Steady', 'Up'
'examide''No'
'citoglipton''No'
'insulin''Down', 'No', 'Steady', 'Up'
'glyburide-metformin''Down', 'No', 'Steady', 'Up'
'glipizide-metformin''No', 'Steady'
'glimepiride-pioglitazone''No', 'Steady'
'metformin-rosiglitazone''No'
'metformin-pioglitazone''No'
'change''Ch', 'No'
'diabetesMed''No', 'Yes'
'readmitted''<30', '>30', 'NO'
\n", "
" ], "text/plain": [ " Values\n", "Domain \n", "'race' 'AfricanAmerican', 'Asian', 'Caucasian', 'Hispanic', 'Other' \n", "'gender' 'Female', 'Male', 'Unknown/Invalid' \n", "'age' '[0-10)', '[10-20)', '[20-30)', '[30-40)', '[40-50)', '[50-60)', '[60-70)', '[70-80)', '[80-90)', '[90-100)' \n", "'weight' '>200', '[0-25)', '[100-125)', '[125-150)', '[150-175)', '[175-200)', '[25-50)', '[50-75)', '[75-100)' \n", "'payer_code' 'BC', 'CH', 'CM', 'CP', 'DM', 'HM', 'MC', 'MD', 'MP', 'OG', 'OT', 'PO', 'SI', 'SP', 'UN', 'WC' \n", "'medical_specialty' 'AllergyandImmunology', 'Anesthesiology', 'Anesthesiology-Pediatric', 'Cardiology', 'Cardiology-Pediatric', 'Dentistry', 'Dermatology', 'Emergency/Trauma', 'Endocrinology', 'Family/GeneralPractice', 'Gastroenterology', 'Gynecology', 'Hematology', 'Hematology/Oncology', 'Hospitalist', 'InfectiousDiseases', 'InternalMedicine', 'Nephrology', 'Neurology', 'Obsterics&Gynecology-GynecologicOnco', 'Obstetrics', 'ObstetricsandGynecology', 'Oncology', 'Ophthalmology', 'Orthopedics', 'Orthopedics-Reconstructive', 'Osteopath', 'Otolaryngology', 'OutreachServices', 'Pathology', 'Pediatrics', 'Pediatrics-AllergyandImmunology', 'Pediatrics-CriticalCare', 'Pediatrics-EmergencyMedicine', 'Pediatrics-Endocrinology', 'Pediatrics-Hematology-Oncology', 'Pediatrics-InfectiousDiseases', 'Pediatrics-Neurology', 'Pediatrics-Pulmonology', 'Perinatology', 'PhysicalMedicineandRehabilitation', 'PhysicianNotFound', 'Podiatry', 'Proctology', 'Psychiatry', 'Psychiatry-Addictive', 'Psychiatry-Child/Adolescent', 'Psychology', 'Pulmonology', 'Radiologist', 'Radiology', 'Rheumatology', 'Speech', 'SportsMedicine', 'Surgeon', 'Surgery-Cardiovascular', 'Surgery-Cardiovascular/Thoracic', 'Surgery-Colon&Rectal', 'Surgery-General', 'Surgery-Maxillofacial', 'Surgery-Neuro', 'Surgery-Pediatric', 'Surgery-Plastic', 'Surgery-PlasticwithinHeadandNeck', 'Surgery-Thoracic', 'Surgery-Vascular', 'SurgicalSpecialty', 'Urology', 'Neurophysiology'\n", "'max_glu_serum' '>200', '>300', 'None', 'Norm' \n", "'A1Cresult' '>7', '>8', 'None', 'Norm' \n", "'metformin' 'Down', 'No', 'Steady', 'Up' \n", "'repaglinide' 'Down', 'No', 'Steady', 'Up' \n", "'nateglinide' 'Down', 'No', 'Steady', 'Up' \n", "'chlorpropamide' 'Down', 'No', 'Steady', 'Up' \n", "'glimepiride' 'Down', 'No', 'Steady', 'Up' \n", "'acetohexamide' 'No', 'Steady' \n", "'glipizide' 'Down', 'No', 'Steady', 'Up' \n", "'glyburide' 'Down', 'No', 'Steady', 'Up' \n", "'tolbutamide' 'No', 'Steady' \n", "'pioglitazone' 'Down', 'No', 'Steady', 'Up' \n", "'rosiglitazone' 'Down', 'No', 'Steady', 'Up' \n", "'acarbose' 'Down', 'No', 'Steady', 'Up' \n", "'miglitol' 'Down', 'No', 'Steady', 'Up' \n", "'troglitazone' 'No', 'Steady' \n", "'tolazamide' 'No', 'Steady', 'Up' \n", "'examide' 'No' \n", "'citoglipton' 'No' \n", "'insulin' 'Down', 'No', 'Steady', 'Up' \n", "'glyburide-metformin' 'Down', 'No', 'Steady', 'Up' \n", "'glipizide-metformin' 'No', 'Steady' \n", "'glimepiride-pioglitazone' 'No', 'Steady' \n", "'metformin-rosiglitazone' 'No' \n", "'metformin-pioglitazone' 'No' \n", "'change' 'Ch', 'No' \n", "'diabetesMed' 'No', 'Yes' \n", "'readmitted' '<30', '>30', 'NO' " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "domain_change_features = ['repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', \n", " 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone', \n", " 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide', \n", " 'examide', 'citoglipton', 'insulin', 'glyburide-metformin', 'glipizide-metformin', \n", " 'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone']\n", "\n", "\n", "# Infer new schema by using your modify_domain_of_features function \n", "# and the defined domain_change_features feature list\n", "schema = modify_domain_of_features(domain_change_features, schema, 'metformin')\n", "\n", "# Display new schema\n", "tfdv.display_schema(schema)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Domain name of 'chlorpropamide': metformin\n", "Domain values of 'chlorpropamide': ['Down', 'No', 'Steady', 'Up']\n", "Domain name of 'repaglinide': metformin\n", "Domain values of 'repaglinide': ['Down', 'No', 'Steady', 'Up']\n", "Domain name of 'nateglinide': metformin\n", "Domain values of 'nateglinide': ['Down', 'No', 'Steady', 'Up']\n" ] } ], "source": [ "# TEST CODE\n", "\n", "# check that the domain of some features are now switched to `metformin`\n", "print(f\"Domain name of 'chlorpropamide': {tfdv.get_feature(schema, 'chlorpropamide').domain}\")\n", "print(f\"Domain values of 'chlorpropamide': {tfdv.get_domain(schema, 'chlorpropamide').value}\")\n", "print(f\"Domain name of 'repaglinide': {tfdv.get_feature(schema, 'repaglinide').domain}\")\n", "print(f\"Domain values of 'repaglinide': {tfdv.get_domain(schema, 'repaglinide').value}\")\n", "print(f\"Domain name of 'nateglinide': {tfdv.get_feature(schema, 'nateglinide').domain}\")\n", "print(f\"Domain values of 'nateglinide': {tfdv.get_domain(schema, 'nateglinide').value}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Expected Output:**\n", "\n", "```\n", "Domain name of 'chlorpropamide': metformin\n", "Domain values of 'chlorpropamide': ['Down', 'No', 'Steady', 'Up']\n", "Domain name of 'repaglinide': metformin\n", "Domain values of 'repaglinide': ['Down', 'No', 'Steady', 'Up']\n", "Domain name of 'nateglinide': metformin\n", "Domain values of 'nateglinide': ['Down', 'No', 'Steady', 'Up']\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's do a final check of anomalies to see if this solved the issue." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Anomaly short descriptionAnomaly long description
Feature name
'readmitted'Column droppedColumn is completely missing
\n", "
" ], "text/plain": [ " Anomaly short description Anomaly long description\n", "Feature name \n", "'readmitted' Column dropped Column is completely missing" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "calculate_and_display_anomalies(serving_stats, schema=schema)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should now see the `metformin-pioglitazone` and `metformin-rosiglitazone` features dropped from the output anomalies." ] }, { "cell_type": "markdown", "metadata": { "id": "bJjh5rigc5xy" }, "source": [ "\n", "### Exercise 9: Detecting anomalies with environments\n", "\n", "There is still one thing to address. The `readmitted` feature (which is the label column) showed up as an anomaly ('Column dropped'). Since labels are not expected in the serving data, let's tell TFDV to ignore this detected anomaly.\n", "\n", "This requirement of introducing slight schema variations can be expressed by using [environments](https://www.tensorflow.org/tfx/data_validation/get_started#schema_environments). In particular, features in the schema can be associated with a set of environments using `default_environment`, `in_environment` and `not_in_environment`." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "# All features are by default in both TRAINING and SERVING environments.\n", "schema.default_environment.append('TRAINING')\n", "schema.default_environment.append('SERVING')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Complete the code below to exclude the `readmitted` feature from the `SERVING` environment.\n", "\n", "To achieve this, you can use the [`tfdv.get_feature()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/get_feature) function to get the `readmitted` feature from the inferred schema and use its `not_in_environment` attribute to specify that `readmitted` should be removed from the `SERVING` environment's schema. This **attribute is a list** so you will have to **append** the name of the environment that you wish to omit this feature for.\n", "\n", "To be more explicit, given a feature you can do something like:\n", "\n", "```python\n", "feature.not_in_environment.append('NAME_OF_ENVIRONMENT')\n", "```\n", "\n", "The function `tfdv.get_feature` receives the following parameters:\n", "\n", "- `schema`: The schema.\n", "- `feature_path`: The path of the feature to obtain from the schema. In this case this is equal to the name of the feature." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "id": "bnbnw8H6Lp2M" }, "outputs": [], "source": [ "### START CODE HERE\n", "# Specify that 'readmitted' feature is not in SERVING environment.\n", "# HINT: Append the 'SERVING' environmnet to the not_in_environment attribute of the feature\n", "tfdv.get_feature(schema, 'readmitted').not_in_environment.append('SERVING')\n", "\n", "# HINT: Calculate anomalies with the validate_statistics function by using the serving statistics, \n", "# inferred schema and the SERVING environment parameter.\n", "serving_anomalies_with_env = tfdv.validate_statistics(serving_stats, schema, environment='SERVING')\n", "### END CODE HERE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should see \"No anomalies found\" by running the cell below." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

No anomalies found.

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Display anomalies\n", "tfdv.display_anomalies(serving_anomalies_with_env)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you have succesfully addressed all anomaly-related issues!" ] }, { "cell_type": "markdown", "metadata": { "id": "yteMr3AGMYEp" }, "source": [ "\n", "## 7 - Check for Data Drift and Skew" ] }, { "cell_type": "markdown", "metadata": { "id": "Foe3aT1OcePh" }, "source": [ "During data validation, you also need to check for data drift and data skew between the training and serving data. You can do this by specifying the [skew_comparator and drift_comparator](https://www.tensorflow.org/tfx/data_validation/get_started#checking_data_skew_and_drift) in the schema. \n", "\n", "Drift and skew is expressed in terms of [L-infinity distance](https://en.wikipedia.org/wiki/Chebyshev_distance) which evaluates the difference between vectors as the greatest of the differences along any coordinate dimension.\n", "\n", "You can set the threshold distance so that you receive warnings when the drift is higher than is acceptable. Setting the correct distance is typically an iterative process requiring domain knowledge and experimentation.\n", "\n", "Let's check for the skew in the **diabetesMed** feature and drift in the **payer_code** feature." ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "id": "wEUsZm_rOd1Q" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Anomaly short descriptionAnomaly long description
Feature name
'diabetesMed'High Linfty distance between training and servingThe Linfty distance between training and serving is 0.0325464 (up to six significant digits), above the threshold 0.03. The feature value with maximum difference is: No
'payer_code'High Linfty distance between current and previousThe Linfty distance between current and previous is 0.0342144 (up to six significant digits), above the threshold 0.03. The feature value with maximum difference is: MC
\n", "
" ], "text/plain": [ " Anomaly short description \\\n", "Feature name \n", "'diabetesMed' High Linfty distance between training and serving \n", "'payer_code' High Linfty distance between current and previous \n", "\n", " Anomaly long description \n", "Feature name \n", "'diabetesMed' The Linfty distance between training and serving is 0.0325464 (up to six significant digits), above the threshold 0.03. The feature value with maximum difference is: No \n", "'payer_code' The Linfty distance between current and previous is 0.0342144 (up to six significant digits), above the threshold 0.03. The feature value with maximum difference is: MC " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Calculate skew for the diabetesMed feature\n", "diabetes_med = tfdv.get_feature(schema, 'diabetesMed')\n", "diabetes_med.skew_comparator.infinity_norm.threshold = 0.03 # domain knowledge helps to determine this threshold\n", "\n", "# Calculate drift for the payer_code feature\n", "payer_code = tfdv.get_feature(schema, 'payer_code')\n", "payer_code.drift_comparator.infinity_norm.threshold = 0.03 # domain knowledge helps to determine this threshold\n", "\n", "# Calculate anomalies\n", "skew_drift_anomalies = tfdv.validate_statistics(train_stats, schema,\n", " previous_statistics=eval_stats,\n", " serving_statistics=serving_stats)\n", "\n", "# Display anomalies\n", "tfdv.display_anomalies(skew_drift_anomalies)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In both of these cases, the detected anomaly distance is not too far from the threshold value of `0.03`. For this exercise, let's accept this as within bounds (i.e. you can set the distance to something like `0.035` instead).\n", "\n", "**However, if the anomaly truly indicates a skew and drift, then further investigation is necessary as this could have a direct impact on model performance.**" ] }, { "cell_type": "markdown", "metadata": { "id": "1ikCxNUlUa2u" }, "source": [ "\n", "## 8 - Display Stats for Data Slices " ] }, { "cell_type": "markdown", "metadata": { "id": "n4aqhj8jUiFs" }, "source": [ "Finally, you can [slice the dataset and calculate the statistics](https://www.tensorflow.org/tfx/data_validation/get_started#computing_statistics_over_slices_of_data) for each unique value of a feature. By default, TFDV computes statistics for the overall dataset in addition to the configured slices. Each slice is identified by a unique name which is set as the dataset name in the [DatasetFeatureStatistics](https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/statistics.proto#L43) protocol buffer. Generating and displaying statistics over different slices of data can help track model and anomaly metrics. \n", "\n", "Let's first define a few helper functions to make our code in the exercise more neat." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "def split_datasets(dataset_list):\n", " '''\n", " split datasets.\n", "\n", " Parameters:\n", " dataset_list: List of datasets to split\n", "\n", " Returns:\n", " datasets: sliced data\n", " '''\n", " datasets = []\n", " for dataset in dataset_list.datasets:\n", " proto_list = DatasetFeatureStatisticsList()\n", " proto_list.datasets.extend([dataset])\n", " datasets.append(proto_list)\n", " return datasets\n", "\n", "\n", "def display_stats_at_index(index, datasets):\n", " '''\n", " display statistics at the specified data index\n", "\n", " Parameters:\n", " index : index to show the anomalies\n", " datasets: split data\n", "\n", " Returns:\n", " display of generated sliced data statistics at the specified index\n", " '''\n", " if index < len(datasets):\n", " print(datasets[index].datasets[0].name)\n", " tfdv.visualize_statistics(datasets[index])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function below returns a list of `DatasetFeatureStatisticsList` protocol buffers. As shown in the ungraded lab, the first one will be for `All Examples` followed by individual slices through the feature you specified.\n", "\n", "To configure TFDV to generate statistics for dataset slices, you will use the function `tfdv.StatsOptions()` with the following 4 arguments: \n", "\n", "- `schema`\n", "\n", "\n", "- `slice_functions` passed as a list.\n", "\n", "\n", "- `infer_type_from_schema` set to True. \n", "\n", "\n", "- `feature_whitelist` set to the approved features.\n", "\n", "\n", "Remember that `slice_functions` only work with [`generate_statistics_from_csv()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/generate_statistics_from_csv) so you will need to convert the dataframe to CSV." ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "id": "5S9EOdt1UbMF" }, "outputs": [], "source": [ "def sliced_stats_for_slice_fn(slice_fn, approved_cols, dataframe, schema):\n", " '''\n", " generate statistics for the sliced data.\n", "\n", " Parameters:\n", " slice_fn : slicing definition\n", " approved_cols: list of features to pass to the statistics options\n", " dataframe: pandas dataframe to slice\n", " schema: the schema\n", "\n", " Returns:\n", " slice_info_datasets: statistics for the sliced dataset\n", " '''\n", " # Set the StatsOptions\n", " slice_stats_options = tfdv.StatsOptions(schema=schema,\n", " slice_functions=[slice_fn],\n", " infer_type_from_schema=True,\n", " feature_whitelist=approved_cols)\n", " \n", " # Convert Dataframe to CSV since `slice_functions` works only with `tfdv.generate_statistics_from_csv`\n", " CSV_PATH = 'slice_sample.csv'\n", " dataframe.to_csv(CSV_PATH)\n", " \n", " # Calculate statistics for the sliced dataset\n", " sliced_stats = tfdv.generate_statistics_from_csv(CSV_PATH, stats_options=slice_stats_options)\n", " \n", " # Split the dataset using the previously defined split_datasets function\n", " slice_info_datasets = split_datasets(sliced_stats)\n", " \n", " return slice_info_datasets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With that, you can now use the helper functions to generate and visualize statistics for the sliced datasets." ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "id": "F1_tkHX_UxwJ" }, "outputs": [ { "data": { "application/javascript": [ "\n", " if (typeof window.interactive_beam_jquery == 'undefined') {\n", " var jqueryScript = document.createElement('script');\n", " jqueryScript.src = 'https://code.jquery.com/jquery-3.4.1.slim.min.js';\n", " jqueryScript.type = 'text/javascript';\n", " jqueryScript.onload = function() {\n", " var datatableScript = document.createElement('script');\n", " datatableScript.src = 'https://cdn.datatables.net/1.10.20/js/jquery.dataTables.min.js';\n", " datatableScript.type = 'text/javascript';\n", " datatableScript.onload = function() {\n", " window.interactive_beam_jquery = jQuery.noConflict(true);\n", " window.interactive_beam_jquery(document).ready(function($){\n", " \n", " });\n", " }\n", " document.head.appendChild(datatableScript);\n", " };\n", " document.head.appendChild(jqueryScript);\n", " } else {\n", " window.interactive_beam_jquery(document).ready(function($){\n", " \n", " });\n", " }" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Statistics generated for:\n", "\n", "All Examples\n", "medical_specialty_Orthopedics\n", "medical_specialty_InternalMedicine\n", "medical_specialty_Cardiology\n", "medical_specialty_Family/GeneralPractice\n", "medical_specialty_Surgery-General\n", "medical_specialty_Emergency/Trauma\n", "medical_specialty_Nephrology\n", "medical_specialty_Surgery-Neuro\n", "medical_specialty_Oncology\n", "medical_specialty_Gastroenterology\n", "medical_specialty_Orthopedics-Reconstructive\n", "medical_specialty_ObstetricsandGynecology\n", "medical_specialty_Surgery-Cardiovascular/Thoracic\n", "medical_specialty_Radiologist\n", "medical_specialty_Urology\n", "medical_specialty_Surgery-Vascular\n", "medical_specialty_Hematology/Oncology\n", "medical_specialty_Neurology\n", "medical_specialty_Psychology\n", "medical_specialty_Psychiatry\n", "medical_specialty_PhysicalMedicineandRehabilitation\n", "medical_specialty_Pulmonology\n", "medical_specialty_Otolaryngology\n", "medical_specialty_Obsterics&Gynecology-GynecologicOnco\n", "medical_specialty_Endocrinology\n", "medical_specialty_Anesthesiology\n", "medical_specialty_Pediatrics-Endocrinology\n", "medical_specialty_Radiology\n", "medical_specialty_Pediatrics\n", "medical_specialty_Pediatrics-Pulmonology\n", "medical_specialty_Osteopath\n", "medical_specialty_Surgery-Plastic\n", "medical_specialty_Podiatry\n", "medical_specialty_Surgery-Thoracic\n", "medical_specialty_Rheumatology\n", "medical_specialty_Obstetrics\n", "medical_specialty_Pediatrics-AllergyandImmunology\n", "medical_specialty_Surgery-Cardiovascular\n", "medical_specialty_Anesthesiology-Pediatric\n", "medical_specialty_Pathology\n", "medical_specialty_Pediatrics-CriticalCare\n", "medical_specialty_PhysicianNotFound\n", "medical_specialty_Gynecology\n", "medical_specialty_AllergyandImmunology\n", "medical_specialty_Surgery-Maxillofacial\n", "medical_specialty_Hospitalist\n", "medical_specialty_Hematology\n", "medical_specialty_Surgeon\n", "medical_specialty_Proctology\n", "medical_specialty_InfectiousDiseases\n", "medical_specialty_Psychiatry-Child/Adolescent\n", "medical_specialty_SurgicalSpecialty\n", "medical_specialty_Ophthalmology\n", "medical_specialty_Surgery-Pediatric\n", "medical_specialty_Pediatrics-Neurology\n", "medical_specialty_Surgery-PlasticwithinHeadandNeck\n", "medical_specialty_OutreachServices\n", "medical_specialty_Pediatrics-Hematology-Oncology\n", "medical_specialty_Dentistry\n", "medical_specialty_Pediatrics-EmergencyMedicine\n", "medical_specialty_Psychiatry-Addictive\n", "medical_specialty_Surgery-Colon&Rectal\n", "medical_specialty_Pediatrics-InfectiousDiseases\n", "medical_specialty_Dermatology\n", "medical_specialty_Perinatology\n", "medical_specialty_SportsMedicine\n", "medical_specialty_Cardiology-Pediatric\n", "medical_specialty_Speech\n", "medical_specialty_Gastroenterology\n" ] }, { "data": { "text/html": [ "\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Generate slice function for the `medical_speciality` feature\n", "slice_fn = slicing_util.get_feature_value_slicer(features={'medical_specialty': None})\n", "\n", "# Generate stats for the sliced dataset\n", "slice_datasets = sliced_stats_for_slice_fn(slice_fn, approved_cols, dataframe=train_df, schema=schema)\n", "\n", "# Print name of slices for reference\n", "print(f'Statistics generated for:\\n')\n", "print('\\n'.join([sliced.datasets[0].name for sliced in slice_datasets]))\n", "\n", "# Display at index 10, which corresponds to the slice named `medical_specialty_Gastroenterology`\n", "display_stats_at_index(10, slice_datasets) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you are curious, try different slice indices to extract the group statistics. For instance, `index=5` corresponds to all `medical_specialty_Surgery-General` records. You can also try slicing through multiple features as shown in the ungraded lab. \n", "\n", "Another challenge is to implement your own helper functions. For instance, you can make a `display_stats_for_slice_name()` function so you don't have to determine the index of a slice. If done correctly, you can just do `display_stats_for_slice_name('medical_specialty_Gastroenterology', slice_datasets)` and it will generate the same result as `display_stats_at_index(10, slice_datasets)`." ] }, { "cell_type": "markdown", "metadata": { "id": "wJ5saC9eWvHx" }, "source": [ "\n", "## 9 - Freeze the schema\n", "\n", "Now that the schema has been reviewed, you will store the schema in a file in its \"frozen\" state. This can be used to validate incoming data once your application goes live to your users.\n", "\n", "This is pretty straightforward using Tensorflow's `io` utils and TFDV's [`write_schema_text()`](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/write_schema_text) function." ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "id": "ydkL4DkIWn18" }, "outputs": [], "source": [ "# Create output directory\n", "OUTPUT_DIR = \"output\"\n", "file_io.recursive_create_dir(OUTPUT_DIR)\n", "\n", "# Use TensorFlow text output format pbtxt to store the schema\n", "schema_file = os.path.join(OUTPUT_DIR, 'schema.pbtxt')\n", "\n", "# write_schema_text function expect the defined schema and output path as parameters\n", "tfdv.write_schema_text(schema, schema_file) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After submitting this assignment, you can click the Jupyter logo in the left upper corner of the screen to check the Jupyter filesystem. The `schema.pbtxt` file should be inside the `output` directory. \n", "\n", "**Congratulations on finishing this week's assignment!** A lot of concepts where introduced and now you should feel more familiar with using TFDV for inferring schemas, anomaly detection and other data-related tasks.\n", "\n", "**Keep it up!**" ] } ], "metadata": { "coursera": { "schema_names": [ "MLEPC2W1-1", "MLEPC2W1-2", "MLEPC2W1-3", "MLEPC2W1-4", "MLEPC2W1-5", "MLEPC2W1-6", "MLEPC2W1-7", "MLEPC2W1-8", "MLEPC2W1-9" ] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 4 }