{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
">### 🚩 *Create a free WhyLabs account to get more value out of whylogs!*
\n",
">*Did you know you can store, visualize, and monitor whylogs profiles with the [WhyLabs Observability Platform](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=employee)? Sign up for a [free WhyLabs account](https://whylabs.ai/whylogs-free-signup?utm_source=whylogs-Github&utm_medium=whylogs-example&utm_campaign=employee) to leverage the power of whylogs and WhyLabs together!*"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Employee Dataset - Usage Example"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/datasets/employee.ipynb)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"This an example demonstrating the usage of the Employee Dataset.\n",
"\n",
"For more information about the dataset itself, check the documentation on :\n",
"https://whylogs.readthedocs.io/en/latest/datasets/employee.html"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Installing the datasets module"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Note: you may need to restart the kernel to use updated packages.\n",
"%pip install 'whylogs[datasets]'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading the Dataset\n",
"\n",
"You can load the dataset of your choice by calling it from the `datasets` module:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from whylogs.datasets import Employee\n",
"\n",
"dataset = Employee(version=\"base\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If no `version` parameter is passed, the default version is `base`."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"This will create a folder in the current directory named `whylogs_data` with the csv files for the Employee Dataset. If the files already exist, the module will not redownload the files."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Discovering Information"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To know what are the available versions for a given dataset, you can call:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"('base',)"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Employee.describe_versions()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To get access to overall description of the dataset:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Employee Dataset\n",
"================\n",
"\n",
"The employee dataset contains annual salary information for employees of an american County. It contains features related to each employee, such as employee's department, gender, salary, and hiring date.\n",
"\n",
"The original data was sourced from the `employee_salaries` OpenML dataset, and can be found here: https://www.openml.org/d/42125. From the source data additional transformations were made, such as: data cleaning, feature creation and feature engineering.\n",
"\n",
"License:\n",
"CC0: Public Domain\n",
"\n",
"Usage\n",
"-----\n",
"\n",
"You can follow this guide to see how to use the ecommerce dataset:\n",
"\n",
".. toctree::\n",
" :maxdepth: 1\n",
"\n",
" ../examples/datasets/employee\n",
"\n",
"Versions and Data Partitions\n",
"----------------------------\n",
"\n",
"Currently the dataset contains one version: **base**. This dataset has no particular tasks defined, as it is aimed to explore data quality issues that are not necessarily related to ML.\n",
"The **base** version contains two partitions: **Baseline** and **Production**\n",
"\n",
"base\n"
]
}
],
"source": [
"print(Employee.describe()[:1000])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"note: the output was truncated to first 1000 characters as `describe()` will print a rather lengthy description."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Getting Baseline Data\n",
"\n",
"You can access data from two different partitions: the baseline dataset and production dataset.\n",
"\n",
"The baseline can be accessed as a whole, whereas the production dataset can be accessed in periodic batches, defined by the user.\n",
"\n",
"To get a `baseline` object, just call `dataset.get_baseline()`:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"from whylogs.datasets import Employee\n",
"\n",
"dataset = Employee()\n",
"\n",
"baseline = dataset.get_baseline()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"`baseline` will contain different attributes - one timestamp and five dataframes.\n",
"\n",
"- timestamp: the batch's timestamp (at the start)\n",
"- data: the complete dataframe\n",
"- features: input features\n",
"- target: output feature(s)\n",
"- prediction: output prediction and, possibly, features such as uncertainty, confidence, probability\n",
"- extra: metadata features that are not of any of the previous categories, but still contain relevant information about the data.\n",
"\n",
"The Employee dataset is a non-ml dataset, so the `prediction` and `target` dataframes will be empty."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"datetime.datetime(2023, 2, 16, 0, 0, tzinfo=datetime.timezone.utc)"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"baseline.timestamp"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | employee_id | \n", "gender | \n", "overtime_pay | \n", "department | \n", "position_title | \n", "date_first_hired | \n", "year_first_hired | \n", "salary | \n", "full_time | \n", "part_time | \n", "sector | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|
date | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
2023-02-16 00:00:00+00:00 | \n", "8894 | \n", "M | \n", "9136.78 | \n", "POL | \n", "Police Sergeant | \n", "07/21/2003 | \n", "2003 | \n", "103506.00 | \n", "1 | \n", "0 | \n", "Sector 3 | \n", "
2023-02-16 00:00:00+00:00 | \n", "6920 | \n", "M | \n", "0.00 | \n", "FRS | \n", "Firefighter/Rescuer III | \n", "12/12/2016 | \n", "2016 | \n", "45261.00 | \n", "1 | \n", "0 | \n", "Sector 1 | \n", "
2023-02-16 00:00:00+00:00 | \n", "2265 | \n", "F | \n", "0.00 | \n", "LIB | \n", "Library Associate | \n", "06/27/1997 | \n", "1997 | \n", "25167.75 | \n", "0 | \n", "1 | \n", "Sector 4 | \n", "
2023-02-16 00:00:00+00:00 | \n", "8790 | \n", "M | \n", "0.00 | \n", "OHR | \n", "Labor Relations Advisor | \n", "10/28/2001 | \n", "2001 | \n", "112899.00 | \n", "1 | \n", "0 | \n", "Sector 3 | \n", "
2023-02-16 00:00:00+00:00 | \n", "7728 | \n", "M | \n", "12516.95 | \n", "DOT | \n", "Bus Operator | \n", "11/10/2014 | \n", "2014 | \n", "42053.42 | \n", "1 | \n", "0 | \n", "Sector 4 | \n", "
\n", " | employee_id | \n", "gender | \n", "overtime_pay | \n", "department | \n", "assignment_category | \n", "position_title | \n", "date_first_hired | \n", "year_first_hired | \n", "salary | \n", "full_time | \n", "part_time | \n", "sector | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|
date | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
2023-02-16 00:00:00+00:00 | \n", "6309 | \n", "F | \n", "0.00 | \n", "HHS | \n", "Fulltime-Regular | \n", "Administrative Specialist I | \n", "02/08/2016 | \n", "2016 | \n", "59276.91 | \n", "1 | \n", "0 | \n", "Sector 1 | \n", "
2023-02-16 00:00:00+00:00 | \n", "4078 | \n", "M | \n", "19677.72 | \n", "POL | \n", "Fulltime-Regular | \n", "Police Officer III | \n", "06/25/1990 | \n", "1990 | \n", "92756.70 | \n", "1 | \n", "0 | \n", "Sector 3 | \n", "
2023-02-16 00:00:00+00:00 | \n", "2445 | \n", "F | \n", "0.00 | \n", "DEP | \n", "Fulltime-Regular | \n", "Planning Specialist III | \n", "06/30/2014 | \n", "2014 | \n", "80499.91 | \n", "1 | \n", "0 | \n", "Sector 4 | \n", "
2023-02-16 00:00:00+00:00 | \n", "2548 | \n", "F | \n", "0.00 | \n", "REC | \n", "Fulltime-Regular | \n", "Recreation Specialist | \n", "03/24/2014 | \n", "2014 | \n", "69842.16 | \n", "1 | \n", "0 | \n", "Sector 2 | \n", "
2023-02-16 00:00:00+00:00 | \n", "5949 | \n", "M | \n", "45267.21 | \n", "DGS | \n", "Fulltime-Regular | \n", "Property Manager II | \n", "05/07/1990 | \n", "1990 | \n", "99870.24 | \n", "1 | \n", "0 | \n", "Sector 3 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
2023-02-16 00:00:00+00:00 | \n", "8594 | \n", "F | \n", "0.00 | \n", "CCL | \n", "Fulltime-Regular | \n", "Confidential Aide | \n", "05/05/2003 | \n", "2003 | \n", "146664.49 | \n", "1 | \n", "0 | \n", "Sector 4 | \n", "
2023-02-16 00:00:00+00:00 | \n", "3479 | \n", "M | \n", "17711.08 | \n", "FRS | \n", "Fulltime-Regular | \n", "Firefighter/Rescuer III | \n", "02/27/2012 | \n", "2012 | \n", "60618.00 | \n", "1 | \n", "0 | \n", "Sector 1 | \n", "
2023-02-16 00:00:00+00:00 | \n", "6067 | \n", "F | \n", "0.00 | \n", "HHS | \n", "Parttime-Regular | \n", "School Health Room Technician I | \n", "08/06/2012 | \n", "2012 | \n", "36797.13 | \n", "0 | \n", "1 | \n", "Sector 1 | \n", "
2023-02-16 00:00:00+00:00 | \n", "5788 | \n", "M | \n", "9526.23 | \n", "DLC | \n", "Fulltime-Regular | \n", "Liquor Store Clerk II | \n", "04/04/2000 | \n", "2000 | \n", "57760.61 | \n", "1 | \n", "0 | \n", "Sector 2 | \n", "
2023-02-16 00:00:00+00:00 | \n", "4375 | \n", "M | \n", "1020.28 | \n", "DOT | \n", "Fulltime-Regular | \n", "Motor Pool Attendant | \n", "05/27/2008 | \n", "2008 | \n", "36493.52 | \n", "1 | \n", "0 | \n", "Sector 4 | \n", "
916 rows × 12 columns
\n", "