{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data engineering\n",
    "\n",
    "This assignment focuses on techniques for cleaning and transforming the data to handle challenges of missing, inaccurate, or incomplete data. Please refer to [Machine Learning productionization - Data engineering](#data-engineering) to learn more.\n",
    "\n",
    "Fill `____` pieces of the below implementation in order to pass the assertions.\n",
    "\n",
    "\n",
    "<!-- \n",
    "# removing rows with missing values from dataset: pandas.DataFrame dataset.dropna(inplace=True) # removing columns with ratio of missing values greater than a threshold dataset = dataset[dataset.columns[dataset.isnull().mean() <= THRESHOLD]]\n",
    "\n",
    "Imputation. If the number of missing values is not small enough to be removed and not large enough to be a substantial proportion of the variable entries, you can replace the missing values in a numerical variable with the mean/median of the non-missing entries and the missing values in a categorical variable with the mode, which is the most frequent entry of the variable.\n",
    "\n",
    "# imputing missing values in each column with the mean of the corresponding # column using scikit-learn from sklearn.impute import SimpleImputer imputer = SimpleImputer(strategy='mean') # transform the dataset imputed_dataset = imputer.fit_transform(dataset)\n",
    "\n",
    "Identifying the group of attributes that uniquely identify a single record is very important and needs validation from a domain expert. Removing duplicates on this group leaves you with distinct records necessary for model training. This group acts as a key to performing multiple aggregate and transformations operations on the dataset like calculating rolling mean, backfilling null values, missing value imputation (details on this in next point), etc.\n",
    "True duplicates, i.e., instances of the same data point, are usually removed. In this way, the increase of the sample weight on these points is eliminated, and the risk of any artificial inflation in the performance metrics is reduced -->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exploring dataset\n",
    "\n",
    "> **Learning goal**: By the end of this subsection, you should be comfortable finding general information about the data stored in pandas DataFrames.\n",
    "\n",
    "In order to explore this functionality, we will import the modefined version of Python scikit-learn library's iconic dataset **Iris**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from sklearn.datasets import load_iris\n",
    "import math\n",
    "\n",
    "iris_df = pd.read_csv('../../assets/data/modefined_sklearn_iris_dataset.csv', index_col=0)\n",
    "iris_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To start off, print the summary of a DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iris_df.____"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{quizdown}\n",
    "\n",
    "## How many entries the Iris dataset has?\n",
    "\n",
    "> Please refer to the output of above cell.  \n",
    "\n",
    "- [ ] 50\n",
    "- [ ] 100\n",
    "- [x] 150\n",
    "- [ ] 200\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let's check the actual content of the `DataFrame`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# displying first 5 rows of our iris_df\n",
    "iris_df____\n",
    "\n",
    "# in the first five rows, which one's spepal length is 5.0cm?\n",
    "assert iris_df.iloc[____, 0] == 5.0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Conversely, we can check the last few rows of the DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# displying last 5 rows of our `iris_df`.\n",
    "iris_df.____\n",
    "\n",
    "# in the last five rows, which one's spepal width is 2.5cm?\n",
    "assert iris_df.iloc[____, 1] == 2.5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Takeaway**: Even just by looking at the metadata about the information in a DataFrame or the first and last few values in one, you can get an immediate idea about the size, shape, and content of the data you are dealing with."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dealing with missing data\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Missing data can cause inaccuracies as well as weak or biased results. Sometimes these can be resolved by a \"reload\" of the data, filling in the missing values with computation and code like Python, or simply just removing the value and corresponding data. There are numerous reasons for why data may be missing and the actions that are taken to resolve these missing values can be dependent on how and why they went missing in the first place.\n",
    "\n",
    "> **Learning goal**: By the end of this subsection, you should know how to replace or remove null values from DataFrames.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In pandas, the `isnull()` and `notnull()` methods are your primary methods for detecting null data. Both return Boolean masks over your data. We will be using numpy for NaN values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iris_isnull_df = iris_df.isnull()\n",
    "\n",
    "print(iris_isnull_df)\n",
    "\n",
    "# find one row with missing value\n",
    "assert iris_isnull_df.iloc[____, ____] == True\n",
    "assert math.isnan(iris_df.iloc[____, ____]) == True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# get all the rows with missing data\n",
    "iris_with_missing_value_df = iris_df____\n",
    "\n",
    "assert iris_with_missing_value_df.shape[0] == 16"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Dropping null values**: Beyond identifying missing values, pandas provides a convenient means `dropna` to remove null values from Series and DataFrames. (Particularly on large data sets, it is often more advisable to simply remove missing [NA] values from your analysis than deal with them in other ways.) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# remove all the rows with missing values\n",
    "iris_with_dropna_on_row_df = iris_df.____\n",
    "\n",
    "assert iris_with_dropna_on_row_df.shape[0] == 134"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# remove all the columns with missing values\n",
    "iris_with_dropna_on_column_df = iris_df.____\n",
    "\n",
    "assert iris_with_dropna_on_column_df.columns.shape[0] == 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# remove all the rows with 2 missing values\n",
    "iris_with_dropna_2_values_on_rows_df = iris_df.____\n",
    "\n",
    "assert iris_with_dropna_2_values_on_rows_df.shape[0] == 144\n",
    "\n",
    "# remove all the rows with 1 missing values\n",
    "iris_with_dropna_1_values_on_rows_df = iris_df.____\n",
    "\n",
    "assert iris_with_dropna_1_values_on_rows_df.shape[0] == 147"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Filling null values**: Depending on your dataset, it can sometimes make more sense to fill null values with valid ones rather than drop them. You could use `isnull` to do this in place, but that can be laborious, particularly if you have a lot of values to fill. Because this is such a common task in data science, pandas provides `fillna`, which returns a copy of the Series or DataFrame with the missing values replaced with one of your choosing. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# fll all the missing values with 0\n",
    "iris_with_fillna_df = iris_df.____\n",
    "\n",
    "# get all the rows with missing data\n",
    "iris_with_missing_value_after_fillna_df = iris_with_fillna_df____\n",
    "\n",
    "assert iris_with_missing_value_after_fillna_df.shape[0] == 0\n",
    "assert iris_with_fillna_df.iloc[____, 3] == -1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# forward-fill null values, which is to use the last valid value to fill a null:\n",
    "iris_with_fillna_forward_df = iris_df.____\n",
    "\n",
    "# get all the rows with missing data\n",
    "iris_with_missing_value_after_fillna_forward_df = iris_with_fillna_forward_df____\n",
    "\n",
    "assert iris_with_missing_value_after_fillna_forward_df.shape[0] == 0\n",
    "assert float(iris_with_fillna_forward_df.iloc[3, 3]) == 0.2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# back-fill null values, which is to use the next valid value to fill a null:\n",
    "iris_with_fillna_back_df = iris_df.____\n",
    "\n",
    "# get all the rows with missing data\n",
    "iris_with_missing_value_after_fillna_back_df = iris_with_fillna_back_df____\n",
    "\n",
    "assert iris_with_missing_value_after_fillna_back_df.shape[0] == 0\n",
    "assert float(iris_with_fillna_back_df.iloc[3, 3]) == 0.1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Removing duplicate data\n",
    "\n",
    "Data that has more than one occurrence can produce inaccurate results and usually should be removed. This can be a common occurrence when joining two or more datasets together. However, there are instances where duplication in joined datasets contain pieces that can provide additional information and may need to be preserved.\n",
    "\n",
    "> **Learning goal**: By the end of this subsection, you should be comfortable identifying and removing duplicate values from DataFrames.\n",
    "\n",
    "In addition to missing data, you will often encounter duplicated data in real-world datasets. Fortunately, pandas provides an easy means of detecting and removing duplicate entries."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Identifying duplicates**: You can easily spot duplicate values using the `duplicated`  method in pandas, which returns a Boolean mask indicating whether an entry in a DataFrame is a duplicate of an earlier one. Let's create another example DataFrame to see this in action."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iris_isduplicated_df = iris_df.____\n",
    "\n",
    "print(iris_isduplicated_df)\n",
    "\n",
    "# find one row with duplicated value\n",
    "assert iris_isduplicated_df.iloc[____, ____] == True"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Dropping duplicates**: `drop_duplicates` simply returns a copy of the data for which all of the duplicated values are False:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# remove all the rows with duplicated values\n",
    "iris_with_drop_duplicates_on_df = iris_df.drop_duplicates()\n",
    "\n",
    "assert iris_with_drop_duplicates_on_df.shape[0] == 143"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Both `duplicated` and `drop_duplicates` default to consider all columns but you can specify that they examine only a subset of columns in your DataFrame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# remove all the rows with duplicated values on column 'petal width (cm)'\n",
    "iris_with_drop_duplicates_on_column_df = iris_df.____\n",
    "\n",
    "assert iris_with_drop_duplicates_on_column_df.shape[0] == 27"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Handle inconsistent data\n",
    "\n",
    "Depending on the source, data can have inconsistencies in how it’s presented. This can cause problems in searching for and representing the value, where it’s seen within the dataset but is not properly represented in visualizations or query results. Common formatting problems involve resolving whitespace, dates, and data types. Resolving formatting issues is typically up to the people who are using the data. For example, standards on how dates and numbers are presented can differ by country.\n",
    "\n",
    "> **Learning goal**: By the end of this subsection, you should know how to handle the inconsistent data format in the DataFrame."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's cleaning up the **4th** column `petal width (cm)` to make sure there's no data entry inconsistencies in it. Firstly, we will use a convenient method `unique` from pandas to check the unique values of this column\n",
    "\n",
    "In pandas, the `unique` method is a convenient way to unique values based on a hash table:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "column_to_format = ____\n",
    "column_to_format_unique = column_to_format.____\n",
    "\n",
    "print(column_to_format_unique)\n",
    "\n",
    "# find one row with duplicated value\n",
    "assert column_to_format_unique.shape[0] == 27"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Regardless the `nan` value, you may find the numeric valus are in different precision. More specifically, `1.` or `1.5012` are not in the same precision as other numbers. We want to append tailing `0` to numbers like `1.`, and round numbers like `1.5012` to `1.5`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# firstly, let's apply `round`` to the values to make the precision all as .1f\n",
    "formatted_column = column_to_format.____\n",
    "\n",
    "print(formatted_column.unique())\n",
    "\n",
    "assert formatted_column.unique().shape[0] == 23"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# now, let's add tailing 0 if needed to make numbers like 1. to be 1.0. \n",
    "# You may need to filter the nan value while processing.\n",
    "formatted_column = formatted_column.____\n",
    "\n",
    "print(formatted_column.unique())\n",
    "\n",
    "assert formatted_column.unique().shape[0] == 23"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## At last\n",
    "\n",
    "Let's apply all the methods above to make the data to be clean."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# remove all rows with missing values\n",
    "no_missing_data_df = iris_df.____\n",
    "\n",
    "# remove all rows with duplicated values\n",
    "no_missing_dup_data_df = no_missing_data_df.____\n",
    "\n",
    "# apply the precision .1f to all the numbers\n",
    "cleand_df = no_missing_dup_data_df.____\n",
    "\n",
    "assert no_missing_data_df.shape[0] == 134\n",
    "assert no_missing_dup_data_df.shape[0] == 129\n",
    "assert cleand_df[cleand_df.columns[3]].unique().shape[0] == 22"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Also, you could refer to below for more about how to handle data quality.\n",
    "\n",
    "- missing data - [pandas - Working with missing data](https://pandas.pydata.org/docs/user_guide/missing_data.html)\n",
    "- duplicate data - [pandas - Duplicate Labels](https://pandas.pydata.org/docs/user_guide/duplicates.html)\n",
    "- outlier\n",
    "  - [Ways to Detect and Remove the Outliers](https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba)\n",
    "  - [Outlier!!! The Silent Killer](https://www.kaggle.com/code/nareshbhat/outlier-the-silent-killer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Acknowledgments\n",
    "\n",
    "Thanks to Microsoft for creating the open source course [Data Science for Beginners](https://github.com/microsoft/Data-Science-For-Beginners). It contributes some of the content in this chapter."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.13 64-bit",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13 (main, May 24 2022, 21:28:31) \n[Clang 13.1.6 (clang-1316.0.21.2)]"
  },
  "vscode": {
   "interpreter": {
    "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}