{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pandas Profiling: HCC Dataset\n",
    "Source of data: https://www.kaggle.com/datasets/mrsantos/hcc-dataset\n",
    "\n",
    "As modifiations have been introduced for the purpose of this use case, the .csv file is provided (hcc.csv)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "from ydata_profiling import ProfileReport"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read the HCC Dataset\n",
    "df = pd.read_csv(\"hcc.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Produce and save the profiling report"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "original_report = ProfileReport(df, title=\"Original Data\")\n",
    "original_report.to_file(\"original_report.html\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Analysis of \"Alerts\"\n",
    "Pandas Profiling alerts for the presence of 4 potential data quality problems:\n",
    "\n",
    "- `DUPLICATES`: 4 duplicate rows in data\n",
    "- `CONSTANT`: Constant value “999” in ‘O2’\n",
    "- `HIGH CORRELATION`: Several features marked as highly correlated\n",
    "- `MISSING`: Missing Values in ‘Ferritin’\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Removing Duplicate Rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Drop duplicate rows\n",
    "df_transformed = df.copy()\n",
    "df_transformed = df_transformed.drop_duplicates()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Removing Irrelevant Features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Remove O2\n",
    "df_transformed = df_transformed.drop(columns=\"O2\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Missing Data Imputation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Impute Missing Values\n",
    "from sklearn.impute import SimpleImputer\n",
    "\n",
    "mean_imputer = SimpleImputer(strategy=\"mean\")\n",
    "df_transformed[\"Ferritin\"] = mean_imputer.fit_transform(\n",
    "    df_transformed[\"Ferritin\"].values.reshape(-1, 1)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Produce Comparison Report"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "transformed_report = ProfileReport(df_transformed, title=\"Transformed Data\")\n",
    "comparison_report = original_report.compare(transformed_report)\n",
    "comparison_report.to_file(\"original_vs_transformed.html\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.10.8 ('feat-comp')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.8"
  },
  "vscode": {
   "interpreter": {
    "hash": "13390b9b50dde76c6c011e02183633aae7d8498993a6e6577a16e1b7cb8c7a8c"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}