{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Pandas Profiling: HCC Dataset\n", "Source of data: https://www.kaggle.com/datasets/mrsantos/hcc-dataset\n", "\n", "As modifiations have been introduced for the purpose of this use case, the .csv file is provided (hcc.csv)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "from ydata_profiling import ProfileReport" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load the dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Read the HCC Dataset\n", "df = pd.read_csv(\"hcc.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Produce and save the profiling report" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "original_report = ProfileReport(df, title=\"Original Data\")\n", "original_report.to_file(\"original_report.html\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analysis of \"Alerts\"\n", "Pandas Profiling alerts for the presence of 4 potential data quality problems:\n", "\n", "- `DUPLICATES`: 4 duplicate rows in data\n", "- `CONSTANT`: Constant value “999” in ‘O2’\n", "- `HIGH CORRELATION`: Several features marked as highly correlated\n", "- `MISSING`: Missing Values in ‘Ferritin’\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Removing Duplicate Rows" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Drop duplicate rows\n", "df_transformed = df.copy()\n", "df_transformed = df_transformed.drop_duplicates()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Removing Irrelevant Features" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Remove O2\n", "df_transformed = df_transformed.drop(columns=\"O2\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Missing Data Imputation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Impute Missing Values\n", "from sklearn.impute import SimpleImputer\n", "\n", "mean_imputer = SimpleImputer(strategy=\"mean\")\n", "df_transformed[\"Ferritin\"] = mean_imputer.fit_transform(\n", " df_transformed[\"Ferritin\"].values.reshape(-1, 1)\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Produce Comparison Report" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "transformed_report = ProfileReport(df_transformed, title=\"Transformed Data\")\n", "comparison_report = original_report.compare(transformed_report)\n", "comparison_report.to_file(\"original_vs_transformed.html\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.10.8 ('feat-comp')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.8" }, "vscode": { "interpreter": { "hash": "13390b9b50dde76c6c011e02183633aae7d8498993a6e6577a16e1b7cb8c7a8c" } } }, "nbformat": 4, "nbformat_minor": 2 }