{ "cells": [ { "cell_type": "markdown", "id": "75bdb225-9ba6-479b-b15a-8fde07034146", "metadata": {}, "source": [ "## Pandas Profiling: USA Air Pollution Data\n", "Source of data: https://data.world/data-society/us-air-pollution-data" ] }, { "cell_type": "markdown", "id": "0f376d03-b7de-4c38-9eaf-ad74dbfebca6", "metadata": {}, "source": [ "The autoreload instruction reloads modules automatically before code execution, which is helpful for the update below." ] }, { "cell_type": "code", "execution_count": null, "id": "b514dd38-2ebd-4c96-aed5-4e3695e20fa2", "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "markdown", "id": "fd3ef034-d2d1-4e12-9c8d-7cd685bfe001", "metadata": {}, "source": [ "Make sure that we have the latest version of pandas-profiling." ] }, { "cell_type": "code", "execution_count": null, "id": "6bcf2de2-58bf-4995-8cf7-3145322b45f7", "metadata": {}, "outputs": [], "source": [ "%%capture\n", "import sys\n", "\n", "!{sys.executable} -m pip install -U pandas-profiling[notebook]\n", "!jupyter nbextension enable --py widgetsnbextension" ] }, { "cell_type": "markdown", "id": "c6d526f3-14fe-454e-8b9c-6662ceac81c3", "metadata": {}, "source": [ "You might want to restart the kernel now." ] }, { "cell_type": "markdown", "id": "da73be69-6618-463b-8e5d-b8fd4e4bfdfb", "metadata": {}, "source": [ "### Import libraries" ] }, { "cell_type": "code", "execution_count": null, "id": "b33a26ed-4e1e-4689-93ce-fa0f98f48e89", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "from ydata_profiling import ProfileReport\n", "from ydata_profiling.utils.cache import cache_file" ] }, { "cell_type": "markdown", "id": "1d7e1bf7-9891-4a8b-87e9-680225596f47", "metadata": {}, "source": [ "### Load and prepare the dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "7dab0b47-537d-4402-af71-1bdfd0cf6cdd", "metadata": {}, "outputs": [], "source": [ "file_name = cache_file(\n", " \"pollution_us_2000_2016.csv\",\n", " \"https://query.data.world/s/mz5ot3l4zrgvldncfgxu34nda45kvb\",\n", ")\n", "\n", "df = pd.read_csv(file_name, index_col=[0])\n", "\n", "# We will only consider the data from Arizone state for this example\n", "df = df[df[\"State\"] == \"Arizona\"]\n", "df[\"Date Local\"] = pd.to_datetime(df[\"Date Local\"])" ] }, { "cell_type": "markdown", "id": "cf4fc84f-6a82-43c7-b17f-24f1a56cb3e1", "metadata": {}, "source": [ "### Multi-entity time-series" ] }, { "cell_type": "markdown", "id": "01519464", "metadata": {}, "source": [ "The support to time series can be enabled by passing the parameter tsmode=True to the ProfileReport when its enabled, pandas profiling will try to identify time-dependent features using the feature's autocorrelation, which requires a sorted DataFrame or the definition of the `sortby` parameter.\n", "\n", "When a feature is identified as time series will trigger the following changes:\n", " - the histogram will be replaced by a line plot\n", " - the feature details will have a new tab with autocorrelation and partial autocorrelation plots\n", " - two new warnings: `NON STATIONARY` and `SEASONAL` (which indicates that the series may have seasonality)\n", "\n", "In cases where the data has multiple entities, as in this example, where we have different meteorological stations, each station can be interpreted as a time series, its necessary to filter the entities and profile each station separately." ] }, { "cell_type": "markdown", "id": "cc10f039", "metadata": {}, "source": [ "The following plot showcases the amount of data for each entity over time. In this case the data from the stations started being collected at the same period, and the data is collected hourly so they have the same amount of data per period." ] }, { "cell_type": "code", "execution_count": null, "id": "15e613a6", "metadata": {}, "outputs": [], "source": [ "from ydata_profiling.visualisation.plot import timeseries_heatmap\n", "\n", "timeseries_heatmap(dataframe=df, entity_column=\"Site Num\", sortby=\"Date Local\")" ] }, { "cell_type": "code", "execution_count": null, "id": "b29a7e78-d52d-458d-ac9a-e509ffd373d1", "metadata": {}, "outputs": [], "source": [ "# Return the profile per station\n", "for group in df.groupby(\"Site Num\"):\n", " # Running 1 profile per station\n", " profile = ProfileReport(\n", " group[1],\n", " tsmode=True,\n", " sortby=\"Date Local\",\n", " # title=f\"Air Quality profiling - Site Num: {group[0]}\"\n", " )\n", "\n", " profile.to_file(f\"Ts_Profile_{group[0]}.html\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 5 }