{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Withings Sleep Analyzer Data Import"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "This example explains how to import and parse data retrieved from the Withings Health Mate app.\n",
    "    \n",
    "<b>Note</b>: This notebook is just to illustrate how to <i>generally</i> approach such a data wrangling problem. The full code (and much more!) is readily available in <code>BioPsyKit</code>: <code>biopsykit.sleep_analyzer.io.load_withings_sleep_analyzer_raw()</code>.\n",
    "</div>\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup and Helper Functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from ast import literal_eval\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "from fau_colors import cmaps\n",
    "\n",
    "import biopsykit as bp\n",
    "\n",
    "%matplotlib widget\n",
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.close(\"all\")\n",
    "\n",
    "tz = \"Europe/Berlin\"\n",
    "\n",
    "palette = sns.color_palette(cmaps.faculties)\n",
    "sns.set_theme(context=\"notebook\", style=\"ticks\", font=\"sans-serif\", palette=palette)\n",
    "\n",
    "plt.rcParams[\"figure.figsize\"] = (8, 4)\n",
    "plt.rcParams[\"pdf.fonttype\"] = 42\n",
    "plt.rcParams[\"mathtext.default\"] = \"regular\"\n",
    "\n",
    "palette"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Import"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read Data from File"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load example data (or read the csv file into a dataframe using `pandas.read_csv()`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = bp.example_data.get_sleep_analyzer_raw_file_unformatted(data_source=\"heart_rate\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We first want to get an impression how the data looks like by displaying the data. In Jupyter Notebooks, ending a cell with the *name of a variable* or *unassigned output of a statement*, Jupyter will ``display`` that variable (in a nice layout) without the need for a ``print`` statement. \n",
    "\n",
    "You can for example call ``data`` to display the or ``data.head()`` to display the beginning of the dataframe.\n",
    "\n",
    "We see that we have three columns: A 'start' column with timestamps, a 'duration' column and a 'value' column. We can read this data row-wise and as follows:\n",
    "Beginning at time 'start', we get the heart rate values in the 'value' column for a 'duration' per value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data type conversion\n",
    "\n",
    "All values are imported as strings, so we need to convert these into the correct data types:\n",
    "\n",
    "* The *String* timestamps in the 'start' column are converted into *datetime* objects that offer extensive functions for handling time series data\n",
    "* The lists in the 'duration' and 'value' columns are also stored as strings so we need to convert them into actual lists with numbers. Googling \"*pandas convert string to array*\" leads us to this StackOverflow post https://stackoverflow.com/questions/23119472/in-pandas-python-reading-array-stored-as-string, where the accepted answer suggests this:\n",
    "\n",
    "\n",
    "```\n",
    "    from ast import literal_eval\n",
    "    df['col2'] = df['col2'].apply(literal_eval)\n",
    "```\n",
    "\n",
    "In the end, we set the 'start' column as the new index of the dataframe and sort the data by the index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"Before: {[type(value) for value in data.iloc[0]]}\")\n",
    "\n",
    "data[\"start\"] = pd.to_datetime(data[\"start\"])\n",
    "data[\"duration\"] = data[\"duration\"].apply(literal_eval)\n",
    "data[\"value\"] = data[\"value\"].apply(literal_eval)\n",
    "\n",
    "print(f\"After: {[type(value) for value in data.iloc[0]]}\")\n",
    "\n",
    "data = data.set_index(\"start\").sort_index()\n",
    "# rename index\n",
    "data.index.name = \"time\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our data now looks like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Explode Arrays"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now want to convert the values stored in the arrays into single values. Googling \"*pandas convert list of values to rows*\" leads us to this StackOverflow post: https://stackoverflow.com/questions/39954668/how-to-convert-column-with-list-of-values-into-rows-in-pandas-dataframe. Here, we don't take the accepted answer, but the answer below:\n",
    "```\n",
    "    df.explode('column')\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Before Explode:\")\n",
    "display(data[\"value\"].head())\n",
    "print(\"\")\n",
    "print(\"After Explode:\")\n",
    "display(data[\"value\"].explode(\"value\").head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `pd.Series.explode()` function only works on one single column. If we want to apply this on multiple columns at once, we need to call `pd.DataFrame.apply()` and pass the function as argument to the apply function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_explode = data.apply(pd.Series.explode)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our dataframe now looks like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_explode.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, we now see that the timestamp is the same for each exploded value. The documentation of `explode()` says the following: \n",
    "\n",
    "`Transform each element of a list-like to a row, *replicating* index values`.\n",
    "\n",
    "To get the correct timestamps we would need to add the 'duration' values cumulatively to the timestamps. However, only summing up the values in 'duration' would not work, we need to perform this only within those timestamps that are the same. One way to achieve this is to group the data into subparts with the same timestamp using `pd.DataFrame.groupby` where we pass the index name (i.e. `time`) to group along. For that, we define our own function that is applied onto each group."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def explode_timestamps(df):\n",
    "    # sum up the time durations and subtract the first value from it (so that we start from 0)\n",
    "    # dur_sum then looks like this: [0, 60, 120, 180, ...]\n",
    "    dur_sum = df[\"duration\"].cumsum() - df[\"duration\"].iloc[0]\n",
    "    # Add these time durations to the index timestamps.\n",
    "    # For that, we need to convert the datetime objects from the pandas DatetimeIndex into a float and add the time onto it\n",
    "    # (we first need to multiply it with 10^9 because the time in the index is stored in nanoseconds)\n",
    "    index_sum = df.index.values.astype(float) + 1e9 * dur_sum\n",
    "    # convert the float values back into a DatetimeIndex\n",
    "    df[\"time\"] = pd.to_datetime(index_sum)\n",
    "    # set this as index and convert it back into the right time zone\n",
    "    df = df.set_index(\"time\")\n",
    "    df = df.tz_localize(\"UTC\").tz_convert(tz)\n",
    "    # we don't need the duration column anymore so we can drop it\n",
    "    df = df.drop(columns=\"duration\")\n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# call groupby and apply our custom function on each group\n",
    "df_hr = data_explode.groupby(\"time\", group_keys=False).apply(explode_timestamps)\n",
    "# rename the value column\n",
    "df_hr.columns = [\"heart_rate\"]\n",
    "\n",
    "df_hr"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Filtering and plotting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Filter data by day"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Assume we want to filter only data from a particular date, e.g. Oct 11 2020.\n",
    "\n",
    "\n",
    "For this, we can slice the index to only include data from this particular date by doing the following steps:\n",
    "\n",
    "* *Normalize* the `DateTimeIndex` (set every date to midnight)\n",
    "* Filter for the desired day\n",
    "* Slice the DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_hr_day = df_hr.loc[df_hr.index.normalize() == \"2020-10-11\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_hr_day"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot this data as example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()\n",
    "df_hr_day.plot(ax=ax)\n",
    "\n",
    "ax.legend().remove()\n",
    "ax.set_ylabel(\"Heart Rate [bpm]\")\n",
    "ax.set_xlabel(\"Time\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# That's it!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This code is also available in `BioPsyKit` and can be used like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sleep_data = bp.example_data.get_sleep_analyzer_raw_example()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sleep_data.keys()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sleep_data[\"2020-10-10\"].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Only load a specific data source (in this case, our example data):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sleep_state_data = bp.example_data.get_sleep_analyzer_raw_file(\"sleep_state\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alternatively: Load your own Sleep Analyzer raw data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# sleep_state_data = bp.io.sleep_analyzer.load_withings_sleep_analyzer_raw_file(\n",
    "#    \"<path-to-sleep-analyzer-raw-file.csv>\",\n",
    "#    data_source=\"sleep_state\"\n",
    "# )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sleep_state_data[\"2020-10-10\"].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "biopsykit",
   "language": "python",
   "name": "biopsykit"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}