{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "2f71c854-bf7a-4844-931c-00a1e473fe94",
   "metadata": {},
   "source": [
    "# Dimension reduction for data visualization"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f4c7581f-fb70-435d-976d-8c5a8b586d26",
   "metadata": {},
   "source": [
    "Non-linear dimension reductiona and clustering is a powerful tool to very fastly see where are the most significant differences between different spectra and what is driving these differences. \n",
    "We propose here to use UMAP as a dimension reduction algorithm and plot the 2D representation for different scenarios.\n",
    "We will start exactly like we did in the other notebook (data_analysis_with_pypam), so we can make sure we have the HMD downloaded."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a4420218-2114-44d7-9017-0023520ba482",
   "metadata": {
    "tags": []
   },
   "source": [
    "## 0. Setup (install)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe031f58-fb63-43d0-a132-1a9052e8be90",
   "metadata": {},
   "source": [
    "First we need to install all the packages which we need to exectue this notebook. You don't need to do this if you're using mybinder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1ec070b8-cbd0-4326-a674-843d5d4b8a41",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "!pip install pvlib \n",
    "!pip install lifewatch-pypam==0.3.2 \n",
    "!pip install minio"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21d56026-3d95-49bf-b7f1-4595a3134a33",
   "metadata": {
    "tags": []
   },
   "source": [
    "## 1. Download the data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "43277afc-991c-4bc6-a1ca-a07901684e6d",
   "metadata": {},
   "source": [
    "We will download some processed HMB data stored in the cloud to give some examples of how can it be used. \n",
    "These data will be downloaded in this jupyterlab space, and you will be able to find them under the folder you specify here below, organized by station. \n",
    "Please change this line depending on where you want to store the data. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "755867a2-b6f9-4982-a25a-38fd27132864",
   "metadata": {},
   "source": [
    "We first start importing the packages we need for this part of the notebook"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8d18f1f5-45b8-4151-ae3c-e8119abf82f9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import the necessary packages\n",
    "import minio\n",
    "import os\n",
    "import pathlib\n",
    "from datetime import datetime\n",
    "import re"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39fb588b-4ac9-4827-8296-a0695362a3c0",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "⚠️ Downloading the data takes quite a while. When running this notebook on Binder, please consider the limitations and limit the maximum number of days to 10 days. When running on other environments you can load a full year if wanted (changed the start, end time and the max_days)\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dc6042cb-6c68-437d-a0bd-ad31fc2ba3ff",
   "metadata": {},
   "outputs": [],
   "source": [
    "local_path = '../data'\n",
    "max_days = 10\n",
    "start_date_time = '2021-01-01'\n",
    "end_date_time = '2021-02-01'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a183bb3a-ad96-407d-bbc5-d7e6a7feb6f2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Update end_date_time if defined temporal range exceeds max_days\n",
    "time_delta = datetime.fromisoformat(end_date_time) - datetime.fromisoformat(start_date_time)\n",
    "if time_delta.days > max_days:\n",
    "    end_date_time = str(datetime.fromisoformat(start_date_time) + timedelta(days=max_days))\n",
    "    print(f'end_date_time updated to {end_date_time}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5c101d98-dc3f-4419-b5ed-39e90b865f34",
   "metadata": {},
   "source": [
    "We will then start defining a function to download the data ONLY from the period we are interested in which can be then reused for different stations (then we don't need to write as much!)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "169221ed-8e08-441c-9db3-bfce28af4b2a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# First we create a function to download data \n",
    "def download_data_station(station_name, \n",
    "                          client_obj, \n",
    "                          bucket_str, \n",
    "                          prefix_str, \n",
    "                          data_path,\n",
    "                          name_format,\n",
    "                          start_datetime, \n",
    "                          end_datetime):\n",
    "    start_datetime_obj = datetime.fromisoformat(start_datetime) \n",
    "    end_datetime_obj = datetime.fromisoformat(end_datetime)\n",
    "    station_folder = pathlib.Path(data_path).joinpath(station_name)\n",
    "    if not station_folder.exists():\n",
    "        os.mkdir(station_folder)\n",
    "    objects = list(client_obj.list_objects(bucket_str, prefix=prefix_str))\n",
    "    ct = 0\n",
    "    for i, obj in enumerate(objects):\n",
    "        object_name = obj.object_name\n",
    "        path_name = pathlib.Path(object_name).name\n",
    "        if (not path_name.startswith('.')) & path_name.endswith('.nc'):\n",
    "            match = re.findall(r\"_(\\d+)\", path_name)[-1]\n",
    "            file_date = datetime.strptime(match, name_format)\n",
    "            if (file_date >= start_datetime_obj) & (file_date <= end_datetime_obj):\n",
    "                download_path = data_path + '/' + station_name + '/' + pathlib.Path(object_name).name\n",
    "                if os.path.isfile(download_path):\n",
    "                    print('Already downloaded: ' + download_path)\n",
    "                else:\n",
    "                    print('Download ' + str(ct) + ' of ' + str(len(objects)) + ': ' + download_path)\n",
    "                    object_data = client.get_object(bucket, object_name)\n",
    "                    if not os.path.isdir(download_path):\n",
    "                        with open(str(download_path), 'wb') as file_data:\n",
    "                            for data in object_data:\n",
    "                                file_data.write(data)\n",
    "                    file_data.close()\n",
    "            else: \n",
    "                print('Ignored, out of selected period or not a netCDF file ' + path_name)\n",
    "        ct += 1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39ca11ae-cf1d-4558-94a9-431afcd664bd",
   "metadata": {},
   "source": [
    "Then we will download the data for 2 stations. \n",
    "For this workshop we have chosen MARS and NRS11, but feel free to change the station names here if you wish to play around with other stations from the same providers"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3395df6d-18f6-438c-8213-5161852da329",
   "metadata": {},
   "source": [
    "Let's start with MARS. We first need to describe where the data are located"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9176172f-70eb-49a3-845b-1cd73cc0df1f",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Set up the download for MARS data\n",
    "client = minio.Minio( \"s3.us-west-2.amazonaws.com\", secure=False)\n",
    "\n",
    "bucket = 'pacific-sound-spectra'\n",
    "prefix = '2021/'\n",
    "station = 'MARS'\n",
    "name_format = '%Y%m%d'\n",
    "download_data_station(station, client, bucket, prefix, local_path, name_format=name_format, start_datetime=start_date_time, end_datetime=end_date_time)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5696783c-50d1-414d-9646-a5fed8c4f1e1",
   "metadata": {},
   "source": [
    "Then we move to the NRS11 data. We repreat the same process just with different parameters.\n",
    "These are the data we can find at [sanctsound](https://console.cloud.google.com/storage/browser/noaa-passive-bioacoustic/sanctsound?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&prefix=&forceOnObjectsSortingFiltering=false), daily millidecade bands. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "063dd84d-af33-430a-8b2f-8a8d10a0e568",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Set up the download for NRS11 data \n",
    "client = minio.Minio('storage.googleapis.com')\n",
    "bucket = 'noaa-passive-bioacoustic'\n",
    "station = 'NRS11'\n",
    "prefix = 'soundcoop/%s/' % station\n",
    "name_format = '%Y%m%d'\n",
    "download_data_station(station, client, bucket, prefix, local_path, name_format=name_format, start_datetime=start_date_time, end_datetime=end_date_time)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7f17ddb3-d006-40df-ad45-1f0476060f47",
   "metadata": {},
   "source": [
    "## Load the data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4d8387c7-5092-4eb5-b956-a39fee1647b5",
   "metadata": {},
   "source": [
    "First we set some style guidelines for plotting"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "67119d83-1e51-413e-8a15-e0b93971200b",
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "# Clear out default notebook settings\n",
    "%config InlineBackend.print_figure_kwargs = {'bbox_inches':None}\n",
    "\n",
    "# Set figure size and layout\n",
    "plt.rcParams[\"figure.figsize\"] = [12.00, 5.00]\n",
    "plt.rcParams[\"figure.autolayout\"] = True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c12d7275-dbb1-4ebb-a4a6-4eb7571702cd",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Import pypam modules \n",
    "import pypam.utils\n",
    "import pypam.plots"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c8f9b39e-b763-43f7-9337-9979cb828cb6",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Import other tools\n",
    "import pathlib\n",
    "import numpy as np\n",
    "import xarray as xr\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "36a04b8f-e06a-4a6a-872a-890f4b6835f9",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "RESAMPLE_RESOLUTION = '1h' # we could also set it to 1d for daily resolution for example\n",
    "min_freq = 10\n",
    "max_freq = 10000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7425d0aa-3267-48c9-b009-322f357694ee",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Only get data from 2021\n",
    "def load_data_from_station_year(station, year, data_vars):\n",
    "    deployment_path = pathlib.Path(local_path).joinpath(station)\n",
    "    print('loading station %s...' % station)\n",
    "    aggregated_ds = pypam.utils.join_all_ds_output_deployment(deployment_path, data_vars=data_vars, datetime_coord='time',\n",
    "                                                              join_only_if_contains='_%s' % year, load=True,\n",
    "                                                              parallel=False, freq_band=[min_freq, max_freq],\n",
    "                                                              freq_coord='frequency',\n",
    "                                                              time_resample='1h',\n",
    "                                                              drop_variables=['psd_image_colormap',\n",
    "                                                                              'psd_image',\n",
    "                                                                              'percentile_image',\n",
    "                                                                              'percentile_image_colormap'])\n",
    "    return aggregated_ds # this assigns an xarray dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49cf2933-85b3-40bc-bbf6-32034eb3bea6",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "⚠️ Be patient, loading two full years can take a while...\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "60e14dc0-092a-4443-9d88-5c7b5ec8c4c9",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Load the data for each station\n",
    "# Quality flag is only available for NRS11, not for MARS. We can decide to ignore it or load it by removing it from the data_vars list \n",
    "ds_collection = {}\n",
    "ds_collection['MARS'] = load_data_from_station_year('MARS', 2021, data_vars=['psd'])\n",
    "ds_collection['NRS11'] = load_data_from_station_year('NRS11', 2021, data_vars=['psd', 'quality_flag'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca8e496a-7a9d-4380-9e1d-6e2fcaee59a7",
   "metadata": {},
   "source": [
    "## Dimension reduction"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b6cf9674-9abb-4916-8ebd-be28337192d3",
   "metadata": {},
   "source": [
    "For the dimension reduction we need an extra package, umap (PCA would also work, from sklearn but umap captures better non-linear relationships)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6d4331ca-5c9c-48aa-92a7-843c6ee4fc87",
   "metadata": {},
   "source": [
    "You don't need to run this cell on mybinder, as this has been installed for you "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c3f816a1-e6f1-4ff6-96b9-755dbd1644d6",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install umap-learn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "552ddc15-c3f1-4e9e-be21-44021c33b63b",
   "metadata": {},
   "outputs": [],
   "source": [
    "import umap\n",
    "import pandas as pd\n",
    "import seaborn as sns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c06b9544-baa3-4218-90b4-cfd395e53846",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select for both stations the maximum frequency\n",
    "ds1 = pypam.utils.select_frequency_range(ds_collection['NRS11'], \n",
    "                                   min_freq=min_freq, \n",
    "                                   max_freq=2000, freq_coord='frequency')\n",
    "ds2 = pypam.utils.select_frequency_range(ds_collection['MARS'], \n",
    "                                   min_freq=min_freq, \n",
    "                                   max_freq=2000, freq_coord='frequency')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8c0c5470-b7f3-4a2b-8e3f-ea918a9f7de3",
   "metadata": {},
   "source": [
    "We first need to convert the data from xarray to pandas format, as this is what UMAP algorithm takes as an input"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2b422161-acfb-4098-8d81-418df785a331",
   "metadata": {},
   "outputs": [],
   "source": [
    "df1 = ds1['psd'].to_pandas()\n",
    "frequency_columns = df1.columns\n",
    "df1['station'] = 'NRS11'\n",
    "\n",
    "df2 = ds2['psd'].to_pandas()\n",
    "df2.columns = frequency_columns\n",
    "df2['station'] = 'MARS'\n",
    "\n",
    "df = pd.concat([df1, df2])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "458beb17-e476-4642-bb7e-2b4189c1ff3a",
   "metadata": {},
   "outputs": [],
   "source": [
    "df['embedding'] = df[frequency_columns].values.tolist()\n",
    "df['month'] = df.index.month \n",
    "df['hour'] = df.index.hour\n",
    "df.to_pickle('df_example_soundcoop.pkl')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "26a0f7f1-d272-44ac-9656-216f7beec9a6",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df.dropna()\n",
    "df1 = df1.dropna()\n",
    "df2 = df2.dropna()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15eb7afb-f01f-4d58-9aeb-23b58c2c55ea",
   "metadata": {},
   "source": [
    "Then we perform the dimension reduction"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c90bc09-dcdc-4d6c-889e-feb8b136d874",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "⚠️ Be patient, this can take a while it lots of data are processed...\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d5d0e665-8a5b-494b-8e9b-733510c4acba",
   "metadata": {},
   "outputs": [],
   "source": [
    "umap_box = umap.UMAP(n_components=2, n_neighbors=20, min_dist=0.05)\n",
    "umap_box.fit(df[frequency_columns].values)\n",
    "embedding = umap_box.transform(df[frequency_columns])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4b12fcc-e5cd-4d28-9a11-c2f10a313c34",
   "metadata": {},
   "source": [
    "We plot the results, coloring per station"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17e4374e-0100-4cf6-9f85-71c4d47d22c8",
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = sns.scatterplot(x=embedding[:, 0], y=embedding[:, 1],\n",
    "                     s=2, alpha=0.9, hue=df['station'],\n",
    "                     legend=True)\n",
    "ax.set_facecolor('white')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e175b814-3aec-4a3b-9dc9-9d5561c1829e",
   "metadata": {},
   "source": [
    "We can perfom another dimension reduction, this time only for one station "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9b50c699-38ee-487c-863a-541f36504cf2",
   "metadata": {},
   "outputs": [],
   "source": [
    "umap_box = umap.UMAP(n_components=2, n_neighbors=20, min_dist=0.05)\n",
    "umap_box.fit(df1[frequency_columns].values)\n",
    "embedding = umap_box.transform(df1[frequency_columns])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76f8a164-a0c9-4656-992e-2730995c266e",
   "metadata": {},
   "source": [
    "And we can plot it coloring it by hour for example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "37b362cd-4ddd-40c8-9412-7e036616dcd6",
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = sns.scatterplot(x=embedding[:, 0], y=embedding[:, 1],\n",
    "                     s=2, alpha=0.9, hue=df1.index.hour,\n",
    "                     legend=True)\n",
    "ax.set_facecolor('white')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8c8c8c81-1e73-49aa-a21d-0181fc3f0d40",
   "metadata": {},
   "source": [
    "Or by month"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c3f6be64-f809-46e2-8103-11a6e3c0e7f7",
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = sns.scatterplot(x=embedding[:, 0], y=embedding[:, 1],\n",
    "                     s=2, alpha=0.9, hue=df1.index.month,\n",
    "                     legend=True)\n",
    "ax.set_facecolor('white')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aaf8e09c-0898-4c3a-89c3-a4a86a6090cc",
   "metadata": {},
   "outputs": [],
   "source": [
    "umap_box = umap.UMAP(n_components=2, n_neighbors=20, min_dist=0.05)\n",
    "umap_box.fit(df2[frequency_columns].values, y=df2.index.month)\n",
    "embedding = umap_box.transform(df2[frequency_columns])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "784f689a-3745-4947-91e8-5e2dd33dfb55",
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = sns.scatterplot(x=embedding[:, 0], y=embedding[:, 1],\n",
    "                     s=2, alpha=0.9, hue=df2.index.month,\n",
    "                     legend=True)\n",
    "ax.set_facecolor('white')\n",
    "plt.show()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}