{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "4c710151",
   "metadata": {},
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ThomasAlbin/Astroniz-YT-Tutorials/blob/main/[ML1]-Asteroid-Spectra/2_data_parse.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b092648c",
   "metadata": {},
   "source": [
    "# Step 2: Data Parsing\n",
    "\n",
    "This notebook takes now the downloaded files, parses and cleans the data and merges the taxonomy classification with the spectra data. The final dataset is stored in a Level 1 directory for further computations. Further, a first clean up is performed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "2287cf5c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import standard libraries\n",
    "import glob\n",
    "import os\n",
    "import pathlib\n",
    "import re\n",
    "\n",
    "# Import installed libraries\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "c03d447f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's mount the Google Drive, where we store files and models (if applicable, otherwise work\n",
    "# locally)\n",
    "try:\n",
    "    from google.colab import drive\n",
    "    drive.mount('/gdrive')\n",
    "    core_path = \"/gdrive/MyDrive/Colab/asteroid_taxonomy/\"\n",
    "except ModuleNotFoundError:\n",
    "    core_path = \"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "d8799503",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get a sorted list of all spectra files (consider only the spfit files that have been explained in\n",
    "# the references)\n",
    "spectra_filepaths = sorted(glob.glob(os.path.join(core_path, \"data/lvl0/\", \"smass2/*spfit*\")))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f22a0c0f",
   "metadata": {},
   "source": [
    "## Asteroid Designation Separation\n",
    "\n",
    "The spectra data have 2 different naming conventions starting with an \"a\" and starting with an \"au\". The first case corresponds to asteroids with an actual designation number (like (1) Ceres). The second case contains only temporary (at the time of the data release) designations (like 1995 BM2). Later, these spectra data need to be joined with the taxonomy class file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "8da22f48",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Separate the filepaths into designation and non-designation files\n",
    "des_file_paths = spectra_filepaths[:-8]\n",
    "non_file_paths = spectra_filepaths[-8:]\n",
    "\n",
    "# Convert the arrays to dataframes\n",
    "des_file_paths_df = pd.DataFrame(des_file_paths, columns=[\"FilePath\"])\n",
    "non_file_paths_df = pd.DataFrame(non_file_paths, columns=[\"FilePath\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "07c14372",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add now the designation / \"non\"-designation number\n",
    "des_file_paths_df.loc[:, \"DesNr\"] = des_file_paths_df[\"FilePath\"] \\\n",
    "                                        .apply(lambda x: int(re.search(r'smass2/a(.*).spfit',\n",
    "                                                                       x).group(1)))\n",
    "non_file_paths_df.loc[:, \"DesNr\"] = non_file_paths_df[\"FilePath\"] \\\n",
    "                                        .apply(lambda x: re.search(r'smass2/au(.*).spfit',\n",
    "                                                                   x).group(1))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef12633a",
   "metadata": {},
   "source": [
    "## Taxonomy Classification\n",
    "\n",
    "Now we read the taxonomy classification file. Theoretically, the file has only 3 columns (asteroid name, Tholen & Bus Classification) with a rather large header. However, due to some formatting errors and the usage of white spaces as well as tabulator tabs, Pandas identifies 5 columns in total ...\n",
    "\n",
    "Since one cannot assign these \"unknown\" classes correctly to either Tholen nor Bus, the corresponding rows are deleted later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "a448d73d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read the classification file\n",
    "asteroid_class_df = pd.read_csv(os.path.join(core_path, \"data/lvl0/\", \"Bus.Taxonomy.txt\"),\n",
    "                                skiprows=21,\n",
    "                                sep=\"\\t\",\n",
    "                                names=[\"Name\",\n",
    "                                       \"Tholen_Class\",\n",
    "                                       \"Bus_Class\",\n",
    "                                       \"unknown1\",\n",
    "                                       \"unknown2\"\n",
    "                                      ]\n",
    "                               )\n",
    "\n",
    "# Remove white spaces\n",
    "asteroid_class_df.loc[:, \"Name\"] = asteroid_class_df[\"Name\"].apply(lambda x: x.strip()).copy()\n",
    "\n",
    "# Separate between designated and non-designated asteroid classes\n",
    "des_ast_class_df = asteroid_class_df[:1403].copy()\n",
    "non_ast_class_df = asteroid_class_df[1403:].copy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "41948a1b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Now split the designated names and get the designation number (to link with the spfit files)\n",
    "des_ast_class_df.loc[:, \"DesNr\"] = des_ast_class_df[\"Name\"].apply(lambda x: int(x.split(\" \")[0]))\n",
    "\n",
    "# Merge with the spectra file paths\n",
    "des_ast_class_join_df = des_ast_class_df.merge(des_file_paths_df, on=\"DesNr\")\n",
    "\n",
    "# For the non designated names, one needs to remove the white space between number and name and\n",
    "# compare with the file paths\n",
    "non_ast_class_df.loc[:, \"DesNr\"] = non_ast_class_df[\"Name\"].apply(lambda x: x.replace(\" \", \"\"))\n",
    "\n",
    "# Merge with the spectra file paths\n",
    "non_ast_class_join_df = non_ast_class_df.merge(non_file_paths_df, on=\"DesNr\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "350f32af",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Merge now both datasets\n",
    "asteroids_df = pd.concat([des_ast_class_join_df, non_ast_class_join_df], axis=0)\n",
    "\n",
    "# Reset the index\n",
    "asteroids_df.reset_index(drop=True, inplace=True)\n",
    "\n",
    "# Remove the tholen class and both unknown columns\n",
    "asteroids_df.drop(columns=[\"Tholen_Class\", \"unknown1\", \"unknown2\"], inplace=True)\n",
    "\n",
    "# Drop now all rows that do not contains a Bus Class\n",
    "asteroids_df.dropna(subset=[\"Bus_Class\"], inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad72f8ea",
   "metadata": {},
   "source": [
    "## Read and store the Spectra into a dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "2d36d956",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read and store the spectra\n",
    "asteroids_df.loc[:, \"SpectrumDF\"] = \\\n",
    "    asteroids_df[\"FilePath\"].apply(lambda x: pd.read_csv(x, sep=\"\\t\",\n",
    "                                                         names=[\"Wavelength_in_microm\",\n",
    "                                                                \"Reflectance_norm550nm\"]\n",
    "                                                        )\n",
    "                                  )\n",
    "# Reset the index\n",
    "asteroids_df.reset_index(drop=True, inplace=True)\n",
    "\n",
    "# Convert the Designation Number to string\n",
    "asteroids_df.loc[:, \"DesNr\"] = asteroids_df[\"DesNr\"].astype(str)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "b2e0e266",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create (if applicable) the level 1 directory\n",
    "pathlib.Path(os.path.join(core_path, \"data/lvl1\")).mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "# Save the dataframe as a pickle file\n",
    "asteroids_df.to_pickle(os.path.join(core_path, \"data/lvl1/\", \"asteroids_merged.pkl\"), protocol=4)"
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "4cd7ab41f5fca4b9b44701077e38c5ffd31fe66a6cab21e0214b68d958d0e462"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}