{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "4575537f",
   "metadata": {},
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ThomasAlbin/Astroniz-YT-Tutorials/blob/main/[ML1]-Asteroid-Spectra/3_data_enrichment.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "72b515e8",
   "metadata": {},
   "source": [
    "# Step 3: Data Enrichment\n",
    "\n",
    "This section is not about feature creation (for an ML algorithm), but to enrich the asteroid dataframe with more, additional information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "d4987fa4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import standard libraries\n",
    "import os\n",
    "import pathlib\n",
    "\n",
    "# Import installed libraries\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "8751fcfc",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's mount the Google Drive, where we store files and models (if applicable, otherwise work\n",
    "# locally)\n",
    "try:\n",
    "    from google.colab import drive\n",
    "    drive.mount('/gdrive')\n",
    "    core_path = \"/gdrive/MyDrive/Colab/asteroid_taxonomy/\"\n",
    "except ModuleNotFoundError:\n",
    "    core_path = \"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "2b6e61df",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read the level 1 dataframe\n",
    "asteroids_df = pd.read_pickle(os.path.join(core_path, \"data/lvl1/\", \"asteroids_merged.pkl\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3d7eee95",
   "metadata": {},
   "source": [
    "## Bus classification to Main group\n",
    "\n",
    "A great summary of asteroid classification schemas, the science behind it and some historical context can be found [here](https://vissiniti.com/asteroid-classification/). One flow chart shows the link between miscellaneous classification schemas. On the right side the flow chart merges into a general \"main group\". These groups are:\n",
    "\n",
    "- C: Carbonaceous asteroids\n",
    "- S: Silicaceous (stony) asteroids\n",
    "- X: Metallic asteroids\n",
    "- Other: Miscellaneous types of rare origin / composition; or even unknown composition like T-Asteroids\n",
    "\n",
    "[<img src=\"https://i2.wp.com/vissiniti.com/wp-content/uploads/2019/07/Asteroid-Classification-Chapman-Tholen-to-Bus-to-BusDeMeo-v4-1.jpg?ssl=1\">](https://vissiniti.com/asteroid-classification/)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "278bfa01",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a dictionary that maps the Bus Classification with the main group\n",
    "bus_to_main_dict = {\n",
    "                    'A': 'Other',\n",
    "                    'B': 'C',\n",
    "                    'C': 'C',\n",
    "                    'Cb': 'C',\n",
    "                    'Cg': 'C',\n",
    "                    'Cgh': 'C',\n",
    "                    'Ch': 'C',\n",
    "                    'D': 'Other',\n",
    "                    'K': 'Other',\n",
    "                    'L': 'Other',\n",
    "                    'Ld': 'Other',\n",
    "                    'O': 'Other',\n",
    "                    'R': 'Other',\n",
    "                    'S': 'S',\n",
    "                    'Sa': 'S',\n",
    "                    'Sk': 'S',\n",
    "                    'Sl': 'S',\n",
    "                    'Sq': 'S',\n",
    "                    'Sr': 'S',\n",
    "                    'T': 'Other',\n",
    "                    'V': 'Other',\n",
    "                    'X': 'X',\n",
    "                    'Xc': 'X',\n",
    "                    'Xe': 'X',\n",
    "                    'Xk': 'X'\n",
    "                   }"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "92d373b3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a new \"main group class\"\n",
    "asteroids_df.loc[:, \"Main_Group\"] = asteroids_df[\"Bus_Class\"].apply(lambda x:\n",
    "                                                                    bus_to_main_dict.get(x, \"None\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "805c350f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Remove the file path and Designation Number\n",
    "asteroids_df.drop(columns=[\"DesNr\", \"FilePath\"], inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "effe38d4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Bus_Class</th>\n",
       "      <th>SpectrumDF</th>\n",
       "      <th>Main_Group</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1 Ceres</td>\n",
       "      <td>C</td>\n",
       "      <td>Wavelength_in_microm  Reflectance_norm550n...</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2 Pallas</td>\n",
       "      <td>B</td>\n",
       "      <td>Wavelength_in_microm  Reflectance_norm550n...</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3 Juno</td>\n",
       "      <td>Sk</td>\n",
       "      <td>Wavelength_in_microm  Reflectance_norm550n...</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4 Vesta</td>\n",
       "      <td>V</td>\n",
       "      <td>Wavelength_in_microm  Reflectance_norm550n...</td>\n",
       "      <td>Other</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5 Astraea</td>\n",
       "      <td>S</td>\n",
       "      <td>Wavelength_in_microm  Reflectance_norm550n...</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1334</th>\n",
       "      <td>1996 UK</td>\n",
       "      <td>Sq</td>\n",
       "      <td>Wavelength_in_microm  Reflectance_norm550n...</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1335</th>\n",
       "      <td>1996 VC</td>\n",
       "      <td>S</td>\n",
       "      <td>Wavelength_in_microm  Reflectance_norm550n...</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1336</th>\n",
       "      <td>1997 CZ5</td>\n",
       "      <td>S</td>\n",
       "      <td>Wavelength_in_microm  Reflectance_norm550n...</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1337</th>\n",
       "      <td>1997 RD1</td>\n",
       "      <td>Sq</td>\n",
       "      <td>Wavelength_in_microm  Reflectance_norm550n...</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1338</th>\n",
       "      <td>1998 WS</td>\n",
       "      <td>Sr</td>\n",
       "      <td>Wavelength_in_microm  Reflectance_norm550n...</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1339 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "           Name Bus_Class                                         SpectrumDF  \\\n",
       "0       1 Ceres         C      Wavelength_in_microm  Reflectance_norm550n...   \n",
       "1      2 Pallas         B      Wavelength_in_microm  Reflectance_norm550n...   \n",
       "2        3 Juno        Sk      Wavelength_in_microm  Reflectance_norm550n...   \n",
       "3       4 Vesta         V      Wavelength_in_microm  Reflectance_norm550n...   \n",
       "4     5 Astraea         S      Wavelength_in_microm  Reflectance_norm550n...   \n",
       "...         ...       ...                                                ...   \n",
       "1334    1996 UK        Sq      Wavelength_in_microm  Reflectance_norm550n...   \n",
       "1335    1996 VC         S      Wavelength_in_microm  Reflectance_norm550n...   \n",
       "1336   1997 CZ5         S      Wavelength_in_microm  Reflectance_norm550n...   \n",
       "1337   1997 RD1        Sq      Wavelength_in_microm  Reflectance_norm550n...   \n",
       "1338    1998 WS        Sr      Wavelength_in_microm  Reflectance_norm550n...   \n",
       "\n",
       "     Main_Group  \n",
       "0             C  \n",
       "1             C  \n",
       "2             S  \n",
       "3         Other  \n",
       "4             S  \n",
       "...         ...  \n",
       "1334          S  \n",
       "1335          S  \n",
       "1336          S  \n",
       "1337          S  \n",
       "1338          S  \n",
       "\n",
       "[1339 rows x 4 columns]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Show the final data set for anyone who is interested ...\n",
    "asteroids_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "ad7ba1dc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Wavelength_in_microm</th>\n",
       "      <th>Reflectance_norm550nm</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.44</td>\n",
       "      <td>0.9281</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.45</td>\n",
       "      <td>0.9388</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.46</td>\n",
       "      <td>0.9488</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.47</td>\n",
       "      <td>0.9572</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.48</td>\n",
       "      <td>0.9643</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.49</td>\n",
       "      <td>0.9716</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0.50</td>\n",
       "      <td>0.9788</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>0.51</td>\n",
       "      <td>0.9859</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>0.52</td>\n",
       "      <td>0.9923</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>0.53</td>\n",
       "      <td>0.9955</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>0.54</td>\n",
       "      <td>0.9969</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>0.55</td>\n",
       "      <td>1.0000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>0.56</td>\n",
       "      <td>1.0040</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>0.57</td>\n",
       "      <td>1.0056</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>0.58</td>\n",
       "      <td>1.0037</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>0.59</td>\n",
       "      <td>1.0036</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>0.60</td>\n",
       "      <td>1.0044</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>0.61</td>\n",
       "      <td>1.0071</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>0.62</td>\n",
       "      <td>1.0107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>0.63</td>\n",
       "      <td>1.0113</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>0.64</td>\n",
       "      <td>1.0117</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>0.65</td>\n",
       "      <td>1.0127</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>0.66</td>\n",
       "      <td>1.0128</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>0.67</td>\n",
       "      <td>1.0124</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>0.68</td>\n",
       "      <td>1.0151</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>0.69</td>\n",
       "      <td>1.0160</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>0.70</td>\n",
       "      <td>1.0146</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>0.71</td>\n",
       "      <td>1.0178</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>0.72</td>\n",
       "      <td>1.0222</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>0.73</td>\n",
       "      <td>1.0216</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30</th>\n",
       "      <td>0.74</td>\n",
       "      <td>1.0191</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31</th>\n",
       "      <td>0.75</td>\n",
       "      <td>1.0179</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32</th>\n",
       "      <td>0.76</td>\n",
       "      <td>1.0167</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33</th>\n",
       "      <td>0.77</td>\n",
       "      <td>1.0149</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34</th>\n",
       "      <td>0.78</td>\n",
       "      <td>1.0161</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35</th>\n",
       "      <td>0.79</td>\n",
       "      <td>1.0176</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36</th>\n",
       "      <td>0.80</td>\n",
       "      <td>1.0178</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>37</th>\n",
       "      <td>0.81</td>\n",
       "      <td>1.0196</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38</th>\n",
       "      <td>0.82</td>\n",
       "      <td>1.0200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39</th>\n",
       "      <td>0.83</td>\n",
       "      <td>1.0164</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>40</th>\n",
       "      <td>0.84</td>\n",
       "      <td>1.0135</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41</th>\n",
       "      <td>0.85</td>\n",
       "      <td>1.0140</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>42</th>\n",
       "      <td>0.86</td>\n",
       "      <td>1.0147</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43</th>\n",
       "      <td>0.87</td>\n",
       "      <td>1.0151</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>44</th>\n",
       "      <td>0.88</td>\n",
       "      <td>1.0142</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>45</th>\n",
       "      <td>0.89</td>\n",
       "      <td>1.0146</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>46</th>\n",
       "      <td>0.90</td>\n",
       "      <td>1.0165</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>47</th>\n",
       "      <td>0.91</td>\n",
       "      <td>1.0181</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48</th>\n",
       "      <td>0.92</td>\n",
       "      <td>1.0200</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    Wavelength_in_microm  Reflectance_norm550nm\n",
       "0                   0.44                 0.9281\n",
       "1                   0.45                 0.9388\n",
       "2                   0.46                 0.9488\n",
       "3                   0.47                 0.9572\n",
       "4                   0.48                 0.9643\n",
       "5                   0.49                 0.9716\n",
       "6                   0.50                 0.9788\n",
       "7                   0.51                 0.9859\n",
       "8                   0.52                 0.9923\n",
       "9                   0.53                 0.9955\n",
       "10                  0.54                 0.9969\n",
       "11                  0.55                 1.0000\n",
       "12                  0.56                 1.0040\n",
       "13                  0.57                 1.0056\n",
       "14                  0.58                 1.0037\n",
       "15                  0.59                 1.0036\n",
       "16                  0.60                 1.0044\n",
       "17                  0.61                 1.0071\n",
       "18                  0.62                 1.0107\n",
       "19                  0.63                 1.0113\n",
       "20                  0.64                 1.0117\n",
       "21                  0.65                 1.0127\n",
       "22                  0.66                 1.0128\n",
       "23                  0.67                 1.0124\n",
       "24                  0.68                 1.0151\n",
       "25                  0.69                 1.0160\n",
       "26                  0.70                 1.0146\n",
       "27                  0.71                 1.0178\n",
       "28                  0.72                 1.0222\n",
       "29                  0.73                 1.0216\n",
       "30                  0.74                 1.0191\n",
       "31                  0.75                 1.0179\n",
       "32                  0.76                 1.0167\n",
       "33                  0.77                 1.0149\n",
       "34                  0.78                 1.0161\n",
       "35                  0.79                 1.0176\n",
       "36                  0.80                 1.0178\n",
       "37                  0.81                 1.0196\n",
       "38                  0.82                 1.0200\n",
       "39                  0.83                 1.0164\n",
       "40                  0.84                 1.0135\n",
       "41                  0.85                 1.0140\n",
       "42                  0.86                 1.0147\n",
       "43                  0.87                 1.0151\n",
       "44                  0.88                 1.0142\n",
       "45                  0.89                 1.0146\n",
       "46                  0.90                 1.0165\n",
       "47                  0.91                 1.0181\n",
       "48                  0.92                 1.0200"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# ... and also the spectrum of Ceres\n",
    "asteroids_df.loc[asteroids_df[\"Name\"] == \"1 Ceres\"][\"SpectrumDF\"][0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "e181ee97",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create Level 2 directory and save the dataframe\n",
    "pathlib.Path(os.path.join(core_path, \"data/lvl2\")).mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "# Save the dataframe as a pickle file\n",
    "asteroids_df.to_pickle(os.path.join(core_path, \"data/lvl2/\", \"asteroids.pkl\"), protocol=4)"
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "4cd7ab41f5fca4b9b44701077e38c5ffd31fe66a6cab21e0214b68d958d0e462"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}