{ "cells": [ { "cell_type": "markdown", "id": "4575537f", "metadata": {}, "source": [ "[](https://colab.research.google.com/github/ThomasAlbin/Astroniz-YT-Tutorials/blob/main/[ML1]-Asteroid-Spectra/3_data_enrichment.ipynb)" ] }, { "cell_type": "markdown", "id": "72b515e8", "metadata": {}, "source": [ "# Step 3: Data Enrichment\n", "\n", "This section is not about feature creation (for an ML algorithm), but to enrich the asteroid dataframe with more, additional information." ] }, { "cell_type": "code", "execution_count": 1, "id": "d4987fa4", "metadata": {}, "outputs": [], "source": [ "# Import standard libraries\n", "import os\n", "import pathlib\n", "\n", "# Import installed libraries\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "id": "8751fcfc", "metadata": {}, "outputs": [], "source": [ "# Let's mount the Google Drive, where we store files and models (if applicable, otherwise work\n", "# locally)\n", "try:\n", " from google.colab import drive\n", " drive.mount('/gdrive')\n", " core_path = \"/gdrive/MyDrive/Colab/asteroid_taxonomy/\"\n", "except ModuleNotFoundError:\n", " core_path = \"\"" ] }, { "cell_type": "code", "execution_count": 3, "id": "2b6e61df", "metadata": {}, "outputs": [], "source": [ "# Read the level 1 dataframe\n", "asteroids_df = pd.read_pickle(os.path.join(core_path, \"data/lvl1/\", \"asteroids_merged.pkl\"))" ] }, { "cell_type": "markdown", "id": "3d7eee95", "metadata": {}, "source": [ "## Bus classification to Main group\n", "\n", "A great summary of asteroid classification schemas, the science behind it and some historical context can be found [here](https://vissiniti.com/asteroid-classification/). One flow chart shows the link between miscellaneous classification schemas. On the right side the flow chart merges into a general \"main group\". These groups are:\n", "\n", "- C: Carbonaceous asteroids\n", "- S: Silicaceous (stony) asteroids\n", "- X: Metallic asteroids\n", "- Other: Miscellaneous types of rare origin / composition; or even unknown composition like T-Asteroids\n", "\n", "[<img src=\"https://i2.wp.com/vissiniti.com/wp-content/uploads/2019/07/Asteroid-Classification-Chapman-Tholen-to-Bus-to-BusDeMeo-v4-1.jpg?ssl=1\">](https://vissiniti.com/asteroid-classification/)\n" ] }, { "cell_type": "code", "execution_count": 4, "id": "278bfa01", "metadata": {}, "outputs": [], "source": [ "# Create a dictionary that maps the Bus Classification with the main group\n", "bus_to_main_dict = {\n", " 'A': 'Other',\n", " 'B': 'C',\n", " 'C': 'C',\n", " 'Cb': 'C',\n", " 'Cg': 'C',\n", " 'Cgh': 'C',\n", " 'Ch': 'C',\n", " 'D': 'Other',\n", " 'K': 'Other',\n", " 'L': 'Other',\n", " 'Ld': 'Other',\n", " 'O': 'Other',\n", " 'R': 'Other',\n", " 'S': 'S',\n", " 'Sa': 'S',\n", " 'Sk': 'S',\n", " 'Sl': 'S',\n", " 'Sq': 'S',\n", " 'Sr': 'S',\n", " 'T': 'Other',\n", " 'V': 'Other',\n", " 'X': 'X',\n", " 'Xc': 'X',\n", " 'Xe': 'X',\n", " 'Xk': 'X'\n", " }" ] }, { "cell_type": "code", "execution_count": 5, "id": "92d373b3", "metadata": {}, "outputs": [], "source": [ "# Create a new \"main group class\"\n", "asteroids_df.loc[:, \"Main_Group\"] = asteroids_df[\"Bus_Class\"].apply(lambda x:\n", " bus_to_main_dict.get(x, \"None\"))" ] }, { "cell_type": "code", "execution_count": 6, "id": "805c350f", "metadata": {}, "outputs": [], "source": [ "# Remove the file path and Designation Number\n", "asteroids_df.drop(columns=[\"DesNr\", \"FilePath\"], inplace=True)" ] }, { "cell_type": "code", "execution_count": 7, "id": "effe38d4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Name</th>\n", " <th>Bus_Class</th>\n", " <th>SpectrumDF</th>\n", " <th>Main_Group</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1 Ceres</td>\n", " <td>C</td>\n", " <td>Wavelength_in_microm Reflectance_norm550n...</td>\n", " <td>C</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2 Pallas</td>\n", " <td>B</td>\n", " <td>Wavelength_in_microm Reflectance_norm550n...</td>\n", " <td>C</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>3 Juno</td>\n", " <td>Sk</td>\n", " <td>Wavelength_in_microm Reflectance_norm550n...</td>\n", " <td>S</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>4 Vesta</td>\n", " <td>V</td>\n", " <td>Wavelength_in_microm Reflectance_norm550n...</td>\n", " <td>Other</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>5 Astraea</td>\n", " <td>S</td>\n", " <td>Wavelength_in_microm Reflectance_norm550n...</td>\n", " <td>S</td>\n", " </tr>\n", " <tr>\n", " <th>...</th>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " </tr>\n", " <tr>\n", " <th>1334</th>\n", " <td>1996 UK</td>\n", " <td>Sq</td>\n", " <td>Wavelength_in_microm Reflectance_norm550n...</td>\n", " <td>S</td>\n", " </tr>\n", " <tr>\n", " <th>1335</th>\n", " <td>1996 VC</td>\n", " <td>S</td>\n", " <td>Wavelength_in_microm Reflectance_norm550n...</td>\n", " <td>S</td>\n", " </tr>\n", " <tr>\n", " <th>1336</th>\n", " <td>1997 CZ5</td>\n", " <td>S</td>\n", " <td>Wavelength_in_microm Reflectance_norm550n...</td>\n", " <td>S</td>\n", " </tr>\n", " <tr>\n", " <th>1337</th>\n", " <td>1997 RD1</td>\n", " <td>Sq</td>\n", " <td>Wavelength_in_microm Reflectance_norm550n...</td>\n", " <td>S</td>\n", " </tr>\n", " <tr>\n", " <th>1338</th>\n", " <td>1998 WS</td>\n", " <td>Sr</td>\n", " <td>Wavelength_in_microm Reflectance_norm550n...</td>\n", " <td>S</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>1339 rows × 4 columns</p>\n", "</div>" ], "text/plain": [ " Name Bus_Class SpectrumDF \\\n", "0 1 Ceres C Wavelength_in_microm Reflectance_norm550n... \n", "1 2 Pallas B Wavelength_in_microm Reflectance_norm550n... \n", "2 3 Juno Sk Wavelength_in_microm Reflectance_norm550n... \n", "3 4 Vesta V Wavelength_in_microm Reflectance_norm550n... \n", "4 5 Astraea S Wavelength_in_microm Reflectance_norm550n... \n", "... ... ... ... \n", "1334 1996 UK Sq Wavelength_in_microm Reflectance_norm550n... \n", "1335 1996 VC S Wavelength_in_microm Reflectance_norm550n... \n", "1336 1997 CZ5 S Wavelength_in_microm Reflectance_norm550n... \n", "1337 1997 RD1 Sq Wavelength_in_microm Reflectance_norm550n... \n", "1338 1998 WS Sr Wavelength_in_microm Reflectance_norm550n... \n", "\n", " Main_Group \n", "0 C \n", "1 C \n", "2 S \n", "3 Other \n", "4 S \n", "... ... \n", "1334 S \n", "1335 S \n", "1336 S \n", "1337 S \n", "1338 S \n", "\n", "[1339 rows x 4 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Show the final data set for anyone who is interested ...\n", "asteroids_df" ] }, { "cell_type": "code", "execution_count": 8, "id": "ad7ba1dc", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Wavelength_in_microm</th>\n", " <th>Reflectance_norm550nm</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0.44</td>\n", " <td>0.9281</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>0.45</td>\n", " <td>0.9388</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>0.46</td>\n", " <td>0.9488</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>0.47</td>\n", " <td>0.9572</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>0.48</td>\n", " <td>0.9643</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>0.49</td>\n", " <td>0.9716</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>0.50</td>\n", " <td>0.9788</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>0.51</td>\n", " <td>0.9859</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>0.52</td>\n", " <td>0.9923</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>0.53</td>\n", " <td>0.9955</td>\n", " </tr>\n", " <tr>\n", " <th>10</th>\n", " <td>0.54</td>\n", " <td>0.9969</td>\n", " </tr>\n", " <tr>\n", " <th>11</th>\n", " <td>0.55</td>\n", " <td>1.0000</td>\n", " </tr>\n", " <tr>\n", " <th>12</th>\n", " <td>0.56</td>\n", " <td>1.0040</td>\n", " </tr>\n", " <tr>\n", " <th>13</th>\n", " <td>0.57</td>\n", " <td>1.0056</td>\n", " </tr>\n", " <tr>\n", " <th>14</th>\n", " <td>0.58</td>\n", " <td>1.0037</td>\n", " </tr>\n", " <tr>\n", " <th>15</th>\n", " <td>0.59</td>\n", " <td>1.0036</td>\n", " </tr>\n", " <tr>\n", " <th>16</th>\n", " <td>0.60</td>\n", " <td>1.0044</td>\n", " </tr>\n", " <tr>\n", " <th>17</th>\n", " <td>0.61</td>\n", " <td>1.0071</td>\n", " </tr>\n", " <tr>\n", " <th>18</th>\n", " <td>0.62</td>\n", " <td>1.0107</td>\n", " </tr>\n", " <tr>\n", " <th>19</th>\n", " <td>0.63</td>\n", " <td>1.0113</td>\n", " </tr>\n", " <tr>\n", " <th>20</th>\n", " <td>0.64</td>\n", " <td>1.0117</td>\n", " </tr>\n", " <tr>\n", " <th>21</th>\n", " <td>0.65</td>\n", " <td>1.0127</td>\n", " </tr>\n", " <tr>\n", " <th>22</th>\n", " <td>0.66</td>\n", " <td>1.0128</td>\n", " </tr>\n", " <tr>\n", " <th>23</th>\n", " <td>0.67</td>\n", " <td>1.0124</td>\n", " </tr>\n", " <tr>\n", " <th>24</th>\n", " <td>0.68</td>\n", " <td>1.0151</td>\n", " </tr>\n", " <tr>\n", " <th>25</th>\n", " <td>0.69</td>\n", " <td>1.0160</td>\n", " </tr>\n", " <tr>\n", " <th>26</th>\n", " <td>0.70</td>\n", " <td>1.0146</td>\n", " </tr>\n", " <tr>\n", " <th>27</th>\n", " <td>0.71</td>\n", " <td>1.0178</td>\n", " </tr>\n", " <tr>\n", " <th>28</th>\n", " <td>0.72</td>\n", " <td>1.0222</td>\n", " </tr>\n", " <tr>\n", " <th>29</th>\n", " <td>0.73</td>\n", " <td>1.0216</td>\n", " </tr>\n", " <tr>\n", " <th>30</th>\n", " <td>0.74</td>\n", " <td>1.0191</td>\n", " </tr>\n", " <tr>\n", " <th>31</th>\n", " <td>0.75</td>\n", " <td>1.0179</td>\n", " </tr>\n", " <tr>\n", " <th>32</th>\n", " <td>0.76</td>\n", " <td>1.0167</td>\n", " </tr>\n", " <tr>\n", " <th>33</th>\n", " <td>0.77</td>\n", " <td>1.0149</td>\n", " </tr>\n", " <tr>\n", " <th>34</th>\n", " <td>0.78</td>\n", " <td>1.0161</td>\n", " </tr>\n", " <tr>\n", " <th>35</th>\n", " <td>0.79</td>\n", " <td>1.0176</td>\n", " </tr>\n", " <tr>\n", " <th>36</th>\n", " <td>0.80</td>\n", " <td>1.0178</td>\n", " </tr>\n", " <tr>\n", " <th>37</th>\n", " <td>0.81</td>\n", " <td>1.0196</td>\n", " </tr>\n", " <tr>\n", " <th>38</th>\n", " <td>0.82</td>\n", " <td>1.0200</td>\n", " </tr>\n", " <tr>\n", " <th>39</th>\n", " <td>0.83</td>\n", " <td>1.0164</td>\n", " </tr>\n", " <tr>\n", " <th>40</th>\n", " <td>0.84</td>\n", " <td>1.0135</td>\n", " </tr>\n", " <tr>\n", " <th>41</th>\n", " <td>0.85</td>\n", " <td>1.0140</td>\n", " </tr>\n", " <tr>\n", " <th>42</th>\n", " <td>0.86</td>\n", " <td>1.0147</td>\n", " </tr>\n", " <tr>\n", " <th>43</th>\n", " <td>0.87</td>\n", " <td>1.0151</td>\n", " </tr>\n", " <tr>\n", " <th>44</th>\n", " <td>0.88</td>\n", " <td>1.0142</td>\n", " </tr>\n", " <tr>\n", " <th>45</th>\n", " <td>0.89</td>\n", " <td>1.0146</td>\n", " </tr>\n", " <tr>\n", " <th>46</th>\n", " <td>0.90</td>\n", " <td>1.0165</td>\n", " </tr>\n", " <tr>\n", " <th>47</th>\n", " <td>0.91</td>\n", " <td>1.0181</td>\n", " </tr>\n", " <tr>\n", " <th>48</th>\n", " <td>0.92</td>\n", " <td>1.0200</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Wavelength_in_microm Reflectance_norm550nm\n", "0 0.44 0.9281\n", "1 0.45 0.9388\n", "2 0.46 0.9488\n", "3 0.47 0.9572\n", "4 0.48 0.9643\n", "5 0.49 0.9716\n", "6 0.50 0.9788\n", "7 0.51 0.9859\n", "8 0.52 0.9923\n", "9 0.53 0.9955\n", "10 0.54 0.9969\n", "11 0.55 1.0000\n", "12 0.56 1.0040\n", "13 0.57 1.0056\n", "14 0.58 1.0037\n", "15 0.59 1.0036\n", "16 0.60 1.0044\n", "17 0.61 1.0071\n", "18 0.62 1.0107\n", "19 0.63 1.0113\n", "20 0.64 1.0117\n", "21 0.65 1.0127\n", "22 0.66 1.0128\n", "23 0.67 1.0124\n", "24 0.68 1.0151\n", "25 0.69 1.0160\n", "26 0.70 1.0146\n", "27 0.71 1.0178\n", "28 0.72 1.0222\n", "29 0.73 1.0216\n", "30 0.74 1.0191\n", "31 0.75 1.0179\n", "32 0.76 1.0167\n", "33 0.77 1.0149\n", "34 0.78 1.0161\n", "35 0.79 1.0176\n", "36 0.80 1.0178\n", "37 0.81 1.0196\n", "38 0.82 1.0200\n", "39 0.83 1.0164\n", "40 0.84 1.0135\n", "41 0.85 1.0140\n", "42 0.86 1.0147\n", "43 0.87 1.0151\n", "44 0.88 1.0142\n", "45 0.89 1.0146\n", "46 0.90 1.0165\n", "47 0.91 1.0181\n", "48 0.92 1.0200" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# ... and also the spectrum of Ceres\n", "asteroids_df.loc[asteroids_df[\"Name\"] == \"1 Ceres\"][\"SpectrumDF\"][0]" ] }, { "cell_type": "code", "execution_count": 9, "id": "e181ee97", "metadata": {}, "outputs": [], "source": [ "# Create Level 2 directory and save the dataframe\n", "pathlib.Path(os.path.join(core_path, \"data/lvl2\")).mkdir(parents=True, exist_ok=True)\n", "\n", "# Save the dataframe as a pickle file\n", "asteroids_df.to_pickle(os.path.join(core_path, \"data/lvl2/\", \"asteroids.pkl\"), protocol=4)" ] } ], "metadata": { "interpreter": { "hash": "4cd7ab41f5fca4b9b44701077e38c5ffd31fe66a6cab21e0214b68d958d0e462" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }