{
"cells": [
{
"cell_type": "markdown",
"id": "4575537f",
"metadata": {},
"source": [
"[](https://colab.research.google.com/github/ThomasAlbin/Astroniz-YT-Tutorials/blob/main/[ML1]-Asteroid-Spectra/3_data_enrichment.ipynb)"
]
},
{
"cell_type": "markdown",
"id": "72b515e8",
"metadata": {},
"source": [
"# Step 3: Data Enrichment\n",
"\n",
"This section is not about feature creation (for an ML algorithm), but to enrich the asteroid dataframe with more, additional information."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "d4987fa4",
"metadata": {},
"outputs": [],
"source": [
"# Import standard libraries\n",
"import os\n",
"import pathlib\n",
"\n",
"# Import installed libraries\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "8751fcfc",
"metadata": {},
"outputs": [],
"source": [
"# Let's mount the Google Drive, where we store files and models (if applicable, otherwise work\n",
"# locally)\n",
"try:\n",
" from google.colab import drive\n",
" drive.mount('/gdrive')\n",
" core_path = \"/gdrive/MyDrive/Colab/asteroid_taxonomy/\"\n",
"except ModuleNotFoundError:\n",
" core_path = \"\""
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "2b6e61df",
"metadata": {},
"outputs": [],
"source": [
"# Read the level 1 dataframe\n",
"asteroids_df = pd.read_pickle(os.path.join(core_path, \"data/lvl1/\", \"asteroids_merged.pkl\"))"
]
},
{
"cell_type": "markdown",
"id": "3d7eee95",
"metadata": {},
"source": [
"## Bus classification to Main group\n",
"\n",
"A great summary of asteroid classification schemas, the science behind it and some historical context can be found [here](https://vissiniti.com/asteroid-classification/). One flow chart shows the link between miscellaneous classification schemas. On the right side the flow chart merges into a general \"main group\". These groups are:\n",
"\n",
"- C: Carbonaceous asteroids\n",
"- S: Silicaceous (stony) asteroids\n",
"- X: Metallic asteroids\n",
"- Other: Miscellaneous types of rare origin / composition; or even unknown composition like T-Asteroids\n",
"\n",
"[
](https://vissiniti.com/asteroid-classification/)\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "278bfa01",
"metadata": {},
"outputs": [],
"source": [
"# Create a dictionary that maps the Bus Classification with the main group\n",
"bus_to_main_dict = {\n",
" 'A': 'Other',\n",
" 'B': 'C',\n",
" 'C': 'C',\n",
" 'Cb': 'C',\n",
" 'Cg': 'C',\n",
" 'Cgh': 'C',\n",
" 'Ch': 'C',\n",
" 'D': 'Other',\n",
" 'K': 'Other',\n",
" 'L': 'Other',\n",
" 'Ld': 'Other',\n",
" 'O': 'Other',\n",
" 'R': 'Other',\n",
" 'S': 'S',\n",
" 'Sa': 'S',\n",
" 'Sk': 'S',\n",
" 'Sl': 'S',\n",
" 'Sq': 'S',\n",
" 'Sr': 'S',\n",
" 'T': 'Other',\n",
" 'V': 'Other',\n",
" 'X': 'X',\n",
" 'Xc': 'X',\n",
" 'Xe': 'X',\n",
" 'Xk': 'X'\n",
" }"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "92d373b3",
"metadata": {},
"outputs": [],
"source": [
"# Create a new \"main group class\"\n",
"asteroids_df.loc[:, \"Main_Group\"] = asteroids_df[\"Bus_Class\"].apply(lambda x:\n",
" bus_to_main_dict.get(x, \"None\"))"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "805c350f",
"metadata": {},
"outputs": [],
"source": [
"# Remove the file path and Designation Number\n",
"asteroids_df.drop(columns=[\"DesNr\", \"FilePath\"], inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "effe38d4",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
| \n", " | Name | \n", "Bus_Class | \n", "SpectrumDF | \n", "Main_Group | \n", "
|---|---|---|---|---|
| 0 | \n", "1 Ceres | \n", "C | \n", "Wavelength_in_microm Reflectance_norm550n... | \n", "C | \n", "
| 1 | \n", "2 Pallas | \n", "B | \n", "Wavelength_in_microm Reflectance_norm550n... | \n", "C | \n", "
| 2 | \n", "3 Juno | \n", "Sk | \n", "Wavelength_in_microm Reflectance_norm550n... | \n", "S | \n", "
| 3 | \n", "4 Vesta | \n", "V | \n", "Wavelength_in_microm Reflectance_norm550n... | \n", "Other | \n", "
| 4 | \n", "5 Astraea | \n", "S | \n", "Wavelength_in_microm Reflectance_norm550n... | \n", "S | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 1334 | \n", "1996 UK | \n", "Sq | \n", "Wavelength_in_microm Reflectance_norm550n... | \n", "S | \n", "
| 1335 | \n", "1996 VC | \n", "S | \n", "Wavelength_in_microm Reflectance_norm550n... | \n", "S | \n", "
| 1336 | \n", "1997 CZ5 | \n", "S | \n", "Wavelength_in_microm Reflectance_norm550n... | \n", "S | \n", "
| 1337 | \n", "1997 RD1 | \n", "Sq | \n", "Wavelength_in_microm Reflectance_norm550n... | \n", "S | \n", "
| 1338 | \n", "1998 WS | \n", "Sr | \n", "Wavelength_in_microm Reflectance_norm550n... | \n", "S | \n", "
1339 rows × 4 columns
\n", "| \n", " | Wavelength_in_microm | \n", "Reflectance_norm550nm | \n", "
|---|---|---|
| 0 | \n", "0.44 | \n", "0.9281 | \n", "
| 1 | \n", "0.45 | \n", "0.9388 | \n", "
| 2 | \n", "0.46 | \n", "0.9488 | \n", "
| 3 | \n", "0.47 | \n", "0.9572 | \n", "
| 4 | \n", "0.48 | \n", "0.9643 | \n", "
| 5 | \n", "0.49 | \n", "0.9716 | \n", "
| 6 | \n", "0.50 | \n", "0.9788 | \n", "
| 7 | \n", "0.51 | \n", "0.9859 | \n", "
| 8 | \n", "0.52 | \n", "0.9923 | \n", "
| 9 | \n", "0.53 | \n", "0.9955 | \n", "
| 10 | \n", "0.54 | \n", "0.9969 | \n", "
| 11 | \n", "0.55 | \n", "1.0000 | \n", "
| 12 | \n", "0.56 | \n", "1.0040 | \n", "
| 13 | \n", "0.57 | \n", "1.0056 | \n", "
| 14 | \n", "0.58 | \n", "1.0037 | \n", "
| 15 | \n", "0.59 | \n", "1.0036 | \n", "
| 16 | \n", "0.60 | \n", "1.0044 | \n", "
| 17 | \n", "0.61 | \n", "1.0071 | \n", "
| 18 | \n", "0.62 | \n", "1.0107 | \n", "
| 19 | \n", "0.63 | \n", "1.0113 | \n", "
| 20 | \n", "0.64 | \n", "1.0117 | \n", "
| 21 | \n", "0.65 | \n", "1.0127 | \n", "
| 22 | \n", "0.66 | \n", "1.0128 | \n", "
| 23 | \n", "0.67 | \n", "1.0124 | \n", "
| 24 | \n", "0.68 | \n", "1.0151 | \n", "
| 25 | \n", "0.69 | \n", "1.0160 | \n", "
| 26 | \n", "0.70 | \n", "1.0146 | \n", "
| 27 | \n", "0.71 | \n", "1.0178 | \n", "
| 28 | \n", "0.72 | \n", "1.0222 | \n", "
| 29 | \n", "0.73 | \n", "1.0216 | \n", "
| 30 | \n", "0.74 | \n", "1.0191 | \n", "
| 31 | \n", "0.75 | \n", "1.0179 | \n", "
| 32 | \n", "0.76 | \n", "1.0167 | \n", "
| 33 | \n", "0.77 | \n", "1.0149 | \n", "
| 34 | \n", "0.78 | \n", "1.0161 | \n", "
| 35 | \n", "0.79 | \n", "1.0176 | \n", "
| 36 | \n", "0.80 | \n", "1.0178 | \n", "
| 37 | \n", "0.81 | \n", "1.0196 | \n", "
| 38 | \n", "0.82 | \n", "1.0200 | \n", "
| 39 | \n", "0.83 | \n", "1.0164 | \n", "
| 40 | \n", "0.84 | \n", "1.0135 | \n", "
| 41 | \n", "0.85 | \n", "1.0140 | \n", "
| 42 | \n", "0.86 | \n", "1.0147 | \n", "
| 43 | \n", "0.87 | \n", "1.0151 | \n", "
| 44 | \n", "0.88 | \n", "1.0142 | \n", "
| 45 | \n", "0.89 | \n", "1.0146 | \n", "
| 46 | \n", "0.90 | \n", "1.0165 | \n", "
| 47 | \n", "0.91 | \n", "1.0181 | \n", "
| 48 | \n", "0.92 | \n", "1.0200 | \n", "