{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "5aa74260",
   "metadata": {},
   "source": [
    "# Feature Resolution and Residuals Experiment\n",
    "\n",
    "This notebook uses the SageWorks Science Workbench to quickly build an AWS® Machine Learning Pipeline with the AQSolDB public dataset. For this exercise we're going to look at the relationship between feature space and target values, specifically we're going to use SageWorks to help us identify areas where compounds that are close in feature space have significant differences in their target values (solubility in this case).\n",
    "<br><br>\n",
    "\n",
    "## Data\n",
    "AqSolDB: A curated reference set of aqueous solubility, created by the Autonomous Energy Materials Discovery [AMD] research group, consists of aqueous solubility values of 9,982 unique compounds curated from 9 different publicly available aqueous solubility datasets. AqSolDB also contains some relevant topological and physico-chemical 2D descriptors. Additionally, AqSolDB contains validated molecular representations of each of the compounds. This openly accessible dataset, which is the largest of its kind, and will not only serve as a useful reference source of measured and calculated solubility data, but also as a much improved and generalizable training data source for building data-driven models. (2019-04-10)\n",
    "\n",
    "Main Reference:\n",
    "https://www.nature.com/articles/s41597-019-0151-1\n",
    "\n",
    "Data Dowloaded from the Harvard DataVerse:\n",
    "https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/OVHAW8\n",
    "\n",
    "® Amazon Web Services, AWS, the Powered by AWS logo, are trademarks of Amazon.com, Inc. or its affiliates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "124e5180-b977-49e4-8e3f-2fc183f57c4a",
   "metadata": {},
   "source": [
    "import os\n",
    "os.environ[\"SAGEWORKS_CONFIG\"] = \"/Users/briford/.sageworks/ideaya_sandbox.json\""
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "77f0186f-c6ac-4dc7-a804-ce5d248d25b7",
   "metadata": {},
   "source": [
    "import sageworks\n",
    "import logging\n",
    "logging.getLogger(\"sageworks\").setLevel(logging.WARNING)"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "a7ae1c21",
   "metadata": {},
   "source": [
    "# We've already created a FeatureSet so just grab it a sample\n",
    "from sageworks.api.feature_set import FeatureSet\n",
    "fs = FeatureSet(\"test_sol_nightly_log_s\")\n",
    "full_df = fs.pull_dataframe()"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "97243583",
   "metadata": {},
   "source": [
    "full_df.head()"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "17b5ccca-0563-40e4-b0fd-9e3791912ec4",
   "metadata": {},
   "source": [
    "# df[\"class\"].value_counts()"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "bb268a27-6f18-432e-8eb5-43316f6174cf",
   "metadata": {},
   "source": [
    "# Sanity check our solubility and solubility_class\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "plt.rcParams['font.size'] = 12.0\n",
    "plt.rcParams['figure.figsize'] = 14.0, 5.0\n",
    "sns.set_theme(style='darkgrid')\n",
    "\n",
    "# Create a box plot\n",
    "ax = sns.boxplot(x='class', y='log_s', data=full_df, order = [2, 1, 0])\n",
    "plt.title('Solubility by Solubility Class')\n",
    "plt.xlabel('Solubility Class')\n",
    "plt.ylabel('Solubility')\n",
    "plt.show()"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "2d83c358-cd7c-4b6e-acb0-0e2e1d3ec3ae",
   "metadata": {},
   "source": [
    "# Let's get the current model metrics\n",
    "from sageworks.api.model import Model\n",
    "model = Model(\"test-sol-nightly-regression\")\n",
    "\n",
    "features = model.features()\n",
    "target = model.target()\n",
    "id_column = \"udm_mol_bat_id\"\n",
    "model.performance_metrics()"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "f797f3ba-c197-462a-ba6e-6fb0ea7de1a2",
   "metadata": {},
   "source": [
    "# Okay lets look at our reference model (already deployed to an Endpoint)\n",
    "from sageworks.api.endpoint import Endpoint\n",
    "end = Endpoint(\"test-sol-nightly-regression-end\")\n",
    "pred_df = end.auto_inference()"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "d5532a40-3301-455c-9e71-81e574e6c067",
   "metadata": {},
   "source": [
    "plot_predictions(pred_df)"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "id": "5cfa6a10-671b-41f0-bc0d-b3c3f4b2082e",
   "metadata": {},
   "source": [
    "show_columns = [id_column, target, \"prediction\", \"residuals\", \"residuals_abs\", \"udm_asy_protocol\", \"smiles\"]\n",
    "high_residuals = pred_df[(pred_df[\"residuals_abs\"] > 2.0) & (pred_df[\"log_s\"] != -16.0)][show_columns]\n",
    "high_residuals"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "id": "a2aa6e0e-7070-46cd-b6fe-0bb10afa2f3c",
   "metadata": {},
   "source": [
    "# Feature Spider"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ef5a1fc-29e9-43d4-95f1-8150df0d060b",
   "metadata": {},
   "source": [
    "# from sageworks.utils.endpoint_utils import fs_training_data\n",
    "# training_df = fs_training_data(end)"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f8902562-61c2-4c22-8ccd-b6ea66cd6531",
   "metadata": {},
   "source": [
    "# training_df.shape"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "75262762-3c51-4c71-b8dc-a7e417230a41",
   "metadata": {},
   "source": [
    "from sageworks.algorithms.dataframe.feature_spider import FeatureSpider\n",
    "\n",
    "feature_spider = FeatureSpider(full_df, features=features, target_column=target, id_column=id_column)"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "2eb41fdf-e131-463d-97f7-131a419c8c45",
   "metadata": {},
   "source": [
    "pred_df[\"confidence\"] = feature_spider.confidence_scores(pred_df, pred_df[\"prediction\"])"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "ea3548a2-eea7-4b25-8ec3-d3fd223ce310",
   "metadata": {},
   "source": [
    "plot_predictions(pred_df, color=\"confidence\")"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "442f18a7-12de-440e-ad41-4efc28d71ece",
   "metadata": {},
   "source": [
    "neighbor_info = feature_spider.neighbor_info(pred_df)"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "id": "24781b4f-54dc-4ea9-9945-08c6353d145b",
   "metadata": {},
   "source": [
    "pd.set_option(\"display.width\", 1000)\n",
    "pd.set_option(\"display.max_colwidth\", 50)\n",
    "pd.set_option(\"display.max_rows\", None)\n",
    "# idya_high_residuals = high_residuals[high_residuals[\"udm_asy_protocol\"] == \"IDYA\"]\n",
    "# idya_high_residuals.sort_values(by='residuals_abs', ascending=False)\n",
    "high_residuals"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "id": "c245d447-804d-409b-93ae-e15259e9ed4f",
   "metadata": {},
   "source": [
    "pd.set_option(\"display.width\", 1000)\n",
    "pd.set_option(\"display.max_colwidth\", 150)\n",
    "pd.set_option(\"display.max_rows\", None)\n",
    "id = \"IDC-5965-1\"\n",
    "index = high_residuals[high_residuals[\"udm_mol_bat_id\"]==id].index[0]\n",
    "print(index)\n",
    "print(high_residuals.loc[index])\n",
    "neighbor_info.iloc[index]"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "id": "5c078cd7-5938-4ed1-8e38-831590c95c24",
   "metadata": {},
   "source": [
    "# So What?\n",
    "Blah, blah, who cares? The real model works fine...\n",
    "\n",
    "# So lets switch gears and look at the predictions on the real model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "id": "4dcce03e-f43e-437c-bba6-60d1c42cfa42",
   "metadata": {},
   "source": [
    "logging.getLogger(\"sageworks\").setLevel(logging.INFO)"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": 107,
   "id": "985a36a5-8dbd-4f63-8b7c-261ac0bc1107",
   "metadata": {},
   "source": [
    "nightly_end = Endpoint(\"solubility-class-0-nightly\")\n",
    "nightly_pred = nightly_end.auto_inference()"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": 130,
   "id": "ced12800-9b81-48c4-8161-3768bfff78ae",
   "metadata": {},
   "source": [
    "# ReInitialize the feature spider with \"class\" as the target column\n",
    "feature_spider = FeatureSpider(full_df, features=features, target_column=\"class\", id_column=id_column)"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": 132,
   "id": "ac8deac3-6e5b-491e-bf0c-52d6295dcc21",
   "metadata": {},
   "source": [
    "nightly_pred[\"confidence\"] = feature_spider.confidence_scores(nightly_pred, nightly_pred[\"prediction\"])\n",
    "nightly_pred[\"confidence\"].hist()"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": 133,
   "id": "c5c8fb45-10b7-44d6-8b2b-a4e855a76efb",
   "metadata": {},
   "source": [
    "show_columns = [\"udm_mol_bat_id\", \"udm_asy_protocol\", \"udm_asy_cnd_format\", \"udm_asy_res_value\", \"class\", \"prediction\"]\n",
    "\n",
    "# Filter rows where 'class' and 'prediction' differ by at least 2\n",
    "filtered_df = nightly_pred[abs(nightly_pred[\"class\"] - nightly_pred[\"prediction\"]) >= 2][show_columns]\n",
    "filtered_df"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": 136,
   "id": "44c7315b-38ee-4f17-ad3b-93c96c42bff3",
   "metadata": {},
   "source": [
    "pd.set_option(\"display.max_colwidth\", 150)\n",
    "id = \"IDC-26481-1\"\n",
    "index = nightly_pred[nightly_pred[\"udm_mol_bat_id\"]==id].index[0]\n",
    "print(index)\n",
    "feature_spider.neighbor_info(nightly_pred).iloc[index]"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "69989265-beed-4942-b616-7f5aa3505a20",
   "metadata": {},
   "source": [
    "high_residuals = pred_df[pred_df[\"residuals_abs\"] > 4.0][show_columns + [\"confidence\"]]\n",
    "high_residuals"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "id": "22b464a3-7168-405a-a75b-089d1f1f99bd",
   "metadata": {},
   "source": [
    "# Feature Resolution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fb13eec7-74d9-490d-abf4-45355c0d63c3",
   "metadata": {},
   "source": [
    "from sageworks.algorithms.dataframe.feature_resolution import FeatureResolution\n",
    "\n",
    "# Grab Model and Get the target and feature columns\n",
    "m = Model(\"aqsol-mol-regression-full\")\n",
    "target_column = m.target()\n",
    "feature_columns = m.features()\n",
    "\n",
    "# Create the class and run the report\n",
    "resolution = FeatureResolution(df, features=feature_columns, target_column=target_column, id_column=\"id\")\n",
    "resolve_df = resolution.compute(within_distance=0.05, min_target_difference=0.5)\n",
    "print(resolve_df)"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "id": "71443207-7f4b-4b51-ba69-88db885236f4",
   "metadata": {},
   "source": [
    "# Hypothesis\n",
    "There should be some overlap of the predictions with high residuals and the features that had resolution issues"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9ea82652-ef21-4d9b-8299-b6c9b376191f",
   "metadata": {},
   "source": [
    "show_columns = [\"id\", \"solubility\", \"prediction\", \"residuals\", \"smiles\"]\n",
    "high_residuals = pred_df[pred_df[\"residuals_abs\"] > 2.0][show_columns]\n",
    "high_residuals"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24cc6a9c-790e-4d4f-a8f9-3d8c0b91de1d",
   "metadata": {},
   "source": [
    "query_ids = [\"B-1027\", \"A-3099\"]\n",
    "show_columns = [\"id\", \"solubility\", \"prediction\", \"residuals\"]\n",
    "pred_df[pred_df[\"id\"].isin(query_ids)][show_columns]"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d879929d-6ce9-42a2-9f20-7b6ccea52d03",
   "metadata": {},
   "source": [
    "high_error_ids = set(high_residuals[\"id\"])\n",
    "feature_issue_ids = set(list(resolve_df[\"id\"]) + list(resolve_df[\"n_id\"]))"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b4d4b9f9-5319-46e0-9baa-82c1460a2733",
   "metadata": {},
   "source": [
    "# Output the overlap\n",
    "print(f\"High Model Error Ids: {len(high_error_ids)}\")\n",
    "print(f\"Feature Resolution Ids: {len(feature_issue_ids)}\")\n",
    "print(f\"Overlap: {len(high_error_ids.intersection(feature_issue_ids))}\")"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6a7faef3-d538-49d7-9581-f42ab1c6fa2c",
   "metadata": {},
   "source": [
    "feature_issue_ids - high_error_ids"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e3930b47-cfcd-4d68-87ec-f2719aaab542",
   "metadata": {},
   "source": [
    "pred_df[pred_df[\"id\"]=='F-465'][show_columns]"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9d6c7a8c-a9bf-47c3-9e91-964cecc7e8df",
   "metadata": {},
   "source": [
    "feature_issue_id"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7ab82eb6-ea1e-4a91-8fb5-5a2770d563cb",
   "metadata": {
    "scrolled": true
   },
   "source": [
    "# Lets construct another model\n",
    "import xgboost as xgb\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# Grab all the data and do a Train/Test split\n",
    "full_df = fs.pull_dataframe()\n",
    "train_df, test_df = train_test_split(full_df, test_size=0.2, random_state=42)\n",
    "\n",
    "\n",
    "X = train_df[features]\n",
    "y = train_df[target]\n",
    "\n",
    "# Train the main XGBRegressor model\n",
    "model = xgb.XGBRegressor()\n",
    "model.fit(X, y)\n",
    "\n",
    "# Run Predictions on the the hold out\n",
    "test_df[\"prediction\"] = model.predict(test_df[features])\n",
    "plot_predictions(test_df)"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9e777a36-9660-4eae-aa8c-b3f1d0f26f9b",
   "metadata": {},
   "source": [
    "len(full_df.columns)"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b997f21b-d304-4b24-b16d-aefc4ff47dcb",
   "metadata": {},
   "source": [
    "from sageworks.algorithms.dataframe.feature_spider import FeatureSpider\n",
    "\n",
    "# Create the FeatureSpider class and run the various methods\n",
    "f_spider = FeatureSpider(full_df, features, id_column=\"id\", target_column=target, neighbors=5)\n",
    "coincident = f_spider.coincident(target_diff=1)\n",
    "print(\"COINCIDENT\")\n",
    "print(coincident)\n",
    "high_gradients = f_spider.high_gradients(within_distance=0.25, target_diff=2)\n",
    "print(\"\\nHIGH GRADIENTS\")\n",
    "print(high_gradients)"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "258d142a-140b-44e4-aac1-568fa8cf3ef6",
   "metadata": {},
   "source": [
    "len([6913, 644, 1799, 9739, 2834, 5014, 8729, 8741, 5292, 1713, 320, 7360, 5698, 9923, 1355, 4305, 5713, 4820, 6874, 7004, 8165, 3174, 364, 1781, 6140, 4735])"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9eb0a2ef-3e12-449f-b215-56a245831021",
   "metadata": {},
   "source": [
    "from rdkit import Chem\n",
    "from rdkit.Chem import Draw\n",
    "from rdkit.Chem.Draw.rdMolDraw2D import SetDarkMode"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "909a363a-3b52-4686-ad46-f2ff6fee9343",
   "metadata": {},
   "source": [],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "84297198-6210-4d42-8292-01388a531049",
   "metadata": {},
   "source": [
    "test_df = FeatureSet(\"solubility_features_test"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0b41daac-55b8-4fb1-8e83-002b821b1b8a",
   "metadata": {},
   "source": [
    "def show(id):\n",
    "    smile = df[df[\"id\"]==id][\"smiles\"].values[0]\n",
    "    print(smile)\n",
    "    _features = df[df[\"id\"]==id][features].values[0]\n",
    "    # print(_features)\n",
    "    _target = df[df[\"id\"]==id][target].values[0]\n",
    "    print(_target)\n",
    "    return Chem.MolFromSmiles(smile)"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "58e7ad49-94ac-4b1e-9cb4-e54f074a8230",
   "metadata": {},
   "source": [
    "show(\"IDC-10845-1\")"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8b07c750-2fd9-47aa-887e-8d116bde6446",
   "metadata": {},
   "source": [
    "show(\"B-3121\")"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "62bced13-f643-4ce9-87ed-afd928cc2fb3",
   "metadata": {},
   "source": [
    "show(\"B-962\")"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "31b0e240-32eb-47a1-9b08-0b67f3a0faba",
   "metadata": {},
   "source": [
    "show(\"C-846\")"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6d2e9fc8-c252-4608-bce6-43e1969f7c7a",
   "metadata": {},
   "source": [
    "show(\"A-2377\")"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3d3e8588-5e93-4ece-a3eb-74a501ed1196",
   "metadata": {},
   "source": [
    "show(\"B-3122\")"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c3566dac-7332-4c4d-b0bd-90361c0b8de6",
   "metadata": {},
   "source": [
    "from sageworks.algorithms.dataframe.target_gradients import TargetGradients\n",
    "gradients = TargetGradients()\n",
    "grad_df = gradients.compute(df, features, target, min_target_distance=1.0)\n",
    "grad_df.head()"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aeb72f66-1e9b-40f1-8bf4-59a899f926f8",
   "metadata": {},
   "source": [
    "grad_df[grad_df[\"target_gradient\"] == float(\"inf\")]"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fdfe49b0-f707-4e3b-9aa7-36445cd12dcb",
   "metadata": {},
   "source": [
    "import numpy as np\n",
    "finite_gradients = grad_df[grad_df[\"target_gradient\"].apply(np.isfinite)]\n",
    "finite_gradients.plot.hist(y=\"target_gradient\", bins=20, alpha=0.5, logy=True)"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "be589a78-aa23-4b6e-b5dc-0452819e90d9",
   "metadata": {},
   "source": [
    "# Remove Observations with Coincident Features and large target differences\n",
    "print(f\"Before Removal: {len(full_df)}\")\n",
    "indexes_to_remove = list(set(coincident).union(set(high_gradients)))\n",
    "clean_df = full_df.drop(indexes_to_remove).copy().reset_index(drop=True)\n",
    "print(f\"After Removal: {len(clean_df)}\")"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3925c895-aaf0-4825-92eb-c0da2f938209",
   "metadata": {},
   "source": [
    "f_spider = FeatureSpider(clean_df, features, id_column=\"id\", target_column=target)\n",
    "preds = f_spider.predict(clean_df)\n",
    "print(preds)\n",
    "coincident = f_spider.coincident(target_diff=1)\n",
    "print(\"COINCIDENT\")\n",
    "print(coincident)\n",
    "high_gradients = f_spider.high_gradients(within_distance=0.25, target_diff=2)\n",
    "print(\"\\nHIGH GRADIENTS\")\n",
    "print(high_gradients)"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f34cac32-0093-40df-874a-85d5f02eee0d",
   "metadata": {},
   "source": [
    "print(f\"Before Removal: {len(clean_df)}\")\n",
    "indexes_to_remove = list(set(coincident).union(set(high_gradients)))\n",
    "clean_df = clean_df.drop(indexes_to_remove).copy().reset_index(drop=True)\n",
    "print(f\"After Removal: {len(clean_df)}\")"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7cb29db1-b204-4bf3-82b5-90c9388b210e",
   "metadata": {},
   "source": [
    "train_df, test_df = train_test_split(clean_df, test_size=0.2, random_state=42)\n",
    "\n",
    "\n",
    "X = clean_df[features]\n",
    "y = clean_df[target]\n",
    "\n",
    "# Train the main XGBRegressor model\n",
    "model = xgb.XGBRegressor()\n",
    "model.fit(X, y)\n",
    "\n",
    "# Run Predictions on the the hold out\n",
    "test_df[\"prediction\"] = model.predict(test_df[features])\n",
    "plot_predictions(test_df)"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "43adcdcc-f2d8-45e1-868c-f613dbf46d7b",
   "metadata": {},
   "source": [
    "f_spider = FeatureSpider(clean_df, features, id_column=\"id\", target_column=target)\n",
    "preds = f_spider.predict(clean_df)\n",
    "print(preds)\n",
    "coincident = f_spider.coincident(target_diff=0.1)\n",
    "print(\"COINCIDENT\")\n",
    "print(coincident)\n",
    "high_gradients = f_spider.high_gradients(within_distance=0.25, target_diff=0.1)\n",
    "print(\"\\nHIGH GRADIENTS\")\n",
    "print(high_gradients)"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f10d7a38-35db-4337-8240-9ac601f2382b",
   "metadata": {},
   "source": [
    "print(f\"Before Removal: {len(clean_df)}\")\n",
    "indexes_to_remove = list(set(coincident).union(set(high_gradients)))\n",
    "clean_df = clean_df.drop(indexes_to_remove).copy().reset_index(drop=True)\n",
    "print(f\"After Removal: {len(clean_df)}\")"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2b269a32-9df7-428b-b55f-a7cf9edebf44",
   "metadata": {},
   "source": [
    "clean_df"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d7f3f797-1f1b-4d0a-95f4-0779f4a17445",
   "metadata": {},
   "source": [
    "# Spin up a test model so get some info\n",
    "target = model.target()\n",
    "print(target)\n",
    "features = model.features()\n",
    "print(features)"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "196b8d79-5224-49f2-b870-6ad800e739a3",
   "metadata": {},
   "source": [
    "import xgboost as xgb\n",
    "xgb_model = xgb.XGBRegressor()\n",
    "\n",
    "# Grab our Features, Target and Train the Model\n",
    "y = clean_df[target]\n",
    "X = clean_df[features]\n",
    "xgb_model.fit(X, y)\n",
    "predictions = xgb_model.predict(clean_df[features])\n",
    "clean_df[\"prediction\"] = predictions"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "07f73a71-0ef8-4ced-976f-68d3948f6a41",
   "metadata": {},
   "source": [
    "plot_predictions(clean_df)"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4b4ff9b5-a347-45a6-9853-5ac3d2897b83",
   "metadata": {},
   "source": [
    "clean_train_df, clean_test_df = train_test_split(clean_df, test_size=0.2, random_state=42)\n",
    "# Grab our Features, Target and Train the Model\n",
    "y = clean_train_df[target]\n",
    "X = clean_train_df[features]\n",
    "xgb_model.fit(X, y)\n",
    "predictions = xgb_model.predict(clean_test_df[features])\n",
    "clean_test_df[\"prediction\"] = predictions\n",
    "plot_predictions(clean_test_df)"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f6d34b1f-a30b-4c33-ab1d-89de0f2b6236",
   "metadata": {},
   "source": [
    "list(df[df[\"id\"].isin([\"C-2581\", \"B-13\"])][\"smiles\"])"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "09491069-aec3-4f1e-a973-94151a69dbe9",
   "metadata": {},
   "source": [
    "Chem.MolFromSmiles(\"CO[C@H]1[C@@H](C[C@@H]2CN3CCc4c([nH]c5cc(OC)ccc45)[C@H]3C[C@@H]2[C@@H]1C(=O)OC)OC(=O)c6cc(OC)c(OC)c(OC)c6\")"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "56976470-512b-4e84-82a9-5696370b7eeb",
   "metadata": {},
   "source": [
    "Chem.MolFromSmiles(\"COC1C(CC2CN3CCC4=C([NH]C5=C4C=CC(=C5)OC)C3CC2C1C(=O)OC)OC(=O)C6=CC(=C(OC)C(=C6)OC)OC\")"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7dcba58b-18b3-44c1-9c85-ec4ab505d950",
   "metadata": {},
   "source": [
    "df[df[\"id\"].isin([\"C-2581\", \"B-13\"])][features]"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5a397f8c-3819-4575-a6c7-9b7c57dd6a52",
   "metadata": {},
   "source": [
    "df[df[\"id\"].isin([\"C-2581\", \"B-13\"])][target]"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "197be8b9-026f-4c34-9031-8e4098cbaf9b",
   "metadata": {},
   "source": [
    "df[df[\"id\"]==\"B-13\"]"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "185a1caa-77c2-42c4-b53b-f71ab54a938c",
   "metadata": {},
   "source": [
    "df[df[\"id\"].isin([\"C-846\", \"B-962\"])]"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cef3e534-c90d-40dd-94d8-b6ed6b429935",
   "metadata": {},
   "source": [
    "show(\"C-846\")"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4af06b74-0125-4f22-91e9-be9be99c092c",
   "metadata": {},
   "source": [
    "show(\"B-962\")"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fb742921-cfc1-4722-af84-a86ac0f86d5f",
   "metadata": {},
   "source": [
    "Chem.MolFromSmiles(\"CCC(=O)OC(CC1=CC=CC=C1)(C(C)CN(C)C)C2=CC=CC=C2\")"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fdb44d45-57a6-4461-8a84-5d30ae6a9c9f",
   "metadata": {},
   "source": [
    "Chem.MolFromSmiles(\"CCC(=O)O[C@@](Cc1ccccc1)([C@H](C)CN(C)C)c2ccccc2\")"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "96a71382-69c4-4804-9c4f-912acaeea4d9",
   "metadata": {},
   "source": [
    "def show(id):\n",
    "    smile = df[df[\"id\"]==id][\"smiles\"].values[0]\n",
    "    print(smile)\n",
    "    _features = df[df[\"id\"]==id][features].values[0]\n",
    "    print(_features)\n",
    "    _target = df[df[\"id\"]==id][target].values[0]\n",
    "    print(_target)\n",
    "    return Chem.MolFromSmiles(smile)"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a8949d09-672e-4cc9-a05e-f60e242cd122",
   "metadata": {},
   "source": [
    "close_ids = [\"E-1200\", \"A-3473\", \"B-1665\"]\n",
    "show(\"E-1200\")"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ee463c10-d1e8-4954-aebf-7e14000676c0",
   "metadata": {},
   "source": [
    "show(\"A-3473\")"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7aedb2fa-924a-42ad-9970-f0d53b2dd250",
   "metadata": {},
   "source": [
    "show(\"B-1665\")"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a4ed1d10-8960-4003-99b9-c594bdb41500",
   "metadata": {},
   "source": [
    "from rdkit import Chem\n",
    "\n",
    "# Create an RDKit molecule from a SMILES string\n",
    "smiles = \"CCC(=O)OC(CC1=CC=CC=C1)(C(C)CN(C)C)C2=CC=CC=C2\"\n",
    "mol = Chem.MolFromSmiles(smiles)\n",
    "\n",
    "# Assign stereochemistry using RDKit\n",
    "Chem.AssignStereochemistry(mol, cleanIt=True, force=True)\n",
    "\n",
    "# Find chiral centers and their configurations\n",
    "chiral_centers = Chem.FindMolChiralCenters(mol, includeUnassigned=True)\n",
    "\n",
    "# Print the results\n",
    "for center in chiral_centers:\n",
    "    index, configuration = center\n",
    "    print(f\"Atom index: {index}, Configuration: {configuration}\")"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d804d78c-0800-433b-8c69-40a799f463dc",
   "metadata": {},
   "source": [
    "# Create an RDKit molecule from a SMILES string\n",
    "smiles = \"CCC(=O)O[C@@](Cc1ccccc1)([C@H](C)CN(C)C)c2ccccc2\"\n",
    "mol = Chem.MolFromSmiles(smiles)\n",
    "\n",
    "# Assign stereochemistry using RDKit\n",
    "Chem.AssignStereochemistry(mol, cleanIt=True, force=True)\n",
    "\n",
    "# Find chiral centers and their configurations\n",
    "chiral_centers = Chem.FindMolChiralCenters(mol, includeUnassigned=True)\n",
    "\n",
    "# Print the results\n",
    "for center in chiral_centers:\n",
    "    index, configuration = center\n",
    "    print(f\"Atom index: {index}, Configuration: {configuration}\")"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f4d7a935-fb36-45d6-b89c-aca9486d2412",
   "metadata": {},
   "source": [
    "from sageworks.algorithms.dataframe.row_tagger import RowTagger\n",
    "row_tagger = RowTagger(\n",
    "    df,\n",
    "    features=features,\n",
    "    id_column=\"id\",\n",
    "    target_column=target,\n",
    "    min_dist=0.0,\n",
    "    min_target_diff=1.0,\n",
    ")\n",
    "data_df = row_tagger.tag_rows()\n",
    "print(data_df)"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "64304dd8-2e51-4cc7-b681-7685abb0663e",
   "metadata": {},
   "source": [
    "data_df[\"tags\"].value_counts()"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "76e8faa0-b131-430c-8594-df3e937224d9",
   "metadata": {},
   "source": [
    "data_df[data_df[\"tags\"]=={\"htg\", \"coincident\"}]"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "id": "f31162c1",
   "metadata": {},
   "source": [
    "# Helper Methods"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "90d26e96",
   "metadata": {},
   "source": [
    "# Helper to look at predictions vs target\n",
    "from math import sqrt\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "plt.rcParams['font.size'] = 12.0\n",
    "plt.rcParams['figure.figsize'] = 14.0, 5.0\n",
    "sns.set_theme(style='darkgrid')\n",
    "def plot_predictions(df, line=True, color=\"PredError\"):\n",
    "    \n",
    "    # Dataframe of the targets and predictions\n",
    "    target = 'Actual Solubility'\n",
    "    pred = 'Predicted Solubility'\n",
    "    df_plot = pd.DataFrame({target: df['log_s'], pred: df['prediction']})\n",
    "    \n",
    "    # Compute Error per prediction\n",
    "    if color == \"PredError\":\n",
    "        df_plot[\"PredError\"] = df_plot.apply(lambda x: abs(x[pred] - x[target]), axis=1)\n",
    "    else:\n",
    "        df_plot[color] = df[color]\n",
    "\n",
    "    #df_plot['error'] = df_plot.apply(lambda x: abs(x[pred] - x[target]), axis=1)\n",
    "    ax = df_plot.plot.scatter(x=target, y=pred, c=color, cmap='coolwarm', sharex=False)\n",
    "    \n",
    "    # Just a diagonal line\n",
    "    if line:\n",
    "        ax.axline((1, 1), slope=1, linewidth=2, c='black')\n",
    "        x_pad = (df_plot[target].max() - df_plot[target].min())/10.0 \n",
    "        y_pad = (df_plot[pred].max() - df_plot[pred].min())/10.0\n",
    "        plt.xlim(df_plot[target].min()-x_pad, df_plot[target].max()+x_pad)\n",
    "        plt.ylim(df_plot[pred].min()-y_pad, df_plot[pred].max()+y_pad)\n",
    "\n",
    "    "
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dfab1df1-ea45-43dd-a76b-33354f572d8e",
   "metadata": {},
   "source": [],
   "outputs": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}