{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "5aa74260",
   "metadata": {},
   "source": [
    "# Building an AWS<sup>®</sup> ML Pipeline with Workbench\n",
    "\n",
    "<div style=\"padding: 20px\">\n",
    "<img width=\"1000\" alt=\"workbench_pipeline\" src=\"https://github.com/SuperCowPowers/workbench/assets/4806709/47cc5739-971c-48c3-9ef6-fd8370e3ec57\"></div>\n",
    "\n",
    "This notebook uses the Workbench Science Workbench to quickly build an AWS® Machine Learning Pipeline with the AQSolDB public dataset. This dataset aggregates aqueous solubility data for a large set of compounds.\n",
    "\n",
    "We're going to set up a full AWS Machine Learning Pipeline from start to finish. Since the Workbench Classes encapsulate, organize, and manage sets of AWS® Services, setting up our ML pipeline will be straight forward.\n",
    "\n",
    "Workbench also provides visibility into AWS services for every step of the process so we know exactly what we've got and how to use it.\n",
    "<br><br>\n",
    "\n",
    "## Data\n",
    "AqSolDB: A curated reference set of aqueous solubility, created by the Autonomous Energy Materials Discovery [AMD] research group, consists of aqueous solubility values of 9,982 unique compounds curated from 9 different publicly available aqueous solubility datasets. AqSolDB also contains some relevant topological and physico-chemical 2D descriptors. Additionally, AqSolDB contains validated molecular representations of each of the compounds. This openly accessible dataset, which is the largest of its kind, and will not only serve as a useful reference source of measured and calculated solubility data, but also as a much improved and generalizable training data source for building data-driven models. (2019-04-10)\n",
    "\n",
    "Main Reference:\n",
    "https://www.nature.com/articles/s41597-019-0151-1\n",
    "\n",
    "Data Dowloaded from the Harvard DataVerse:\n",
    "https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/OVHAW8\n",
    "\n",
    "\n",
    "\n",
    "## Workbench\n",
    "Workbench is a medium granularity framework that manages and aggregates AWS® Services into classes and concepts. When you use Workbench you think about DataSources, FeatureSets, Models, and Endpoints. Underneath the hood those classes handle all the details around updating and\n",
    "\n",
    "## Notebook\n",
    "This notebook uses the Workbench Science Workbench to quickly build an AWS® Machine Learning Pipeline.\n",
    "\n",
    "We're going to set up a full AWS Machine Learning Pipeline from start to finish. Since the Workbench Classes encapsulate, organize, and manage sets of AWS® Services, setting up our ML pipeline will be straight forward.\n",
    "\n",
    "Workbench also provides visibility into AWS services for every step of the process so we know exactly what we've got and how to use it.\n",
    "<br><br>\n",
    "\n",
    "® Amazon Web Services, AWS, the Powered by AWS logo, are trademarks of Amazon.com, Inc. or its affiliates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "a7ae1c21",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Okay first we get our data into Workbench as a DataSource\n",
    "from workbench.api.data_source import DataSource"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "97243583",
   "metadata": {},
   "outputs": [],
   "source": [
    "s3_path = 's3://workbench-public-data/comp_chem/aqsol_public_data.csv'\n",
    "data_source = DataSource(s3_path, 'aqsol_data')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31affdf1",
   "metadata": {},
   "source": [
    "<div style=\"float: right; padding: 20px\"><img src=\"images/aws_dashboard_aqsol.png\" width=600px\"></div>\n",
    "\n",
    "# So what just happened?\n",
    "Okay, so it was just a few lines of code but Workbench did the following for you:\n",
    "   \n",
    "- Transformed the CSV to a **Parquet** formatted dataset and stored it in AWS S3\n",
    "- Created an AWS Data Catalog database/table with the columns names/types\n",
    "- Athena Queries can now be done directly on this data in AWS Athena Console\n",
    "\n",
    "The new 'DataSource' will show up in AWS and of course the Workbench AWS Dashboard. Anyone can see the data, get information on it, use AWS® Athena to query it, and of course use it as part of their analysis pipelines."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2b781d74",
   "metadata": {},
   "source": [
    "<div style=\"float: right; padding: 20px\"><img src=\"images/athena_query_aqsol.png\" width=600px\"></div>\n",
    "\n",
    "# Visibility and Easy to Use AWS Athena Queries\n",
    "Since Workbench manages a broad range of AWS Services it means that you get visibility into exactly what data you have in AWS. It also means nice perks like hitting the 'Query' link in the Dashboard Web Interface and getting a direct Athena console on your dataset. With AWS Athena you can use typical SQL statements to inspect and investigate your data.\n",
    "    \n",
    "**But that's not all!**\n",
    "    \n",
    "Workbench also provides API to directly query DataSources and FeatureSets right from the API, so lets do that now."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "174e06f0",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_source.query('SELECT * from aqsol_data limit 5')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0fe38834",
   "metadata": {},
   "source": [
    "# The AWS ML Pipeline Awaits\n",
    "Okay, so in a few lines of code we created a 'DataSource' (which is simply a set of orchestrated AWS Services) but now we'll go through the construction of the rest of our Machine Learning pipeline.\n",
    "\n",
    "<div style=\"padding: 20px\">\n",
    "<img width=\"900\" alt=\"workbench_pipeline\" src=\"https://github.com/SuperCowPowers/workbench/assets/4806709/47cc5739-971c-48c3-9ef6-fd8370e3ec57\"></div>\n",
    "\n",
    "## ML Pipeline\n",
    "- DataSource **(done)**\n",
    "- FeatureSet\n",
    "- Model\n",
    "- Endpoint (serves models)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4292590a",
   "metadata": {},
   "source": [
    "# Create a FeatureSet\n",
    "**Note:** Normally this is where you'd do a deep dive on the data/features, look at data quality metrics, redudant features and engineer new features. For the purposes of this notebook we're simply going to take the features given to us in the AQSolDB data from the Harvard Dataverse, those features are:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "37674152",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_source.column_details()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1c3bf3b7",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_source.to_features(\"aqsol_features\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "09b88130",
   "metadata": {},
   "source": [
    "# New FeatureSet shows up in Dashboard\n",
    "Now we see our new feature set automatically pop up in our dashboard. FeatureSet creation involves the most complex set of AWS Services:\n",
    "- New Entry in AWS Feature Store\n",
    "- Specific Type and Field Requirements are handled\n",
    "- Plus all the AWS Services associated with DataSources (see above)\n",
    "\n",
    "The new 'FeatureSet' will show up in AWS and of course the Workbench AWS Dashboard. Anyone can see the feature set, get information on it, use AWS® Athena to query it, and of course use it as part of their analysis pipelines.\n",
    "\n",
    "<div style=\"padding: 20px\"><img src=\"images/dashboard_aqsol_features.png\" width=1000px\"></div>\n",
    "    \n",
    "**Important:** All inputs are stored to track provenance on your data as it goes through the pipeline. We can see the last field in the FeatureSet shows the input DataSource."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3943e7c0",
   "metadata": {},
   "source": [
    "# Publishing our Model\n",
    "**Note:** Normally this is where you'd do a deep dive on the feature set. For the purposes of this notebook we're simply going to take the features given to us and make a reference model that can track our baseline model performance for other to improve upon. :)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "010006a6",
   "metadata": {},
   "outputs": [],
   "source": [
    "from workbench.api.feature_set import FeatureSet\n",
    "from workbench.api.model import Model, ModelType\n",
    "\n",
    "# Compute our features\n",
    "feature_set = FeatureSet(\"aqsol_features\")\n",
    "feature_list = [\n",
    "    \"sd\",\n",
    "    \"ocurrences\",\n",
    "    \"molwt\",\n",
    "    \"mollogp\",\n",
    "    \"molmr\",\n",
    "    \"heavyatomcount\",\n",
    "    \"numhacceptors\",\n",
    "    \"numhdonors\",\n",
    "    \"numheteroatoms\",\n",
    "    \"numrotatablebonds\",\n",
    "    \"numvalenceelectrons\",\n",
    "    \"numaromaticrings\",\n",
    "    \"numsaturatedrings\",\n",
    "    \"numaliphaticrings\",\n",
    "    \"ringcount\",\n",
    "    \"tpsa\",\n",
    "    \"labuteasa\",\n",
    "    \"balabanj\",\n",
    "    \"bertzct\",\n",
    "]\n",
    "feature_set.to_model(\n",
    "    name=\"aqsol-regression\",\n",
    "    model_type=ModelType.REGRESSOR,\n",
    "    target_column=\"solubility\",\n",
    "    feature_list=feature_list,\n",
    "    description=\"AQSol Regression Model\",\n",
    "    tags=[\"aqsol\", \"regression\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e91676ba",
   "metadata": {},
   "source": [
    "<div style=\"float: right; padding: 20px\"><img src=\"images/model_screenshot.png\" width=500px\"></div>\n",
    "\n",
    "# Model is trained and published\n",
    "Okay we've clipped the output above to focus on the important bits. The Workbench model harness provides some simple model performance output\n",
    "\n",
    "- FIT/TRAIN: (8056, 35)\n",
    "- VALIDATiON: (1926, 35)\n",
    "- RMSE: 1.175\n",
    "- MAE: 0.784\n",
    "- R2 Score: 0.760\n",
    "\n",
    "The Workbench Dashboard also has a really spiffy model details page that gives a deeper dive on the feature importance and model performance metrics.\n",
    "\n",
    "**Note:** Model details is still WIP/Alpha version that we're working on :)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "981c9381",
   "metadata": {},
   "source": [
    "# Deploying an AWS Endpoint\n",
    "Okay now that are model has been published we can deploy an AWS Endpoint to serve inference requests for that model. Deploying an Endpoint allows a large set of servies/APIs to use our model in production."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a362f172",
   "metadata": {},
   "outputs": [],
   "source": [
    "m = Model(\"aqsol-regression\")\n",
    "m.to_endpoint(name=\"aqsol-regression-end\", tags=[\"aqsol\", \"regression\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "04024783",
   "metadata": {},
   "source": [
    "# Model Inference from the Endpoint\n",
    "AWS Endpoints will bundle up a model as a service that responds to HTTP requests. The typical way to use an endpoint is to send a POST request with your features in CSV format. Workbench provides a nice DataFrame based interface that takes care of many details for you."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "289d3380",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the Endpoint\n",
    "from workbench.api.endpoint import Endpoint\n",
    "my_endpoint = Endpoint('aqsol-regression-end')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1a1cdebe",
   "metadata": {},
   "source": [
    "# Model Provenance is locked into Workbench\n",
    "We can now look at the model, see what FeatureSet was used to train it and even better see exactly which ROWS in that training set where used to create the model. We can make a query that returns the ROWS that were not used for training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "a12b00ea",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get a DataFrame of data (not used to train) and run predictions\n",
    "table = feature_set.view(\"training\").table\n",
    "test_df = feature_set.query(f\"SELECT * FROM {table} where training = FALSE\")\n",
    "test_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "ed6c088a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Okay now use the Workbench Endpoint to make prediction on TEST data\n",
    "prediction_df = my_endpoint.predict(test_df)\n",
    "metrics = my_endpoint.regression_metrics('solubility', prediction_df)\n",
    "print(metrics)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "56792f4a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Lets look at the predictions versus actual values\n",
    "prediction_df[['id', 'solubility', 'prediction']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "a201c37e",
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_predictions(prediction_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2a20529",
   "metadata": {},
   "source": [
    "# Follow Up on Predictions\n",
    "Looking at the prediction plot above we can see that many predictions were close to the actual value but about 10 of the predictions were WAY off. So at this point we'd use Workbench to investigate those predictions, map them back to our FeatureSet and DataSource and see if there were irregularities in the training data."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2358b668",
   "metadata": {},
   "source": [
    "# Wrap up: Building an AWS<sup>®</sup> ML Pipeline with Workbench\n",
    "\n",
    "<div style=\"float: right; padding: 20px\"><img width=\"450\" src=\"https://user-images.githubusercontent.com/4806709/266844238-df2f1b90-9e6f-4dbb-9490-ad75545e630f.png\"></div>\n",
    "\n",
    "\n",
    "\n",
    "This notebook used the Workbench Science Toolkit to quickly build an AWS® Machine Learning Pipeline with the AQSolDB public dataset. We built a full AWS Machine Learning Pipeline from start to finish.\n",
    "\n",
    "Workbench made it easy:\n",
    "- Visibility into AWS services for every step of the process.\n",
    "- Managed the complexity of organizing the data and populating the AWS services.\n",
    "- Provided an easy to use API to perform Transformations and inspect Artifacts.\n",
    "\n",
    "Using Workbench will minimizize the time and manpower needed to incorporate AWS ML into your organization. If your company would like to be a Workbench Alpha Tester, contact us at [workbench@supercowpowers.com](mailto:workbench@supercowpowers.com)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1a5ac2c7",
   "metadata": {},
   "source": [
    "<br><br><br><br>\n",
    "<br><br><br><br>\n",
    "<br><br><br><br>\n",
    "<br><br><br><br>\n",
    "<br><br><br><br>\n",
    "<br><br><br><br>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f31162c1",
   "metadata": {},
   "source": [
    "# Helper Methods"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "90d26e96",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Helper to look at predictions vs target\n",
    "from math import sqrt\n",
    "import pandas as pd\n",
    "def plot_predictions(df, line=True):\n",
    "    \n",
    "    # Dataframe of the targets and predictions\n",
    "    target = 'Actual Solubility'\n",
    "    pred = 'Predicted Solubility'\n",
    "    df_plot = pd.DataFrame({target: df['solubility'], pred: df['prediction']})\n",
    "    \n",
    "    # Compute Error per prediction\n",
    "    df_plot['RMSError'] = df_plot.apply(lambda x: sqrt((x[pred] - x[target])**2), axis=1)\n",
    "    #df_plot['error'] = df_plot.apply(lambda x: abs(x[pred] - x[target]), axis=1)\n",
    "    ax = df_plot.plot.scatter(x=target, y=pred, c='RMSError', cmap='coolwarm', sharex=False)\n",
    "    \n",
    "    # Just a diagonal line\n",
    "    if line:\n",
    "        ax.axline((1, 1), slope=1, linewidth=2, c='black')\n",
    "        x_pad = (df_plot[target].max() - df_plot[target].min())/10.0 \n",
    "        y_pad = (df_plot[pred].max() - df_plot[pred].min())/10.0\n",
    "        plt.xlim(df_plot[target].min()-x_pad, df_plot[target].max()+x_pad)\n",
    "        plt.ylim(df_plot[pred].min()-y_pad, df_plot[pred].max()+y_pad)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "c09b6c26",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plotting defaults\n",
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "#plt.style.use('seaborn-deep')\n",
    "#plt.style.use('seaborn-dark')\n",
    "plt.rcParams['font.size'] = 12.0\n",
    "plt.rcParams['figure.figsize'] = 14.0, 7.0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dfab1df1-ea45-43dd-a76b-33354f572d8e",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}