{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction: Active Learning Cycle\n", "\n", "This notebook encompasses the comprehensive workflow of the active learning cycle, designed to iteratively improve the predictive accuracy of our model by intelligently selecting new data points for training. The cycle consists of the following steps:\n", "\n", "1. Load the Cleaned Data: Import the pre-processed and cleaned data set ready for integration and analysis.\n", "\n", "2. Integrate All Obtained Data: Combine data from various sources into a unified dataset for model training.\n", "\n", "3. Save the Integrated Data: Persist the integrated dataset to disk for reproducibility and future reference.\n", "\n", "4. Train a Predictive Model: Utilize the integrated data to train a machine learning model capable of predicting target variables with high accuracy.\n", "\n", "5. Save the Predictive Model: Save the trained model to ensure that it can be reused without retraining.\n", "\n", "6. Create a Grid in the Feature Space: Generate a comprehensive grid that spans the entire feature space under consideration.\n", "\n", "7. Extract a Subset of the Grid that is Experimentally Feasible: Identify and isolate parts of the grid that are viable for experimental verification based on predefined criteria.\n", "\n", "8. Predict the Target Value in the Subset: Use the trained model to predict target values for the experimentally feasible subset.\n", "\n", "9. Save the Predicted Properties of the Subset: Store the predictions to guide future experimental endeavors.\n", "\n", "10. Acquire Data Based on Predicted Properties: Select new data points for acquisition based on the model's predictions and a tiered acquisition strategy, aiming to maximize the model's learning.\n", "\n", "\n", "11. Save the Acquired Data: Document the newly acquired data points to refine the model in subsequent learning cycles.\n", "\n", "By following this structured approach, we aim to efficiently navigate the feature space, prioritize experimental efforts, and iteratively refine our model's predictive capability. This cycle is pivotal in harnessing the potential of active learning to address complex problems with an evolving data-driven strategy." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Libraries\n", "\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%matplotlib widget\n", "\n", "import numpy as np \n", "import pandas as pd\n", "from datetime import datetime\n", "import scipy \n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import statsmodels.api as sm\n", "import random\n", "import sklearn\n", "from sklearn.manifold import TSNE\n", "from sklearn.metrics import f1_score, mean_squared_error, r2_score, make_scorer, mean_absolute_percentage_error, mean_absolute_error\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.model_selection import cross_val_score\n", "from sklearn import svm, neighbors\n", "from sklearn.svm import SVR\n", "from sklearn.model_selection import ShuffleSplit\n", "from sklearn.tree import DecisionTreeRegressor\n", "from sklearn import preprocessing\n", "from sklearn.gaussian_process import GaussianProcessRegressor\n", "from sklearn.gaussian_process.kernels import DotProduct, WhiteKernel, RBF, Matern, RationalQuadratic, ConstantKernel\n", "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.neighbors import KNeighborsRegressor\n", "from sklearn.model_selection import GridSearchCV\n", "from sklearn.metrics import mean_absolute_error\n", "import pandas as pd\n", "import plotly.express as px\n", "\n", "from tqdm import tqdm\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import os \n", "import inspect\n", "import sys\n", "\n", "script_path = inspect.getframeinfo(inspect.currentframe()).filename\n", "script_dir = os.path.dirname(os.path.abspath(script_path))\n", "sys.path.append(script_dir)\n", "\n", "import lithium_brine_ml as lbml" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Functions" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def load_from_dict(data_dict_address):\n", " '''This function loads dataframes from excel files and returns a dictionary of dataframes.\n", " \n", " Parameters:\n", " ----------\n", " data_dict_address: dictionary of addresses of excel files\n", " dictionary\n", " \n", " Returns:\n", " -------\n", " data_dict: dictionary of dataframes\n", " dictionary'''\n", " \n", " data_dict = {}\n", " \n", " # load dataframes from excel files:\n", " \n", " for key in data_dict_address.keys():\n", " \n", " data_dict[key] = pd.read_excel(data_dict_address[key])\n", " \n", " return data_dict\n", "\n", "\n", "def dataframe_format_correction(data_dict, label1, label2):\n", " '''\n", " This function changes the column names/labels to the same format for all dataframes.\n", " \n", " Parameters:\n", " ----------\n", " data_dict: dictionary of dataframes\n", " dictionary of pd.DataFrame\n", " \n", " label1: list of old column names\n", " list\n", " \n", " label2: list of new column names\n", " list\n", " \n", " Returns:\n", " -------\n", " data_corrected_labels: dictionary of dataframes with corrected labels\n", " dictionary of pd.DataFrame'''\n", " \n", " data_corrected_labels = {}\n", " \n", " # for each dataframe in the dictionary:\n", " # change the column names/labels to the same format for all dataframes\n", " # label1 list of old column names and label2 list of new column names\n", " \n", " for key in data_dict.keys():\n", " \n", " data_corrected_labels[key] = data_dict[key].rename(columns=dict(zip(label1, label2)))\n", " \n", " return data_corrected_labels\n", "\n", "\n", "def basic_clean(data_dict, eps=0.01):\n", " '''\n", " This function cleans the dataframes by removing NaN values and replacing 0 values with eps.\n", " \n", " Parameters:\n", " ----------\n", " data_dict: dictionary of dataframes\n", " dictionary of pd.DataFrame\n", " \n", " eps: small number to replace 0 values\n", " float\n", " \n", " Returns:\n", " -------\n", " data_cleaned: dictionary of cleaned dataframes\n", " dictionary of pd.DataFrame\n", " '''\n", " data_cleaned = {}\n", " \n", " for key in data_dict.keys():\n", " data_cleaned[key] = data_dict[key].dropna()\n", " data_cleaned[key]['yield'] = data_dict[key]['yield'].apply(lambda x: x+eps if x==0 else x)\n", " \n", " return data_cleaned\n", "\n", "def data_integration(data_dict):\n", " ''' \n", " This function integrates the dataframes in the dictionary into one dataframe.\n", " \n", " Parameter:\n", " ----------\n", " data_dict: dictionary of dataframes\n", " dictionary of pd.DataFrame\n", " \n", " Return:\n", " -------\n", " data_integrated: integrated dataframe\n", " pd.DataFrame\n", " '''\n", " \n", " data_integrated = pd.DataFrame()\n", " \n", " for key in data_dict.keys():\n", " data_integrated = pd.concat([data_integrated, data_dict[key]])\n", " \n", " data_integrated = data_integrated.reset_index(drop=True)\n", " \n", " return data_integrated\n", "\n", "\n", "def fact_check(combination):\n", " '''\n", " Checking if the nominated data (suggested experiment) satisfies acquisition guidlines or not\n", " \n", " Parameter:\n", " ----------\n", " combination: an array containing the experiment combinations. \n", " The combination of data must follow below order: \n", " [{C} , {N}, {Li}, {T}]\n", " Returens: \n", " ----------\n", " check: True if all the criteria are satisfied, False if not \n", " \n", " '''\n", " #print(combination)\n", " \n", " c = combination[0]\n", " n = combination[1]\n", " l = combination[2]\n", " \n", " c_total_limit = 0.5 * n\n", " \n", " \n", " if n > 6:\n", " return False\n", " \n", " if c > c_total_limit:\n", " return False\n", " \n", " if c > 2.5: \n", " return False\n", " \n", " if l > 6 or l <= 0:\n", " return False\n", " \n", " if c < (l/2):\n", " return False\n", " \n", " #print('n: ', n, ' || n_total: ',n_total, ' || c: ', c, \" || c_total_limit: \", c_total_limit, \" || Li: \", l)\n", " return True \n", "\n", "def df_scaler(fit_df, transform_df, labels_list = ['init_C', 'init_N', 'init_Li', 'T']):\n", " '''\n", " Recieves two dataframes and scales the second dataframe using the first dataframe\n", " Parameters:\n", " ----------\n", " fit_df: the dataframe to be used for fitting the scaler\n", " pd.DataFrame\n", " \n", " transform_df: the dataframe to be scaled\n", " pd.DataFrame\n", " \n", " labels_list: the list of labels to be used for scaling\n", " list\n", " \n", " Returns:\n", " ----------\n", " scaled_df: the scaled dataframe\n", " \n", " '''\n", " \n", " # using the batch containing all the data scale the experimental grid\n", " scaler = preprocessing.StandardScaler().fit(fit_df[labels_list])\n", " scaled_df = scaler.transform(transform_df[labels_list])\n", " \n", " return scaled_df\n", "\n", "def entropy_measure(std_list):\n", " '''\n", " The function is used to calculate the entropy of the standard deviation of the model prediction\n", " \n", " Parameters:\n", " ------------\n", " std_list: the list of the standard deviation of the model prediction\n", " list\n", " \n", " Returns:\n", " ------------\n", " entropy: the entropy of the standard deviation of the model prediction\n", " float\n", " '''\n", " \n", " std_list = np.array(std_list)\n", " # entropy = ln(std*root(2*pi*e))\n", " entropy = np.log(std_list * np.sqrt(2*np.pi*np.e))\n", " \n", " return entropy\n", "\n", "\n", "def binary_model_plot(model, scaler, range_a, range_b, val_c,\n", " label_a, label_b , label_c , T = 66,\n", " entropy_contour=False, std_contour=True, yield_contour=True):\n", " '''\n", " The function is used to plot the binary contour plot of the model prediction\n", "\n", " Parameters:\n", " ------------\n", " model: the model to predict the yield\n", " SKlearn model object\n", " \n", " scaler: the scaler used to scale the data\n", " SKlearn scaler object\n", " \n", " range_a: the range of the first parameter\n", " tuple\n", " \n", " range_b: the range of the second parameter\n", " tuple\n", " \n", " val_c: the value of the third parameter\n", " float\n", " \n", " label_a: the label of the first parameter\n", " string\n", " \n", " label_b: the label of the second parameter\n", " string\n", " \n", " label_c: the label of the third parameter\n", " string\n", " \n", " T: the temperature of the experiment\n", " float\n", " \n", " entropy_contour: if True, the entropy contour will be plotted\n", " bool\n", " \n", " std_contour: if True, the standard deviation contour will be plotted\n", " bool\n", " \n", " yield_contour: if True, the yield contour will be plotted\n", " bool\n", " \n", " Returns:\n", " ------------\n", " none\n", " \n", " '''\n", " \n", " a_list = np.linspace(range_a[0], range_a[1], 50)\n", " b_list = np.linspace(range_b[0], range_b[1], 50)\n", "\n", " a_mesh, b_mesh = np.meshgrid(a_list, b_list)\n", " yield_mesh = a_mesh.copy()\n", " std_mesh = a_mesh.copy()\n", " entropy_mesh = a_mesh.copy()\n", " \n", " for i in tqdm(range(len(a_list))):\n", " for j in range(len(b_list)):\n", " x_df = pd.DataFrame({label_a: a_mesh[i,j],\n", " label_b : b_mesh[i,j],\n", " label_c : [val_c],\n", " 'T':[T]})\n", " # it is crucial to rearrange the columns in the same order as the training data!!!\n", " x_df = x_df[['init_C', 'init_N', 'init_Li', 'T']]\n", " \n", " scaled_grid = scaler.transform(x_df)\n", " yield_mesh[i,j], std_mesh[i,j] = model.predict(scaled_grid, return_std=True)\n", " if entropy_contour:\n", " entropy_mesh[i,j] = entropy_measure(std_mesh[i,j])\n", " \n", " plt.figure()\n", " \n", " plt.contourf(a_mesh, b_mesh, yield_mesh, 100, cmap = 'viridis')\n", " cbar = plt.colorbar()\n", " \n", " if std_contour:\n", " contour1 = plt.contour(a_mesh, b_mesh, std_mesh, colors = 'red')\n", " plt.clabel(contour1, inline=True, fontsize=8)\n", " \n", " if yield_contour:\n", " contour2 = plt.contour(a_mesh, b_mesh, yield_mesh, colors = 'black')\n", " plt.clabel(contour2, inline=True, fontsize=8)\n", " \n", " if entropy_contour:\n", " contour3 = plt.contour(a_mesh, b_mesh, entropy_mesh, colors = 'blue')\n", " plt.clabel(contour3, inline=True, fontsize=8)\n", " \n", " \n", " \n", " plt.xlabel(label_a)\n", " plt.ylabel(label_b)\n", " cbar.ax.set_ylabel('Yield')\n", " plt.title(label_c + \": \" + str(val_c))\n", " plt.show()\n", " plt.tight_layout()\n", " \n", " return\n", "\n", "\n", "def model_scaler_setup(model, train_df, labels_list = ['init_C', 'init_N', 'init_Li', 'T'], target = 'yield'):\n", " '''\n", " This function takes in a model, training data, and the labels and target of the training data and returns the trained model and scaler.\n", " \n", " Parameters:\n", " ----------\n", " model: sklearn model\n", " \n", " train_df: pandas dataframe\n", " \n", " labels_list: list of strings\n", " \n", " target: string\n", " \n", " Returns:\n", " -------\n", " model: fitted sklearn model \n", " \n", " scaler: fitted sklearn scaler\n", " \n", " '''\n", " # Define the scaler \n", " scaler = preprocessing.StandardScaler().fit(train_df[labels_list])\n", " scaled_batch = scaler.transform(train_df[labels_list])\n", " \n", " # train the model on scaled data\n", " model.fit(scaled_batch, train_df[target])\n", " \n", " return model, scaler\n", "\n", "# Create a grid for the prediction space:\n", "\n", "def create_exp_grid(label_a = \"init_C\", label_b = \"init_N\", label_c = \"init_Li\", label_d= \"T\",\n", " range_a = (0,2.5), range_b = (0,6), range_c = (0,6), range_d = (66,66)\n", " ):\n", " '''\n", " This function creates a grid of experimental conditions for the model to predict.\n", " \n", " Parameters:\n", " ----------\n", " label_a: string\n", " \n", " label_b: string\n", " \n", " label_c: string\n", " \n", " label_d: string\n", " \n", " range_a: tuple\n", " \n", " range_b: tuple\n", " \n", " range_c: tuple\n", " \n", " range_d: tuple\n", " \n", " Returns:\n", " -------\n", " exp_grid: pandas dataframe\n", " \n", " '''\n", " \n", " exp_grid = pd.DataFrame({label_a:[], label_b : [], label_c : [], label_d: []})\n", " \n", " for a in tqdm(np.linspace(range_a[0], range_a[1], 16)):\n", " for b in np.linspace(range_b[0],range_b[1],37):\n", " for c in np.linspace(range_c[0],range_c[1],37):\n", " for d in np.linspace(range_d[0],range_d[1],1):\n", " x_df = pd.DataFrame({label_a:[a],\n", " label_b : [b],\n", " label_c : [c],\n", " label_d: [d]})\n", " exp_grid = exp_grid.append(x_df)\n", " \n", " return exp_grid\n", " \n", " \n", "def grid_feasibility(data_df, label_a = \"init_C\", label_b = \"init_N\", label_c = \"init_Li\", label_d= \"T\"):\n", " '''This function takes in a dataframe of experimental conditions and returns a list of booleans indicating whether the conditions are feasible or not.\n", " Parameters:\n", " ----------\n", " data_df: pandas dataframe\n", " \n", " label_a: string\n", " \n", " label_b: string\n", " \n", " label_c: string\n", " \n", " label_d: string\n", " \n", " Returns:\n", " -------\n", " data_df: pandas dataframe''' \n", " fact_check_list = []\n", " \n", " for i in tqdm(range(len(data_df))):\n", " fact_check_list.append(fact_check([data_df[label_a][i],\n", " data_df[label_b][i],\n", " data_df[label_c][i],\n", " data_df[label_d][i]]))\n", " \n", " data_df['fact_check'] = fact_check_list\n", " \n", " return data_df\n", "\n", "\n", "def scale_dataframe(exp_grid, train_df, labels_list = ['init_C', 'init_N', 'init_Li', 'T']):\n", " '''This function takes in a dataframe of experimental conditions and returns the scaled dataframe.\n", " Parameters:\n", " ----------\n", " exp_grid: pandas dataframe\n", " \n", " train_df: pandas dataframe\n", " \n", " labels_list: list of strings\n", " \n", " Returns:\n", " -------\n", " exp_grid: pandas dataframe with scaled columns'''\n", " \n", " # Define the scaler \n", " scaler = preprocessing.StandardScaler().fit(train_df[labels_list])\n", " scaled_batch = scaler.transform(train_df[labels_list])\n", " \n", " # scale the experimental grid\n", " scaled_exp_grid = scaler.transform(exp_grid[labels_list])\n", " \n", " exp_grid['scl_'+labels_list[0]] = scaled_exp_grid[:,0]\n", " exp_grid['scl_'+labels_list[1]] = scaled_exp_grid[:,1]\n", " exp_grid['scl_'+labels_list[2]] = scaled_exp_grid[:,2]\n", " exp_grid['scl_'+labels_list[3]] = scaled_exp_grid[:,3]\n", " \n", " return exp_grid\n", "\n", "\n", "\n", "def euclidean_distance(input_batch, vector, n_components = 3):\n", " '''\n", " Finds the minimum euclidean distance between a batch of vector (batch, np.arrays) and a new vector (experiment, np.array)\n", " \n", " Parameters:\n", " -----------\n", " input_batch: 2D np array (a set of vectors)\n", " vector: 1D np array (a vector)\n", " \n", " Returne:\n", " -----------\n", " min_distance: minimum Euclidean distance of the vector to the batch \n", " \n", " '''\n", " \n", " batch_np = np.array(input_batch)\n", " vector_np = np.array(vector)\n", " \n", "\n", " dislocation = batch_np[0, 0:n_components] - vector_np[0:n_components]\n", " min_distance = np.linalg.norm(dislocation)\n", "\n", " for i in range(len(batch_np)):\n", " dislocation = batch_np[i,0:n_components] - vector_np[0:n_components]\n", " distance = np.linalg.norm(dislocation)\n", " \n", " if min_distance > distance:\n", " min_distance = distance\n", " \n", " return min_distance\n", "\n", "\n", "\n", "\n", "def tierd_greedy_acquisition(exp_grid_df, train_df, tier1_label = 'std', tier2_label = 'gpr_yield', \n", " tier1_acquisition = 12, tier2_acquisition = 6, tier3_acquisition = 6, min_distance = 0.5):\n", " \"\"\"\n", " data_df: dataframe with columns 'std' and 'gpr_yield'\n", " tier1_label: column name for the first tier of acquisition\n", " tier2_label: column name for the second tier of acquisition\n", " n: number of samples to acquire\n", " \"\"\"\n", " acquisition_batch = []\n", " control_batch = train_df[['init_C', 'init_N', 'init_Li']].values\n", " temp_df = exp_grid_df.copy()\n", " # sort the dataframe by the first tier\n", " temp_df = temp_df.sort_values(by=tier1_label, ascending=False)\n", " temp_df.reset_index(drop=True)\n", " \n", " acquisition_counter = 0 \n", " row_counter = 0\n", " while acquisition_counter < tier1_acquisition:\n", " if euclidean_distance(control_batch, temp_df.iloc[row_counter, 0:3].values) > min_distance and temp_df.iloc[row_counter][\"fact_check\"]==True:\n", " acquisition_batch.append(temp_df.iloc[row_counter])\n", " control_batch = np.append(control_batch, [temp_df.iloc[row_counter, 0:3]], axis=0)\n", " acquisition_counter += 1\n", " \n", " row_counter += 1\n", " \n", " acquisition_counter = 0 \n", " row_counter = 0\n", " temp_df = temp_df.sort_values(by=tier2_label, ascending=False)\n", " temp_df.reset_index(drop=True)\n", " while acquisition_counter < tier2_acquisition:\n", " if euclidean_distance(control_batch, temp_df.iloc[row_counter, 0:3].values) > min_distance and temp_df.iloc[row_counter][\"fact_check\"]==True:\n", " acquisition_batch.append(temp_df.iloc[row_counter])\n", " control_batch = np.append(control_batch, [temp_df.iloc[row_counter, 0:3]], axis=0)\n", " acquisition_counter += 1\n", " \n", " row_counter += 1\n", " \n", " acquisition_counter = 0 \n", " while acquisition_counter < tier3_acquisition:\n", " row_counter = random.randint(0,len(exp_grid_df)-1)\n", " if euclidean_distance(control_batch, temp_df.iloc[row_counter, 0:3].values) > min_distance and temp_df.iloc[row_counter][\"fact_check\"]==True:\n", " acquisition_batch.append(temp_df.iloc[row_counter])\n", " control_batch = np.append(control_batch, [temp_df.iloc[row_counter, 0:3]], axis=0)\n", " acquisition_counter += 1\n", " \n", " acquisition_df = pd.DataFrame(acquisition_batch, columns = ['init_C', 'init_N', 'init_Li', 'T', 'gpr_yield', 'std', 'fact_check'])\n", " \n", " acquisition_type = []\n", " for i in range(len(acquisition_df)):\n", " if i < tier1_acquisition:\n", " acquisition_type.append(tier1_label)\n", " elif i < tier1_acquisition + tier2_acquisition:\n", " acquisition_type.append(tier2_label)\n", " else:\n", " acquisition_type.append('random')\n", " \n", " acquisition_df['acquisition'] = acquisition_type\n", " \n", " return acquisition_df\n", " \n", "\n", "def binary_model_plot(model, scaler, range_a, range_b, val_c,\n", " label_a, label_b , label_c , T = 66,\n", " entropy_contour=False, std_contour=True, yield_contour=True):\n", " '''\n", " The function is used to plot the binary contour plot of the model prediction\n", "\n", " Parameters:\n", " ------------\n", " model: the model to predict the yield\n", " SKlearn model object\n", " \n", " scaler: the scaler used to scale the data\n", " SKlearn scaler object\n", " \n", " range_a: the range of the first parameter\n", " tuple\n", " \n", " range_b: the range of the second parameter\n", " tuple\n", " \n", " val_c: the value of the third parameter\n", " float\n", " \n", " label_a: the label of the first parameter\n", " string\n", " \n", " label_b: the label of the second parameter\n", " string\n", " \n", " label_c: the label of the third parameter\n", " string\n", " \n", " T: the temperature of the experiment\n", " float\n", " \n", " entropy_contour: if True, the entropy contour will be plotted\n", " bool\n", " \n", " std_contour: if True, the standard deviation contour will be plotted\n", " bool\n", " \n", " yield_contour: if True, the yield contour will be plotted\n", " bool\n", " \n", " Returns:\n", " ------------\n", " none\n", " \n", " '''\n", " \n", " a_list = np.linspace(range_a[0], range_a[1], 50)\n", " b_list = np.linspace(range_b[0], range_b[1], 50)\n", "\n", " a_mesh, b_mesh = np.meshgrid(a_list, b_list)\n", " yield_mesh = a_mesh.copy()\n", " std_mesh = a_mesh.copy()\n", " entropy_mesh = a_mesh.copy()\n", " \n", " for i in tqdm(range(len(a_list))):\n", " for j in range(len(b_list)):\n", " x_df = pd.DataFrame({label_a: a_mesh[i,j],\n", " label_b : b_mesh[i,j],\n", " label_c : [val_c],\n", " 'T':[T]})\n", " # it is crucial to rearrange the columns in the same order as the training data!!!\n", " x_df = x_df[['init_C', 'init_N', 'init_Li', 'T']]\n", " \n", " scaled_grid = scaler.transform(x_df)\n", " yield_mesh[i,j], std_mesh[i,j] = model.predict(scaled_grid, return_std=True)\n", " if entropy_contour:\n", " entropy_mesh[i,j] = entropy_measure(std_mesh[i,j])\n", " \n", " plt.figure()\n", " \n", " plt.contourf(a_mesh, b_mesh, yield_mesh, 100, cmap = 'viridis')\n", " cbar = plt.colorbar()\n", " \n", " if std_contour:\n", " contour1 = plt.contour(a_mesh, b_mesh, std_mesh, colors = 'red')\n", " plt.clabel(contour1, inline=True, fontsize=8)\n", " \n", " if yield_contour:\n", " contour2 = plt.contour(a_mesh, b_mesh, yield_mesh, colors = 'black')\n", " plt.clabel(contour2, inline=True, fontsize=8)\n", " \n", " if entropy_contour:\n", " contour3 = plt.contour(a_mesh, b_mesh, entropy_mesh, colors = 'blue')\n", " plt.clabel(contour3, inline=True, fontsize=8)\n", " \n", " \n", " \n", " plt.xlabel(label_a)\n", " plt.ylabel(label_b)\n", " cbar.ax.set_ylabel('Yield')\n", " plt.title(label_c + \": \" + str(val_c))\n", " plt.show()\n", " plt.tight_layout()\n", " \n", " return\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# 1- Load the Cleaned Data \n", "\n", "Import the pre-processed and cleaned data set ready for integration and analysis." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "# Instructions for users:\n", "# Update the file_paths list below with the actual paths to your cleaned data files.\n", "# These paths should point to the Excel files containing the data from each batch you wish to load.\n", "# Ensure the format is correct and consistent with your operating system's path conventions.\n", "# Example format for Windows: \"data\\\\clean\\\\batch0.xlsx\"\n", "# Example format for Unix/Linux: \"data/clean/batch0.xlsx\"\n", "\n", "file_paths = [\n", " 'data/clean/batch0.xlsx', # Update this path to your first batch file\n", " 'data/clean/batch1.xlsx', # Update this path to your second batch file\n", " 'data/clean/batch2.xlsx' # Update this path to your third batch file\n", " # Add more paths as needed for additional batches\n", "]\n", "\n", "# Initialize a list to store the loaded data from each batch\n", "data_batches = []\n", "\n", "# Loop through the file paths, loading each file's data into a DataFrame\n", "# and appending it to the data_batches list.\n", "for path in file_paths:\n", " # Load the Excel file into a DataFrame\n", " data = pd.read_excel(path)\n", " # Append the loaded data to the data_batches list\n", " data_batches.append(data)\n", "\n", "# At this point, data_batches contains a list of DataFrames,\n", "# each representing the data from one of the specified files.\n", "# You can now proceed with integrating and analyzing these DataFrames as needed.\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# 2- Integrate All Obtained Data\n", "\n", "\n", "Combine data from various sources into a unified dataset for model training.\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# Instructions for users:\n", "# At this point, it's assumed that you have a list of DataFrames named `data_batches`,\n", "# each containing data from different batches as loaded in the previous step.\n", "\n", "# To integrate all the obtained data into a single DataFrame for model training,\n", "# we use the pandas concat function. This function effectively merges the list of\n", "# DataFrames into one, stacking them vertically (one on top of the other) by default.\n", "\n", "# Integrate all obtained data from the different batches\n", "data = pd.concat(data_batches, ignore_index=True)\n", "\n", "# The `ignore_index=True` parameter is used to reindex the new DataFrame. This is useful\n", "# when the original DataFrames have their own indices that might overlap; reindexing\n", "# prevents any potential issues with duplicate indices.\n", "\n", "# At this stage, `data` is a unified DataFrame containing all the data from the different batches.\n", "# This integrated dataset is now ready for further processing, such as cleaning, exploration,\n", "# and model training." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data inspection: " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Telescope_idexp_idinit_Cinit_Ninit_LiTfini_Liyield
0CS-NRCan-014_A1B0-00.54.51.0660.7204060.279594
1CS-NRCan-014_A2B0-11.04.51.0660.6603090.339691
2CS-NRCan-014_A3B0-21.54.51.0660.7356690.264331
3CS-NRCan-014_A5B0-31.06.01.0660.6213220.378678
4CS-NRCan-014_A6B0-41.56.01.0660.6559830.344017
\n", "
" ], "text/plain": [ " Telescope_id exp_id init_C init_N init_Li T fini_Li yield\n", "0 CS-NRCan-014_A1 B0-0 0.5 4.5 1.0 66 0.720406 0.279594\n", "1 CS-NRCan-014_A2 B0-1 1.0 4.5 1.0 66 0.660309 0.339691\n", "2 CS-NRCan-014_A3 B0-2 1.5 4.5 1.0 66 0.735669 0.264331\n", "3 CS-NRCan-014_A5 B0-3 1.0 6.0 1.0 66 0.621322 0.378678\n", "4 CS-NRCan-014_A6 B0-4 1.5 6.0 1.0 66 0.655983 0.344017" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.head()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "dcf01336058c424b9fbf1135571fde95", "version_major": 2, "version_minor": 0 }, "image/png": "", "text/html": [ "\n", "
\n", "
\n", " Figure\n", "
\n", " \n", "
\n", " " ], "text/plain": [ "Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Assuming 'data' is your DataFrame\n", "\n", "# Set the size of each subplot with the 'height' parameter.\n", "# 'aspect' controls the width of each subplot relative to its height; aspect=1 means each plot is square.\n", "pairplot = sns.pairplot(data, height=1.2, aspect=1)\n", "\n", "# Adjust overall figure size if necessary\n", "plt.subplots_adjust(top=0.9)\n", "plt.suptitle('Pairplot of Data', fontsize=12) # Add a title to the figure\n", "\n", "# Show the plot\n", "plt.show()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# 3- Save the Integrated Data:\n", "\n", "Persist the integrated dataset to disk for reproducibility and future reference." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from datetime import datetime\n", "\n", "# Assuming `data` is your integrated DataFrame and you want to save it as 'new_data_integrated'\n", "\n", "# Get the current date and time in the format Year-Month-Day_Hour-Minute-Second\n", "# This ensures that each file saved has a unique name based on the exact time it was saved.\n", "now = datetime.now().strftime('%Y-%m-%d_%H-%M-%S')\n", "\n", "# Define the path and filename with the current timestamp\n", "# Adjust the directory path as needed for your project's file organization.\n", "file_path = f'data/clean/new_data_{now}.xlsx'\n", "\n", "# Save the DataFrame as an Excel file using the path and filename defined above\n", "# If you're using a Windows system, you might need to use double backslashes (\\\\) or raw string literals (r'path')\n", "data.to_excel(file_path, index=False)\n", "\n", "# Note: The `index=False` parameter is used to prevent pandas from writing row indices into the Excel file.\n", "# If your data's index carries meaningful information, you might want to omit this parameter.\n", "\n", "# This command will save your integrated data to the specified path with a timestamp,\n", "# making it easy to identify and access specific versions of your data.\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# 4- Train a Predictive Model\n", "\n", "Utilize the integrated data to train a machine learning model capable of predicting target variables with high accuracy." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Feature and target selection" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "# Selecting the features and the target from the dataset. The features include initial concentrations\n", "# and temperature ('init_C', 'init_N', 'init_Li'), and the target is the yield ('yield').\n", "x_train = data[['init_C', 'init_N', 'init_Li']]\n", "y_train = data['yield']\n", "\n", "# Scaling the features to standardize them. This is crucial for models that are sensitive to\n", "# the scale of the input features, such as Gaussian Process Regressors.\n", "scaler = preprocessing.StandardScaler().fit(x_train)\n", "x_train_scaled = scaler.transform(x_train)\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Define the models and hyperparameters tuning" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 5 folds for each of 72 candidates, totalling 360 fits\n", " params mean_test_score \\\n", "0 {'alpha': 1e-05, 'kernel': 1**2} -0.219749 \n", "1 {'alpha': 1e-05, 'kernel': RBF(length_scale=1)} -0.108052 \n", "2 {'alpha': 1e-05, 'kernel': RBF(length_scale=0.5)} -0.108052 \n", "3 {'alpha': 1e-05, 'kernel': RBF(length_scale=10)} -0.187659 \n", "4 {'alpha': 1e-05, 'kernel': RBF(length_scale=100)} -0.187659 \n", ".. ... ... \n", "67 {'alpha': 100.0, 'kernel': RBF(length_scale=100)} -0.315294 \n", "68 {'alpha': 100.0, 'kernel': Matern(length_scale... -0.422865 \n", "69 {'alpha': 100.0, 'kernel': Matern(length_scale... -0.315298 \n", "70 {'alpha': 100.0, 'kernel': Matern(length_scale... -0.315299 \n", "71 {'alpha': 100.0, 'kernel': WhiteKernel(noise_l... -0.427234 \n", "\n", " std_test_score \n", "0 0.113909 \n", "1 0.092560 \n", "2 0.092560 \n", "3 0.163989 \n", "4 0.163989 \n", ".. ... \n", "67 0.069801 \n", "68 0.139160 \n", "69 0.069796 \n", "70 0.069796 \n", "71 0.141448 \n", "\n", "[72 rows x 3 columns]\n", "==================================================\n", "Best parameters for GPR: {'alpha': 0.01, 'kernel': Matern(length_scale=0.5, nu=1.5)}\n", "Best score: -0.05234413020655955\n" ] } ], "source": [ "from sklearn.model_selection import GridSearchCV\n", "from sklearn.gaussian_process import GaussianProcessRegressor\n", "from sklearn.gaussian_process.kernels import WhiteKernel, RBF, Matern, ConstantKernel as C\n", "from sklearn.metrics import make_scorer, mean_absolute_error\n", "import numpy as np\n", "\n", "# The choice of model and its hyperparameters can significantly affect the prediction performance.\n", "# We use the Gaussian Process Regressor (GPR), a powerful and flexible model for regression problems.\n", "\n", "# Defining a grid of hyperparameters allows us to explore a range of model configurations.\n", "# The 'kernel' parameter controls the shape of the process' covariance function,\n", "# and 'alpha' deals with the model's noise level. We'll use GridSearchCV to find the best combination.\n", "param_grid = {\n", " 'kernel': [C(1.0, (1e-2, 1e2)), RBF(1.0, (1e-2, 1e2)),\n", " RBF(0.5, (1e-2, 1e2)),\n", " RBF(10.0, (1e-2, 1e2)),\n", " RBF(100.0, (1e-2, 1e2)),\n", " Matern(length_scale=0.1),\n", " Matern(length_scale=0.5),\n", " Matern(length_scale=1),\n", " WhiteKernel()],\n", " 'alpha': np.logspace(-5, 2, 8)\n", "}\n", "\n", "# GridSearchCV systematically works through multiple combinations of parameter values,\n", "# cross-validates each to determine which configuration gives the best performance.\n", "gpr = GaussianProcessRegressor()\n", "\n", "mae_scorer = make_scorer(mean_absolute_error, greater_is_better=False)\n", "\n", "# Performing the grid search on the scaled training data.\n", "grid_search = GridSearchCV(\n", " gpr, \n", " param_grid, \n", " cv=5, \n", " n_jobs=-1,\n", " scoring= mae_scorer,\n", " error_score='raise', # raise an exception if a score cannot be computed\n", " verbose=2 # print out more detailed error messages\n", ")\n", "\n", "# Fit GridSearchCV to the data. This process can be time-consuming for large datasets or complex models\n", "# but is crucial for finding the most accurate model configuration.\n", "grid_search.fit(x_train_scaled, y_train)\n", "\n", "# Get the cross-validation results\n", "cv_results = pd.DataFrame(grid_search.cv_results_)\n", "print(cv_results[['params', 'mean_test_score', 'std_test_score']])\n", "print('='*50)\n", "best_idx = cv_results['mean_test_score'].argmax()\n", "print(f'Best parameters for GPR: {cv_results[\"params\"][best_idx]}')\n", "print('Best score: {}'.format(grid_search.best_score_))\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Best parameters for different methods are as follows:\n", "\n", "- Best parameters for GPR: {'alpha': 0.001, 'kernel': Matern(length_scale=10, nu=1.5)}\n", "- Best score: -0.053484038058140136\n", "\n", "- Best parameters for Decision Tree: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2}\n", "- Best score for Decision Tree: 0.06063882006374226\n", "\n", "- Best parameters for Random Forest: {'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 500}\n", "- Best score for Random Forest: 0.055737128116578306\n", "\n", "- Best parameters for K-Nearest Neighbor: {'n_neighbors': 5, 'p': 2, 'weights': 'distance'}\n", "- Best score for K-Nearest Neighbor: 0.08994459562199185" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 4.3. Train the models" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
GaussianProcessRegressor(kernel=Matern(length_scale=0.5, nu=1.5),\n",
       "                         n_restarts_optimizer=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "GaussianProcessRegressor(kernel=Matern(length_scale=0.5, nu=1.5),\n", " n_restarts_optimizer=5)" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# After identifying the best parameters, we train a final model using these settings.\n", "# This model is then ready to make predictions on new, unseen data.\n", "\n", "#best_kernel = grid_search.best_params_['kernel']\n", "#model = GaussianProcessRegressor(kernel=best_kernel, alpha=grid_search.best_params_['alpha'], n_restarts_optimizer=5)\n", "#model.fit(x_train_scaled, y_train)\n", "\n", "#Example:\n", "model = GaussianProcessRegressor(kernel=Matern(length_scale=0.5, nu=1.5), n_restarts_optimizer=5)\n", "model.fit(x_train_scaled, y_train)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# 5- Save the predictive model\n", "\n", "Save the trained model to ensure that it can be reused without retraining." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "#import pickle\n", "\n", "# Assuming `model` is your trained model object from the previous steps\n", "\n", "# Specify the filename for the saved model\n", "#model_filename = 'trained_model.pkl'\n", "\n", "# Open a file in write-binary (wb) mode to save the model\n", "#with open(model_filename, 'wb') as file:\n", "# pickle.dump(model, file)\n", "\n", "# The model is now saved to 'trained_model.pkl' and can be loaded later for predictions\n", "\n", "# Load the model from the file\n", "#with open(model_filename, 'rb') as file:\n", "# loaded_model = pickle.load(file)\n", "\n", "# `loaded_model` is now the same as the `model` object you saved earlier\n", "# You can use `loaded_model.predict()` to make predictions" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# 6- Create a Grid in the Feature Space\n", "\n", "Generate a comprehensive grid that spans the entire feature space under consideration.\n" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "def create_exp_grid(label_a = \"init_C\", label_b = \"init_N\", label_c = \"init_Li\", label_d= \"T\",\n", " range_a = (0,2.5), range_b = (0,6), range_c = (0,6), range_d = (66,66)\n", " ):\n", " '''\n", " This function generates a comprehensive grid over the specified feature space.\n", " \n", " Parameters:\n", " ----------\n", " label_a, label_b, label_c, label_d : str\n", " The labels of the features for which the grid is created.\n", " range_a, range_b, range_c, range_d : tuple\n", " The minimum and maximum values (inclusive) to create the grid for each feature.\n", " \n", " Returns:\n", " -------\n", " pd.DataFrame\n", " A DataFrame containing all combinations of the specified feature ranges.\n", " '''\n", " \n", " exp_grid = pd.DataFrame({label_a:[], label_b : [], label_c : [], label_d: []})\n", " \n", " for a in tqdm(np.linspace(range_a[0], range_a[1], 16)):\n", " for b in np.linspace(range_b[0],range_b[1],37):\n", " for c in np.linspace(range_c[0],range_c[1],37):\n", " for d in np.linspace(range_d[0],range_d[1],1):\n", " x_df = pd.DataFrame({label_a:[a],\n", " label_b : [b],\n", " label_c : [c],\n", " label_d: [d]})\n", " exp_grid = exp_grid.append(x_df)\n", " \n", " return exp_grid\n", " " ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 16/16 [00:10<00:00, 1.49it/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ " init_C init_N init_Li T\n", "0 0.0 0.0 0.000000 66.0\n", "0 0.0 0.0 0.166667 66.0\n", "0 0.0 0.0 0.333333 66.0\n", "0 0.0 0.0 0.500000 66.0\n", "0 0.0 0.0 0.666667 66.0\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "# Example usage of the function to create an experimental grid\n", "# Users should adjust the feature labels and ranges according to their specific datasets\n", "exp_grid = create_exp_grid()\n", "print(exp_grid.head()) # Display the first few rows of the grid\n" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "def scale_dataframe(exp_grid, train_df, labels_list = ['init_C', 'init_N', 'init_Li']):\n", " '''This function takes in a dataframe of experimental conditions and returns the scaled dataframe.\n", " Parameters:\n", " ----------\n", " exp_grid: pandas dataframe\n", " \n", " train_df: pandas dataframe\n", " \n", " labels_list: list of strings\n", " \n", " Returns:\n", " -------\n", " exp_grid: pandas dataframe with scaled columns'''\n", " \n", " # Define the scaler \n", " scaler = preprocessing.StandardScaler().fit(train_df[labels_list])\n", " scaled_batch = scaler.transform(train_df[labels_list])\n", " \n", " # scale the experimental grid\n", " scaled_exp_grid = scaler.transform(exp_grid[labels_list])\n", " \n", " exp_grid['scl_'+labels_list[0]] = scaled_exp_grid[:,0]\n", " exp_grid['scl_'+labels_list[1]] = scaled_exp_grid[:,1]\n", " exp_grid['scl_'+labels_list[2]] = scaled_exp_grid[:,2]\n", " \n", " return exp_grid\n" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " init_C init_N init_Li T scl_init_C scl_init_N scl_init_Li\n", "0 0.0 0.0 0.000000 66.0 -1.662726 -3.539546 -1.659431\n", "1 0.0 0.0 0.166667 66.0 -1.662726 -3.539546 -1.484481\n", "2 0.0 0.0 0.333333 66.0 -1.662726 -3.539546 -1.309531\n", "3 0.0 0.0 0.500000 66.0 -1.662726 -3.539546 -1.134581\n", "4 0.0 0.0 0.666667 66.0 -1.662726 -3.539546 -0.959630\n" ] } ], "source": [ "# Assuming 'x_train' is your training DataFrame which has already been defined\n", "# Here, we scale the experimental grid based on 'x_train'\n", "exp_grid = scale_dataframe(exp_grid, x_train, ['init_C', 'init_N', 'init_Li'])\n", "\n", "# Reset the index for clarity\n", "exp_grid.reset_index(drop=True, inplace=True)\n", "print(exp_grid.head()) # Display the first few rows of the scaled grid\n", "\n", "# Note to users:\n", "# Ensure that the training data ('x_train') has been properly prepared before using it to scale the experimental grid.\n", "# The experimental grid now matches the scale of your training data and is ready for predictive modeling.\n", "# Proceed to assess the experimental feasibility of each combination in the grid in the next step." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# 7- Extract a Subset of the Grid that is Experimentally Feasible\n", "\n", "Identify and isolate parts of the grid that are viable for experimental verification based on predefined criteria." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "def fact_check(combination):\n", " '''\n", " Evaluates if the given combination of experimental conditions meets predefined feasibility criteria.\n", " \n", " Parameters:\n", " ----------\n", " combination : list\n", " List containing the experiment combinations in the order: [C, N, Li, T].\n", " \n", " Returns:\n", " -------\n", " bool\n", " True if all criteria are satisfied, False otherwise.\n", " \n", " The feasibility criteria used here are placeholders. Adjust them based on actual experimental constraints.\n", " '''\n", " \n", " # Extract individual feature values from the combination\n", " c, n, l = combination[0], combination[1], combination[2]\n", " \n", " # Define feasibility criteria based on chemical and experimental guidelines\n", " c_total_limit = 0.5 * n\n", " if n > 6 or c > c_total_limit or c > 2.5 or l > 6 or l <= 0 or c < (l / 2):\n", " return False # Criteria not met\n", " return True # Criteria met" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "def grid_feasibility(data_df, label_a=\"init_C\", label_b=\"init_N\", label_c=\"init_Li\", label_d=\"T\"):\n", " '''\n", " Iterates over a DataFrame of experimental conditions, applying `fact_check` to each row.\n", " \n", " Parameters:\n", " ----------\n", " data_df : pd.DataFrame\n", " DataFrame containing experimental conditions.\n", " label_a, label_b, label_c, label_d : str\n", " Column labels for the features to check for feasibility.\n", " \n", " Returns:\n", " -------\n", " pd.DataFrame\n", " Original DataFrame with an added 'fact_check' column indicating feasibility.\n", " '''\n", " \n", " # List to store feasibility results\n", " fact_check_list = []\n", " \n", " # Check each row for feasibility\n", " for i in tqdm(data_df.index, desc=\"Assessing feasibility\"):\n", " combination = [data_df.loc[i, label_a], data_df.loc[i, label_b], data_df.loc[i, label_c], data_df.loc[i, label_d]]\n", " fact_check_list.append(fact_check(combination))\n", " \n", " # Add feasibility results to DataFrame\n", " data_df['fact_check'] = fact_check_list\n", " \n", " return data_df" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Assessing feasibility: 100%|██████████| 21904/21904 [00:00<00:00, 30415.60it/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ " init_C init_N init_Li T scl_init_C scl_init_N scl_init_Li \\\n", "1444 0.166667 0.333333 0.166667 66.0 -1.422222 -3.290527 -1.484481 \n", "1445 0.166667 0.333333 0.333333 66.0 -1.422222 -3.290527 -1.309531 \n", "1481 0.166667 0.500000 0.166667 66.0 -1.422222 -3.166018 -1.484481 \n", "1482 0.166667 0.500000 0.333333 66.0 -1.422222 -3.166018 -1.309531 \n", "1518 0.166667 0.666667 0.166667 66.0 -1.422222 -3.041508 -1.484481 \n", "\n", " fact_check \n", "1444 True \n", "1445 True \n", "1481 True \n", "1482 True \n", "1518 True \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "# Apply the feasibility check to the experimental grid\n", "exp_grid = grid_feasibility(exp_grid)\n", "\n", "# Filter the grid for feasible experiments\n", "feasible_exp_grid = exp_grid[exp_grid['fact_check'] == True]\n", "\n", "# Display the first few rows of the feasible experimental grid\n", "print(feasible_exp_grid.head())\n", "\n", "# Instructions to Users:\n", "# The `feasible_exp_grid` DataFrame now contains only those combinations deemed experimentally viable.\n", "# This filtered grid should be used for subsequent model predictions and experimental planning.\n", "# Adjust the feasibility criteria in the `fact_check` function based on your specific experimental constraints and safety guidelines." ] }, { "cell_type": "code", "execution_count": 200, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
init_Cinit_Ninit_LiTscl_init_Cscl_init_Nscl_init_Liscl_Tfact_check
00.00.00.00000066.0-2.060409-5.261654-2.3449280.805534False
10.00.00.16666766.0-2.060409-5.261654-2.1216200.805534False
20.00.00.33333366.0-2.060409-5.261654-1.8983130.805534False
30.00.00.50000066.0-2.060409-5.261654-1.6750050.805534False
40.00.00.66666766.0-2.060409-5.261654-1.4516970.805534False
\n", "
" ], "text/plain": [ " init_C init_N init_Li T scl_init_C scl_init_N scl_init_Li \\\n", "0 0.0 0.0 0.000000 66.0 -2.060409 -5.261654 -2.344928 \n", "1 0.0 0.0 0.166667 66.0 -2.060409 -5.261654 -2.121620 \n", "2 0.0 0.0 0.333333 66.0 -2.060409 -5.261654 -1.898313 \n", "3 0.0 0.0 0.500000 66.0 -2.060409 -5.261654 -1.675005 \n", "4 0.0 0.0 0.666667 66.0 -2.060409 -5.261654 -1.451697 \n", "\n", " scl_T fact_check \n", "0 0.805534 False \n", "1 0.805534 False \n", "2 0.805534 False \n", "3 0.805534 False \n", "4 0.805534 False " ] }, "execution_count": 200, "metadata": {}, "output_type": "execute_result" } ], "source": [ "exp_grid.head()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# 8- Predict the Target Value in the Subset\n", "\n", "Use the trained model to predict target values for the experimentally feasible subset." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " init_C init_N init_Li T scl_init_C scl_init_N scl_init_Li \\\n", "1444 0.166667 0.333333 0.166667 66.0 -1.422222 -3.290527 -1.484481 \n", "1445 0.166667 0.333333 0.333333 66.0 -1.422222 -3.290527 -1.309531 \n", "1481 0.166667 0.500000 0.166667 66.0 -1.422222 -3.166018 -1.484481 \n", "1482 0.166667 0.500000 0.333333 66.0 -1.422222 -3.166018 -1.309531 \n", "1518 0.166667 0.666667 0.166667 66.0 -1.422222 -3.041508 -1.484481 \n", "\n", " fact_check \n", "1444 True \n", "1445 True \n", "1481 True \n", "1482 True \n", "1518 True \n", " init_C init_N init_Li T scl_init_C scl_init_N scl_init_Li \\\n", "1444 0.166667 0.333333 0.166667 66.0 -1.422222 -3.290527 -1.484481 \n", "1445 0.166667 0.333333 0.333333 66.0 -1.422222 -3.290527 -1.309531 \n", "1481 0.166667 0.500000 0.166667 66.0 -1.422222 -3.166018 -1.484481 \n", "1482 0.166667 0.500000 0.333333 66.0 -1.422222 -3.166018 -1.309531 \n", "1518 0.166667 0.666667 0.166667 66.0 -1.422222 -3.041508 -1.484481 \n", "\n", " fact_check gpr_yield std \n", "1444 True -1.124478e-07 0.000010 \n", "1445 True 1.935593e-02 0.033434 \n", "1481 True -3.834630e-04 0.012515 \n", "1482 True 2.040830e-02 0.032492 \n", "1518 True -4.558765e-04 0.018514 \n" ] } ], "source": [ "# Assuming `models` is a list of trained model objects and we're using the first model for prediction.\n", "# Ensure your model has been trained and is ready for making predictions.\n", "\n", "# Filter the experimental grid for feasible experiments based on the 'fact_check' column\n", "feasible_grid = exp_grid[exp_grid['fact_check'] == True]\n", "\n", "# Display the first few rows of the feasible experimental grid\n", "print(feasible_grid.head())\n", "\n", "# Predict target values using the trained model for the scaled features of the feasible grid\n", "# 'scl_init_C', 'scl_init_N', 'scl_init_Li' should be scaled versions of your features prepared earlier\n", "y_pred, std = model.predict(feasible_grid[['scl_init_C', 'scl_init_N', 'scl_init_Li']], return_std=True)\n", "\n", "# Add the predictions (yield) and standard deviations to the feasible experimental grid as new columns\n", "feasible_grid['gpr_yield'] = y_pred\n", "feasible_grid['std'] = std\n", "\n", "# Instructions to Users:\n", "# - The feasible experimental grid now includes predicted yields ('gpr_yield') and their uncertainties ('std').\n", "# - These predictions can be used to guide experimental planning, focusing efforts on areas of the feature space\n", "# with high potential yield or interesting properties.\n", "# - The 'std' column provides an estimate of the prediction uncertainty, useful for risk assessment and experimental prioritization.\n", "\n", "# Note:\n", "# - Ensure that the feature names in the prediction line match those used during model training.\n", "# - If your model does not support `return_std`, you may need to adjust the code to omit standard deviation predictions.\n", "# - Replace `models[0]` with the variable name of your trained model if different.\n", "\n", "# Display the first few rows of the feasible grid with predictions\n", "print(feasible_grid.head())\n", "\n", "# Reminder:\n", "# - Prior to prediction, it's crucial to scale your experimental grid's features using the same scaler object\n", "# that was applied to the training data to ensure consistency between training and prediction.\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# 9- Save the Predicted Properties of the Subset\n", "\n", "Store the predictions to guide future experimental endeavors." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "from datetime import datetime # Ensure datetime is imported for timestamping\n", "import pandas as pd # Make sure pandas is available for DataFrame operations\n", "\n", "# Current date and time for timestamping the output file, ensuring uniqueness\n", "now = datetime.now().strftime('%Y-%m-%d_%H-%M-%S')\n", "\n", "# Saving the experimental grid with predictions to an Excel file\n", "# Adjust the file path and name as needed for your project structure and naming conventions\n", "file_path = 'data/generated/exp_grid_{}.xlsx'.format(now) # Constructs the file path with a timestamp\n", "\n", "# Save the DataFrame to an Excel file\n", "# Note: The DataFrame 'exp_grid' should include all your experimental conditions along with the 'gpr_yield' and 'std' columns\n", "exp_grid.to_excel(file_path, index=False) # Set index=False to avoid saving the DataFrame index as a separate column\n", "\n", "# Instructions to Users:\n", "# - This step saves the experimental grid, complete with your model's predictions and uncertainties, to an Excel file.\n", "# - The filename includes a timestamp to prevent overwriting previous files and to track when predictions were made.\n", "# - Use this saved file as a reference for planning and conducting future experiments. The 'gpr_yield' column indicates\n", "# the predicted yield for each set of conditions, while the 'std' column provides an estimate of prediction uncertainty.\n", "# - This file serves as a valuable resource for identifying promising areas of the feature space to explore experimentally.\n", "\n", "# Note:\n", "# - Ensure you have write permissions to the target directory\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# 10- Acquire Data Based on Predicted Properties:\n", "\n", "Select new data points for acquisition based on the model's predictions and a tiered acquisition strategy, aiming to maximize the model's learning.\n" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "def euclidean_distance(input_batch, vector, n_components = 3):\n", " '''\n", " Finds the minimum euclidean distance between a batch of vector (batch, np.arrays) and a new vector (experiment, np.array)\n", " \n", " Parameters:\n", " -----------\n", " input_batch: 2D np array (a set of vectors)\n", " vector: 1D np array (a vector)\n", " \n", " Returne:\n", " -----------\n", " min_distance: minimum Euclidean distance of the vector to the batch \n", " \n", " '''\n", " \n", " batch_np = np.array(input_batch)\n", " vector_np = np.array(vector)\n", " \n", "\n", " dislocation = batch_np[0, 0:n_components] - vector_np[0:n_components]\n", " min_distance = np.linalg.norm(dislocation)\n", "\n", " for i in range(len(batch_np)):\n", " dislocation = batch_np[i,0:n_components] - vector_np[0:n_components]\n", " distance = np.linalg.norm(dislocation)\n", " \n", " if min_distance > distance:\n", " min_distance = distance\n", " \n", " return min_distance\n" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "def tierd_greedy_acquisition(exp_grid_df, train_df, tier1_label='std', tier2_label='gpr_yield', \n", " tier1_acquisition=12, tier2_acquisition=6, tier3_acquisition=6, min_distance=0.5):\n", " \"\"\"\n", " Selects new data points for acquisition based on the model's predictions and a tiered acquisition strategy.\n", " \n", " Parameters:\n", " ----------\n", " exp_grid_df : pd.DataFrame\n", " DataFrame with columns 'std' and 'gpr_yield' among others.\n", " train_df : pd.DataFrame\n", " DataFrame used for training the model, used to avoid selecting points too close to existing data.\n", " tier1_label, tier2_label : str\n", " Column names for sorting data points in tiered acquisition.\n", " tier1_acquisition, tier2_acquisition, tier3_acquisition : int\n", " Number of samples to acquire in each tier.\n", " min_distance : float\n", " Minimum Euclidean distance from any point in the training set for a point to be eligible for acquisition.\n", " \n", " Returns:\n", " -------\n", " pd.DataFrame\n", " DataFrame of selected points for acquisition.\n", " \"\"\"\n", " \n", " acquisition_batch = []\n", " control_batch = train_df[['init_C', 'init_N', 'init_Li']].values\n", " temp_df = exp_grid_df.copy()\n", " \n", " # Tier 1 Acquisition\n", " temp_df = temp_df.sort_values(by=tier1_label, ascending=False).reset_index(drop=True)\n", " \n", " acquisition_counter = 0\n", " row_counter = 0\n", " while acquisition_counter < tier1_acquisition and row_counter < len(temp_df):\n", " if euclidean_distance(control_batch, temp_df.iloc[row_counter, [0, 1, 2]].values) > min_distance and temp_df.iloc[row_counter][\"fact_check\"]:\n", " acquisition_batch.append(temp_df.iloc[row_counter])\n", " control_batch = np.vstack((control_batch, temp_df.iloc[row_counter, [0, 1, 2]].values))\n", " acquisition_counter += 1\n", " row_counter += 1\n", "\n", " # Tier 2 Acquisition\n", " temp_df = temp_df.sort_values(by=tier2_label, ascending=False).reset_index(drop=True)\n", " \n", " acquisition_counter = 0\n", " row_counter = 0\n", " while acquisition_counter < tier2_acquisition and row_counter < len(temp_df):\n", " if euclidean_distance(control_batch, temp_df.iloc[row_counter, [0, 1, 2]].values) > min_distance and temp_df.iloc[row_counter][\"fact_check\"]:\n", " acquisition_batch.append(temp_df.iloc[row_counter])\n", " control_batch = np.vstack((control_batch, temp_df.iloc[row_counter, [0, 1, 2]].values))\n", " acquisition_counter += 1\n", " row_counter += 1\n", "\n", " # Tier 3 Acquisition (Random)\n", " while acquisition_counter < tier3_acquisition:\n", " row_counter = random.randint(0, len(temp_df) - 1)\n", " if euclidean_distance(control_batch, temp_df.iloc[row_counter, [0, 1, 2]].values) > min_distance and temp_df.iloc[row_counter][\"fact_check\"]:\n", " acquisition_batch.append(temp_df.iloc[row_counter])\n", " control_batch = np.vstack((control_batch, temp_df.iloc[row_counter, [0, 1, 2]].values))\n", " acquisition_counter += 1\n", "\n", " # Construct the acquisition DataFrame\n", " acquisition_df = pd.DataFrame(acquisition_batch).reset_index(drop=True)\n", " \n", " # Assign acquisition types\n", " acquisition_types = (['tier1'] * min(tier1_acquisition, len(acquisition_df)) + \n", " ['tier2'] * min(tier2_acquisition, len(acquisition_df) - tier1_acquisition) +\n", " ['random'] * max(len(acquisition_df) - tier1_acquisition - tier2_acquisition, 0))\n", " acquisition_df['acquisition_type'] = acquisition_types[:len(acquisition_df)]\n", " \n", " return acquisition_df\n" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "ename": "KeyboardInterrupt", "evalue": "", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)", "Cell \u001b[1;32mIn[53], line 2\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[38;5;66;03m# Corrected usage\u001b[39;00m\n\u001b[1;32m----> 2\u001b[0m acquisition_df \u001b[38;5;241m=\u001b[39m \u001b[43mtierd_greedy_acquisition\u001b[49m\u001b[43m(\u001b[49m\u001b[43mfeasible_grid\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mx_train\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mmin_distance\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;241;43m0.75\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[0;32m 4\u001b[0m \u001b[38;5;66;03m# Display the selected points for acquisition\u001b[39;00m\n\u001b[0;32m 5\u001b[0m \u001b[38;5;28mprint\u001b[39m(acquisition_df\u001b[38;5;241m.\u001b[39mhead())\n", "Cell \u001b[1;32mIn[52], line 56\u001b[0m, in \u001b[0;36mtierd_greedy_acquisition\u001b[1;34m(exp_grid_df, train_df, tier1_label, tier2_label, tier1_acquisition, tier2_acquisition, tier3_acquisition, min_distance)\u001b[0m\n\u001b[0;32m 54\u001b[0m \u001b[38;5;28;01mwhile\u001b[39;00m acquisition_counter \u001b[38;5;241m<\u001b[39m tier3_acquisition:\n\u001b[0;32m 55\u001b[0m row_counter \u001b[38;5;241m=\u001b[39m random\u001b[38;5;241m.\u001b[39mrandint(\u001b[38;5;241m0\u001b[39m, \u001b[38;5;28mlen\u001b[39m(temp_df) \u001b[38;5;241m-\u001b[39m \u001b[38;5;241m1\u001b[39m)\n\u001b[1;32m---> 56\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m euclidean_distance(control_batch, \u001b[43mtemp_df\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43miloc\u001b[49m\u001b[43m[\u001b[49m\u001b[43mrow_counter\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m0\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m1\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m2\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m]\u001b[49m\u001b[38;5;241m.\u001b[39mvalues) \u001b[38;5;241m>\u001b[39m min_distance \u001b[38;5;129;01mand\u001b[39;00m temp_df\u001b[38;5;241m.\u001b[39miloc[row_counter][\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mfact_check\u001b[39m\u001b[38;5;124m\"\u001b[39m]:\n\u001b[0;32m 57\u001b[0m acquisition_batch\u001b[38;5;241m.\u001b[39mappend(temp_df\u001b[38;5;241m.\u001b[39miloc[row_counter])\n\u001b[0;32m 58\u001b[0m control_batch \u001b[38;5;241m=\u001b[39m np\u001b[38;5;241m.\u001b[39mvstack((control_batch, temp_df\u001b[38;5;241m.\u001b[39miloc[row_counter, [\u001b[38;5;241m0\u001b[39m, \u001b[38;5;241m1\u001b[39m, \u001b[38;5;241m2\u001b[39m]]\u001b[38;5;241m.\u001b[39mvalues))\n", "File \u001b[1;32mc:\\Users\\mousas10\\AppData\\Local\\anaconda3\\envs\\SI\\lib\\site-packages\\pandas\\core\\indexing.py:1067\u001b[0m, in \u001b[0;36m_LocationIndexer.__getitem__\u001b[1;34m(self, key)\u001b[0m\n\u001b[0;32m 1065\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_is_scalar_access(key):\n\u001b[0;32m 1066\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mobj\u001b[38;5;241m.\u001b[39m_get_value(\u001b[38;5;241m*\u001b[39mkey, takeable\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_takeable)\n\u001b[1;32m-> 1067\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_getitem_tuple\u001b[49m\u001b[43m(\u001b[49m\u001b[43mkey\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 1068\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m 1069\u001b[0m \u001b[38;5;66;03m# we by definition only have the 0th axis\u001b[39;00m\n\u001b[0;32m 1070\u001b[0m axis \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39maxis \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;241m0\u001b[39m\n", "File \u001b[1;32mc:\\Users\\mousas10\\AppData\\Local\\anaconda3\\envs\\SI\\lib\\site-packages\\pandas\\core\\indexing.py:1565\u001b[0m, in \u001b[0;36m_iLocIndexer._getitem_tuple\u001b[1;34m(self, tup)\u001b[0m\n\u001b[0;32m 1563\u001b[0m tup \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_validate_tuple_indexer(tup)\n\u001b[0;32m 1564\u001b[0m \u001b[38;5;28;01mwith\u001b[39;00m suppress(IndexingError):\n\u001b[1;32m-> 1565\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_getitem_lowerdim\u001b[49m\u001b[43m(\u001b[49m\u001b[43mtup\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 1567\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_getitem_tuple_same_dim(tup)\n", "File \u001b[1;32mc:\\Users\\mousas10\\AppData\\Local\\anaconda3\\envs\\SI\\lib\\site-packages\\pandas\\core\\indexing.py:991\u001b[0m, in \u001b[0;36m_LocationIndexer._getitem_lowerdim\u001b[1;34m(self, tup)\u001b[0m\n\u001b[0;32m 989\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m section\n\u001b[0;32m 990\u001b[0m \u001b[38;5;66;03m# This is an elided recursive call to iloc/loc\u001b[39;00m\n\u001b[1;32m--> 991\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mgetattr\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43msection\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mname\u001b[49m\u001b[43m)\u001b[49m\u001b[43m[\u001b[49m\u001b[43mnew_key\u001b[49m\u001b[43m]\u001b[49m\n\u001b[0;32m 993\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m IndexingError(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mnot applicable\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n", "File \u001b[1;32mc:\\Users\\mousas10\\AppData\\Local\\anaconda3\\envs\\SI\\lib\\site-packages\\pandas\\core\\indexing.py:1073\u001b[0m, in \u001b[0;36m_LocationIndexer.__getitem__\u001b[1;34m(self, key)\u001b[0m\n\u001b[0;32m 1070\u001b[0m axis \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39maxis \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;241m0\u001b[39m\n\u001b[0;32m 1072\u001b[0m maybe_callable \u001b[38;5;241m=\u001b[39m com\u001b[38;5;241m.\u001b[39mapply_if_callable(key, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mobj)\n\u001b[1;32m-> 1073\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_getitem_axis\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmaybe_callable\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43maxis\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43maxis\u001b[49m\u001b[43m)\u001b[49m\n", "File \u001b[1;32mc:\\Users\\mousas10\\AppData\\Local\\anaconda3\\envs\\SI\\lib\\site-packages\\pandas\\core\\indexing.py:1616\u001b[0m, in \u001b[0;36m_iLocIndexer._getitem_axis\u001b[1;34m(self, key, axis)\u001b[0m\n\u001b[0;32m 1614\u001b[0m \u001b[38;5;66;03m# a list of integers\u001b[39;00m\n\u001b[0;32m 1615\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m is_list_like_indexer(key):\n\u001b[1;32m-> 1616\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_get_list_axis\u001b[49m\u001b[43m(\u001b[49m\u001b[43mkey\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43maxis\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43maxis\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 1618\u001b[0m \u001b[38;5;66;03m# a single integer\u001b[39;00m\n\u001b[0;32m 1619\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m 1620\u001b[0m key \u001b[38;5;241m=\u001b[39m item_from_zerodim(key)\n", "File \u001b[1;32mc:\\Users\\mousas10\\AppData\\Local\\anaconda3\\envs\\SI\\lib\\site-packages\\pandas\\core\\indexing.py:1587\u001b[0m, in \u001b[0;36m_iLocIndexer._get_list_axis\u001b[1;34m(self, key, axis)\u001b[0m\n\u001b[0;32m 1570\u001b[0m \u001b[38;5;250m\u001b[39m\u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[0;32m 1571\u001b[0m \u001b[38;5;124;03mReturn Series values by list or array of integers.\u001b[39;00m\n\u001b[0;32m 1572\u001b[0m \n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 1584\u001b[0m \u001b[38;5;124;03m`axis` can only be zero.\u001b[39;00m\n\u001b[0;32m 1585\u001b[0m \u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[0;32m 1586\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m-> 1587\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mobj\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_take_with_is_copy\u001b[49m\u001b[43m(\u001b[49m\u001b[43mkey\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43maxis\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43maxis\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 1588\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mIndexError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m err:\n\u001b[0;32m 1589\u001b[0m \u001b[38;5;66;03m# re-raise with different error message\u001b[39;00m\n\u001b[0;32m 1590\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mIndexError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mpositional indexers are out-of-bounds\u001b[39m\u001b[38;5;124m\"\u001b[39m) \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01merr\u001b[39;00m\n", "File \u001b[1;32mc:\\Users\\mousas10\\AppData\\Local\\anaconda3\\envs\\SI\\lib\\site-packages\\pandas\\core\\series.py:945\u001b[0m, in \u001b[0;36mSeries._take_with_is_copy\u001b[1;34m(self, indices, axis)\u001b[0m\n\u001b[0;32m 936\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21m_take_with_is_copy\u001b[39m(\u001b[38;5;28mself\u001b[39m, indices, axis\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m0\u001b[39m) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m Series:\n\u001b[0;32m 937\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[0;32m 938\u001b[0m \u001b[38;5;124;03m Internal version of the `take` method that sets the `_is_copy`\u001b[39;00m\n\u001b[0;32m 939\u001b[0m \u001b[38;5;124;03m attribute to keep track of the parent dataframe (using in indexing\u001b[39;00m\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 943\u001b[0m \u001b[38;5;124;03m See the docstring of `take` for full explanation of the parameters.\u001b[39;00m\n\u001b[0;32m 944\u001b[0m \u001b[38;5;124;03m \"\"\"\u001b[39;00m\n\u001b[1;32m--> 945\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mtake\u001b[49m\u001b[43m(\u001b[49m\u001b[43mindices\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mindices\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43maxis\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43maxis\u001b[49m\u001b[43m)\u001b[49m\n", "File \u001b[1;32mc:\\Users\\mousas10\\AppData\\Local\\anaconda3\\envs\\SI\\lib\\site-packages\\pandas\\core\\series.py:929\u001b[0m, in \u001b[0;36mSeries.take\u001b[1;34m(self, indices, axis, is_copy, **kwargs)\u001b[0m\n\u001b[0;32m 921\u001b[0m warnings\u001b[38;5;241m.\u001b[39mwarn(\n\u001b[0;32m 922\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mis_copy is deprecated and will be removed in a future version. \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m 923\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mtake\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m always returns a copy, so there is no need to specify this.\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[0;32m 924\u001b[0m \u001b[38;5;167;01mFutureWarning\u001b[39;00m,\n\u001b[0;32m 925\u001b[0m stacklevel\u001b[38;5;241m=\u001b[39mfind_stack_level(),\n\u001b[0;32m 926\u001b[0m )\n\u001b[0;32m 927\u001b[0m nv\u001b[38;5;241m.\u001b[39mvalidate_take((), kwargs)\n\u001b[1;32m--> 929\u001b[0m indices \u001b[38;5;241m=\u001b[39m \u001b[43mensure_platform_int\u001b[49m\u001b[43m(\u001b[49m\u001b[43mindices\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 930\u001b[0m new_index \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mindex\u001b[38;5;241m.\u001b[39mtake(indices)\n\u001b[0;32m 931\u001b[0m new_values \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_values\u001b[38;5;241m.\u001b[39mtake(indices)\n", "\u001b[1;31mKeyboardInterrupt\u001b[0m: " ] } ], "source": [ "# Corrected usage\n", "acquisition_df = tierd_greedy_acquisition(feasible_grid, x_train, min_distance=0.75)\n", "\n", "# Display the selected points for acquisition\n", "print(acquisition_df.head())\n", "\n", "# Instructions to Users:\n", "# - This function helps in selecting new data points for acquisition based on the model's predictions and a tiered strategy.\n", "# - The tiered strategy involves selecting points based on their predicted standard deviation ('std'), predicted yield ('gpr_yield'),\n", "# and a random selection to explore the feature space.\n", "# - Ensure the 'exp_grid_df' passed to the function has been filtered for feasibility (`fact_check` == True) and includes\n", "# scaled features (`scl_init_C`, `scl_init_N`, `scl_init_Li`) used for prediction.\n", "# - The 'train_df' should include the same features as 'exp_grid_df' and represent the data used to train the model.\n", "# - Adjust the acquisition tiers and the\n", "\n" ] }, { "cell_type": "code", "execution_count": 208, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
init_Cinit_Ninit_LiTgpr_yieldstdfact_checkacquisition
14440.1666670.3333330.16666766.0-0.4001340.351154Truestd
16290.1666671.1666670.16666766.0-0.3280780.289295Truestd
18140.1666672.0000000.16666766.0-0.2516310.256215Truestd
19990.1666672.8333330.16666766.0-0.2276320.228918Truestd
86591.0000002.0000000.16666766.0-0.3397780.202977Truestd
21840.1666673.6666670.16666766.0-0.2615880.193080Truestd
218682.5000006.0000000.16666766.0-0.4337430.184640Truestd
115821.3333332.8333330.16666766.0-0.3829150.180349Truestd
216752.5000005.0000005.00000066.00.7371720.178645Truestd
218972.5000006.0000005.00000066.00.7765010.175824Truestd
216462.5000005.0000000.16666766.0-0.5690170.174424Truestd
173172.0000004.0000000.16666766.0-0.5345770.172032Truestd
162651.8333335.3333333.66666766.00.7399300.086747Truegpr_yield
188182.1666674.5000003.66666766.00.7088130.088322Truegpr_yield
159311.8333333.8333333.50000066.00.6472250.120750Truegpr_yield
130051.5000003.0000003.00000066.00.5494700.126346Truegpr_yield
104481.1666673.8333332.33333366.00.5244310.062215Truegpr_yield
189182.1666675.0000001.83333366.00.4558590.048620Truegpr_yield
106581.1666674.8333330.33333366.0-0.1264030.098772Truerandom
89941.0000003.5000000.50000066.0-0.0580220.107840Truerandom
94731.0000005.6666670.16666766.0-0.1390920.109784Truerandom
52960.5000005.3333330.83333366.00.2204410.060282Truerandom
87071.0000002.1666672.00000066.00.4088100.109876Truerandom
37380.3333334.5000000.16666766.0-0.2744440.137729Truerandom
\n", "
" ], "text/plain": [ " init_C init_N init_Li T gpr_yield std fact_check \\\n", "1444 0.166667 0.333333 0.166667 66.0 -0.400134 0.351154 True \n", "1629 0.166667 1.166667 0.166667 66.0 -0.328078 0.289295 True \n", "1814 0.166667 2.000000 0.166667 66.0 -0.251631 0.256215 True \n", "1999 0.166667 2.833333 0.166667 66.0 -0.227632 0.228918 True \n", "8659 1.000000 2.000000 0.166667 66.0 -0.339778 0.202977 True \n", "2184 0.166667 3.666667 0.166667 66.0 -0.261588 0.193080 True \n", "21868 2.500000 6.000000 0.166667 66.0 -0.433743 0.184640 True \n", "11582 1.333333 2.833333 0.166667 66.0 -0.382915 0.180349 True \n", "21675 2.500000 5.000000 5.000000 66.0 0.737172 0.178645 True \n", "21897 2.500000 6.000000 5.000000 66.0 0.776501 0.175824 True \n", "21646 2.500000 5.000000 0.166667 66.0 -0.569017 0.174424 True \n", "17317 2.000000 4.000000 0.166667 66.0 -0.534577 0.172032 True \n", "16265 1.833333 5.333333 3.666667 66.0 0.739930 0.086747 True \n", "18818 2.166667 4.500000 3.666667 66.0 0.708813 0.088322 True \n", "15931 1.833333 3.833333 3.500000 66.0 0.647225 0.120750 True \n", "13005 1.500000 3.000000 3.000000 66.0 0.549470 0.126346 True \n", "10448 1.166667 3.833333 2.333333 66.0 0.524431 0.062215 True \n", "18918 2.166667 5.000000 1.833333 66.0 0.455859 0.048620 True \n", "10658 1.166667 4.833333 0.333333 66.0 -0.126403 0.098772 True \n", "8994 1.000000 3.500000 0.500000 66.0 -0.058022 0.107840 True \n", "9473 1.000000 5.666667 0.166667 66.0 -0.139092 0.109784 True \n", "5296 0.500000 5.333333 0.833333 66.0 0.220441 0.060282 True \n", "8707 1.000000 2.166667 2.000000 66.0 0.408810 0.109876 True \n", "3738 0.333333 4.500000 0.166667 66.0 -0.274444 0.137729 True \n", "\n", " acquisition \n", "1444 std \n", "1629 std \n", "1814 std \n", "1999 std \n", "8659 std \n", "2184 std \n", "21868 std \n", "11582 std \n", "21675 std \n", "21897 std \n", "21646 std \n", "17317 std \n", "16265 gpr_yield \n", "18818 gpr_yield \n", "15931 gpr_yield \n", "13005 gpr_yield \n", "10448 gpr_yield \n", "18918 gpr_yield \n", "10658 random \n", "8994 random \n", "9473 random \n", "5296 random \n", "8707 random \n", "3738 random " ] }, "execution_count": 208, "metadata": {}, "output_type": "execute_result" } ], "source": [ "acquisition_df.reset_index(drop=True)\n", "acquisition_df.head(1000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 11- Save the Acquired Data\n", "\n", " Document the newly acquired data points to refine the model in subsequent learning cycles." ] }, { "cell_type": "code", "execution_count": 209, "metadata": {}, "outputs": [], "source": [ "now = datetime.now().strftime('%Y-%m-%d_%H-%M-%S')\n", "\n", "acquisition_df.to_excel('data\\\\generated\\\\TGA_batch_{}.xlsx'.format(now))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "chemdev", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.16" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "d382266be4744d0754a97ce7ea21db92af52e4d5aa60fa44c3f1d1cf60cd486a" } } }, "nbformat": 4, "nbformat_minor": 2 }