## Tuning the hyperparameters of a neural network using EasyVVUQ and FabSim3

In this tutorial we will use the EasyVVUQ `GridSampler` to perform a grid search on the hyperparameters of a simple Keras neural network model, trained to recognize hand-written digits. This is the famous MNIST data set, of which 4 input features (of size 28 x 28) are show below. These are fed into a standard feed-forward neural network, which will predict the label 0-9.

The (Keras) neural network script is located in `mnist/keras_mnist.template`, which will form the input template for the EasyVVUQ encoder. We will assume you are familiar with the basic EasyVVUQ building blocks. If not, you can look at the [basic tutorial](https://github.com/UCL-CCS/EasyVVUQ/blob/dev/tutorials/basic_tutorial.ipynb).

![](mnist/mnist_feats.png)

We need EasyVVUQ, TensorFlow and the TensorFlow data sets to execute this tutorial. If you need to install these, uncomment the corresponding line below.

In [1]:
# !pip install easyvvuq
# !pip install tensorflow
# !pip install tensorflow_datasets

### FabSim3

While running on the localhost, we will use the [FabSim3](https://github.com/djgroen/FabSim3) automation toolkit for the data processing workflow, i.e. to move the UQ ensemble to/from the localhost. To connect EasyVVUQ with FabSim3, the [FabUQCampaign](https://github.com/wedeling/FabUQCampaign) plugin must be installed.

The advantage of this construction is that we could offload the ensemble to a remote supercomputer using this same script by simply changing the `MACHINE='localhost'` flag, provided that FabSIm3 is set up on the remote resource.

For an example **without FabSim3**, see `tutorials/hyperparameter_tuning_tutorial.ipynb`.

For now, import the required libraries below. `fabsim3_cmd_api` is an interface with fabSim3 such that the command-line FabSim3 commands can be executed in a Python script. It is stored locally in `fabsim3_cmd_api.py`.

In [2]:
import easyvvuq as uq
import os
import numpy as np

############################################
# Import the FabSim3 commandline interface #
############################################
import fabsim3_cmd_api as fab

We now set some flags:

In [3]:
# Work directory, where the EasyVVUQ directory will be placed
WORK_DIR = '/tmp'
# machine to run ensemble on
MACHINE = "localhost"
# target output filename generated by the code
TARGET_FILENAME = 'output.csv'
# EasyVVUQ campaign name
CAMPAIGN_NAME = 'grid_test'

# FabSim3 config name
CONFIG = 'grid_search'
# Use QCG PilotJob or not
PILOT_JOB = False

Most of these are self explanatory. Here, `CONFIG` is the name of the script that gets executed for each sample, in this case `grid_search`, which is located in `FabUQCampaign/templates/grid_search`. Its contents are essentially just runs our Python code `hyper_param_tune.py`:

```
cd $job_results
$run_prefix

/usr/bin/env > env.log

python3 hyper_param_tune.py
```

Here, `hyper_param_tune` is generated by the EasyVVUQ encoder, see below. The flag `PILOT_JOB` regulates the use of the QCG PilotJob mechanism. If `True`, FabSim will submit the ensemble to the (remote) host as a QCG PilotJob, which essentially means that all invididual jobs of the ensemble will get packaged into a single job allocation, thereby circumventing the limit on the maximum number of simultaneous jobs that is present on many supercomputers. For more info on the QCG PilotJob click [here](https://github.com/vecma-project/QCG-PilotJob). In this example we'll run the samples on the localhost (see `MACHINE`), and hence we set `PILOT_JOB=False`.

As is standard in EasyVVUQ, we now define the parameter space. In this case these are 4 hyperparameters. There is one hidden layer with `n_neurons` neurons, a Dropout layer after the input and hidden layer, with dropout probability `dropout_prob_in` and `dropout_prob_hidden` respectively. We made the `learning_rate` tuneable as well.

In [4]:
params = {}
params["n_neurons"] = {"type":"integer", "default": 32}
params["dropout_prob_in"] = {"type":"float", "default": 0.0}
params["dropout_prob_hidden"] = {"type":"float", "default": 0.0}
params["learning_rate"] = {"type":"float", "default": 0.001}

These 4 hyperparameter appear as flags in the input template `mnist/keras_mnist.template`. Typically this is generated from an input file used by some simualtion code. In this case however, `mnist/keras_mnist.template` is directly our Python script, with the hyperparameters replaced by flags. For instance:

```python
model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dropout($dropout_prob_in),
  tf.keras.layers.Dense($n_neurons, activation='relu'),
  tf.keras.layers.Dropout($dropout_prob_hidden),
  tf.keras.layers.Dense(10)
])
```

is simply the neural network construction part with flags for the dropout probabilities and the number of neurons in the hidden layer. The encoder reads the flags and replaces them with numeric values, and it subsequently writes the corresponding `target_filename=hyper_param_tune.py`:

In [5]:
encoder = uq.encoders.GenericEncoder('./mnist/keras_mnist.template', target_filename='hyper_param_tune.py')

Now we create the first set of EasyVVUQ `actions` to create separate run directories and to encode the template:

In [6]:
# actions: create directories and encode input template, placing 1 hyper_param_tune.py file in each directory.
actions = uq.actions.Actions(
    uq.actions.CreateRunDirectory(root=WORK_DIR, flatten=True),
    uq.actions.Encode(encoder),
)

# create the EasyVVUQ main campaign object
campaign = uq.Campaign(
    name=CAMPAIGN_NAME,
    work_dir=WORK_DIR,
)

# add the param definitions and actions to the campaign
campaign.add_app(
    name=CAMPAIGN_NAME,
    params=params,
    actions=actions
)

As with the uncertainty-quantification (UQ) samplers, the `vary` is used to select which of the `params` we actually vary. Unlike the UQ samplers we do not specify an input probability distribution. This being a grid search, we simply specify a list of values for each hyperparameter. Parameters not in `vary`, but with a flag in the template, will be given the default value specified in `params`.

In [7]:
vary = {"n_neurons": [64, 128], "learning_rate": [0.005, 0.01, 0.015]}

**Note:** we are mixing integer and floats in the `vary` dict. Other data types (string, boolean) can also be used.

The `vary` dict is passed to the `Grid_Sampler`. As can be seen, it created a tensor product of all 1D points specified in `vary`. If a single tensor product is not useful (e.g. because it creates combinations of parameters that do not makes sense), you can also pass a list of different `vary` dicts. For even more flexibility you can also write the required parameter combinations to a CSV file, and pass it to the `CSV_Sampler` instead.

In [8]:
# create an instance of the Grid Sampler
sampler = uq.sampling.Grid_Sampler(vary)

# Associate the sampler with the campaign
campaign.set_sampler(sampler)

# print the points
print("There are %d points:" % (sampler.n_samples()))
sampler.points

There are 6 points:


[array([[64, 0.005],
        [64, 0.01],
        [64, 0.015],
        [128, 0.005],
        [128, 0.01],
        [128, 0.015]], dtype=object)]

Run the `actions` (create directories with `hyper_param_tune.py` files in it)

In [9]:
###############################
# execute the defined actions #
###############################

campaign.execute().collate()

To run the ensemble, execute:

In [10]:
###################################################
# run the UQ ensemble using the FabSim3 interface #
###################################################

fab.run_uq_ensemble(CONFIG, campaign.campaign_dir, script='grid_search',
                    machine=MACHINE, PJ=PILOT_JOB)

# wait for job to complete
fab.wait(machine=MACHINE)

Executing fabsim localhost run_uq_ensemble:grid_search,campaign_dir=/tmp/grid_testrebm6ntq,script=grid_search,skip=0,PJ=False


2023-03-02 11:35:56.557670: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-02 11:35:56.725197: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-03-02 11:35:56.725224: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-03-02 11:35:57.644413: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-

2023-03-02 11:36:45.073851: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-03-02 11:36:45.073875: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303)
2023-03-02 11:36:45.073894: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (wouter-XPS-13-7390): /proc/driver/nvidia/version does not exist
2023-03-02 11:36:45.074174: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-02 11:36:56.730036: I tensorflow/core/platform/cpu_featur

True

In [11]:
# check if all output files are retrieved from the remote machine, returns a Boolean flag
all_good = fab.verify(CONFIG, campaign.campaign_dir, TARGET_FILENAME, machine=MACHINE)

Executing fabsim localhost fetch_results
Executing fabsim localhost verify_last_ensemble:grid_search,campaign_dir=/tmp/grid_testrebm6ntq,target_filename=output.csv,machine=localhost


In [12]:
if all_good:
    # copy the results from the FabSim results dir to the EasyVVUQ results dir
    fab.get_uq_samples(CONFIG, campaign.campaign_dir, sampler.n_samples(), machine=MACHINE)
else:
    print("Not all samples executed correctly")
    import sys
    sys.exit()

Executing fabsim localhost get_uq_samples:grid_search,campaign_dir=/tmp/grid_testrebm6ntq,number_of_samples=6,skip=0


Briely:

* `fab.run_uq_ensemble`: this command submits the ensemble to the (remote) host for execution. Under the hood it uses the FabSim3 `campaign2ensemble` subroutine to copy the run directories from `WORK_DIR` to the FabSim3 `SWEEP` directory, located in `config_files/grid_search/SWEEP`. From there the ensemble will be sent to the (remote) host.
* `fab.wait`: this will check every minute on the status of the jobs on the remote host, and sleep otherwise, halting further execution of the script. On the localhost this command doesn't do anything.
* `fab.verify`: this will execute the `verify_last_ensemble` subroutine to see if the output file `target_filename` for each run in the `SWEEP` directory is present in the corresponding FabSim3 results directory. Returns a boolean flag. `fab.verify` will also call the FabSim `fetch_results` method, which actually retreives the results from the (remote) host. So, if you want to just get the results without verifying the presence of output files, call `fab.fetch_results(machine=MACHINE)` instead. However, if something went wrong on the (remote) host, this will cause an error later on since not all required output files will be transfered on the EasyVVUQ `WORK_DIR`.
* `fab.get_uq_samples`: copies the samples from the (local) FabSim results directory to the (local) EasyVVUQ campaign directory. It will not delete the results from the FabSim results directory. If you want to save space, you can delete the results on the FabSim side (see `results` directory in your FabSim home directory). You can also call `fab.clear_results(machine, name_results_dir)` to remove a specific FabSim results directory on a given machine.

#### Error handling

If `all_good == False` something went wrong on the (remote) host, and `sys.exit()` is called in our example, giving you the opportunity of investigating what went wrong. It can happen that a (small) number of jobs did not get executed on the remote host for some reason, whereas (most) jobs did execute succesfully. In this case simply resubmitting the failed jobs could be an option:

```python
fab.remove_succesful_runs(CONFIG, campaign.campaign_dir)
fab.resubmit_previous_ensemble(CONFIG, 'grid_search')
```

The first command removes all succesful run directories from the `SWEEP` dir for which the output file `TARGET_FILENAME` has been found. For this to work, `fab.verify` must have been called. Then, `fab.resubmit_previous_ensemble` simply resubmits the runs that are present in the `SWEEP` directory, which by now only contains the failed runs. After the jobs have finished, call `fab.verify` again to see if now `TARGET_FILENAME` is present in the results directory, for every run in the `SWEEP` dir.

Once we are sure we have all required output files, the role of FabSim is over, and we proceed with decoding the output files. In this case, our Python script wrote the training and test accuracy to a CSV file, hence we use the `SimpleCSV` decoder. 

**Note**: It is also possible to use a more flexible HDF5 format, by using `uq.decoders.HDF5` instead.

In [13]:
#############################################
# All output files are present, decode them #
#############################################
output_columns = ["accuracy_train", "accuracy_test"]

decoder = uq.decoders.SimpleCSV(
    target_filename=TARGET_FILENAME,
    output_columns=output_columns)

actions = uq.actions.Actions(
    uq.actions.Decode(decoder),
)

campaign.replace_actions(CAMPAIGN_NAME, actions)

###########################
# Execute decoding action #
###########################

campaign.execute().collate()

data_frame = campaign.get_collation_result()
data_frame

Unnamed: 0_level_0,run_id,iteration,n_neurons,learning_rate,dropout_prob_in,dropout_prob_hidden,accuracy_train,accuracy_test
Unnamed: 0_level_1,0,0,0,0,0,0,0,0
0,1,0,64,0.005,0.0,0.0,0.959267,0.9544
1,2,0,64,0.01,0.0,0.0,0.974133,0.9653
2,3,0,64,0.015,0.0,0.0,0.979717,0.9712
3,4,0,128,0.005,0.0,0.0,0.963333,0.9592
4,5,0,128,0.01,0.0,0.0,0.978667,0.9718
5,6,0,128,0.015,0.0,0.0,0.98365,0.9744


Display the hyperparameters with the maximum test accuracy

In [14]:
print("Best hyperparameters with %.2f%% test accuracy:" % (data_frame['accuracy_test'].max().values * 100,))
data_frame.loc[data_frame['accuracy_test'].idxmax()][vary.keys()]

Best hyperparameters with 97.44% test accuracy:


Unnamed: 0_level_0,n_neurons,learning_rate
Unnamed: 0_level_1,0,0
5,128,0.015


## Executing a grid search on a remote host

To run the example script on a remote host, a number of changes must be made. Ensure the remote host is defined in `machines.yml` in your FabSim3 directory, as well as the user login information. Assuming we'll run the ensemble on the Eagle super computer at the Poznan Supercomputing and Networking Center , the entry in `machines_user.yml` could look similar to the following:

```
eagle_vecma:
  username: "<your_username>"
  home_path_template: "/tmp/lustre/<your_username>"
  budget: "plgvecma2021"
  cores: 1
  # job wall time for each job, format Days-Hours:Minutes:Seconds
  job_wall_time : "0-0:59:00" # job wall time for each single job without PJ
  PJ_size : "1" # number of requested nodes for PJ
  PJ_wall_time : "0-00:59:00" # job wall time for PJ
  modules:
    loaded: ["python/3.7.3"] 
    unloaded: [] 
```
 Here:
 
 * `home_path_template`: the remote root directory for FabSim3, such that for instance the results on the remote machine will be stored in `home_path_template/FabSim3/results`.
 * `budget`: the name of the computational budget that you are allowed to use.
 * `cores`: the number of cores to use *per run*. Our simple Keras script justs need a single core, but applications which already have some built-in paralellism will require more cores.
 * `job_wall_time`: a time limit *per run*, and *without* the use of the QCG PilotJob framework.
 * `PJ_size`: the number of *nodes*, in the case *with* the use of the QCG PilotJob framework. 
 * `PJ_wall_time`:  a *total* time limit, and *with* the use of the QCG PilotJob framework.

To automatically setup the ssh keys, and prevent having to login manually for every random sample, run the following from the command line:

```
fabsim eagle_vecma setup_ssh_keys
```

Once the remote machine is properly setup, we can just set:

```python
# Use QCG PilotJob or not
PILOT_JOB = False
# machine to run ensemble on
MACHINE = "eagle_vecma"
```

If you now re-run the example script, the ensemble will execute on the remote host, submitting each run as a separate job. By setting `PILOT_JOB=True`, all runs will be packaged in a single job.