---
aliases:
- /2018/06/organize-code-simulations-nersc
categories:
- hpc
- python
- nersc
date: 2018-06-20 18:00
layout: post
slug: organize-code-data-simulations-nersc
title: How to organize code and data for simulations at NERSC

---

I recently improved my strategy for organizing code and data for simulations run at NERSC,
I'll write it here for reference.

## Libraries

I mostly use Python (often with C/C++ extensions), so I first rely on the Anaconda
module maintained by NERSC, currently `python/3.6-anaconda-4.4`.

If I need to add many more packages I can create a conda environment, but for just installing
1 or 2 packages I prefer to just add them to my `PYTHONPATH`.

I have core libraries that I rely on and often modify to run my simulations,
those should be installed on Global Common Software: `/global/common/software/projectname`
which is specifically designed to access small files like Python packages.
I generally create a subfolder and reference it with an environment variable:

     export PREFIX=/global/common/software/projectname/zonca/python_prefix

Then I create a `env.sh` script in the source folder of the package (in Global Home) that loads
the environment:

    module load python/3.6-anaconda-4.4
    export PREFIX=/global/common/software/projectname/zonca/python_prefix
    export PATH=$PREFIX/bin:$PATH
    export LD_LIBRARY_PATH=$PREFIX/lib:$LD_LIBRARY_PATH
    export PYTHONPATH=$PREFIX/lib/python3.6/site-packages:$PYTHONPATH

This environment is automatically propagated to the computing nodes when I submit a SLURM script,
therefore I do not add any of these environment details to my SLURM scripts.

Then I can install a package there with:

    python setup.py install --prefix=$PREFIX

or from pip:

    pip install apackage --prefix=$PREFIX

It is also common to install a newer version of a package which is already provided by
the base environment:

    pip install apackage --ignore-installed --upgrade --no-deps --prefix=$PREFIX

## Simulations SLURM scripts and configuration files

I first create a repository on Github for my simulations and clone it to my home folder at NERSC.
I generally create a repository for each experiment, then I create a subfolder for each
type of simulation I am working on.

Inside a folder I create parameters files to configure my run and slurm scripts to launch the
simulations and put everything under version control immediately, I often create a Pull Request
on Github and ask my collaborators to cross-check the configuration before a submit a run.

Smaller input data files, even binaries, can be added for convenience to the Github repository.

Once a run has been validated, inside the simulation type folder I createa a subfolder `runs/201806_details_about_run` and
add a `README.md`, this will include all the details about the simulation.
I also tag both the core library I depend on and the simulation repository with the same name e.g.:

    git tag -a 201806_details_about_run -m "software version used for 201806_details_about_run"

I'll also add the path at NERSC of the input data and output results.

Then for future simulations I'll keep modifying the SLURM scripts and parameter files but always have
a reference to each previous version.

## Larger input data and output data

Larger input data and outputs are not suitable for version control and should live in a SCRATCH filesystem.
I always use the Global Scratch `$CSCRATCH` which is available both on Edison on Cori and also
from the Jupyter Notebook environment at: <https://jupyter.nersc.gov>.

I create a root folder for the project at:

    $CSCRATCH/projectname

Then a subfolder for each simulation type:

    $CSCRATCH/projectname/simulation_type_1
    $CSCRATCH/projectname/simulation_type_2

Then I symlink those inside the simulation repository as the folder `out/`:

    cd $HOME/projectname/simulation_type_1
    ln -s $CSCRATCH/projectname/simulation_type_1 out

Therefore I can setup my simulation software to save all results inside `out/201806_details_about_run`
and this is going to be written to `CSCRATCH`.

This setup makes it very convenient to regularly backup everything to tape using `cput` which just backs up
files that are not already on tape, e.g.:

    cd $CSCRATCH
    hsi
    cput -R projectname

This is going to synchronize the backup on tape with the latest results on `CSCRATCH`.

I do the same for input files:

    mkdir $CSCRATCH/projectname/input_simulation_type_1
    cd $HOME/projectname/simulation_type_1
    ln -s $CSCRATCH/projectname/input_simulation_type_1 input