--- title: 'distribute: Easy to use Resource Manager for Distributed Computing without Assumptions' tags: - Rust - Python - computing - gpu - linux authors: - name: Brooks Karlik orcid: 0000-0002-1431-5199 equal-contrib: true corresponding: true affiliation: 1 - name: Aditya Nair equal-contrib: false orcid: 0000-0002-8979-8420 affiliation: 1 affiliations: - name: University of Nevada, Reno index: 1 date: 3 March 2023 bibliography: paper.bib --- # Summary Distributed computing is an essential element for advancing computational sciences. Workload managers used by computing clusters help to meet the need for running large-scale simulations and efficiently sharing computational resources among multiple users [@reuther2016scheduler]. In this context, we introduce a tool designed to allocate computational resources effectively for small clusters of computers commonly found in research groups lacking specialized hardware such as computing fabric or shared file-systems. Additionally, compute nodes within the cluster can also serve as standard workstations for researchers, and any ongoing job on a machine can be temporarily suspended to enable normal research activities to proceed. # Statement of need Many research groups face challenges with inefficient utilization or over-scheduling of their computational resources. While some computers work for days to complete a queue of simulations, others remain unused. Dividing simulations manually among computers is time-consuming and often ineffective due to varying hardware configurations. Additionally, multiple individuals running jobs on a single computer can slow simulatiaoni progress for everyone involved. To meet the growing challenges of managing compute resources between researchers, workload managers such as Slurm [@SLURM] and TORQUE [@TORQUE] quickly rose in popularity. These workload managers enforce strict upper bounds on memory, CPU time, and CPU cores on each job submitted. Leveraging known runtime and hardware constraints, these traditional workload managers optimize the scheduling of hundreds of jobs concurrently across thousands of CPU cores and terrabytes of memory. While Slurm and TORQUE provide elegant and powerful solutions to scientific computing, they are incompatible with cheap off-the-shelf hardware present in many research labs. In order to schedule a compute task across hundreds of cores and dozens of individual computers concurrently, traditional workload managers require specialized memory and filesystems to enable communication between multiple computers. Although this hardware is standard in conventional supercomputer clusters, it leaves collections of common desktop computers present in research labs underserved. Users with access to scientific computing clusters are usually given minimal priviledges: only enough to start jobs and download the results. Since users do not have access to individual compute nodes, Slurm and TORQUE generally assume they are used for purely compute reasons and provide no method to temporarily halt a job on a single node. However, in a lab environment, desktops used for compute purposes often double as workstations for researchers, and the execution of jobs in the background may slow down day-to-day work. # *distribute* Overview Our tool, *distribute*, eliminates the need for specialized hardware such as memory fabric or shared filesystems, making almost no assumptions about the architecture of the computers. It schedules multiple jobs concurrently across several computers, as shown in \autoref{fig:computing-layout}, avoiding the simultaneous execution of a single job across multiple computers. Since the hardware that *distribute* targets also doubles as research workstations, we offer a simple interface to pausing the background execution of jobs for normal research work to continue. ![ Left: traditional horizontally scaling compute framework. Right: proposed compute framework. \label{fig:computing-layout} ](./node_layout.png) While Slurm and TORQUE provide simple interfaces to schedule a single large job, *distribute* focuses on the scheduling of tens to hundreds of small to medium sized jobs. In a *distribute* configuration file, a method for compiling a computational task ("solver") is specified with a list of job names. Each job name specifies a list of files that serve as inputs to the solver. \autoref{fig:stateless-solver-formula} provides an overview of the *distribute* execution model. When each job is scheduled, *distribute* ensures that the solver is correctly compiled on each compute node and provided with access to the input files for the job. When the job is completed, the solver outputs are transported and archived on the head node. In distribute, the user is given two methods of providing a solver. In the first method, an apptainer [@apptainer] image may be compiled and submitted with the configuration file. With this method, the user is able to verify that all dependencies are correctly supplied and the container functions exactly as expected. Although less robust, the alternative method is to supply a python script to compile your solver at runtime on each compute node. After the completion of job batches, *distribute* provides a simple method to query the archived outputs and selectively download files from the head node. With this methodology, users can avoid manually writing scripts with *rsync* to download files stored on a compute cluster. Running a multidimensional parameter space sweep may result in hundreds of jobs, and our tool provides a Python package to programmatically generate the configuration file. Furthermore, *distribute* can transpile its job execution configuration to the SLURM format, making it easy to run a batch of jobs previously executed on a *distribute* cluster of in-house machines on a larger cluster with hundreds of cores. ![ The input-output model for *distribute* jobs. A stateless solver executes a computational workload and produces output. \label{fig:stateless-solver-formula} ](./input_outputs.png) # Acknowledgements AGN acknowledges the support the National Science Foundation AI Institute in Dynamic systems (Award no: 2112085, PM: Dr. Shahab Shojaei-Zadeh). # References