--- name: slurm-job-script-generator description: > Generate correct, copy-pasteable SLURM sbatch job scripts and sanity-check HPC resource requests — configure nodes, MPI tasks, OpenMP threads, memory (per-node or per-cpu), GPUs, walltime, partitions, modules, and environment variables, with automatic detection of conflicting directives and oversubscription. Use when preparing a SLURM submission script, deciding between pure MPI and hybrid MPI+OpenMP layouts, standardizing #SBATCH directives across a team, debugging why a job won't launch or gets killed, or setting up GPU-accelerated simulation jobs, even if the user only says "I need to run this on the cluster" or "my job keeps getting killed." allowed-tools: Read, Bash, Write, Grep, Glob metadata: author: HeshamFS version: "1.1.0" security_tier: high security_reviewed: true tested_with: - claude-code - gemini-cli - vs-code-copilot eval_cases: 2 last_reviewed: "2026-03-26" --- # SLURM Job Script Generator ## Goal Generate a correct, copy-pasteable SLURM job script (`.sbatch`) for running a simulation, and surface common configuration mistakes (bad walltime format, conflicting memory flags, oversubscription hints). ## Requirements - Python 3.8+ - No external dependencies (Python standard library only) - Works on Linux, macOS, and Windows (script generation only) ## Inputs to Gather | Input | Description | Example | |-------|-------------|---------| | Job name | Short identifier for the job | `phasefield-strong-scaling` | | Walltime | SLURM time limit | `00:30:00` | | Partition | Cluster partition/queue (if required) | `compute` | | Account | Project/account (if required) | `matsim` | | Nodes | Number of nodes to allocate | `2` | | MPI tasks | Total tasks, or tasks per node | `128` or `64` per node | | Threads | CPUs per task (OpenMP threads) | `2` | | Memory | `--mem` or `--mem-per-cpu` (cluster policy dependent) | `32G` | | GPUs | GPUs per node (optional) | `4` | | Working directory | Where the run should execute | `$SLURM_SUBMIT_DIR` | | Modules | Environment modules to load (optional) | `gcc/12`, `openmpi/4.1` | | Run command | The command to launch under SLURM | `./simulate --config cfg.json` | ## Decision Guidance ### MPI vs MPI+OpenMP layout ``` Does the code use OpenMP / threading? ├── NO → Use MPI-only: cpus-per-task=1 └── YES → Use hybrid: set cpus-per-task = threads per MPI rank and export OMP_NUM_THREADS = cpus-per-task ``` **Rule of thumb:** if you see diminishing strong-scaling efficiency at high MPI ranks, try fewer ranks with more threads per rank (and measure). ### Memory flag selection - Use **either** `--mem` (per node) **or** `--mem-per-cpu` (per CPU), not both. - Follow your cluster’s documentation; some sites enforce one style. - SLURM `--mem` units are integer MB by default, or an integer with suffix `K/M/G/T` (and `--mem=0` commonly means “all memory on node”). ## Script Outputs (JSON Fields) | Script | Key Outputs | |--------|-------------| | `scripts/slurm_script_generator.py` | `results.script`, `results.directives`, `results.derived`, `results.warnings` | ## Workflow 1. Gather cluster constraints (partition/account, GPU policy, memory policy). 2. Choose a process layout (MPI-only vs hybrid MPI+OpenMP). 3. Generate the script with `slurm_script_generator.py`. 4. Inspect warnings (conflicts, suspicious layouts). 5. Save the generated script as `job.sbatch`. 6. Submit with `sbatch job.sbatch` and monitor with `squeue`. ## CLI Examples ```bash # Preview a job script (prints to stdout) python3 skills/hpc-deployment/slurm-job-script-generator/scripts/slurm_script_generator.py \ --job-name phasefield \ --time 00:10:00 \ --partition compute \ --nodes 1 \ --ntasks-per-node 8 \ --cpus-per-task 2 \ --mem 16G \ --module gcc/12 \ --module openmpi/4.1 \ -- \ ./simulate --config config.json # Write to a file and also emit structured JSON python3 skills/hpc-deployment/slurm-job-script-generator/scripts/slurm_script_generator.py \ --job-name phasefield \ --time 00:10:00 \ --nodes 1 \ --ntasks 16 \ --cpus-per-task 1 \ --out job.sbatch \ --json \ -- \ /bin/echo hello ``` ## Conversational Workflow Example **User**: I need an `sbatch` script for my MPI simulation. I want 2 nodes, 64 ranks per node, 2 OpenMP threads per rank, and 2 hours. **Agent workflow**: 1. Confirm partition/account and whether GPUs are needed. 2. Generate a hybrid job script: ```bash python3 scripts/slurm_script_generator.py --job-name run --time 02:00:00 --nodes 2 --ntasks-per-node 64 --cpus-per-task 2 -- -- ./simulate ``` 3. Explain the mapping: - Total ranks = 128 - Threads per rank = 2 (`OMP_NUM_THREADS=2`) 4. If the user provides node core counts, sanity-check oversubscription using `--cores-per-node`. ## Error Handling | Error | Cause | Resolution | |-------|-------|------------| | `time must be HH:MM:SS or D-HH:MM:SS` | Bad walltime format | Use `00:30:00` or `1-00:00:00` | | `nodes must be positive` | Non-positive nodes | Provide `--nodes >= 1` | | `Provide either --mem or --mem-per-cpu, not both` | Conflicting memory directives | Choose one memory style | | `Provide a run command after --` | Missing launch command | Add `-- ./simulate ...` | ## Security ### Input Validation - `--time` is validated against strict `HH:MM:SS` or `D-HH:MM:SS` format via regex - `--nodes`, `--ntasks`, `--ntasks-per-node`, `--cpus-per-task`, `--gpus` are validated as positive integers with upper bounds - `--mem` and `--mem-per-cpu` are validated against SLURM's accepted format (`[K|M|G|T]`); providing both simultaneously is rejected - `--job-name` is validated against `[a-zA-Z0-9_.-]+` (no shell metacharacters) - `--partition` and `--account` are validated against safe-character allowlists - `--module` values are validated to prevent shell injection (no `;`, `|`, `&`, backticks, or `$`) ### File Access - The script reads no external files; all inputs are provided via CLI arguments - `--out` writes the generated sbatch script to a single specified file path - The generated script is a plain-text shell script with `#SBATCH` directives; it contains no dynamically generated code ### Tool Restrictions - **Read**: Used to inspect script source, references, and existing job scripts - **Bash**: Used to execute `slurm_script_generator.py` with explicit argument lists; the generated script itself is NOT executed by the agent - **Write**: Used to save the generated `.sbatch` file; writes are scoped to the user's working directory - **Grep/Glob**: Used to locate existing scripts, configs, and cluster documentation ### Safety Measures - No `eval()`, `exec()`, or dynamic code generation - All subprocess calls use explicit argument lists (no `shell=True`) - The run command (after `--`) is included verbatim in the generated script but is never executed by the skill itself - Module names are sanitized to prevent injection into `module load` directives - Generated scripts use `set -euo pipefail` for safe shell execution on the cluster ## Limitations - Does not query cluster hardware or site policies; it can only validate internal consistency. - SLURM installations vary (GPU directives, QoS rules, partitions). Adjust directives for your site. ## References - `references/slurm_directives.md` - Common `#SBATCH` directives and mapping tips ## Version History - **v1.0.0** (2026-02-25): Initial SLURM job script generator