--- name: nemo-evaluator-sdk description: Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking. version: 1.0.0 author: Orchestra Research license: MIT tags: [Evaluation, NeMo, NVIDIA, Benchmarking, MMLU, HumanEval, Multi-Backend, Slurm, Docker, Reproducible, Enterprise] dependencies: [nemo-evaluator-launcher>=0.1.25, docker] --- # NeMo Evaluator SDK - Enterprise LLM Benchmarking ## Quick Start NeMo Evaluator SDK evaluates LLMs across 100+ benchmarks from 18+ harnesses using containerized, reproducible evaluation with multi-backend execution (local Docker, Slurm HPC, Lepton cloud). **Installation**: ```bash pip install nemo-evaluator-launcher ``` **Set API key and run evaluation**: ```bash export NGC_API_KEY=nvapi-your-key-here # Create minimal config cat > config.yaml << 'EOF' defaults: - execution: local - deployment: none - _self_ execution: output_dir: ./results target: api_endpoint: model_id: meta/llama-3.1-8b-instruct url: https://integrate.api.nvidia.com/v1/chat/completions api_key_name: NGC_API_KEY evaluation: tasks: - name: ifeval EOF # Run evaluation nemo-evaluator-launcher run --config-dir . --config-name config ``` **View available tasks**: ```bash nemo-evaluator-launcher ls tasks ``` ## Common Workflows ### Workflow 1: Evaluate Model on Standard Benchmarks Run core academic benchmarks (MMLU, GSM8K, IFEval) on any OpenAI-compatible endpoint. **Checklist**: ``` Standard Evaluation: - [ ] Step 1: Configure API endpoint - [ ] Step 2: Select benchmarks - [ ] Step 3: Run evaluation - [ ] Step 4: Check results ``` **Step 1: Configure API endpoint** ```yaml # config.yaml defaults: - execution: local - deployment: none - _self_ execution: output_dir: ./results target: api_endpoint: model_id: meta/llama-3.1-8b-instruct url: https://integrate.api.nvidia.com/v1/chat/completions api_key_name: NGC_API_KEY ``` For self-hosted endpoints (vLLM, TRT-LLM): ```yaml target: api_endpoint: model_id: my-model url: http://localhost:8000/v1/chat/completions api_key_name: "" # No key needed for local ``` **Step 2: Select benchmarks** Add tasks to your config: ```yaml evaluation: tasks: - name: ifeval # Instruction following - name: gpqa_diamond # Graduate-level QA env_vars: HF_TOKEN: HF_TOKEN # Some tasks need HF token - name: gsm8k_cot_instruct # Math reasoning - name: humaneval # Code generation ``` **Step 3: Run evaluation** ```bash # Run with config file nemo-evaluator-launcher run \ --config-dir . \ --config-name config # Override output directory nemo-evaluator-launcher run \ --config-dir . \ --config-name config \ -o execution.output_dir=./my_results # Limit samples for quick testing nemo-evaluator-launcher run \ --config-dir . \ --config-name config \ -o +evaluation.nemo_evaluator_config.config.params.limit_samples=10 ``` **Step 4: Check results** ```bash # Check job status nemo-evaluator-launcher status # List all runs nemo-evaluator-launcher ls runs # View results cat results///artifacts/results.yml ``` ### Workflow 2: Run Evaluation on Slurm HPC Cluster Execute large-scale evaluation on HPC infrastructure. **Checklist**: ``` Slurm Evaluation: - [ ] Step 1: Configure Slurm settings - [ ] Step 2: Set up model deployment - [ ] Step 3: Launch evaluation - [ ] Step 4: Monitor job status ``` **Step 1: Configure Slurm settings** ```yaml # slurm_config.yaml defaults: - execution: slurm - deployment: vllm - _self_ execution: hostname: cluster.example.com account: my_slurm_account partition: gpu output_dir: /shared/results walltime: "04:00:00" nodes: 1 gpus_per_node: 8 ``` **Step 2: Set up model deployment** ```yaml deployment: checkpoint_path: /shared/models/llama-3.1-8b tensor_parallel_size: 2 data_parallel_size: 4 max_model_len: 4096 target: api_endpoint: model_id: llama-3.1-8b # URL auto-generated by deployment ``` **Step 3: Launch evaluation** ```bash nemo-evaluator-launcher run \ --config-dir . \ --config-name slurm_config ``` **Step 4: Monitor job status** ```bash # Check status (queries sacct) nemo-evaluator-launcher status # View detailed info nemo-evaluator-launcher info # Kill if needed nemo-evaluator-launcher kill ``` ### Workflow 3: Compare Multiple Models Benchmark multiple models on the same tasks for comparison. **Checklist**: ``` Model Comparison: - [ ] Step 1: Create base config - [ ] Step 2: Run evaluations with overrides - [ ] Step 3: Export and compare results ``` **Step 1: Create base config** ```yaml # base_eval.yaml defaults: - execution: local - deployment: none - _self_ execution: output_dir: ./comparison_results evaluation: nemo_evaluator_config: config: params: temperature: 0.01 parallelism: 4 tasks: - name: mmlu_pro - name: gsm8k_cot_instruct - name: ifeval ``` **Step 2: Run evaluations with model overrides** ```bash # Evaluate Llama 3.1 8B nemo-evaluator-launcher run \ --config-dir . \ --config-name base_eval \ -o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \ -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions # Evaluate Mistral 7B nemo-evaluator-launcher run \ --config-dir . \ --config-name base_eval \ -o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3 \ -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions ``` **Step 3: Export and compare** ```bash # Export to MLflow nemo-evaluator-launcher export --dest mlflow nemo-evaluator-launcher export --dest mlflow # Export to local JSON nemo-evaluator-launcher export --dest local --format json # Export to Weights & Biases nemo-evaluator-launcher export --dest wandb ``` ### Workflow 4: Safety and Vision-Language Evaluation Evaluate models on safety benchmarks and VLM tasks. **Checklist**: ``` Safety/VLM Evaluation: - [ ] Step 1: Configure safety tasks - [ ] Step 2: Set up VLM tasks (if applicable) - [ ] Step 3: Run evaluation ``` **Step 1: Configure safety tasks** ```yaml evaluation: tasks: - name: aegis # Safety harness - name: wildguard # Safety classification - name: garak # Security probing ``` **Step 2: Configure VLM tasks** ```yaml # For vision-language models target: api_endpoint: type: vlm # Vision-language endpoint model_id: nvidia/llama-3.2-90b-vision-instruct url: https://integrate.api.nvidia.com/v1/chat/completions evaluation: tasks: - name: ocrbench # OCR evaluation - name: chartqa # Chart understanding - name: mmmu # Multimodal understanding ``` ## When to Use vs Alternatives **Use NeMo Evaluator when:** - Need **100+ benchmarks** from 18+ harnesses in one platform - Running evaluations on **Slurm HPC clusters** or cloud - Requiring **reproducible** containerized evaluation - Evaluating against **OpenAI-compatible APIs** (vLLM, TRT-LLM, NIMs) - Need **enterprise-grade** evaluation with result export (MLflow, W&B) **Use alternatives instead:** - **lm-evaluation-harness**: Simpler setup for quick local evaluation - **bigcode-evaluation-harness**: Focused only on code benchmarks - **HELM**: Stanford's broader evaluation (fairness, efficiency) - **Custom scripts**: Highly specialized domain evaluation ## Supported Harnesses and Tasks | Harness | Task Count | Categories | |---------|-----------|------------| | `lm-evaluation-harness` | 60+ | MMLU, GSM8K, HellaSwag, ARC | | `simple-evals` | 20+ | GPQA, MATH, AIME | | `bigcode-evaluation-harness` | 25+ | HumanEval, MBPP, MultiPL-E | | `safety-harness` | 3 | Aegis, WildGuard | | `garak` | 1 | Security probing | | `vlmevalkit` | 6+ | OCRBench, ChartQA, MMMU | | `bfcl` | 6 | Function calling v2/v3 | | `mtbench` | 2 | Multi-turn conversation | | `livecodebench` | 10+ | Live coding evaluation | | `helm` | 15 | Medical domain | | `nemo-skills` | 8 | Math, science, agentic | ## Common Issues **Issue: Container pull fails** Ensure NGC credentials are configured: ```bash docker login nvcr.io -u '$oauthtoken' -p $NGC_API_KEY ``` **Issue: Task requires environment variable** Some tasks need HF_TOKEN or JUDGE_API_KEY: ```yaml evaluation: tasks: - name: gpqa_diamond env_vars: HF_TOKEN: HF_TOKEN # Maps env var name to env var ``` **Issue: Evaluation timeout** Increase parallelism or reduce samples: ```bash -o +evaluation.nemo_evaluator_config.config.params.parallelism=8 -o +evaluation.nemo_evaluator_config.config.params.limit_samples=100 ``` **Issue: Slurm job not starting** Check Slurm account and partition: ```yaml execution: account: correct_account partition: gpu qos: normal # May need specific QOS ``` **Issue: Different results than expected** Verify configuration matches reported settings: ```yaml evaluation: nemo_evaluator_config: config: params: temperature: 0.0 # Deterministic num_fewshot: 5 # Check paper's fewshot count ``` ## CLI Reference | Command | Description | |---------|-------------| | `run` | Execute evaluation with config | | `status ` | Check job status | | `info ` | View detailed job info | | `ls tasks` | List available benchmarks | | `ls runs` | List all invocations | | `export ` | Export results (mlflow/wandb/local) | | `kill ` | Terminate running job | ## Configuration Override Examples ```bash # Override model endpoint -o target.api_endpoint.model_id=my-model -o target.api_endpoint.url=http://localhost:8000/v1/chat/completions # Add evaluation parameters -o +evaluation.nemo_evaluator_config.config.params.temperature=0.5 -o +evaluation.nemo_evaluator_config.config.params.parallelism=8 -o +evaluation.nemo_evaluator_config.config.params.limit_samples=50 # Change execution settings -o execution.output_dir=/custom/path -o execution.mode=parallel # Dynamically set tasks -o 'evaluation.tasks=[{name: ifeval}, {name: gsm8k}]' ``` ## Python API Usage For programmatic evaluation without the CLI: ```python from nemo_evaluator.core.evaluate import evaluate from nemo_evaluator.api.api_dataclasses import ( EvaluationConfig, EvaluationTarget, ApiEndpoint, EndpointType, ConfigParams ) # Configure evaluation eval_config = EvaluationConfig( type="mmlu_pro", output_dir="./results", params=ConfigParams( limit_samples=10, temperature=0.0, max_new_tokens=1024, parallelism=4 ) ) # Configure target endpoint target_config = EvaluationTarget( api_endpoint=ApiEndpoint( model_id="meta/llama-3.1-8b-instruct", url="https://integrate.api.nvidia.com/v1/chat/completions", type=EndpointType.CHAT, api_key="nvapi-your-key-here" ) ) # Run evaluation result = evaluate(eval_cfg=eval_config, target_cfg=target_config) ``` ## Advanced Topics **Multi-backend execution**: See [references/execution-backends.md](references/execution-backends.md) **Configuration deep-dive**: See [references/configuration.md](references/configuration.md) **Adapter and interceptor system**: See [references/adapter-system.md](references/adapter-system.md) **Custom benchmark integration**: See [references/custom-benchmarks.md](references/custom-benchmarks.md) ## Requirements - **Python**: 3.10-3.13 - **Docker**: Required for local execution - **NGC API Key**: For pulling containers and using NVIDIA Build - **HF_TOKEN**: Required for some benchmarks (GPQA, MMLU) ## Resources - **GitHub**: https://github.com/NVIDIA-NeMo/Evaluator - **NGC Containers**: nvcr.io/nvidia/eval-factory/ - **NVIDIA Build**: https://build.nvidia.com (free hosted models) - **Documentation**: https://github.com/NVIDIA-NeMo/Evaluator/tree/main/docs