---
name: llama-cpp-runtime
description: "llama.cpp runtime/session control: use llama-cli and llama-server commands/flags to run local GGUF models and serve an API in a worker terminal. Trigger when the controller needs to operate llama.cpp like a human."
---

# llama.cpp Runtime

## Overview
Operate llama.cpp safely: run local GGUF models via `llama-cli` or serve an API via `llama-server`.

## Session Safety

1) Confirm idle state
- Snapshot and/or `status` the worker; do not intervene mid-run.
- Only proceed when the worker is at a prompt or explicitly idle.

## Core Commands

### `llama-cli` (interactive/local runs)
- Run a local model file:
  - `llama-cli -m my_model.gguf`
- Download and run directly from Hugging Face:
  - `llama-cli -hf ggml-org/gemma-3-1b-it-GGUF`
- Conversation mode (if not auto-enabled):
  - `llama-cli -m model.gguf -cnv --chat-template chatml`

### `llama-server` (OpenAI-compatible API)
- Start a local server on port 8080:
  - `llama-server -m model.gguf --port 8080`
- Parallel decoding example:
  - `llama-server -m model.gguf -c 16384 -np 4`

## Guardrails

- Do not restart mid-run.
- Use `llama-server` for API-style usage and `llama-cli` for interactive/local prompts.
- If the worker is not llama.cpp, switch to the model-specific runtime skill instead.