--- name: llamacpp description: "Complete llama.cpp C/C++ API reference covering model loading, inference, text generation, embeddings, chat, tokenization, sampling, batching, KV cache, LoRA adapters, and state management. Triggers on: llama.cpp questions, LLM inference code, GGUF models, local AI/ML inference, C/C++ LLM integration, \"how do I use llama.cpp\", API function lookups, implementation questions, troubleshooting llama.cpp issues, and any llama-cpp or ggerganov/llama.cpp mentions." --- # llama.cpp C API Guide Comprehensive reference for the llama.cpp C API, documenting all non-deprecated functions and common usage patterns. ## Overview llama.cpp is a C/C++ implementation for LLM inference with minimal dependencies and state-of-the-art performance. This skill provides: - **Complete API Reference**: All non-deprecated functions organized by category - **Common Workflows**: Working examples for typical use cases - **Best Practices**: Patterns for efficient and correct API usage ## Quick Start See **[references/workflows.md](references/workflows.md)** for complete working examples. Basic workflow: 1. `llama_backend_init()` - Initialize backend 2. `llama_model_load_from_file()` - Load model 3. `llama_init_from_model()` - Create context 4. `llama_tokenize()` - Convert text to tokens 5. `llama_decode()` - Process tokens 6. `llama_sampler_sample()` - Sample next token 7. Cleanup in reverse order ## When to Use This Skill Use this skill when: 1. **API Lookup**: You need to find a specific function (e.g., "How do I load a model?", "What function creates a context?") 2. **Code Generation**: You're writing C code that uses llama.cpp 3. **Workflow Guidance**: You need to understand the steps for a task (e.g., text generation, embeddings, chat) 4. **Advanced Features**: You're working with batches, sequences, LoRA adapters, state management, or custom sampling 5. **Migration**: You're updating code from deprecated functions to current API ## Core Concepts ### Key Objects - **`llama_model`**: Loaded model weights and architecture - **`llama_context`**: Inference state (KV cache, compute buffers) - **`llama_batch`**: Input tokens and positions for processing - **`llama_sampler`**: Token sampling configuration - **`llama_vocab`**: Vocabulary and tokenizer - **`llama_memory_t`**: KV cache memory handle ### Typical Flow 1. **Initialize**: `llama_backend_init()` 2. **Load Model**: `llama_model_load_from_file()` 3. **Create Context**: `llama_init_from_model()` 4. **Tokenize**: `llama_tokenize()` 5. **Process**: `llama_encode()` or `llama_decode()` 6. **Sample**: `llama_sampler_sample()` 7. **Generate**: Repeat steps 5-6 8. **Cleanup**: Free in reverse order ## API Reference For detailed API documentation, the complete API is split across 6 files for efficient targeted loading. Start with **[references/api-core.md](references/api-core.md)** which links to all other sections. **API Files:** - **[api-core.md](references/api-core.md)** (220 lines) - Initialization, parameters, model loading - **[api-model-info.md](references/api-model-info.md)** (193 lines) - Model properties, architecture detection **NEW** - **[api-context.md](references/api-context.md)** (412 lines) - Context, memory (KV cache), state management - **[api-inference.md](references/api-inference.md)** (417 lines) - Batch operations, inference, tokenization, chat - **[api-sampling.md](references/api-sampling.md)** (467 lines) - All 25+ sampling strategies + backend sampling API [NEW] - **[api-advanced.md](references/api-advanced.md)** (359 lines) - LoRA adapters, performance, training **Total:** 172 active, non-deprecated functions (b7631) across 6 organized files ### Quick Function Lookup Most common: `llama_backend_init()`, `llama_model_load_from_file()`, `llama_init_from_model()`, `llama_tokenize()`, `llama_decode()`, `llama_sampler_sample()`, `llama_vocab_is_eog()`, `llama_memory_clear()` See **[references/api.md](references/api.md)** for all 172 function signatures and detailed usage. ## Common Workflows See **[references/workflows.md](references/workflows.md)** for 13 complete working examples: basic text generation, chat, embeddings, batch processing, multi-sequence, LoRA, state save/load, custom sampling (XTC/DRY), encoder-decoder models, model detection, and memory management patterns. ## Best Practices See **[references/workflows.md](references/workflows.md)** for detailed best practices. Key points: - Always use default parameter functions (`llama_model_default_params()`, etc.) - Check return values for errors - Free resources in reverse order of creation - Handle dynamic buffer sizes for tokenization - Query actual context size after creation (`llama_n_ctx()`) - Check for end-of-generation with `llama_vocab_is_eog()` ## Common Patterns End-of-generation check (`llama_vocab_is_eog()`), logits retrieval (`llama_get_logits_ith()`), batch creation (`llama_batch_get_one()`), tokenization buffer handling. See **[references/workflows.md](references/workflows.md)** for complete code examples. ## Troubleshooting ### Common Issues **Model loading fails:** - Verify file path and GGUF format validity - Check available RAM/VRAM for model size - Reduce `n_gpu_layers` if GPU memory insufficient **Tokenization returns negative value:** - Buffer too small; reallocate with `-n` size and retry - See tokenization pattern in [Common Patterns](#common-patterns) **Decode/encode returns non-zero:** - Verify batch initialization (`llama_batch_get_one()` or `llama_batch_init()`) - Check context capacity (`llama_n_ctx()`) - Ensure positions within context window **Silent failures / no output:** - Check if `llama_vocab_is_eog()` immediately returns true - Verify sampler initialization - Enable logging: `llama_log_set()` **Performance issues:** - Increase `n_threads` for CPU - Set `n_gpu_layers` for GPU offloading - Use larger `n_batch` for prompts - See [Performance & Utilities](references/api.md#performance--utilities) **Sliding Window Attention (SWA) issues:** - If using Mistral-style models with SWA, set `ctx_params.swa_full = true` to access beyond attention window - Check: `llama_model_n_swa(model)` to detect SWA size and configuration needs - Symptoms: Token positions beyond window size causing decode errors **Per-sequence state errors:** - Ensure sequence ID matches when loading: `llama_state_seq_load_file(ctx, "file", dest_seq_id, ...)` - Verify token buffer is large enough for loaded tokens - Check sequence wasn't cleared or removed before loading state **Model type detection:** - Use `llama_model_has_encoder()` before assuming decoder-only architecture - For recurrent models (Mamba/RWKV), KV cache behavior differs from standard transformers - Encoder-decoder models require `llama_encode()` then `llama_decode()` workflow For advanced issues: https://github.com/ggerganov/llama.cpp/discussions ## Resources - **API Reference** (6 files, 2,086 lines total) - Complete API reference split by category for targeted loading: - [api-core.md](references/api-core.md) - Initialization, parameters, model loading - [api-model-info.md](references/api-model-info.md) - Model properties, architecture detection - [api-context.md](references/api-context.md) - Context, memory, state management - [api-inference.md](references/api-inference.md) - Batch, inference, tokenization, chat - [api-sampling.md](references/api-sampling.md) - All 25+ sampling strategies + backend sampling API - [api-advanced.md](references/api-advanced.md) - LoRA, performance, training - **[references/workflows.md](references/workflows.md)** (1,616 lines) - 15 complete working examples: basic workflows (text generation, chat, embeddings, batching, sequences), intermediate (LoRA, state, sampling, encoder-decoder, memory), advanced features (XTC/DRY, per-sequence state, model detection), and production applications (interactive chat, streaming). ## Key Differences from Deprecated API If you're updating old code: - Use `llama_model_load_from_file()` instead of `llama_load_model_from_file()` - Use `llama_model_free()` instead of `llama_free_model()` - Use `llama_init_from_model()` instead of `llama_new_context_with_model()` - Use `llama_vocab_*()` functions instead of `llama_token_*()` - Use `llama_state_*()` functions instead of deprecated state functions See the API reference for complete mappings.