---
name: ai-llm
description: Production LLM engineering skill. Covers strategy selection (prompting vs RAG vs fine-tuning), dataset design, PEFT/LoRA, evaluation workflows, deployment handoff to inference serving, and lifecycle operations with cost/safety controls.
---

# LLM Development & Engineering — Complete Reference

Build, evaluate, and deploy LLM systems with **modern production standards**.

This skill covers the full LLM lifecycle:

- **Development**: Strategy selection, dataset design, instruction tuning, PEFT/LoRA fine-tuning
- **Evaluation**: Automated testing, LLM-as-judge, metrics, rollout gates
- **Deployment**: Serving handoff, latency/cost budgeting, reliability patterns (see `ai-llm-inference`)
- **Operations**: Quality monitoring, change management, incident response (see `ai-mlops`)
- **Safety**: Threat modeling, data governance, layered mitigations (NIST AI RMF: https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf)

**Modern Best Practices (2026)**:

- Treat the model as a **component** with contracts, budgets, and rollback plans (not "magic").
- Separate **core concepts** (tokenization, context, training vs adaptation) from **implementation choices** (providers, SDKs).
- Gate upgrades with repeatable evals and staged rollout; avoid blind model swaps.
- **Cost-aware engineering**: Measure cost per successful outcome, not just cost per token; design tiering/caching early.
- **Security-by-design**: Threat model prompt injection, data leakage, and tool abuse; treat guardrails as production code.

**For detailed patterns:** See [Resources](#resources-best-practices--operational-patterns) and [Templates](#templates-copy-paste-ready) sections below.

---

## Quick Reference

| Task | Tool/Framework | Command/Pattern | When to Use |
|------|----------------|-----------------|-------------|
| Choose architecture | Prompt vs RAG vs fine-tune | Start simple; add retrieval/adaptation only if needed | New products and migrations |
| Model selection | Scoring matrix | Quality/latency/cost/privacy/license weighting | Provider changes and procurement |
| **Cost optimization** | Tiered models + caching | Cascade routing, prompt caching, budget guardrails | Cost-sensitive production |
| **Fine-tuning ROI** | ROI calculator | Break-even analysis, TCO comparison | Investment decisions |
| Prompt contracts | Structured output + constraints | JSON schema, max tokens, refusal rules | Reliability and integration |
| RAG integration | Hybrid retrieval + grounding | Retrieve → rerank → pack → cite → verify | Fresh/large corpora, traceability |
| Fine-tuning | PEFT/LoRA (when justified) | Small targeted datasets + regression suite | Stable domains, repeated tasks |
| Evaluation | Offline + online | Golden sets + A/B + canary + monitoring | Prevent regressions and drift |

---

## Decision Tree: LLM System Architecture

```text
Building LLM application: [Architecture Selection]
    ├─ Need current knowledge?
    │   ├─ Simple Q&A? → Basic RAG (page-level chunking + hybrid retrieval)
    │   └─ Complex retrieval? → Advanced RAG (reranking + contextual retrieval)
    │
    ├─ Need tool use / actions?
    │   ├─ Single task? → Simple agent (ReAct pattern)
    │   └─ Multi-step workflow? → Multi-agent (LangGraph, CrewAI)
    │
    ├─ Static behavior sufficient?
    │   ├─ Quick MVP? → Prompt engineering (CI/CD integrated)
    │   └─ Production quality? → Fine-tuning (PEFT/LoRA)
    │
    └─ Best results?
        └─ Hybrid (RAG + Fine-tuning + Agents) → Comprehensive solution
```

**See [Decision Matrices](references/decision-matrices.md) for detailed selection criteria.**

---

## Cost-Quality Decision Framework

LLM spend is driven by usage-based inference (tokens/requests) plus supporting infra and engineering. Model selection is a **cost-quality-latency-risk tradeoff**.

### Model Tier Strategy

| Tier | Typical profile | Use For |
|------|--------|------|---------|
| **Value** | Small/fast models | High-volume, simple tasks |
| **Balanced** | General-purpose models | Most production workloads |
| **Premium** | Frontier/large models | Hardest tasks, low volume |

### Cost Optimization Levers

1. **Model tiering**: Route simple requests to cheaper models (often large savings at scale)
2. **Prompt caching**: Reuse stable prefixes/context (provider-specific discounts and constraints)
3. **Prompt optimization**: Compress examples and instructions (typically meaningful token reduction)
4. **Output limits**: Set appropriate max_tokens (prevents runaway costs)

### When to Fine-Tune (ROI-Based)

Fine-tuning pays off when:
- **Volume justifies it**: >10k requests/month provides meaningful cost savings
- **Domain is stable**: Requirements unchanged for >6 months
- **Data exists**: >1,000 quality training examples available
- **Break-even achievable**: <12 months to recover investment

**See [Cost Economics](references/cost-economics.md) for TCO modeling and [Fine-Tuning ROI Calculator](assets/selection/fine-tuning-roi-calculator.md) for investment analysis.**

---

## Core Concepts (Vendor-Agnostic)

- **Model classes**: encoder-only, decoder-only, encoder-decoder, multimodal; choose based on task and latency.
- **Tokenization & limits**: context window, max output, and prompt/template overhead drive both cost and tail latency.
- **Adaptation options**: prompting → retrieval → adapters (LoRA) → full fine-tune; choose by stability and ROI (LoRA: https://arxiv.org/abs/2106.09685).
- **Evaluation**: metrics must map to user value; report uncertainty and slice performance, not only global averages.
- **Governance**: data retention, residency, licensing, and auditability are product requirements (EU AI Act: https://eur-lex.europa.eu/eli/reg/2024/1689/oj; NIST GenAI Profile: https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf).

## Implementation Practices (Tooling Examples)

- Use a **provider abstraction** (gateway/router) to enable fallbacks and staged upgrades.
- Instrument requests with tokens, latency, and error classes (OpenTelemetry GenAI semantic conventions: https://opentelemetry.io/docs/specs/semconv/gen-ai/).
- Maintain **prompt/model registries** with versioning, changelogs, and rollback criteria.

## Do / Avoid

**Do**
- Do pin model + prompt versions in production, and re-run evals before any change.
- Do enforce budgets at the boundary: max tokens, max tools, max retries, max cost.
- Do plan for degraded modes (smaller model, cached answers, “unable to answer”).

**Avoid**
- Avoid model sprawl (unowned variants with no eval coverage).
- Avoid blind upgrades based on anecdotal quality; require measured impact.
- Avoid training on production logs without consent, governance, and leakage controls.

## When to Use This Skill

Claude should invoke this skill when the user asks about:

- LLM preflight/project checklists, production best practices, or data pipelines
- Building or deploying RAG, agentic, or prompt-based LLM apps
- Prompt design, chain-of-thought (CoT), ReAct, or template patterns
- Troubleshooting LLM hallucination, bias, retrieval issues, or production failures
- Evaluating LLMs: benchmarks, multi-metric eval, or rollout/monitoring
- LLMOps: deployment, rollback, scaling, resource optimization
- Technology stack selection (models, vector DBs, frameworks)
- Production deployment strategies and operational patterns

---

## Scope Boundaries (Use These Skills for Depth)

- **Prompt design & CI/CD** → [ai-prompt-engineering](../ai-prompt-engineering/SKILL.md)
- **RAG pipelines & chunking** → [ai-rag](../ai-rag/SKILL.md)
- **Search tuning (BM25, HNSW, hybrid)** → [ai-rag](../ai-rag/SKILL.md)
- **Agent architectures & tools** → [ai-agents](../ai-agents/SKILL.md)
- **Serving optimization/quantization** → [ai-llm-inference](../ai-llm-inference/SKILL.md)
- **Production deployment/monitoring** → [ai-mlops](../ai-mlops/SKILL.md)
- **Security/guardrails** → [ai-mlops](../ai-mlops/SKILL.md)

---

## Resources (Best Practices & Operational Patterns)

Comprehensive operational guides with checklists, patterns, and decision frameworks:

### Core Operational Patterns

- **[Cost Economics & Decision Frameworks](references/cost-economics.md)** - Cost modeling, unit economics, TCO analysis
  - Pricing/discount assumptions (verify against current provider docs)
  - Cost-quality tradeoff framework and decision matrix
  - Total Cost of Ownership (TCO) calculation
  - Fine-tuning ROI framework and break-even analysis
  - Prompt caching economics
  - Cost monitoring and budget guardrails

- **[Project Planning Patterns](references/project-planning-patterns.md)** - Stack selection, FTI pipeline, performance budgeting
  - AI engineering stack selection matrix
  - Feature/Training/Inference (FTI) pipeline blueprint
  - Performance budgeting and goodput gates
  - Progressive complexity (prompt → RAG → fine-tune → hybrid)

- **[Production Checklists](references/production-checklists.md)** - Pre-deployment validation and operational checklists
  - LLM lifecycle checklist (modern production standards)
  - Data & training, RAG pipeline, deployment & serving
  - Safety/guardrails, evaluation, agentic systems
  - Reliability & data infrastructure (DDIA-grade)
  - Weekly production tasks

- **[Common Design Patterns](references/common-design-patterns.md)** - Copy-paste ready implementation examples
  - Chain-of-Thought (CoT) prompting
  - ReAct (Reason + Act) pattern
  - RAG pipeline (minimal to advanced)
  - Agentic planning loop
  - Self-reflection and multi-agent collaboration

- **[Decision Matrices](references/decision-matrices.md)** - Quick reference tables for selection
  - RAG type decision matrix (naive → advanced → modular)
  - Production evaluation table with targets and actions
  - Model selection matrix (tier-based, vendor-agnostic)
  - Vector database, embedding model, framework selection
  - Deployment strategy matrix

- **[Anti-Patterns](references/anti-patterns.md)** - Common mistakes and prevention strategies
  - Data leakage, prompt dilution, RAG context overload
  - Agentic runaway, over-engineering, ignoring evaluation
  - Hard-coded prompts, missing observability
  - Detection methods and prevention code examples

### Domain-Specific Patterns

- **[LLMOps Best Practices](references/llmops-best-practices.md)** - Operational lifecycle and deployment patterns
- **[Evaluation Patterns](references/eval-patterns.md)** - Testing, metrics, and quality validation
- **[Prompt Engineering Patterns](references/prompt-engineering-patterns.md)** - Quick reference (canonical skill: [ai-prompt-engineering](../ai-prompt-engineering/SKILL.md))
- **[Agentic Patterns](references/agentic-patterns.md)** - Quick reference (canonical skill: [ai-agents](../ai-agents/SKILL.md))
- **[RAG Best Practices](references/rag-best-practices.md)** - Quick reference (canonical skill: [ai-rag](../ai-rag/SKILL.md))

**Note:** Each resource file includes preflight/validation checklists, copy-paste reference tables, inline templates, anti-patterns, and decision matrices.

---

## Templates (Copy-Paste Ready)

Production templates by use case and technology:

### Selection & Governance

- **[Model Selection Matrix](assets/selection/model-selection-matrix.md)** - Documented selection, scoring, licensing, and governance
- **[Fine-Tuning ROI Calculator](assets/selection/fine-tuning-roi-calculator.md)** - Investment analysis, break-even, go/no-go decisions

### RAG Pipelines

- **[Basic RAG](assets/rag-pipelines/template-basic-rag.md)** - Simple retrieval-augmented generation
- **[Advanced RAG](assets/rag-pipelines/template-advanced-rag.md)** - Hybrid retrieval, reranking, contextual embeddings

### Prompt Engineering

- **[Chain-of-Thought](assets/prompt-engineering/template-cot.md)** - Step-by-step reasoning pattern
- **[ReAct](assets/prompt-engineering/template-react.md)** - Reason + Act for tool use

### Agentic Workflows

- **[Reflection Agent](assets/agentic-workflows/template-reflection.md)** - Self-critique and improvement
- **[Multi-Agent](assets/agentic-workflows/template-multi-agent.md)** - Manager-worker orchestration

### Data Pipelines

- **[Data Quality](assets/data-pipelines/template-data-quality.md)** - Validation, deduplication, PII detection

### Deployment

- **[LLM Deployment](assets/deployment/template-llm-deployment.md)** - Production deployment with monitoring

### Evaluation

- **[Multi-Metric Evaluation](assets/evaluation/template-multi-metric.md)** - Comprehensive testing suite

---

## Shared Utilities (Centralized patterns — extract, don't duplicate)

- [../software-clean-code-standard/utilities/llm-utilities.md](../software-clean-code-standard/utilities/llm-utilities.md) — Token counting, streaming, cost estimation
- [../software-clean-code-standard/utilities/error-handling.md](../software-clean-code-standard/utilities/error-handling.md) — Effect Result types, correlation IDs
- [../software-clean-code-standard/utilities/resilience-utilities.md](../software-clean-code-standard/utilities/resilience-utilities.md) — p-retry v6, circuit breaker for LLM API calls
- [../software-clean-code-standard/utilities/logging-utilities.md](../software-clean-code-standard/utilities/logging-utilities.md) — pino v9 + OpenTelemetry integration
- [../software-clean-code-standard/utilities/observability-utilities.md](../software-clean-code-standard/utilities/observability-utilities.md) — OpenTelemetry SDK, tracing, metrics
- [../software-clean-code-standard/utilities/config-validation.md](../software-clean-code-standard/utilities/config-validation.md) — Zod 3.24+, secrets management for API keys
- [../software-clean-code-standard/utilities/testing-utilities.md](../software-clean-code-standard/utilities/testing-utilities.md) — Test factories, fixtures, mocks
- [../software-clean-code-standard/references/clean-code-standard.md](../software-clean-code-standard/references/clean-code-standard.md) — Canonical clean code rules (`CC-*`) for citation

---

## Trend Awareness Protocol

**IMPORTANT**: For “best/latest” recommendations, verify recency using current sources (official docs/release notes/benchmarks). If you can’t browse, state assumptions and ask for timeframe + constraints.

### Trigger Conditions

- "What's the best LLM model for [use case]?"
- "What should I use for [RAG/fine-tuning/agents]?"
- "What's the latest in LLM development?"
- "Current best practices for [prompting/evaluation/deployment]?"
- "Is [model/framework] still relevant in 2026?"
- "[Model A] vs [Model B]?" or "[Framework A] vs [Framework B]?"
- "Best vector database for [use case]?"
- "What agent framework should I use?"

### Minimal Verification Checklist

1. Confirm user constraints: latency, cost, privacy/compliance, deployment target, and toolchain.
2. Check at least 2 authoritative sources from `data/sources.json` (provider docs, release notes, pricing/quotas, deprecations).
3. Prefer stable guidance (tradeoffs + decision criteria) over “one best model/framework”.

### What to Report

After searching, provide:

- **Current landscape**: What models/frameworks are popular NOW (not 6 months ago)
- **Emerging trends**: New models, frameworks, or techniques gaining traction
- **Deprecated/declining**: Models/frameworks losing relevance or support
- **Recommendation**: Based on fresh data, not just static knowledge

### Example Topics (verify with fresh sources)

- Latest frontier models (GPT-4.5, Claude 4, Gemini 2.x, Llama 4)
- Agent frameworks (LangGraph, CrewAI, AutoGen, Semantic Kernel)
- Vector databases (Pinecone, Qdrant, Weaviate, pgvector)
- RAG techniques (contextual retrieval, agentic RAG, graph RAG)
- Inference engines (vLLM, TensorRT-LLM, SGLang)
- Evaluation frameworks (RAGAS, DeepEval, Braintrust)

---

## Related Skills

This skill integrates with complementary Claude Code skills:

### Core Dependencies

- **[ai-rag](../ai-rag/SKILL.md)** - Retrieval pipelines: chunking, hybrid search, reranking, evaluation
- **[ai-prompt-engineering](../ai-prompt-engineering/SKILL.md)** - Systematic prompt design, evaluation, testing, and optimization
- **[ai-agents](../ai-agents/SKILL.md)** - Agent architectures, tool use, multi-agent systems, autonomous workflows

### Production & Operations

- **[ai-llm-inference](../ai-llm-inference/SKILL.md)** - Production serving, quantization, batching, GPU optimization
- **[ai-mlops](../ai-mlops/SKILL.md)** - Deployment, monitoring, incident response, security, and governance

---

## External Resources

See **[data/sources.json](data/sources.json)** for 50+ curated authoritative sources:

- **Official LLM platform docs** - OpenAI, Anthropic, Gemini, Mistral, Azure OpenAI, AWS Bedrock
- **Open-source models and frameworks** - HuggingFace Transformers, open-weight models, PEFT/LoRA, distributed training/inference stacks
- **RAG frameworks and vector DBs** - LlamaIndex, LangChain 1.2+, LangGraph, LangGraph Studio v2, Haystack, Pinecone, Qdrant, Chroma
- **Agent frameworks (examples)** - LangGraph, Semantic Kernel, AutoGen, CrewAI
- **RAG innovations (examples)** - Graph-based retrieval, hybrid retrieval, online evaluation loops
- **Prompt engineering** - Anthropic Prompt Library, Prompt Engineering Guide, CoT/ReAct patterns
- **Evaluation and monitoring** - OpenAI Evals, HELM, Anthropic Evals, LangSmith, W&B, Arize Phoenix
- **Production deployment** - Model gateways/routers, self-hosted serving, managed endpoints

---

## Usage

### For New Projects

1. Start with **[Production Checklists](references/production-checklists.md)** - Validate all pre-deployment requirements
2. Use **[Decision Matrices](references/decision-matrices.md)** - Select technology stack
3. Reference **[Project Planning Patterns](references/project-planning-patterns.md)** - Design FTI pipeline
4. Implement with **[Common Design Patterns](references/common-design-patterns.md)** - Copy-paste code examples
5. Avoid **[Anti-Patterns](references/anti-patterns.md)** - Learn from common mistakes

### For Troubleshooting

1. Check **[Anti-Patterns](references/anti-patterns.md)** - Identify failure modes and mitigations
2. Use **[Decision Matrices](references/decision-matrices.md)** - Evaluate if architecture fits use case
3. Reference **[Common Design Patterns](references/common-design-patterns.md)** - Verify implementation correctness

### For Ongoing Operations

1. Follow **[Production Checklists](references/production-checklists.md)** - Weekly operational tasks
2. Integrate **[Evaluation Patterns](references/eval-patterns.md)** - Continuous quality monitoring
3. Apply **[LLMOps Best Practices](references/llmops-best-practices.md)** - Deployment and rollback procedures

---

## Navigation Summary

**Quick Decisions:** [Decision Matrices](references/decision-matrices.md)
**Pre-Deployment:** [Production Checklists](references/production-checklists.md)
**Planning:** [Project Planning Patterns](references/project-planning-patterns.md)
**Implementation:** [Common Design Patterns](references/common-design-patterns.md)
**Troubleshooting:** [Anti-Patterns](references/anti-patterns.md)

**Domain Depth:** [LLMOps](references/llmops-best-practices.md) | [Evaluation](references/eval-patterns.md) | [Prompts](references/prompt-engineering-patterns.md) | [Agents](references/agentic-patterns.md) | [RAG](references/rag-best-practices.md)

**Templates:** [assets/](assets/) - Copy-paste ready production code

**Sources:** [data/sources.json](data/sources.json) - Authoritative documentation links

---