---
name: performant-ai
description: Strategies for high-performance AI/LLM systems (Context Management, Prompt Engineering, RAG, Inference Tuning).
triggers: [ai, llm, performance, context window, tokens, prompt engineering, rag, inference, latency]
tags: [coding, ai, architecture]
context_cost: medium
---
# Performant AI Skill

## Goal
Optimizing the interaction, speed, and cost-effectiveness of LLM-based systems by mastering context management and inference strategies.

## Capabilities

### 1. Context Window Engineering
- **Context Pruning**: Implement logic to remove irrelevant or redundant tokens from the prompt to fit within limits and reduce cost.
- **Summarization Chains**: Use "recursive summarization" for long conversations or documents.
- **Observation Masking**: Hide older or less critical data to keep the attention of the model on the immediate task.

### 2. Efficient Prompting (Latency & Cost)
- **Few-Shot Optimization**: Minimize the number of examples to the bare minimum needed for accuracy.
- **Output Structuring**: Use JSON mode or structured outputs to reduce parsing errors and retry loops.
- **Prompt Compression**: Use tools or manual techniques to shorten instructions without losing semantic meaning.

### 3. RAG Optimization (Retrieval-Augmented Generation)
- **Chunking Strategy**: Optimize chunk sizes and overlap for the specific domain (e.g., small chunks for semantic search, large for summaries).
- **Hybrid Search**: Combine Vector search (semantic) with Keyword search (BM25) for higher precision.
- **Re-ranking**: Use a secondary, smaller model to re-rank the top-K results before sending them to the expensive LLM.

### 4. Inference & Routing Strategies
- **Brain Mode Routing**: Arbitrate between "Local" models (faster/cheaper) and "Remote" models (complex/slower) based on task difficulty.
- **Speculative Decoding**: (Where possible) use smaller models to draft tokens for larger models to verify, speeding up generation.
- **Cache Hits**: Implement semantic caching (Redis) to reuse LLM responses for similar queries.

### 5. Architectural Patterns
- **Self-Correction Loops**: Build reflection phases into the agent flow to catch errors early.
- **Asynchronous Agents**: Run independent research or tool calls in parallel to reduce perceived latency (Loki Mode).

## Steps
1. **Token Audit**: Trace the token count of typical requests to find "bloat" in systemic prompts.
2. **Latency Mapping**: Break down Time-to-First-Token (TTFT) and Total Generation Time.
3. **Retrieval Benchmark**: Measure the Hit Rate and Recall of the RAG pipeline.
4. **Cost Projection**: Estimate monthly burn based on different model providers and context sizes.

## Deliverables
- `COST_OPTIMIZATION_REPORT_TEMPLATE.md`: Analysis of prompt efficiency and LLM token usage.
- `ARCHITECTURE_REVIEW_TEMPLATE.md`: Configuration for vector DB, chunking, and search weights.
- `SCALABILITY_ANALYSIS_TEMPLATE.md`: Logic table for local vs remote model selection and context scaling.

## Security & Guardrails

### 1. Data Privacy
- **PII Masking**: Ensure no Personally Identifiable Information is sent to remote LLM providers without encryption or redaction.
- **Data Leakage**: Verify that RAG sources do not inadvertently expose unauthorized documents to the user.

### 2. Reliability
- **Hallucination Checks**: Mandatory verification step for critical facts generated by the LLM.
- **Fallback Logic**: Always have a "conservative" fallback if the primary LLM fails or hits rate limits.

### 3. Agent Guardrails
- **No Infinite Loops**: Implement strict limits on agent reflection or self-healing cycles (Max 5 attempts).
- **Cost Ceiling**: Set token or dollar limits per session to prevent runaway autonomous spending.