--- name: litellm description: When calling LLM APIs from Python code. When connecting to llamafile or local LLM servers. When switching between OpenAI/Anthropic/local providers. When implementing retry/fallback logic for LLM calls. When code imports litellm or uses completion() patterns. --- # LiteLLM Unified Python interface for calling 100+ LLM APIs using consistent OpenAI format. Provides standardized exception handling, retry/fallback logic, and cost tracking across multiple providers. ## When to Use This Skill Use this skill when: - Integrating with multiple LLM providers through a single interface - Routing requests to local llamafile servers using OpenAI-compatible endpoints - Implementing retry and fallback logic for LLM calls - Building applications requiring consistent error handling across providers - Tracking LLM usage costs across different providers - Converting between provider-specific APIs and OpenAI format - Deploying LLM proxy servers with unified configuration - Testing applications against both cloud and local LLM endpoints ## Core Capabilities ### Provider Support LiteLLM supports 100+ providers through consistent OpenAI-style API: - **Cloud Providers**: OpenAI, Anthropic, Google, Azure, AWS Bedrock - **Local Servers**: llamafile, Ollama, LocalAI, vLLM - **Unified Format**: All requests use OpenAI message format - **Exception Mapping**: All provider errors map to OpenAI exception types ### Key Features 1. **Unified API**: Single `completion()` function for all providers 2. **Exception Handling**: All exceptions inherit from OpenAI types 3. **Retry Logic**: Built-in retry with configurable attempts 4. **Streaming Support**: Sync and async streaming for all providers 5. **Cost Tracking**: Automatic usage and cost calculation 6. **Proxy Mode**: Deploy centralized LLM gateway ## Installation ```bash # Using pip pip install litellm # Using uv uv add litellm ``` ## Llamafile Integration ### Provider Configuration All llamafile models MUST use the `llamafile/` prefix for routing: ```python model = "llamafile/mistralai/mistral-7b-instruct-v0.2" model = "llamafile/gemma-3-3b" ``` ### API Base URL The `api_base` MUST point to llamafile's OpenAI-compatible endpoint: ```python api_base = "http://localhost:8080/v1" ``` **Critical Requirements**: - Include `/v1` suffix - Do NOT add endpoint paths like `/chat/completions` (LiteLLM adds these automatically) - Default llamafile port is 8080 ### Environment Variable Configuration ```python import os os.environ["LLAMAFILE_API_BASE"] = "http://localhost:8080/v1" ``` ## Basic Usage Patterns ### Synchronous Completion ```python import litellm response = litellm.completion( model="llamafile/mistralai/mistral-7b-instruct-v0.2", messages=[{"role": "user", "content": "Summarize this diff"}], api_base="http://localhost:8080/v1", temperature=0.2, max_tokens=80, ) print(response.choices[0].message.content) ``` ### Asynchronous Completion ```python from litellm import acompletion import asyncio async def generate_message(): response = await acompletion( model="llamafile/gemma-3-3b", messages=[{"role": "user", "content": "Write a commit message"}], api_base="http://localhost:8080/v1", temperature=0.3, max_tokens=200, ) return response.choices[0].message.content result = asyncio.run(generate_message()) print(result) ``` ### Async Streaming ```python from litellm import acompletion import asyncio async def stream_response(): response = await acompletion( model="llamafile/gemma-3-3b", messages=[{"role": "user", "content": "Hello, how are you?"}], api_base="http://localhost:8080/v1", stream=True, ) async for chunk in response: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True) print() asyncio.run(stream_response()) ``` ### Embeddings ```python from litellm import embedding import os os.environ["LLAMAFILE_API_BASE"] = "http://localhost:8080/v1" response = embedding( model="llamafile/sentence-transformers/all-MiniLM-L6-v2", input=["Hello world"], ) print(response) ``` ## Exception Handling ### Import Pattern All exceptions can be imported directly from `litellm`: ```python from litellm import ( BadRequestError, # 400 errors AuthenticationError, # 401 errors NotFoundError, # 404 errors Timeout, # 408 errors (alias: openai.APITimeoutError) RateLimitError, # 429 errors APIConnectionError, # 500 errors / connection issues (default) ServiceUnavailableError, # 503 errors ) ``` ### Exception Types Reference | Status Code | Exception Type | Inherits from | Description | | ----------- | ----------------------------- | ---------------------------- | --------------------------- | | 400 | `BadRequestError` | openai.BadRequestError | Invalid request | | 400 | `ContextWindowExceededError` | litellm.BadRequestError | Token limit exceeded | | 400 | `ContentPolicyViolationError` | litellm.BadRequestError | Content policy violation | | 401 | `AuthenticationError` | openai.AuthenticationError | Auth failure | | 403 | `PermissionDeniedError` | openai.PermissionDeniedError | Permission denied | | 404 | `NotFoundError` | openai.NotFoundError | Invalid model/endpoint | | 408 | `Timeout` | openai.APITimeoutError | Request timeout | | 429 | `RateLimitError` | openai.RateLimitError | Rate limited | | 500 | `APIConnectionError` | openai.APIConnectionError | Default for unmapped errors | | 500 | `APIError` | openai.APIError | Generic 500 error | | 503 | `ServiceUnavailableError` | openai.APIStatusError | Service unavailable | | >=500 | `InternalServerError` | openai.InternalServerError | Unmapped 500+ errors | ### Exception Attributes All LiteLLM exceptions include: - `status_code`: HTTP status code - `message`: Error message - `llm_provider`: Provider that raised the exception ### Exception Handling Example ```python import litellm import openai try: response = litellm.completion( model="llamafile/gemma-3-3b", messages=[{"role": "user", "content": "Hello"}], api_base="http://localhost:8080/v1", timeout=30.0, ) except openai.APITimeoutError as e: # LiteLLM exceptions inherit from OpenAI types print(f"Timeout: {e}") except litellm.APIConnectionError as e: print(f"Connection failed: {e.message}") print(f"Provider: {e.llm_provider}") ``` ### Alternative Import from litellm.exceptions ```python from litellm.exceptions import BadRequestError, AuthenticationError, APIError try: response = litellm.completion( model="llamafile/gemma-3-3b", messages=[{"role": "user", "content": "Hello"}], api_base="http://localhost:8080/v1", ) except AuthenticationError as e: print(f"Authentication failed: {e}") except BadRequestError as e: print(f"Bad request: {e}") except APIError as e: print(f"API error: {e}") ``` ### Checking If Exception Should Retry ```python import litellm try: response = litellm.completion( model="llamafile/gemma-3-3b", messages=[{"role": "user", "content": "Hello"}], api_base="http://localhost:8080/v1", ) except Exception as e: if hasattr(e, 'status_code'): should_retry = litellm._should_retry(e.status_code) print(f"Should retry: {should_retry}") ``` ## Retry and Fallback Configuration ```python from litellm import completion response = completion( model="llamafile/gemma-3-3b", messages=[{"role": "user", "content": "Hello"}], api_base="http://localhost:8080/v1", num_retries=3, # Retry 3 times on failure timeout=30.0, # 30 second timeout ) ``` ## Proxy Server Configuration For proxy deployments, use `config.yaml`: ```yaml model_list: - model_name: commit-polish-model litellm_params: model: llamafile/gemma-3-3b # add llamafile/ prefix api_base: http://localhost:8080/v1 # add api base for OpenAI compatible provider ``` ## Application Integration Patterns ### Connection Verification Pattern ```python import litellm from litellm import APIConnectionError def verify_llamafile_connection(api_base: str = "http://localhost:8080/v1") -> bool: """Check if llamafile server is running.""" try: litellm.completion( model="llamafile/test", messages=[{"role": "user", "content": "test"}], api_base=api_base, max_tokens=1, ) return True except APIConnectionError: return False ``` ### Async Service Pattern ```python import litellm from litellm import acompletion, APIConnectionError import asyncio class AIService: """LiteLLM wrapper with llamafile routing.""" def __init__(self, model: str, api_base: str, temperature: float = 0.3, max_tokens: int = 200): self.model = model self.api_base = api_base self.temperature = temperature self.max_tokens = max_tokens async def generate_commit_message(self, diff: str, system_prompt: str) -> str: """Generate a commit message using the LLM.""" try: response = await acompletion( model=self.model, messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": f"Generate a commit message for this diff:\n\n{diff}"}, ], api_base=self.api_base, temperature=self.temperature, max_tokens=self.max_tokens, ) return response.choices[0].message.content.strip() except APIConnectionError as e: raise RuntimeError(f"Failed to connect to llamafile server at {self.api_base}: {e.message}") ``` ## Common Pitfalls to Avoid 1. **Missing `llamafile/` prefix**: Without prefix, LiteLLM won't route to OpenAI-compatible endpoint 2. **Wrong port**: Llamafile uses 8080 by default, not 8000 3. **Missing `/v1` suffix**: API base must end with `/v1` 4. **Adding extra path segments**: Do NOT use `http://localhost:8080/v1/chat/completions` - LiteLLM adds the endpoint path automatically 5. **API key requirement**: No API key needed for local llamafile (use empty string or any value if required by validation) ## Configuration Examples ### TOML Configuration ```toml # ~/.config/commit-polish/config.toml [ai] model = "llamafile/gemma-3-3b" # MUST have llamafile/ prefix temperature = 0.3 max_tokens = 200 ``` ### Environment Variables ```bash export LLAMAFILE_API_BASE="http://localhost:8080/v1" export LITELLM_LOG="INFO" # Enable LiteLLM debug logging ``` ## Related Skills For comprehensive documentation on related tools: - **llamafile**: Activate the llamafile skill using `Skill(command: "llamafile")` for llamafile server setup, model management, and local LLM deployment patterns - **uv**: Activate the uv skill using `Skill(command: "uv")` for Python project management, dependency handling, and virtual environment workflows ## References ### Official Documentation - [LiteLLM Documentation](https://docs.litellm.ai/) - Main documentation portal - [Llamafile Provider Docs](https://docs.litellm.ai/docs/providers/llamafile) - Llamafile-specific configuration - [Exception Mapping](https://docs.litellm.ai/docs/exception_mapping) - Complete exception reference - [GitHub Repository](https://github.com/BerriAI/litellm) - Source code and examples ### Provider-Specific Documentation - [Llamafile API Endpoints](https://github.com/Mozilla-Ocho/llamafile/blob/main/llama.cpp/server/README.md#api-endpoints) - Llamafile OpenAI-compatible API reference - [Completion Streaming](https://docs.litellm.ai/docs/completion/stream) - Streaming implementation guide ### Version Information - Documentation verified against: LiteLLM GitHub repository (main branch, accessed 2025-01-15) - Python: 3.11+ - Llamafile: 0.9.3+