--- name: python-logging-best-practices description: Python logging with loguru, structlog, and orjson. TRIGGERS - loguru, structlog, structured logging allowed-tools: Read, Bash, Grep, Edit, Write --- # Python Logging Best Practices > **Self-Evolving Skill**: This skill improves through use. If instructions are wrong, parameters drifted, or a workaround was needed — fix this file immediately, don't defer. Only update for real, reproducible issues. ## When to Use This Skill Use this skill when: - Setting up Python logging for any service or script - Configuring structured JSONL logging for analysis - Implementing log rotation - Choosing between lightweight (zero-dep) and full-featured logging - Adding logging to containerized, systemd, or local applications ## Overview Unified reference for Python logging patterns optimized for machine readability (Claude Code analysis) and operational reliability. **Starts with the lightest viable approach and scales up only when needed.** ## Decision Heuristic: Start Light, Scale Up ``` Is it < 5 services on a single machine, < 1 event/sec? YES → Lightweight Pattern (print + JSONL telemetry) NO → Is it containerized / serverless? YES → stdout JSON (any library), no file rotation NO → Is OTel tracing required? YES → structlog + OTel NO → loguru (CLI tools) or stdlib RotatingFileHandler ``` | Approach | Use Case | Pros | Cons | | --------------- | ---------------------------------------------------- | --------------------------------------------- | -------------------------------------------- | | **Lightweight** | Small systemd services, self-hosted, single operator | Zero deps, journald integration, minimal code | No severity filtering, no per-module control | | `loguru` | CLI tools, scripts, local services | Zero-config, built-in rotation, great DX | External dep, not truly schema-enforced | | `structlog` | Production services, OTel integration | ContextVars, processor chains, OTel-native | Steeper learning curve | | `stdlib` | LaunchAgent daemons, zero-dep constraint | No dependencies, Python 3.14 `merge_extra` | More boilerplate, no structured defaults | | `Logfire` | AI/LLM observability, Pydantic apps | Built on OTel, token/cost tracking, SQL | SaaS dependency, newer ecosystem | --- ## Preferred: Lightweight Pattern (Zero Dependencies) **For: < 5 systemd services, single server, single operator. Battle-tested in production by [ccmax-monitor](https://github.com/terrylica/ccmax-monitor).** This pattern uses a **two-channel architecture**: - **Channel 1**: `print(flush=True)` → systemd journald (operational logs, human-readable) - **Channel 2**: Append-only JSONL file (structured telemetry, machine-readable) This maps to the 12-Factor App's "treat logs as event streams" principle. journald handles ops (rotation, filtering, metadata), while the JSONL file serves domain telemetry for post-mortem analysis. ### Architecture: Three-Concern Separation | Concern | Mechanism | Purpose | Lifecycle | | ------------------ | ------------------------------ | ------------------------------------------- | ---------------------------------- | | **Ops logging** | `print()` → journald | Human debugging, `journalctl -u service -f` | Managed by journald (auto-rotated) | | **Telemetry** | JSONL file (`telemetry.jsonl`) | Structured audit trail, AI/LLM analysis | Append-only, rotated by size | | **State recovery** | WAL file (optional) | Crash recovery for irreversible operations | Ephemeral, deleted on success | ### Complete Lightweight Example ```python """Append-only JSONL telemetry logger with size-based rotation. Zero external dependencies. Works with systemd journald for ops logging and a separate JSONL file for structured machine-readable telemetry. """ import json from datetime import datetime, timezone from pathlib import Path TELEMETRY_PATH = Path(__file__).parent / "telemetry.jsonl" MAX_SIZE = 10 * 1024 * 1024 # 10 MB BACKUP_COUNT = 3 # Keep 3 rotated backups (~30MB total) def log_event(event_type: str, data: dict) -> None: """Append a structured JSON line to telemetry.jsonl.""" entry = { "ts": datetime.now(timezone.utc).isoformat(), "type": event_type, **data, } line = json.dumps(entry, separators=(",", ":")) + "\n" try: try: if TELEMETRY_PATH.stat().st_size > MAX_SIZE: _rotate() except FileNotFoundError: pass with open(TELEMETRY_PATH, "a") as f: f.write(line) except OSError as e: # Fallback to stderr (captured by journald) print(f"[telemetry] write failed: {e}", file=__import__("sys").stderr, flush=True) def _rotate() -> None: """Rotate telemetry files: .jsonl → .jsonl.1 → .jsonl.2 → .jsonl.3""" for i in range(BACKUP_COUNT, 1, -1): src = TELEMETRY_PATH.with_suffix(f".jsonl.{i - 1}") dst = TELEMETRY_PATH.with_suffix(f".jsonl.{i}") if src.exists(): dst.unlink(missing_ok=True) src.rename(dst) backup = TELEMETRY_PATH.with_suffix(".jsonl.1") backup.unlink(missing_ok=True) TELEMETRY_PATH.rename(backup) # === Ops logging (goes to journald via stdout) === def log(msg: str) -> None: """Human-readable operational log line. Captured by journald.""" ts = datetime.now(timezone.utc).strftime("%H:%M:%S") print(f"[{ts}] {msg}", flush=True) ``` **Usage:** ```python # Operational (human reads via journalctl -u myservice -f) log("Refreshing token for account X") log("Switch: account A → account B (reason: 5h breach)") # Telemetry (machine reads via jq/DuckDB/Claude Code) log_event("token_refresh", {"account": "X", "expires_in_h": 8.0, "token_fp": "abc12345"}) log_event("account_switch", {"from": "A", "to": "B", "reason": "5h_breach"}) ``` ### Security: Token Fingerprinting (Not Regex Redaction) **Never pass secrets through the logging pipeline.** Log only a non-reversible fragment: ```python def _token_fingerprint(token: str) -> str: """Extract uniquely identifiable chars from a token's mid-section. The prefix (sk-ant-oat01-) and suffix (...AA) are common across tokens. Chars 14-22 (after the prefix) are the most unique per-token. Middle-slice avoids leaking type-prefix metadata that prefix-based approaches expose. """ if len(token) > 25: return token[14:22] return token[:8] if token else "" # Usage: log the fingerprint, never the token log_event("token_refresh", {"account": name, "token_fp": _token_fingerprint(token)}) ``` **Why this is superior to regex redaction filters:** | Approach | Security | Maintenance | Failure mode | | ------------------------------------------- | ----------------------------------------- | ----------------------------------------- | ------------------------------- | | **Token fingerprinting** (log only a slice) | Secret never enters logging pipeline | Zero — works with any token format | Cannot fail — nothing to redact | | Regex redaction filter | Secret passes through, filtered on output | Must update regexes for new token formats | Silent miss = secret in logs | This aligns with OWASP Logging Cheat Sheet: "Ensure that no sensitive data is included in log entries." Major platforms (AWS, Stripe, GitHub) use separate non-secret identifiers or partial token display — never full tokens with regex scrubbing. Regex filters remain useful as a **defense-in-depth backstop**, not a primary control. ### Health Endpoints as Observability For small deployments, **rich JSON health endpoints replace log aggregation**: ```python @app.get("/api/status") def status(): """White-box monitoring — current state on demand.""" return {"active_account": ..., "accounts": [...], "polled_at": ...} @app.get("/api/vault-health") def vault_health(): """Token health for all accounts.""" return {name: {"status": "healthy", "expires_in": "7.5h", ...} for ...} ``` This is the **Health Endpoint Monitoring Pattern** (Microsoft Azure Architecture Center) / **Health Check API Pattern** (microservices.io). The dashboard IS the monitoring tool — no Grafana/Prometheus needed. When the service itself serves its own operational state as structured JSON, you get: - **Real-time** current state (not delayed by log ingestion pipelines) - **Zero infrastructure** (no log shipper, storage, or query engine) - **AI-parseable** (Claude Code can `curl` and analyze directly) ### Post-Mortem with FOSS CLI Tools No log aggregation stack needed. These single-binary tools work directly on JSONL: ```bash # DuckDB — SQL analytics on JSONL (most powerful) duckdb -c "SELECT type, count(*) FROM read_json_auto('telemetry.jsonl') GROUP BY 1 ORDER BY 2 DESC" # jq — ad-hoc JSON filtering jq 'select(.type == "token_refresh")' telemetry.jsonl # journalctl — already exports JSONL natively journalctl -u ccmax-switcher -o json --since "1h ago" | jq 'select(.PRIORITY == "3")' # lnav — interactive terminal log viewer with SQL lnav telemetry.jsonl # llm (Simon Willison) — pipe to LLM for AI post-mortem journalctl -u myservice --since "2h ago" --priority=err -o json | llm "analyze root cause" ``` ### When to Upgrade Beyond Lightweight Upgrade to loguru/structlog when any of these become true: - **> 5 services** across multiple hosts (need trace IDs for correlation) - **> 10 events/sec** sustained (need async sinks, `orjson`) - **Multiple operators** who need per-module log level filtering - **Compliance requirements** that mandate structured audit trails with signatures - **Container/K8s deployment** (stdout JSON is the standard) --- ## Full-Featured: Loguru + JSONL Pattern For CLI tools, scripts, and services that benefit from a logging library: ### Log Rotation (ALWAYS CONFIGURE for local/CLI apps) ```python from loguru import logger logger.add( log_path, rotation="10 MB", retention="7 days", compression="gz" ) # stdlib alternative (zero-dep) from logging.handlers import RotatingFileHandler handler = RotatingFileHandler( log_path, maxBytes=100 * 1024 * 1024, # 100MB backupCount=5 ) ``` > **Container/serverless apps**: Skip file rotation entirely. Log to **stdout/stderr as JSON**. Let the container runtime handle collection and rotation. ### JSONL Format (Machine-Readable) ```python # One JSON object per line - jq-parseable {"timestamp": "2026-01-14T12:45:23.456Z", "level": "info", "message": "..."} ``` **File extension**: Always use `.jsonl` (not `.json` or `.log`) **Performance**: For >10k records/sec, use `orjson` instead of `json.dumps()`: ```python import orjson def json_formatter(record) -> str: log_entry = { ... } return orjson.dumps(log_entry).decode() ``` ### Regex Redaction (Defense-in-Depth) Use as a **backstop** alongside token fingerprinting, not as the primary control: ```python import re REDACT_PATTERNS = [ (re.compile(r'AKIA[0-9A-Z]{16}'), '[REDACTED_AWS_KEY]'), (re.compile(r'sk-[a-zA-Z0-9]{48}'), '[REDACTED_API_KEY]'), (re.compile(r'(?i)bearer\s+[a-zA-Z0-9._~+/=-]+'), '[REDACTED_BEARER]'), ] def redact_filter(record): for pattern, replacement in REDACT_PATTERNS: record["message"] = pattern.sub(replacement, record["message"]) return True logger.add(sink, filter=redact_filter) ``` ### Shutdown — Always Flush Enqueued Messages ```python import asyncio from loguru import logger async def main(): logger.add("app.jsonl", enqueue=True) await logger.complete() asyncio.run(main()) # Sync: logger.remove() ``` ### Complete Loguru + JSONL Example ```python #!/usr/bin/env python3 # /// script # requires-python = ">=3.14" # dependencies = ["loguru", "orjson"] # /// import re import sys from pathlib import Path from uuid import uuid4 import orjson from loguru import logger REDACT_PATTERNS = [ (re.compile(r'AKIA[0-9A-Z]{16}'), '[REDACTED_AWS_KEY]'), (re.compile(r'sk-[a-zA-Z0-9]{48}'), '[REDACTED_API_KEY]'), ] def json_formatter(record) -> str: log_entry = { "timestamp": record["time"].strftime("%Y-%m-%dT%H:%M:%S.%f")[:-3] + "Z", "level": record["level"].name.lower(), "component": record["function"], "operation": record["extra"].get("operation", "unknown"), "operation_status": record["extra"].get("status", None), "trace_id": record["extra"].get("trace_id"), "message": record["message"], "context": {k: v for k, v in record["extra"].items() if k not in ("operation", "status", "trace_id", "metrics")}, "metrics": record["extra"].get("metrics", {}), "error": None } if record["exception"]: exc_type, exc_value, _ = record["exception"] log_entry["error"] = { "type": exc_type.__name__ if exc_type else "Unknown", "message": str(exc_value) if exc_value else "Unknown error", } return orjson.dumps(log_entry).decode() def redact_filter(record): for pattern, replacement in REDACT_PATTERNS: record["message"] = pattern.sub(replacement, record["message"]) return True def setup_logger(app_name: str, log_dir: Path | None = None): logger.remove() logger.add(sys.stderr, format=json_formatter, filter=redact_filter, level="INFO") if log_dir is not None: log_dir.mkdir(parents=True, exist_ok=True) logger.add( str(log_dir / f"{app_name}.jsonl"), format=json_formatter, filter=redact_filter, rotation="10 MB", retention="7 days", compression="gz", level="DEBUG" ) return logger ``` ## Semantic Fields Reference | Field | Type | Purpose | | ------------------- | ------------- | ------------------------------------------------------ | | `timestamp` / `ts` | ISO 8601 | Event ordering (millisecond precision minimum) | | `level` / `type` | string | Severity or event type | | `component` / `svc` | string | Module, function, or service name | | `operation` | string | What action is being performed | | `operation_status` | string | started/success/failed/skipped | | `trace_id` | UUID4 or OTel | Correlation ID (OTel trace ID for production services) | | `message` | string | Human-readable description | | `context` | object | Operation-specific metadata | | `metrics` | object | Quantitative data (counts, durations) | | `error` | object/null | Exception details if failed | ## Related Resources - [Health Endpoint Monitoring Pattern](https://learn.microsoft.com/en-us/azure/architecture/patterns/health-endpoint-monitoring) - Microsoft Azure Architecture Center - [OWASP Logging Cheat Sheet](https://cheatsheetseries.owasp.org/cheatsheets/Logging_Cheat_Sheet.html) - Security best practices - [Write-Ahead Log pattern](https://martinfowler.com/articles/patterns-of-distributed-systems/write-ahead-log.html) - Martin Fowler - [DuckDB JSON support](https://duckdb.org/docs/data/json/overview) - SQL analytics on JSONL - [lnav](https://lnav.org/) - Terminal log file navigator with SQL - [llm CLI](https://github.com/simonw/llm) - Pipe logs to LLMs for analysis - [structlog docs](https://www.structlog.org/) - Structured logging for production services - [Pydantic Logfire](https://pydantic.dev/logfire) - AI/LLM observability built on OTel - [Langfuse](https://langfuse.com/) - Open-source LLM observability (self-hostable) ## Anti-Patterns to Avoid 1. **Unbounded logs** - Always configure rotation (local) or stdout (container) 2. **Logging full secrets** - Use token fingerprinting; regex redaction is a backstop, not primary 3. **Adding loguru/structlog to < 5 low-volume services** - print + JSONL is sufficient; dependency is not free 4. **Bare except without logging** - Catch specific exceptions, log them 5. **Silent failures** - Log errors before suppressing 6. **`enqueue=True` without `logger.complete()`** - Silent log loss on shutdown 7. **`enqueue=True` with slow sinks** - Unbounded memory growth 8. **`json.dumps()` at >10k events/sec** - Use orjson for 2-10x speedup 9. **UUID4 trace IDs in OTel services** - Use OTel-propagated trace IDs 10. **Prometheus/Grafana for < 5 services** - Health endpoints + Uptime Kuma is sufficient 11. **Conflating WAL and telemetry** - WAL is for crash recovery (ephemeral), telemetry is for audit (permanent) --- ## Troubleshooting | Issue | Cause | Solution | | ----------------------------- | -------------------------------- | ---------------------------------------------------- | | loguru not found | Not installed | Run `uv add loguru` | | Logs not appearing | Wrong log level | Set level to DEBUG for troubleshooting | | Log rotation not working | Missing rotation config | Add rotation param to logger.add() | | JSONL parse errors | Malformed log line | Check for unescaped special characters | | OOM with enqueue=True | Unbounded internal queue | Monitor RSS; use structlog or avoid slow sinks | | Lost logs on shutdown | Missing logger.complete() | Call `await logger.complete()` or `logger.remove()` | | Slow JSONL serialization | Using stdlib json at high volume | Switch to `orjson.dumps().decode()` | | Secrets in logs | No fingerprinting | Log token slices, not full values | | journald not capturing output | Missing flush | Use `print(..., flush=True)` or `PYTHONUNBUFFERED=1` | | No alerts when services crash | No external monitor | Add Uptime Kuma or Gatus polling health endpoints | ## Post-Execution Reflection After this skill completes, check before closing: 1. **Did the command succeed?** — If not, fix the instruction or error table that caused the failure. 2. **Did parameters or output change?** — If the underlying tool's interface drifted, update Usage examples and Parameters table to match. 3. **Was a workaround needed?** — If you had to improvise (different flags, extra steps), update this SKILL.md so the next invocation doesn't need the same workaround. Only update if the issue is real and reproducible — not speculative.