# Operations Guide ## 1. Observability Stack Startup ```bash docker compose -f docker-compose.observability.yml up -d ``` Access endpoints: - Prometheus: `http://localhost:9090` - Loki: `http://localhost:3100` - Tempo: `http://localhost:3200` - Alertmanager: `http://localhost:9093` ## 2. Core Alerts - `HighHttpP95Latency`: p95 > 1.5s for 5m - `IngestionFailureRateHigh`: ingestion failed ratio > 5% ## 3. Queue Backend Modes - Redis Stream: `APP_INGESTION_QUEUE_BACKEND=redis_stream` - RabbitMQ: `APP_INGESTION_QUEUE_BACKEND=rabbitmq` - DB polling fallback: `APP_INGESTION_QUEUE_BACKEND=db_polling` Terminal failures enter DLQ stream/queue. ## 4. Log Shipping - Application log file: `logs/knowledgeops-agent.log` - Promtail scrapes `logs/*.log` and pushes to Loki - Trace and request correlation fields: `trace_id`, `request_id`, `chat_id` ## 5. Nightly Regression ```bash python3 scripts/generate_eval_dataset.py python3 scripts/generate_eval_predictions.py python3 scripts/run_regression.py --dataset evaluation/dataset.large.json --predictions evaluation/predictions.generated.json --threshold 0.75 ``` ## 6. Performance Validation ```bash k6 run performance/k6/chat_ingestion_load.js -e BASE_URL=http://localhost:8080 k6 run performance/k6/distributed_chat_ingestion.js -e BASE_URL=http://localhost:8080 python3 performance/k6/generate_report.py --summary reports/performance/distributed-k6-summary.json ``` ## 7. Incident Triage Playbook 1. Verify app health endpoint and dependency availability. 2. Check ingestion queue lag and failed jobs. 3. Correlate logs by `trace_id`. 4. Review p95 latency and error spikes in Prometheus. 5. Trigger fallback or rollback if SLA continues to degrade.