--- name: enterprise-agent-ops description: Operate long-lived agent workloads with observability, security boundaries, and lifecycle management. origin: ECC --- # Enterprise Agent Ops Use this skill for cloud-hosted or continuously running agent systems that need operational controls beyond single CLI sessions. ## Operational Domains 1. runtime lifecycle (start, pause, stop, restart) 2. observability (logs, metrics, traces) 3. safety controls (scopes, permissions, kill switches) 4. change management (rollout, rollback, audit) ## Baseline Controls - immutable deployment artifacts - least-privilege credentials - environment-level secret injection - hard timeout and retry budgets - audit log for high-risk actions ## Metrics to Track - success rate - mean retries per task - time to recovery - cost per successful task - failure class distribution ## Incident Pattern When failure spikes: 1. freeze new rollout 2. capture representative traces 3. isolate failing route 4. patch with smallest safe change 5. run regression + security checks 6. resume gradually ## Deployment Integrations This skill pairs with: - PM2 workflows - systemd services - container orchestrators - CI/CD gates