---
name: autonomous-operations
description: Autonomous homelab operations using OODA loop (Observe, Orient, Decide, Act) - use when reviewing system state, planning autonomous actions, or investigating operational issues
---

# Autonomous Operations Engine

## Overview

The autonomous operations engine implements an OODA (Observe, Orient, Decide, Act) loop that orchestrates all homelab components into a self-managing system. It leverages existing infrastructure:

- **Context Framework** - Historical patterns and preferences
- **Auto-Remediation** - Playbook-based fixes
- **Predictive Analytics** - Resource exhaustion forecasting
- **Backup Integration** - Snapshot-protected operations
- **Homelab Intelligence** - Health scoring and diagnostics

**Philosophy:** Proactive over reactive. Predict and prevent rather than detect and fix.

## When to Use

**Always use for:**
- Reviewing current autonomous operations state
- Understanding what actions the system would take
- Investigating why an action was/wasn't taken
- Adjusting autonomy settings
- Querying the decision log

**Triggers:**
- User asks "what would the system do about..."
- User asks about autonomous actions taken
- User wants to review pending actions
- User asks to adjust autonomy level

## Architecture: OODA Loop

```
┌─────────────────────────────────────────────────────────────────────┐
│                           OBSERVE                                   │
│  • Predictive analytics (resource exhaustion forecasts)             │
│  • Health scoring (homelab-intel.sh)                                │
│  • Drift detection (check-drift.sh)                                 │
│  • Backup health verification                                       │
└────────────────────────────┬────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│                           ORIENT                                    │
│  • Historical patterns (issue-history.json)                         │
│  • User preferences (preferences.yml)                               │
│  • Service overrides (traefik: no auto-restart)                     │
│  • Current state (autonomous-state.json)                            │
└────────────────────────────┬────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│                           DECIDE                                    │
│  Confidence = (prediction × 0.30) + (historical × 0.30)             │
│             + (impact × 0.20) + (rollback × 0.20)                   │
│                                                                     │
│  ┌────────────────────┬──────────────────────────────────┐          │
│  │ Confidence + Risk  │ Action                           │          │
│  ├────────────────────┼──────────────────────────────────┤          │
│  │ >90% + Low Risk    │ AUTO-EXECUTE                     │          │
│  │ >80% + Med Risk    │ NOTIFY + EXECUTE                 │          │
│  │ >70% + Any Risk    │ QUEUE (for approval)             │          │
│  │ <70%               │ ALERT ONLY                       │          │
│  └────────────────────┴──────────────────────────────────┘          │
└────────────────────────────┬────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│                            ACT                                      │
│  1. Create pre-action snapshot (BTRFS)                              │
│  2. Execute via appropriate skill/playbook                          │
│  3. Validate outcome (health check)                                 │
│  4. Auto-rollback if validation fails                               │
│  5. Log decision and outcome                                        │
└─────────────────────────────────────────────────────────────────────┘
```

## Core Scripts

### autonomous-check.sh

Evaluates current system state and outputs recommended actions:

```bash
# Run assessment (outputs JSON)
~/containers/scripts/autonomous-check.sh

# Verbose output
~/containers/scripts/autonomous-check.sh --verbose

# Output to file
~/containers/scripts/autonomous-check.sh --output /tmp/assessment.json
```

**Output structure:**
```json
{
  "timestamp": "2025-11-29T22:00:00+01:00",
  "health_score": 94,
  "observations": [...],
  "recommended_actions": [
    {
      "id": "action-001",
      "type": "disk-cleanup",
      "reason": "Disk predicted 90% full in 8 days",
      "confidence": 0.92,
      "risk": "low",
      "decision": "auto-execute"
    }
  ],
  "pending_queue": []
}
```

### autonomous-execute.sh

Executes approved actions from the queue:

```bash
# Execute pending low-risk actions
~/containers/scripts/autonomous-execute.sh

# Execute specific action
~/containers/scripts/autonomous-execute.sh --action-id action-001

# Dry run (simulate only)
~/containers/scripts/autonomous-execute.sh --dry-run
```

### Emergency Controls

```bash
# Emergency stop all autonomous operations
~/containers/scripts/autonomous-execute.sh --stop

# Pause (stop acting, keep observing)
~/containers/scripts/autonomous-execute.sh --pause

# Resume operations
~/containers/scripts/autonomous-execute.sh --resume

# View current state
~/containers/scripts/autonomous-execute.sh --status
```

## State Files

### autonomous-state.json

Location: `~/.claude/context/autonomous-state.json`

Tracks operational state:
```json
{
  "enabled": true,
  "mode": "active",
  "paused": false,
  "circuit_breaker": {
    "triggered": false,
    "consecutive_failures": 0,
    "last_failure": null
  },
  "last_check": "2025-11-29T22:00:00+01:00",
  "pending_actions": [],
  "cooldowns": {
    "jellyfin.restart": "2025-11-29T20:00:00+01:00"
  },
  "statistics": {
    "actions_24h": 2,
    "actions_7d": 8,
    "success_rate": 0.95
  }
}
```

### decision-log.json

Location: `~/.claude/context/decision-log.json`

Audit trail of all decisions:
```json
{
  "decisions": [
    {
      "id": "decision-001",
      "timestamp": "2025-11-29T03:00:00+01:00",
      "action_type": "disk-cleanup",
      "trigger": "prediction: disk 90% in 9 days",
      "confidence": 0.92,
      "risk": "low",
      "decision": "auto-execute",
      "outcome": "success",
      "details": "Freed 15 GB, new prediction: 21 days",
      "duration_seconds": 45
    }
  ]
}
```

## Integration Points

### With Existing Playbooks

The engine routes actions to existing remediation playbooks:

| Action Type | Playbook | Risk |
|-------------|----------|------|
| disk-cleanup | `.claude/remediation/playbooks/disk-cleanup.yml` | Low |
| service-restart | `.claude/remediation/playbooks/service-restart.yml` | Low |
| drift-reconciliation | `.claude/remediation/playbooks/drift-reconciliation.yml` | Medium |
| resource-pressure | `.claude/remediation/playbooks/resource-pressure.yml` | Medium |

### With Preferences

Respects all settings in `~/.claude/context/preferences.yml`:

- `risk_tolerance` - Affects confidence thresholds
- `service_overrides` - Per-service restrictions (traefik: no auto-restart)
- `safety.max_restarts_per_hour` - Rate limiting
- `safety.restart_cooldown_seconds` - Cooldown periods

### With Backups

Every action is protected:

1. Pre-action snapshot created via `btrfs-snapshot-backup.sh`
2. Snapshot tagged: `autonomous-<action>-<timestamp>`
3. Auto-rollback if validation fails
4. Snapshot retained for 7 days minimum

## Decision Matrix

### Risk Classification

| Action | Data Risk | Availability Risk | Overall |
|--------|-----------|-------------------|---------|
| Disk cleanup | None (snapshot) | None | **LOW** |
| Service restart | None (stateless) | 2-5s downtime | **LOW** |
| Config reconciliation | None (snapshot) | Potential failure | **MEDIUM** |
| Multi-service change | Medium | Cascading risk | **HIGH** |

### Confidence Calculation

```
confidence = (
    prediction_confidence × 0.30 +    # How sure about the problem?
    historical_success × 0.30 +       # Has this fix worked before?
    impact_certainty × 0.20 +         # Are we sure about outcome?
    rollback_feasibility × 0.20       # Can we undo if it fails?
)
```

**Example: Disk cleanup**
```
prediction_confidence = 0.89  # Linear trend, high confidence
historical_success = 1.00     # 12/12 successful cleanups
impact_certainty = 0.95       # Known outcome
rollback_feasibility = 1.00   # Instant snapshot restore

confidence = 0.89×0.3 + 1.0×0.3 + 0.95×0.2 + 1.0×0.2
           = 0.267 + 0.300 + 0.190 + 0.200
           = 0.957 (95.7%)

Decision: AUTO-EXECUTE (>90% + low risk)
```

## Safety Controls

### Circuit Breaker

Automatically pauses after 3 consecutive failures:

```json
{
  "circuit_breaker": {
    "threshold": 3,
    "triggered": true,
    "consecutive_failures": 3,
    "last_failure": "2025-11-29T04:00:00+01:00",
    "reason": "service-restart failed 3 times"
  }
}
```

**Recovery:** Manual reset via `--resume` or automatically after 24 hours.

### Cooldowns

Prevent action storms:

- Service restart: 5 minute cooldown between restarts
- Disk cleanup: 1 hour cooldown
- Drift reconciliation: 15 minute cooldown

### Service Overrides

Critical services have special handling:

```yaml
# From preferences.yml
service_overrides:
  traefik:
    auto_restart: false         # Never auto-restart
    requires_confirmation: true
  authelia:
    auto_restart: false         # SSO is critical
    requires_confirmation: true
  prometheus:
    auto_restart: true          # Can auto-restart
    restart_timeout_seconds: 90
```

## Verification Feedback Loop

**NEW: Actions are verified and confidence scores are updated based on outcomes.**

This creates a learning system - successful actions increase autonomy, failed actions increase caution.

### Enhanced ACT Phase

The ACT phase now includes verification and confidence learning:

```
┌─────────────────────────────────────────────────────────────────────┐
│                       ACT (Enhanced)                                │
│                                                                     │
│  1. Create pre-action snapshot (BTRFS)                              │
│  2. Capture before-state (homelab-intel.sh --json)                  │
│  3. Execute via appropriate skill/playbook                          │
│  4. Wait for stabilization (10-30s depending on action)             │
│                                                                     │
│  5. VERIFY OUTCOME (NEW)                                            │
│     ├─ Run verify-autonomous-outcome.sh                             │
│     ├─ Compare before/after state                                   │
│     ├─ For services: run service-validator                          │
│     └─ Calculate verification confidence                            │
│                                                                     │
│  6. UPDATE CONFIDENCE (NEW)                                         │
│     ├─ Verified success: confidence +5%                             │
│     ├─ Warnings: confidence +2%                                     │
│     ├─ Failed verification: confidence -10%                         │
│     └─ Track in decision log                                        │
│                                                                     │
│  7. Auto-rollback if verification fails                             │
│  8. Log decision with verification details                          │
│  9. Update circuit breaker state                                    │
└─────────────────────────────────────────────────────────────────────┘
```

### Verification by Action Type

**disk-cleanup:**
```bash
# Verify disk space actually freed (>3% improvement)
~/containers/scripts/verify-autonomous-outcome.sh \
  disk-cleanup \
  /tmp/before-state.json \
  30  # wait 30s

# Success: Freed 12% → confidence +5%
# Minimal: Freed 1% → confidence +2% (warning)
# Failed: No space freed → confidence -10%, rollback
```

**service-restart:**
```bash
# Verify service healthy after restart
~/containers/scripts/verify-autonomous-outcome.sh \
  service-restart \
  /tmp/before-state-jellyfin.json \
  30

# Also run service-validator for comprehensive checks
~/.claude/skills/homelab-deployment/scripts/verify-deployment.sh \
  jellyfin \
  https://jellyfin.patriark.org \
  true

# Success: Service up + all checks pass → confidence +5%
# Warning: Service up but warnings → confidence +2%
# Failed: Service crashed or checks fail → confidence -10%, rollback
```

**drift-reconciliation:**
```bash
# Verify drift actually resolved
~/containers/scripts/verify-autonomous-outcome.sh \
  drift-reconciliation \
  /tmp/before-state.json \
  30

# Success: Drift status = MATCH → confidence +5%
# Failed: Drift still present → confidence -10%, rollback
```

**resource-pressure:**
```bash
# Verify memory/CPU pressure relieved
~/containers/scripts/verify-autonomous-outcome.sh \
  resource-pressure \
  /tmp/before-state.json \
  30

# Success: Pressure reduced >5% → confidence +5%
# Minimal: Pressure reduced <5% → confidence +2%
# Failed: Pressure not reduced → confidence -10%
```

### Confidence Learning

Confidence scores are tracked per action type and updated based on verification outcomes:

**Before verification (historical only):**
```json
{
  "action_type": "disk-cleanup",
  "base_confidence": 0.92,
  "historical_success_rate": 12/12,
  "recent_executions": []
}
```

**After 3 verified successes:**
```json
{
  "action_type": "disk-cleanup",
  "base_confidence": 0.97,  // +5% after each success
  "historical_success_rate": 15/15,
  "recent_executions": [
    {"date": "2026-01-01", "verified": true, "delta": +5},
    {"date": "2026-01-02", "verified": true, "delta": +5},
    {"date": "2026-01-03", "verified": true, "delta": +5}
  ],
  "trend": "increasing"
}
```

**After 1 failed verification:**
```json
{
  "action_type": "service-restart",
  "base_confidence": 0.79,  // -10% after failure
  "historical_success_rate": 14/15,
  "recent_executions": [
    {"date": "2026-01-01", "verified": true, "delta": +5},
    {"date": "2026-01-02", "verified": true, "delta": +5},
    {"date": "2026-01-03", "verified": false, "delta": -10}
  ],
  "trend": "stable (recent failure)"
}
```

### State File Updates

**autonomous-state.json** now includes verification tracking:

```json
{
  "enabled": true,
  "last_check": "2026-01-04T06:30:00+01:00",
  "verification": {
    "enabled": true,
    "last_verification": "2026-01-04T06:35:00+01:00",
    "verification_pass_rate_7d": 0.95,
    "failures_requiring_rollback": 0
  },
  "confidence_trends": {
    "disk-cleanup": {
      "current": 0.97,
      "trend": "increasing",
      "last_updated": "2026-01-03T06:30:00+01:00"
    },
    "service-restart": {
      "current": 0.89,
      "trend": "stable",
      "last_updated": "2026-01-02T06:30:00+01:00"
    },
    "drift-reconciliation": {
      "current": 0.82,
      "trend": "stable",
      "last_updated": "2026-01-01T06:30:00+01:00"
    }
  },
  "circuit_breaker": {
    "threshold": 3,
    "triggered": false,
    "consecutive_failures": 0
  }
}
```

**decision-log.json** now includes verification results:

```json
{
  "decisions": [
    {
      "id": "decision-042",
      "timestamp": "2026-01-04T06:30:00+01:00",
      "action_type": "service-restart",
      "service": "jellyfin",
      "trigger": "health check failing",
      "confidence": 0.87,
      "risk": "low",
      "decision": "auto-execute",
      "outcome": "success",
      "verification": {
        "status": "VERIFIED",
        "confidence": 95,
        "checks_passed": 6,
        "checks_warned": 1,
        "checks_failed": 0,
        "report_path": "/tmp/verification-jellyfin-1704351000.txt",
        "verification_time_seconds": 24
      },
      "confidence_delta": 5,
      "new_confidence": 0.92,
      "details": "Service restarted successfully, all verification checks passed",
      "duration_seconds": 45
    }
  ]
}
```

### Benefits of Verification Feedback

**1. Learning from Experience**
- Actions that consistently verify successfully → higher confidence → more autonomy
- Actions that fail verification → lower confidence → more caution/user approval

**2. Earlier Problem Detection**
- Verification catches issues immediately after action
- Auto-rollback prevents prolonged outages
- User sees verification reports in decision log

**3. Improved Decision Making**
- Confidence scores reflect actual success rates (not just historical)
- Recent failures appropriately reduce autonomy
- Recent successes appropriately increase autonomy

**4. Auditable Outcomes**
- Every action has verification report
- Can review why action succeeded/failed
- Track verification pass rates over time

### Example: Confidence Evolution

**Week 1: Initial deployment**
```
disk-cleanup confidence: 0.85 (based on historical data)
→ Execute action
→ Verify: 15% disk freed ✓
→ New confidence: 0.90 (+5%)
```

**Week 2: Continued success**
```
disk-cleanup confidence: 0.90
→ Execute action
→ Verify: 12% disk freed ✓
→ New confidence: 0.95 (+5%)
```

**Week 3: First failure**
```
disk-cleanup confidence: 0.95
→ Execute action
→ Verify: 0% disk freed ✗ (logs already rotated elsewhere)
→ Rollback triggered
→ New confidence: 0.85 (-10%)
```

**Week 4: Recovery**
```
disk-cleanup confidence: 0.85
→ Execute action
→ Verify: 8% disk freed ✓
→ New confidence: 0.90 (+5%)
```

**Result:** System learned that disk-cleanup success rate is ~75%, not 100%. Confidence stabilizes at ~0.90, which is more accurate than initial 0.95 assumption.

## Systemd Integration

### Timer: autonomous-check.timer

Runs daily assessment at 06:00 (after backups):

```ini
[Timer]
OnCalendar=*-*-* 06:00:00
RandomizedDelaySec=300
Persistent=true
```

### Service: autonomous-execute.service

Executes approved actions after check completes.

## Query Interface

### Via Claude Code

```
"What autonomous actions were taken this week?"
"Show me pending actions"
"Why didn't the system restart Jellyfin?"
"What's the current autonomous operations status?"
```

### Via Scripts

```bash
# Query decision log
~/containers/.claude/context/scripts/query-decisions.sh --last 7d

# Query pending actions
~/containers/.claude/context/scripts/query-decisions.sh --pending

# Query by action type
~/containers/.claude/context/scripts/query-decisions.sh --type disk-cleanup
```

## Reporting

### Weekly Intelligence Report Integration

The weekly report includes autonomous operations section:

```
## Autonomous Operations (Last 7 Days)

Actions Taken: 8
  - disk-cleanup: 2 (100% success)
  - service-restart: 5 (100% success)
  - drift-reconciliation: 1 (100% success)

Success Rate: 100%
Average Confidence: 91%

Pending Actions: 0
Circuit Breaker: Not triggered
```

### Discord Notifications

- Action execution: Notify on execute (configurable)
- Failures: Always notify
- Weekly summary: Included in intelligence report

## Troubleshooting

### Check Current State

```bash
# View state file
cat ~/.claude/context/autonomous-state.json | jq '.'

# View recent decisions
cat ~/.claude/context/decision-log.json | jq '.decisions[-5:]'

# Check if paused
cat ~/.claude/context/autonomous-state.json | jq '.paused'
```

### Circuit Breaker Triggered

```bash
# Check why
cat ~/.claude/context/autonomous-state.json | jq '.circuit_breaker'

# View failing action
cat ~/.claude/context/decision-log.json | jq '.decisions | map(select(.outcome == "failure")) | .[-3:]'

# Reset manually
~/containers/scripts/autonomous-execute.sh --resume
```

### Action Not Executing

Check in order:
1. Is autonomous operations enabled? (`enabled: true`)
2. Is it paused? (`paused: false`)
3. Is circuit breaker triggered?
4. Is the service in overrides with `auto_restart: false`?
5. Is there a cooldown active?
6. Is confidence below threshold?

## Success Metrics

- **Proactive ratio:** 80%+ issues prevented before impact
- **Success rate:** >95% of autonomous actions succeed
- **MTTR reduction:** Minutes instead of hours
- **Manual interventions:** <10% of issues require human action

---

**This skill orchestrates all homelab components into a self-managing system.**