# Agent Fleet Status

> Check the health of multiple machines via SSH. Get a unified dashboard for distributed systems.

## Description

Agent Fleet Status monitors the health of a fleet of machines — servers, Mac Minis, cloud instances, Raspberry Pis, or any SSH-accessible host. It checks uptime, disk, memory, CPU, running processes, cron jobs, and service health, then outputs a unified dashboard. Built for operators running distributed agent swarms, home labs, or multi-server deployments.

## Activation

This skill activates when:
- The user asks about the status of their servers or machines
- The user wants to check if a service is running on a remote host
- The user mentions fleet health, server status, or system monitoring
- The user asks "are my agents running?" or "what's the status of my machines?"

**Trigger phrases**: "fleet status", "check my servers", "machine health", "are my agents running", "server dashboard", "system status", "check uptime", "what's running on"

## Health Check Dimensions

### 1. Connectivity
- Can we reach the host via SSH?
- Latency (ms)
- Last successful connection

### 2. System Resources
| Check | Warning | Critical |
|-------|---------|----------|
| Disk usage | >80% | >95% |
| Memory usage | >85% | >95% |
| CPU load (5min avg) | >70% | >90% |
| Uptime | <1 day (recent reboot) | N/A |
| Swap usage | >50% | >80% |

### 3. Process Health
- Are expected processes running? (by name or PID file)
- Zombie processes count
- Top 5 processes by CPU/memory

### 4. Service Health
- HTTP endpoints responding? (status code + latency)
- Port checks (is the port open and accepting connections?)
- Cron job status (last run time, exit codes from logs)

### 5. Agent-Specific
- OpenClaw gateway status
- Ollama model loaded
- Custom agent processes
- Log file freshness (is the agent producing output?)

## Instructions

When asked to check fleet status, generate commands and/or output in this format:

```
# Fleet Status Dashboard
## Checked at: [timestamp]

---

### [HOSTNAME] — [IP] — [STATUS EMOJI removed: OK / WARN / CRITICAL / OFFLINE]

| Check | Value | Status |
|-------|-------|--------|
| SSH | Connected (XXms) | OK |
| Uptime | XX days | OK |
| Disk | XX% of XXG used | [OK/WARN/CRIT] |
| Memory | XX% of XXG used | [OK/WARN/CRIT] |
| CPU Load | X.XX (5m avg) | [OK/WARN/CRIT] |
| Swap | XX% used | [OK/WARN/CRIT] |

**Running Services**:
- [service]: PID XXXX, up Xh, CPU X%, MEM X%
- [service]: PID XXXX, up Xh, CPU X%, MEM X%

**Cron Jobs**: XX active, last run [time], [X failures in 24h]

**Alerts**:
- [Any warnings or critical issues]

---

### Fleet Summary

| Host | Status | Disk | Mem | CPU | Services | Alerts |
|------|--------|------|-----|-----|----------|--------|
| [host1] | OK | 45% | 62% | 0.8 | 5/5 | 0 |
| [host2] | WARN | 82% | 71% | 1.2 | 4/5 | 1 |
| [host3] | OFFLINE | — | — | — | — | SSH FAIL |
```

## SSH Commands Used

The skill generates and executes these commands per host:

```bash
# Connectivity
ssh -o ConnectTimeout=5 -o BatchMode=yes user@host "echo ok"

# System resources
ssh user@host "
  uptime;
  df -h / | tail -1;
  free -m | grep Mem;
  cat /proc/loadavg;
  swapon --show --bytes 2>/dev/null
"

# Process health
ssh user@host "
  ps aux --sort=-%mem | head -6;
  ps aux | grep -c Z  # zombie count
"

# Service checks
ssh user@host "
  pgrep -la ollama;
  pgrep -la openclaw;
  pgrep -la node;
  systemctl is-active [service] 2>/dev/null
"

# Cron status
ssh user@host "crontab -l 2>/dev/null | grep -v '^#' | wc -l"

# Log freshness
ssh user@host "find /var/log -name '*.log' -mmin -60 | head -5"
```

For macOS hosts, adjust commands:
```bash
# macOS disk
ssh user@host "df -h / | tail -1"

# macOS memory (no free command)
ssh user@host "vm_stat | head -5; sysctl hw.memsize"

# macOS processes
ssh user@host "ps aux -r | head -6"
```

## Fleet Configuration

Define your fleet as a simple list:

```yaml
fleet:
  - name: Omni
    host: localhost
    type: linux
    expected_services: [ollama, node]

  - name: BMO
    host: 192.168.1.98
    user: operator
    type: macos
    expected_services: [openclaw, ollama, node, n8n]

  - name: OCI
    host: 192.168.1.92
    user: macmini
    type: macos
    expected_services: [openclaw]

  - name: SailorsBot1
    host: 192.168.1.99
    user: operator
    type: macos
    expected_services: [repflow]
```

## Alerting Logic

Severity escalation:

1. **INFO**: All systems green. No action needed.
2. **WARN**: One or more hosts have warnings (disk >80%, high memory, service restart detected). Review within 24 hours.
3. **CRITICAL**: A host has critical resource usage or a key service is down. Act within 1 hour.
4. **OFFLINE**: A host is unreachable via SSH. Investigate immediately — could be network, crash, or power.

## Example

### Input
> Check status of BMO (192.168.1.98, operator, macOS) and OCI (192.168.1.92, macmini, macOS).

### Output
*(Generates SSH commands, executes them, and produces the dashboard table showing both machines' disk, memory, CPU, running services, cron job count, and any alerts.)*

---

*Built by [KOINO Capital](https://koino.capital) — Agentic growth systems that run while you sleep.*
*Want this running autonomously 24/7? [Deploy with KOINO](https://koino.capital/deploy)*