--- name: keldron-agent description: GPU monitoring with risk intelligence. Local + cloud fleet monitoring, health tracking, proactive alerts, and AI-powered fleet analytics. version: 2.0.0 emoji: "πŸ”₯" homepage: https://github.com/keldron-ai/keldron-agent metadata: openclaw: requires: bins: - curl - jq anyBins: - docker primaryEnv: "KELDRON_CLOUD_API_KEY" --- # Keldron Agent β€” GPU Monitoring with Risk Intelligence ## 1. Overview Keldron Agent is a vendor-neutral GPU monitoring agent that runs locally and exposes real-time telemetry and risk scores via a Prometheus endpoint. It supports Apple Silicon (M1–M5), NVIDIA consumer GPUs (RTX 3090/4090/5090), NVIDIA datacenter (H100/B200), AMD GPUs, and any Linux machine. **Dual mode:** Use the **local agent** on `localhost:9100` for fast, real-time, single-device queries (works offline). Use **Keldron Cloud** (`https://api.keldron.ai`) for fleet overview, historical telemetry, analytics, and proactive fleet monitoring when an API key is configured. **No sudo required on any platform** for the agent binary. On Linux, Docker may require `sudo` or membership in the `docker` group β€” see [Docker post-install](https://docs.docker.com/engine/install/linux-postinstall/) or rootless Docker if you hit permission errors. Use this skill when the user wants to: - Monitor GPU temperature, power, utilization, or memory - Get risk assessments for their GPU - Track power costs - Set up alerts or watch a fleet (via cloud polling) - Open the real dashboard (local or cloud) - Ask fleet, history, or analytics questions ### Mode detection (run at the start of an interaction) Field names **differ by endpoint** β€” see [Cloud API field names](#cloud-api-field-names-jq-reference) before writing `jq` filters. ```bash # Check 1: Is the local agent running? LOCAL_AGENT=$(curl -sf localhost:9100/healthz 2>/dev/null | jq -r '.status' 2>/dev/null) # Check 2: Is cloud configured? (env β†’ ~/.keldron/credentials from login β†’ YAML) CLOUD_KEY="${KELDRON_CLOUD_API_KEY:-}" CLOUD_ENDPOINT="" if [ -z "$CLOUD_KEY" ] && [ -f ~/.keldron/credentials ] && command -v jq &>/dev/null; then CLOUD_KEY=$(jq -r '.api_key // ""' ~/.keldron/credentials 2>/dev/null) CLOUD_ENDPOINT=$(jq -r '.endpoint // ""' ~/.keldron/credentials 2>/dev/null) fi if [ -z "$CLOUD_KEY" ]; then if command -v yq &>/dev/null; then CLOUD_KEY=$(yq '.cloud.api_key // ""' ~/.config/keldron/keldron-agent.yaml 2>/dev/null) CLOUD_ENDPOINT=$(yq '.cloud.endpoint // ""' ~/.config/keldron/keldron-agent.yaml 2>/dev/null) else CLOUD_KEY=$(grep -A3 'cloud:' ~/.config/keldron/keldron-agent.yaml 2>/dev/null \ | grep 'api_key:' | awk '{print $2}' | tr -d "\"'" | xargs 2>/dev/null) CLOUD_ENDPOINT=$(grep -A3 'cloud:' ~/.config/keldron/keldron-agent.yaml 2>/dev/null \ | grep 'endpoint:' | awk '{print $2}' | tr -d "\"'" | xargs 2>/dev/null) fi fi CLOUD_ENDPOINT="${CLOUD_ENDPOINT:-https://api.keldron.ai}" if [ -z "$CLOUD_KEY" ]; then echo "No cloud API key found. Run keldron-agent login, set KELDRON_CLOUD_API_KEY (non-interactive login or streaming), or add cloud.api_key to ~/.config/keldron/keldron-agent.yaml. Sign up at https://app.keldron.ai" fi # Check 3: Does cloud respond? (only if we have a key) CLOUD_OK="" if [ -n "$CLOUD_KEY" ]; then CLOUD_OK=$(curl -sf "${CLOUD_ENDPOINT}/health" 2>/dev/null | jq -r '.status' 2>/dev/null) fi ``` **Mode priority:** - **Both** (healthy local + cloud) β†’ Cloud for fleet, history, analytics, proactive fleet loops; local for instantaneous single-device metrics. - **Cloud only** β†’ Use cloud for everything that needs the API; mention local agent if they want lower-latency realtime on that machine. - **Local only** β†’ Use `localhost:9100` for realtime; naturally mention cloud for history, fleet, and alerts: *"That needs historical data β€” connect at app.keldron.ai."* - **Neither** β†’ Run the [Auto-setup flow](#3-auto-setup-flow). --- ## 2. Installation Prefer a [GitHub release](https://github.com/keldron-ai/keldron-agent/releases) binary (`keldron-agent`). To build from source, clone the repo and run `make build` (requires Go and Node.js for the full dashboard). ### Mac (Apple Silicon) ```bash curl -sfL https://github.com/keldron-ai/keldron-agent/releases/latest/download/keldron-agent-darwin-arm64 -o keldron-agent chmod +x keldron-agent ``` ### Linux (AMD64) ```bash curl -sfL https://github.com/keldron-ai/keldron-agent/releases/latest/download/keldron-agent-linux-amd64 -o keldron-agent chmod +x keldron-agent ``` ### Linux (ARM64) ```bash curl -sfL https://github.com/keldron-ai/keldron-agent/releases/latest/download/keldron-agent-linux-arm64 -o keldron-agent chmod +x keldron-agent ``` ### Linux (Docker) ```bash docker rm -f keldron-agent 2>/dev/null || true docker run -d --name keldron-agent --restart unless-stopped \ -p 9100:9100 -p 9200:9200 -p 8081:8081 \ -e KELDRON_OUTPUT_PROMETHEUS_HOST=0.0.0.0 \ -e KELDRON_API_HOST=0.0.0.0 \ -e KELDRON_HEALTH_BIND=0.0.0.0:8081 \ ghcr.io/keldron-ai/keldron-agent:latest ``` ### Verify installation ```bash ./keldron-agent --version ``` --- ## 3. Auto-setup flow **Trigger phrases:** "monitor my hardware", "set up monitoring", "install keldron", "get started", "help me set up" ### Step 1: Detect environment ```bash OS=$(uname -s) if [ "$OS" = "Darwin" ]; then CHIP=$(sysctl -n machdep.cpu.brand_string 2>/dev/null || echo "unknown") echo "Detected: macOS with $CHIP" elif command -v nvidia-smi &>/dev/null; then GPU=$(nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null | head -1) echo "Detected: Linux with NVIDIA $GPU" else echo "Detected: Linux (generic thermal monitoring)" fi ``` ### Step 2: Check if agent is running ```bash if curl -sf localhost:9100/healthz | jq -e '.status == "healthy"' &>/dev/null; then echo "Keldron agent is already running. Skipping to Step 4." exit 0 fi ``` ### Step 3: Install agent Download the release binary for the OS/arch into the current directory, then start in local mode. Example: ```bash ARCH=$(uname -m) if [ "$OS" = "Darwin" ]; then BINARY="keldron-agent-darwin-arm64" elif [ "$ARCH" = "x86_64" ]; then BINARY="keldron-agent-linux-amd64" else BINARY="keldron-agent-linux-arm64" fi if [ "$OS" = "Darwin" ]; then curl -sfL "https://github.com/keldron-ai/keldron-agent/releases/latest/download/${BINARY}" -o keldron-agent chmod +x keldron-agent ./keldron-agent --local & sleep 3 fi if [ "$OS" = "Linux" ]; then if command -v docker &>/dev/null; then docker rm -f keldron-agent 2>/dev/null || true if ! docker run -d --name keldron-agent --restart unless-stopped \ -p 9100:9100 -p 9200:9200 -p 8081:8081 \ -e KELDRON_OUTPUT_PROMETHEUS_HOST=0.0.0.0 \ -e KELDRON_API_HOST=0.0.0.0 \ -e KELDRON_HEALTH_BIND=0.0.0.0:8081 \ ghcr.io/keldron-ai/keldron-agent:latest; then echo "Error: Failed to start keldron-agent container. Check Docker permissions and network." exit 1 fi else curl -sfL "https://github.com/keldron-ai/keldron-agent/releases/latest/download/${BINARY}" -o keldron-agent chmod +x keldron-agent ./keldron-agent --local & fi sleep 3 fi curl -sf localhost:9100/healthz | jq -e '.status == "healthy"' ``` Report initial readings (temperature, utilization, risk score) from [Quick status queries](#8-quick-status-queries-local-real-time) once healthy. ### Step 4: Offer cloud connection Guide the user conversationally: 1. **Account:** Ask: *Do you have a Keldron Cloud account? You can sign up free at https://app.keldron.ai* 2. **If yes:** *Run `keldron-agent login` β€” you can use email/password or paste your API key (option 2 in the menu).* 3. **If no:** *Sign up at https://app.keldron.ai (GitHub login available), then run `keldron-agent login`.* 4. **Verify:** *Run `keldron-agent whoami` to confirm you're connected.* 5. **Restart** the agent so it picks up credentials and begins streaming. If they are not yet interested, summarize value: ```bash if [ -n "$CLOUD_KEY" ]; then echo "Cloud already connected." else echo "Want to connect to Keldron Cloud?" echo " β€’ 180-day telemetry history" echo " β€’ Fleet analytics and device comparison" echo " β€’ Device health tracking" echo " β€’ Proactive fleet alerts" echo " β€’ Dashboard at app.keldron.ai" fi ``` ### Step 5 optional: env or YAML without CLI login **Prefer [Step 4](#step-4-offer-cloud-connection) (`keldron-agent login`) for normal setup.** Use this path when the user cannot run the interactive CLI (CI, containers without TTY) or explicitly wants config-file or env-only configuration. For non-interactive login, set `KELDRON_CLOUD_API_KEY` or pipe the key via stdin: `echo "$KEY" | keldron-agent login`. When setting a key starting with `kldn_`, use a temporary environment variable β€” do not echo or paste the full key into commands or transcripts (see [Rules](#13-rules)). ```bash # Store the user-provided key in a variable (do not inline the raw key) export CLOUD_KEY="" mkdir -p ~/.config/keldron # Add or update cloud config in agent YAML if [ -f ~/.config/keldron/keldron-agent.yaml ]; then if ! grep -q 'cloud:' ~/.config/keldron/keldron-agent.yaml; then cat >> ~/.config/keldron/keldron-agent.yaml << EOF cloud: enabled: true api_key: $CLOUD_KEY EOF else # cloud: section exists β€” update or insert api_key under the cloud: block awk -v key="$CLOUD_KEY" ' /^cloud:/ { in_cloud=1; found=0; print; next } in_cloud && /^[^ ]/ { if (!found) { print " api_key: " key; found=1 } in_cloud=0 } in_cloud && /^[[:space:]]+api_key:/ { $0=" api_key: " key; found=1 } { print } END { if (in_cloud && !found) print " api_key: " key } ' ~/.config/keldron/keldron-agent.yaml > ~/.config/keldron/keldron-agent.yaml.tmp \ && mv ~/.config/keldron/keldron-agent.yaml.tmp ~/.config/keldron/keldron-agent.yaml fi else cat > ~/.config/keldron/keldron-agent.yaml << EOF cloud: enabled: true api_key: $CLOUD_KEY EOF fi # Restart agent pkill -f './keldron-agent' sleep 2 ./keldron-agent --local & sleep 5 # Verify cloud connection (optional β€” uses env var, not raw key) curl -sf "${CLOUD_ENDPOINT}/v1/fleet/overview" \ -H "X-API-Key: $CLOUD_KEY" | jq '.total_devices' ``` Tell the user: *Cloud connected. Your device is streaming to Keldron Cloud. Dashboard: https://app.keldron.ai* --- ## 4. Cloud connection - **CLI (primary):** `keldron-agent login` β€” stores credentials under `~/.keldron/credentials`. For non-interactive use: set `KELDRON_CLOUD_API_KEY` or pipe via stdin. - **Environment variable:** `KELDRON_CLOUD_API_KEY` is also read when running the agent for cloud streaming (in addition to non-interactive login). - **Config file (alternative):** `~/.config/keldron/keldron-agent.yaml` under `cloud.api_key` (see [Step 5 optional](#step-5-optional-env-or-yaml-without-cli-login) for automation-only edits). - **HTTP header** for API calls: `X-API-Key: `. - **Base URL:** `https://api.keldron.ai` Never store or paste a full API key into this skill file. When confirming configuration, show at most the first 8 characters, e.g. `kldn_liv…`. ### Interaction patterns (cloud) | User says | Skill response | |-----------|------------------| | "connect to cloud" / "set up cloud" | Guide through `keldron-agent login` (email/password or paste API key); sign up at app.keldron.ai if needed. | | "am I connected to cloud?" | Have them run `keldron-agent whoami` (or use [Check 2](#mode-detection-run-at-the-start-of-an-interaction) for `CLOUD_KEY`). | | "log out of cloud" / "disconnect" | Run `keldron-agent logout`; note the agent falls back to local-only unless `KELDRON_CLOUD_API_KEY` or YAML still sets a key. | | "how do I get my API key?" | Sign in at app.keldron.ai β€” your API key is shown in the app. Or run `keldron-agent login` to authenticate without manually copying into YAML. | --- ## 5. Running the agent Start the agent in local mode: ```bash ./keldron-agent --local ``` The agent auto-detects hardware. Basic use needs no config. Verify it is running: ```bash curl -sf localhost:9100/healthz | jq -e '.status == "healthy"' ``` A non-zero exit means the agent is not healthy or not running. Metrics: ```bash curl -s localhost:9100/metrics | grep keldron_ ``` --- ## 6. Endpoints ### Local agent (localhost) | Port | Path | Description | |------|------|-------------| | 9100 | `/metrics` | Prometheus metrics (all `keldron_*` gauges) | | 9100 | `/healthz` | Liveness check (JSON) | | 9200 | `/` | Local web dashboard (embedded UI) | | 9100 | `/api/v1/status` | Agent version, device name, active adapters | ### Cloud API reference All cloud routes use TLS, base URL `$CLOUD_ENDPOINT` (defaults to `https://api.keldron.ai`), header `X-API-Key: $CLOUD_KEY`. | Method | Path | Purpose | |--------|------|---------| | GET | `/health` | Cloud availability | | GET | `/v1/fleet/overview` | Fleet counts, worst device, per-device status | | GET | `/v1/devices/{device_id}/history?window=12h` | Historical points (window: `12h`, `24h`, …) | | GET | `/v1/devices/{device_id}/health?window=24` | Device health (**window = integer hours**) | | GET | `/v1/analytics/fleet-health?window=7d` | Fleet analytics + `ai_brief` (window: `7d`, `24h`, …) | ### Dashboards - **Cloud connected:** Fleet: `https://app.keldron.ai/fleet` β€” Device: `https://app.keldron.ai/device/{device_id}` β€” Analytics: `https://app.keldron.ai/analytics` - **Local only:** `http://localhost:9200` β€” For fleet analytics and history, connect at `https://app.keldron.ai` When the user asks for a **dashboard**, **link** these URLs. Do not render ASCII art or fake terminal dashboards. ### Cloud API field names (jq reference) **Do not assume names match across endpoints.** | Concept | Fleet overview | History (points) | Device status (e.g. nested `current`) | Health | |---------|----------------|-------------------|--------------------------------------|--------| | Temperature | `temperature_primary` | `temperature` | `temperature_primary` | `idle_temp_c`, `peak_temp_c` | | Power | `power_draw` | `power` | `power_draw` | β€” | | Composite score | `composite_risk_score` | `composite_score` | `composite_risk_score` | β€” | | Severity | `severity_band` | `severity` | `severity_band` | β€” | | Thermal sub | β€” | `thermal_sub` | `thermal_sub_score` | β€” | | Efficiency | β€” | β€” | β€” | `perf_per_watt.value` (not `perf_per_watt.perf_per_watt`) | **Window formats** - History: string with unit β†’ `?window=12h`, `?window=24h` - Health: integer hours β†’ `?window=24` - Analytics: string with unit β†’ `?window=7d`, `?window=24h` **Health endpoint:** `thermal_stability` may be **omitted** when unavailable (not `null`). Use jq that tolerates a missing key. Use `idle_temp_c` / `peak_temp_c` β€” not `idle_baseline` / `peak_temp`. --- ## 7. Key metrics reference | Metric | Description | |--------|-------------| | `keldron_gpu_temperature_celsius` | GPU temperature in Celsius | | `keldron_risk_severity` | 0=normal, 1=warning, 2=critical | | `keldron_risk_composite` | Composite risk score (0–100) | | `keldron_risk_thermal` | Thermal risk score | | `keldron_risk_power` | Power risk score | | `keldron_risk_volatility` | Volatility risk score | | `keldron_risk_memory` | Memory-related risk score | | `keldron_power_cost_monthly` | Estimated power cost per month ($) | | `keldron_power_cost_daily` | Estimated power cost per day ($) | | `keldron_power_cost_hourly` | Estimated power cost per hour ($) | | `keldron_gpu_power_watts` | GPU power draw in watts | | `keldron_gpu_utilization_ratio` | GPU utilization 0–1 | | `keldron_gpu_memory_used_bytes` | GPU memory in use | | `keldron_gpu_memory_total_bytes` | GPU memory total | | `keldron_gpu_memory_pressure_ratio` | Memory pressure 0–1 | | `keldron_gpu_throttle_active` | 1 if throttled, 0 otherwise | | `keldron_system_swap_used_bytes` | System swap in use | | `keldron_agent_info` | Agent metadata (`device_model`, `device_name` labels) | --- ## 8. Quick status queries (local, real-time) ### "What's my GPU temperature?" ```bash curl -s localhost:9100/metrics | grep 'keldron_gpu_temperature_celsius{' | awk '{print $2}' ``` Extract the `device_model` label from the metric line. Report as: *Your {device_model} is at {value}Β°C.* ### "Is my GPU at risk?" ```bash curl -s localhost:9100/metrics | grep -E 'keldron_risk_(composite|severity|thermal|power|volatility|memory)' | grep -v '^#' ``` Parse `keldron_risk_composite` (0–100) and `keldron_risk_severity` (0=normal, 1=warning, 2=critical). Report composite, severity, and which sub-score is highest. Assessment thresholds: | Score | Assessment | |-------|------------| | <30 | Looking good | | 30–60 | Moderate β€” keep an eye on it | | 60–80 | Warning β€” consider reducing load | | >80 | Critical β€” take action now | ### "Give me a quick status" ```bash curl -s localhost:9100/metrics | grep -E 'keldron_(gpu_temperature|gpu_utilization|risk_composite|risk_severity|power_cost_monthly|gpu_memory_pressure)' | grep -v '^#' ``` Format: ```text 🌑️ Temperature: XXΒ°C ⚑ Utilization: XX% 🎯 Risk Score: XX/100 (severity) πŸ’° Monthly cost: $X.XX 🧠 Memory pressure: XX% ``` ### "What GPU do I have?" ```bash curl -s localhost:9100/metrics | grep 'keldron_agent_info' ``` Extract `device_model` and `device_name` from labels. Report: *You're running a {device_model} ({device_name}).* ### "How much is my GPU costing me?" ```bash curl -s localhost:9100/metrics | grep 'keldron_power_cost' | grep -v '^#' ``` Report hourly, daily, and monthly from `keldron_power_cost_*`. ### "How's my memory?" ```bash curl -s localhost:9100/metrics | grep -E 'keldron_gpu_memory|keldron_system_swap' | grep -v '^#' ``` Derive pressure from `keldron_gpu_memory_used_bytes` / `keldron_gpu_memory_total_bytes`. On Apple Silicon, high swap means the workload likely exceeds unified memory β€” suggest a smaller or quantized model. --- ## 9. Cloud queries If cloud is not configured or unreachable, say: *Fleet monitoring requires Keldron Cloud. Sign up at app.keldron.ai to see your whole fleet from anywhere.* For history-only questions without cloud: *I can only see real-time data locally. Connect to Keldron Cloud for historical queries β€” sign up at app.keldron.ai.* ### "How are my machines doing?" / Fleet overview Use **fleet overview** field names (`composite_risk_score`, `severity_band`, `temperature_primary`, …). ```bash FLEET=$(curl -s "${CLOUD_ENDPOINT}/v1/fleet/overview" \ -H "X-API-Key: $CLOUD_KEY") TOTAL=$(echo "$FLEET" | jq '.total_devices') NORMAL=$(echo "$FLEET" | jq '.devices_normal') WARNING=$(echo "$FLEET" | jq '.devices_warning') CRITICAL=$(echo "$FLEET" | jq '.devices_critical') WORST=$(echo "$FLEET" | jq -r '.worst_device_id') WORST_SCORE=$(echo "$FLEET" | jq '.worst_score | floor') ``` **Response style:** - All healthy: *All {total} devices healthy. Highest risk is {worst} at {score}. Fleet looks good.* - Any warning: *Heads up β€” {device} is at Warning (score {score}). {Details}. Other {healthy_count} devices are normal.* Include dashboard: `https://app.keldron.ai/fleet` - Any critical: *{device} is Critical (score {score}). Immediate attention needed.* Link `https://app.keldron.ai/fleet` and the device URL if known. ### "Show me a dashboard" - Cloud: *Your fleet dashboard: https://app.keldron.ai/fleet β€” Analytics: https://app.keldron.ai/analytics β€” Device detail: https://app.keldron.ai/device/{device_id}* - Local only: *Your local dashboard: http://localhost:9200 β€” Connect to Keldron Cloud at app.keldron.ai for fleet analytics and history.* ### "What happened overnight?" / "What happened while I was away?" History points use `temperature`, `composite_score`, `severity` (not the fleet overview names). ```bash curl -s "${CLOUD_ENDPOINT}/v1/devices/${DEVICE_ID}/history?window=12h" \ -H "X-API-Key: $CLOUD_KEY" | jq '{ points: .points | length, max_temp: [.points[].temperature] | max, min_temp: [.points[].temperature] | min, max_score: [.points[].composite_score] | max, any_warning: [.points[] | select(.severity != "normal")] | length }' ``` Interpret: quiet night vs events; cite min/max temp and whether any non-normal severities occurred. ### "Watch for an hour and report" (history) ```bash curl -s "${CLOUD_ENDPOINT}/v1/devices/${DEVICE_ID}/history?window=1h" \ -H "X-API-Key: $CLOUD_KEY" | jq '{ points: .points | length, max_temp: [.points[].temperature] | max, min_temp: [.points[].temperature] | min, avg_temp: (if (.points | length) > 0 then ([.points[].temperature] | add / length) else null end), max_score: [.points[].composite_score] | max, warnings: [.points[] | select(.severity != "normal")] | length }' ``` ### "Compare my devices" / "Which device is the worst?" ```bash curl -s "${CLOUD_ENDPOINT}/v1/analytics/fleet-health?window=7d" \ -H "X-API-Key: $CLOUD_KEY" | jq '{ flagged: .total_flagged, ai_brief: .ai_brief, devices: [.devices[] | { name: .hostname, score: .current.composite_risk_score, recovery: .current.recovery_seconds, temp: .current.temperature_primary, flags: [.flags[].message] }] }' ``` Report `ai_brief` when present β€” it is a natural-language summary. Link `https://app.keldron.ai/analytics`. ### "How's {device} trending?" / Device health `window` is **integer hours**. Use `idle_temp_c`, `peak_temp_c`, `perf_per_watt.value`. Handle missing `thermal_stability`. ```bash DEVICE_ENCODED=$(echo "$DEVICE_ID" | jq -sRr @uri) curl -s "${CLOUD_ENDPOINT}/v1/devices/${DEVICE_ENCODED}/health?window=24" \ -H "X-API-Key: $CLOUD_KEY" | jq '{ idle_temp_c: .idle_temp_c, peak_temp_c: .peak_temp_c, perf_per_watt: .perf_per_watt.value, thermal_stability: (if has("thermal_stability") then .thermal_stability else null end) }' ``` Report in plain language using idle/peak temps and efficiency. If cloud is down, fall back to local realtime metrics and avoid dumping raw errors. --- ## 10. Proactive fleet monitoring Use **cloud** polling β€” not `localhost:9100` loops β€” for "watch my fleet" / "alert me if anything changes". ### "Watch my fleet" / "Alert me if anything changes" Set `MAX_CHECKS` to limit the total number of iterations, including failed API calls (e.g., `MAX_CHECKS=60` for ~1 hour). Leave unset or `0` for unlimited. ```bash echo "Fleet monitoring active. Checking every 60 seconds via Keldron Cloud." echo "Press Ctrl+C to stop." trap 'echo ""; echo "Fleet monitoring stopped by user."; exit 0' INT PREV_WORST="" PREV_WORST_SCORE=0 CHECK_COUNT=0 MAX_CHECKS="${MAX_CHECKS:-0}" while true; do CHECK_COUNT=$((CHECK_COUNT + 1)) FLEET=$(curl -s "${CLOUD_ENDPOINT}/v1/fleet/overview" \ -H "X-API-Key: $CLOUD_KEY" 2>/dev/null) if [ -z "$FLEET" ]; then if [ "$MAX_CHECKS" -gt 0 ] && [ "$CHECK_COUNT" -ge "$MAX_CHECKS" ]; then echo "Reached $MAX_CHECKS checks. Fleet monitoring stopped by timeout." break fi sleep 60 continue fi WARNING=$(echo "$FLEET" | jq '.devices_warning // 0') CRITICAL=$(echo "$FLEET" | jq '.devices_critical // 0') WORST=$(echo "$FLEET" | jq -r '.worst_device_id') WORST_SCORE=$(echo "$FLEET" | jq '.worst_score // 0 | floor') if [ "${CRITICAL:-0}" -gt 0 ]; then DEVICE_INFO=$(echo "$FLEET" | jq -r '.devices[] | select(.severity_band == "critical") | "\(.hostname): score \((.composite_risk_score // 0) | floor), temp \((.temperature_primary // 0) | floor)Β°C"') echo "CRITICAL: $DEVICE_INFO" break fi if [ "${WARNING:-0}" -gt 0 ]; then DEVICE_INFO=$(echo "$FLEET" | jq -r '.devices[] | select(.severity_band == "warning") | "\(.hostname): score \((.composite_risk_score // 0) | floor), temp \((.temperature_primary // 0) | floor)Β°C"') echo "WARNING: $DEVICE_INFO" fi if [ -n "$PREV_WORST" ] && [ "$WORST" = "$PREV_WORST" ]; then SCORE_DELTA=$((WORST_SCORE - PREV_WORST_SCORE)) if [ "$SCORE_DELTA" -gt 20 ]; then echo "SPIKE: $WORST jumped from $PREV_WORST_SCORE to $WORST_SCORE" fi fi PREV_WORST="$WORST" PREV_WORST_SCORE=$WORST_SCORE if [ "$MAX_CHECKS" -gt 0 ] && [ "$CHECK_COUNT" -ge "$MAX_CHECKS" ]; then echo "Reached $MAX_CHECKS checks. Fleet monitoring stopped by timeout." break fi sleep 60 done ``` ### "Give me a morning report" ```bash ANALYTICS=$(curl -s "${CLOUD_ENDPOINT}/v1/analytics/fleet-health?window=24h" \ -H "X-API-Key: $CLOUD_KEY") FLEET=$(curl -s "${CLOUD_ENDPOINT}/v1/fleet/overview" \ -H "X-API-Key: $CLOUD_KEY") TOTAL=$(echo "$FLEET" | jq '.total_devices') NORMAL=$(echo "$FLEET" | jq '.devices_normal') AI_BRIEF=$(echo "$ANALYTICS" | jq -r '.ai_brief') FLAGGED=$(echo "$ANALYTICS" | jq '.total_flagged') echo "Morning Fleet Report" echo "--------------------" echo "Fleet: $TOTAL devices, $NORMAL healthy" echo "" echo "$AI_BRIEF" echo "" if [ "$FLAGGED" -gt 0 ]; then echo "$FLAGGED device(s) flagged β€” https://app.keldron.ai/analytics" else echo "All devices healthy." fi echo "" echo "Dashboard: https://app.keldron.ai" ``` --- ## 11. Configuration & management ### Changing config If the user needs to change settings (e.g. electricity rate), tell them: **Edit `~/.config/keldron/keldron-agent.yaml` β€” the agent picks up changes on restart.** Avoid complex sed one-liners for YAML edits; prefer manual edits, YAML-aware tools like `yq`, or scoped awk scripts. ### Multi-device / fleet All devices that stream to the same cloud account appear in the fleet automatically. ### Stop monitoring ```bash pkill -f './keldron-agent' ``` Confirm: *Agent stopped. GPU monitoring is off.* ### Restart the agent ```bash pkill -f './keldron-agent' sleep 2 ./keldron-agent --local & sleep 3 curl -s localhost:9100/healthz ``` Report the `healthz` response to confirm it is up. --- ## 12. Proactive behavior - After setup, suggest cloud and dashboards: *Your agent is running. Ask how your fleet is doing, open https://app.keldron.ai/fleet, or use local metrics on this machine.* - For **alert** / **watch** on a **fleet**, run the cloud polling loop in [Β§10](#10-proactive-fleet-monitoring); for single-machine **realtime** questions, use [Β§8](#8-quick-status-queries-local-real-time). --- ## 13. Rules - **Always check agent health first.** Before any local query, verify: `curl -sf localhost:9100/healthz | jq -e '.status == "healthy"'`. Non-zero exit = agent down. Offer to start it or guide setup. - **Auto-detect mode.** Check for cloud API key: `KELDRON_CLOUD_API_KEY`, then `~/.keldron/credentials` (from `keldron-agent login`), then `cloud.api_key` in `~/.config/keldron/keldron-agent.yaml`. Optionally confirm with `keldron-agent whoami`. Use cloud for fleet, history, analytics; local for realtime single-device. - **Guide setup, don't just explain.** When the agent or cloud is not configured, walk through steps and run commands. - **If metrics return 0 for temperature, the agent may still be warming up.** Wait 30 seconds and retry once. - **Always include severity assessment.** Report severity (normal/warning/critical) alongside numeric scores where applicable. - **Cloud upsell is helpful, not pushy.** When a question needs cloud (history, fleet comparison), mention it naturally: *That needs historical data β€” connect at app.keldron.ai.* - **Proactive monitoring uses the cloud API.** Background poll loops use `api.keldron.ai`, not localhost metrics. - **Link to dashboards.** Include relevant links: `app.keldron.ai/fleet`, `app.keldron.ai/analytics`, or `http://localhost:9200` for local. - **API key security.** Never echo the full API key. Show only the first 8 characters: *Configured with key kldn_liv…* - **Fallback gracefully.** Cloud unreachable β†’ fall back to local. Local unreachable β†’ guide setup. Never show raw errors to the user. - **On Apple Silicon, high swap = model too large.** If swap is high, suggest a smaller or quantized model. - **The agent never requires sudo.** Runs unprivileged on all platforms. Docker on Linux may need sudo or docker-group. - **Use metric labels.** Extract `device_model` and `device_name` from Prometheus labels for personalized responses. - **When the user says "alert me" or "watch" (fleet),** set up the cloud polling loop and execute it. - **When the user says "dashboard",** link to the real dashboard. Do not render ASCII art.