--- name: investigate-stuck-messages description: Investigate stuck messages in relayer queue. Use when alerts mention "queue length > 0", to diagnose why messages are stuck, or to get message IDs for denylisting. --- # Investigate Stuck Messages Query the relayer API to investigate stuck messages, their retry counts, and error reasons. ## When to Use 1. **Alert-based triggers:** - Alert: "Known app context relayer queue length > 0 for 40m" - Any alert mentioning stuck messages in prepare queue - High retry counts for specific app contexts 2. **User request triggers:** - "Why are messages stuck for [app_context]?" - "Investigate stuck messages on [chain]" - "What's causing the queue alert?" - Pasting a Grafana alert URL ## Input Parameters **Option 1: Grafana Alert URL (recommended)** ``` /investigate-stuck-messages https://abacusworks.grafana.net/alerting/grafana/cdg1ro5hi4vswb/view?tab=instances ``` **Option 2: Manual specification** ``` /investigate-stuck-messages app_context=EZETH/renzo-prod remote=linea ``` | Parameter | Required | Default | Description | | ------------- | -------- | ---------- | --------------------------------------------------------------------- | | `alert_url` | No | - | Grafana alert URL (extracts app_context/remote from firing instances) | | `app_context` | No\* | - | The app context (e.g., `EZETH/renzo-prod`, `oUSDT/production`) | | `remote` | No\* | - | Destination chain name (e.g., `linea`, `ethereum`, `arbitrum`) | | `environment` | No | `mainnet3` | Deployment environment | \*Either `alert_url` OR both `app_context` and `remote` must be provided. ## Workflow ### Step 1: Parse Input and Extract Alert Instances **If Grafana alert URL provided:** 1. Extract the alert UID from the URL (e.g., `cdg1ro5hi4vswb` from `.../alerting/grafana/cdg1ro5hi4vswb/view`) 2. Query Prometheus directly for firing instances using `mcp__grafana__query_prometheus`: ``` sum by (app_context, remote)( max_over_time( hyperlane_submitter_queue_length{ queue_name="prepare_queue", app_context!~"Unknown|merkly_eth|merkly_erc20|helloworld|velo_message_module", hyperlane_context!~"rc|vanguard0|vanguard1|vanguard2|vanguard3|vanguard4|vanguard5", operation_status!~"Retry\\(ApplicationReport\\(.*\\)\\)|FirstPrepareAttempt", hyperlane_deployment="mainnet3", }[2m] ) ) > 0 ``` 3. Extract `app_context` and `remote` labels from each result. **If manual app_context/remote provided:** Use the provided values directly. ### Step 2: Setup Port-Forward to Relayer Check if port 9090 is already in use: ```bash lsof -i :9090 ``` If not in use, start port-forward in background: ```bash kubectl port-forward omniscient-relayer-hyperlane-agent-relayer-0 9090 -n mainnet3 & ``` Wait a few seconds for the port-forward to establish. ### Step 3: Get Domain IDs for Chains Look up domain IDs from the registry: ```bash cat node_modules/.pnpm/@hyperlane-xyz+registry@*/node_modules/@hyperlane-xyz/registry/dist/chains//metadata.json | jq '.domainId' ``` Common domain IDs: - ethereum: 1 - optimism: 10 - arbitrum: 42161 - polygon: 137 - base: 8453 - unichain: 130 - avalanche: 43114 ### Step 4: Query Relayer API For each destination chain, query the relayer API: ```bash curl -s 'http://localhost:9090/list_operations?destination_domain=' > /tmp/.json ``` The response contains operations with: - `id`: Message ID (H256) - `operation.message.sender`: Sender address - `operation.message.recipient`: Recipient address - `operation.num_retries`: Number of retries (higher = more stuck) - `operation.status`: Error status (e.g., `{"Retry": "ErrorEstimatingGas"}`) - `operation.message.origin`: Origin domain ID - `operation.message.destination`: Destination domain ID - `operation.app_context`: App context name ### Step 5: Filter Messages by App Context Look up the `app_context` in `rust/main/app-contexts/mainnet_config.json`: ```bash jq '.metricAppContexts[] | select(.name == "")' rust/main/app-contexts/mainnet_config.json ``` Filter API results to only include messages where: - `operation.message.recipient` matches one of the `recipientAddress` values for that destination domain **Important**: Addresses are padded to 32 bytes (H256 format). ### Step 6: Query GCP Logs for Actual Errors **Calculate log freshness based on retry count:** The relayer uses exponential backoff (see `calculate_msg_backoff` in `rust/main/agents/relayer/src/msg/pending_message.rs`): | Retries | Backoff/retry | Cumulative Time | Freshness Flag | | ------- | ------------- | --------------- | ----------------- | | 1-4 | 5s-1min | ~2min | `--freshness=1h` | | 5-24 | 3min | ~1h | `--freshness=3h` | | 25-39 | 5-26min | ~5h | `--freshness=12h` | | 40-49 | 30min-1h | ~12h | `--freshness=24h` | | 50-60 | 2-22h | ~35h | `--freshness=3d` | | 60+ | 22h+ | 35h+ | `--freshness=7d` | For each message ID, query GCP logs with calculated freshness: ```bash gcloud logging read 'resource.type=k8s_container AND resource.labels.namespace_name=mainnet3 AND resource.labels.pod_name:omniscient-relayer AND jsonPayload.span.id: AND jsonPayload.fields.error:*' --project=abacus-labs-dev --limit=1 --format='value(jsonPayload.fields.error)' --freshness= ``` Extract the human-readable error from the response using `sed` (macOS compatible): ```bash echo "$raw_error" | sed -n 's/.*execution reverted: \([^"]*\)".*/\1/p' | head -1 ``` Common error patterns: - `"execution reverted: Nonce already used"` → "Nonce already used" - `"execution reverted: panic: arithmetic underflow"` → "Arithmetic underflow" **Note**: Do not use `grep -P` as it's not available on macOS. ### Step 7: Present Investigation Results Output a detailed summary table with **full message IDs** and **both error sources**: ``` ## Investigation Results for [APP_CONTEXT] ### Summary - Total stuck messages: X - Destinations affected: [list] - Reprepare reasons: ErrorEstimatingGas (N), CouldNotFetchMetadata (M) ### Messages | Message ID | Retries | Reprepare Reason | Error | Origin | |------------|---------|------------------|-----------|--------| | `0xaa18ebc1c79345e6d24984a0b9a5ab66c968d128d46b2357b641e56e71b8d30c` | 47 | ErrorEstimatingGas | Nonce already used | optimism | | `0xd6aeef7c092a88aa23ad53227aeb834ae731d059b3ce749db8451e761f3f15ac` | 47 | ErrorEstimatingGas | Nonce already used | arbitrum | **Important**: Always show the full 66-character message ID (0x + 64 hex chars). Do not truncate. ### Error Analysis [Explain based on the actual log errors found] ### Next Steps To denylist these messages, run: /denylist-stuck-messages app_context=APP_CONTEXT ``` **Column definitions:** - **Reprepare Reason**: From `operation.status` in relayer API (e.g., ErrorEstimatingGas, CouldNotFetchMetadata) - **Error**: Actual revert reason from GCP logs (e.g., "Nonce already used", "Arithmetic underflow") ### Step 8: Output Denylist Command At the end of the investigation results, output the full denylist command: ``` ### Next Steps To denylist, run: /denylist-stuck-messages 0xaa18ebc1c79345e6d24984a0b9a5ab66c968d128d46b2357b641e56e71b8d30c 0xd6aeef7c092a88aa23ad53227aeb834ae731d059b3ce749db8451e761f3f15ac app_context=APP_CONTEXT ``` Always use full message IDs, never truncated. ## Error Status Reference | Status | Meaning | Action | | ------------------------ | --------------------------------------- | ---------------------------------------- | | `ErrorEstimatingGas` | Gas estimation failed (contract revert) | Usually denylist - contract won't accept | | `CouldNotFetchMetadata` | Can't get ISM metadata | Check validators, may resolve itself | | `ApplicationReport(...)` | App-specific error | Check the specific error message | | `GasPaymentNotFound` | No IGP payment | May need manual relay with gas | ## Error Handling - **Port-forward fails**: Check kubectl context: `kubectl config current-context` - **No messages found**: Queue may have cleared; alert may be stale - **API returns error**: Check relayer pod: `kubectl get pods -n mainnet3 | grep relayer` - **App context not found**: May be new/custom; ask user for sender/recipient addresses ## Prerequisites - `kubectl` configured with access to mainnet cluster - Grafana MCP server connected (for alert URL parsing)