# Voice AI Agent Skill Guide

This document teaches any AI coding assistant how to build a voice-enabled agent using Amazon Nova 2 Sonic, Strands Agents SDK, and Amazon Bedrock AgentCore Runtime. It is distilled from the insurance claims FNOL agent in this repository but applies to any domain where a voice interface submits data through an existing API.

The guide is tool-agnostic — it works with Claude Code, Cursor, Kiro, Cline, Windsurf, or any assistant that can read a markdown file.

---

## Quick Start

### Prerequisites

- AWS account with Bedrock model access for `amazon.nova-2-sonic-v1:0`
- Node.js 22.x, AWS CDK 2.235+, Python 3.12+, Docker Desktop
- AWS CLI configured with credentials

### Deploy and Verify

```bash
git clone https://github.com/aws-samples/serverless-eda-insurance-claims-processing.git
cd serverless-eda-insurance-claims-processing
npm install
npm run deploy           # Deploys all stacks including VoiceFnolStack
```

After deployment, the CDK output includes the AgentCore WebSocket endpoint ARN. The React frontend connects to this endpoint with SigV4-signed WebSocket URLs.

To verify the agent is running, check the AgentCore Runtime status in the AWS console under Amazon Bedrock > AgentCore > Runtimes.

---

## Architecture at a Glance

```
Browser (React)
  |  SigV4-signed WebSocket
  v
AgentCore Runtime (managed container hosting)
  |  Docker container (Python 3.13, ARM64)
  v
Strands BidiAgent + Nova 2 Sonic
  |  Tool calls
  v
Customer API (GET)  +  FNOL API (POST)
  |                      |
  v                      v
DynamoDB             EventBridge --> SQS --> Lambda --> IoT Core MQTT
(read policy)        (Claim.Requested)                  (Claim.Accepted/Rejected)
```

The voice agent is a new entry point into an existing event-driven backend. Everything downstream of the FNOL API — fraud detection, settlement, notification — runs unchanged. The agent integrates at the API boundary, not the event bus.

For the full blog post, see: [Extending an event-driven insurance claims application with Voice AI](https://aws.amazon.com/blogs/industries/)

---

## Core Concepts

### BidiAgent and Bidirectional Streaming

Strands `BidiAgent` manages a bidirectional audio stream between the client and Nova 2 Sonic. It accepts async callables for input and output (`websocket.receive_json` / `websocket.send_json`), wires them into its internal event loop, and dispatches tool calls as they arise. The agent is initialized once and reused across sessions — model loading happens at startup, not per connection.

### Nova 2 Sonic: Speech-to-Speech in a Single Pass

Nova 2 Sonic is not a wrapper around separate ASR and TTS services. The model performs speech understanding, reasoning, tool calling, and speech generation in a single bidirectional stream — raw PCM audio in, raw PCM audio out. Tone, hesitation, and emphasis reach the model directly. Barge-in detection is built into the model server-side. Polyglot voices (e.g., "tiffany") support mid-sentence language switching.

### AgentCore Runtime: Serverless Container Hosting

AgentCore Runtime hosts the agent container behind a single WebSocket endpoint. It handles authentication (SigV4), session routing, lifecycle management (5-min idle timeout, 1-hour max), and observability (CloudWatch Logs, X-Ray). Pay-as-you-go pricing charges only for active processing — I/O wait (waiting for Nova 2 Sonic or API responses) incurs no compute charge.

### Tool-Based Design

Each agent capability maps to a `@tool`-decorated function with a bounded responsibility. Tools call existing APIs via SigV4-signed HTTP requests. The agent holds no direct knowledge of databases, event buses, or downstream processing. This makes each tool independently testable.

---

## Build Your Own Voice Agent

### Step 1: Define the Agent

Create the agent with a `BidiNovaSonicModel` and a set of tools. The agent is a singleton — initialize once, reuse across WebSocket sessions.

```python
from strands.experimental.bidi.models.nova_sonic import BidiNovaSonicModel
from strands.experimental.bidi.agent import BidiAgent

def create_agent():
    model = BidiNovaSonicModel(
        model_id="amazon.nova-2-sonic-v1:0",
        client_config={"region": os.environ["AWS_REGION"]},
        provider_config={
            "audio": {
                "input_rate": 16000,   # 16kHz from browser microphone
                "output_rate": 24000,  # 24kHz Nova Sonic synthesis
                "format": "pcm",
                "voice": "tiffany"     # Polyglot voice with code-switching
            }
        }
    )
    return BidiAgent(
        model=model,
        tools=[your_tool_1, your_tool_2, stop_conversation],
        system_prompt=SYSTEM_PROMPT
    )
```

**System prompt rules:**
- Keep it conversational (3-5 sentences for the core persona)
- Use gender-appropriate pronouns for the selected voice
- Provide one-shot conversation examples for complex flows
- Do not use imperatives like "You must call tool X" — let the model decide tool timing
- Include safety-first guidance if the domain requires it (e.g., emergency assessment before data collection)

**Reference:** `lib/services/voice-fnol-agent/app/agent.py`

### Step 2: Build Tools

Tools are Python functions decorated with `@tool`. Each tool returns a dictionary.

**Basic tool:**

```python
from strands.tools import tool

@tool
async def your_lookup_tool(query: str) -> dict:
    """Retrieve data based on query."""
    # Call your API here
    return {"success": True, "data": result}
```

**Tool with context (for user identity):**

Use `@tool(context=True)` when the tool needs the caller's identity or session data. The `invocation_state` dictionary passed to `agent.run()` flows into every context-enabled tool.

```python
from strands import tool, ToolContext

@tool(context=True)
async def get_customer_info(tool_context: ToolContext) -> dict:
    """Retrieve customer information using authenticated identity."""
    cognito_id = tool_context.invocation_state['cognito_identity_id']
    # Use cognito_id to call your API
    return {"success": True, "customer": data}
```

**Tool with inputSchema (critical for Nova Sonic):**

Nova 2 Sonic constructs tool calls from audio, not text. It needs explicit field-level schemas with types and descriptions to map speech to structured parameters. Without `inputSchema`, the model cannot reliably map "it happened on Route 9 in Phoenix" to a nested location object.

```python
@tool(
    inputSchema={
        "type": "object",
        "properties": {
            "incident": {
                "type": "object",
                "description": "Incident details",
                "properties": {
                    "location": {
                        "type": "object",
                        "properties": {
                            "city": {"type": "string", "description": "City name"},
                            "state": {"type": "string", "description": "State abbreviation"},
                            "road": {"type": "string", "description": "Street or road name"}
                        },
                        "required": ["city", "state", "road"]
                    },
                    "description": {
                        "type": "string",
                        "description": "What happened and damage description"
                    }
                }
            }
        },
        "required": ["incident"]
    }
)
async def submit_data(incident: dict) -> dict:
    """Submit structured data to your API."""
    # POST to your endpoint
    return {"success": True}
```

**SigV4 helper for AWS API calls:**

```python
import boto3
from botocore.auth import SigV4Auth
from botocore.awsrequest import AWSRequest

def get_sigv4_headers(url, method, region, body=""):
    session = boto3.Session()
    credentials = session.get_credentials()
    request = AWSRequest(method=method, url=url, data=body,
                         headers={"Content-Type": "application/json",
                                  "Host": url.split("/")[2]})
    SigV4Auth(credentials, "execute-api", region).add_auth(request)
    return dict(request.headers)
```

**Reference:** `lib/services/voice-fnol-agent/app/tools/`

### Step 3: Wire the WebSocket Handler

The `BedrockAgentCoreApp` class from the `bedrock_agentcore` package handles the WebSocket lifecycle. The handler wires the agent to the connection.

```python
from bedrock_agentcore import BedrockAgentCoreApp, RequestContext
from app.agent import get_agent

app = BedrockAgentCoreApp()

# Agent is a singleton — initialized once, reused across sessions
agent = get_agent()

@app.websocket
async def websocket_handler(websocket, context: RequestContext):
    # Extract custom headers (AgentCore lowercases them)
    cognito_identity_id = context.request_headers.get(
        'x-amzn-bedrock-agentcore-runtime-custom-cognitoidentityid')

    await websocket.accept()
    try:
        await agent.run(
            inputs=[websocket.receive_json],
            outputs=[websocket.send_json],
            invocation_state={'cognito_identity_id': cognito_identity_id}
        )
    except WebSocketDisconnect as e:
        if getattr(e, 'code', None) != 1000:
            logger.warning(f"Unexpected disconnect: {e}")
    finally:
        await agent.stop()
```

Key points:
- `inputs` and `outputs` accept any async callable — Strands wires them into bidirectional streaming
- `invocation_state` flows to every `@tool(context=True)` decorated tool
- Always call `agent.stop()` in `finally` to clean up resources

**Reference:** `lib/services/voice-fnol-agent/app/app_agentcore.py`

### Step 4: CDK Infrastructure

Two CDK resources define the agent deployment: `CfnRuntime` (the agent) and `CfnRuntimeEndpoint` (the WebSocket endpoint).

```typescript
import * as bedrockagentcore from "aws-cdk-lib/aws-bedrockagentcore";
import * as ecr_assets from "aws-cdk-lib/aws-ecr-assets";

// IAM role — trust principal MUST be bedrock-agentcore.amazonaws.com
const agentRole = new iam.Role(this, "AgentRole", {
  assumedBy: new iam.ServicePrincipal("bedrock-agentcore.amazonaws.com", {
    conditions: {
      StringEquals: { "aws:SourceAccount": account },
      ArnLike: { "aws:SourceArn": `arn:aws:bedrock-agentcore:${region}:${account}:*` }
    }
  })
});

// Docker image — AgentCore REQUIRES ARM64
const dockerImage = new ecr_assets.DockerImageAsset(this, "AgentImage", {
  directory: path.join(__dirname, "../"),
  platform: ecr_assets.Platform.LINUX_ARM64,
});

// AgentCore Runtime
const agentRuntime = new bedrockagentcore.CfnRuntime(this, "AgentRuntime", {
  agentRuntimeName: "my_voice_agent",
  roleArn: agentRole.roleArn,
  networkConfiguration: { networkMode: "PUBLIC" },
  agentRuntimeArtifact: {
    containerConfiguration: { containerUri: dockerImage.imageUri }
  },
  lifecycleConfiguration: {
    idleRuntimeSessionTimeout: 300,  // 5 min idle timeout
    maxLifetime: 3600                // 1 hour max
  },
  requestHeaderConfiguration: {
    requestHeaderAllowlist: [
      "X-Amzn-Bedrock-AgentCore-Runtime-Custom-CognitoIdentityId"
    ]
  }
});

// AgentCore Runtime Endpoint
const agentEndpoint = new bedrockagentcore.CfnRuntimeEndpoint(this, "AgentEndpoint", {
  name: "my_voice_agent_endpoint",
  agentRuntimeId: agentRuntime.ref,
});
agentEndpoint.addDependency(agentRuntime);
```

Key CDK notes:
- Import is `aws-cdk-lib/aws-bedrockagentcore` (not `aws-cdk-lib/aws-bedrock`)
- Trust principal is `bedrock-agentcore.amazonaws.com` (not `bedrock.amazonaws.com`)
- `requestHeaderAllowlist` headers must be prefixed with `X-Amzn-Bedrock-AgentCore-Runtime-Custom-`
- Without `requestHeaderAllowlist`, custom headers are stripped silently at the AgentCore boundary
- Grant `ecr:GetAuthorizationToken` on `*` and `ecr:BatchGetImage` on the repository

**Reference:** `lib/services/voice-fnol-agent/infra/voice-fnol-service.ts`

### Step 5: Frontend Audio

The frontend opens a SigV4-presigned WebSocket connection and streams PCM audio bidirectionally.

**SigV4 presigned WebSocket URL:**

```javascript
const signer = new SignatureV4({
  credentials: {
    accessKeyId: credentials.accessKeyId,
    secretAccessKey: credentials.secretAccessKey,
    sessionToken: credentials.sessionToken
  },
  region: "us-east-1",
  service: "bedrock-agentcore",   // NOT "bedrock"
  sha256: Sha256,
});
const signedRequest = await signer.presign(request, { expiresIn: 300 });
```

**Audio capture (16kHz PCM in):**
- Use `navigator.mediaDevices.getUserMedia()` with `sampleRate: 16000`, `channelCount: 1`
- Enable `echoCancellation` and `noiseSuppression`
- Convert Float32Array to Int16Array (PCM16) before sending

**Audio playback (24kHz PCM out):**
- Create `AudioContext` at 24kHz
- Schedule chunks at `nextPlayTime` to prevent gaps — do not call `source.start()` without a time parameter
- Track `activeSources` array for barge-in cancellation

**Barge-in handling:**
- Listen for `bidi_interruption` message type on the WebSocket
- On interruption: stop all active audio sources, clear the queue, reset `nextPlayTime` to `audioContext.currentTime`

**Reference:** `react-claims/src/utils.js` and `react-claims/src/components/`

---

## Common Pitfalls

### 1. Missing inputSchema on tools

**Symptom:** Nova Sonic fails to call the tool or sends malformed parameters.
**Cause:** Text-based LLMs infer parameter structure from docstrings; a speech model cannot.
**Fix:** Define explicit `inputSchema` with types and descriptions on every tool that accepts structured parameters.

### 2. Creating agent per WebSocket connection

**Symptom:** High latency on every new connection, excessive memory usage.
**Cause:** Model loading and tool registration happen inside the handler instead of at startup.
**Fix:** Initialize the agent once at module level (`agent = create_agent()`), reuse across sessions.

### 3. Audio playback gaps

**Symptom:** Choppy, stuttering audio output.
**Cause:** Calling `source.start()` without scheduling — each chunk plays immediately instead of after the previous one finishes.
**Fix:** Track `nextPlayTime` and schedule each chunk: `source.start(nextPlayTime); nextPlayTime += buffer.duration;`

### 4. Ignoring bidi_interruption events

**Symptom:** Agent audio continues playing after the customer starts speaking.
**Cause:** Frontend does not listen for `bidi_interruption` messages from Nova Sonic.
**Fix:** On interruption, stop all active audio sources, clear the queue, and reset `nextPlayTime`.

### 5. Wrong IAM trust principal

**Symptom:** AgentCore fails to assume the IAM role; Runtime creation fails.
**Cause:** Trust policy uses `bedrock.amazonaws.com` instead of `bedrock-agentcore.amazonaws.com`.
**Fix:** Set the service principal to `bedrock-agentcore.amazonaws.com` with `SourceAccount` and `SourceArn` conditions.

### 6. Custom headers stripped silently

**Symptom:** `tool_context.invocation_state` has no user identity; `get_customer_info` fails.
**Cause:** The header is not listed in `requestHeaderAllowlist` on the CfnRuntime resource.
**Fix:** Add the header name to `requestHeaderAllowlist`. Headers must be prefixed with `X-Amzn-Bedrock-AgentCore-Runtime-Custom-`.

### 7. Missing invocation_state in agent.run()

**Symptom:** `@tool(context=True)` tools receive empty context, cannot access user identity.
**Cause:** `agent.run()` is called without the `invocation_state` parameter.
**Fix:** Pass `invocation_state={"key": value}` to `agent.run()`.

---

## Adapting This Pattern

To build a voice agent for a different domain:

1. **Keep the skeleton.** The `BidiAgent` → `BedrockAgentCoreApp` → `CfnRuntime` → `CfnRuntimeEndpoint` pattern is domain-independent.

2. **Replace the tools.** Remove `get_customer_info`, `submit_to_fnol_api`, etc. Add tools for your domain — each should call one API endpoint and return a dictionary.

3. **Rewrite the system prompt.** Describe your agent's persona, conversation flow, and safety considerations. Keep it conversational (3-5 sentences for the core persona, then specific guidance).

4. **Define inputSchemas.** For every tool that accepts structured data, write an explicit JSON schema matching your API contract.

5. **Update CDK environment variables.** Replace `FNOL_API_ENDPOINT` and `CUSTOMER_API_ENDPOINT` with your API endpoints. Update IAM policies to grant `execute-api:Invoke` on your specific API Gateway resources.

6. **Adjust frontend.** Update the WebSocket connection URL and any custom headers. The audio capture/playback code is reusable as-is.

The voice infrastructure (AgentCore, Nova Sonic, SigV4 auth, audio streaming) stays identical. Only the tools, system prompt, and API endpoints change.

---

## Using This File With Your Coding Assistant

### Claude Code
Reference this file directly in your prompt, or add to your project's `.claude/` instructions:
```
# In your conversation
@agent-skills/VOICE_AGENT_SKILL.md build me a voice agent for appointment scheduling
```

### Cursor
Use `@agent-skills/VOICE_AGENT_SKILL.md` in chat, or copy the file into `.cursor/rules/` for automatic inclusion.

### Kiro
Copy to `.kiro/steering/voice-agent.md` and add frontmatter:
```yaml
---
inclusion: auto
description: Voice AI agent development guide
tags: [voice-ai, nova-sonic, strands, agentcore]
---
```

### Cline
Use `@file` reference in chat: `@agent-skills/VOICE_AGENT_SKILL.md`. Or add to `.clinerules` for automatic context.

### Windsurf
Open as a tab and `@`-reference in Cascade. Or add to Windsurf Rules for automatic inclusion.

### Generic / Manual
Paste the contents into your assistant's context window, or point it at the file path if it supports file reading.

---

## References

- [Amazon Bedrock AgentCore Runtime documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/agentcore.html)
- [Amazon Nova 2 Sonic model guide](https://docs.aws.amazon.com/nova/latest/userguide/nova-sonic.html)
- [Strands Agents SDK documentation](https://strandsagents.com/)
- [Strands BidiAgent (experimental)](https://strandsagents.com/latest/user-guide/concepts/model-providers/nova-sonic-bidi/)
- [Amazon Bedrock AgentCore pricing](https://aws.amazon.com/bedrock/agentcore/pricing/)
- [Blog: Extending an event-driven insurance claims application with Voice AI](https://aws.amazon.com/blogs/industries/)