--- name: agent-onboarding-checklist description: Use when deploying a new agent into production. Checklist for onboarding agents. author: Melisia Archimedes url: https://hivedoctrine.com mcp: https://hive-doctrine-mcp.vercel.app/mcp --- --- title: "Agent Onboarding Checklist: Your First 7 Days" author: Melisia Archimedes collection: C4 Infrastructure tier: pollen price: free version: 1.0 last_updated: 2026-03-09 audience: agent_operators hive_doctrine_id: HD-1018 sources_researched: [agent deployment guides, production onboarding playbooks, DevOps best practices, operator community forums] word_count: 942 --- # Agent Onboarding Checklist: Your First 7 Days Most people try to build an agent in an afternoon. The good ones take a week. This is your roadmap from blank canvas to production-ready agent. Seven days. Seven milestones. Each day has a clear objective, a checklist, and a gate you must pass before moving forward. ## The 7-Day Framework You're not optimising for speed here—you're optimising for **confidence**. By day 7, you'll know whether your agent works, what it costs, where it breaks, and what's next. You'll have production logs. You'll have failure patterns. You'll have a go/no-go decision backed by data. The framework splits into three phases: 1. **Foundation (Days 1–3):** Identity, model selection, system prompt, tool access boundaries 2. **Integration (Days 4–5):** MCP connections, tool testing, task validation 3. **Validation (Days 6–7):** Monitoring setup, staging deployment, launch readiness --- ## Day-by-Day Checklist ### Day 1: Define Purpose, Pick Your Model, Set Up API Access **Objective:** Know what your agent does. Have credentials ready. - [ ] Write your agent's SOUL.md (what it is, what it's not, non-negotiables) - [ ] Document the primary use case in one sentence - [ ] List 3–5 secondary use cases (things it could do but shouldn't) - [ ] Choose your base model (Claude 3.5 Sonnet? Claude Haiku? GPT-4o? Llama?) - Document your reasoning: latency budget, cost-per-call, accuracy requirements - [ ] Create API credentials for your chosen provider(s) - [ ] Test authentication: run a single API call, verify response time and cost - [ ] Set up a secrets manager or .env file (never hardcode keys) - [ ] Document your model's context window and cost-per-1k tokens - [ ] Review the model's instruction-following ability for your use case **Gate:** You have a SOUL.md, API credentials that work, and you've validated a single API call in under 100ms. --- ### Day 2: Write System Prompt, Define Tool Access, Set Boundaries **Objective:** Your agent knows its constraints. - [ ] Write a detailed system prompt (500–800 words) - What the agent is and why it exists - What it must not do - How it should handle errors, edge cases, user hostility - Tone and voice (clinical? friendly? urgent?) - [ ] Define which tools the agent can call - Create a whitelist (not a blacklist) - Specify role-based access (can it modify production? Can it delete?) - [ ] Write tool descriptions: what each tool does, when to use it, what could go wrong - [ ] Set hard limits: - Max tool calls per session - Max API cost per interaction - Timeout thresholds (e.g., "if a tool takes >30s, fail fast") - Rate-limiting rules - [ ] Design your failure mode playbook: - Agent hallucinating tool outputs → How do you detect and stop it? - Tool returning null/error → Agent fallback strategy? - Budget overrun → Kill switch or graceful degradation? - [ ] Document your decision log (why you set boundaries this way) **Gate:** System prompt is written and reviewed. Tool access is defined. Boundaries are enforced in code (not just documentation). --- ### Day 3: Build the Memory Layer, Choose Context Strategy **Objective:** Your agent remembers what matters. - [ ] Choose your memory architecture: - **Context window:** Keep everything in the prompt (cheap, simple, limited to 8–200k tokens) - **Sliding buffer:** Keep N most recent interactions + fixed episodic summary - **Vector store:** Embed all interactions, retrieve relevant context on each call - [ ] Implement your chosen strategy - [ ] Test memory retrieval under load (does it find the right context?) - [ ] Design your summary function (if using buffer/vector): - How do you compress a 10-turn conversation into a 2-sentence summary? - What information is most valuable to keep? - [ ] Set memory retention policy: - How long do you keep logs? (24 hours? 30 days? Forever?) - Do you store personally identifiable data? (Decide before day 1 of production) - [ ] Implement cost tracking: log memory retrieval cost separately from inference cost **Gate:** Memory layer is implemented, tested under realistic load, and cost-tracked. --- ### Day 4: Connect Tools via MCP, Test Individually **Objective:** Your agent's hands work. - [ ] Set up Model Context Protocol (MCP) server (or HTTP tool wrapper) - [ ] For each tool: - [ ] Write the MCP/HTTP schema (inputs, outputs, errors) - [ ] Test the tool in isolation (don't call agent yet) - [ ] Document failure modes - [ ] Verify timeout behaviour - [ ] Check rate-limiting - [ ] Run a "tool gauntlet" test: - Happy path: call each tool with valid inputs - Error cases: invalid inputs, rate-limit conditions, timeout - Edge cases: empty results, malformed responses, slow responses (>10s) - [ ] Log all tool calls: timestamp, input, output, latency, cost - [ ] Create a tool status dashboard (can you see which tools are slow/expensive?) **Gate:** All tools pass the gauntlet. You have a tool status dashboard. Zero silent failures. --- ### Day 5: Run 50 Test Tasks, Score Performance **Objective:** Measure what works and what breaks. - [ ] Design 50 test tasks covering: - Happy path (30 tasks): typical use cases, realistic inputs - Error cases (10 tasks): malformed input, missing data, edge cases - Boundary cases (10 tasks): maximum complexity, maximum scope, ambiguous requests - [ ] Run all 50 tasks, log outputs: - Task ID, input, output, latency, cost, success/failure, failure reason - [ ] Calculate your scorecard: - Completion rate: % of tasks that succeeded (target: ≥90%) - Accuracy: % of successful tasks that were correct (target: ≥95%) - Cost per task: total spend ÷ 50 (budget check) - P95 latency: 95th percentile response time - [ ] Identify failure patterns: - Did certain task types fail more often? (e.g., ambiguous requests?) - Did certain tools fail more often? - Did the agent misuse a tool? - [ ] Tune system prompt or tool definitions based on failures **Gate:** Completion rate ≥85%, accuracy ≥90%, cost within budget, failure patterns documented and addressed. --- ### Day 6: Add Monitoring, Set Up Logging and Alerts **Objective:** You know when your agent is breaking. - [ ] Set up structured logging (every agent action is logged as JSON) - Timestamp, agent ID, task ID, action, result, duration, cost - [ ] Create monitoring dashboards: - Success rate (rolling 1-hour, 24-hour windows) - Cost trend (per hour, per day) - Latency distribution (p50, p95, p99) - Error rate by type (tool failed, timeout, hallucination, budget exceeded) - [ ] Define alert thresholds: - Success rate drops below 85% in last hour → page on-call - Cost per task exceeds budget by 2x → warn - P95 latency exceeds threshold → warn - Any "hallucination" detected → alert - [ ] Set up log rotation (logs can grow unbounded) - [ ] Test your alerting system with a simulated failure **Gate:** Monitoring is live. You can see success rate, cost, and latency in real-time. Alerts are tested and firing. --- ### Day 7: Deploy to Staging, Run 24 Hours, Go/No-Go Decision **Objective:** Validate in the wild before production. - [ ] Deploy agent to staging environment (not production) - [ ] Run realistic production-like traffic for 24 hours - Mix of happy path, edge cases, and error conditions - Similar load to what you expect in production - [ ] Monitor continuously: - Are alerts firing? Are they useful or noisy? - Are logs parsing correctly? - Is cost tracking accurate? - [ ] Review the 24-hour report: - 100 tasks completed. Success rate: X%. Accuracy: Y%. Cost: Z. - Any unexpected failures? - Any silent failures (success reported but output wrong)? - Any performance surprises? - [ ] Make your go/no-go decision: - **GO:** Metrics meet thresholds. Failure modes understood. Ready for production. - **NO-GO:** Metrics below threshold. Failure mode unresolved. Return to day 2–5, fix, re-test. **Gate:** 24-hour staging run complete. Go/no-go decision documented and signed off. --- ## Go/No-Go Criteria Your agent is **GO** for production if: - ✅ Success rate ≥85% (tasks completed as intended) - ✅ Accuracy ≥90% (correct outputs when task succeeds) - ✅ Cost per task within budget (or you've justified the overage) - ✅ P95 latency meets SLA (latency budget negotiated with stakeholders) - ✅ Zero untraced failures (all errors logged and understood) - ✅ Monitoring and alerting verified to work - ✅ Runbook documented (how to restart, how to page on-call, how to rollback) - ✅ SOUL.md, system prompt, tool schemas, and failure modes documented If any box is unchecked, you're **NO-GO**. Fix the failing criterion. Return to the relevant day. Re-test. Re-gate. --- ## What's Next: Days 8–30 Your 7-day checklist gets you to production. Days 8–30 are about **learning and optimisation**. For the full 30-day playbook—including prompt engineering, tool optimisation, cost reduction, and scaling patterns—see the **Agent Onboarding Playbook: Day 1 to Day 30** (Honey tier). Days 8–14 focus on early production wins: - Reducing cost per task by 30% - Improving accuracy with targeted prompt tuning - Adding new tools based on real-world failure patterns Days 15–30 focus on scaling: - Load testing (can your agent handle 10x traffic?) - Multi-model strategies (when to use Haiku vs. Sonnet) - Fine-tuning and caching (advanced optimisations) Start with this 7-day checklist. Get to production. *Then* optimise. --- **Last updated:** 2026-03-09 | **Author:** Melisia Archimedes | **Hive Doctrine ID:** HD-1018 --- *From The Hive Doctrine — hivedoctrine.com* *Browse 116+ products: `claude mcp add --transport http hive-doctrine https://hive-doctrine-mcp.vercel.app/mcp`* *The field, not the flower.*