There's a massive gap between a demo AI agent and one that can run in production. A demo calls the LLM API and returns a response. A production agent handles error states, manages memory across conversations, controls costs, and knows when to escalate to a human. After building dozens of production agents for fintech, SaaS, and healthcare companies, we've developed an architecture pattern that consistently ships reliable agents. This guide walks through each layer — with enough technical detail for a CTO to evaluate whether their current approach will survive production traffic.
The Demo-to-Production Gap
Every AI agent demo looks impressive. The LLM is smart enough to handle a scripted scenario. But production traffic exposes every weakness:
- Users ask questions the agent wasn't designed for
- APIs fail, timeout, or return unexpected data
- The agent enters infinite loops when it can't resolve a task
- Token costs spike when conversations go long
- The agent confidently gives wrong answers (hallucination)
- No one knows the agent made a mistake until a customer complains
Most 'AI agent' failures aren't model failures — they're architecture failures. The LLM is fine. The system around it is what breaks.
Production Agent Architecture: 5 Layers
A production agent is not a single LLM call. It's a system with five distinct layers, each with its own failure modes and design considerations:
- Orchestrator — the brain that plans and routes
- Tools — the actions the agent can take
- Memory — context management across conversations
- Evaluation — guardrails and quality checks
- Human-in-the-loop — escalation when the agent isn't confident
Each layer can be implemented simply or with sophistication depending on your use case. The key is having all five — skip any layer and you'll hit production issues within the first week.
Layer 1: The Orchestrator
The orchestrator decides what the agent should do next. In a simple agent, this is a single LLM call with a system prompt. In a production agent, it's a state machine with planning capabilities.
Key decisions the orchestrator makes:
- Should I use a tool, ask a clarifying question, or respond directly?
- What information do I need before I can complete this task?
- Am I confident enough to act, or should I escalate?
Implementation options: LangChain/LangGraph for Python-based agents, or custom orchestration with direct API calls for maximum control. We typically use LangGraph for complex multi-step agents and direct API calls for focused single-purpose agents.
Critical design choice: ReAct (reason-then-act) vs. plan-then-execute. ReAct is simpler and works well for 1-3 step tasks. Plan-then-execute is better for complex workflows where the agent needs to coordinate multiple tools in sequence.
Layer 2: Tool Use
Tools are the actions your agent can take — querying a database, calling an API, sending an email, updating a CRM record. The quality of your tool definitions determines how reliably the agent uses them.
Production tool design principles:
- Clear, unambiguous tool descriptions (the LLM reads these to decide which tool to use)
- Input validation before execution (don't let the LLM pass malformed data to your API)
- Timeout and retry logic for every external call
- Idempotency where possible (safe to retry without side effects)
- Output formatting that the LLM can parse reliably
Common mistake: giving the agent too many tools. An agent with 30 tools will misroute frequently. Start with 5-7 focused tools and expand based on actual usage data.
📥 無料ダウンロード:ベトナムオフショア開発コストガイド 2026
実際の開発者単価、プロジェクトコスト内訳、予算計画テンプレート付き。200社以上のスタートアップ創業者が活用。
Ready to build?
NKKTech delivers AI Development projects from $30K.
Fixed scope. Senior Vietnam engineers. 14-day kickoff.
Layer 3: Memory Management
Memory is what separates a chatbot from an agent. A production agent needs three types of memory:
-
Conversation memory — the current session context. Challenge: token limits. A conversation that runs 50+ turns will exceed context windows. Solution: summarization of older turns, keeping recent turns verbatim.
-
User memory — persistent information about the user across sessions. Their preferences, past issues, account details. Stored in a database, injected into context at conversation start.
-
Working memory — intermediate state during multi-step tasks. If the agent is processing a 5-step workflow and fails at step 3, working memory lets it resume from step 3 instead of starting over.
Implementation: Redis for session memory (fast, ephemeral), PostgreSQL for user memory (persistent, queryable), and in-context state objects for working memory.
Layer 4: Evaluation and Guardrails
This is the layer most teams skip — and the one that causes the most production incidents.
Runtime guardrails:
- Output validation: does the response match expected format?
- Hallucination detection: is the agent citing information that wasn't in the retrieved context?
- Toxicity and PII filters: is the response safe to show users?
- Confidence scoring: how certain is the agent about its response?
Offline evaluation:
- Weekly automated eval runs against a test dataset
- A/B testing for prompt changes (never deploy untested prompts to production)
- Regression testing: does the new model/prompt break previously working cases?
The evaluation layer is what lets you deploy with confidence. Without it, every deployment is a gamble.
Layer 5: Human-in-the-Loop
Every production agent needs an escape hatch. When the agent isn't confident, it should seamlessly hand off to a human — with full conversation context, attempted actions, and the reason for escalation.
Escalation triggers:
- Confidence score below threshold (we typically use 0.7)
- User explicitly asks for a human
- Agent enters a loop (same action attempted 3+ times)
- Sensitive topics detected (billing disputes, legal questions, complaints)
- Task exceeds maximum step count
The handoff must be invisible to the user. No 'please hold while I transfer you.' The human sees the full conversation, the agent's reasoning, and can resume the conversation naturally.
Cost Control: Preventing $50K/Month Bills
LLM API costs can spiral without controls. A single unmanaged agent handling 10,000 conversations/month can easily generate $20,000-$50,000 in API costs.
Cost control strategies:
-
Model routing: use GPT-4/Claude for complex reasoning, GPT-3.5/Haiku for simple classification and routing. This alone can cut costs 60-70%.
-
Caching: identical or near-identical queries should hit a cache, not the LLM. Semantic caching (matching by meaning, not exact text) catches 20-40% of queries in most support scenarios.
-
Token budgets: hard limits per conversation, per user, per day. Alert when approaching limits. Kill conversations that enter infinite loops.
-
Prompt optimization: shorter system prompts, efficient few-shot examples, structured output formats that minimize token usage.
-
Batch processing: for non-real-time tasks, batch API calls during off-peak hours for lower rates.
We've seen companies reduce their LLM costs from $40,000/month to $8,000/month with these strategies alone — while handling the same volume.
Common Failure Modes
After building 30+ production agents, these are the failures we see most often:
-
Hallucination loops: the agent generates a confident but wrong answer, then doubles down when challenged. Fix: retrieval-based answers with source citations, confidence thresholds.
-
Infinite tool loops: the agent calls the same tool repeatedly with slightly different parameters, never resolving the task. Fix: maximum step limits, loop detection, human escalation.
-
Cost spirals: a bug causes the agent to generate extremely long responses or make unnecessary API calls. Fix: token budgets, cost alerting, circuit breakers.
-
Context overflow: long conversations exceed the model's context window, causing degraded performance. Fix: conversation summarization, sliding window approach.
-
Silent failures: the agent returns a plausible-looking but wrong answer, and no one notices. Fix: evaluation layer, user feedback mechanisms, automated accuracy monitoring.
The common thread: these are all systems problems, not model problems. A better model won't fix them. Better architecture will.
📥 無料ダウンロード:ベトナムオフショア開発コストガイド 2026
実際の開発者単価、プロジェクトコスト内訳、予算計画テンプレート付き。200社以上のスタートアップ創業者が活用。
Ready to build?
NKKTech delivers AI Development projects from $30K.
Fixed scope. Senior Vietnam engineers. 14-day kickoff.

10+ years building AI systems for Toyota, Sony, and Rakuten in Japan. Founded NKKTech in 2018 with a senior-only engineering model.
Continue Reading
NKKTechと一緒に構築しませんか?
Building an AI agent for production? Let's review your architecture. Book a free 30-minute technical consultation — we'll identify potential failure points before they hit your users.
無料相談を予約