Most "AI agent" demos shipped in 2024–2025 were impressive on Twitter and broken in production. The agent worked on the happy path, but the moment it hit a malformed tool response, a rate limit, an ambiguous user instruction, or a stale memory pointer, it either hung silently or produced confidently wrong output. After 30+ AI agent deployments across fintech, healthcare, sales-ops, and customer-service workloads at NKKTech, we've seen the same patterns repeat. This guide is the architecture playbook we wish we'd had at the start of 2024 — what to build, what to measure, what fails, and what it costs. It's written for engineering leads and CTOs who need to ship AI agents that work on day 30, not just day one.
What Makes an AI Agent "Production-Ready"
A demo agent is judged on whether it completes one happy-path task. A production agent is judged on five dimensions: correctness, reliability, latency, cost predictability, and observability. Miss any one and the system either fails customers or fails your finance team.
Correctness means the agent's output is acceptable on at least the 95th percentile of input variations — not just the 5–10 prompts you wrote during a design sprint. Reliability means it gracefully degrades when a tool times out, an LLM provider returns a 429, or memory state is partially corrupted; it should retry, fall back, or surface a useful error rather than freezing. Latency means p50 and p99 are both within your product's UX budget — usually under 2 seconds for chat-style interfaces, under 30 seconds for background workflows. Cost predictability means per-task cost is bounded and known in advance; an agent that costs $0.04 per task on average but occasionally costs $4 because of an infinite tool-calling loop is unshippable. Observability means every agent run is logged with structured spans (input, output, tools called, tokens used, latency per stage) so when something breaks at 3 a.m. on a Tuesday, you can diagnose in 10 minutes instead of three days.
If you're reading this and three of those five dimensions aren't being measured in your current agent stack, you don't have a production system — you have a long-running prototype. That's the most common state we encounter when clients hand us an internal AI initiative that "works in demo but keeps breaking."
Memory Architecture: Working, Short-Term, Long-Term
Agent memory is the single most common source of production bugs, because most tutorials treat it as a single "chat history" array. In real systems you need three distinct memory tiers, each with its own retention policy, retrieval mechanism, and budget.
Working memory is the current turn's scratchpad — the tool calls in flight, the partial reasoning chain, the variables the agent is currently manipulating. This lives entirely in the LLM's context window and is discarded at turn boundary. Budget: 4-16k tokens. Mechanism: structured fields injected into each LLM call.
Short-term memory is the current session's history — last 10-30 turns of conversation, recent tool results, user clarifications. This needs to be summarizable when the session runs long; we use rolling summarization (keep last 5 verbatim, summarize older into a paragraph) to prevent token explosion. Budget: 8-32k tokens. Mechanism: a structured session log persisted to Redis or PostgreSQL, re-injected into context with a configurable lookback window.
Long-term memory is everything that should survive across sessions — user preferences, learned facts about the user or their domain, the agent's accumulated knowledge of the user's project history. This must NOT live in the LLM context (too expensive and unreliable). Instead: store as structured records in PostgreSQL, indexed for retrieval by user+topic, and inject only the relevant subset into context via embedding-based or rule-based retrieval. Budget: zero context tokens by default; retrieval adds 500-2000 tokens when triggered. Mechanism: pgvector or Pinecone for semantic retrieval, plus a structured key-value store for hard facts ("user's company name", "user's preferred currency").
Get the tier boundaries wrong and you'll see two failure modes constantly: (a) the agent "forgets" things it should know (long-term memory failed to retrieve), or (b) the agent's response latency or cost explodes because everything is being stuffed into context. A well-architected memory system feels invisible — the agent remembers what matters and forgets what doesn't.
Tool Calling: Sync, Async, Parallel, and Streaming
Tool calling is where prototypes become real. A demo agent calls one tool, gets a result, responds. A production agent calls 3-7 tools per task on average, sometimes in parallel, sometimes with the next tool depending on the previous tool's result, and at least one of those tools will fail or time out.
Four patterns matter. Synchronous: the agent calls a tool, waits for the result, decides next action. Use for fast tools (database queries under 500ms, simple API calls). Asynchronous: the agent dispatches a long-running tool (a 30-second analysis, an external service that may take minutes) and continues with other work; result is reconciled on a later turn or via webhook. Critical for any tool that can take longer than your UX latency budget. Parallel: the agent identifies multiple independent tools needed for the same decision and calls them simultaneously; latency = max(tool latencies) instead of sum. Modern LLMs (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro) all support parallel tool calling natively — most teams just don't use it because their orchestration framework doesn't surface it. Streaming: tool results come back as a stream rather than a complete payload (e.g., a long search result with paginated chunks); the agent can begin reasoning over partial results without waiting for completion.
For each tool you expose to the agent, you also need: a strict input schema (Pydantic, Zod) with descriptions on every field so the LLM can self-correct on malformed calls, an explicit timeout (we default to 10 seconds per tool call, max 30 seconds for known-slow tools), retry-with-backoff for transient failures (max 3 retries with exponential backoff), and a circuit breaker that disables the tool for the agent's current session after N consecutive failures so the agent doesn't keep retrying a broken downstream.
Real example from a NKKTech project: a B2B fintech agent that had to enrich a company across Apollo, LinkedIn Sales Nav, and Clearbit. The first version called all three in sequence — 8-12 second latency, 30% failure rate when any one was rate-limited. After we restructured to parallel calls with per-tool circuit breakers and fallback chains (Clearbit → Apollo → cached previous run), latency dropped to 2-3 seconds and the overall failure rate fell to under 2%. Same code budget, same LLM model — the win was entirely architectural.
Multi-Agent Orchestration: Hierarchical vs Network
Single-agent systems hit a complexity ceiling around 15-20 distinct tools or capabilities. Past that point, the agent starts making more tool-selection errors than productive moves, and your eval scores stop improving even as you add capabilities. The fix is decomposition — break the agent into specialized sub-agents.
Two orchestration topologies dominate production deployments. Hierarchical: a coordinator agent receives the user request, decomposes it into sub-tasks, and dispatches each sub-task to a specialist sub-agent ("data lookup specialist", "writer specialist", "compliance reviewer"). Sub-agents don't talk to each other; they all return to the coordinator, which assembles the final answer. Good when tasks have a clear hierarchy, when you want strict control over which sub-agent handles what, and when sub-agents need different tool permissions. Easier to evaluate (you can score each sub-agent independently).
Network: agents talk peer-to-peer, with no central coordinator. An agent can ask another agent for help, get a response, and continue its own reasoning. Good for collaborative workflows where the boundary between sub-tasks isn't fixed upfront — research-and-write workflows, debate or critique loops, multi-stakeholder negotiation simulations. Harder to evaluate, harder to debug, but more expressive.
In our experience 80% of B2B use cases work better with hierarchical orchestration — the coordinator pattern is more debuggable, the sub-agents stay focused, and the cost profile is more predictable. We default to hierarchical unless the use case explicitly requires multi-agent debate (e.g., a content-review workflow where two reviewer agents argue and a third decides).
A tactical note on framework choice: LangGraph, CrewAI, and Microsoft's AutoGen are the three most-deployed multi-agent frameworks in 2026. LangGraph is the most production-mature, with first-class observability and persistence; CrewAI is the easiest to start with but has hit production limits on memory and error handling for some clients; AutoGen is the most flexible but the steepest learning curve. We use LangGraph by default for new projects unless a client requires CrewAI for their internal stack.
📥 Free Download: Vietnam Offshore Dev Cost Guide 2026
Real developer rates, project cost breakdowns, and a budget planning template. Used by 200+ startup founders.
Ready to build?
NKKTech delivers AI Development projects from $30K.
Fixed scope. Senior Vietnam engineers. 14-day kickoff.
The Evaluation Framework Most Teams Skip
Most teams ship AI agents with no formal eval beyond "it worked when I tried 5 prompts." Six weeks later, when a customer reports a bad output, there's no way to know whether it was a regression, a new edge case, or the system has always been bad on that input type.
A real eval framework needs four components. A frozen eval set — 50-500 example tasks with expected outputs, curated by domain experts, that you run before every deployment. This set should include happy paths, known edge cases (especially the ones that caused production incidents), adversarial inputs (prompt injections, ambiguous instructions, contradictions), and inputs from underrepresented user segments (different languages, dialects, technical literacy levels). A scoring function for each task — sometimes exact match ("did the agent extract the right amount from the invoice?"), sometimes LLM-as-judge for free-form output ("is this customer response polite, accurate, and actionable?"), sometimes a custom domain function ("does the SQL the agent generated actually run and return the right rows?"). Component-level evals — score each agent stage separately: retrieval quality, tool-call accuracy, reasoning coherence, final-output quality. This is what lets you debug a regression to a specific stage instead of a vague "the agent got worse" feeling. Regression tracking — eval scores per release in a dashboard, with alerts when any metric drops more than 3% week over week.
We build eval frameworks for every production agent we ship. The investment is 2-5 days of engineering up front, plus 1-2 hours per week to curate new eval cases as production usage reveals new failure modes. The ROI is enormous: we've watched clients ship 8+ releases without breaking a single capability because they had a working eval, while clients without evals shipped 2-3 "silent regressions" per quarter and discovered them only after customer complaints.
If you take one architectural recommendation from this entire guide, take this one: build the eval framework before you ship the agent. It feels like overkill in week 2; by week 12 it's the only thing keeping your team sane.
Production Deployment: Latency, Cost, Reliability
Once the agent works in dev, you'll discover three production realities that demos never surface. LLM provider latency is bimodal — most calls complete in 2-4 seconds, but 5-10% of calls take 15-60 seconds for no apparent reason. Build for this from day one: streaming responses to the user even when the underlying agent isn't done, timeouts with graceful fallback ("the AI is taking longer than usual — here's what we have so far"), and aggressive parallelism so one slow tool doesn't block the whole turn.
Cost variance is also bimodal — most tasks cost $0.01-0.05, but pathological inputs (a user pastes a 50-page PDF; a tool-call loop misfires; the agent generates a 4k-token reasoning chain before responding) can cost 100-1000x more. Hard caps prevent the worst outcomes: max tokens per task (10-20k), max tool calls per task (15-20), max wall-clock time per task (60-120 seconds). When any cap is hit, the agent surfaces a graceful error and the run is killed before it can bill more.
Reliability requires defense in depth. We deploy production agents behind: a request queue (so traffic spikes don't overwhelm the LLM provider quota), a circuit breaker per LLM provider (if OpenAI is degraded, automatically failover to Anthropic for the same prompt with minimal quality loss), a structured observability layer (OpenTelemetry traces with custom spans for each agent stage, exported to Datadog or Grafana), and a kill switch (a single config toggle that disables all agent calls in case of a runaway incident, falling back to a non-AI path or a maintenance message).
Real numbers from a NKKTech project: a customer-support agent serving 12,000 tickets/month for a mid-market SaaS, originally deployed without circuit breakers or kill switch. When OpenAI had a 6-hour partial outage in November 2025, every ticket sat in a 30-second loop until timeout, the agent burned $1,800 in retries on broken calls, and Slack lit up with customer complaints. We rebuilt the deployment with the patterns above; during the next provider outage in February 2026, the agent automatically failed over to Claude, latency rose by 800ms, and not a single customer escalation was filed.
Failure Modes That Will Bite You (with Real Examples)
Eight failure modes account for almost every production agent incident we've debugged.
The infinite tool-call loop. The agent calls tool A, gets a result, decides it needs tool B, calls B, decides to re-call A with slight variation, ad infinitum. Prevention: hard cap on tool calls per task, plus an LLM-side instruction that says "if you call the same tool with the same arguments twice in this task, stop and respond with what you have."
The schema mismatch silent corruption. A tool's output schema changes (a field renames, a type widens), the agent doesn't notice because the LLM does "best effort" type coercion, and the output looks plausible but is wrong. Prevention: strict schema validation on every tool boundary (Pydantic, Zod), fail loud on mismatch.
The stale context lie. Long-term memory retrieves an outdated fact (the user's company name changed three weeks ago but the retrieval returned the old version), and the agent answers with conviction. Prevention: timestamp every long-term memory record; teach the agent (via system prompt) to flag potential staleness when a retrieved fact is older than X days.
The prompt injection escape. A user (or a tool's input data, like an inbound email) contains "ignore previous instructions and..." content. Some agents follow it. Prevention: input sanitization on user content + clear delimiter blocks in the system prompt + LLM models that have been RLHF-trained against this attack (GPT-4o, Claude 3.5 are robust; older or open-source models are weaker).
The currency / unit confusion. The agent sees "$120" and "120 VND" and treats them as the same magnitude, or confuses 1,000 USD with 1,000 JPY. Prevention: explicit type annotations on all monetary fields (Money type with currency code), and a final-step validator that flags when monetary outputs are missing a currency or are wildly out of expected range.
The hallucinated tool. The agent decides it needs to call a tool that doesn't exist, fabricates a plausible-looking tool name, and reports success. Prevention: strict allow-list of tools per agent (no dynamic tool registration mid-task), and a validation step that catches "unknown tool" errors and re-prompts the agent.
The over-confident no-retrieval response. The agent's long-term retrieval returns zero results (because the query was malformed, or the topic doesn't exist in memory), but the agent doesn't acknowledge the empty result and hallucinates an answer. Prevention: explicit "no results found, do you want to clarify your question?" handling in the orchestration layer.
The quiet rate-limit cascade. LLM provider returns 429, your retry logic kicks in, retries fail because you're still over quota, the task eventually times out, and the user sees "something went wrong" with no information. Prevention: 429 → fall back to a different LLM provider immediately, not retry against the same provider; if all providers fail, surface a specific error to the user ("AI service temporarily unavailable, please try again in a few minutes").
We've debugged at least one production incident for each of these eight modes in the last 18 months. Putting prevention in place from day one is 10x cheaper than discovering them in production.
Cost Optimization: Model Routing and Caching
Most production agents are running 70-90% over their optimal cost because every task uses the most capable (and most expensive) LLM. Two architectural changes routinely cut cost by 50-80% without measurable quality loss.
Model routing. Not every step needs GPT-4o or Claude 3.5 Sonnet. A simple classifier ("is this user message a greeting, a tool-call request, or a complex reasoning task?") can route trivial steps to GPT-4o-mini or Claude Haiku at 1/15th the cost, reserving the flagship model only for the reasoning-heavy steps. The classifier itself runs on the cheap model. Implementation: a small router LLM call at task start ($0.0001), then dispatch to the appropriate model for the body of the work. Caveat: model routing adds latency (one extra LLM call) and complexity (one more thing to monitor), so it's worth it only when your task volume is high (10k+ tasks/day) or your unit cost is sensitive (B2C product with thin margins).
Semantic caching. Many production agents see the same or similar queries repeatedly — "What's our refund policy?", "Convert this CSV to JSON", "Summarize this meeting transcript." Embedding-based caching lets you detect semantically similar requests and return the cached answer instead of re-running the agent. Implementation: hash + embed every successful task input; on a new task, check the cache for semantically similar inputs (cosine similarity > 0.95) and return the cached output if hit. Set TTL based on how often the underlying facts change (60 minutes for product-info questions, 1 minute for live-data queries). Real-world hit rate is 15-40% for customer-support and Q&A workloads.
Prompt caching. Anthropic and OpenAI both now support explicit prompt caching — mark stable parts of the system prompt (your tool catalogs, your retrieval results, your long-form instructions) as cached, and pay reduced tokens on subsequent calls within a 5-minute window. For agents with large stable prompts (4k+ tokens), this is a 60-80% reduction on the prompt portion of the bill. Almost free to implement; we enable it by default for every agent we ship.
A concrete cost-optimization story from a NKKTech project: an enterprise sales-research agent at a Series B SaaS was costing $7,400/month for 18,000 tasks. After implementing model routing (mini for classification + retrieval prep, flagship for synthesis) and semantic caching (28% hit rate on a workload heavy in repeated company lookups), monthly cost fell to $1,950 — a 74% reduction with zero quality drop on the client's eval set. The optimization took 5 engineering days; the payback was 11 days.
What to Build Next
If you're at the start of an AI agent project, the right sequence is: nail the eval framework first (week 1-2), then build single-agent capabilities one tool at a time with continuous eval (week 3-8), then add memory tiers as production usage demands them (week 6+), and finally consider multi-agent orchestration only if a single-agent system hits the 15-20 tool ceiling (week 12+). Skip ahead at your peril.
If you're already running an agent in production but it's flaky, prioritize in this order: (1) instrument observability so you can diagnose what's actually breaking, (2) build the eval framework so you can measure regressions, (3) add hard caps and circuit breakers so cost and latency stop being unbounded, (4) audit your memory architecture for the three-tier separation, (5) add model routing and prompt caching for cost.
If you'd rather have NKKTech build it for you — fixed-scope, senior engineers, in 10-16 weeks — book a free 30-minute discovery call. We work with companies in the US, Canada, Australia, Singapore, Japan, and Korea, contracts under Singapore law (via NKKTech Global Pte. Ltd.) or Vietnam law (via NKKTech Global JSC), whichever your procurement prefers. 120+ projects since 2018, 50+ senior engineers, ISO 9001 and 22301 certified. No drip sequences, no sales pressure — just a conversation about whether we're the right fit.
📥 Free Download: Vietnam Offshore Dev Cost Guide 2026
Real developer rates, project cost breakdowns, and a budget planning template. Used by 200+ startup founders.
Ready to build?
NKKTech delivers AI Development projects from $30K.
Fixed scope. Senior Vietnam engineers. 14-day kickoff.

10+ years building AI systems for Toyota, Sony, and Rakuten in Japan. Founded NKKTech in 2018 with a senior-only engineering model.
Thêm trong pillar này
Want to build this with NKKTech?
Building an AI agent and want a second opinion on your architecture? Book a free 30-minute call with a NKKTech senior engineer — not a sales rep — for a no-pitch technical review.
Book a Free Call