Almost every team building an AI agent starts an eval framework, runs it twice, and then quietly stops because adding new eval cases is tedious and the scoring code lives in a Jupyter notebook nobody touches. After shipping evals on 30+ production agents at NKKTech, we've learned what makes an eval framework actually get used past month two — and what makes it die. This guide is the four-component recipe we ship with every production agent. Each component takes 1–3 days of engineering, and together they're the difference between catching regressions before they ship and finding them via customer complaints.
Why Most Eval Frameworks Get Abandoned
The failure pattern is consistent: a senior engineer builds the eval scaffolding in week two of the project, writes 20 eval cases, runs the suite a few times, and then the team stops adding new cases as the agent capability grows. By month three, the eval suite tests a fraction of what the agent actually does in production, regression detection breaks, and the team falls back to manual smoke-testing. The fix is structural, not motivational: make adding eval cases a 30-second action (paste a failed production case into a directory, the suite picks it up automatically), make running the eval part of CI (every PR runs the full suite, blocks merge on a score drop), and make scores visible (a dashboard in the team's daily-standup channel, not buried in an internal wiki). The most successful frameworks we've shipped have all four components below — drop any one and the framework drifts dead within a quarter.
Component 1: The Frozen Eval Set
Curate 50–500 example tasks with expected outputs, stored as YAML or JSON files in your repo. Every example has: a unique ID, the input the agent receives, the expected output (or expected properties of the output if the format is free-form), and a category tag ("happy-path", "edge-case", "adversarial", "underrepresented-segment"). The set is frozen in the sense that examples don't change once added — if production behavior reveals a new edge case, you add a new example rather than modifying an existing one. This preserves the historical regression signal. We aim for: ~40% happy-path coverage of the most common task variants, ~30% edge cases (specifically the inputs that have caused production incidents), ~15% adversarial (prompt injection attempts, contradictory instructions), and ~15% underrepresented (different languages, dialects, technical literacy levels). The set lives in evals/cases/*.yaml in our standard project layout, and we have a CLI command (npm run eval:add) that scaffolds a new case from a copy-pasted production input.
📥 Free Download: Vietnam Offshore Dev Cost Guide 2026
Real developer rates, project cost breakdowns, and a budget planning template. Used by 200+ startup founders.
Ready to build?
NKKTech delivers AI Development projects from $30K.
Fixed scope. Senior Vietnam engineers. 14-day kickoff.
Component 2: Scoring Functions (Per Task Type)
One scoring function per agent capability. Three patterns cover 90% of use cases. Exact match works when the agent's output is structured (a JSON object, an extracted field, a SQL query) — score is 1 if the output equals the expected output, 0 otherwise. Schema validation is a relaxed version: the output must match the expected schema even if specific values differ. LLM-as-judge for free-form outputs (customer support responses, marketing copy, summaries) — pass the agent's output, the expected output, and a rubric ("is the response polite, accurate, and actionable?") to a separate LLM that returns a 1–5 score with reasoning. Custom domain functions for cases where correctness is computable but not match-based — e.g., "does this generated SQL actually execute and return the right number of rows?". Whichever pattern you use, the function takes (input, expected, actual) → score in [0,1] and a reasoning string. We log both, because the reasoning is what lets a human reviewer quickly understand why a score dropped.
Component 3: Component-Level Evals
End-to-end evals tell you the system got worse but not where. Component-level evals score each agent stage independently: retrieval quality (did the right documents come back?), tool-call accuracy (did the agent pick the right tool and pass the right arguments?), reasoning coherence (did the intermediate reasoning chain make sense?), final-output quality (the metric you care about). When the end-to-end score drops, the component scores tell you which stage regressed. This is what cuts debugging from "the agent feels worse" to "retrieval precision dropped 8% on legal documents after the embedding-model change in PR #1242." Implementation is straightforward: every agent stage emits structured telemetry (we use OpenTelemetry spans), and the eval suite scores each span type with the appropriate scoring function. Total investment: 1–2 days. Payback: every regression you catch costs 10× less to fix at PR time than after deployment.
Component 4: Regression Tracking and Alerting
The fourth component is what makes the framework actually used. Every CI run logs eval scores to a time-series store (we use simple Postgres tables; some teams use Datadog or Grafana). A dashboard shows score trends per metric per release. Alerts fire when any metric drops more than 3% week-over-week, posted to the team's Slack channel with a link to the failing eval cases. We also include eval scores in PR descriptions automatically — if a PR drops the overall score, that's visible in the GitHub review UI, and reviewers know to look harder. For a full architectural treatment of how this fits with the broader agent system (memory, tool calling, multi-agent orchestration, deployment patterns), see our AI Agents in Production Architecture Guide. The eval framework is one of eight architectural decisions covered there.
📥 Free Download: Vietnam Offshore Dev Cost Guide 2026
Real developer rates, project cost breakdowns, and a budget planning template. Used by 200+ startup founders.
Ready to build?
NKKTech delivers AI Development projects from $30K.
Fixed scope. Senior Vietnam engineers. 14-day kickoff.

10+ years building AI systems for Toyota, Sony, and Rakuten in Japan. Founded NKKTech in 2018 with a senior-only engineering model.
Want to build this with NKKTech?
Need help building an eval framework for an agent you've already shipped? Book a free 30-minute review — we'll look at your current testing approach and suggest a minimum viable eval suite for your top three use cases.
Book a Free Call