Designing Evals for Multi-Step LLM Agents Before Production

Most teams discover too late that evaluating an agent is not the same thing as evaluating a prompt.
A single-turn prompt can often be judged with a relatively clean input/output lens: did the model answer correctly, follow the format, avoid policy violations, and stay within latency and cost bounds? A multi-step agent is a different beast. It can choose the wrong subgoal, call the right tool with the wrong arguments, retrieve the wrong documents, write a flawed intermediate summary into memory, recover poorly from partial failures, or technically complete the task while violating constraints you care about in production. If you only evaluate the final answer, you miss most of the failure surface.
This usually becomes painfully obvious during staging. The demo looked good. A handful of happy-path test prompts worked. Then the agent hits a more realistic workflow: it plans too broadly, burns tokens on unnecessary tool calls, fetches stale knowledge, loops when a dependency times out, or contaminates later steps with a bad intermediate assumption. The team responds the way teams often do under pressure: tweak the system prompt, add more examples, maybe swap models, and re-run a few manual tests. Sometimes the changes help. Often they just move the failure around.
The core issue is not that the model is bad. The issue is that the evaluation strategy is too shallow for the system that was built.
If your production system includes planning, retrieval, tool use, memory, and stateful multi-turn behavior, your evaluation harness has to reflect that architecture. You need to measure the task at multiple layers, localize failure, simulate realistic environments, maintain regression suites, and define release gates that align with business risk. Otherwise, you are shipping a distributed system with no observability and calling it prompt engineering.
A realistic failure scenario
Consider a support operations agent used internally by a SaaS company. The agent is supposed to help human support reps resolve customer issues by:
- reading the customer conversation
- retrieving relevant policy and product docs
- checking account state via internal APIs
- proposing a resolution plan
- drafting a customer-safe response
- storing a short case summary for future follow-up
In prototype testing, the team evaluates on 50 manually written cases. The agent looks impressive. It resolves refund requests, identifies billing issues, and drafts polished replies. Leadership wants it in production.
Then pre-production testing with live-like traffic reveals the real behavior:
- The planner overuses retrieval, pulling five to ten documents even for simple known issues.
- It sometimes queries the billing API before confirming the account ID, leading to wrong-user lookups.
- Retrieved policy documents contain overlapping but slightly different refund rules; the agent anchors on an outdated policy.
- When the API times out, the agent hallucinates a likely account status instead of asking for retry or escalation.
- The final response is often fluent and plausible, so casual reviewers miss the fact that the intermediate reasoning violated support policy.
- The memory summary writes “customer eligible for refund” even when eligibility was uncertain, biasing the next agent turn.
If the only eval metric is “was the final drafted response acceptable,” you will undercount important failures. Some bad trajectories happen to end in an acceptable output. Some good trajectories fail because of one flaky tool call and should be treated differently from genuine reasoning failures. Some steps are expensive but harmless; others are rare but catastrophic.
This is the key pattern: agent quality is path-dependent. Final-answer grading is necessary, but nowhere near sufficient.
The pattern to identify: agents fail by trajectory, not just by answer
Single-turn prompt testing tends to collapse system quality into a single completion. Agent systems introduce a trajectory:
- interpret the task
- decompose into subgoals
- choose whether to retrieve or call tools
- decide tool order
- construct tool arguments
- interpret tool outputs
- update working memory
- continue, repair, or terminate
- produce final output
Every one of those stages can fail independently, and failures compound.
That means your eval design should answer at least five questions:
- Task success: Did the agent achieve the user’s objective under the relevant constraints?
- Trajectory quality: Did it follow an efficient, policy-compliant, and robust sequence of steps?
- Component quality: Which subsystem failed: planning, retrieval, tool choice, tool arguments, memory, guardrails, or response generation?
- Operational quality: What were the latency, cost, token, and tool-utilization characteristics?
- Reliability under variation: Does performance hold up across task types, environment conditions, and model/prompt/tool changes?
Once you see the system through that lens, the naive evaluation approach becomes obviously inadequate.
Why the naive approach fails
The common first pass for agent evals usually looks like this:
- collect a small set of representative prompts
- run them through the full agent
- have a human or an LLM judge the final answer
- maybe compare a few prompts or models
This is better than nothing, but it breaks down for multi-step systems for several reasons.
1. Final-answer-only scoring hides root causes
Two agent runs can both fail the same task for completely different reasons. One might fail because retrieval returned nothing relevant. Another might fail because the planner skipped retrieval entirely. A third might fail because a correct tool result was later overwritten by a bad memory summary.
Those imply different fixes:
- retrieval indexing changes
- prompt or policy changes for tool selection
- memory write schema and validation changes
If your eval only says “incorrect answer,” you cannot prioritize engineering work.
2. Happy-path prompts underrepresent production complexity
Manually authored prompts are usually cleaner than real user requests. They omit ambiguity, incomplete inputs, contradictory evidence, adversarial phrasing, and environment noise. Real users also create long-tail combinations of conditions that no one thought to handwrite.
Agents are more vulnerable than single-turn prompts here because they must make control-flow decisions under uncertainty. Ambiguity doesn’t just lower answer quality; it changes which path the agent takes.
3. Agent performance depends on environment behavior
A single-turn benchmark usually assumes the model is the system. In agents, the environment is part of the product:
- retrieval quality
- tool schema clarity
- API reliability
- memory store behavior
- cache freshness
- policy versions
- rate limits
- network delays
Without environment simulators and replayable tool traces, you cannot separate model behavior from infrastructure behavior.
4. Evals without cost and latency metrics incentivize bad behavior
An agent can increase final success by making more tool calls and spending more tokens. That does not mean it is production-ready. Sometimes the “better” version is financially or operationally unacceptable.
You need to measure:
- median and p95 wall-clock latency
- average number of steps
- average tokens per successful task
- average tool calls per task
- failure recovery rate
- cost per completed task
Otherwise you reward brute-force reasoning and over-retrieval.
5. One-off reviews do not protect against regressions
Agent changes interact in surprising ways. A prompt improvement for planning may increase tool overuse. A cheaper model may reduce latency but silently worsen argument construction. A retrieval ranking tweak may help one domain while hurting another.
Without a stable regression suite and release gates, teams end up re-learning old failures.
A better approach: eval the architecture you actually built
The practical approach is to design the eval harness around the agent’s execution graph.
A strong evaluation system for multi-step agents usually has six layers:
- Task-level evals for end-to-end success
- Step-level evals for trajectory and failure localization
- Component evals for planning, retrieval, tools, memory, and guardrails
- Environment simulation for realistic, replayable conditions
- Regression suites for known failures and critical workflows
- Release gates tied to quality, cost, latency, and safety thresholds
Think of it as test infrastructure plus observability plus decision support.
1) Start with a task model, not just a prompt list
Before building metrics, define the task structure for your agent family.
For each task type, document:
- user goal
- required constraints and policies
- expected subgoals
- allowed and disallowed tools
- success criteria
- acceptable recovery behaviors
- failure severity
For example, in a support agent, “resolve a billing dispute” may require:
- identify account correctly
- inspect billing status
- retrieve current refund policy
- determine refund eligibility
- avoid issuing commitments without validation
- escalate when policy ambiguity remains
This decomposition becomes the basis for both human labeling and automated checks.
A useful artifact here is a task spec schema. At minimum:
json{ "task_id": "billing_refund_eligibility_v3", "user_input": "Customer says they were double charged and wants a refund.", "context": { "account_id_present": false, "policy_version": "2026-01", "available_tools": ["crm_lookup", "billing_api", "policy_search"] }, "required_subgoals": [ "identify_customer_account", "check_recent_charges", "retrieve_current_refund_policy", "determine_eligibility_or_escalate" ], "hard_constraints": [ "do_not_claim_refund_approved_without_billing_confirmation", "do_not_access_account_without_verified_identifier" ], "success_criteria": [ "correct_account_or_request_missing_identifier", "correct_policy_applied", "final_response_safe_and_actionable" ] }
This is boring engineering work, but it unlocks the rest. Without explicit task specs, eval devolves into subjective impressions.
2) Measure task decomposition quality explicitly
One major difference between agent evals and prompt evals is that the plan itself matters.
You do not necessarily need chain-of-thought or internal hidden reasoning. But you do need inspectable structured actions or plan summaries that can be evaluated. This can be done through explicit planner outputs, action logs, or normalized event traces.
Useful task decomposition metrics include:
Subgoal coverage
Did the agent address all required subgoals for the task?
Example:
- required: identify account, retrieve refund policy, inspect billing status
- observed: inspect billing status, draft reply
- score: 2/3 covered, one critical omission
Subgoal ordering correctness
Did the agent do things in a valid order?
Common ordering failures:
- calling account tools before identity verification
- drafting final resolution before policy retrieval
- writing memory before conflict resolution
Plan efficiency
How many unnecessary steps were taken?
Track:
- extra retrieval calls
- repeated tool invocations with minor argument changes
- loops or dead-end branches
- excessive clarification when enough evidence already exists
Decision-point correctness
At each branch, did the agent choose an appropriate next action?
Examples:
- ask a clarifying question vs guess
- use retrieval vs call transactional API
- retry tool vs escalate
- terminate vs continue searching
Recovery quality
When a step fails, does the agent recover appropriately?
For example:
- retries transient API failure once, then escalates
- notices retrieval contradiction and requests human review
- avoids fabricating missing tool results
These metrics give you a much better picture than a binary pass/fail.
3) Build a step-level failure taxonomy
If you want evals to drive engineering decisions, every failed run should be classifiable.
A practical step-level taxonomy looks something like this:
A. Task interpretation failures
- misunderstood user intent
- ignored constraints
- failed to detect ambiguity
B. Planning failures
- missing required subgoals
- invalid subgoal order
- premature termination
- unnecessary decomposition / overplanning
C. Retrieval failures
- retrieval skipped when needed
- low-relevance documents selected
- stale or conflicting documents not handled
- context window overfilled with noisy evidence
D. Tool selection failures
- wrong tool chosen
- tool omitted
- excessive tool usage
- unsafe tool chosen without required checks
E. Tool argument failures
- malformed arguments
- missing required parameters
- wrong entity resolution
- weak query formulation
F. Tool result interpretation failures
- misread tool output
- ignored structured error
- treated empty result as confirmation
- failed to detect contradiction across tools
G. Memory failures
- wrote incorrect summary
- omitted key unresolved issue
- retrieved stale memory
- over-relied on prior memory despite conflicting fresh evidence
H. Guardrail/compliance failures
- violated policy despite correct task completion
- leaked sensitive information
- exceeded action permissions
- lacked required escalation
I. Response generation failures
- final answer incorrect
- unsupported claim
- poor user communication
- wrong formatting or actionability
J. Operational failures
- timeout
- rate limit loop
- token explosion
- non-termination
You do not need perfect labeling fidelity at the start. Even a coarse taxonomy is enough to expose dominant failure modes.
Crucially, let multiple labels coexist. Agent failures are often cascades: bad retrieval causes wrong policy selection, which causes an incorrect memory write, which later affects the final response.
4) Use both synthetic and human-labeled datasets
Teams often swing too hard one way.
Some rely entirely on human-labeled examples. These are high quality but expensive, slow, and often too small.
Others generate huge synthetic sets and trust them too much. These scale well but can encode unrealistic distributions or leak the assumptions of the generation process.
In practice, you want a portfolio.
Human-labeled test sets
Use these for:
- high-risk workflows
- business-critical tasks
- ambiguous cases requiring domain judgment
- release gating
- calibrating automated judges
Your human-labeled set should include:
- canonical happy paths
- near-boundary policy cases
- ambiguous inputs
- adversarial or misleading phrasing
- partial-information cases
- tool failure and stale-data scenarios
For each case, annotate:
- expected outcome
- required subgoals
- acceptable tool sequences or disallowed actions
- severity if failed
- preferred escalation behavior
Synthetic test sets
Use these for:
- scale
- coverage of combinatorial edge cases
- stress testing planners and tool routing
- mutation testing of prompts, docs, and schemas
- regression expansion after incidents
Good synthetic generation patterns include:
Parameterized scenario generation
Create templates with varying entities, missing fields, policy conflicts, and tool outcomes.
Mutation-based generation
Take real or labeled examples and mutate:
- wording
- order of facts
- irrelevant distractors
- contradictory snippets
- missing identifiers
- noisy retrieved docs
Failure injection
Generate scenarios where tools return:
- timeout
- partial success
- empty results
- malformed fields
- stale versions
- conflicting outputs
Synthetic data is especially powerful for agent eval because the failure surface is combinatorial. You can simulate conditions humans would rarely write by hand.
The rule of thumb: use humans to define truth and severity, then use synthetic generation to broaden coverage around those truths.
5) Build environment simulators, not just datasets
A static prompt-response benchmark is insufficient for agents because actions change future state.
You need replayable environments that simulate:
- tool outputs based on agent actions
- state transitions over time
- API failures and retries
- retrieval corpora and ranking behavior
- memory reads/writes
- permission boundaries
This does not have to be fancy at first. A good simulator can be a deterministic harness around mocked tools and corpora.
What a simulator should provide
Deterministic execution
The same test case should produce the same environment behavior unless you intentionally sample stochastic conditions.
Action logging
Every tool call, argument payload, retrieval query, memory write, and decision node should be traceable.
Configurable fault injection
Let tests specify:
- first API call times out, second succeeds
- retrieval index returns stale doc first
- memory store returns conflicting previous summary
State assertions
Allow checks like:
- no write to memory until eligibility determined
- no billing API call before identity verification
- no external action taken after policy conflict detected
This is how agent eval becomes closer to integration testing for distributed systems.
6) Define a scoring stack, not one metric
A production-worthy scorecard usually has multiple dimensions.
Here is a practical stack.
End-to-end task metrics
- task success rate
- constraint satisfaction rate
- escalation correctness rate
- user-visible answer quality
Trajectory metrics
- average steps per task
- required subgoal coverage
- invalid action rate
- unnecessary tool call rate
- recovery success rate
- loop/non-termination rate
Retrieval metrics
- retrieval-needed detection precision/recall
- top-k evidence relevance
- evidence sufficiency rate
- stale/conflict detection rate
Tool-use metrics
- tool selection accuracy
- argument correctness
- structured error handling rate
- retry appropriateness
Memory metrics
- memory write accuracy
- retrieval usefulness
- stale memory override rate
- contamination rate from incorrect prior summaries
Safety and guardrail metrics
- policy violation rate
- unsafe action attempt rate
- sensitive-data exposure rate
- human-escalation compliance
Operational metrics
- cost per task
- token usage by phase
- median and p95 latency
- tool latency contribution
- success under degraded dependencies
Not every team needs every metric on day one. But you do need enough to observe the main tradeoffs.
7) Use evals to drive model selection and architecture choices
One of the biggest mistakes in agent development is choosing models based only on general benchmarks or raw final-answer quality.
Different agent stages often benefit from different models.
For example:
- a smaller fast model may be enough for routing or classification
- a stronger model may be needed for long-horizon planning
- structured argument construction might do well on a model that is especially reliable with JSON/function calling
- synthesis of final user-facing output may tolerate a cheaper model if the earlier steps already constrained the evidence well
Your eval harness should support stage-specific ablations:
- planner model A vs B
- retrieval reranker on vs off
- tool schema version 1 vs 2
- memory summarizer prompt old vs new
- final response model small vs large
Then compare not just task success, but the whole scorecard.
A typical pattern looks like this:
- Model X improves task success by 3%
- but increases average steps by 25%
- and doubles token cost
- while only helping on a narrow subset of ambiguous planning cases
That may still be worth it for high-value workflows, but maybe only for those workflows. Evals let you route selectively instead of over-upgrading the whole stack.
8) Use eval data to tune prompts, tools, and guardrails differently
Not all failures are solved by prompt changes.
A good failure taxonomy will tell you what lever to pull.
When prompt changes are appropriate
- planner omits a required subgoal consistently
- final response fails formatting or user communication
- retrieval query formulation is weak but tool choice is correct
When model changes are appropriate
- structured argument construction is unreliable despite prompt hardening
- contradiction handling is weak across long contexts
- smaller model fails nuanced policy interpretation
When tool/interface changes are appropriate
- function schemas are ambiguous
- tools require too many inferred arguments
- tool outputs are hard to parse or overly verbose
- status/error fields are inconsistent across APIs
When guardrail tuning is appropriate
- unsafe actions are attempted in edge cases
- escalation thresholds are too lax or too strict
- sensitive data is overexposed in memory or response drafts
When architecture changes are appropriate
- one monolithic agent loops too much; split planner/executor
- memory writes are too unconstrained; add schema validation
- retrieval corpora are too broad; add scoped indices
- tool-use logic should become a deterministic policy for certain steps
This is where evals become operationally valuable. They should not merely tell you “version B is better.” They should tell you why.
Implementation details: a practical harness design
A robust harness does not need to be exotic. It does need consistent traces, structured artifacts, and repeatability.
Instrument the agent run as an event stream
Capture each run as structured events:
json{ "run_id": "run_7821", "task_id": "billing_refund_eligibility_v3", "events": [ {"t": 1, "type": "user_input", "content": "I was charged twice"}, {"t": 2, "type": "plan", "subgoals": ["identify_account", "check_billing", "fetch_policy"]}, {"t": 3, "type": "tool_call", "tool": "crm_lookup", "args": {"email": "..."}}, {"t": 4, "type": "tool_result", "tool": "crm_lookup", "status": "missing_identifier"}, {"t": 5, "type": "decision", "action": "ask_user_for_identifier"}, {"t": 6, "type": "final_output", "content": "Please confirm your account email."} ], "metrics": { "total_tokens": 4120, "latency_ms": 6820, "tool_calls": 1 } }
Without structured traces, step-level eval becomes manual archaeology.
Separate judges from production models when possible
If you use LLM-as-judge, avoid grading with the exact same configuration that generated the trajectory. Use either:
- a different stronger model for evaluation
- rubric-constrained evaluation prompts
- pairwise comparisons with human calibration
- deterministic code-based checks where possible
LLM judges are helpful for nuanced trajectory review, but should be anchored by explicit rubrics and periodically audited.
Prefer code checks for objective constraints
Examples of easy automated checks:
- prohibited tool called before verification
- required tool not called on certain task classes
- malformed function arguments
- memory written before task completion
- more than N repeated retrieval calls
- timeout not followed by retry/escalation policy
Reserve LLM judging for softer dimensions like:
- whether the final response is clear and actionably phrased
- whether a plan addresses all relevant concerns in ambiguous cases
- whether the agent handled conflicting evidence reasonably
Version everything
At minimum, track versions for:
- model(s)
- prompts
- tool schemas
- retrieval index / corpus snapshot
- memory schema
- evaluator prompts/rubrics
- simulator configuration
Otherwise your results will be non-reproducible.
Regression suites: the safety net most teams build too late
Every production incident or near miss should become a permanent regression case.
Your regression suite should have categories such as:
- known policy edge cases
- prior hallucinated tool outputs
- stale doc retrieval failures
- memory contamination incidents
- timeout and retry loops
- formatting failures for downstream systems
- unsafe action attempts
Tag each case by:
- severity
- subsystem implicated
- workflow/domain
- customer impact
- whether it is release-blocking
A useful pattern is to maintain two suites:
Fast suite
- small
- deterministic
- runs on every PR or prompt change
- catches obvious regressions quickly
Full pre-release suite
- much larger
- includes simulator-heavy scenarios
- model comparison runs
- latency/cost profiling
- human spot checks on sampled failures
This mirrors how mature software teams use unit and integration suites.
Release gates for agents
Do not ship based on vibes or a few demos.
Define explicit release gates by workflow and risk tier.
Example release gate structure:
Quality gates
- no regression on critical regression suite cases
- task success rate above threshold on tier-1 workflows
- subgoal coverage above threshold on required tasks
- unsafe action attempt rate below threshold
Operational gates
- p95 latency below agreed SLO
- average cost per task within budget band
- tool-call inflation below threshold versus prior release
- non-termination rate below threshold
Robustness gates
- passes simulator scenarios for timeout, stale docs, and contradictory evidence
- stable performance across three random seeds or repeated runs where stochasticity matters
- no critical failures in memory contamination tests
Human review gates
- sampled failures reviewed by domain expert
- judge calibration rechecked on current release candidate
- policy/compliance signoff for impacted flows
Not every workflow needs the same bar. A low-risk internal drafting assistant and a high-risk operations agent should have different thresholds.
How agent evals differ from single-turn prompt testing
It is worth stating this plainly because many teams try to reuse the same mental model.
Single-turn prompt testing asks: given this input, is the output good?
Agent evaluation asks:
- did the system choose the right sequence of actions?
- were those actions valid under uncertainty and environment constraints?
- did it use evidence correctly across steps?
- did it recover safely from failure?
- did it update state appropriately for future turns?
- did it do all of this at acceptable cost and latency?
That means agent evals need:
- trajectory visibility
- environment simulation
- stateful testing
- component attribution
- operational scorecards
- release gates tied to system risk
This is why a prompt playground is useful for ideation, but insufficient for production readiness.
Cost and latency tradeoffs: what to measure before you get surprised
Teams often optimize for success rate until finance or operations forces a reset.
Agent evals should help you quantify tradeoffs before launch.
Track per workflow:
- average tokens by stage: planning, retrieval summarization, tool interpretation, final response
- model cost by stage
- tool/API cost if applicable
- median/p95 latency by stage
- success lift from extra steps or larger models
Then ask practical questions:
- Is a 2% quality gain worth 3x token cost?
- Does an additional retrieval hop materially improve success on only 5% of tasks?
- Can a cheaper model handle memory summarization without affecting downstream success?
- Should the large model be invoked only after a small-model attempt fails a confidence threshold?
Good eval infrastructure enables staged routing strategies, not just winner-take-all model selection.
A phased rollout plan for teams building this from scratch
If your current process is mostly manual prompt tests, do not try to build the perfect eval platform in one shot.
Phase 1: Basic end-to-end harness
- collect 50–100 real task examples
- define task specs and success criteria
- log traces for plans, tool calls, and outputs
- score task success plus a few operational metrics
Phase 2: Failure localization
- add step-level taxonomy
- add code-based checks for hard constraints
- label top failure modes
- create first regression suite from observed failures
Phase 3: Simulators and synthetic coverage
- mock tools and replay realistic state
- inject timeouts, stale docs, and contradictions
- generate synthetic variants around critical workflows
Phase 4: Release gating and model ablations
- define workflow-specific thresholds
- compare models/prompts/tools using the same suite
- track cost/latency alongside quality
- require regression pass before release
Phase 5: Continuous eval ops
- feed production incidents back into regression sets
- monitor drift in user tasks and tool behavior
- recalibrate judges and update task specs
- maintain dashboards by subsystem and workflow
This progression is achievable for most engineering teams and far more valuable than endlessly tuning prompts without observability.
Practical takeaways
If you are building a multi-step LLM agent, the biggest evaluation mistake is treating it like a smarter chatbot.
A production agent is a control-flow system built from models, tools, retrieval, memory, and policies. Your eval strategy needs to mirror that structure.
The most effective patterns are straightforward:
- define task specs with required subgoals and hard constraints
- instrument full trajectories, not just final outputs
- score decomposition quality and step-level behavior
- maintain a failure taxonomy to map failures to fixes
- combine human-labeled sets for truth with synthetic sets for coverage
- build deterministic environment simulators with fault injection
- track cost and latency as first-class metrics
- convert incidents into regression tests
- ship only through explicit release gates
And perhaps the most important point: evals are not just for ranking models. They are the mechanism by which you decide whether to change prompts, redesign tools, split agents, tune guardrails, or narrow workflow scope.
That is the difference between a team that demos agents and a team that operates them.
Before production, you do not need perfect confidence. You do need a harness that tells you, with enough honesty, where the system fails, why it fails, how expensive those failures are, and whether the latest change actually improved the product.
That is what good agent evals buy you: not certainty, but controlled learning.