Designing Evals for Multi-Step LLM Agents Before Production

Most teams discover too late that evaluating an agent is not the same thing as evaluating a prompt.

A single-turn prompt can often be judged with a relatively clean input/output lens: did the model answer correctly, follow the format, avoid policy violations, and stay within latency and cost bounds? A multi-step agent is a different beast. It can choose the wrong subgoal, call the right tool with the wrong arguments, retrieve the wrong documents, write a flawed intermediate summary into memory, recover poorly from partial failures, or technically complete the task while violating constraints you care about in production. If you only evaluate the final answer, you miss most of the failure surface.

This usually becomes painfully obvious during staging. The demo looked good. A handful of happy-path test prompts worked. Then the agent hits a more realistic workflow: it plans too broadly, burns tokens on unnecessary tool calls, fetches stale knowledge, loops when a dependency times out, or contaminates later steps with a bad intermediate assumption. The team responds the way teams often do under pressure: tweak the system prompt, add more examples, maybe swap models, and re-run a few manual tests. Sometimes the changes help. Often they just move the failure around.

The core issue is not that the model is bad. The issue is that the evaluation strategy is too shallow for the system that was built.

If your production system includes planning, retrieval, tool use, memory, and stateful multi-turn behavior, your evaluation harness has to reflect that architecture. You need to measure the task at multiple layers, localize failure, simulate realistic environments, maintain regression suites, and define release gates that align with business risk. Otherwise, you are shipping a distributed system with no observability and calling it prompt engineering.

A realistic failure scenario

Consider a support operations agent used internally by a SaaS company. The agent is supposed to help human support reps resolve customer issues by:

reading the customer conversation
retrieving relevant policy and product docs
checking account state via internal APIs
proposing a resolution plan
drafting a customer-safe response
storing a short case summary for future follow-up

In prototype testing, the team evaluates on 50 manually written cases. The agent looks impressive. It resolves refund requests, identifies billing issues, and drafts polished replies. Leadership wants it in production.

Then pre-production testing with live-like traffic reveals the real behavior:

The planner overuses retrieval, pulling five to ten documents even for simple known issues.
It sometimes queries the billing API before confirming the account ID, leading to wrong-user lookups.
Retrieved policy documents contain overlapping but slightly different refund rules; the agent anchors on an outdated policy.
When the API times out, the agent hallucinates a likely account status instead of asking for retry or escalation.
The final response is often fluent and plausible, so casual reviewers miss the fact that the intermediate reasoning violated support policy.
The memory summary writes “customer eligible for refund” even when eligibility was uncertain, biasing the next agent turn.

If the only eval metric is “was the final drafted response acceptable,” you will undercount important failures. Some bad trajectories happen to end in an acceptable output. Some good trajectories fail because of one flaky tool call and should be treated differently from genuine reasoning failures. Some steps are expensive but harmless; others are rare but catastrophic.

This is the key pattern: agent quality is path-dependent. Final-answer grading is necessary, but nowhere near sufficient.

The pattern to identify: agents fail by trajectory, not just by answer

Single-turn prompt testing tends to collapse system quality into a single completion. Agent systems introduce a trajectory:

interpret the task
decompose into subgoals
choose whether to retrieve or call tools
decide tool order
construct tool arguments
interpret tool outputs
update working memory
continue, repair, or terminate
produce final output

Every one of those stages can fail independently, and failures compound.

That means your eval design should answer at least five questions:

Task success: Did the agent achieve the user’s objective under the relevant constraints?
Trajectory quality: Did it follow an efficient, policy-compliant, and robust sequence of steps?
Component quality: Which subsystem failed: planning, retrieval, tool choice, tool arguments, memory, guardrails, or response generation?
Operational quality: What were the latency, cost, token, and tool-utilization characteristics?
Reliability under variation: Does performance hold up across task types, environment conditions, and model/prompt/tool changes?

Once you see the system through that lens, the naive evaluation approach becomes obviously inadequate.

Why the naive approach fails

The common first pass for agent evals usually looks like this:

collect a small set of representative prompts
run them through the full agent
have a human or an LLM judge the final answer
maybe compare a few prompts or models

This is better than nothing, but it breaks down for multi-step systems for several reasons.

1. Final-answer-only scoring hides root causes

Two agent runs can both fail the same task for completely different reasons. One might fail because retrieval returned nothing relevant. Another might fail because the planner skipped retrieval entirely. A third might fail because a correct tool result was later overwritten by a bad memory summary.

Those imply different fixes:

retrieval indexing changes
prompt or policy changes for tool selection
memory write schema and validation changes

If your eval only says “incorrect answer,” you cannot prioritize engineering work.

2. Happy-path prompts underrepresent production complexity

Manually authored prompts are usually cleaner than real user requests. They omit ambiguity, incomplete inputs, contradictory evidence, adversarial phrasing, and environment noise. Real users also create long-tail combinations of conditions that no one thought to handwrite.

Agents are more vulnerable than single-turn prompts here because they must make control-flow decisions under uncertainty. Ambiguity doesn’t just lower answer quality; it changes which path the agent takes.

3. Agent performance depends on environment behavior

A single-turn benchmark usually assumes the model is the system. In agents, the environment is part of the product:

retrieval quality
tool schema clarity
API reliability
memory store behavior
cache freshness
policy versions
rate limits
network delays

Without environment simulators and replayable tool traces, you cannot separate model behavior from infrastructure behavior.

4. Evals without cost and latency metrics incentivize bad behavior

An agent can increase final success by making more tool calls and spending more tokens. That does not mean it is production-ready. Sometimes the “better” version is financially or operationally unacceptable.

You need to measure:

median and p95 wall-clock latency
average number of steps
average tokens per successful task
average tool calls per task
failure recovery rate
cost per completed task

Otherwise you reward brute-force reasoning and over-retrieval.

5. One-off reviews do not protect against regressions

Agent changes interact in surprising ways. A prompt improvement for planning may increase tool overuse. A cheaper model may reduce latency but silently worsen argument construction. A retrieval ranking tweak may help one domain while hurting another.

Without a stable regression suite and release gates, teams end up re-learning old failures.

A better approach: eval the architecture you actually built

The practical approach is to design the eval harness around the agent’s execution graph.

A strong evaluation system for multi-step agents usually has six layers:

Task-level evals for end-to-end success
Step-level evals for trajectory and failure localization
Component evals for planning, retrieval, tools, memory, and guardrails
Environment simulation for realistic, replayable conditions
Regression suites for known failures and critical workflows
Release gates tied to quality, cost, latency, and safety thresholds

Think of it as test infrastructure plus observability plus decision support.

1) Start with a task model, not just a prompt list

Before building metrics, define the task structure for your agent family.

For each task type, document:

user goal
required constraints and policies
expected subgoals
allowed and disallowed tools
success criteria
acceptable recovery behaviors
failure severity

For example, in a support agent, “resolve a billing dispute” may require:

identify account correctly
inspect billing status
retrieve current refund policy
determine refund eligibility
avoid issuing commitments without validation
escalate when policy ambiguity remains

This decomposition becomes the basis for both human labeling and automated checks.

A useful artifact here is a task spec schema. At minimum:

json
{
  "task_id": "billing_refund_eligibility_v3",
  "user_input": "Customer says they were double charged and wants a refund.",
  "context": {
    "account_id_present": false,
    "policy_version": "2026-01",
    "available_tools": ["crm_lookup", "billing_api", "policy_search"]
  },
  "required_subgoals": [
    "identify_customer_account",
    "check_recent_charges",
    "retrieve_current_refund_policy",
    "determine_eligibility_or_escalate"
  ],
  "hard_constraints": [
    "do_not_claim_refund_approved_without_billing_confirmation",
    "do_not_access_account_without_verified_identifier"
  ],
  "success_criteria": [
    "correct_account_or_request_missing_identifier",
    "correct_policy_applied",
    "final_response_safe_and_actionable"
  ]
}

This is boring engineering work, but it unlocks the rest. Without explicit task specs, eval devolves into subjective impressions.

2) Measure task decomposition quality explicitly

One major difference between agent evals and prompt evals is that the plan itself matters.

You do not necessarily need chain-of-thought or internal hidden reasoning. But you do need inspectable structured actions or plan summaries that can be evaluated. This can be done through explicit planner outputs, action logs, or normalized event traces.

Useful task decomposition metrics include:

Subgoal coverage

Did the agent address all required subgoals for the task?

Example:

required: identify account, retrieve refund policy, inspect billing status
observed: inspect billing status, draft reply
score: 2/3 covered, one critical omission

Subgoal ordering correctness

Did the agent do things in a valid order?

Common ordering failures:

calling account tools before identity verification
drafting final resolution before policy retrieval
writing memory before conflict resolution

Plan efficiency

How many unnecessary steps were taken?

Track:

extra retrieval calls
repeated tool invocations with minor argument changes
loops or dead-end branches
excessive clarification when enough evidence already exists

Decision-point correctness

At each branch, did the agent choose an appropriate next action?

Examples:

ask a clarifying question vs guess
use retrieval vs call transactional API
retry tool vs escalate
terminate vs continue searching

Recovery quality

When a step fails, does the agent recover appropriately?

For example:

retries transient API failure once, then escalates
notices retrieval contradiction and requests human review
avoids fabricating missing tool results

These metrics give you a much better picture than a binary pass/fail.

3) Build a step-level failure taxonomy

If you want evals to drive engineering decisions, every failed run should be classifiable.

A practical step-level taxonomy looks something like this:

A. Task interpretation failures

misunderstood user intent
ignored constraints
failed to detect ambiguity

B. Planning failures

missing required subgoals
invalid subgoal order
premature termination
unnecessary decomposition / overplanning

C. Retrieval failures

retrieval skipped when needed
low-relevance documents selected
stale or conflicting documents not handled
context window overfilled with noisy evidence

D. Tool selection failures

wrong tool chosen
tool omitted
excessive tool usage
unsafe tool chosen without required checks

E. Tool argument failures

malformed arguments
missing required parameters
wrong entity resolution
weak query formulation

F. Tool result interpretation failures

misread tool output
ignored structured error
treated empty result as confirmation
failed to detect contradiction across tools

G. Memory failures

wrote incorrect summary
omitted key unresolved issue
retrieved stale memory
over-relied on prior memory despite conflicting fresh evidence

H. Guardrail/compliance failures

violated policy despite correct task completion
leaked sensitive information
exceeded action permissions
lacked required escalation

I. Response generation failures

final answer incorrect
unsupported claim
poor user communication
wrong formatting or actionability

J. Operational failures

timeout
rate limit loop
token explosion
non-termination

You do not need perfect labeling fidelity at the start. Even a coarse taxonomy is enough to expose dominant failure modes.

Crucially, let multiple labels coexist. Agent failures are often cascades: bad retrieval causes wrong policy selection, which causes an incorrect memory write, which later affects the final response.

4) Use both synthetic and human-labeled datasets

Teams often swing too hard one way.

Some rely entirely on human-labeled examples. These are high quality but expensive, slow, and often too small.

Others generate huge synthetic sets and trust them too much. These scale well but can encode unrealistic distributions or leak the assumptions of the generation process.

In practice, you want a portfolio.

Human-labeled test sets

Use these for:

high-risk workflows
business-critical tasks
ambiguous cases requiring domain judgment
release gating
calibrating automated judges

Your human-labeled set should include:

canonical happy paths
near-boundary policy cases
ambiguous inputs
adversarial or misleading phrasing
partial-information cases
tool failure and stale-data scenarios

For each case, annotate:

expected outcome
required subgoals
acceptable tool sequences or disallowed actions
severity if failed
preferred escalation behavior

Synthetic test sets

Use these for:

scale
coverage of combinatorial edge cases
stress testing planners and tool routing
mutation testing of prompts, docs, and schemas
regression expansion after incidents

Good synthetic generation patterns include:

Parameterized scenario generation

Create templates with varying entities, missing fields, policy conflicts, and tool outcomes.

Mutation-based generation

Take real or labeled examples and mutate:

wording
order of facts
irrelevant distractors
contradictory snippets
missing identifiers
noisy retrieved docs

Failure injection

Generate scenarios where tools return:

timeout
partial success
empty results
malformed fields
stale versions
conflicting outputs

Synthetic data is especially powerful for agent eval because the failure surface is combinatorial. You can simulate conditions humans would rarely write by hand.

The rule of thumb: use humans to define truth and severity, then use synthetic generation to broaden coverage around those truths.

5) Build environment simulators, not just datasets

A static prompt-response benchmark is insufficient for agents because actions change future state.

You need replayable environments that simulate:

tool outputs based on agent actions
state transitions over time
API failures and retries
retrieval corpora and ranking behavior
memory reads/writes
permission boundaries

This does not have to be fancy at first. A good simulator can be a deterministic harness around mocked tools and corpora.

What a simulator should provide

Deterministic execution

The same test case should produce the same environment behavior unless you intentionally sample stochastic conditions.

Action logging

Every tool call, argument payload, retrieval query, memory write, and decision node should be traceable.

Configurable fault injection

Let tests specify:

first API call times out, second succeeds
retrieval index returns stale doc first
memory store returns conflicting previous summary

State assertions

Allow checks like:

no write to memory until eligibility determined
no billing API call before identity verification
no external action taken after policy conflict detected

This is how agent eval becomes closer to integration testing for distributed systems.

6) Define a scoring stack, not one metric

A production-worthy scorecard usually has multiple dimensions.

Here is a practical stack.

End-to-end task metrics

task success rate
constraint satisfaction rate
escalation correctness rate
user-visible answer quality

Trajectory metrics

average steps per task
required subgoal coverage
invalid action rate
unnecessary tool call rate
recovery success rate
loop/non-termination rate

Retrieval metrics

retrieval-needed detection precision/recall
top-k evidence relevance
evidence sufficiency rate
stale/conflict detection rate

Tool-use metrics

tool selection accuracy
argument correctness
structured error handling rate
retry appropriateness

Memory metrics

memory write accuracy
retrieval usefulness
stale memory override rate
contamination rate from incorrect prior summaries

Safety and guardrail metrics

policy violation rate
unsafe action attempt rate
sensitive-data exposure rate
human-escalation compliance

Operational metrics

cost per task
token usage by phase
median and p95 latency
tool latency contribution
success under degraded dependencies

Not every team needs every metric on day one. But you do need enough to observe the main tradeoffs.

7) Use evals to drive model selection and architecture choices

One of the biggest mistakes in agent development is choosing models based only on general benchmarks or raw final-answer quality.

Different agent stages often benefit from different models.

For example:

a smaller fast model may be enough for routing or classification
a stronger model may be needed for long-horizon planning
structured argument construction might do well on a model that is especially reliable with JSON/function calling
synthesis of final user-facing output may tolerate a cheaper model if the earlier steps already constrained the evidence well

Your eval harness should support stage-specific ablations:

planner model A vs B
retrieval reranker on vs off
tool schema version 1 vs 2
memory summarizer prompt old vs new
final response model small vs large

Then compare not just task success, but the whole scorecard.

A typical pattern looks like this:

Model X improves task success by 3%
but increases average steps by 25%
and doubles token cost
while only helping on a narrow subset of ambiguous planning cases

That may still be worth it for high-value workflows, but maybe only for those workflows. Evals let you route selectively instead of over-upgrading the whole stack.

8) Use eval data to tune prompts, tools, and guardrails differently

Not all failures are solved by prompt changes.

A good failure taxonomy will tell you what lever to pull.

When prompt changes are appropriate

planner omits a required subgoal consistently
final response fails formatting or user communication
retrieval query formulation is weak but tool choice is correct

When model changes are appropriate

structured argument construction is unreliable despite prompt hardening
contradiction handling is weak across long contexts
smaller model fails nuanced policy interpretation

When tool/interface changes are appropriate

function schemas are ambiguous
tools require too many inferred arguments
tool outputs are hard to parse or overly verbose
status/error fields are inconsistent across APIs

When guardrail tuning is appropriate

unsafe actions are attempted in edge cases
escalation thresholds are too lax or too strict
sensitive data is overexposed in memory or response drafts

When architecture changes are appropriate

one monolithic agent loops too much; split planner/executor
memory writes are too unconstrained; add schema validation
retrieval corpora are too broad; add scoped indices
tool-use logic should become a deterministic policy for certain steps

This is where evals become operationally valuable. They should not merely tell you “version B is better.” They should tell you why.

Implementation details: a practical harness design

A robust harness does not need to be exotic. It does need consistent traces, structured artifacts, and repeatability.

Instrument the agent run as an event stream

Capture each run as structured events:

json
{
  "run_id": "run_7821",
  "task_id": "billing_refund_eligibility_v3",
  "events": [
    {"t": 1, "type": "user_input", "content": "I was charged twice"},
    {"t": 2, "type": "plan", "subgoals": ["identify_account", "check_billing", "fetch_policy"]},
    {"t": 3, "type": "tool_call", "tool": "crm_lookup", "args": {"email": "..."}},
    {"t": 4, "type": "tool_result", "tool": "crm_lookup", "status": "missing_identifier"},
    {"t": 5, "type": "decision", "action": "ask_user_for_identifier"},
    {"t": 6, "type": "final_output", "content": "Please confirm your account email."}
  ],
  "metrics": {
    "total_tokens": 4120,
    "latency_ms": 6820,
    "tool_calls": 1
  }
}

Without structured traces, step-level eval becomes manual archaeology.

Separate judges from production models when possible

If you use LLM-as-judge, avoid grading with the exact same configuration that generated the trajectory. Use either:

a different stronger model for evaluation
rubric-constrained evaluation prompts
pairwise comparisons with human calibration
deterministic code-based checks where possible

LLM judges are helpful for nuanced trajectory review, but should be anchored by explicit rubrics and periodically audited.

Prefer code checks for objective constraints

Examples of easy automated checks:

prohibited tool called before verification
required tool not called on certain task classes
malformed function arguments
memory written before task completion
more than N repeated retrieval calls
timeout not followed by retry/escalation policy

Reserve LLM judging for softer dimensions like:

whether the final response is clear and actionably phrased
whether a plan addresses all relevant concerns in ambiguous cases
whether the agent handled conflicting evidence reasonably

Version everything

At minimum, track versions for:

model(s)
prompts
tool schemas
retrieval index / corpus snapshot
memory schema
evaluator prompts/rubrics
simulator configuration

Otherwise your results will be non-reproducible.

Regression suites: the safety net most teams build too late

Every production incident or near miss should become a permanent regression case.

Your regression suite should have categories such as:

known policy edge cases
prior hallucinated tool outputs
stale doc retrieval failures
memory contamination incidents
timeout and retry loops
formatting failures for downstream systems
unsafe action attempts

Tag each case by:

severity
subsystem implicated
workflow/domain
customer impact
whether it is release-blocking

A useful pattern is to maintain two suites:

Fast suite

small
deterministic
runs on every PR or prompt change
catches obvious regressions quickly

Full pre-release suite

much larger
includes simulator-heavy scenarios
model comparison runs
latency/cost profiling
human spot checks on sampled failures

This mirrors how mature software teams use unit and integration suites.

Release gates for agents

Do not ship based on vibes or a few demos.

Define explicit release gates by workflow and risk tier.

Example release gate structure:

Quality gates

no regression on critical regression suite cases
task success rate above threshold on tier-1 workflows
subgoal coverage above threshold on required tasks
unsafe action attempt rate below threshold

Operational gates

p95 latency below agreed SLO
average cost per task within budget band
tool-call inflation below threshold versus prior release
non-termination rate below threshold

Robustness gates

passes simulator scenarios for timeout, stale docs, and contradictory evidence
stable performance across three random seeds or repeated runs where stochasticity matters
no critical failures in memory contamination tests

Human review gates

sampled failures reviewed by domain expert
judge calibration rechecked on current release candidate
policy/compliance signoff for impacted flows

Not every workflow needs the same bar. A low-risk internal drafting assistant and a high-risk operations agent should have different thresholds.

How agent evals differ from single-turn prompt testing

It is worth stating this plainly because many teams try to reuse the same mental model.

Single-turn prompt testing asks: given this input, is the output good?

Agent evaluation asks:

did the system choose the right sequence of actions?
were those actions valid under uncertainty and environment constraints?
did it use evidence correctly across steps?
did it recover safely from failure?
did it update state appropriately for future turns?
did it do all of this at acceptable cost and latency?

That means agent evals need:

trajectory visibility
environment simulation
stateful testing
component attribution
operational scorecards
release gates tied to system risk

This is why a prompt playground is useful for ideation, but insufficient for production readiness.

Cost and latency tradeoffs: what to measure before you get surprised

Teams often optimize for success rate until finance or operations forces a reset.

Agent evals should help you quantify tradeoffs before launch.

Track per workflow:

average tokens by stage: planning, retrieval summarization, tool interpretation, final response
model cost by stage
tool/API cost if applicable
median/p95 latency by stage
success lift from extra steps or larger models

Then ask practical questions:

Is a 2% quality gain worth 3x token cost?
Does an additional retrieval hop materially improve success on only 5% of tasks?
Can a cheaper model handle memory summarization without affecting downstream success?
Should the large model be invoked only after a small-model attempt fails a confidence threshold?

Good eval infrastructure enables staged routing strategies, not just winner-take-all model selection.

A phased rollout plan for teams building this from scratch

If your current process is mostly manual prompt tests, do not try to build the perfect eval platform in one shot.

Phase 1: Basic end-to-end harness

collect 50–100 real task examples
define task specs and success criteria
log traces for plans, tool calls, and outputs
score task success plus a few operational metrics

Phase 2: Failure localization

add step-level taxonomy
add code-based checks for hard constraints
label top failure modes
create first regression suite from observed failures

Phase 3: Simulators and synthetic coverage

mock tools and replay realistic state
inject timeouts, stale docs, and contradictions
generate synthetic variants around critical workflows

Phase 4: Release gating and model ablations

define workflow-specific thresholds
compare models/prompts/tools using the same suite
track cost/latency alongside quality
require regression pass before release

Phase 5: Continuous eval ops

feed production incidents back into regression sets
monitor drift in user tasks and tool behavior
recalibrate judges and update task specs
maintain dashboards by subsystem and workflow

This progression is achievable for most engineering teams and far more valuable than endlessly tuning prompts without observability.

Practical takeaways

If you are building a multi-step LLM agent, the biggest evaluation mistake is treating it like a smarter chatbot.

A production agent is a control-flow system built from models, tools, retrieval, memory, and policies. Your eval strategy needs to mirror that structure.

The most effective patterns are straightforward:

define task specs with required subgoals and hard constraints
instrument full trajectories, not just final outputs
score decomposition quality and step-level behavior
maintain a failure taxonomy to map failures to fixes
combine human-labeled sets for truth with synthetic sets for coverage
build deterministic environment simulators with fault injection
track cost and latency as first-class metrics
convert incidents into regression tests
ship only through explicit release gates

And perhaps the most important point: evals are not just for ranking models. They are the mechanism by which you decide whether to change prompts, redesign tools, split agents, tune guardrails, or narrow workflow scope.

That is the difference between a team that demos agents and a team that operates them.

Before production, you do not need perfect confidence. You do need a harness that tells you, with enough honesty, where the system fails, why it fails, how expensive those failures are, and whether the latest change actually improved the product.

That is what good agent evals buy you: not certainty, but controlled learning.