EN ES FR

12 Metrics of Agent Reliability

Accuracy tells you when your agent succeeded. These metrics tell you how it will fail.

How Do We Know?

AI agents are not failing for lack of capability. They are failing for lack of reliability. Accuracy scores keep rising. Real-world deployments keep breaking. Princeton researchers identified twelve concrete metrics. Accuracy doesn't measure them, and can't substitute for them.

Four dimensions. Twelve metrics. One framework for deploying agents you can actually trust.

Source: Rabanser et al., Princeton University - arxiv:2602.16666  ·  "Towards a Science of AI Agent Reliability"
Dimension I
Consistency
Does the system behave the same way when run multiple times under the same conditions?
METRIC #01

Outcome Consistency

Cout
Dimension: Consistency Impact: High

What It Measures

Whether the agent succeeds or fails consistently on repeated attempts at the same task. An insurance claims agent that approves a claim on one run but denies the identical claim on the next creates legal liability and destroys user trust, regardless of average accuracy.

Failure Signals

  • Same input produces success one day and failure the next
  • Demos work; production doesn't. Or vice versa.
  • "It worked when I tested it" is a common support response
  • Error rates spike on identical task batches run at different times

How To Evaluate

Run K identical tasks multiple times. Measure pass∧k (success on ALL K runs) rather than pass@k (success on at least one run). Normalize variance by p(1−p) to isolate consistency from capability level.

Common Failures

  • Reporting pass@k as a reliability metric: it measures best-case capability, not consistency
  • Testing with one run and declaring the system "validated"
  • Treating outcome variance as "acceptable model stochasticity"
METRIC #02

Trajectory Consistency: Distributional

Ctrajd
Dimension: Consistency Impact: Medium

What It Measures

Whether the agent uses similar types and frequencies of actions across repeated runs, even when it reaches the same conclusion. Does it always search before writing? Always verify before executing? The distribution of action types should be stable across runs.

Failure Signals

  • Agent sometimes reads files before writing, sometimes writes directly
  • Audit logs look completely different for identical tasks
  • Tool call counts fluctuate wildly across runs of the same workflow
  • Compliance teams cannot generate reproducible audit trails

How To Evaluate

Compare action type frequency distributions across K runs using distributional similarity measures. Flag high variance in "what the agent does" even when final outcomes match. Audit processes require stable action distributions.

Common Failures

  • Measuring only final outcomes while ignoring behavioral consistency
  • Assuming "same result" means "same behavior"
  • Building compliance frameworks on output-only auditing
METRIC #03

Trajectory Consistency: Sequential

Ctrajs
Dimension: Consistency Impact: Medium

What It Measures

Whether the agent follows consistent action orderings across runs. Even when action types are similar, does it sometimes charge a card before checking inventory? Verify identity after taking action? Ordering inconsistencies create failure modes that are invisible to output-only evaluation.

Failure Signals

  • "Interrupted state" failures are unpredictable and hard to recover from
  • The same task creates different intermediate states across runs
  • Recovery procedures work sometimes but not consistently
  • Rollback mechanisms fail because state assumptions were violated

How To Evaluate

Compare action sequences using sequence similarity measures (e.g., edit distance on action traces). Define ordering invariants for critical processes (steps that must always precede others) and verify they hold across K runs.

Common Failures

  • Assuming action ordering is irrelevant if the final outcome is correct
  • Skipping workflow analysis in favor of end-state verification only
  • Not defining or testing ordering invariants for high-stakes processes
METRIC #04

Resource Consistency

Cres
Dimension: Consistency Impact: Medium

What It Measures

Whether the agent's computational cost, latency, and API token consumption are predictable across identical tasks. An agent that costs $0.02 per task on Monday and $2.00 on Friday cannot be safely operated at scale, even with identical accuracy.

Failure Signals

  • P99 latency is 10× the median for no apparent reason
  • Monthly API costs vary 200% for the same workload
  • Task duration fluctuates from 3 seconds to 3 minutes unpredictably
  • Budget overruns occur on routine, repeat tasks

How To Evaluate

Measure latency, token consumption, and API cost variance across K runs. Report coefficient of variation, not just mean. Flag tasks where cost or latency variance exceeds 50% of mean as reliability risks.

Common Failures

  • Reporting mean cost and latency without variance as a reliability signal
  • Only measuring resource consumption in development environments
  • Treating cost spikes as infrastructure issues rather than agent behavior issues
Accuracy measures best-case performance.
Reliability measures what happens at scale.
Dimension II
Robustness
When conditions deviate from nominal, does the system degrade gracefully or fail abruptly?
METRIC #05

Fault Robustness

Rfault
Dimension: Robustness Impact: High

What It Measures

Whether the agent handles infrastructure failures gracefully: API timeouts, malformed responses, tool unavailability, partial data returns. A robust agent retries, falls back, or fails safely. A brittle agent enters an inconsistent state or silently abandons the task.

Failure Signals

  • Single API timeout crashes the entire task pipeline
  • Partial tool failures propagate into corrupted downstream results
  • Agent loops indefinitely on retryable errors
  • No behavioral distinction between transient and permanent failures

How To Evaluate

Inject controlled fault scenarios: tool timeouts, malformed API responses, partial data returns. Measure accuracy ratio between nominal and fault conditions. An agent with Rfault = 0.9 maintains 90% of nominal performance under infrastructure faults.

Common Failures

  • Testing only in ideal environments and calling the system "production-ready"
  • Building no retry or graceful fallback logic into agentic workflows
  • Treating all errors as fatal rather than categorizing by recoverability
METRIC #06

Environment Robustness

Renv
Dimension: Robustness Impact: High

What It Measures

Whether the agent maintains performance when its environment changes in semantically neutral ways: reordered JSON fields, renamed API parameters, changed date formats, evolved tool interfaces. Real production environments drift. Brittle agents fail silently when they do.

Failure Signals

  • Agent breaks when database schema is , even on non-breaking changes
  • API versioning causes silent failures with no error messages
  • JSON field reordering produces different agent behavior
  • Locale or timezone changes cause unpredictable results

How To Evaluate

Apply semantic-preserving transformations to the environment: reorder fields, rename parameters, alter formats. Compare performance to nominal baseline. A production-ready agent should be indifferent to cosmetic environmental variation.

Common Failures

  • Assuming stable APIs and schemas across the lifetime of the deployment
  • Hardcoding field names, date formats, or ordering expectations
  • Not testing environment drift as a first-class reliability scenario
METRIC #07

Prompt Robustness

Rprompt
Dimension: Robustness Impact: High

What It Measures

Whether the agent performs consistently across semantically equivalent reformulations of instructions: rephrasings, translations, register variations. Real users don't speak like benchmark prompts. If "cancel my subscription" succeeds but "end my plan" fails, the agent cannot handle real traffic.

Failure Signals

  • Success rates drop significantly for non-English instructions
  • Formal versus informal phrasing produces different outcomes
  • Users from different backgrounds receive different results for equivalent requests
  • Paraphrase sensitivity is high but undocumented

How To Evaluate

Generate semantically equivalent prompt variants using paraphrasing, translation, and register changes. Measure accuracy ratio between original and variant prompts. Target Rprompt close to 1.0 for any production deployment serving real users.

Common Failures

  • Evaluating only on "canonical" prompt formulations from the benchmark
  • Ignoring cross-lingual reliability for multilingual deployments
  • Treating prompt sensitivity as "expected behavior" rather than a defect
A capable system can be unreliable.
A less capable system can be highly reliable.
They are not the same thing.
Dimension III
Predictability
Can the system recognize when it is likely to fail?
METRIC #08

Calibration

Pcal
Dimension: Predictability Impact: High

What It Measures

Whether the agent's stated confidence levels match its empirical success rates. An agent claiming 80% confidence should succeed roughly 80% of the time. Overconfident agents cause users to trust outputs they should verify. Underconfident agents generate unnecessary human review loops.

Failure Signals

  • High-confidence outputs are wrong at surprising frequency
  • Agent never abstains or flags uncertainty, even on clearly hard tasks
  • Confidence scores cluster near 90%+ regardless of task difficulty
  • Human reviewers constantly override "confident" agent decisions

How To Evaluate

Collect agent confidence scores and outcomes. Plot calibration curves (confidence vs. empirical accuracy). Use Expected Calibration Error (ECE) to quantify miscalibration. Well-calibrated agents have curves on the diagonal.

Common Failures

  • Not extracting or tracking agent confidence scores at all
  • Treating confidence scores as ornamental rather than decision-relevant
  • Optimizing for accuracy without testing calibration quality
METRIC #09

Discrimination

PAUROC
Dimension: Predictability Impact: Medium

What It Measures

Whether the agent's confidence scores successfully separate successes from failures. An agent can be systematically overconfident yet still assign higher confidence to tasks it completes correctly. PAUROC measures this ranking quality independently of calibration level.

Failure Signals

  • Confidence scores are identical for tasks the agent aces and tasks it fails
  • Review queues designed around "low-confidence" tasks don't catch real failures
  • Confidence-based routing provides no lift over random assignment
  • Human reviewers can't use confidence scores to prioritize their work

How To Evaluate

Compute AUROC of confidence scores predicting binary task success. AUROC = 0.5 means confidence is uninformative; AUROC = 1.0 means perfect separation. Target AUROC > 0.7 before building confidence-based routing into production workflows.

Common Failures

  • Assuming calibration implies . They are independent properties
  • Building review workflows on confidence scores without validating AUROC
  • Using point estimates instead of distributional confidence representations
METRIC #10

Brier Score

Pbrier
Dimension: Predictability Impact: Medium

What It Measures

A proper scoring rule that jointly penalizes both miscalibration and poor discrimination. Unlike ECE or AUROC alone, the Brier Score provides a single holistic measure of predictive quality that incentivizes honest confidence reporting. Deviation from the true probability always worsens the score.

Failure Signals

  • Agent achieves good AUROC but poor , or vice versa
  • Confidence reporting optimized for one dimension while the other degrades
  • No single unified metric tracks overall predictive quality across model versions
  • Model updates improve accuracy but silently degrade confidence quality

How To Evaluate

Compute Brier Score = mean((confidence − outcome)²) across tasks. Range [0,1]; lower is better. Use as the primary aggregate predictability metric in model selection and deployment decisions.

Common Failures

  • Tracking only accuracy and AUROC without a proper scoring rule
  • Using Brier Score without also reporting its calibration/resolution decomposition
  • Treating confidence quality as a secondary concern to task accuracy
The accountability gap is not accidental.
It is what happens when accuracy becomes the only
thing anyone measures.
Dimension IV
Safety
When failures occur, how severe are the resulting consequences?
METRIC #11

Compliance

Scomp
Dimension: Safety Impact: Critical

What It Measures

Whether the agent adheres to predefined operational constraints: no PII exposure, no unauthorized actions, no boundary overreach. Compliance evaluates constraint adherence regardless of whether violations cause immediately observable harm.

Failure Signals

  • Agent executes actions outside its defined permission scope
  • Sensitive data appears in logs, outputs, or downstream systems
  • Agent bypasses confirmation steps for irreversible actions
  • Boundary violations are only discovered in post-incident reviews

How To Evaluate

Define constraint sets explicitly before deployment. Log all agent actions. Compute Scomp = (constrained tasks − violated tasks) / constrained tasks. Target Scomp = 1.0; any violation below this threshold is a production risk that must be addressed before scaling.

Common Failures

  • Defining constraints in documentation rather than in enforceable code
  • Testing compliance only in sandbox environments with mocked tools
  • Treating compliance violations as isolated incidents rather than reliability signals
METRIC #12

Harm Severity

Sharm
Dimension: Safety Impact: Critical

What It Measures

Among tasks where constraint violations occur, how severe are the consequences? Returning results in the wrong sort order is benign. Executing an unintended DELETE statement is catastrophic. Sharm separates violation frequency from violation severity. It is the classical risk decomposition from safety-critical engineering.

Failure Signals

  • Violations cluster in high-consequence action categories: deletions, payments, communications
  • Agent has access to irreversible tools without proportional authorization safeguards
  • No severity classification exists for different violation types
  • Real-world incident severity is higher than safety tests predicted

How To Evaluate

Assign severity scores (0 = benign, 1 = catastrophic) to violation categories before deployment. Compute Saf = 1 − P(violation) × E[severity | violation]. Report safety metrics separately from other dimensions. Safety is a hard constraint, not a trade-off.

Common Failures

  • Treating all violations as equally , ignoring the severity distribution
  • Granting access to irreversible capabilities without staged authorization
  • Reporting only violation rates without consequence analysis
Safety is not a dimension to average.
It is a hard constraint.
One catastrophic failure at 1% frequency is not acceptable
because the other 99% were fine.