12 Metrics of AI Agent Reliability

AI agents are not failing for lack of capability. They are failing for lack of reliability. Accuracy scores keep rising. Real-world deployments keep breaking. Princeton researchers identified twelve concrete metrics. Accuracy doesn't measure them, and can't substitute for them.

Four dimensions. Twelve metrics. One framework for deploying agents you can actually trust.

Source: Rabanser et al., Princeton University - arxiv:2602.16666 · "Towards a Science of AI Agent Reliability"

Dimension I

Consistency

Does the system behave the same way when run multiple times under the same conditions?

METRIC #01

Outcome Consistency

C_out

Dimension: Consistency Impact: High

What It Measures

Whether the agent succeeds or fails consistently on repeated attempts at the same task. An insurance claims agent that approves a claim on one run but denies the identical claim on the next creates legal liability and destroys user trust, regardless of average accuracy.

Failure Signals

Same input produces success one day and failure the next
Demos work; production doesn't. Or vice versa.
"It worked when I tested it" is a common support response
Error rates spike on identical task batches run at different times

How To Evaluate

Run K identical tasks multiple times. Measure pass∧k (success on ALL K runs) rather than pass@k (success on at least one run). Normalize variance by p(1−p) to isolate consistency from capability level.

Common Failures

Reporting pass@k as a reliability metric: it measures best-case capability, not consistency
Testing with one run and declaring the system "validated"
Treating outcome variance as "acceptable model stochasticity"

METRIC #02

Trajectory Consistency: Distributional

C_traj^d

Dimension: Consistency Impact: Medium

What It Measures

Whether the agent uses similar types and frequencies of actions across repeated runs, even when it reaches the same conclusion. Does it always search before writing? Always verify before executing? The distribution of action types should be stable across runs.

Failure Signals

Agent sometimes reads files before writing, sometimes writes directly
Audit logs look completely different for identical tasks
Tool call counts fluctuate wildly across runs of the same workflow
Compliance teams cannot generate reproducible audit trails

How To Evaluate

Compare action type frequency distributions across K runs using distributional similarity measures. Flag high variance in "what the agent does" even when final outcomes match. Audit processes require stable action distributions.

Common Failures

Measuring only final outcomes while ignoring behavioral consistency
Assuming "same result" means "same behavior"
Building compliance frameworks on output-only auditing

METRIC #03

Trajectory Consistency: Sequential

C_traj^s

Dimension: Consistency Impact: Medium

What It Measures

Whether the agent follows consistent action orderings across runs. Even when action types are similar, does it sometimes charge a card before checking inventory? Verify identity after taking action? Ordering inconsistencies create failure modes that are invisible to output-only evaluation.

Failure Signals

"Interrupted state" failures are unpredictable and hard to recover from
The same task creates different intermediate states across runs
Recovery procedures work sometimes but not consistently
Rollback mechanisms fail because state assumptions were violated

How To Evaluate

Compare action sequences using sequence similarity measures (e.g., edit distance on action traces). Define ordering invariants for critical processes (steps that must always precede others) and verify they hold across K runs.

Common Failures

Assuming action ordering is irrelevant if the final outcome is correct
Skipping workflow analysis in favor of end-state verification only
Not defining or testing ordering invariants for high-stakes processes

METRIC #04

Resource Consistency

C_res

Dimension: Consistency Impact: Medium

What It Measures

Whether the agent's computational cost, latency, and API token consumption are predictable across identical tasks. An agent that costs $0.02 per task on Monday and $2.00 on Friday cannot be safely operated at scale, even with identical accuracy.

Failure Signals

P99 latency is 10× the median for no apparent reason
Monthly API costs vary 200% for the same workload
Task duration fluctuates from 3 seconds to 3 minutes unpredictably
Budget overruns occur on routine, repeat tasks

How To Evaluate

Measure latency, token consumption, and API cost variance across K runs. Report coefficient of variation, not just mean. Flag tasks where cost or latency variance exceeds 50% of mean as reliability risks.

Common Failures

Reporting mean cost and latency without variance as a reliability signal
Only measuring resource consumption in development environments
Treating cost spikes as infrastructure issues rather than agent behavior issues

Accuracy measures best-case performance.
Reliability measures what happens at scale.

Dimension II

Robustness

When conditions deviate from nominal, does the system degrade gracefully or fail abruptly?

METRIC #05

Fault Robustness

R_fault

Dimension: Robustness Impact: High

What It Measures

Whether the agent handles infrastructure failures gracefully: API timeouts, malformed responses, tool unavailability, partial data returns. A robust agent retries, falls back, or fails safely. A brittle agent enters an inconsistent state or silently abandons the task.

Failure Signals

Single API timeout crashes the entire task pipeline
Partial tool failures propagate into corrupted downstream results
Agent loops indefinitely on retryable errors
No behavioral distinction between transient and permanent failures

How To Evaluate

Inject controlled fault scenarios: tool timeouts, malformed API responses, partial data returns. Measure accuracy ratio between nominal and fault conditions. An agent with R_fault = 0.9 maintains 90% of nominal performance under infrastructure faults.

Common Failures

Testing only in ideal environments and calling the system "production-ready"
Building no retry or graceful fallback logic into agentic workflows
Treating all errors as fatal rather than categorizing by recoverability

METRIC #06

Environment Robustness

R_env

Dimension: Robustness Impact: High

What It Measures

Whether the agent maintains performance when its environment changes in semantically neutral ways: reordered JSON fields, renamed API parameters, changed date formats, evolved tool interfaces. Real production environments drift. Brittle agents fail silently when they do.

Failure Signals

Agent breaks when database schema is , even on non-breaking changes
API versioning causes silent failures with no error messages
JSON field reordering produces different agent behavior
Locale or timezone changes cause unpredictable results

How To Evaluate

Apply semantic-preserving transformations to the environment: reorder fields, rename parameters, alter formats. Compare performance to nominal baseline. A production-ready agent should be indifferent to cosmetic environmental variation.

Common Failures

Assuming stable APIs and schemas across the lifetime of the deployment
Hardcoding field names, date formats, or ordering expectations
Not testing environment drift as a first-class reliability scenario

METRIC #07

Prompt Robustness

R_prompt

Dimension: Robustness Impact: High

What It Measures

Whether the agent performs consistently across semantically equivalent reformulations of instructions: rephrasings, translations, register variations. Real users don't speak like benchmark prompts. If "cancel my subscription" succeeds but "end my plan" fails, the agent cannot handle real traffic.

Failure Signals

Success rates drop significantly for non-English instructions
Formal versus informal phrasing produces different outcomes
Users from different backgrounds receive different results for equivalent requests
Paraphrase sensitivity is high but undocumented

How To Evaluate

Generate semantically equivalent prompt variants using paraphrasing, translation, and register changes. Measure accuracy ratio between original and variant prompts. Target R_prompt close to 1.0 for any production deployment serving real users.

Common Failures

Evaluating only on "canonical" prompt formulations from the benchmark
Ignoring cross-lingual reliability for multilingual deployments
Treating prompt sensitivity as "expected behavior" rather than a defect

A capable system can be unreliable.
A less capable system can be highly reliable.
They are not the same thing.

Dimension III

Predictability

Can the system recognize when it is likely to fail?

METRIC #08

Calibration

P_cal

Dimension: Predictability Impact: High

What It Measures

Whether the agent's stated confidence levels match its empirical success rates. An agent claiming 80% confidence should succeed roughly 80% of the time. Overconfident agents cause users to trust outputs they should verify. Underconfident agents generate unnecessary human review loops.

Failure Signals

High-confidence outputs are wrong at surprising frequency
Agent never abstains or flags uncertainty, even on clearly hard tasks
Confidence scores cluster near 90%+ regardless of task difficulty
Human reviewers constantly override "confident" agent decisions

How To Evaluate

Collect agent confidence scores and outcomes. Plot calibration curves (confidence vs. empirical accuracy). Use Expected Calibration Error (ECE) to quantify miscalibration. Well-calibrated agents have curves on the diagonal.

Common Failures

Not extracting or tracking agent confidence scores at all
Treating confidence scores as ornamental rather than decision-relevant
Optimizing for accuracy without testing calibration quality

METRIC #09

Discrimination

P_AUROC

Dimension: Predictability Impact: Medium

What It Measures

Whether the agent's confidence scores successfully separate successes from failures. An agent can be systematically overconfident yet still assign higher confidence to tasks it completes correctly. P_AUROC measures this ranking quality independently of calibration level.

Failure Signals

Confidence scores are identical for tasks the agent aces and tasks it fails
Review queues designed around "low-confidence" tasks don't catch real failures
Confidence-based routing provides no lift over random assignment
Human reviewers can't use confidence scores to prioritize their work

How To Evaluate

Compute AUROC of confidence scores predicting binary task success. AUROC = 0.5 means confidence is uninformative; AUROC = 1.0 means perfect separation. Target AUROC > 0.7 before building confidence-based routing into production workflows.

Common Failures

Assuming calibration implies . They are independent properties
Building review workflows on confidence scores without validating AUROC
Using point estimates instead of distributional confidence representations

METRIC #10

Brier Score

P_brier

Dimension: Predictability Impact: Medium

What It Measures

A proper scoring rule that jointly penalizes both miscalibration and poor discrimination. Unlike ECE or AUROC alone, the Brier Score provides a single holistic measure of predictive quality that incentivizes honest confidence reporting. Deviation from the true probability always worsens the score.

Failure Signals

Agent achieves good AUROC but poor , or vice versa
Confidence reporting optimized for one dimension while the other degrades
No single unified metric tracks overall predictive quality across model versions
Model updates improve accuracy but silently degrade confidence quality

How To Evaluate

Compute Brier Score = mean((confidence − outcome)²) across tasks. Range [0,1]; lower is better. Use as the primary aggregate predictability metric in model selection and deployment decisions.

Common Failures

Tracking only accuracy and AUROC without a proper scoring rule
Using Brier Score without also reporting its calibration/resolution decomposition
Treating confidence quality as a secondary concern to task accuracy

The accountability gap is not accidental.
It is what happens when accuracy becomes the only
thing anyone measures.

Dimension IV

Safety

When failures occur, how severe are the resulting consequences?

METRIC #11

Compliance

S_comp

Dimension: Safety Impact: Critical

What It Measures

Whether the agent adheres to predefined operational constraints: no PII exposure, no unauthorized actions, no boundary overreach. Compliance evaluates constraint adherence regardless of whether violations cause immediately observable harm.

Failure Signals

Agent executes actions outside its defined permission scope
Sensitive data appears in logs, outputs, or downstream systems
Agent bypasses confirmation steps for irreversible actions
Boundary violations are only discovered in post-incident reviews

How To Evaluate

Define constraint sets explicitly before deployment. Log all agent actions. Compute S_comp = (constrained tasks − violated tasks) / constrained tasks. Target S_comp = 1.0; any violation below this threshold is a production risk that must be addressed before scaling.

Common Failures

Defining constraints in documentation rather than in enforceable code
Testing compliance only in sandbox environments with mocked tools
Treating compliance violations as isolated incidents rather than reliability signals

METRIC #12

Harm Severity

S_harm

Dimension: Safety Impact: Critical

What It Measures

Among tasks where constraint violations occur, how severe are the consequences? Returning results in the wrong sort order is benign. Executing an unintended DELETE statement is catastrophic. S_harm separates violation frequency from violation severity. It is the classical risk decomposition from safety-critical engineering.

Failure Signals

Violations cluster in high-consequence action categories: deletions, payments, communications
Agent has access to irreversible tools without proportional authorization safeguards
No severity classification exists for different violation types
Real-world incident severity is higher than safety tests predicted

How To Evaluate

Assign severity scores (0 = benign, 1 = catastrophic) to violation categories before deployment. Compute ℛ_Saf = 1 − P(violation) × E[severity | violation]. Report safety metrics separately from other dimensions. Safety is a hard constraint, not a trade-off.

Common Failures

Treating all violations as equally , ignoring the severity distribution
Granting access to irreversible capabilities without staged authorization
Reporting only violation rates without consequence analysis

Safety is not a dimension to average.
It is a hard constraint.
One catastrophic failure at 1% frequency is not acceptable
because the other 99% were fine.

12 Metrics of Agent Reliability

Outcome Consistency

What It Measures

Failure Signals

How To Evaluate

Common Failures

Trajectory Consistency: Distributional

What It Measures

Failure Signals

How To Evaluate

Common Failures

Trajectory Consistency: Sequential

What It Measures

Failure Signals

How To Evaluate

Common Failures

Resource Consistency

What It Measures

Failure Signals

How To Evaluate

Common Failures

Fault Robustness

What It Measures

Failure Signals

How To Evaluate

Common Failures

Environment Robustness

What It Measures

Failure Signals

How To Evaluate

Common Failures

Prompt Robustness

What It Measures

Failure Signals

How To Evaluate

Common Failures

Calibration

What It Measures

Failure Signals

How To Evaluate

Common Failures

Discrimination

What It Measures

Failure Signals

How To Evaluate

Common Failures

Brier Score

What It Measures

Failure Signals

How To Evaluate

Common Failures

Compliance

What It Measures

Failure Signals

How To Evaluate

Common Failures

Harm Severity

What It Measures

Failure Signals

How To Evaluate

Common Failures