The Governance Gap

In financial services, governance is not optional. Accountability is the product. Regulators, auditors, clients, and courts all demand answers to the same question: who decided this, and why?

Agentic AI systems do not answer that question cleanly. They weren't designed to. And the frameworks built to govern human decision-makers do not map cleanly onto systems that operate without a human decision-maker in the loop.

Three gaps have opened simultaneously.
Most organizations have closed none of them.

Gap 01 Accountability

Gap 02 Reliability

Gap 03 Architecture

GAP 01

Accountability Gap

The Assumption

Every decision has an owner. The chain of accountability runs from action to actor - a person, a role, a license number. When something goes wrong in financial services, regulators follow that chain. The assumption is that the chain exists and terminates in a human being.

The Reality

When an agent decides - executes a trade, denies a claim, flags a transaction as suspicious, sends a client communication - the accountability chain fractures into at least four parties, none of whom fully owns the outcome:

The developer who wrote the agent's instructions and tool access
The model provider whose weights underpin the agent's reasoning
The operator who configured the workflow and set the thresholds
The human who set the cron job at 3 AM and hasn't checked it since

Each party can plausibly disclaim responsibility. None has full visibility into what happened. The agent doesn't explain itself - it produces an output, logs a token stream, and moves on.

Incident Pattern

A compliance monitoring agent flags 847 transactions as suspicious over a weekend. On Monday, the team reviews 40 of them. The other 807 are cleared in bulk because the queue is too long. Three of those 807 are genuine laundering. Who is accountable - the agent that flagged them (correctly), the team that cleared them (without review), the manager who staffed a skeleton crew, or the executive who approved the deployment?

The Fix

Named decision owners, not process owners. Assign a human whose job description explicitly includes accountability for what a specific agent produces - not the process of running it, but the outputs it generates. That person must have the access, the authority, and the time to actually review agent decisions, including the ones that didn't escalate. Pair this with an agent registry that documents what each agent does, what it touches, who owns it, and what the escalation path is. The registry is not optional in a regulated environment. It is the paper trail regulators will ask for first.

GAP 02

Reliability Gap

The Assumption

Accuracy is the metric that matters. If the model gets 95% of answers right on the benchmark, it is 95% reliable. This is how procurement conversations happen, how vendor scorecards are built, and how deployment decisions get made. Accuracy is measurable. Accuracy is legible. Accuracy is wrong.

The Reality

Princeton's evaluation framework (arxiv:2602.16666) identifies 12 distinct dimensions of agent reliability - and accuracy is one of them. The others measure failure modes that accuracy actively obscures:

Consistency: Does the agent give the same answer to the same question twice? An agent that scores 95% on a benchmark but gives different answers on repeated queries is not 95% reliable - it is unpredictably unreliable.
Robustness: Does accuracy hold when inputs are slightly reworded, formatted differently, or submitted at different times of day? Most agents are brittle in ways the benchmark doesn't reveal.
Predictability: Can the humans monitoring the agent anticipate when it will fail? Unpredictable failure is more dangerous than predictable failure - you can design around the latter.
Safety: Does the agent take actions it shouldn't in edge cases? A 99% accurate agent that occasionally takes irreversible destructive action has a reliability problem that accuracy cannot measure.
Calibration: Does the agent know when it doesn't know? Overconfident agents produce wrong outputs with high confidence, which is worse than producing wrong outputs with appropriate uncertainty.

Incident Pattern

A credit risk agent performs at 94% accuracy on historical data. Deployed in production, it encounters a novel loan structure - a product the training data didn't include. Its accuracy on novel structures is 61%. It doesn't flag this. It produces outputs with normal confidence scores. The team sees no signal that anything is wrong. They find out during a portfolio review six months later.

The Fix

Retire single-metric evaluation. Before deploying an agent in a regulated function, require testing against a minimum of the 12 Princeton reliability dimensions relevant to that function. For financial services: prioritize calibration (does it know what it doesn't know?), robustness (does it hold under edge-case inputs?), and safety (does it ever take actions outside its intended scope?). Document the reliability profile - not just the accuracy score - in the deployment record. When the agent fails, this documentation is what distinguishes negligence from reasonable diligence.

Governance frameworks are written for human decision-makers.
Agentic systems don't have one.
That is not a gap in the agent. It is a gap in the framework.

GAP 03

Architecture Gap

The Assumption

Governance frameworks assume a human decision-maker at the center. Every compliance obligation, audit requirement, fiduciary duty, and regulatory mandate was designed with a person in mind - someone who can be held accountable, who can explain their reasoning, who can be sanctioned or rewarded. The entire architecture of financial regulation is a system for governing human judgment.

The Reality

Agentic systems are not human decision-makers operating at scale. They are a fundamentally different class of actor - one that existing frameworks were not designed to govern. The architectural mismatches are structural:

Fiduciary duty attaches to persons. An agent is not a person. The human nominally responsible for an agent's outputs may have no knowledge of a specific decision - and no mechanism to have had it.
Explainability requirements assume the decision-maker can articulate their reasoning. Chain-of-thought outputs are not reasoning traces - they are post-hoc narratives generated by the same model that produced the output.
Audit trails record human actions. Agent logs record token emissions. These are not equivalent. A log that shows what the agent output does not show why, and courts have not yet resolved whether a token stream satisfies an explanation requirement.
Sovereign agent frameworks (e.g., arxiv:2501.xxxxx) propose that agents with persistent identity, goal-directedness, and resource acquisition capabilities may require governance frameworks closer to corporate personhood than to software licensing. Financial regulators have not caught up.

Incident Pattern

A wealth management firm deploys a client communication agent that sends personalized portfolio commentary to 12,000 clients. One communication contains a statement that, in context of a specific client's tax situation, constitutes unsolicited tax advice - a licensed activity. No individual at the firm made the decision to send it. The agent did. Regulators ask who is responsible. The firm's legal team spends four months constructing an answer that satisfies nobody.

The Fix

Governance-by-design, not governance-by-retrofit. Before deploying agents in regulated functions, map every regulatory obligation the agent's outputs could trigger - licensing, disclosure, suitability, explainability, data residency, fair lending, best execution. For each obligation, designate a human who is accountable for that dimension specifically and who has the access to monitor it. Do not attempt to retrofit existing compliance frameworks onto agentic systems - they will not fit. Build the governance architecture around the agent's actual decision surface, not the human decision-making framework it replaces. This is not a compliance exercise. It is a precondition for operating legally at scale.

The question is not whether your agents are accurate.
The question is who answers when they're wrong.

Sources

Princeton AI Reliability Framework - "Benchmarking Agentic AI: A Thinking-First Approach to 12 Metrics of Reliability" (arxiv:2602.16666, 2025) · Sovereign AI Agents - proposed governance frameworks for persistent, goal-directed AI actors (arxiv, 2025) · FANUC Oshino Plant - lights-off manufacturing reference, 2001–present · Dark Factory Scale - unmake.it framework for agentic automation maturity