Anthropic's Finance Agents and the 23% Problem

On May 5, 2026, Anthropic shipped ten finance agents — ready-to-run templates for the kind of work that eats an analyst's week: building pitchbooks, screening KYC files, closing the books at month-end. The launch material leads with a number. Claude Opus 4.7 is "state-of-the-art on financial tasks," topping Vals AI's Finance Agent benchmark at 64.37%.

That number is real. It's also from the wrong benchmark.

The agents are a genuinely smart piece of productization, and for a slice of real work they'll earn their keep. But the headline oversells them, and the benchmark that actually mirrors the job tells a more sober story. Here's the practitioner's read.

What actually shipped

Ten agents, grouped into two families. Research and client coverage: pitch builder, meeting preparer, earnings reviewer, model builder, market researcher. Finance and operations: valuation reviewer, general-ledger reconciler, month-end closer, statement auditor, KYC screener.

The interesting part is the plumbing. Each agent ships three ways — a plugin in Claude Cowork and Claude Code, and a cookbook for Managed Agents — and connects out to the data terminals analysts already live in: FactSet, S&P Capital IQ, Morningstar, PitchBook, MSCI, LSEG. Pair that with full Microsoft 365 integration (Claude reading and writing across Excel, PowerPoint, Word, and Outlook) and you have something a desk can switch on in days rather than building for months.

So the productization is the innovation here, not the intelligence. Which makes the intelligence question the one worth asking.

Read the leaderboard properly

The 64.37% comes from Finance Agent v1.1. The current version is v2, and it was deliberately rebuilt to look less like a quiz and more like the job: questions reorganized around analyst workflows, stricter numeric tolerances, and a "dealbreaker-gated" grade where one wrong critical fact zeroes the whole answer.

On v2, the frontier bunches up — and not impressively:

Model	Finance Agent v2 (partial credit)
GPT-5.5	51.76%
Claude Opus 4.7	51.51%
Claude Sonnet 4.6	51.03%

No model clears 52%. And that's the generous grade. Under "All-Pass" scoring — every fact in the answer has to be right — the whole field drops below 40%.

The category breakdown is where it gets pointed. v2 splits the work into nine analytical buckets, and the spread is enormous:

Category	Top score
Earnings analysis	~70%
General quantitative	~70%
General qualitative	~70%
Disclosure analysis	mid-60s
Market analysis	mid-60s
Adjustments	45–50%
Comparables	45–50%
Financial modeling	~23%
Precedents	~23%

Now line that up against the product. The two worst categories on the board — financial modeling and precedents, both stuck around 23% — are precisely what the model builder and valuation reviewer agents are sold to do. The tasks Anthropic is packaging as flagship agents are the tasks models are demonstrably worst at.

The pattern underneath is consistent: models do well on retrieval ("find this number, read this filing") and fall apart on multi-step synthesis ("build the model, run the precedents, defend the output"). The 70% categories are lookups. The 23% categories are judgment.

Why the hard tasks stay hard

This isn't a tuning problem that the next checkpoint fixes. Multi-step financial work compounds error: each step inherits the last step's mistakes, and a model with no source document will cheerfully fill the gap. Wall Street Prep's 2026 testing found that even front-running tools "hallucinated significant portions of historical data" when asked to source figures themselves rather than from supplied documents.

And finance has an unusually brutal error profile. In most domains 99% accuracy is excellent. In a balance sheet it can be worthless:

An accuracy rate of 99% yields 0% operational trust if the 1% error is a sign-convention inversion on a balance sheet or a temporal hallucination in a compliance report.

A model that's right 99 times and silently flips a sign on the hundredth hasn't saved you work — it's handed you a number you now have to fully re-derive to trust. That's why the Cambridge Centre for Alternative Finance's 2026 survey found 70% of firms and 70% of regulators naming model hallucination a top-two risk. The failure mode isn't being wrong loudly. It's being wrong quietly, inside a cell you didn't check.

So what, practically

None of this makes the agents useless. It makes them specific. The honest deployment map:

Lean in on assembly and retrieval — pulling comps, drafting the first pitchbook, reading a transcript and flagging changes, gathering a KYC file, prepping a meeting brief. This is where the 70% lives, and it's genuinely the grind that burns out junior analysts.
Keep humans on judgment — building the model that drives a decision, signing off a valuation, anything where a single wrong number propagates silently. This is the 23%, and the cost of a quiet error is high.

The tell is that this is already how Anthropic ships them. Its KYC/AML agent is explicitly designed so the AI assembles the evidence and a human makes the call. The autonomous-analyst framing is in the marketing; the human-in-the-loop is in the product.

So the disruption is real, but it's aimed lower than the headline implies. These agents come for the junior-analyst treadmill — the assembling, the formatting, the first draft — not for the analyst's judgment. For now, the right mental model isn't "an analyst in a box." It's a fast, tireless first-year who produces a lot and needs every number checked.

Read the leaderboard, not the press release.

Sources: Anthropic — Agents for financial services, Vals AI — Finance Agent v2, Fortune.