Finance Agent Benchmark: AI Hits 55% — Useful but Not Reliable
Published Nov 10, 2025
The Finance Agent benchmark (2025-11-07) shows meaningful progress but highlights clear limits: Claude Sonnet 4.5 leads at 55.3%, excelling at simple retrieval and calculations yet failing on multi-step inference, tool control, and context retention. Agents can augment routine financial workflows—data gathering and basic reporting—but nearly half of tasks still require human analysts. Comparative benchmarks show higher performance in specialized coding agents (Claude Code >72% local) versus low averages for autonomous research agents (~13.9%), underscoring that domain specialization and real-world telemetry drive practical value. Strategic priorities are clear: improve tool interfacing, multi-step reasoning, context switching, and error recovery, and adopt benchmarks that measure real-world impact rather than synthetic tasks. Scaling agentic AI across professional domains depends on these targeted advances and continued human oversight.