RLI Reveals Agents Can't Automate Remote Work; Liability Looms

Published Nov 10, 2025

The Remote Labor Index benchmark (240 freelance projects, 6,000+ human hours, $140k payouts) finds frontier AI agents automate at most 2.5% of real remote work, with frequent failures—technical errors (18%), incomplete submissions (36%), sub‐professional quality (46%) and inconsistent deliverables (15%). These empirical limits, coupled with rising legal scrutiny (e.g., the AI LEAD Act applying product‐liability principles and mounting IP/liability risk for code‐generating tools), compel an expectation reset. Organizations should treat agents as assistive tools, enforce human oversight and robust fallback processes, and maintain documentation of design, data, and error responses to mitigate legal exposure. Benchmarks like RLI provide measurable baselines; until performance improves materially, prioritize augmentation over replacement.

#agentic-ai #ai-safety-governance #ai-software-eng #assurance #benchmarks #evaluation #risk

AI Automation Peaks at 2.5% in Freelance Projects Across Diverse Domains

Peak automation on real freelance projects: 2.5% (Manus); others—Grok 4 and Sonnet 4.5 2.1%, GPT‐5 1.7%, ChatGPT 1.3%, Gemini 2.5 Pro 0.8%
Benchmark scope: 240 projects across 23 domains; 6,000+ human hours; payouts $140,000+
Failure rates: poor quality 46%; incomplete submissions 36%; technical errors/unusable files 18%; inconsistency ~15%
Overall takeaway: current agentic AI delivers under 3% end-to-end automation on real remote work

Mitigating AI Risks: Balancing Automation, Legal Liability, and Quality Governance

Over-automation and delivery failure (Highest)
Why important: RLI shows 2.5%) across domains | 2026 (TBD) | Anticipated | RemoteLabor.ai (benchmark momentum) |

Accountability, Not Automation: Why Building Trust Is Key to AI Agents’ Success

Depending on where you sit, the RLI numbers are either a reality check or an unfair trial. Skeptics call 2.5% end-to-end automation a rounding error, proof that “agents” are still interns with Wi‐Fi. Optimists counter that outcomes undercount the quiet wins—speedups on subtasks, knowledge retrieval, and draft generation that never show up as “project completion.” Builders argue the benchmark’s bar is set at professional-grade deliverables under real constraints—exactly the friction that agents must learn to survive. Meanwhile, the legal turn cuts through the hype: the AI LEAD Act reframes founders as manufacturers and “oops” as liability, a move labor advocates applaud and libertarians decry as innovation-chilling. Either way, the combined signal is blunt: automated freelancing isn’t here, and responsibility won’t be outsourced to the ether of multi-agent complexity.

The deeper play is to stop optimizing for replacement and start optimizing for accountability. If you measure what compounds—clean handoffs, auditable traces, deterministic retries, provenance-aware code and content—then 2.5% becomes a wedge that safely de-risks the other 97.5%. Benchmarks like RLI and liability regimes don’t slow progress; they set the fitness function for agents that can actually be paid, trusted, and defended in court. The surprising conclusion is that the fastest route to autonomy runs through boredom: make agents predictable, inspectable, and lawsuit-resistant first, and only then will capability scale without breaking trust. In other words, the next breakthrough isn’t a clever planner—it’s a contract: between agents and operators, logs and law, promises and proofs. When the system is built to be accountable by design, performance will follow—and the market will notice.