arxiv:2605.16679

CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

Published on May 15

· Submitted by

Weiran Yao on May 19

actAVA AI

Upvote

Authors:

Abstract

Healthcare workflow benchmark challenges agents with policy-dense, multi-role, and multilateral interaction requirements, revealing significant performance gaps in automated enterprise applications.

AI-generated summary

End-to-end automation of realistic healthcare operations stresses three capabilities underrepresented in current benchmarks: policy density, decisions must be grounded in a large library of medical, insurance, and operational rules; Multi-role composition: a single task requires the agent to play multiple roles with handoffs; and multilateral interaction: intermediate workflow steps are multi-turn dialogs, such as peer-to-peer review and patient outreach. We introduce χ-Bench, a benchmark of long-horizon healthcare workflows across three domains: provider prior authorization, payer utilization management, and care management. Each task hands the agent a clinical case in a high-fidelity simulator of 20 healthcare apps exposed via 87 MCP tools, which it must drive to a terminal status through tool calls and writing the role's artifacts, guided by a 1,290+ document managed-care operations handbook skill. Across 30 agent harness/models configurations, the best agent resolves only 28.0% of tasks, no agent clears 20% on strict pass^3, and executing all tasks in a single session slumps the performance to 3.8%. These results raise the hypothesis that similar gaps are likely to surface in other policy-dense, role-composed, irreversible enterprise domains.

View arXiv page View PDF Project page GitHub 0 Add to collection

Community

weirayao

Paper submitter about 5 hours ago

Today, we introduce CHI-Bench (Clinical Healthcare In-situ Benchmark), the first long-horizon healthcare benchmark for AI agents.

We built high-fidelity simulators for three live domains: Provider Prior Authorization, Payer Utilization Management, and Population Health Care Management, each instantiated as MCP servers that operate on patients, clinicians and insurers records.

Each trial in CHI-Bench runs an agent for 60-80 steps across four to six clinical stages, exposing 21 healthcare apps through 200+ MCP tools and a 1,279-document operations handbook. It evaluates the trajectory, every artifact, and world state using deterministic unit tests and LLM judge for evidence grounding, consent, and cross-stage consistency.

Results from 30 frontier agents on the leaderboard

Best overall: Anthropic's Claude Code with Opus 4.6 — 28% pass@1.
Runner-up: OpenAI's Codex with GPT-5.5 — 21%.
By domain: utilization review 41%; care management 32%; prior-authorization paperwork 29%.
Reliability: no agent clears 20% when the same case is run three times.