CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?
Abstract
Healthcare workflow benchmark challenges agents with policy-dense, multi-role, and multilateral interaction requirements, revealing significant performance gaps in automated enterprise applications.
End-to-end automation of realistic healthcare operations stresses three capabilities underrepresented in current benchmarks: policy density, decisions must be grounded in a large library of medical, insurance, and operational rules; Multi-role composition: a single task requires the agent to play multiple roles with handoffs; and multilateral interaction: intermediate workflow steps are multi-turn dialogs, such as peer-to-peer review and patient outreach. We introduce χ-Bench, a benchmark of long-horizon healthcare workflows across three domains: provider prior authorization, payer utilization management, and care management. Each task hands the agent a clinical case in a high-fidelity simulator of 20 healthcare apps exposed via 87 MCP tools, which it must drive to a terminal status through tool calls and writing the role's artifacts, guided by a 1,290+ document managed-care operations handbook skill. Across 30 agent harness/models configurations, the best agent resolves only 28.0% of tasks, no agent clears 20% on strict pass^3, and executing all tasks in a single session slumps the performance to 3.8%. These results raise the hypothesis that similar gaps are likely to surface in other policy-dense, role-composed, irreversible enterprise domains.
Community
Today, we introduce CHI-Bench (Clinical Healthcare In-situ Benchmark), the first long-horizon healthcare benchmark for AI agents.
We built high-fidelity simulators for three live domains: Provider Prior Authorization, Payer Utilization Management, and Population Health Care Management, each instantiated as MCP servers that operate on patients, clinicians and insurers records.
Each trial in CHI-Bench runs an agent for 60-80 steps across four to six clinical stages, exposing 21 healthcare apps through 200+ MCP tools and a 1,279-document operations handbook. It evaluates the trajectory, every artifact, and world state using deterministic unit tests and LLM judge for evidence grounding, consent, and cross-stage consistency.
Results from 30 frontier agents on the leaderboard
- Best overall: Anthropic's Claude Code with Opus 4.6 — 28% pass@1.
- Runner-up: OpenAI's Codex with GPT-5.5 — 21%.
- By domain: utilization review 41%; care management 32%; prior-authorization paperwork 29%.
- Reliability: no agent clears 20% when the same case is run three times.
CHI-Bench is open under Apache 2.0; the leaderboard accepts community submissions today.
🤖Github: https://github.com/actava-ai/chi-bench
🤗HuggingFace: https://huggingface.co/datasets/actava/chi-bench
🏆Leaderboard: https://actava.ai/benchmarks
Get this paper in your agent:
hf papers read 2605.16679 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper