TRL-Bench Collection TRL-Bench: cross-paradigm representation-level evaluation of tabular encoders. CTbench + Rbench + DLTE. • 4 items • Updated 29 days ago
GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents Paper • 2604.07429 • Published Apr 8 • 121
FORGE:Fine-grained Multimodal Evaluation for Manufacturing Scenarios Paper • 2604.07413 • Published Apr 8 • 95
FORGE:Fine-grained Multimodal Evaluation for Manufacturing Scenarios Paper • 2604.07413 • Published Apr 8 • 95
CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents Paper • 2603.24440 • Published Mar 25 • 98