Submitted by HideOnBush 34 FORGE:Fine-grained Multimodal Evaluation for Manufacturing Scenarios University of Waterloo 3 2
Submitted by limuloo1999 18 RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details Zhejiang University 13 2
Submitted by taesiri 16 Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory · 23 authors
Submitted by Constant8868 5 ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion 北京交通大学 1
Submitted by Yongxin-Guo 5 Structured Causal Video Reasoning via Multi-Objective Alignment University of Western Australia 1
Submitted by taesiri 2 VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images · 6 authors
Submitted by taesiri - CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation · 13 authors 1
Submitted by coldhyuk - Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video Ulsan National Institute of Science and Technology 0 1