arxiv:2606.18953

Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement

Published on Jun 17

· Submitted by

Kinam Kim on Jun 29

Microsoft Research

Upvote

Authors:

Abstract

An object-centric residual reinforcement learning framework improves real-world vision-language-action model robustness through simulation-trained corrective policies that transfer zero-shot despite sim-to-real challenges.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Vision-Language-Action (VLA) models can generalize across diverse manipulation tasks, but their imitation-learning-based policies remain brittle in precise physical interactions due to compounding execution errors; Can a reinforcement learning policy trained purely in simulation improve the robustness of real-world VLAs zero-shot? Residual RL, which learns a corrective policy on top of a frozen VLA, offers a natural framework, but existing approaches face a fundamental sim-to-real dilemma: privileged-state methods require lossy distillation for deployment; image-based methods suffer from the visual domain gap; and real-world RL is costly and unsafe. We propose an object-centric residual RL framework that refines VLA actions using object poses, enabling a compact observation space that transfers consistently between simulation and reality. To align the two domains, we additionally replay the same teleoperation demonstrations in simulation to train a sim counterpart of the real-world VLA. The residual RL policy is trained only in simulation with pose noise injection and dropout, and transfers zero-shot to the real robot. Across five manipulation tasks on a real Franka Research 3 (FR3) robot, our method improves the success rate from 42% to 76% zero-shot, and the improved rollouts can be further reused to retrain the base VLA for self-improvement without additional teleoperation. Project page: https://www.microsoft.com/en-us/research/articles/object-centric-residual-rl/

View arXiv page View PDF Project page Add to collection

Community

kinam0252

Paper submitter about 3 hours ago

TL;DR: We make a frozen Vision-Language-Action (VLA) policy reliable on precise, contact-rich tasks by training a small residual correction policy with RL entirely in simulation. The residual conditions on object pose instead of raw pixels, so it transfers to the real robot zero-shot, with no new demos and no real-world RL.

Highlights:

Base VLA stays frozen, we only add a learned correction
Object pose, not images, so no visual sim-to-real gap
Sim 49% to 87%, real robot 42% to 76% across 5 tasks
Residual rollouts distill back into the VLA for a stronger standalone policy

🌐 Project page: https://www.microsoft.com/en-us/research/articles/object-centric-residual-rl/

Happy to answer questions, feedback welcome!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.18953

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.18953 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.18953 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.18953 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.