DriveVLM-RL: pretrained policy checkpoints

Final SAC policy checkpoints for DriveVLM-RL: Neuroscience-Inspired Reinforcement Learning with Vision-Language Models for Safe and Deployable Autonomous Driving.

Policies are trained in CARLA Town02 with three independent seeds. The VLM reward components (CLIP, YOLOv8s, Qwen3-VL) are used only during training; these checkpoints are pure Stable-Baselines3 SAC policies and need no VLM at inference.

Code: https://github.com/zilin-huang/DriveVLM-RL
Paper: https://arxiv.org/abs/2603.18315
Project page: https://zilin-huang.github.io/DriveVLM-RL-website/

Checkpoints

Normal mode

Standard DriveVLM-RL policies (full dual-pathway reward).

File	Seed (run id)	Training steps
`drivevlm_rl_normal_seed1_20251029_115740.zip`	20251029_115740	980k
`drivevlm_rl_normal_seed2_20260423_120202.zip`	20260423_120202	1.16M
`drivevlm_rl_normal_seed3_20260423_120553.zip`	20260423_120553	1.07M

Extreme mode

Policies trained under the extreme reward setting with the safety penalty disabled (R_penalty = 0), used for the robustness study in the paper.

File	Seed (run id)	Training steps
`drivevlm_rl_extreme_seed1_20251104_141251.zip`	20251104_141251	1.02M
`drivevlm_rl_extreme_seed2_20260425_023939.zip`	20260425_023939	1.06M
`drivevlm_rl_extreme_seed3_20260425_023946.zip`	20260425_023946	1.03M

Results

All numbers are mean ± std over 3 training seeds. Metrics: AS = average speed, TD = travel distance (m), RC = route completion, SR = success rate, CS = collision speed (collision severity; lower is safer). ↑ higher is better, ↓ lower is better. Bold marks the best driving method per column. Raw per-seed CSVs are in results/.

Main comparison (CARLA Town02, in-distribution)

Method	Venue	AS ↑	TD ↑	RC ↑	SR ↑	CS ↓
TIRL-SAC	TR-C'22	0.45 ± 0.77	1.49 ± 2.32	0.01 ± 0.01	0.00 ± 0.00	0.29 ± 0.50
Chen-SAC	T-ITS'22	24.32 ± 0.46	162.01 ± 17.67	0.49 ± 0.08	0.50 ± 0.10	16.04 ± 2.51
ASAP	RSS'23	11.53 ± 10.22	25.00 ± 24.92	0.12 ± 0.11	0.00 ± 0.00	7.07 ± 5.96
ChatScene-PPO	CVPR'24	14.78 ± 0.30	127.85 ± 10.39	0.44 ± 0.14	0.40 ± 0.10	6.05 ± 1.28
Revolve	ICLR'25	17.42 ± 0.80	134.37 ± 15.26	0.40 ± 0.12	0.40 ± 0.20	10.33 ± 2.25
Revolve-auto	ICLR'25	14.12 ± 3.07	129.14 ± 33.22	0.33 ± 0.12	0.40 ± 0.20	7.80 ± 1.06
VLM-SR	NeurIPS'23	0.06 ± 0.05	2.26 ± 1.26	0.01 ± 0.00	0.00 ± 0.00	0.66 ± 1.14
RoboCLIP	NeurIPS'23	0.13 ± 0.09	3.46 ± 2.32	0.02 ± 0.01	0.00 ± 0.00	0.01 ± 0.02
VLM-RM	ICLR'24	0.08 ± 0.01	3.60 ± 0.38	0.02 ± 0.00	0.00 ± 0.00	0.00 ± 0.00
LORD	WACV'25	0.36 ± 0.59	4.10 ± 6.11	0.03 ± 0.03	0.00 ± 0.00	1.52 ± 2.63
VLM-RL	TR-C'25	14.38 ± 1.53	138.08 ± 16.68	0.51 ± 0.08	0.40 ± 0.00	10.09 ± 11.93
DriveVLM-RL (ours)	This work	14.54 ± 1.81	186.59 ± 14.00	0.57 ± 0.03	0.57 ± 0.15	1.75 ± 3.02

Several reward-only baselines (TIRL-SAC, VLM-SR, RoboCLIP, VLM-RM, LORD) collapse to a near-stationary policy (AS ≈ 0, RC ≈ 0.01) and avoid collisions trivially, so their low CS is degenerate. Among methods that complete routes, DriveVLM-RL drives the farthest, completes the most of the route, succeeds most often, and has the lowest collision severity.

Cross-town generalization (zero-shot, trained on Town02)

Town	Method	RC ↑	SR ↑	CS ↓
Town01	ChatScene-PPO	0.33 ± 0.04	0.30 ± 0.10	4.87 ± 0.48
Town01	VLM-RL	0.27 ± 0.08	0.03 ± 0.06	10.64 ± 3.24
Town01	DriveVLM-RL	0.21 ± 0.03	0.03 ± 0.06	1.59 ± 0.80
Town03	ChatScene-PPO	0.34 ± 0.06	0.10 ± 0.00	15.27 ± 4.49
Town03	VLM-RL	0.28 ± 0.04	0.07 ± 0.06	18.20 ± 8.46
Town03	DriveVLM-RL	0.38 ± 0.04	0.10 ± 0.00	10.97 ± 3.05
Town04	ChatScene-PPO	0.27 ± 0.02	0.10 ± 0.00	15.20 ± 3.16
Town04	VLM-RL	0.18 ± 0.02	0.17 ± 0.06	8.54 ± 4.16
Town04	DriveVLM-RL	0.18 ± 0.07	0.07 ± 0.12	3.57 ± 2.45
Town05	ChatScene-PPO	0.29 ± 0.09	0.07 ± 0.06	8.02 ± 7.06
Town05	VLM-RL	0.22 ± 0.06	0.00 ± 0.00	6.77 ± 11.73
Town05	DriveVLM-RL	0.30 ± 0.02	0.03 ± 0.06	3.57 ± 1.50

DriveVLM-RL attains the lowest collision severity (CS) in every unseen town.

Traffic density (CARLA Town02; Regular is the training density)

Density	Method	RC ↑	SR ↑	CS ↓
Empty	ChatScene-PPO	0.57 ± 0.00	0.90 ± 0.00	8.92 ± 4.04
Empty	VLM-RL	0.53 ± 0.12	0.77 ± 0.32	0.00 ± 0.00
Empty	DriveVLM-RL	0.57 ± 0.05	0.70 ± 0.10	0.00 ± 0.00
Regular	ChatScene-PPO	0.44 ± 0.14	0.40 ± 0.10	6.05 ± 1.28
Regular	VLM-RL	0.51 ± 0.08	0.40 ± 0.00	10.09 ± 5.93
Regular	DriveVLM-RL	0.57 ± 0.03	0.57 ± 0.15	1.75 ± 3.02
Dense	ChatScene-PPO	0.41 ± 0.09	0.20 ± 0.17	4.77 ± 1.10
Dense	VLM-RL	0.37 ± 0.07	0.27 ± 0.06	6.93 ± 1.63
Dense	DriveVLM-RL	0.46 ± 0.08	0.33 ± 0.15	2.28 ± 1.83

Extreme mode (safety penalty disabled, R_penalty = 0)

Method	AS ↑	TD ↑	RC ↑	SR ↑	CS ↓
ChatScene-PPO	15.20 ± 0.39	137.51 ± 12.42	0.38 ± 0.05	0.47 ± 0.12	3.91 ± 1.75
VLM-RL	14.52 ± 0.56	136.33 ± 31.86	0.49 ± 0.09	0.40 ± 0.10	4.59 ± 4.09
DriveVLM-RL	15.17 ± 1.85	149.69 ± 34.85	0.44 ± 0.03	0.50 ± 0.10	0.69 ± 1.09

Even with the explicit safety penalty removed, DriveVLM-RL keeps the lowest collision severity and highest success rate, showing the dual-pathway semantic reward instills safe behavior rather than relying on a hand-tuned penalty.

Usage

from huggingface_hub import hf_hub_download

ckpt = hf_hub_download("zilinhuang/DriveVLM-RL",
                       "drivevlm_rl_normal_seed1_20251029_115740.zip")
# then, in the DriveVLM-RL repo (vlm-rl env):
#   python eval/eval.py --model <ckpt> --config drivevlm_rl --town Town02 --density regular

Downloads last month: -

Video Preview

Reinforcement Learning

Paper for zilinhuang/DriveVLM-RL

DriveVLM-RL: Neuroscience-Inspired Reinforcement Learning with Vision-Language Models for Safe and Deployable Autonomous Driving

Paper • 2603.18315 • Published Mar 18