DriveVLM-RL: pretrained policy checkpoints

Final SAC policy checkpoints for DriveVLM-RL: Neuroscience-Inspired Reinforcement Learning with Vision-Language Models for Safe and Deployable Autonomous Driving.

Policies are trained in CARLA Town02 with three independent seeds. The VLM reward components (CLIP, YOLOv8s, Qwen3-VL) are used only during training; these checkpoints are pure Stable-Baselines3 SAC policies and need no VLM at inference.

Checkpoints

Normal mode

Standard DriveVLM-RL policies (full dual-pathway reward).

File Seed (run id) Training steps
drivevlm_rl_normal_seed1_20251029_115740.zip 20251029_115740 980k
drivevlm_rl_normal_seed2_20260423_120202.zip 20260423_120202 1.16M
drivevlm_rl_normal_seed3_20260423_120553.zip 20260423_120553 1.07M

Extreme mode

Policies trained under the extreme reward setting with the safety penalty disabled (R_penalty = 0), used for the robustness study in the paper.

File Seed (run id) Training steps
drivevlm_rl_extreme_seed1_20251104_141251.zip 20251104_141251 1.02M
drivevlm_rl_extreme_seed2_20260425_023939.zip 20260425_023939 1.06M
drivevlm_rl_extreme_seed3_20260425_023946.zip 20260425_023946 1.03M

Results

All numbers are mean ± std over 3 training seeds. Metrics: AS = average speed, TD = travel distance (m), RC = route completion, SR = success rate, CS = collision speed (collision severity; lower is safer). ↑ higher is better, ↓ lower is better. Bold marks the best driving method per column. Raw per-seed CSVs are in results/.

Main comparison (CARLA Town02, in-distribution)

Method Venue AS ↑ TD ↑ RC ↑ SR ↑ CS ↓
TIRL-SAC TR-C'22 0.45 ± 0.77 1.49 ± 2.32 0.01 ± 0.01 0.00 ± 0.00 0.29 ± 0.50
Chen-SAC T-ITS'22 24.32 ± 0.46 162.01 ± 17.67 0.49 ± 0.08 0.50 ± 0.10 16.04 ± 2.51
ASAP RSS'23 11.53 ± 10.22 25.00 ± 24.92 0.12 ± 0.11 0.00 ± 0.00 7.07 ± 5.96
ChatScene-PPO CVPR'24 14.78 ± 0.30 127.85 ± 10.39 0.44 ± 0.14 0.40 ± 0.10 6.05 ± 1.28
Revolve ICLR'25 17.42 ± 0.80 134.37 ± 15.26 0.40 ± 0.12 0.40 ± 0.20 10.33 ± 2.25
Revolve-auto ICLR'25 14.12 ± 3.07 129.14 ± 33.22 0.33 ± 0.12 0.40 ± 0.20 7.80 ± 1.06
VLM-SR NeurIPS'23 0.06 ± 0.05 2.26 ± 1.26 0.01 ± 0.00 0.00 ± 0.00 0.66 ± 1.14
RoboCLIP NeurIPS'23 0.13 ± 0.09 3.46 ± 2.32 0.02 ± 0.01 0.00 ± 0.00 0.01 ± 0.02
VLM-RM ICLR'24 0.08 ± 0.01 3.60 ± 0.38 0.02 ± 0.00 0.00 ± 0.00 0.00 ± 0.00
LORD WACV'25 0.36 ± 0.59 4.10 ± 6.11 0.03 ± 0.03 0.00 ± 0.00 1.52 ± 2.63
VLM-RL TR-C'25 14.38 ± 1.53 138.08 ± 16.68 0.51 ± 0.08 0.40 ± 0.00 10.09 ± 11.93
DriveVLM-RL (ours) This work 14.54 ± 1.81 186.59 ± 14.00 0.57 ± 0.03 0.57 ± 0.15 1.75 ± 3.02

Several reward-only baselines (TIRL-SAC, VLM-SR, RoboCLIP, VLM-RM, LORD) collapse to a near-stationary policy (AS ≈ 0, RC ≈ 0.01) and avoid collisions trivially, so their low CS is degenerate. Among methods that complete routes, DriveVLM-RL drives the farthest, completes the most of the route, succeeds most often, and has the lowest collision severity.

Cross-town generalization (zero-shot, trained on Town02)

Town Method RC ↑ SR ↑ CS ↓
Town01 ChatScene-PPO 0.33 ± 0.04 0.30 ± 0.10 4.87 ± 0.48
Town01 VLM-RL 0.27 ± 0.08 0.03 ± 0.06 10.64 ± 3.24
Town01 DriveVLM-RL 0.21 ± 0.03 0.03 ± 0.06 1.59 ± 0.80
Town03 ChatScene-PPO 0.34 ± 0.06 0.10 ± 0.00 15.27 ± 4.49
Town03 VLM-RL 0.28 ± 0.04 0.07 ± 0.06 18.20 ± 8.46
Town03 DriveVLM-RL 0.38 ± 0.04 0.10 ± 0.00 10.97 ± 3.05
Town04 ChatScene-PPO 0.27 ± 0.02 0.10 ± 0.00 15.20 ± 3.16
Town04 VLM-RL 0.18 ± 0.02 0.17 ± 0.06 8.54 ± 4.16
Town04 DriveVLM-RL 0.18 ± 0.07 0.07 ± 0.12 3.57 ± 2.45
Town05 ChatScene-PPO 0.29 ± 0.09 0.07 ± 0.06 8.02 ± 7.06
Town05 VLM-RL 0.22 ± 0.06 0.00 ± 0.00 6.77 ± 11.73
Town05 DriveVLM-RL 0.30 ± 0.02 0.03 ± 0.06 3.57 ± 1.50

DriveVLM-RL attains the lowest collision severity (CS) in every unseen town.

Traffic density (CARLA Town02; Regular is the training density)

Density Method RC ↑ SR ↑ CS ↓
Empty ChatScene-PPO 0.57 ± 0.00 0.90 ± 0.00 8.92 ± 4.04
Empty VLM-RL 0.53 ± 0.12 0.77 ± 0.32 0.00 ± 0.00
Empty DriveVLM-RL 0.57 ± 0.05 0.70 ± 0.10 0.00 ± 0.00
Regular ChatScene-PPO 0.44 ± 0.14 0.40 ± 0.10 6.05 ± 1.28
Regular VLM-RL 0.51 ± 0.08 0.40 ± 0.00 10.09 ± 5.93
Regular DriveVLM-RL 0.57 ± 0.03 0.57 ± 0.15 1.75 ± 3.02
Dense ChatScene-PPO 0.41 ± 0.09 0.20 ± 0.17 4.77 ± 1.10
Dense VLM-RL 0.37 ± 0.07 0.27 ± 0.06 6.93 ± 1.63
Dense DriveVLM-RL 0.46 ± 0.08 0.33 ± 0.15 2.28 ± 1.83

Extreme mode (safety penalty disabled, R_penalty = 0)

Method AS ↑ TD ↑ RC ↑ SR ↑ CS ↓
ChatScene-PPO 15.20 ± 0.39 137.51 ± 12.42 0.38 ± 0.05 0.47 ± 0.12 3.91 ± 1.75
VLM-RL 14.52 ± 0.56 136.33 ± 31.86 0.49 ± 0.09 0.40 ± 0.10 4.59 ± 4.09
DriveVLM-RL 15.17 ± 1.85 149.69 ± 34.85 0.44 ± 0.03 0.50 ± 0.10 0.69 ± 1.09

Even with the explicit safety penalty removed, DriveVLM-RL keeps the lowest collision severity and highest success rate, showing the dual-pathway semantic reward instills safe behavior rather than relying on a hand-tuned penalty.

Usage

from huggingface_hub import hf_hub_download

ckpt = hf_hub_download("zilinhuang/DriveVLM-RL",
                       "drivevlm_rl_normal_seed1_20251029_115740.zip")
# then, in the DriveVLM-RL repo (vlm-rl env):
#   python eval/eval.py --model <ckpt> --config drivevlm_rl --town Town02 --density regular
Downloads last month
-
Video Preview
loading

Paper for zilinhuang/DriveVLM-RL