arxiv:2606.20980

Robusto-2: Benchmarking Humans & VLMs for Autonomous Driving in Lima & New York City

Published on Jun 18

· Submitted by

Arturo Deza on Jun 23

Artificio

Upvote

Authors:

Abstract

Research examines how self-driving car systems and humans perform on visual question answering tasks across different geographic locations, revealing that both human and AI responses diverge based on question types but show similar performance regardless of location.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

As Self-Driving Cars continue to expand internationally and use multi-modal systems such as VLMs as a cognitive backbone for their Action models; how well will these systems generalize in new settings, in particular out-of-distribution (OOD) edge-case scenarios in new geographies? In this paper, we study this open question by providing a full factorial analysis with human drivers of Lima, human drivers from New York City, and VLMs and showing them dashcam footage collected from Lima and New York City -- prompting them with a variety of questions under a Visual Question Answering (VQA) paradigm. In particular, we pick these two cities as they are highly challenging driving locations where no Self-Driving Car company currently operates in, and ask questions that span 4 categories: Factual, Ratings, Counterfactual and Reasoning. We find that Humans and VLMs diverge in their responses -- though this is modulated by the type of questions asked, and that Humans answer similarly independent of where they are from (Lima/NYC). To our surprise, we did not find a strong difference in terms of answers (Humans or VLMs) that was modulated by geography, likely due to their high out-of-distribution nature. Our dataset is available at: https://huggingface.co/datasets/Artificio/robusto-2

View arXiv page View PDF Add to collection

Community

ArturoDeza

Paper submitter about 4 hours ago

Excited to share the Robusto-2 dataset! Here we assess the cross-cultural and cross-system different between Humans of Lima, Humans of NYC and VLMs when shown driving videos of both Lima and NYC. This is an important study as AV models have know shifted towards VLAs + VLMs and we are interested in understanding the geographic generalization gap! Videos, Human and VLM data are open-sourced!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.20980 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.20980 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.