Robusto-2: Benchmarking Humans & VLMs for Autonomous Driving in Lima & New York City
Abstract
Research examines how self-driving car systems and humans perform on visual question answering tasks across different geographic locations, revealing that both human and AI responses diverge based on question types but show similar performance regardless of location.
As Self-Driving Cars continue to expand internationally and use multi-modal systems such as VLMs as a cognitive backbone for their Action models; how well will these systems generalize in new settings, in particular out-of-distribution (OOD) edge-case scenarios in new geographies? In this paper, we study this open question by providing a full factorial analysis with human drivers of Lima, human drivers from New York City, and VLMs and showing them dashcam footage collected from Lima and New York City -- prompting them with a variety of questions under a Visual Question Answering (VQA) paradigm. In particular, we pick these two cities as they are highly challenging driving locations where no Self-Driving Car company currently operates in, and ask questions that span 4 categories: Factual, Ratings, Counterfactual and Reasoning. We find that Humans and VLMs diverge in their responses -- though this is modulated by the type of questions asked, and that Humans answer similarly independent of where they are from (Lima/NYC). To our surprise, we did not find a strong difference in terms of answers (Humans or VLMs) that was modulated by geography, likely due to their high out-of-distribution nature. Our dataset is available at: https://huggingface.co/datasets/Artificio/robusto-2
Community
Excited to share the Robusto-2 dataset! Here we assess the cross-cultural and cross-system different between Humans of Lima, Humans of NYC and VLMs when shown driving videos of both Lima and NYC. This is an important study as AV models have know shifted towards VLAs + VLMs and we are interested in understanding the geographic generalization gap! Videos, Human and VLM data are open-sourced!
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper