arxiv:2601.10649

MINERVA-Cultural: A Benchmark for Cultural and Multilingual Long Video Reasoning

Published on Apr 7

Authors:

Abstract

A new multicultural and multilingual video reasoning benchmark is introduced, featuring human-generated annotations from diverse cultural videos across 18 global locales, revealing significant limitations of state-of-the-art video-language models in understanding visual cultural context.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Recent advancements in video models have shown tremendous progress, particularly in long video understanding. However, current benchmarks predominantly feature western-centric data and English as the dominant language, introducing significant biases in evaluation. To address this, we introduce MINERVA-Cultural, a challenging benchmark for multicultural and multilingual video reasoning. MINERVA-Cultural comprises high-quality, entirely human-generated annotations from diverse, region-specific cultural videos across 18 global locales. Unlike prior work that relies on automatic translations, MINERVA-Cultural provides complex questions, answers, and multi-step reasoning steps, all crafted in native languages. Making progress on MINERVA-Cultural requires a deeply situated understanding of visual cultural context. Furthermore, we leverage MINERVA-Cultural's reasoning traces to construct evidence-based graphs and propose a novel iterative strategy using these graphs to identify fine-grained errors in reasoning. Our evaluations reveal that SoTA Video-LLMs struggle significantly, performing substantially below human-level accuracy, with errors primarily stemming from the visual perception of cultural elements. MINERVA-Cultural will be publicly available under https://github.com/google-deepmind/neptune?tab=readme-ov-file\#minerva-cultural

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2601.10649

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.10649 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.10649 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.10649 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.