arxiv:2606.05531

Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models

Published on Jun 4

· Submitted by

Mahdi Abootorabi on Jun 8

Qatar Computing Research Institute

Upvote

Authors:

Mohammad Mahdi Abootorabi ,

Omid Ghahroodi ,

Abstract

BloomBench presents a cognitively grounded bilingual multimodal benchmark for Vision-Language Models, revealing significant cognitive asymmetries and cross-lingual performance gaps in current models.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagnose their true reasoning abilities and chart meaningful progress toward human-like multimodal intelligence. Most existing evaluations focus on piecemeal or disconnected tasks, obscuring critical cognitive weaknesses and providing little insight for targeted improvement. To address this gap, we introduce BloomBench, part of the Almieyar benchmarking series, the first cognitively human-grounded, bilingual (English-Arabic) multimodal benchmark for VLMs. Grounded in Bloom's Taxonomy, BloomBench systematically evaluates six levels of cognition (Remember, Understand, Apply, Analyze, Evaluate, Create) through carefully designed image-question-answer tasks. Built with a semi-automated pipeline and validated through a stratified hybrid quality assurance protocol, it ensures scalability, cultural inclusivity, and linguistic fidelity. Leveraging this framework, we conduct a comprehensive study of state-of-the-art VLMs to diagnose their cognitive profiles. Our analysis reveals a sharp cognitive asymmetry: while state-of-the-art models achieve strong performance ceilings in semantic understanding, they struggle substantially with factual recall and creative synthesis. This demonstrates that current general multimodal proficiency masks deeper limitations in specific cognitive layers. Furthermore, our study highlights a critical performance gap between Arabic and English, exposing limitations in current cross-lingual multimodal reasoning. These findings establish a foundation for developing more cognitively aligned and inclusive VLMs. The benchmark framework and dataset is available at: https://github.com/qcri/Almieyar-Oryx-BloomBench.

View arXiv page View PDF GitHub 7 Add to collection

Community

aboots

Paper author Paper submitter about 21 hours ago

We're excited to introduce BloomBench 🌸, our bilingual (English–Arabic) multimodal benchmark for Vision–Language Models, accepted to ACL 2026 Findings!

Most VLM benchmarks give you a single accuracy number that hides where a model actually fails. BloomBench is built around Bloom's Taxonomy, organizing tasks by cognitive skill from basic recall up to creative synthesis, so you can pinpoint exactly which levels of reasoning depth your VLM struggles with, instead of relying on only one overall score.

BloomBench covers:

Full cognitive coverage: 7,747 bilingual image–question–answer items across 106 task types, spanning all six Bloom levels: Remember, Understand, Apply, Analyze, Evaluate, and Create.
True bilingual evaluation: parallel English and Arabic items, so we can measure cross-lingual reasoning instead of assuming English results carry over.
Two scoring methods: Regex-based Answer Extraction (RAE) and Likelihood-based Scoring (LBS). LBS exposes calibration gaps that RAE misses, showing how heavy instruction-tuning can optimize answer formatting at the cost of probabilistic calibration.
High-quality data: a semi-automated pipeline validated with a hybrid LLM-as-judge plus human review, reaching a 98.45% quality rate.
Key finding: current VLMs are strong at semantic understanding but weaker at factual recall, procedural application, and creative synthesis, with Arabic trailing English across the board.

If you work on VLM evaluation, give it a try and let us know what you find. We'd love to hear your feedback 🚀