Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models
Abstract
BloomBench presents a cognitively grounded bilingual multimodal benchmark for Vision-Language Models, revealing significant cognitive asymmetries and cross-lingual performance gaps in current models.
Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagnose their true reasoning abilities and chart meaningful progress toward human-like multimodal intelligence. Most existing evaluations focus on piecemeal or disconnected tasks, obscuring critical cognitive weaknesses and providing little insight for targeted improvement. To address this gap, we introduce BloomBench, part of the Almieyar benchmarking series, the first cognitively human-grounded, bilingual (English-Arabic) multimodal benchmark for VLMs. Grounded in Bloom's Taxonomy, BloomBench systematically evaluates six levels of cognition (Remember, Understand, Apply, Analyze, Evaluate, Create) through carefully designed image-question-answer tasks. Built with a semi-automated pipeline and validated through a stratified hybrid quality assurance protocol, it ensures scalability, cultural inclusivity, and linguistic fidelity. Leveraging this framework, we conduct a comprehensive study of state-of-the-art VLMs to diagnose their cognitive profiles. Our analysis reveals a sharp cognitive asymmetry: while state-of-the-art models achieve strong performance ceilings in semantic understanding, they struggle substantially with factual recall and creative synthesis. This demonstrates that current general multimodal proficiency masks deeper limitations in specific cognitive layers. Furthermore, our study highlights a critical performance gap between Arabic and English, exposing limitations in current cross-lingual multimodal reasoning. These findings establish a foundation for developing more cognitively aligned and inclusive VLMs. The benchmark framework and dataset is available at: https://github.com/qcri/Almieyar-Oryx-BloomBench.
Community
We're excited to introduce BloomBench 🌸, our bilingual (English–Arabic) multimodal benchmark for Vision–Language Models, accepted to ACL 2026 Findings!
Most VLM benchmarks give you a single accuracy number that hides where a model actually fails. BloomBench is built around Bloom's Taxonomy, organizing tasks by cognitive skill from basic recall up to creative synthesis, so you can pinpoint exactly which levels of reasoning depth your VLM struggles with, instead of relying on only one overall score.
BloomBench covers:
- Full cognitive coverage: 7,747 bilingual image–question–answer items across 106 task types, spanning all six Bloom levels: Remember, Understand, Apply, Analyze, Evaluate, and Create.
- True bilingual evaluation: parallel English and Arabic items, so we can measure cross-lingual reasoning instead of assuming English results carry over.
- Two scoring methods: Regex-based Answer Extraction (RAE) and Likelihood-based Scoring (LBS). LBS exposes calibration gaps that RAE misses, showing how heavy instruction-tuning can optimize answer formatting at the cost of probabilistic calibration.
- High-quality data: a semi-automated pipeline validated with a hybrid LLM-as-judge plus human review, reaching a 98.45% quality rate.
- Key finding: current VLMs are strong at semantic understanding but weaker at factual recall, procedural application, and creative synthesis, with Arabic trailing English across the board.
If you work on VLM evaluation, give it a try and let us know what you find. We'd love to hear your feedback 🚀
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Can MLLMs Reason Beyond Language? VisReason: A Comprehensive Benchmark for Vision-Centric Reasoning (2026)
- Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios (2026)
- VISTAQA: Benchmarking Joint Visual Question Answering and Pixel-Level Evidence (2026)
- FinDocMRE: A Benchmark for Document-Level Financial Multimodal Reasoning Evaluation (2026)
- EPIC-Bench: A Perception-Centric Benchmark for Fine-Grained Embodied Visual Grounding in Vision-Language Models (2026)
- Evian: Towards Explainable Visual Instruction-tuning Data Auditing (2026)
- Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.05531 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper