Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models
Abstract
Research reveals that native omni-modal large language models exhibit visual preference over text, with modality preference emerging progressively in mid-to-late layers and enabling diagnosis of cross-modal hallucinations.
Native Omni-modal Large Language Models (OLLMs) have shifted from pipeline architectures to unified representation spaces. However, this native integration gives rise to a critical yet underexplored phenomenon: modality preference. To bridge this gap, we first systematically quantify modality preference of OLLMs using a newly-curated conflict-based benchmark and the modality selection rate metric. Our evaluation of ten representative OLLMs reveals a notable paradigm shift: unlike the ``text-dominance'' of traditional VLMs, most OLLMs exhibit a pronounced visual preference. To further understand the underlying mechanism, we conduct layer-wise probing and demonstrate that such modality preference is not static but emerges progressively in the mid-to-late layers. Building upon these insights, we leverage these internal signals to diagnose cross-modal hallucinations, achieving competitive performance across three downstream multi-modal benchmarks without task-specific data. Our work provides both a mechanistic understanding and a practical tool for building more trustworthy OLLMs. Our code and related resources are publicly available at: https://github.com/icip-cas/OmniPreference
Community
- Proposing a modality preference evaluation framework for OLLMs: Constructing a tri-modal semantic conflict dataset with quantitative metrics to systematically measure model modality preferences.
- Revealing the modality preference landscape of OLLMs: Unlike the ``text-dominance'' of traditional VLMs, most OLLMs exhibit a pronounced visual preference.
- Investigating the internal evolution patterns of modality preference: Employing layer-wise linear probing to reveal that modality preference signals are absent in shallow layers and gradually emerge in mid-to-late layers.
- Leveraging linear probes for hallucination detection: Discovering that hallucination generation is accompanied by abnormally elevated preference probability toward the interfering modality, enabling effective hallucination detection via linear probes.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos (2026)
- OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs (2026)
- Omni-Modal Dissonance Benchmark: Systematically Breaking Modality Consensus to Probe Robustness and Calibrated Abstention (2026)
- Why Multimodal In-Context Learning Lags Behind? Unveiling the Inner Mechanisms and Bottlenecks (2026)
- Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs (2026)
- A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning (2026)
- Visual Attention Drifts,but Anchors Hold:Mitigating Hallucination in Multimodal Large Language Models via Cross-Layer Visual Anchors (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.16902 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper