arxiv:2606.15378

Rethinking the Role of Efficient Attention in Hybrid Architectures

Published on Jun 13

· Submitted by

Chaojun XIAO on Jun 17

OpenBMB

Upvote

Authors:

Ziqing Qiao ,

Abstract

Hybrid architectures combining full attention with efficient attention modules like sliding-window attention exhibit distinct scaling behaviors and optimization trajectories, with efficient attention primarily affecting the emergence speed of long-context capabilities rather than final performance.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Modern language models increasingly adopt hybrid architectures that combine full attention with efficient attention modules, such as sliding-window attention (SWA) and recurrent sequence mixers. However, how these efficient modules shape model capabilities remains poorly understood. To address this gap, we conduct a systematic analysis across hybrid architectures from three perspectives: scaling behavior, mechanism analysis, and architecture design. First, from a scaling perspective, we find that efficient-attention design primarily affects how fast long-context capability emerges, while different hybrids eventually converge to comparable long-context performance under sufficient training. Second, mechanistically, we show that long-range retrieval is mainly carried by full attention, whereas efficient attention shapes its optimization trajectory. This explains a counter-intuitive phenomenon we call Large-Window Laziness: larger SWA windows can delay the formation of retrieval heads in full-attention layers. Third, guided by this mechanism, we show that applying NoPE to only the full-attention layers of a small-window SWA hybrid substantially improves long-context performance with negligible impact on short-context performance.

View arXiv page View PDF GitHub 1 Add to collection

Community

xcjthu

Paper submitter 2 days ago

noahml

1 day ago

Neat paper. The finding that efficient attention modules shape the optimization trajectory rather than just performance is a cool perspective. It’s wild that a larger sliding window can actually delay the formation of retrieval heads in full-attention layers.

Does this imply that the "Large-Window Laziness" phenomenon could be avoided by staggering the training of different layers, or is it an inherent trade-off for these hybrid architectures?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/4c48aa6d-94a7-4721-aea5-d84471ff70a9