Abstract
Attention mechanisms in Transformers can be reinterpreted as MLPs with dynamically predicted parameters, offering a linear-complexity alternative to explicit attention while maintaining sequence modeling performance.
Existing research largely attributes the global sequence modeling capability of Transformers to the explicit computation of attention weights, a process that inherently incurs quadratic computational complexity. In this work, we offer a novel perspective: we demonstrate that attention can be mathematically reframed as a Multi-Layer Perceptron (MLP) equipped with dynamically predicted parameters. Through this lens, we explain attention's global modeling power not as explicit token-wise aggregation, but as an implicit process where dynamically generated parameters act as a compressed representation of the global context. Inspired by this insight, we investigate a fundamental question: can we achieve Transformer-level sequence global modeling entirely through dynamic parameterization while maintaining linear complexity, effectively replacing explicit attention? To explore this, we design various dynamic parameter prediction strategies and integrate them into standard network layers. Extensive empirical studies on vision models demonstrate that dynamic parameterization can indeed serve as a highly effective, linear-complexity alternative to explicit attention, opening new pathways for efficient sequence modeling. Code is available at https://github.com/LeapLabTHU/WeightFormer.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Efficiency Follows Global-Local Decoupling (2026)
- Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers (2026)
- InfoMamba: An Attention-Free Hybrid Mamba-Transformer Model (2026)
- Reviving ConvNeXt for Efficient Convolutional Diffusion Models (2026)
- LaplacianFormer:Rethinking Linear Attention with Laplacian Kernel (2026)
- Adaptive Head Budgeting for Efficient Multi-Head Attention (2026)
- Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.01711 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper