End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer
Abstract
End-to-end training of autoregressive image models with joint reconstruction and generation optimization achieves state-of-the-art results on ImageNet 256x256 generation.
Autoregressive image modeling relies on visual tokenizers to compress images into compact latent representations. We design an end-to-end training pipeline that jointly optimizes reconstruction and generation, enabling direct supervision from generation results to the tokenizer. This contrasts with prior two-stage approaches that train tokenizers and generative models separately. We further investigate leveraging vision foundation models to improve 1D tokenizers for autoregressive modeling. Our autoregressive generative model achieves strong empirical results, including a state-of-the-art FID score of 1.48 without guidance on ImageNet 256x256 generation.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Semantic One-Dimensional Tokenizer for Image Reconstruction and Generation (2026)
- CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization (2026)
- MacTok: Robust Continuous Tokenization for Image Generation (2026)
- Semantic-Aware Prefix Learning for Token-Efficient Image Generation (2026)
- RPiAE: A Representation-Pivoted Autoencoder Enhancing Both Image Generation and Editing (2026)
- End-to-End Training for Unified Tokenization and Latent Denoising (2026)
- TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.00503 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper