IRIS: Iterative Recurrent Image Synthesis

A novel architecture for mobile-first, high-quality text-to-image generation under 3-4GB RAM

params memory mobile license

πŸš€ Train It Now!

Open In Colab ← Download IRIS_Training_Notebook.ipynb from this repo and upload to Colab!

Quick start: Download IRIS_Training_Notebook.ipynb, open it in Colab (or Kaggle), enable GPU, and run all cells. Trains end-to-end in ~2-3 hours on a free T4.

The notebook includes:

  • πŸ“¦ Auto-downloads architecture code from this repo
  • 🎨 Trains on PokΓ©mon BLIP Captions dataset (833 image-caption pairs)
  • πŸ”¬ Stage 1: Wavelet VAE training with frequency-aware loss
  • ⚑ Stage 2: Rectified Flow generator training with CLIP conditioning
  • πŸ“Š Visualizations: reconstructions, generated samples, loss curves, GRFM internals
  • πŸ’Ύ Checkpoint saving for continued training

🎯 Why IRIS?

Current image generation models face critical limitations:

Problem Current State IRIS Solution
Too heavy for mobile SD3: 2B params, FLUX: 12B params 48-136M params, <600MB inference
Quadratic attention O(NΒ²) self-attention O(N log N) Fourier + O(N) recurrence
Too many inference steps 20-50 NFE typical 1-4 steps with consistency distillation
Old models look bad SD 1.5 era quality insufficient Modern rectified flow + frequency-aware latent
Quantization degrades quality INT4/INT8 drops aesthetics Architecture-level efficiency, no quantization needed
No editing support Separate heavy editing models Iterative core naturally extends to editing

πŸ—οΈ Architecture Overview

IRIS introduces a Prelude-Core-Coda architecture with shared-weight iterative refinement:

Text ──→ CLIP-L/14 ──→ text_tokens [77Γ—768]
                                                    
Image ──→ HaarDWT ──→ WaveletVAE ──→ zβ‚€ [CΓ—H/16Γ—W/16]
                                      β”‚
                                      β–Ό (+ noise via Rectified Flow)
                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                              β”‚   PRELUDE    β”‚ ← 2 conv blocks (unique weights)
                              β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                                     β”‚
                              β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
                              β”‚    CORE     β”‚ ← GRFM + CrossAttn + FFN
                              β”‚  (shared    β”‚   Iterated 4-16Γ— (same weights!)
                              β”‚   weights)  β”‚   Iteration-aware via adaLN
                              β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                                     β”‚
                              β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
                              β”‚    CODA     β”‚ ← 2 local-attention blocks
                              β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                                     β”‚
                              β–Ό predicted velocity
                              └──→ WaveletVAE Decode ──→ HaarIDWT ──→ Image

πŸ”¬ Key Innovations

1. GRFM (Gated Recurrent Fourier Mixer) β€” Novel Token Mixing

Three complementary pathways fused via learned adaptive gating:

  • Fourier Global Pathway (O(N log N)): RFFT2 β†’ Block-diagonal MLP β†’ SoftShrink β†’ IRFFT2
  • Gated Linear Recurrence (O(N)): Bidirectional RG-LRU scan with variance-preserving updates
  • Manhattan Spatial Gate: Per-head learnable spatial decay D_{nm} = Ξ³^Manhattan(n,m)
output = gate Γ— x_fourier + (1 - gate) Γ— x_recurrent + Ξ± Γ— x_spatial

2. Recurrent Depth Core (Huginn paradigm, novel for images)

  • Shared-weight core block iterated 4-16Γ— (same model, adaptive quality!)
  • 4-layer block Γ— 8 iterations = 32 effective layers from just 4 layers of params
  • 48M unique params β†’ 270-524M effective capacity

3. Wavelet-Frequency Latent Space

  • Haar DWT preprocessing preserves frequency structure in latent space
  • 16Γ— total spatial compression (lossless wavelet + learned VAE)

4. Dual-Axis Recurrence (Novel)

  • Recurrence over noise schedule (diffusion) AND computational depth (core iterations)

πŸ“Š Model Variants

Variant Generator Params Total Memory (fp16) Mobile Fit
IRIS-Tiny 19M 545 MB βœ… Ultra-mobile
IRIS-Small 47M 597 MB βœ… Mobile
IRIS-Base 135M 760 MB βœ… Consumer GPU

πŸ”§ Quick Start

from iris_model import create_iris_small
import torch

model = create_iris_small()
text_tokens = torch.randn(1, 77, 768)  # Replace with CLIP-L/14 embeddings

# Fast mobile inference (4 iterations, 4 steps)
images = model.generate(text_tokens, num_steps=4, num_iterations=4)

# Quality inference (8 iterations, 4 steps)
images = model.generate(text_tokens, num_steps=4, num_iterations=8)

πŸ“ Mathematical Foundations

Rectified Flow Training

z_t = (1-t)Β·zβ‚€ + tΒ·Ξ΅,  v_target = Ξ΅ - zβ‚€
L = w(t) Β· ||v_ΞΈ(z_t, t, c) - v_target||Β²,  w(t) = t/(1-t)
t ~ Logit-Normal(0, 1)

GRFM Pathways

Fourier:    RFFT2 β†’ BlockDiagMLP β†’ SoftShrink(Ξ») β†’ IRFFT2       [O(N log N)]
Recurrence: h_t = a_tβŠ™h_{t-1} + √(1-a_tΒ²)βŠ™(i_tβŠ™x_t)           [O(N)]
Spatial:    D_{nm} = Ξ³^(|row_n-row_m| + |col_n-col_m|)           [O(NΓ—window)]

πŸ‹οΈ Training Recipe

Stage Data Est. Cost
1. VAE ImageNet + CC3M 20 GPU-hrs
2. Class-Cond ImageNet 256px 100 GPU-hrs
3. Text-Image CC3M/CC12M 200 GPU-hrs
4. Aesthetic JourneyDB 50 GPU-hrs
5. Distill Self-distill 30 GPU-hrs

Total: 400 A100 GPU-hours ($1,600) | Stages 1-2 run on free Colab T4

πŸ“š Research Foundations

Concept Source How Used
Recurrent Depth Huginn (2502.05171) Prelude-Core-Coda
Fourier Mixing AFNO (2111.13587) GRFM pathway
Gated Recurrence Griffin RG-LRU (2402.19427) GRFM pathway
Manhattan Decay RMT (2309.11523) GRFM pathway
Wavelet Diffusion WaveDiff (2211.16152) Latent space
Rectified Flow RF (2209.03003), SD3 Training objective
Consistency Models CM (2303.01469) Distillation
adaLN-Zero DiT (2212.09748) Conditioning
Efficient Training PixArt-Ξ± (2310.00426) Training recipe
Mobile Design SnapGen (2412.09619) DWSConv, tiny VAE

πŸ“„ Files

File Description
IRIS_Training_Notebook.ipynb πŸ”₯ Complete Colab/Kaggle training notebook
iris_model.py Architecture implementation (~1200 lines)
train_iris.py CLI training pipeline (all 5 stages)
test_iris.py Validation test suite (9 tests, all passing)
ARCHITECTURE.md Detailed math specification

βœ… Verified Properties

  • βœ… Haar DWT/IDWT roundtrip lossless (error < 1e-5)
  • βœ… WaveletVAE: 256Γ—256β†’16Γ—16 latent (48Γ— compression)
  • βœ… GRFM forward/backward correct, all gradients flow
  • βœ… Variable iteration counts work (adaptive compute)
  • βœ… Full training step with rectified flow loss
  • βœ… End-to-end generation pipeline
  • βœ… IRIS-Tiny: 545 MB total inference (< 3GB βœ…)
  • βœ… IRIS-Small: 597 MB total inference (< 3GB βœ…)
  • βœ… 16Γ— iteration gives 10.9Γ— effective capacity

πŸ“œ License

Apache 2.0

@misc{iris2026,
  title={IRIS: Iterative Recurrent Image Synthesis for Mobile-First Image Generation},
  year={2026},
  note={Novel architecture: GRFM + Recurrent Depth + Wavelet Latent Space}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support