YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
π¨ MicroForge: A Novel Mobile-First Image Generation Architecture
Recurrent Latent Planning Γ SSM-Conv Hybrid Backbone Γ Deep Compression
MicroForge is a genuinely new image generation architecture designed from scratch for consumer devices (3-4 GB RAM), trainable on a single 16 GB GPU. It combines the best ideas from recent research into an efficient, compact, editing-ready system.
Key numbers:
- MicroForge-tiny: 28M params, ~56 MB fp16, ~0.13s/image on CPU
- MicroForge-small: 114M params, ~228 MB fp16
- MicroForge-base: 193M params, ~386 MB fp16
- Editing-ready: Same backbone handles generation, editing, inpainting, super-res
Table of Contents
- Architecture Overview
- Paper Shortlist & Critique
- Module-by-Module Design
- Mathematical Formulation
- Training Objective
- Memory & Compute Budget
- Training Curriculum
- Mobile Deployment Plan
- Failure Mode Analysis
- Ablation Plan
- Editing Roadmap
- Quick Start
1. Architecture Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MicroForge Pipeline β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Text βββ [Text Encoder (CLIP/TinyCLIP)] βββ text_emb, pooled β
β β β
β βΌ β
β Noise z_T βββ [Recurrent Latent Planner] β
β β K=32 plan tokens (49 KB state) β
β β READ: cross-attn(plan, z_t) β O(KΒ·N) β
β β REASON: self-attn(plan) β O(KΒ²) β
β β Self-condition from previous step β
β βΌ β
β z_t βββ [SSM-Conv Hybrid Backbone] βββ planner_tokens β
β β Per block (Γ6/12/18): β
β β 1. AdaLN-Group(z_t, t_emb + text_pool) β
β β 2. BiSSM(zigzag scan) β O(N) β
β β 3. CrossAttn(z_t, text_emb β₯ plan) β O(NΒ·M) β
β β 4. FFN(expansion=3) β O(NΒ·D) β
β β Every K blocks: SharedMQA(z_t) β single instance β
β βΌ β
β v_pred = backbone(z_t, t, text, plan) β
β z_{t-1} = z_t + Ξt Β· v_pred (Euler ODE step) β
β β
β z_0 βββ [DC-VAE Decoder (32Γ upsample)] βββ Image [3,H,W] β
β β
β ββββ Editing Mode (same backbone) βββββββββββββββββββββ β
β β z_input = [z_target_noise β₯ z_source] (width-concat) β β
β β Task token: [Generate] / [Edit] / [Inpaint] / [SR] β β
β β No extra parameters needed β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
What's Novel
Recurrent Latent Planner (RLP): Persistent latent tokens that carry "memory" across denoising steps. The planner reasons at a higher level before the backbone commits to pixel changes. Inspired by RIN (Jabri et al., 2022) but adapted for diffusion: plan tokens READ from the noised latent, REASON internally via self-attention, then inject guidance into the backbone via cross-attention. Self-conditioning carries plan state across steps.
SSM-Conv Hybrid Backbone: Replaces O(NΒ²) self-attention with bidirectional SSM scanning (O(N)) plus local DWConv. One globally-shared lightweight MQA attention block provides in-context learning capability. This hybrid achieves the global receptive field of attention with linear complexity.
Deep Compression VAE with Residual Shortcuts: 32Γ spatial compression using space-to-channel rearrangement as non-parametric skip connections. 512px β 16Γ16Γ32 latent = only 256 spatial tokens (vs 4096 in SD-VAE).
Editing by Design: DreamLite-style spatial concatenation enables generation, editing, inpainting, and super-resolution with zero extra parameters. The same backbone processes all tasks.
2. Paper Shortlist & Critique
A. Efficient Image Generation
| Paper | Problem Solved | What to Borrow | Failure Modes |
|---|---|---|---|
| SANA-Sprint (2503.09641) | 1-step generation, 0.6B params | Linear DiT + DC-AE latent + sCM+LADD distillation | Text encoder dominates memory |
| SnapGen (2412.09619) | Mobile T2I, 0.38B, iPhone 15 | Remove SA from high-res, MQA, expanded separable conv | No public weights |
| SnapGen++ (2601.08303) | 360ms/step iPhone, 0.4B | ASSA, elastic supernetwork, tiny VAE | Proprietary |
| DreamLite (2603.28713) | Mobile gen+edit unified | Spatial concat, task-progressive training | No public weights |
B. Subquadratic Backbones
| Paper | Problem Solved | What to Borrow | Failure Modes |
|---|---|---|---|
| DiMSUM (2411.04168) | Best FID with Mamba, 3Γ faster convergence | Wavelet+Mamba, shared attention block | Complex implementation |
| ZigMa (2403.13802) | Spatial continuity for SSM | Zigzag-8 scan, heterogeneous layers | Only class-conditional |
| LiT (2501.12976) | Pure linear DiT | DWConv inside linear attn, weight inheritance | Small quality drop at low res |
C. Compact Latent Spaces
| Paper | Problem Solved | What to Borrow | Failure Modes |
|---|---|---|---|
| DC-AE (2410.10733) | 32-128Γ compression | Residual space-to-channel shortcuts | High-channel needs bigger backbone |
| TiTok (2406.07550) | 32-128 1D tokens | Break 2D grid, proxy-code VQ | Resolution-fixed |
D. Editing Patterns
| Paper | Problem Solved | What to Borrow | Failure Modes |
|---|---|---|---|
| DreamLite (2603.28713) | Mobile gen+edit | Spatial concat (+14 GenEval vs channel) | Editing data at scale |
| FLUX Kontext (2506.15742) | Best editing quality | 3D RoPE offset, multi-reference | 12B, not mobile |
| RIN (2212.11972) | Decoupled computation | Latent tokens + cross-attn, self-cond | Pixel-space only |
3. Module-by-Module Design
Module A: Deep Compression VAE (microforge/vae.py)
32Γ spatial compression with space-to-channel residual shortcuts (DC-AE technique).
| Config | Channels | Latent C | Params | FP16 |
|---|---|---|---|---|
| tiny | [32,64,128,256] | 16 | 16M | 32 MB |
| small | [64,128,256,512] | 32 | 77M | 154 MB |
| base | [128,256,512,512] | 32 | 110M | 220 MB |
Module B: SSM-Conv Hybrid Backbone (microforge/backbone.py)
Bidirectional SSM + local DWConv + one globally-shared MQA attention.
| Config | Depth | Dim | Params | FP16 |
|---|---|---|---|---|
| tiny | 6 | 256 | 8M | 16 MB |
| small | 12 | 384 | 29M | 58 MB |
| base | 18 | 512 | 71M | 142 MB |
Module C: Recurrent Latent Planner (microforge/planner.py)
32 persistent plan tokens, 49 KB state per plan. O(KΒ²+KΒ·N) per layer.
Module D: Text Encoder (pluggable)
- Mobile: TinyCLIP ~60M
- Quality: CLIP-L ~428M
- Best: Gemma-2-2B ~2B
4. Mathematical Formulation
Rectified Flow: z_t = (1-t)Β·z_0 + tΒ·Ξ΅
Velocity target: v* = Ξ΅ - z_0
Training loss: L = E[w(t) Β· ||v_ΞΈ(z_t, t, c) - v*||Β²] where w(t) = 1/(1+|2t-1|)
Sampling: z_{t-Ξt} = z_t + Ξt Β· v_ΞΈ(z_t, t, c)
Planner self-conditioning: p_t = Ο(w)Β·p_{t+1} + (1-Ο(w))Β·p_init(text)
CFG: vΜ = v_β + sΒ·(v_c - v_β )
5. Training Objective
- Stage 1 (VAE): L1 + Ξ»_KLΒ·KL + LPIPS + GAN
- Stage 2-3 (Flow): w(t)Β·||v_ΞΈ - v*||Β²
- Stage 4 (KD): L_flow + Ξ»_tΒ·Ξ±(t)Β·||v_student - v_teacher||Β²
- Stage 5 (Edit): ||v_ΞΈ([z_t|z_src], t, c_edit) - v*||Β²
- Stage 6 (Distill): ||f_ΞΈ(z_t, t) - f_{ΞΈβ»}(z_t', t')||Β²
6. Memory & Compute Budget
Total System Memory (FP16, no text encoder)
- Tiny: ~76 MB inference @ 512px
- Small: ~308 MB inference @ 512px
- Base: ~530 MB inference @ 512px
With TinyCLIP (+120 MB) β under 500 MB for tiny config.
7. Training Curriculum (16 GB GPU)
| Stage | Freeze | Train | Data | Res | Steps | LR | Time (T4) |
|---|---|---|---|---|---|---|---|
| 1. VAE | β | VAE | ImageNet-50K | 128β256 | 50K | 1e-4 | 6h |
| 2. Low-Res | VAE | Backbone+Plan | Synthetic 100K | 128β256 | 100K | 1e-4 | 12h |
| 3. High-Res | VAE | Backbone+Plan | Same+high-res | 256β512 | 50K | 5e-5 | 8h |
| 4. Distill | VAE | Backbone+Plan | Teacher cached | 512 | 30K | 2e-5 | 6h |
| 5. Edit | VAE | All (low LR) | IP2P+MagicBrush | 256β512 | 20K | 1e-5 | 4h |
8. Mobile Deployment
- Step distill to 4 steps (consistency/LADD)
- Export ONNX with static shapes
- INT8 weight quantization
- Convert to CoreML/NNAPI/QNN
- Profile on-device
9. Failure Modes
| Failure | Fix |
|---|---|
| SSM scan artifacts | More scan directions + larger DWConv |
| Planner collapse | Diversity loss on plan tokens |
| VAE blur | Reduce Ξ»_KL + adversarial loss |
| Training instability | Grad clip=2.0 + separate SSM LR |
| Editing forgetting | Spatial concat + progressive training |
10. Ablation Plan
| ID | Change | Expected |
|---|---|---|
| A1 | No Planner | -2-5% FID |
| A2 | Full attention (no SSM) | Better@256, worse@1024, 2-4Γ slower |
| A3 | No shared MQA | -1-3% FID |
| A4 | No DWConv in SSM | -2-4% FID |
| A5 | No self-conditioning | More step jitter |
| A6 | Full vs grouped adaLN | +46% params, marginal gain |
| A7 | f16 vs f32 vs f64 VAE | f32 sweet spot |
| A8 | Spatial vs channel concat | Spatial preserves gen quality |
11. Editing Roadmap
- β Phase 1: Architecture supports spatial concatenation
- Phase 2: Image editing (InstructPix2Pix data)
- Phase 3: Inpainting (masked spatial concat)
- Phase 4: Super-resolution
- Phase 5: Style/reference (add IP-Adapter, +22M params)
- Phase 6: Local editing (region-aware planner)
12. Quick Start
import torch
from microforge.vae import MicroForgeVAE
from microforge.backbone import MicroForgeBackbone
from microforge.planner import RecurrentLatentPlanner
from microforge.pipeline import MicroForgePipeline, SimpleTextEncoder
vae = MicroForgeVAE(config='tiny')
backbone = MicroForgeBackbone(latent_channels=16, config='tiny')
planner = RecurrentLatentPlanner(num_plan_tokens=16, dim=256, text_dim=768, latent_channels=16)
text_enc = SimpleTextEncoder(embed_dim=768, num_layers=2)
pipeline = MicroForgePipeline(vae, backbone, text_enc, planner)
tokens = torch.randint(0, 8192, (1, 10))
images = pipeline.text2img(tokens, height=256, width=256, num_steps=4)
License
MIT License
Citation
@software{microforge2025,
title={MicroForge: Mobile-First Image Generation with Recurrent Latent Planning},
year={2025},
url={https://huggingface.co/asdf98/microforge}
}