YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

🔨 MicroForge: A Novel Mobile-First Image Generation Architecture

Recurrent Latent Planning × SSM-Conv Hybrid Backbone × Deep Compression

MicroForge is a genuinely new image generation architecture designed from scratch for consumer devices (3-4 GB RAM), trainable on a single 16 GB GPU. It combines the best ideas from recent research into an efficient, compact, editing-ready system.

Key numbers:

MicroForge-tiny: 28M params, ~56 MB fp16, ~0.13s/image on CPU
MicroForge-small: 114M params, ~228 MB fp16
MicroForge-base: 193M params, ~386 MB fp16
Editing-ready: Same backbone handles generation, editing, inpainting, super-res

Architecture Overview
Paper Shortlist & Critique
Module-by-Module Design
Mathematical Formulation
Training Objective
Memory & Compute Budget
Training Curriculum
Mobile Deployment Plan
Failure Mode Analysis
Ablation Plan
Editing Roadmap
Quick Start

1. Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                     MicroForge Pipeline                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Text ──→ [Text Encoder (CLIP/TinyCLIP)] ──→ text_emb, pooled   │
│                     │                                            │
│                     ▼                                            │
│  Noise z_T ──→ [Recurrent Latent Planner]                       │
│                  │  K=32 plan tokens (49 KB state)               │
│                  │  READ: cross-attn(plan, z_t) — O(K·N)        │
│                  │  REASON: self-attn(plan) — O(K²)             │
│                  │  Self-condition from previous step            │
│                  ▼                                               │
│  z_t ──→ [SSM-Conv Hybrid Backbone] ◄── planner_tokens          │
│           │ Per block (×6/12/18):                                │
│           │   1. AdaLN-Group(z_t, t_emb + text_pool)            │
│           │   2. BiSSM(zigzag scan) — O(N)                      │
│           │   3. CrossAttn(z_t, text_emb ∥ plan) — O(N·M)      │
│           │   4. FFN(expansion=3) — O(N·D)                      │
│           │ Every K blocks: SharedMQA(z_t) — single instance    │
│           ▼                                                      │
│  v_pred = backbone(z_t, t, text, plan)                          │
│  z_{t-1} = z_t + Δt · v_pred       (Euler ODE step)            │
│                                                                  │
│  z_0 ──→ [DC-VAE Decoder (32× upsample)] ──→ Image [3,H,W]    │
│                                                                  │
│  ┌─── Editing Mode (same backbone) ────────────────────┐        │
│  │ z_input = [z_target_noise ∥ z_source] (width-concat) │        │
│  │ Task token: [Generate] / [Edit] / [Inpaint] / [SR]   │        │
│  │ No extra parameters needed                            │        │
│  └──────────────────────────────────────────────────────┘        │
└─────────────────────────────────────────────────────────────────┘

What's Novel

Recurrent Latent Planner (RLP): Persistent latent tokens that carry "memory" across denoising steps. The planner reasons at a higher level before the backbone commits to pixel changes. Inspired by RIN (Jabri et al., 2022) but adapted for diffusion: plan tokens READ from the noised latent, REASON internally via self-attention, then inject guidance into the backbone via cross-attention. Self-conditioning carries plan state across steps.
SSM-Conv Hybrid Backbone: Replaces O(N²) self-attention with bidirectional SSM scanning (O(N)) plus local DWConv. One globally-shared lightweight MQA attention block provides in-context learning capability. This hybrid achieves the global receptive field of attention with linear complexity.
Deep Compression VAE with Residual Shortcuts: 32× spatial compression using space-to-channel rearrangement as non-parametric skip connections. 512px → 16×16×32 latent = only 256 spatial tokens (vs 4096 in SD-VAE).
Editing by Design: DreamLite-style spatial concatenation enables generation, editing, inpainting, and super-resolution with zero extra parameters. The same backbone processes all tasks.

2. Paper Shortlist & Critique

A. Efficient Image Generation

Paper	Problem Solved	What to Borrow	Failure Modes
SANA-Sprint (2503.09641)	1-step generation, 0.6B params	Linear DiT + DC-AE latent + sCM+LADD distillation	Text encoder dominates memory
SnapGen (2412.09619)	Mobile T2I, 0.38B, iPhone 15	Remove SA from high-res, MQA, expanded separable conv	No public weights
SnapGen++ (2601.08303)	360ms/step iPhone, 0.4B	ASSA, elastic supernetwork, tiny VAE	Proprietary
DreamLite (2603.28713)	Mobile gen+edit unified	Spatial concat, task-progressive training	No public weights

B. Subquadratic Backbones

Paper	Problem Solved	What to Borrow	Failure Modes
DiMSUM (2411.04168)	Best FID with Mamba, 3× faster convergence	Wavelet+Mamba, shared attention block	Complex implementation
ZigMa (2403.13802)	Spatial continuity for SSM	Zigzag-8 scan, heterogeneous layers	Only class-conditional
LiT (2501.12976)	Pure linear DiT	DWConv inside linear attn, weight inheritance	Small quality drop at low res

C. Compact Latent Spaces

Paper	Problem Solved	What to Borrow	Failure Modes
DC-AE (2410.10733)	32-128× compression	Residual space-to-channel shortcuts	High-channel needs bigger backbone
TiTok (2406.07550)	32-128 1D tokens	Break 2D grid, proxy-code VQ	Resolution-fixed

D. Editing Patterns

Paper	Problem Solved	What to Borrow	Failure Modes
DreamLite (2603.28713)	Mobile gen+edit	Spatial concat (+14 GenEval vs channel)	Editing data at scale
FLUX Kontext (2506.15742)	Best editing quality	3D RoPE offset, multi-reference	12B, not mobile
RIN (2212.11972)	Decoupled computation	Latent tokens + cross-attn, self-cond	Pixel-space only

3. Module-by-Module Design

Module A: Deep Compression VAE (`microforge/vae.py`)

32× spatial compression with space-to-channel residual shortcuts (DC-AE technique).

Config	Channels	Latent C	Params	FP16
tiny	[32,64,128,256]	16	16M	32 MB
small	[64,128,256,512]	32	77M	154 MB
base	[128,256,512,512]	32	110M	220 MB

Module B: SSM-Conv Hybrid Backbone (`microforge/backbone.py`)

Bidirectional SSM + local DWConv + one globally-shared MQA attention.

Config	Depth	Dim	Params	FP16
tiny	6	256	8M	16 MB
small	12	384	29M	58 MB
base	18	512	71M	142 MB

Module C: Recurrent Latent Planner (`microforge/planner.py`)

32 persistent plan tokens, 49 KB state per plan. O(K²+K·N) per layer.

Module D: Text Encoder (pluggable)

Mobile: TinyCLIP ~60M
Quality: CLIP-L ~428M
Best: Gemma-2-2B ~2B

4. Mathematical Formulation

Rectified Flow: z_t = (1-t)·z_0 + t·ε

Velocity target: v* = ε - z_0

Training loss: L = E[w(t) · ||v_θ(z_t, t, c) - v*||²] where w(t) = 1/(1+|2t-1|)

Sampling: z_{t-Δt} = z_t + Δt · v_θ(z_t, t, c)

Planner self-conditioning: p_t = σ(w)·p_{t+1} + (1-σ(w))·p_init(text)

CFG: v̂ = v_∅ + s·(v_c - v_∅)

5. Training Objective

Stage 1 (VAE): L1 + λ_KL·KL + LPIPS + GAN
Stage 2-3 (Flow): w(t)·||v_θ - v*||²
Stage 4 (KD): L_flow + λ_t·α(t)·||v_student - v_teacher||²
Stage 5 (Edit): ||v_θ([z_t|z_src], t, c_edit) - v*||²
Stage 6 (Distill): ||f_θ(z_t, t) - f_{θ⁻}(z_t', t')||²

6. Memory & Compute Budget

Total System Memory (FP16, no text encoder)

Tiny: ~76 MB inference @ 512px
Small: ~308 MB inference @ 512px
Base: ~530 MB inference @ 512px

With TinyCLIP (+120 MB) → under 500 MB for tiny config.

7. Training Curriculum (16 GB GPU)

Stage	Freeze	Train	Data	Res	Steps	LR	Time (T4)
1. VAE	—	VAE	ImageNet-50K	128→256	50K	1e-4	6h
2. Low-Res	VAE	Backbone+Plan	Synthetic 100K	128→256	100K	1e-4	12h
3. High-Res	VAE	Backbone+Plan	Same+high-res	256→512	50K	5e-5	8h
4. Distill	VAE	Backbone+Plan	Teacher cached	512	30K	2e-5	6h
5. Edit	VAE	All (low LR)	IP2P+MagicBrush	256→512	20K	1e-5	4h

8. Mobile Deployment

Step distill to 4 steps (consistency/LADD)
Export ONNX with static shapes
INT8 weight quantization
Convert to CoreML/NNAPI/QNN
Profile on-device

9. Failure Modes

Failure	Fix
SSM scan artifacts	More scan directions + larger DWConv
Planner collapse	Diversity loss on plan tokens
VAE blur	Reduce λ_KL + adversarial loss
Training instability	Grad clip=2.0 + separate SSM LR
Editing forgetting	Spatial concat + progressive training

10. Ablation Plan

ID	Change	Expected
A1	No Planner	-2-5% FID
A2	Full attention (no SSM)	Better@256, worse@1024, 2-4× slower
A3	No shared MQA	-1-3% FID
A4	No DWConv in SSM	-2-4% FID
A5	No self-conditioning	More step jitter
A6	Full vs grouped adaLN	+46% params, marginal gain
A7	f16 vs f32 vs f64 VAE	f32 sweet spot
A8	Spatial vs channel concat	Spatial preserves gen quality

11. Editing Roadmap

✅ Phase 1: Architecture supports spatial concatenation
Phase 2: Image editing (InstructPix2Pix data)
Phase 3: Inpainting (masked spatial concat)
Phase 4: Super-resolution
Phase 5: Style/reference (add IP-Adapter, +22M params)
Phase 6: Local editing (region-aware planner)

12. Quick Start

import torch
from microforge.vae import MicroForgeVAE
from microforge.backbone import MicroForgeBackbone
from microforge.planner import RecurrentLatentPlanner
from microforge.pipeline import MicroForgePipeline, SimpleTextEncoder

vae = MicroForgeVAE(config='tiny')
backbone = MicroForgeBackbone(latent_channels=16, config='tiny')
planner = RecurrentLatentPlanner(num_plan_tokens=16, dim=256, text_dim=768, latent_channels=16)
text_enc = SimpleTextEncoder(embed_dim=768, num_layers=2)
pipeline = MicroForgePipeline(vae, backbone, text_enc, planner)

tokens = torch.randint(0, 8192, (1, 10))
images = pipeline.text2img(tokens, height=256, width=256, num_steps=4)

License

MIT License

Citation

@software{microforge2025,
  title={MicroForge: Mobile-First Image Generation with Recurrent Latent Planning},
  year={2025},
  url={https://huggingface.co/asdf98/microforge}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

asdf98
/

microforge

🔨 MicroForge: A Novel Mobile-First Image Generation Architecture

Table of Contents

1. Architecture Overview

What's Novel

2. Paper Shortlist & Critique

A. Efficient Image Generation

B. Subquadratic Backbones

C. Compact Latent Spaces

D. Editing Patterns

3. Module-by-Module Design

Module A: Deep Compression VAE (`microforge/vae.py`)

Module B: SSM-Conv Hybrid Backbone (`microforge/backbone.py`)

Module C: Recurrent Latent Planner (`microforge/planner.py`)

Module D: Text Encoder (pluggable)

4. Mathematical Formulation

5. Training Objective

6. Memory & Compute Budget

Total System Memory (FP16, no text encoder)

7. Training Curriculum (16 GB GPU)

8. Mobile Deployment

9. Failure Modes

10. Ablation Plan

11. Editing Roadmap

12. Quick Start

License

Citation

🔨 MicroForge: A Novel Mobile-First Image Generation Architecture

Table of Contents

1. Architecture Overview

What's Novel

2. Paper Shortlist & Critique

A. Efficient Image Generation

B. Subquadratic Backbones

C. Compact Latent Spaces

D. Editing Patterns

3. Module-by-Module Design

Module A: Deep Compression VAE (microforge/vae.py)

Module B: SSM-Conv Hybrid Backbone (microforge/backbone.py)

Module C: Recurrent Latent Planner (microforge/planner.py)

Module D: Text Encoder (pluggable)

4. Mathematical Formulation

5. Training Objective

6. Memory & Compute Budget

Total System Memory (FP16, no text encoder)

7. Training Curriculum (16 GB GPU)

8. Mobile Deployment

9. Failure Modes

10. Ablation Plan

11. Editing Roadmap

12. Quick Start

License

Citation

Module A: Deep Compression VAE (`microforge/vae.py`)

Module B: SSM-Conv Hybrid Backbone (`microforge/backbone.py`)

Module C: Recurrent Latent Planner (`microforge/planner.py`)