YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

πŸ”¨ MicroForge: A Novel Mobile-First Image Generation Architecture

Recurrent Latent Planning Γ— SSM-Conv Hybrid Backbone Γ— Deep Compression

MicroForge is a genuinely new image generation architecture designed from scratch for consumer devices (3-4 GB RAM), trainable on a single 16 GB GPU. It combines the best ideas from recent research into an efficient, compact, editing-ready system.

Key numbers:

  • MicroForge-tiny: 28M params, ~56 MB fp16, ~0.13s/image on CPU
  • MicroForge-small: 114M params, ~228 MB fp16
  • MicroForge-base: 193M params, ~386 MB fp16
  • Editing-ready: Same backbone handles generation, editing, inpainting, super-res

Table of Contents

  1. Architecture Overview
  2. Paper Shortlist & Critique
  3. Module-by-Module Design
  4. Mathematical Formulation
  5. Training Objective
  6. Memory & Compute Budget
  7. Training Curriculum
  8. Mobile Deployment Plan
  9. Failure Mode Analysis
  10. Ablation Plan
  11. Editing Roadmap
  12. Quick Start

1. Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     MicroForge Pipeline                          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                  β”‚
β”‚  Text ──→ [Text Encoder (CLIP/TinyCLIP)] ──→ text_emb, pooled   β”‚
β”‚                     β”‚                                            β”‚
β”‚                     β–Ό                                            β”‚
β”‚  Noise z_T ──→ [Recurrent Latent Planner]                       β”‚
β”‚                  β”‚  K=32 plan tokens (49 KB state)               β”‚
β”‚                  β”‚  READ: cross-attn(plan, z_t) β€” O(KΒ·N)        β”‚
β”‚                  β”‚  REASON: self-attn(plan) β€” O(KΒ²)             β”‚
β”‚                  β”‚  Self-condition from previous step            β”‚
β”‚                  β–Ό                                               β”‚
β”‚  z_t ──→ [SSM-Conv Hybrid Backbone] ◄── planner_tokens          β”‚
β”‚           β”‚ Per block (Γ—6/12/18):                                β”‚
β”‚           β”‚   1. AdaLN-Group(z_t, t_emb + text_pool)            β”‚
β”‚           β”‚   2. BiSSM(zigzag scan) β€” O(N)                      β”‚
β”‚           β”‚   3. CrossAttn(z_t, text_emb βˆ₯ plan) β€” O(NΒ·M)      β”‚
β”‚           β”‚   4. FFN(expansion=3) β€” O(NΒ·D)                      β”‚
β”‚           β”‚ Every K blocks: SharedMQA(z_t) β€” single instance    β”‚
β”‚           β–Ό                                                      β”‚
β”‚  v_pred = backbone(z_t, t, text, plan)                          β”‚
β”‚  z_{t-1} = z_t + Ξ”t Β· v_pred       (Euler ODE step)            β”‚
β”‚                                                                  β”‚
β”‚  z_0 ──→ [DC-VAE Decoder (32Γ— upsample)] ──→ Image [3,H,W]    β”‚
β”‚                                                                  β”‚
β”‚  β”Œβ”€β”€β”€ Editing Mode (same backbone) ────────────────────┐        β”‚
β”‚  β”‚ z_input = [z_target_noise βˆ₯ z_source] (width-concat) β”‚        β”‚
β”‚  β”‚ Task token: [Generate] / [Edit] / [Inpaint] / [SR]   β”‚        β”‚
β”‚  β”‚ No extra parameters needed                            β”‚        β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

What's Novel

  1. Recurrent Latent Planner (RLP): Persistent latent tokens that carry "memory" across denoising steps. The planner reasons at a higher level before the backbone commits to pixel changes. Inspired by RIN (Jabri et al., 2022) but adapted for diffusion: plan tokens READ from the noised latent, REASON internally via self-attention, then inject guidance into the backbone via cross-attention. Self-conditioning carries plan state across steps.

  2. SSM-Conv Hybrid Backbone: Replaces O(NΒ²) self-attention with bidirectional SSM scanning (O(N)) plus local DWConv. One globally-shared lightweight MQA attention block provides in-context learning capability. This hybrid achieves the global receptive field of attention with linear complexity.

  3. Deep Compression VAE with Residual Shortcuts: 32Γ— spatial compression using space-to-channel rearrangement as non-parametric skip connections. 512px β†’ 16Γ—16Γ—32 latent = only 256 spatial tokens (vs 4096 in SD-VAE).

  4. Editing by Design: DreamLite-style spatial concatenation enables generation, editing, inpainting, and super-resolution with zero extra parameters. The same backbone processes all tasks.


2. Paper Shortlist & Critique

A. Efficient Image Generation

Paper Problem Solved What to Borrow Failure Modes
SANA-Sprint (2503.09641) 1-step generation, 0.6B params Linear DiT + DC-AE latent + sCM+LADD distillation Text encoder dominates memory
SnapGen (2412.09619) Mobile T2I, 0.38B, iPhone 15 Remove SA from high-res, MQA, expanded separable conv No public weights
SnapGen++ (2601.08303) 360ms/step iPhone, 0.4B ASSA, elastic supernetwork, tiny VAE Proprietary
DreamLite (2603.28713) Mobile gen+edit unified Spatial concat, task-progressive training No public weights

B. Subquadratic Backbones

Paper Problem Solved What to Borrow Failure Modes
DiMSUM (2411.04168) Best FID with Mamba, 3Γ— faster convergence Wavelet+Mamba, shared attention block Complex implementation
ZigMa (2403.13802) Spatial continuity for SSM Zigzag-8 scan, heterogeneous layers Only class-conditional
LiT (2501.12976) Pure linear DiT DWConv inside linear attn, weight inheritance Small quality drop at low res

C. Compact Latent Spaces

Paper Problem Solved What to Borrow Failure Modes
DC-AE (2410.10733) 32-128Γ— compression Residual space-to-channel shortcuts High-channel needs bigger backbone
TiTok (2406.07550) 32-128 1D tokens Break 2D grid, proxy-code VQ Resolution-fixed

D. Editing Patterns

Paper Problem Solved What to Borrow Failure Modes
DreamLite (2603.28713) Mobile gen+edit Spatial concat (+14 GenEval vs channel) Editing data at scale
FLUX Kontext (2506.15742) Best editing quality 3D RoPE offset, multi-reference 12B, not mobile
RIN (2212.11972) Decoupled computation Latent tokens + cross-attn, self-cond Pixel-space only

3. Module-by-Module Design

Module A: Deep Compression VAE (microforge/vae.py)

32Γ— spatial compression with space-to-channel residual shortcuts (DC-AE technique).

Config Channels Latent C Params FP16
tiny [32,64,128,256] 16 16M 32 MB
small [64,128,256,512] 32 77M 154 MB
base [128,256,512,512] 32 110M 220 MB

Module B: SSM-Conv Hybrid Backbone (microforge/backbone.py)

Bidirectional SSM + local DWConv + one globally-shared MQA attention.

Config Depth Dim Params FP16
tiny 6 256 8M 16 MB
small 12 384 29M 58 MB
base 18 512 71M 142 MB

Module C: Recurrent Latent Planner (microforge/planner.py)

32 persistent plan tokens, 49 KB state per plan. O(KΒ²+KΒ·N) per layer.

Module D: Text Encoder (pluggable)

  • Mobile: TinyCLIP ~60M
  • Quality: CLIP-L ~428M
  • Best: Gemma-2-2B ~2B

4. Mathematical Formulation

Rectified Flow: z_t = (1-t)Β·z_0 + tΒ·Ξ΅

Velocity target: v* = Ξ΅ - z_0

Training loss: L = E[w(t) Β· ||v_ΞΈ(z_t, t, c) - v*||Β²] where w(t) = 1/(1+|2t-1|)

Sampling: z_{t-Ξ”t} = z_t + Ξ”t Β· v_ΞΈ(z_t, t, c)

Planner self-conditioning: p_t = Οƒ(w)Β·p_{t+1} + (1-Οƒ(w))Β·p_init(text)

CFG: vΜ‚ = v_βˆ… + sΒ·(v_c - v_βˆ…)


5. Training Objective

  • Stage 1 (VAE): L1 + Ξ»_KLΒ·KL + LPIPS + GAN
  • Stage 2-3 (Flow): w(t)Β·||v_ΞΈ - v*||Β²
  • Stage 4 (KD): L_flow + Ξ»_tΒ·Ξ±(t)Β·||v_student - v_teacher||Β²
  • Stage 5 (Edit): ||v_ΞΈ([z_t|z_src], t, c_edit) - v*||Β²
  • Stage 6 (Distill): ||f_ΞΈ(z_t, t) - f_{θ⁻}(z_t', t')||Β²

6. Memory & Compute Budget

Total System Memory (FP16, no text encoder)

  • Tiny: ~76 MB inference @ 512px
  • Small: ~308 MB inference @ 512px
  • Base: ~530 MB inference @ 512px

With TinyCLIP (+120 MB) β†’ under 500 MB for tiny config.


7. Training Curriculum (16 GB GPU)

Stage Freeze Train Data Res Steps LR Time (T4)
1. VAE β€” VAE ImageNet-50K 128β†’256 50K 1e-4 6h
2. Low-Res VAE Backbone+Plan Synthetic 100K 128β†’256 100K 1e-4 12h
3. High-Res VAE Backbone+Plan Same+high-res 256β†’512 50K 5e-5 8h
4. Distill VAE Backbone+Plan Teacher cached 512 30K 2e-5 6h
5. Edit VAE All (low LR) IP2P+MagicBrush 256β†’512 20K 1e-5 4h

8. Mobile Deployment

  1. Step distill to 4 steps (consistency/LADD)
  2. Export ONNX with static shapes
  3. INT8 weight quantization
  4. Convert to CoreML/NNAPI/QNN
  5. Profile on-device

9. Failure Modes

Failure Fix
SSM scan artifacts More scan directions + larger DWConv
Planner collapse Diversity loss on plan tokens
VAE blur Reduce Ξ»_KL + adversarial loss
Training instability Grad clip=2.0 + separate SSM LR
Editing forgetting Spatial concat + progressive training

10. Ablation Plan

ID Change Expected
A1 No Planner -2-5% FID
A2 Full attention (no SSM) Better@256, worse@1024, 2-4Γ— slower
A3 No shared MQA -1-3% FID
A4 No DWConv in SSM -2-4% FID
A5 No self-conditioning More step jitter
A6 Full vs grouped adaLN +46% params, marginal gain
A7 f16 vs f32 vs f64 VAE f32 sweet spot
A8 Spatial vs channel concat Spatial preserves gen quality

11. Editing Roadmap

  • βœ… Phase 1: Architecture supports spatial concatenation
  • Phase 2: Image editing (InstructPix2Pix data)
  • Phase 3: Inpainting (masked spatial concat)
  • Phase 4: Super-resolution
  • Phase 5: Style/reference (add IP-Adapter, +22M params)
  • Phase 6: Local editing (region-aware planner)

12. Quick Start

import torch
from microforge.vae import MicroForgeVAE
from microforge.backbone import MicroForgeBackbone
from microforge.planner import RecurrentLatentPlanner
from microforge.pipeline import MicroForgePipeline, SimpleTextEncoder

vae = MicroForgeVAE(config='tiny')
backbone = MicroForgeBackbone(latent_channels=16, config='tiny')
planner = RecurrentLatentPlanner(num_plan_tokens=16, dim=256, text_dim=768, latent_channels=16)
text_enc = SimpleTextEncoder(embed_dim=768, num_layers=2)
pipeline = MicroForgePipeline(vae, backbone, text_enc, planner)

tokens = torch.randint(0, 8192, (1, 10))
images = pipeline.text2img(tokens, height=256, width=256, num_steps=4)

License

MIT License

Citation

@software{microforge2025,
  title={MicroForge: Mobile-First Image Generation with Recurrent Latent Planning},
  year={2025},
  url={https://huggingface.co/asdf98/microforge}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support