JLT
Collection
JLT: Clean-Latent Prediction in Latent Diffusion Transformers • 2 items • Updated • 1
Funing Fu · Tenghui Wang · Guanyu Zhou · Junyong Cen · Qichao Zhu
JLT investigates whether predicting clean data is better than predicting velocity in latent space. Under the same architecture, training settings, and FLUX.2 VAE representation, clean-latent prediction achieves FID 2.50 vs. velocity prediction at FID 6.56 — a 62% improvement on ImageNet 256×256.
This model is trained on FLUX.2 VAE latent space with clean-latent prediction target.
| Model | Target | FID-50K ↓ | IS ↑ |
|---|---|---|---|
| JLT-B/1 | x (clean) | 2.50 | 232.51 |
| DiT-B/1 | v (velocity) | 6.56 | 132.12 |
Under the linear corruption path z_t = t * x + (1-t) * epsilon:
x directly, attenuating low-variance latent directionsv = x - epsilon, adding an isotropic unit floor to all directionsKey insight: velocity prediction amplifies low-variance latent directions while clean prediction attenuates them.
| Component | Specification |
|---|---|
| Transformer Blocks | 12 |
| Hidden Dimension | 768 |
| Attention Heads | 12 |
| Parameters | 130M |
| Tokenizer | FLUX.2 VAE (frozen) |
huggingface-cli download dawn-neo/JLT checkpoint-last.pth
# Requires pre-encoded ImageNet latents and torch-fidelity
python main_jit.py \
--model JiT-B/1 --vae_type flux2 \
--data_path /path/to/imagenet_latents_256 --use_latent_cache \
--online_eval --eval_freq 1 --gen_bsz 128 --num_images 50000 \
--cfg 2.9 --num_sampling_steps 50 \
--resume checkpoint-last.pth --output_dir ./eval_output
For full training and inference code, see the GitHub repository.
@article{fu2026jlt,
title={{JLT}: {C}lean-{L}atent {P}rediction in {L}atent {D}iffusion {T}ransformers},
author={Fu, Funing and Wang, Tenghui and Zhou, Guanyu and Cen, Junyong and Zhu, Qichao},
journal = {arXiv preprint arXiv:2605.27102},
year={2026}
}