CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models
Paper • 2412.13195 • Published
import torch
from diffusers import DiffusionPipeline
# switch to "mps" for apple devices
pipe = DiffusionPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", dtype=torch.bfloat16, device_map="cuda")
pipe.load_lora_weights("blurgy/CoMPaSS-FLUX.1")
prompt = "a photo of a laptop above a dog"
image = pipe(prompt).images[0][Project Page] [code] [arXiv]



A LoRA adapter that enhances spatial understanding capabilities of the FLUX.1 text-to-image diffusion model. This model demonstrates significant improvements in generating images with specific spatial relationships between objects.
We provide a custom node with examples at comfyui-node-impl. Use the ComfyUI-compatible LoRA checkpoint comfyui-checkpoint to get started.
See our GitHub repository to get started.
The model works well with:
| Metric | FLUX.1 | +CoMPaSS |
|---|---|---|
| VISOR uncond (⬆️) | 37.96% | 75.17% |
| T2I-CompBench Spatial (⬆️) | 0.18 | 0.30 |
| GenEval Position (⬆️) | 0.26 | 0.60 |
| FID (⬇️) | 27.96 | 26.40 |
| CMMD (⬇️) | 0.8737 | 0.6859 |
If you use this model in your research, please cite:
@inproceedings{zhang2025compass,
title={CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models},
author={Zhang, Gaoyang and Fu, Bingtao and Fan, Qingnan and Zhang, Qi and Liu, Runxing and Gu, Hong and Zhang, Huaqi and Liu, Xinguo},
booktitle={ICCV},
year={2025}
}
For questions about the model, please contact blurgy@zju.edu.cn
Weights for this model are available in Safetensors format.
Download them in the Files & versions tab.
Base model
black-forest-labs/FLUX.1-dev