CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models
Paper • 2412.13195 • Published
import torch
from diffusers import DiffusionPipeline
# switch to "mps" for apple devices
pipe = DiffusionPipeline.from_pretrained("blurgy/CoMPaSS-SD1.5", dtype=torch.bfloat16, device_map="cuda")
prompt = "a photo of a laptop above a dog"
image = pipe(prompt).images[0]


[Project Page] [code] [arXiv]
A UNet that enhances spatial understanding capabilities of the StableDiffusion 1.5 text-to-image diffusion model. This model demonstrates significant improvements in generating images with specific spatial relationships between objects.
See our GitHub repository to get started.
The model works well with:
| Metric | StableDiffusion 1.4 | +CoMPaSS |
|---|---|---|
| VISOR uncond (⬆️) | 17.58% | 61.46% |
| T2I-CompBench Spatial (⬆️) | 0.08 | 0.35 |
| GenEval Position (⬆️) | 0.04 | 0.54 |
| FID (⬇️) | 12.82 | 10.89 |
| CMMD (⬇️) | 0.5548 | 0.3235 |
If you use this model in your research, please cite:
@inproceedings{zhang2025compass,
title={CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models},
author={Zhang, Gaoyang and Fu, Bingtao and Fan, Qingnan and Zhang, Qi and Liu, Runxing and Gu, Hong and Zhang, Huaqi and Liu, Xinguo},
booktitle={ICCV},
year={2025}
}
For questions about the model, please contact blurgy@zju.edu.cn
Weights for this model are available in Safetensors format.
Base model
runwayml/stable-diffusion-v1-5