Visual Generation Models
Collection
5 items • Updated
Diffusers-ready checkpoints for Scalable Interpolant Transformers (SiT), converted for local/offline use.
This root folder is a model collection that contains:
SiT-S-2-256-diffusersSiT-B-2-256-diffusersSiT-L-2-256-diffusersSiT-XL-2-256-diffusersSiT-XL-2-512-diffusersEach subfolder is a self-contained Diffusers model repo with:
pipeline.pytransformer/transformer_sit.pyscheduler/scheduling_flow_match_sit.pytransformer/diffusion_pytorch_model.safetensorsvae/diffusion_pytorch_model.safetensorsUse paths relative to this root README:
| Model | Resolution | Local path |
|---|---|---|
| SiT-S/2 | 256x256 | ./SiT-S-2-256-diffusers |
| SiT-B/2 | 256x256 | ./SiT-B-2-256-diffusers |
| SiT-L/2 | 256x256 | ./SiT-L-2-256-diffusers |
| SiT-XL/2 | 256x256 | ./SiT-XL-2-256-diffusers |
| SiT-XL/2 | 512x512 | ./SiT-XL-2-512-diffusers |
import torch
from diffusers import DiffusionPipeline
model_path = "./SiT-XL-2-512-diffusers" # change to any path in the table above
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = DiffusionPipeline.from_pretrained(
model_path,
trust_remote_code=True,
).to(device)
generator = torch.Generator(device=device).manual_seed(0)
# ImageNet class example: 207 = golden retriever
result = pipe(
class_labels=207,
height=512,
width=512,
num_inference_steps=250, # official SiT comparisons commonly use 250 steps
guidance_scale=4.0,
generator=generator,
)
image = result.images[0]
image.save("sit_xl_512_demo.png")
model_path = "./SiT-S-2-256-diffusers"
# model_path = "./SiT-B-2-256-diffusers"
# model_path = "./SiT-L-2-256-diffusers"
# model_path = "./SiT-XL-2-256-diffusers"
pipe = DiffusionPipeline.from_pretrained(model_path, trust_remote_code=True).to(device)
image = pipe(
class_labels=207,
height=256,
width=256,
num_inference_steps=250,
guidance_scale=4.0,
generator=generator,
).images[0]
image.save("sit_256_demo.png")
The table below summarizes widely cited SiT numbers from the official project materials for class-conditional ImageNet generation.
| Model / setting | Resolution | FID-50K (lower is better) |
|---|---|---|
| SiT-S (400K steps) | 256x256 | 57.6 |
| SiT-B (400K steps) | 256x256 | 33.5 |
| SiT-L (400K steps) | 256x256 | 17.2 |
| SiT-XL (400K steps) | 256x256 | 8.6 |
| SiT-XL (cfg=1.5, ODE) | 256x256 | 2.15 |
SiT-XL (cfg=1.5, SDE, w(t)=sigma_t) |
256x256 | 2.06 |
| SiT-XL (sample showcase) | 512x512 | Not reported in the same benchmark table |
Note: FID depends on training recipe, sampler choice (ODE/SDE), guidance scale, and evaluation protocol. Treat this table as a reference to official SiT reports, not as guaranteed reproducibility for every conversion/export.
If you use SiT in your work, please cite:
@inproceedings{ma2024sit,
title={SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers},
author={Ma, Nanye and Goldstein, Mark and Albergo, Michael S. and Boffi, Nicholas M. and Vanden-Eijnden, Eric and Xie, Saining},
booktitle={European Conference on Computer Vision (ECCV)},
year={2024},
note={Accepted to ECCV 2024}
}