Add SGLang serving instructions

#14
by MickJ - opened
Files changed (1) hide show
  1. README.md +67 -26
README.md CHANGED
@@ -10,34 +10,11 @@ tags:
10
  - cosmos3
11
  - vllm
12
  - vllm-omni
 
 
13
  - diffusers
14
  - text, image, video, audio, and action generation
15
  - omnimodel
16
- countDownloads:
17
- - checkpoint.json
18
- - config.json
19
- - generation_config.json
20
- - model.safetensors.index.json
21
- - model_index.json
22
- - tokenizer.json
23
- - tokenizer_config.json
24
- - sound_tokenizer/config.json
25
- - sound_tokenizer/diffusion_pytorch_model.safetensors
26
- - text_tokenizer/tokenizer.json
27
- - text_tokenizer/tokenizer_config.json
28
- - transformer/config.json
29
- - transformer/diffusion_pytorch_model-00001-of-00007.safetensors
30
- - transformer/diffusion_pytorch_model-00002-of-00007.safetensors
31
- - transformer/diffusion_pytorch_model-00003-of-00007.safetensors
32
- - transformer/diffusion_pytorch_model-00004-of-00007.safetensors
33
- - transformer/diffusion_pytorch_model-00005-of-00007.safetensors
34
- - transformer/diffusion_pytorch_model-00006-of-00007.safetensors
35
- - transformer/diffusion_pytorch_model-00007-of-00007.safetensors
36
- - transformer/diffusion_pytorch_model.safetensors.index.json
37
- - vae/config.json
38
- - vae/diffusion_pytorch_model.safetensors
39
- - vision_encoder/config.json
40
- - vision_encoder/model.safetensors
41
  ---
42
 
43
  # **Cosmos 3: Omnimodal World Models for Physical AI**
@@ -192,6 +169,7 @@ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated sys
192
  - [PyTorch](https://github.com/nvidia/cosmos3)
193
  - [vLLM-Omni](https://github.com/vllm-project/vllm-omni)
194
  - [Hugging Face Diffusers](https://huggingface.co/docs/diffusers/en/index)
 
195
 
196
  **Supported Hardware Microarchitecture Compatibility:**
197
 
@@ -941,6 +919,69 @@ Example output:
941
 
942
  <video controls width="1280" height="720" src="https://huggingface.co/nvidia/Cosmos3-Nano/resolve/main/assets/example_t2v_diffusers_output.mp4"></video>
943
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
944
  ## Limitations
945
 
946
  Cosmos3 may produce imperfect outputs in challenging scenarios. Generation artifacts include temporal inconsistency, unstable camera or object motion, imprecise physical interactions, inaccurate audio-video synchronization, and action-state drift — especially in long-horizon or high-resolution outputs. Reasoning may also be incorrect: object states, causal relationships, spatial geometry, temporal ordering, agent intent, and future outcomes can be misinferred, and complex or long-context inputs may yield hallucinated entities, inconsistent interpretations, or implausible predictions. Because the model lacks an explicit physics simulator, 3D geometry, 4D space-time evolution, object permanence, contact dynamics, and physical laws are only approximated — producing artifacts such as disappearing or morphing objects, unrealistic collisions, and physically implausible motions. Quality further degrades in out-of-distribution environments, safety-critical edge cases, and domains underrepresented in training.
@@ -949,7 +990,7 @@ Cosmos3 outputs should not be treated as physically accurate simulation, reliabl
949
 
950
  ## Inference
951
 
952
- **Acceleration Engine:** [PyTorch](https://pytorch.org/), [vLLM](https://github.com/vllm-project/vllm), [vLLM-Omni](https://github.com/vllm-project/vllm-omni), [Hugging Face Diffusers](https://github.com/huggingface/diffusers)
953
 
954
  **Test Hardware:** GB200 and H100
955
 
 
10
  - cosmos3
11
  - vllm
12
  - vllm-omni
13
+ - sglang
14
+ - sglang-diffusion
15
  - diffusers
16
  - text, image, video, audio, and action generation
17
  - omnimodel
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  ---
19
 
20
  # **Cosmos 3: Omnimodal World Models for Physical AI**
 
169
  - [PyTorch](https://github.com/nvidia/cosmos3)
170
  - [vLLM-Omni](https://github.com/vllm-project/vllm-omni)
171
  - [Hugging Face Diffusers](https://huggingface.co/docs/diffusers/en/index)
172
+ - [SGLang](https://github.com/sgl-project/sglang)
173
 
174
  **Supported Hardware Microarchitecture Compatibility:**
175
 
 
919
 
920
  <video controls width="1280" height="720" src="https://huggingface.co/nvidia/Cosmos3-Nano/resolve/main/assets/example_t2v_diffusers_output.mp4"></video>
921
 
922
+ ### SGLang
923
+
924
+ [SGLang Diffusion](https://docs.sglang.io/docs/sglang-diffusion/index) can serve `nvidia/Cosmos3-Nano` through OpenAI-compatible image and video generation endpoints. Install SGLang from the main branch with diffusion dependencies, then start a server:
925
+
926
+ ```shell
927
+ git clone --branch main https://github.com/sgl-project/sglang.git
928
+ cd sglang
929
+ pip install -e "python[diffusion]"
930
+ pip install "cosmos-guardrail==0.3.1"
931
+
932
+ sglang serve --model-path nvidia/Cosmos3-Nano
933
+ ```
934
+
935
+ Cosmos 3 support in SGLang Diffusion currently requires the SGLang main branch. Switch to a stable SGLang release once Cosmos 3 support is included there.
936
+
937
+ For a video-specialized checkpoint, use `Cosmos3-Super-Image2Video` with multiple GPUs:
938
+
939
+ ```shell
940
+ sglang serve \
941
+ --model-path nvidia/Cosmos3-Super-Image2Video \
942
+ --num-gpus 4
943
+ ```
944
+
945
+ Supported SGLang endpoints:
946
+
947
+ | Mode | Endpoint | Notes |
948
+ | --- | --- | --- |
949
+ | Text to image | `POST /v1/images/generations` | Returns base64 image data by default |
950
+ | Text to video | `POST /v1/videos` | Creates an async job; poll `GET /v1/videos/{id}` and download `/content` |
951
+ | Image to video | `POST /v1/videos` | Upload the conditioning image with `input_reference` |
952
+
953
+ Example text-to-video request:
954
+
955
+ ```shell
956
+ job_id=$(curl -sS -X POST http://localhost:30000/v1/videos \
957
+ --form-string "prompt=A small warehouse robot moves a blue box across a clean floor." \
958
+ --form-string "negative_prompt=blurry, distorted, low quality" \
959
+ --form-string "size=1280x720" \
960
+ --form-string "num_frames=81" \
961
+ --form-string "fps=24" \
962
+ --form-string "num_inference_steps=35" \
963
+ --form-string "guidance_scale=4.0" \
964
+ --form-string "flow_shift=10.0" \
965
+ --form-string "seed=42" \
966
+ --form-string 'extra_params={"guardrails":true,"use_resolution_template":false,"use_duration_template":false}' \
967
+ | python -c 'import json, sys; print(json.load(sys.stdin)["id"])')
968
+
969
+ while true; do
970
+ status=$(curl -sS "http://localhost:30000/v1/videos/${job_id}" \
971
+ | python -c 'import json, sys; print(json.load(sys.stdin)["status"])')
972
+ [ "$status" = "completed" ] && break
973
+ [ "$status" = "failed" ] && exit 1
974
+ sleep 1
975
+ done
976
+
977
+ curl -sS -L "http://localhost:30000/v1/videos/${job_id}/content" \
978
+ -o cosmos3_t2v_output.mp4
979
+ ```
980
+
981
+ SGLang accepts Cosmos 3 request options including `max_sequence_length`, `flow_shift`, `extra_params.guardrails`, `extra_params.use_resolution_template`, and `extra_params.use_duration_template`. Video-to-video, video-with-sound, and action generation are not supported by SGLang yet.
982
+
983
+ For complete serving instructions and request examples, see the [Cosmos3 SGLang cookbook](https://docs.sglang.io/cookbook/diffusion/Cosmos/Cosmos3).
984
+
985
  ## Limitations
986
 
987
  Cosmos3 may produce imperfect outputs in challenging scenarios. Generation artifacts include temporal inconsistency, unstable camera or object motion, imprecise physical interactions, inaccurate audio-video synchronization, and action-state drift — especially in long-horizon or high-resolution outputs. Reasoning may also be incorrect: object states, causal relationships, spatial geometry, temporal ordering, agent intent, and future outcomes can be misinferred, and complex or long-context inputs may yield hallucinated entities, inconsistent interpretations, or implausible predictions. Because the model lacks an explicit physics simulator, 3D geometry, 4D space-time evolution, object permanence, contact dynamics, and physical laws are only approximated — producing artifacts such as disappearing or morphing objects, unrealistic collisions, and physically implausible motions. Quality further degrades in out-of-distribution environments, safety-critical edge cases, and domains underrepresented in training.
 
990
 
991
  ## Inference
992
 
993
+ **Acceleration Engine:** [PyTorch](https://pytorch.org/), [vLLM](https://github.com/vllm-project/vllm), [vLLM-Omni](https://github.com/vllm-project/vllm-omni), [Hugging Face Diffusers](https://github.com/huggingface/diffusers), [SGLang](https://github.com/sgl-project/sglang), [SGLang Diffusion](https://docs.sglang.io/docs/sglang-diffusion/index)
994
 
995
  **Test Hardware:** GB200 and H100
996