Add pipeline tag, library name, and sample usage

#12
by nielsr HF Staff - opened
Files changed (3) hide show
  1. README.md +46 -83
  2. sound_tokenizer.ckpt +3 -0
  3. sound_tokenizer.json +42 -0
README.md CHANGED
@@ -1,24 +1,22 @@
1
  ---
 
2
  license: other
3
  license_name: openmdw1.1-license
4
- license_link: >-
5
- https://openmdw.ai/license/1-1/
6
- library_name: cosmos
7
  tags:
8
- - nvidia
9
- - cosmos
10
- - cosmos3
11
- - vllm
12
- - vllm-omni
13
- - sglang
14
- - sglang-diffusion
15
- - diffusers
16
- - text, image, video, audio, and action generation
17
- - omnimodel
18
  ---
19
 
20
  # **Cosmos 3: Omnimodal World Models for Physical AI**
21
- **[Model Collection](https://huggingface.co/collections/nvidia/cosmos3)** | **[Code](https://github.com/nvidia/cosmos)** | **[White Paper](https://research.nvidia.com/labs/cosmos-lab/cosmos3/technical-report.pdf)** | **[Website](https://research.nvidia.com/labs/cosmos-lab/cosmos3/)**
22
 
23
  [NVIDIA Cosmos™](https://github.com/nvidia/cosmos) is a world foundation model platform designed to accelerate the development of Physical AI by enabling machines to understand, simulate, and interact with the physical world across robotics, autonomous driving, and smart space environments, including industrial and factory-scale applications.
24
 
@@ -32,6 +30,37 @@ This model is ready for commercial and non-commercial use.
32
 
33
  **Model Developer:** NVIDIA
34
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
  ### Model Versions
36
  - Cosmos3-Nano:
37
  - Given multimodal inputs including text, images, video, audio, and action trajectories, generate coherent text, images, video, audio, and action outputs for multimodal understanding, world simulation, future prediction, action reasoning, and Physical AI applications.
@@ -169,7 +198,6 @@ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated sys
169
  - [PyTorch](https://github.com/nvidia/cosmos3)
170
  - [vLLM-Omni](https://github.com/vllm-project/vllm-omni)
171
  - [Hugging Face Diffusers](https://huggingface.co/docs/diffusers/en/index)
172
- - [SGLang](https://github.com/sgl-project/sglang)
173
 
174
  **Supported Hardware Microarchitecture Compatibility:**
175
 
@@ -819,7 +847,7 @@ client = openai.OpenAI(
819
  base_url="http://localhost:8000/v1",
820
  )
821
 
822
- response = client.chat.completions.create(
823
  model=client.models.list().data[0].id,
824
  messages=[
825
  {
@@ -924,71 +952,6 @@ Example output:
924
 
925
  <video controls width="1280" height="720" src="https://huggingface.co/nvidia/Cosmos3-Super/resolve/main/assets/example_t2v_diffusers_output.mp4"></video>
926
 
927
- ### SGLang
928
-
929
- [SGLang Diffusion](https://docs.sglang.io/docs/sglang-diffusion/index) can serve `nvidia/Cosmos3-Super` through OpenAI-compatible image and video generation endpoints. Install SGLang from the main branch with diffusion dependencies, then start the server:
930
-
931
- ```bash
932
- git clone --branch main https://github.com/sgl-project/sglang.git
933
- cd sglang
934
- pip install -e "python[diffusion]"
935
- pip install "cosmos-guardrail==0.3.1"
936
-
937
- sglang serve \
938
- --model-path nvidia/Cosmos3-Super \
939
- --num-gpus 4
940
- ```
941
-
942
- Cosmos 3 support in SGLang Diffusion currently requires the SGLang main branch. Switch to a stable SGLang release once Cosmos 3 support is included there.
943
-
944
- For the video-specialized checkpoint:
945
-
946
- ```bash
947
- sglang serve \
948
- --model-path nvidia/Cosmos3-Super-Image2Video \
949
- --num-gpus 4
950
- ```
951
-
952
- Supported SGLang endpoints:
953
-
954
- | Mode | Endpoint | Notes |
955
- | --- | --- | --- |
956
- | Text to image | `POST /v1/images/generations` | Returns base64 image data by default |
957
- | Text to video | `POST /v1/videos` | Creates an async job; poll `GET /v1/videos/{id}` and download `/content` |
958
- | Image to video | `POST /v1/videos` | Upload the conditioning image with `input_reference` |
959
-
960
- Example text-to-video request:
961
-
962
- ```bash
963
- job_id=$(curl -sS -X POST http://localhost:30000/v1/videos \
964
- --form-string "prompt=A small warehouse robot moves a blue box across a clean floor." \
965
- --form-string "negative_prompt=blurry, distorted, low quality" \
966
- --form-string "size=1280x720" \
967
- --form-string "num_frames=81" \
968
- --form-string "fps=24" \
969
- --form-string "num_inference_steps=35" \
970
- --form-string "guidance_scale=4.0" \
971
- --form-string "flow_shift=10.0" \
972
- --form-string "seed=42" \
973
- --form-string 'extra_params={"guardrails":true,"use_resolution_template":false,"use_duration_template":false}' \
974
- | python -c 'import json, sys; print(json.load(sys.stdin)["id"])')
975
-
976
- while true; do
977
- status=$(curl -sS "http://localhost:30000/v1/videos/${job_id}" \
978
- | python -c 'import json, sys; print(json.load(sys.stdin)["status"])')
979
- [ "$status" = "completed" ] && break
980
- [ "$status" = "failed" ] && exit 1
981
- sleep 1
982
- done
983
-
984
- curl -sS -L "http://localhost:30000/v1/videos/${job_id}/content" \
985
- -o cosmos3_super_t2v_output.mp4
986
- ```
987
-
988
- Video-to-video, video-with-sound, and action generation are not supported by SGLang yet.
989
-
990
- For complete serving instructions and request examples, see the [Cosmos3 SGLang cookbook](https://docs.sglang.io/cookbook/diffusion/Cosmos/Cosmos3).
991
-
992
  ## Limitations
993
 
994
  Cosmos3 may produce imperfect outputs in challenging scenarios. Generation artifacts include temporal inconsistency, unstable camera or object motion, imprecise physical interactions, inaccurate audio-video synchronization, and action-state drift — especially in long-horizon or high-resolution outputs. Reasoning may also be incorrect: object states, causal relationships, spatial geometry, temporal ordering, agent intent, and future outcomes can be misinferred, and complex or long-context inputs may yield hallucinated entities, inconsistent interpretations, or implausible predictions. Because the model lacks an explicit physics simulator, 3D geometry, 4D space-time evolution, object permanence, contact dynamics, and physical laws are only approximated — producing artifacts such as disappearing or morphing objects, unrealistic collisions, and physically implausible motions. Quality further degrades in out-of-distribution environments, safety-critical edge cases, and domains underrepresented in training.
@@ -997,7 +960,7 @@ Cosmos3 outputs should not be treated as physically accurate simulation, reliabl
997
 
998
  ## Inference
999
 
1000
- **Acceleration Engine:** [PyTorch](https://pytorch.org/), [vLLM](https://github.com/vllm-project/vllm), [vLLM-Omni](https://github.com/vllm-project/vllm-omni), [Hugging Face Diffusers](https://github.com/huggingface/diffusers), [SGLang](https://github.com/sgl-project/sglang), [SGLang Diffusion](https://docs.sglang.io/docs/sglang-diffusion/index)
1001
 
1002
  **Test Hardware:** GB200 and H100
1003
 
@@ -1009,4 +972,4 @@ Please make sure you have proper rights and permissions for all input image and
1009
 
1010
  Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.
1011
 
1012
- For more detailed information on ethical considerations for this model, please see the Model Card++ [Explainability](EXPLAINABILITY.md), [Bias](BIAS.md), [Safety & Security](SAFETY.md), and [Privacy](PRIVACY.md) subcards. Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
 
1
  ---
2
+ library_name: diffusers
3
  license: other
4
  license_name: openmdw1.1-license
5
+ license_link: https://openmdw.ai/license/1-1/
6
+ pipeline_tag: any-to-any
 
7
  tags:
8
+ - nvidia
9
+ - cosmos
10
+ - cosmos3
11
+ - vllm
12
+ - vllm-omni
13
+ - diffusers
14
+ - text, image, video, audio, and action generation
15
+ - omnimodel
 
 
16
  ---
17
 
18
  # **Cosmos 3: Omnimodal World Models for Physical AI**
19
+ **[Paper Page](https://huggingface.co/papers/2606.02800)** | **[Model Collection](https://huggingface.co/collections/nvidia/cosmos3)** | **[Code](https://github.com/nvidia/cosmos)** | **[White Paper](https://research.nvidia.com/labs/cosmos-lab/cosmos3/technical-report.pdf)** | **[Website](https://research.nvidia.com/labs/cosmos-lab/cosmos3/)**
20
 
21
  [NVIDIA Cosmos™](https://github.com/nvidia/cosmos) is a world foundation model platform designed to accelerate the development of Physical AI by enabling machines to understand, simulate, and interact with the physical world across robotics, autonomous driving, and smart space environments, including industrial and factory-scale applications.
22
 
 
30
 
31
  **Model Developer:** NVIDIA
32
 
33
+ ### Sample Usage
34
+
35
+ You can use the model with the [diffusers](https://github.com/huggingface/diffusers) library as shown below:
36
+
37
+ ```python
38
+ import torch
39
+ from diffusers import Cosmos3OmniPipeline
40
+ from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
41
+ from diffusers.utils import export_to_video
42
+
43
+ pipe = Cosmos3OmniPipeline.from_pretrained(
44
+ "nvidia/Cosmos3-Super",
45
+ torch_dtype=torch.bfloat16,
46
+ device_map="cuda",
47
+ )
48
+ pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=10.0)
49
+
50
+ result = pipe(
51
+ prompt="A mobile robot navigates a warehouse aisle and stops at a shelf.",
52
+ num_frames=189,
53
+ height=720,
54
+ width=1280,
55
+ fps=24,
56
+ num_inference_steps=35,
57
+ guidance_scale=6.0,
58
+ generator=torch.Generator(device="cuda").manual_seed(123),
59
+ )
60
+
61
+ export_to_video(result.video, "cosmos3_super_t2v.mp4", fps=24)
62
+ ```
63
+
64
  ### Model Versions
65
  - Cosmos3-Nano:
66
  - Given multimodal inputs including text, images, video, audio, and action trajectories, generate coherent text, images, video, audio, and action outputs for multimodal understanding, world simulation, future prediction, action reasoning, and Physical AI applications.
 
198
  - [PyTorch](https://github.com/nvidia/cosmos3)
199
  - [vLLM-Omni](https://github.com/vllm-project/vllm-omni)
200
  - [Hugging Face Diffusers](https://huggingface.co/docs/diffusers/en/index)
 
201
 
202
  **Supported Hardware Microarchitecture Compatibility:**
203
 
 
847
  base_url="http://localhost:8000/v1",
848
  )
849
 
850
+ response = client.chat.completion.create(
851
  model=client.models.list().data[0].id,
852
  messages=[
853
  {
 
952
 
953
  <video controls width="1280" height="720" src="https://huggingface.co/nvidia/Cosmos3-Super/resolve/main/assets/example_t2v_diffusers_output.mp4"></video>
954
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
955
  ## Limitations
956
 
957
  Cosmos3 may produce imperfect outputs in challenging scenarios. Generation artifacts include temporal inconsistency, unstable camera or object motion, imprecise physical interactions, inaccurate audio-video synchronization, and action-state drift — especially in long-horizon or high-resolution outputs. Reasoning may also be incorrect: object states, causal relationships, spatial geometry, temporal ordering, agent intent, and future outcomes can be misinferred, and complex or long-context inputs may yield hallucinated entities, inconsistent interpretations, or implausible predictions. Because the model lacks an explicit physics simulator, 3D geometry, 4D space-time evolution, object permanence, contact dynamics, and physical laws are only approximated — producing artifacts such as disappearing or morphing objects, unrealistic collisions, and physically implausible motions. Quality further degrades in out-of-distribution environments, safety-critical edge cases, and domains underrepresented in training.
 
960
 
961
  ## Inference
962
 
963
+ **Acceleration Engine:** [PyTorch](https://pytorch.org/), [vLLM](https://github.com/vllm-project/vllm), [vLLM-Omni](https://github.com/vllm-project/vllm-omni), [Hugging Face Diffusers](https://github.com/huggingface/diffusers)
964
 
965
  **Test Hardware:** GB200 and H100
966
 
 
972
 
973
  Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.
974
 
975
+ For more detailed information on ethical considerations for this model, please see the Model Card++ [Explainability](EXPLAINABILITY.md), [Bias](BIAS.md), [Safety & Security](SAFETY.md), and [Privacy](PRIVACY.md) subcards. Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
sound_tokenizer.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6daeb68a219f3e86c0918f616d78b9ebf073f3d700df63ff1c02d214c081d72d
3
+ size 1985246007
sound_tokenizer.json ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "autoencoder_v2",
3
+ "sampling_rate": 48000,
4
+ "stereo": true,
5
+ "use_wav_as_input": true,
6
+ "normalize_volume": true,
7
+ "hop_size": 1920,
8
+ "input_channels": 1,
9
+ "enc_type": "spec_convnext",
10
+ "enc_dim": 192,
11
+ "enc_intermediate_dim": 768,
12
+ "enc_num_layers": 12,
13
+ "enc_num_blocks": 2,
14
+ "enc_n_fft": 64,
15
+ "enc_hop_length": 16,
16
+ "enc_latent_dim": 128,
17
+ "enc_c_mults": [1, 2, 4],
18
+ "enc_strides": [4, 5, 6],
19
+ "enc_identity_init": false,
20
+ "enc_use_snake": true,
21
+ "dec_type": "oobleck",
22
+ "dec_dim": 320,
23
+ "dec_c_mults": [1, 2, 4, 8, 16],
24
+ "dec_strides": [2, 4, 5, 6, 8],
25
+ "dec_use_snake": true,
26
+ "dec_final_tanh": false,
27
+ "dec_out_channels": 2,
28
+ "dec_anti_aliasing": false,
29
+ "dec_use_nearest_upsample": false,
30
+ "dec_use_tanh_at_final": false,
31
+ "bottleneck_type": "vae",
32
+ "bottleneck": {"type": "vae"},
33
+ "activation": "snakebeta",
34
+ "snake_logscale": true,
35
+ "anti_aliasing": false,
36
+ "use_cuda_kernel": false,
37
+ "causal": false,
38
+ "padding_mode": "zeros",
39
+ "vocoder_input_dim": 64,
40
+ "latent_mean": null,
41
+ "latent_std": null
42
+ }