nvidia
/

Cosmos3-Super

@@ -1,24 +1,22 @@
 ---
 license: other
 license_name: openmdw1.1-license
-license_link: >-
-  https://openmdw.ai/license/1-1/
-library_name: cosmos
 tags:
-  - nvidia
-  - cosmos
-  - cosmos3
-  - vllm
-  - vllm-omni
-  - sglang
-  - sglang-diffusion
-  - diffusers
-  - text, image, video, audio, and action generation
-  - omnimodel
 ---
 # **Cosmos 3: Omnimodal World Models for Physical AI**
-**[Model Collection](https://huggingface.co/collections/nvidia/cosmos3)** | **[Code](https://github.com/nvidia/cosmos)** | **[White Paper](https://research.nvidia.com/labs/cosmos-lab/cosmos3/technical-report.pdf)** | **[Website](https://research.nvidia.com/labs/cosmos-lab/cosmos3/)**
 [NVIDIA Cosmos™](https://github.com/nvidia/cosmos) is a world foundation model platform designed to accelerate the development of Physical AI by enabling machines to understand, simulate, and interact with the physical world across robotics, autonomous driving, and smart space environments, including industrial and factory-scale applications.
@@ -32,6 +30,37 @@ This model is ready for commercial and non-commercial use.
 **Model Developer:** NVIDIA
 ### Model Versions
 - Cosmos3-Nano:
   - Given multimodal inputs including text, images, video, audio, and action trajectories, generate coherent text, images, video, audio, and action outputs for multimodal understanding, world simulation, future prediction, action reasoning, and Physical AI applications.
@@ -169,7 +198,6 @@ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated sys
 - [PyTorch](https://github.com/nvidia/cosmos3)
 - [vLLM-Omni](https://github.com/vllm-project/vllm-omni)
 - [Hugging Face Diffusers](https://huggingface.co/docs/diffusers/en/index)
-- [SGLang](https://github.com/sgl-project/sglang)
 **Supported Hardware Microarchitecture Compatibility:**
@@ -819,7 +847,7 @@ client = openai.OpenAI(
     base_url="http://localhost:8000/v1",
 )
-response = client.chat.completions.create(
     model=client.models.list().data[0].id,
     messages=[
         {
@@ -924,71 +952,6 @@ Example output:
 <video controls width="1280" height="720" src="https://huggingface.co/nvidia/Cosmos3-Super/resolve/main/assets/example_t2v_diffusers_output.mp4"></video>
-### SGLang
-[SGLang Diffusion](https://docs.sglang.io/docs/sglang-diffusion/index) can serve `nvidia/Cosmos3-Super` through OpenAI-compatible image and video generation endpoints. Install SGLang from the main branch with diffusion dependencies, then start the server:
-```bash
-git clone --branch main https://github.com/sgl-project/sglang.git
-cd sglang
-pip install -e "python[diffusion]"
-pip install "cosmos-guardrail==0.3.1"
-sglang serve \
-  --model-path nvidia/Cosmos3-Super \
-  --num-gpus 4
-```
-Cosmos 3 support in SGLang Diffusion currently requires the SGLang main branch. Switch to a stable SGLang release once Cosmos 3 support is included there.
-For the video-specialized checkpoint:
-```bash
-sglang serve \
-  --model-path nvidia/Cosmos3-Super-Image2Video \
-  --num-gpus 4
-```
-Supported SGLang endpoints:
-| Mode | Endpoint | Notes |
-| --- | --- | --- |
-| Text to image | `POST /v1/images/generations` | Returns base64 image data by default |
-| Text to video | `POST /v1/videos` | Creates an async job; poll `GET /v1/videos/{id}` and download `/content` |
-| Image to video | `POST /v1/videos` | Upload the conditioning image with `input_reference` |
-Example text-to-video request:
-```bash
-job_id=$(curl -sS -X POST http://localhost:30000/v1/videos \
-  --form-string "prompt=A small warehouse robot moves a blue box across a clean floor." \
-  --form-string "negative_prompt=blurry, distorted, low quality" \
-  --form-string "size=1280x720" \
-  --form-string "num_frames=81" \
-  --form-string "fps=24" \
-  --form-string "num_inference_steps=35" \
-  --form-string "guidance_scale=4.0" \
-  --form-string "flow_shift=10.0" \
-  --form-string "seed=42" \
-  --form-string 'extra_params={"guardrails":true,"use_resolution_template":false,"use_duration_template":false}' \
-  | python -c 'import json, sys; print(json.load(sys.stdin)["id"])')
-while true; do
-  status=$(curl -sS "http://localhost:30000/v1/videos/${job_id}" \
-    | python -c 'import json, sys; print(json.load(sys.stdin)["status"])')
-  [ "$status" = "completed" ] && break
-  [ "$status" = "failed" ] && exit 1
-  sleep 1
-done
-curl -sS -L "http://localhost:30000/v1/videos/${job_id}/content" \
-  -o cosmos3_super_t2v_output.mp4
-```
-Video-to-video, video-with-sound, and action generation are not supported by SGLang yet.
-For complete serving instructions and request examples, see the [Cosmos3 SGLang cookbook](https://docs.sglang.io/cookbook/diffusion/Cosmos/Cosmos3).
 ## Limitations
 Cosmos3 may produce imperfect outputs in challenging scenarios. Generation artifacts include temporal inconsistency, unstable camera or object motion, imprecise physical interactions, inaccurate audio-video synchronization, and action-state drift — especially in long-horizon or high-resolution outputs. Reasoning may also be incorrect: object states, causal relationships, spatial geometry, temporal ordering, agent intent, and future outcomes can be misinferred, and complex or long-context inputs may yield hallucinated entities, inconsistent interpretations, or implausible predictions. Because the model lacks an explicit physics simulator, 3D geometry, 4D space-time evolution, object permanence, contact dynamics, and physical laws are only approximated — producing artifacts such as disappearing or morphing objects, unrealistic collisions, and physically implausible motions. Quality further degrades in out-of-distribution environments, safety-critical edge cases, and domains underrepresented in training.
@@ -997,7 +960,7 @@ Cosmos3 outputs should not be treated as physically accurate simulation, reliabl
 ## Inference
-**Acceleration Engine:** [PyTorch](https://pytorch.org/), [vLLM](https://github.com/vllm-project/vllm), [vLLM-Omni](https://github.com/vllm-project/vllm-omni), [Hugging Face Diffusers](https://github.com/huggingface/diffusers), [SGLang](https://github.com/sgl-project/sglang), [SGLang Diffusion](https://docs.sglang.io/docs/sglang-diffusion/index)
 **Test Hardware:** GB200 and H100
@@ -1009,4 +972,4 @@ Please make sure you have proper rights and permissions for all input image and
 Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.
-For more detailed information on ethical considerations for this model, please see the Model Card++ [Explainability](EXPLAINABILITY.md), [Bias](BIAS.md), [Safety & Security](SAFETY.md), and [Privacy](PRIVACY.md) subcards. Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).

 ---
+library_name: diffusers
 license: other
 license_name: openmdw1.1-license
+license_link: https://openmdw.ai/license/1-1/
+pipeline_tag: any-to-any
 tags:
+- nvidia
+- cosmos
+- cosmos3
+- vllm
+- vllm-omni
+- diffusers
+- text, image, video, audio, and action generation
+- omnimodel
 ---
 # **Cosmos 3: Omnimodal World Models for Physical AI**
+**[Paper Page](https://huggingface.co/papers/2606.02800)** | **[Model Collection](https://huggingface.co/collections/nvidia/cosmos3)** | **[Code](https://github.com/nvidia/cosmos)** | **[White Paper](https://research.nvidia.com/labs/cosmos-lab/cosmos3/technical-report.pdf)** | **[Website](https://research.nvidia.com/labs/cosmos-lab/cosmos3/)**
 [NVIDIA Cosmos™](https://github.com/nvidia/cosmos) is a world foundation model platform designed to accelerate the development of Physical AI by enabling machines to understand, simulate, and interact with the physical world across robotics, autonomous driving, and smart space environments, including industrial and factory-scale applications.
 **Model Developer:** NVIDIA
+### Sample Usage
+You can use the model with the [diffusers](https://github.com/huggingface/diffusers) library as shown below:
+```python
+import torch
+from diffusers import Cosmos3OmniPipeline
+from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
+from diffusers.utils import export_to_video
+pipe = Cosmos3OmniPipeline.from_pretrained(
+    "nvidia/Cosmos3-Super",
+    torch_dtype=torch.bfloat16,
+    device_map="cuda",
+)
+pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=10.0)
+result = pipe(
+    prompt="A mobile robot navigates a warehouse aisle and stops at a shelf.",
+    num_frames=189,
+    height=720,
+    width=1280,
+    fps=24,
+    num_inference_steps=35,
+    guidance_scale=6.0,
+    generator=torch.Generator(device="cuda").manual_seed(123),
+)
+export_to_video(result.video, "cosmos3_super_t2v.mp4", fps=24)
+```
 ### Model Versions
 - Cosmos3-Nano:
   - Given multimodal inputs including text, images, video, audio, and action trajectories, generate coherent text, images, video, audio, and action outputs for multimodal understanding, world simulation, future prediction, action reasoning, and Physical AI applications.
 - [PyTorch](https://github.com/nvidia/cosmos3)
 - [vLLM-Omni](https://github.com/vllm-project/vllm-omni)
 - [Hugging Face Diffusers](https://huggingface.co/docs/diffusers/en/index)
 **Supported Hardware Microarchitecture Compatibility:**
     base_url="http://localhost:8000/v1",
 )
+response = client.chat.completion.create(
     model=client.models.list().data[0].id,
     messages=[
         {
 <video controls width="1280" height="720" src="https://huggingface.co/nvidia/Cosmos3-Super/resolve/main/assets/example_t2v_diffusers_output.mp4"></video>
 ## Limitations
 Cosmos3 may produce imperfect outputs in challenging scenarios. Generation artifacts include temporal inconsistency, unstable camera or object motion, imprecise physical interactions, inaccurate audio-video synchronization, and action-state drift — especially in long-horizon or high-resolution outputs. Reasoning may also be incorrect: object states, causal relationships, spatial geometry, temporal ordering, agent intent, and future outcomes can be misinferred, and complex or long-context inputs may yield hallucinated entities, inconsistent interpretations, or implausible predictions. Because the model lacks an explicit physics simulator, 3D geometry, 4D space-time evolution, object permanence, contact dynamics, and physical laws are only approximated — producing artifacts such as disappearing or morphing objects, unrealistic collisions, and physically implausible motions. Quality further degrades in out-of-distribution environments, safety-critical edge cases, and domains underrepresented in training.
 ## Inference
+**Acceleration Engine:** [PyTorch](https://pytorch.org/), [vLLM](https://github.com/vllm-project/vllm), [vLLM-Omni](https://github.com/vllm-project/vllm-omni), [Hugging Face Diffusers](https://github.com/huggingface/diffusers)
 **Test Hardware:** GB200 and H100
 Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.
+For more detailed information on ethical considerations for this model, please see the Model Card++ [Explainability](EXPLAINABILITY.md), [Bias](BIAS.md), [Safety & Security](SAFETY.md), and [Privacy](PRIVACY.md) subcards. Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).

sound_tokenizer.ckpt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6daeb68a219f3e86c0918f616d78b9ebf073f3d700df63ff1c02d214c081d72d
+size 1985246007

sound_tokenizer.json ADDED Viewed

	@@ -0,0 +1,42 @@

+{
+    "model_type": "autoencoder_v2",
+    "sampling_rate": 48000,
+    "stereo": true,
+    "use_wav_as_input": true,
+    "normalize_volume": true,
+    "hop_size": 1920,
+    "input_channels": 1,
+    "enc_type": "spec_convnext",
+    "enc_dim": 192,
+    "enc_intermediate_dim": 768,
+    "enc_num_layers": 12,
+    "enc_num_blocks": 2,
+    "enc_n_fft": 64,
+    "enc_hop_length": 16,
+    "enc_latent_dim": 128,
+    "enc_c_mults": [1, 2, 4],
+    "enc_strides": [4, 5, 6],
+    "enc_identity_init": false,
+    "enc_use_snake": true,
+    "dec_type": "oobleck",
+    "dec_dim": 320,
+    "dec_c_mults": [1, 2, 4, 8, 16],
+    "dec_strides": [2, 4, 5, 6, 8],
+    "dec_use_snake": true,
+    "dec_final_tanh": false,
+    "dec_out_channels": 2,
+    "dec_anti_aliasing": false,
+    "dec_use_nearest_upsample": false,
+    "dec_use_tanh_at_final": false,
+    "bottleneck_type": "vae",
+    "bottleneck": {"type": "vae"},
+    "activation": "snakebeta",
+    "snake_logscale": true,
+    "anti_aliasing": false,
+    "use_cuda_kernel": false,
+    "causal": false,
+    "padding_mode": "zeros",
+    "vocoder_input_dim": 64,
+    "latent_mean": null,
+    "latent_std": null
+}