Instructions to use mlx-community/MOSS-SoundEffect-v2.0-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/MOSS-SoundEffect-v2.0-4bit with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir MOSS-SoundEffect-v2.0-4bit mlx-community/MOSS-SoundEffect-v2.0-4bit
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
mlx-community/MOSS-SoundEffect-v2.0-4bit
This model mlx-community/MOSS-SoundEffect-v2.0-4bit was converted to MLX format from OpenMOSS-Team/MOSS-SoundEffect-v2.0 โ a text-to-sound-effect diffusion pipeline (foley / ambience / creature / action audio, 48 kHz, up to 30 s) with a 1.3B Wan-style flow-matching DiT, a continuous 128-d DAC VAE (50 Hz latents), and a frozen Qwen3-1.7B text encoder.
Precision: DiT int4 (group_size 64, transformer-block Linears only โ embeddings, time/text projections, head, and norms stay bf16), DAC-VAE fp32, Qwen3 text encoder bf16.
Use with mlx
pip install moss-sfx-mlx # https://github.com/xocialize/moss-soundeffect-mlx
from moss_sfx_mlx.pipeline_mlx import MossSoundEffectPipeline
pipe = MossSoundEffectPipeline.from_pretrained("mlx-community/MOSS-SoundEffect-v2.0-4bit")
audio = pipe(prompt="a heavy wooden door creaks open slowly",
seconds=5, num_inference_steps=100, cfg_scale=4.0, seed=0)
# audio: (1, 1, samples) mx.array at 48 kHz
Parity
Validated against the upstream PyTorch reference (fp32, CPU stream, per-module and end-to-end golden tensors; full suite in the GitHub repo):
End-to-end waveform vs PyTorch golden (10-step CFG denoise): max_abs < 1e-2 fp32
Full-DiT velocity field at production scale (T=1500): max_abs < 1e-2 fp32
DAC-VAE decode vs reference: max_abs < 1e-2 fp32 (no scale constant โ the learned post_quant_conv is faithful)
Qwen3 hidden states: cosine 1.0, max_abs 4.4e-4 (fp32 accumulation floor)
int4 DiT per-pass cosine vs bf16 on identical injected inputs: 0.999425 (gate 0.99)
10-prompt perceptual A/B at 100 steps: passed human review (correct content, duration, no tonal artifacts)
Performance (Apple M5 Max)
100 steps, cfg 4.0, full 30 s latent: 45 s wall clock, 12.2 GB peak memory; DiT shrinks 2.83 GB -> 0.83 GB.
License
Apache-2.0, matching the upstream model, code, and all components.
Quantized
Model tree for mlx-community/MOSS-SoundEffect-v2.0-4bit
Base model
Qwen/Qwen3-1.7B-Base