LongCat-AudioDiT Env-TTS — 10000-step (independent noise / room-RIR / mic-IR augmentation)

Fine-tune of meituan-longcat/LongCat-AudioDiT-1B for the three-stream env-tts task: given a reference environment audio, a reference speaker audio, and three text streams (env caption / speaker caption / target speech text), generate target speech that places the target text in the referenced environment with the referenced speaker timbre.

The "rir" variant — full acoustic-scene augmentation. This checkpoint is trained with THREE INDEPENDENT Bernoulli augmentation dimensions (p=0.5 each): additive noise, room RIR reverb, and microphone-IR colouration, applied in the chain RIR → mic IR → noise (noise stays un-coloured). The spk reference gets its own independent draw; env and target share one realization (same noise clip + same RIR + same mic IR + same SNR/wet — one acoustic scene), so the model learns env-consistent generation. Compare with -ablation (no augmentation) and -augment (spk-only noise+RIR) to isolate the effect.

Differences from the base model

The transformer adds six learnable boundary tokens (three latent-space, three text-space):

latent sequence : [<boe>  z_env  <bos>  z_spk  <bon>  z_target]
text sequence   : [<boe_t> env_text_emb <bos_t> spk_text_emb <bon_t> target_text_emb]

encode_multistream_text(env, spk, target, drop_env_text=…, drop_spk_text=…, drop_target_text=…) is the new entry-point. AudioDiTModel.forward(...) also accepts a pre-assembled prompt_latent (replaces prompt_audio) so the inference path can feed the boundary-tokenized three-stream prompt directly.

Training summary

Field	Value
Steps	10000 (~1.7 epochs of the 379k-row train split)
Effective batch	16 × grad_accum 4 × 1 GPU = 64 rows / step
Learning rate	cosine 5e-5 (warmup 250)
AdamW	β₁=0.9, β₂=0.999, wd=0.01
EMA	disabled
LoRA	r=32, alpha=32, target = attn + ffn
Full-train	boundary tokens + AdaLN + text_conv + latent_embed + latent_cond_embedder + input_embed + output_proj + time_embed
Audio filter	target duration ∈ [3, 15] s
RMS normalize	three-stream independent to -23 dBFS (target_rms=0.0708); re-normalized after augmentation; peak-clip 0.5
Augmentation	three independent Bernoulli dims, p=0.5 each — noise (SNR ~ U[-5, 15] dB), room RIR (wet=1.0, tail 1.0 s), mic IR (peak-crop pre 5 ms + tail 0.10 s, unit-energy, aligned same-length convolve, RMS-matched, colour 1.0). Chain RIR → mic → noise. spk: independent draw; env+target: coupled draw (one shared acoustic scene).
Augment sources	noise/RIR streamed from DNS-Noise `noise`/`rir`; mic IRs from its `mic_ir` split (8,644 IRs: MADIR studio mics / MicIRP vintage / CTF device mics), sampled dataset-balanced (1/3 each) with off-axis weighting
Data	ChristianYang/Env-TTS-Clean
Final CFM loss	≈ 0.91 (mean of last 180 steps; near the augmentation-induced conditional-variance floor ≈ 0.6–0.9)

Evaluation

Generate-then-score evaluation on the held-out test split is pending for this checkpoint. For the suite (WER / speaker-sim / CLAP env-sim / audiobox PQ) and sibling-model numbers, see -ablation.

How to load

The model uses custom code in this repo, so pass trust_remote_code=True:

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained(
    "ChristianYang/LongCat-AudioDiT-Env-TTS-1B-rir",
    trust_remote_code=True,
).cuda().eval()

tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder_model)

For end-to-end env-tts inference (three-stream prompt + ASR fallback for missing env/spk text) see the training repo's tasks/inference.py.

License

Inherits the original meituan-longcat/LongCat-AudioDiT-1B license.

Downloads last month: 22

Safetensors

Model size

1B params

Tensor type

F32

Model tree for humanify/LongCat-AudioDiT-Env-TTS-1B-rir

Base model

meituan-longcat/LongCat-AudioDiT-1B

Finetuned

(11)

this model