RVQ Proxy Network

A lightweight differentiable surrogate network that maps from Qwen3-TTS RVQ embedding space to high-level perceptual audio features: speaker embedding, wav2vec2 content features, and mel spectrogram.

Purpose

During voice conversion training, the standard pipeline (logits β†’ argmax β†’ RVQ tokens β†’ decoder β†’ waveform β†’ feature extractors) is non-differentiable. The RVQ Proxy replaces this with a tiny differentiable network, enabling end-to-end training without audio decoding.

model logits β†’ softmax β†’ E_soft β†’ RVQProxy β†’ speaker / wav2vec / mel

Architecture

  • Shared temporal encoder: 3-layer 1D conv (receptive field ~560ms) with GroupNorm + GELU
  • Speaker head: 2-layer MLP + mean pooling β†’ 2048-dim speaker embedding
  • Wav2vec head: Single linear projection β†’ 768-dim features
  • Mel head: 2-layer MLP β†’ 80-bin mel spectrogram

Parameters: ~6.7M

Checkpoints

File Description
rvq_proxy_10k.pt Best checkpoint (val speaker cosine = 0.9925)
rvq_proxy_10k_final.pt Final epoch checkpoint (epoch 20)

Both checkpoints include metadata (input_dim, num_speaker_dims) for easy loading.

Usage

from exiv.components.models.qwen3_tts.sern.rvq_proxy import RVQProxy
import torch

ckpt = torch.load("rvq_proxy_10k.pt", map_location="cpu")
proxy = RVQProxy(
    input_dim=ckpt["input_dim"],
    num_speaker_dims=ckpt["num_speaker_dims"]
)
proxy.load_state_dict(ckpt["proxy_state"])
proxy.eval().cuda()

# Forward pass
out = proxy(E_soft, mask=mask)  # E_soft: [B, T, 512]
speaker = out["speaker"]        # [B, 2048]
wav2vec = out["wav2vec"]        # [B, T, 768]
mel = out["mel"]                # [B, T, 80]

Requirements

  • PyTorch β‰₯ 2.0
  • See Exiv for full integration with Qwen3-TTS

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support