RVQ Proxy Network
A lightweight differentiable surrogate network that maps from Qwen3-TTS RVQ embedding space to high-level perceptual audio features: speaker embedding, wav2vec2 content features, and mel spectrogram.
Purpose
During voice conversion training, the standard pipeline (logits β argmax β RVQ tokens β decoder β waveform β feature extractors) is non-differentiable. The RVQ Proxy replaces this with a tiny differentiable network, enabling end-to-end training without audio decoding.
model logits β softmax β E_soft β RVQProxy β speaker / wav2vec / mel
Architecture
- Shared temporal encoder: 3-layer 1D conv (receptive field ~560ms) with GroupNorm + GELU
- Speaker head: 2-layer MLP + mean pooling β 2048-dim speaker embedding
- Wav2vec head: Single linear projection β 768-dim features
- Mel head: 2-layer MLP β 80-bin mel spectrogram
Parameters: ~6.7M
Checkpoints
| File | Description |
|---|---|
rvq_proxy_10k.pt |
Best checkpoint (val speaker cosine = 0.9925) |
rvq_proxy_10k_final.pt |
Final epoch checkpoint (epoch 20) |
Both checkpoints include metadata (input_dim, num_speaker_dims) for easy loading.
Usage
from exiv.components.models.qwen3_tts.sern.rvq_proxy import RVQProxy
import torch
ckpt = torch.load("rvq_proxy_10k.pt", map_location="cpu")
proxy = RVQProxy(
input_dim=ckpt["input_dim"],
num_speaker_dims=ckpt["num_speaker_dims"]
)
proxy.load_state_dict(ckpt["proxy_state"])
proxy.eval().cuda()
# Forward pass
out = proxy(E_soft, mask=mask) # E_soft: [B, T, 512]
speaker = out["speaker"] # [B, 2048]
wav2vec = out["wav2vec"] # [B, T, 768]
mel = out["mel"] # [B, T, 80]
Requirements
- PyTorch β₯ 2.0
- See Exiv for full integration with Qwen3-TTS
License
MIT
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support