RVQ Proxy Network

A lightweight differentiable surrogate network that maps from Qwen3-TTS RVQ embedding space to high-level perceptual audio features: speaker embedding, wav2vec2 content features, and mel spectrogram.

Purpose

During voice conversion training, the standard pipeline (logits → argmax → RVQ tokens → decoder → waveform → feature extractors) is non-differentiable. The RVQ Proxy replaces this with a tiny differentiable network, enabling end-to-end training without audio decoding.

model logits → softmax → E_soft → RVQProxy → speaker / wav2vec / mel

Architecture

Shared temporal encoder: 3-layer 1D conv (receptive field ~560ms) with GroupNorm + GELU
Speaker head: 2-layer MLP + mean pooling → 2048-dim speaker embedding
Wav2vec head: Single linear projection → 768-dim features
Mel head: 2-layer MLP → 80-bin mel spectrogram

Parameters: ~6.7M

Checkpoints

File	Description
`rvq_proxy_10k.pt`	Best checkpoint (val speaker cosine = 0.9925)
`rvq_proxy_10k_final.pt`	Final epoch checkpoint (epoch 20)

Both checkpoints include metadata (input_dim, num_speaker_dims) for easy loading.

Usage

from exiv.components.models.qwen3_tts.sern.rvq_proxy import RVQProxy
import torch

ckpt = torch.load("rvq_proxy_10k.pt", map_location="cpu")
proxy = RVQProxy(
    input_dim=ckpt["input_dim"],
    num_speaker_dims=ckpt["num_speaker_dims"]
)
proxy.load_state_dict(ckpt["proxy_state"])
proxy.eval().cuda()

# Forward pass
out = proxy(E_soft, mask=mask)  # E_soft: [B, T, 512]
speaker = out["speaker"]        # [B, 2048]
wav2vec = out["wav2vec"]        # [B, T, 768]
mel = out["mel"]                # [B, T, 80]

Requirements

PyTorch ≥ 2.0
See Exiv for full integration with Qwen3-TTS

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support