X-VC

Official code release for X-VC: Zero-shot Streaming Voice Conversion in Codec Space.

Environment Setup

1. Clone

git clone https://github.com/Jerrister/X-VC.git
cd X-VC

2. Create conda environment and install dependencies

conda create -n xvc python=3.10 -y
conda activate xvc
pip install -U pip
pip install -r requirements.txt

3. Prepare pretrained models

Prepare:

GLM-4-Voice-Tokenizer (for semantic tokenization)
ERes2Net speaker encoder (for speaker feature extraction)

Then set paths in configs/xvc.yaml, especially:

model.generator.semantic_encoder.encoder.from_pretrained
model.generator.semantic_encoder.cfg
model.generator.speaker_encoder.pretrained_dir

4. Prepare checkpoints

Put checkpoints under ckpts/, for example:

ckpts/
  xvc.pt

Inference

Single-pair Inference

Use scripts/infer_single.sh.

bash scripts/infer_single.sh

Key arguments in this script:

current=0 for offline inference.
current>0 for streaming inference.
chunk/current/future/smooth control streaming behavior.

Outputs are saved under save_dir (default: outputs/xvc_single).

Batch Offline Inference (SeedTTS-eval as example)

Use scripts/batch_infer_seedtts_offline.sh.

bash scripts/batch_infer_seedtts_offline.sh

This script reports:

saved_dir
total_rtf

Batch Streaming Inference (SeedTTS-eval as example)

Use scripts/batch_infer_seedtts_stream.sh.

bash scripts/batch_infer_seedtts_stream.sh

This script reports:

saved_dir
avg_latency_ms

Training

Step 1: Prepare pretrained dependencies

Before training, prepare the required pretrained dependencies:

SAC pretrained checkpoint(s) (for model initialization)

Then set corresponding paths in configs/xvc.yaml, especially:

model.generator.checkpoint
model.discriminator.checkpoint

Step 2: Prepare training data

Organize your training/validation data in JSONL format and set:

datasets.train
datasets.val

in configs/xvc.yaml.

Step 3: Modify training configs

You can adjust training behavior in:

configs/xvc.yaml (main training config)
configs/ds_stage2.json (DeepSpeed config)

Step 4: Start training

Use scripts/train.sh.

bash scripts/train.sh

Notes:

Default training engine is DeepSpeed (configs/ds_stage2.json).
Main experiment config is configs/xvc.yaml.
Set your WANDB_API_KEY in scripts/train.sh before running if you use wandb logging.

Data Format

Training config points to JSONL files in configs/xvc.yaml:

datasets.train
datasets.val

Each JSONL line should be a JSON object.

Required fields:

target_utt
source_wav_path
target_wav_path

Optional field:

source_utt

Minimal example:

{"source_utt":"utt_0001","source_wav_path":"<path_to_source>","target_utt":"utt_0002","target_wav_path":"<path_to_target>"}

Acknowledgements

This codebase builds upon open-source components from SAC and the broader audio generation ecosystem.

Citation

If you find our work useful in your research, please consider citing:

@misc{zheng2026xvczeroshotstreamingvoice,
      title={X-VC: Zero-shot Streaming Voice Conversion in Codec Space}, 
      author={Qixi Zheng and Yuxiang Zhao and Tianrui Wang and Wenxi Chen and Kele Xu and Yikang Li and Qinyuan Chen and Xipeng Qiu and Kai Yu and Xie Chen},
      year={2026},
      eprint={2604.12456},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2604.12456},
}

License

This project is licensed under the MIT License.

Downloads last month: 36

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for chenxie95/X-VC

X-VC: Zero-shot Streaming Voice Conversion in Codec Space

Paper • 2604.12456 • Published 10 days ago