X-VC
Official code release for X-VC: Zero-shot Streaming Voice Conversion in Codec Space.
Environment Setup
1. Clone
git clone https://github.com/Jerrister/X-VC.git
cd X-VC
2. Create conda environment and install dependencies
conda create -n xvc python=3.10 -y
conda activate xvc
pip install -U pip
pip install -r requirements.txt
3. Prepare pretrained models
Prepare:
- GLM-4-Voice-Tokenizer (for semantic tokenization)
- ERes2Net speaker encoder (for speaker feature extraction)
Then set paths in configs/xvc.yaml, especially:
model.generator.semantic_encoder.encoder.from_pretrainedmodel.generator.semantic_encoder.cfgmodel.generator.speaker_encoder.pretrained_dir
4. Prepare checkpoints
Put checkpoints under ckpts/, for example:
ckpts/
xvc.pt
Inference
Single-pair Inference
bash scripts/infer_single.sh
Key arguments in this script:
current=0for offline inference.current>0for streaming inference.chunk/current/future/smoothcontrol streaming behavior.
Outputs are saved under save_dir (default: outputs/xvc_single).
Batch Offline Inference (SeedTTS-eval as example)
Use scripts/batch_infer_seedtts_offline.sh.
bash scripts/batch_infer_seedtts_offline.sh
This script reports:
saved_dirtotal_rtf
Batch Streaming Inference (SeedTTS-eval as example)
Use scripts/batch_infer_seedtts_stream.sh.
bash scripts/batch_infer_seedtts_stream.sh
This script reports:
saved_diravg_latency_ms
Training
Step 1: Prepare pretrained dependencies
Before training, prepare the required pretrained dependencies:
- SAC pretrained checkpoint(s) (for model initialization)
Then set corresponding paths in configs/xvc.yaml, especially:
model.generator.checkpointmodel.discriminator.checkpoint
Step 2: Prepare training data
Organize your training/validation data in JSONL format and set:
datasets.traindatasets.val
in configs/xvc.yaml.
Step 3: Modify training configs
You can adjust training behavior in:
configs/xvc.yaml(main training config)configs/ds_stage2.json(DeepSpeed config)
Step 4: Start training
Use scripts/train.sh.
bash scripts/train.sh
Notes:
- Default training engine is DeepSpeed (
configs/ds_stage2.json). - Main experiment config is
configs/xvc.yaml. - Set your
WANDB_API_KEYinscripts/train.shbefore running if you use wandb logging.
Data Format
Training config points to JSONL files in configs/xvc.yaml:
datasets.traindatasets.val
Each JSONL line should be a JSON object.
Required fields:
target_uttsource_wav_pathtarget_wav_path
Optional field:
source_utt
Minimal example:
{"source_utt":"utt_0001","source_wav_path":"<path_to_source>","target_utt":"utt_0002","target_wav_path":"<path_to_target>"}
Acknowledgements
This codebase builds upon open-source components from SAC and the broader audio generation ecosystem.
Citation
If you find our work useful in your research, please consider citing:
@misc{zheng2026xvczeroshotstreamingvoice,
title={X-VC: Zero-shot Streaming Voice Conversion in Codec Space},
author={Qixi Zheng and Yuxiang Zhao and Tianrui Wang and Wenxi Chen and Kele Xu and Yikang Li and Qinyuan Chen and Xipeng Qiu and Kai Yu and Xie Chen},
year={2026},
eprint={2604.12456},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2604.12456},
}
License
This project is licensed under the MIT License.
- Downloads last month
- 36