X-VC

arXiv GitHub Demo Page

Official code release for X-VC: Zero-shot Streaming Voice Conversion in Codec Space.

Environment Setup

1. Clone

git clone https://github.com/Jerrister/X-VC.git
cd X-VC

2. Create conda environment and install dependencies

conda create -n xvc python=3.10 -y
conda activate xvc
pip install -U pip
pip install -r requirements.txt

3. Prepare pretrained models

Prepare:

Then set paths in configs/xvc.yaml, especially:

  • model.generator.semantic_encoder.encoder.from_pretrained
  • model.generator.semantic_encoder.cfg
  • model.generator.speaker_encoder.pretrained_dir

4. Prepare checkpoints

Put checkpoints under ckpts/, for example:

ckpts/
  xvc.pt

Inference

Single-pair Inference

Use scripts/infer_single.sh.

bash scripts/infer_single.sh

Key arguments in this script:

  • current=0 for offline inference.
  • current>0 for streaming inference.
  • chunk/current/future/smooth control streaming behavior.

Outputs are saved under save_dir (default: outputs/xvc_single).

Batch Offline Inference (SeedTTS-eval as example)

Use scripts/batch_infer_seedtts_offline.sh.

bash scripts/batch_infer_seedtts_offline.sh

This script reports:

  • saved_dir
  • total_rtf

Batch Streaming Inference (SeedTTS-eval as example)

Use scripts/batch_infer_seedtts_stream.sh.

bash scripts/batch_infer_seedtts_stream.sh

This script reports:

  • saved_dir
  • avg_latency_ms

Training

Step 1: Prepare pretrained dependencies

Before training, prepare the required pretrained dependencies:

Then set corresponding paths in configs/xvc.yaml, especially:

  • model.generator.checkpoint
  • model.discriminator.checkpoint

Step 2: Prepare training data

Organize your training/validation data in JSONL format and set:

  • datasets.train
  • datasets.val

in configs/xvc.yaml.

Step 3: Modify training configs

You can adjust training behavior in:

Step 4: Start training

Use scripts/train.sh.

bash scripts/train.sh

Notes:

  • Default training engine is DeepSpeed (configs/ds_stage2.json).
  • Main experiment config is configs/xvc.yaml.
  • Set your WANDB_API_KEY in scripts/train.sh before running if you use wandb logging.

Data Format

Training config points to JSONL files in configs/xvc.yaml:

  • datasets.train
  • datasets.val

Each JSONL line should be a JSON object.

Required fields:

  • target_utt
  • source_wav_path
  • target_wav_path

Optional field:

  • source_utt

Minimal example:

{"source_utt":"utt_0001","source_wav_path":"<path_to_source>","target_utt":"utt_0002","target_wav_path":"<path_to_target>"}

Acknowledgements

This codebase builds upon open-source components from SAC and the broader audio generation ecosystem.

Citation

If you find our work useful in your research, please consider citing:

@misc{zheng2026xvczeroshotstreamingvoice,
      title={X-VC: Zero-shot Streaming Voice Conversion in Codec Space}, 
      author={Qixi Zheng and Yuxiang Zhao and Tianrui Wang and Wenxi Chen and Kele Xu and Yikang Li and Qinyuan Chen and Xipeng Qiu and Kai Yu and Xie Chen},
      year={2026},
      eprint={2604.12456},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2604.12456},
}

License

This project is licensed under the MIT License.

Downloads last month
36
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for chenxie95/X-VC