HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation
Abstract
HiFi-WaveGAN synthesizes high-fidelity 48kHz singing voices in real-time using an Extended WaveNet generator and multi-resolution discriminators with pulse extraction and spectrogram-phase loss.
Entertainment-oriented singing voice synthesis (SVS) requires a vocoder to generate high-fidelity (e.g. 48kHz) audio. However, most text-to-speech (TTS) vocoders cannot reconstruct the waveform well in this scenario. In this paper, we propose HiFi-WaveGAN to synthesize the 48kHz high-quality singing voices in real-time. Specifically, it consists of an Extended WaveNet served as a generator, a multi-period discriminator proposed in HiFiGAN, and a multi-resolution spectrogram discriminator borrowed from UnivNet. To better reconstruct the high-frequency part from the full-band mel-spectrogram, we incorporate a pulse extractor to generate the constraint for the synthesized waveform. Additionally, an auxiliary spectrogram-phase loss is utilized to approximate the real distribution further. The experimental results show that our proposed HiFi-WaveGAN obtains 4.23 in the mean opinion score (MOS) metric for the 48kHz SVS task, significantly outperforming other neural vocoders.
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper