Diverse and Expressive Speech Prosody Prediction with Denoising Diffusion Probabilistic Model

Accepted by INTERSPEECH 2023

arXiv Hits

Abstract

Expressive human speech generally abounds with rich and flexible speech prosody variations. The speech prosody predictors in existing expressive speech synthesis methods mostly produce deterministic predictions, which are learned by directly minimizing the norm of prosody prediction error. Its unimodal nature leads to a mismatch with ground truth distribution and harms the model’s ability in making diverse predictions. Thus, we propose a novel prosody predictor based on the denoising diffusion probabilistic model to take advantage of its high-quality generative modeling and training stability. Experiment results confirm that the proposed prosody predictor outperforms the deterministic baseline on both the expressiveness and diversity of prediction results with even fewer network parameters.

Proposed prosody predictor & expressive TTS system

The proposed prosody predictor is a denoising diffusion probabilistic model (DDPM) on 3-dimensional data \(x_0\), which consists of phoneme-wise fundamental frequency (F0), energy, and duration (frame numbers), respectively.

As shown in Fig.1, we incorporate the proposed DDPM-based prosody predictor with a pre-trained TTS backbone based on FastSpeech2 (FS2) to construct an expressive TTS system, where the phoneme-wise F0, energy and duration features are utilized as the prosody representation.


Fig.1: Proposed expressive TTS system architecture.

Evaluation

We apply the proposed method to a private Mandarin audiobook dataset with 28.23 hours of speech audio. We also train the original prosody predictors in FS2 with mean squared error (MSE) as the deterministic baseline model for comparison, which contains 60.5% more network parameters than the proposed prosody predictor.

Expressiveness

Script Proposed Baseline FS2 with Ground Truth Prosody Ground Truth Audio (Reconstructed)
乱飞的火团儿腾在了空中,又变成了黑色的蝴蝶儿,妖冶地飘飘四散。
以往看到如此情景,是诗情画意的。今天,则是揪着心地疼啊。
汪半城夹起一只螃蟹,勾起了当年闯关东差点儿让螃蟹撑破了肚皮的事儿,便说给大家听。
前年,调署承德知县,本来是顺风顺水,不想却丢了官差。
小死老婆儿,儿子才当个县案首,就把你得瑟得没老没少的啊,想把你爹气死咋的?
我咋气他了?啊我咋气他了他外孙子当了县案首,报个喜儿咋还气着他了呢?
哼,一家儿去了四个,连一个秀才的毛儿都没摸着,还大夫第呢,丢不丢人?
啊,我咋不知道呐?他砦四海敢背着我干这缺德的事儿,看我回去怎么收拾他!

Diversity

We demonstrate the diversity of the proposed method by sampling multiple times on the same text. As shown in the following table, the predicted prosody varies in different runs, resulting in diverse synthesized spectrograms. Meanwhile, the predicted prosody is still in coherence with the text, preserving the naturalness of synthesized audio at an acceptable level.

Script      
瞧着遭大罪吧!富老将军人是不错.
平均开五垧荒,累折裤衩带儿也完不成啊!
官兵屯丁脱了老棉袄,换上了汗衫子。
南风吹到了胸脯子上,像相好儿姑娘的小手儿在抚摩。
不小心被火舌舔着了翅膀儿的,一头扎进了火海。