View on GitHub

DictTTS-Demo

Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech

Abstract

Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech (TTS) systems. However, previous approaches require substantial annotated training data and additional efforts from language experts, making it difficult to extend high-quality neural TTS systems to out-of-domain daily conversations and countless languages worldwide. This paper tackles the polyphone disambiguation problem from a concise and novel perspective: we propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary (the existing prior information in the natural language). Specifically, we design a semantics-to-pronunciation attention (S2PA) module to match the semantic patterns between the input text sequence and the prior semantics in the dictionary and obtain the corresponding pronunciations; The S2PA module can be easily trained with the end-to-end TTS model without any annotated phoneme labels. Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy and improves the prosody modeling of TTS systems. Further extensive analyses with different linguistic encoders demonstrate that each design in Dict-TTS is effective.

Audio Samples

We provide the audio samples generated by the TTS systems in the experiments from three datasets, including BiaoBei (a Mandarin dataset), JSUT (a Japanese dataset), and Commonvoice-HK (a Cantonese dataset).

BiaoBei (Mandarin, single-speaker)

鱼yú 在zài 水shuǐ 里lǐ 打dǎ 旋xuán 地de 往wǎng 外wài 乱luàn 蹦bèng
(Fish are swirling out in the water)

	GT	GT (voc.)
wav
	Character	Bert emb.	NLR
wav
	Phoneme (G2PM)	Phoneme (pypinyin)	Dict-TTS
wav

毋wú 庸yōng 讳huì 言yán , 人rén 有yǒu 记jì 性xìng , 亦yì 有yǒu 忘wàng 性xìng
(Needless to say, people have memory and forgetfulness)

	GT	GT (voc.)
wav
	Character	Bert emb.	NLR
wav
	Phoneme (G2PM)	Phoneme (pypinyin)	Dict-TTS
wav

鸟niǎo 儿er 喳zhā 喳zhā , 奏zòu 起qǐ 晨chén 曲qǔ
(Birds chirp and sing in the morning)

	GT	GT (voc.)
wav
	Character	Bert emb.	NLR
wav
	Phoneme (G2PM)	Phoneme (pypinyin)	Dict-TTS
wav

先xiān 师shī 的de 研yán 究jiū 浩yíng 瀚hàn 无wú 涯yá 矣yǐ
(The research field of the master is vast and boundless)

	GT	GT (voc.)
wav
	Character	Bert emb.	NLR
wav
	Phoneme (G2PM)	Phoneme (pypinyin)	Dict-TTS
wav

JSUT (Japanese, single-speaker)

末期試験に備えて本当に気合いを入れて勉強しなきゃ ( I have to study hard in preparation for the final exam)

	GT	GT (voc.)
wav
	Character	Phoneme (pyopenjtalk)	Dict-TTS
wav

計画をたてることとそれを実行する事は別問題です ( Making a plan and executing it are separate issues.)

	GT	GT (voc.)
wav
	Character	Phoneme (pyopenjtalk)	Dict-TTS
wav

迷惑をおかけして申し訳ありません (Sorry for the inconvenience)

	GT	GT (voc.)
wav
	Character	Phoneme (pyopenjtalk)	Dict-TTS
wav

CommonVoice-HK (Cantonese, multi-speaker)

有個老人去左牛池灣沐翠街食齋 (The old man went to Niuchi Bay Mucui Street to eat vegetarian)

	GT	Character
wav
	Phoneme(pycantonese)	Dict-TTS
wav

呃得一時唔呃得一世 (You can cheat for a while, but you can't cheat for a lifetime)

	GT	Character
wav
	Phoneme(pycantonese)	Dict-TTS
wav

試看前幾天街上男女的樣子 (Just look at the men and women on the street a few days ago)

	GT	Character
wav
	Phoneme(pycantonese)	Dict-TTS
wav

Verification Code

You can try the following scripts to verify the G2P pipelines’ errors shown in the demo page. This work is done at 20 May 2022. Please choose the corresponding version of the G2P pipelines.

# pip install pypinyin==0.46.0
import pypinyin
res = pypinyin.pinyin("鸟儿喳喳，奏起晨曲")
print(res)
# [['niǎo'], ['ér'], ['zhā'], ['zhā'], ['，'], ['zòu'], ['qǐ'], ['chén'], ['qū']]

# pip install g2pM
from g2pM import G2pM
model = G2pM()
res = model("鸟儿喳喳，奏起晨曲", tone=True, char_split=True)
print(res)
# ['niao3', 'er2', 'cha1', 'zha1', '，', 'zou4', 'qi3', 'chen2', 'qu3']