Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech
Abstract
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech (TTS) systems. However, previous approaches require substantial annotated training data and additional efforts from language experts, making it difficult to extend high-quality neural TTS systems to out-of-domain daily conversations and countless languages worldwide. This paper tackles the polyphone disambiguation problem from a concise and novel perspective: we propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary (the existing prior information in the natural language). Specifically, we design a semantics-to-pronunciation attention (S2PA) module to match the semantic patterns between the input text sequence and the prior semantics in the dictionary and obtain the corresponding pronunciations; The S2PA module can be easily trained with the end-to-end TTS model without any annotated phoneme labels. Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy and improves the prosody modeling of TTS systems. Further extensive analyses with different linguistic encoders demonstrate that each design in Dict-TTS is effective.
Audio Samples
We provide the audio samples generated by the TTS systems in the experiments from three datasets, including BiaoBei (a Mandarin dataset), JSUT (a Japanese dataset), and Commonvoice-HK (a Cantonese dataset).
BiaoBei (Mandarin, single-speaker)
鱼
在
水
里
打
旋
地
往
外
乱
蹦
(Fish are swirling out in the water)
GT
GT (voc.)
wav
Character
Bert emb.
NLR
wav
Phoneme (G2PM)
Phoneme (pypinyin)
Dict-TTS
wav
毋
庸
讳
言
,
人
有
记
性
,
亦
有
忘
性
(Needless to say, people have memory and forgetfulness)
GT
GT (voc.)
wav
Character
Bert emb.
NLR
wav
Phoneme (G2PM)
Phoneme (pypinyin)
Dict-TTS
wav
鸟
儿
喳
喳
,
奏
起
晨
曲
(Birds chirp and sing in the morning)
GT
GT (voc.)
wav
Character
Bert emb.
NLR
wav
Phoneme (G2PM)
Phoneme (pypinyin)
Dict-TTS
wav
先
师
的
研
究
浩
瀚
无
涯
矣
(The research field of the master is vast and boundless)
GT
GT (voc.)
wav
Character
Bert emb.
NLR
wav
Phoneme (G2PM)
Phoneme (pypinyin)
Dict-TTS
wav
JSUT (Japanese, single-speaker)
末期試験に備えて本当に気合いを入れて勉強しなきゃ (
I have to study hard in preparation for the final exam)
GT
GT (voc.)
wav
Character
Phoneme (pyopenjtalk)
Dict-TTS
wav
計画をたてることとそれを実行する事は別問題です (
Making a plan and executing it are separate issues.)
GT
GT (voc.)
wav
Character
Phoneme (pyopenjtalk)
Dict-TTS
wav
迷惑をおかけして申し訳ありません (Sorry for the inconvenience)
GT
GT (voc.)
wav
Character
Phoneme (pyopenjtalk)
Dict-TTS
wav
CommonVoice-HK (Cantonese, multi-speaker)
有個老人去左牛池灣沐翠街食齋 (The old man went to Niuchi Bay Mucui Street to eat vegetarian)
GT
Character
wav
Phoneme(pycantonese)
Dict-TTS
wav
呃得一時唔呃得一世 (You can cheat for a while, but you can't cheat for a lifetime)
GT
Character
wav
Phoneme(pycantonese)
Dict-TTS
wav
試看前幾天街上男女的樣子 (Just look at the men and women on the street a few days ago)
GT
Character
wav
Phoneme(pycantonese)
Dict-TTS
wav
Verification Code
You can try the following scripts to verify the G2P pipelines’ errors shown in the demo page. This work is done at 20 May 2022. Please choose the corresponding version of the G2P pipelines.