View on GitHub

DictTTS-Demo

Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech

Abstract

Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech (TTS) systems. However, previous approaches require substantial annotated training data and additional efforts from language experts, making it difficult to extend high-quality neural TTS systems to out-of-domain daily conversations and countless languages worldwide. This paper tackles the polyphone disambiguation problem from a concise and novel perspective: we propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary (the existing prior information in the natural language). Specifically, we design a semantics-to-pronunciation attention (S2PA) module to match the semantic patterns between the input text sequence and the prior semantics in the dictionary and obtain the corresponding pronunciations; The S2PA module can be easily trained with the end-to-end TTS model without any annotated phoneme labels. Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy and improves the prosody modeling of TTS systems. Further extensive analyses with different linguistic encoders demonstrate that each design in Dict-TTS is effective.

Audio Samples

We provide the audio samples generated by the TTS systems in the experiments from three datasets, including BiaoBei (a Mandarin dataset), JSUT (a Japanese dataset), and Commonvoice-HK (a Cantonese dataset).

BiaoBei (Mandarin, single-speaker)

  1. zài shuǐ xuán de wǎng wài luàn bèng
    (Fish are swirling out in the water)
  2. GT GT (voc.)
    wav
    Character Bert emb. NLR
    wav
    Phoneme (G2PM) Phoneme (pypinyin) Dict-TTS
    wav
  3. yōng huì yán , rén yǒu xìng , yǒu wàng xìng
    (Needless to say, people have memory and forgetfulness)
  4. GT GT (voc.)
    wav
    Character Bert emb. NLR
    wav
    Phoneme (G2PM) Phoneme (pypinyin) Dict-TTS
    wav
  5. niǎo er zhā zhā , zòu chén
    (Birds chirp and sing in the morning)
  6. GT GT (voc.)
    wav
    Character Bert emb. NLR
    wav
    Phoneme (G2PM) Phoneme (pypinyin) Dict-TTS
    wav
  7. xiān shī de yán jiū yíng hàn
    (The research field of the master is vast and boundless)
  8. GT GT (voc.)
    wav
    Character Bert emb. NLR
    wav
    Phoneme (G2PM) Phoneme (pypinyin) Dict-TTS
    wav

JSUT (Japanese, single-speaker)

  1. 末期試験に備えて本当に気合いを入れて勉強しなきゃ ( I have to study hard in preparation for the final exam)
  2. GT GT (voc.)
    wav
    Character Phoneme (pyopenjtalk) Dict-TTS
    wav
  3. 計画をたてることとそれを実行する事は別問題です ( Making a plan and executing it are separate issues.)
  4. GT GT (voc.)
    wav
    Character Phoneme (pyopenjtalk) Dict-TTS
    wav
  5. 迷惑をおかけして申し訳ありません (Sorry for the inconvenience)
  6. GT GT (voc.)
    wav
    Character Phoneme (pyopenjtalk) Dict-TTS
    wav

CommonVoice-HK (Cantonese, multi-speaker)

  1. 有個老人去左牛池灣沐翠街食齋 (The old man went to Niuchi Bay Mucui Street to eat vegetarian)
  2. GT Character
    wav
    Phoneme(pycantonese) Dict-TTS
    wav
  3. 呃得一時唔呃得一世 (You can cheat for a while, but you can't cheat for a lifetime)
  4. GT Character
    wav
    Phoneme(pycantonese) Dict-TTS
    wav
  5. 試看前幾天街上男女的樣子 (Just look at the men and women on the street a few days ago)
  6. GT Character
    wav
    Phoneme(pycantonese) Dict-TTS
    wav

Verification Code

You can try the following scripts to verify the G2P pipelines’ errors shown in the demo page. This work is done at 20 May 2022. Please choose the corresponding version of the G2P pipelines.

# pip install pypinyin==0.46.0
import pypinyin
res = pypinyin.pinyin("鸟儿喳喳,奏起晨曲")
print(res)
# [['niǎo'], ['ér'], ['zhā'], ['zhā'], [','], ['zòu'], ['qǐ'], ['chén'], ['qū']]

# pip install g2pM
from g2pM import G2pM
model = G2pM()
res = model("鸟儿喳喳,奏起晨曲", tone=True, char_split=True)
print(res)
# ['niao3', 'er2', 'cha1', 'zha1', ',', 'zou4', 'qi3', 'chen2', 'qu3']