Paper

  • Emotion-Controllable Speech Synthesis Using Emotion Soft-Label and Fine-Grained Prosody Factors

Speech emotion is simultaneously controlled by

  • Coarse-grained (e.g. emotion soft-label)
  • Fine-grained (e.g. prosodic feature) control

Coarse-grained: Angry = 1.0

  • Angry(1.0) + Energy_mean(+0.3)
  • Angry(1.0) + Energy_mean(-0.3)
  • Angry(1.0) + Energy_range(+0.3)
  • Angry(1.0) + Energy_range(-0.3)
  • Angry(1.0) + Pitch_mean(+0.3)
  • Angry(1.0) + Pitch_mean(-0.3)
  • Angry(1.0) + Pitch_range(+0.3)
  • Angry(1.0) + Pitch_range(-0.3)

Coarse-grained: Neutral = 1.0

  • Neutral(1.0) + Energy_mean(+0.3)
  • Neutral(1.0) + Energy_mean(-0.3)
  • Neutral(1.0) + Energy_range(+0.3)
  • Neutral(1.0) + Energy_range(-0.3)
  • Neutral(1.0) + Pitch_mean(+0.3)
  • Neutral(1.0) + Pitch_mean(-0.3)
  • Neutral(1.0) + Pitch_range(+0.3)
  • Neutral(1.0) + Pitch_range(-0.3)