Paper
- Emotion-Controllable Speech Synthesis Using Emotion Soft-Label and Fine-Grained Prosody Factors
Speech emotion is simultaneously controlled by
- Coarse-grained (e.g. emotion soft-label)
- Fine-grained (e.g. prosodic feature) control
Coarse-grained: Angry = 1.0
- Angry(1.0) + Energy_mean(+0.3)
- Angry(1.0) + Energy_mean(-0.3)
- Angry(1.0) + Energy_range(+0.3)
- Angry(1.0) + Energy_range(-0.3)
- Angry(1.0) + Pitch_mean(+0.3)
- Angry(1.0) + Pitch_mean(-0.3)
- Angry(1.0) + Pitch_range(+0.3)
- Angry(1.0) + Pitch_range(-0.3)
Coarse-grained: Neutral = 1.0
- Neutral(1.0) + Energy_mean(+0.3)
- Neutral(1.0) + Energy_mean(-0.3)
- Neutral(1.0) + Energy_range(+0.3)
- Neutral(1.0) + Energy_range(-0.3)
- Neutral(1.0) + Pitch_mean(+0.3)
- Neutral(1.0) + Pitch_mean(-0.3)
- Neutral(1.0) + Pitch_range(+0.3)
- Neutral(1.0) + Pitch_range(-0.3)