Supplementary material from "Exploring emotional prototypes in a high dimensional TTS latent space"

Paper: Submitted to INTERSPEECH 2021

Authors: Pol van Rijn, Silvan Mertes, Dominik Schiller, Peter M. C. Harrison, Pauline Larrouy-Maestri, Elisabeth André, Nori Jacoby

Abstract: Recent TTS systems are able to generate prosodically varied and realistic speech. However, it is unclear how this prosodic variation contributes to the perception of speakers' emotional states. Here we use the recent psychological paradigm 'Gibbs Sampling with People' to search the prosodic latent space in a trained GST Tacotron model to explore prototypes of emotional prosody. Participants are recruited online and collectively manipulate the latent space of the generative speech model in a sequentially adaptive way so that the stimulus presented to one group of participants is determined by the response of the previous groups. We demonstrate that (1) particular regions of the model's latent space are reliably associated with particular emotions, (2) the resulting emotional prototypes are well-recognized by a separate group of human raters, and (3) these emotional prototypes can be effectively transferred to new sentences. Collectively, these experiments demonstrate a novel approach to the understanding of emotional speech by providing a tool to explore the relation between the latent space of generative models and human semantics.



Supplementary figures

Supplementary figure S1

Figure S1: Schematic of GST Tacotron architecture. The schematic is based on Wang et al. (2018). In the upper part, the training procedure is described. The lower part of the figure describes two ways to create stimuli from the model. The first way (see S1A), is identical to how the model was trained. The second way is to directly set the attention weights, which is used in the experiment.

Final iterations

Text: Pick a card and slip it under the pack.
Iteration 0 Angry Happy Sad Random
Text: The pencils have all been used.
Iteration 0 Angry Happy Sad Random
Text: Much of the story makes good sense.
Iteration 0 Angry Happy Sad Random

Transferred prosody

Text: Take the match and strike it against your shoe.
Angry Happy Sad
Text: The desk and both chairs were painted tan.
Angry Happy Sad
Text: She was waiting at my front lawn.
Angry Happy Sad
Text: Pack the records in a neat thin case.
Angry Happy Sad