Supplementary material from "Exploring emotional prototypes in a high dimensional TTS latent space"
Paper: Submitted to INTERSPEECH 2021
Authors: Pol van Rijn, Silvan Mertes, Dominik Schiller, Peter M. C. Harrison, Pauline Larrouy-Maestri,
Elisabeth André, Nori Jacoby
Abstract: Recent TTS systems are able to generate prosodically varied and realistic speech. However, it is
unclear how this prosodic variation contributes to the perception of speakers' emotional states. Here we use the
recent psychological paradigm 'Gibbs Sampling with People' to search the prosodic latent space in a trained GST
Tacotron model to explore prototypes of emotional prosody. Participants are recruited online and collectively
manipulate the latent space of the generative speech model in a sequentially adaptive way so that the stimulus
presented to one group of participants is determined by the response of the previous groups. We demonstrate that (1)
particular regions of the model's latent space are reliably associated with particular emotions, (2) the resulting
emotional prototypes are well-recognized by a separate group of human raters, and (3) these emotional prototypes can
be effectively transferred to new sentences. Collectively, these experiments demonstrate a novel approach to the
understanding of emotional speech by providing a tool to explore the relation between the latent space of generative
models and human semantics.
Figure S1: Schematic of GST Tacotron architecture. The schematic is based on Wang et al. (2018). In the upper part, the training procedure is
described. The lower part of the figure describes two ways to create stimuli from the model. The first way (see
S1A), is identical to how the model was trained. The second way is to directly set the attention weights, which
is used in the experiment.