Comparing monological and dialogical neural representations of dialogue history for predicting the acoustic parameters of an upcoming conversational turn
Are you already subscribed?
Login to check
whether this content is already included on your personal or institutional subscription.
Abstract
During a conversation, speakers tend to influence each other’s production according to a phenomenon known as convergence or entrainment. Establishing the dynamics of such adaptation in natural conversations is important for building our understanding of language-based social interactions. This work assesses the importance of both participants’ influence on acoustic and prosodic parameters of an upcoming turn by predicting energy, pitch and speech rate values according to a feature-based regression model. The model relies on a neural recurrent architecture and accounts for acoustic-prosodic parameters from the dialogue history, and on linguistic embeddings of the lexical content of previous turns. In a set of experiments on the Switchboard corpus, we compare the model predictions when using a dialogical history (from both participants) vs. monological history (only from the upcoming turn’s speaker) using two kinds of approaches based on acoustic-prosodic information as input, and an extension that includes in addition word embeddings. Results tend to show that the information contained in previous turns produced by both the speaker and his interlocutor reduce the error in predicting the current acoustic target variable of the 7% on average. In addition, the error in prediction decreases as the number of previous turns are taken into account increases.
Keywords
- Convergence
- prediction
- acoustic features
- neural representations
- dialogue