Core Concepts
A novel speech synthesis pipeline that generates emotional and disfluent speech patterns in a zero-shot manner using a large language model, enabling more natural and relatable interactions for conversational AI systems.
Abstract
The content describes a novel approach to humanize machine communication by generating emotional and disfluent speech patterns in a zero-shot manner using a large language model.
The key highlights are:
Contemporary conversational systems often lack the emotional depth and disfluent characteristics of human interactions, making them seem mechanical and less relatable.
The proposed method uses a large language model (GPT-4) to generate responses with varying levels of emotion and disfluency cues through careful prompt tuning, in a zero-shot fashion.
The generated text is then converted to speech using a rule-based approach that maps the emotional cues and disfluencies to corresponding speech patterns and sounds.
The method is evaluated in the context of a virtual patient scenario for SBIRT (Screening, Brief Intervention, and Referral to Treatment) training, where realistic emotional expression is crucial for effective healthcare training.
Experiments show that the synthesized speech is almost indistinguishable from genuine human communication, making the interactions more personal and authentic.
Stats
"I am very sad, I would like to have a cup of coffee"
"sighs heavily I am very sad, whispers I would l-like to have a cup of coffee"
Quotes
sighs heavily
whispers
cries softly
looks down
sobs
nods slowly
bursts into tears