Skip to main content
eScholarship
Open Access Publications from the University of California

When Machines Speak with Feeling: Investigating Emotional Prosody, Authenticity, and Trust in AI vs. Human Voices

Creative Commons 'BY' version 4.0 license
Abstract

Emotional prosody---vocal cues that convey affect---profoundly influences how listeners interpret a speaker's intentions. We conducted two studies comparing AI- and human-generated emotional speech. In Study 1 (N=38), participants categorized five emotions (happy, sad, angry, neutral, fear) expressed by human voices and by an advanced text-to-speech (TTS) system. Human recordings exhibited higher overall accuracy (79.82% vs. 72.65%) and were rated significantly more natural, an effect partially explained by micro-perturbations (e.g., jitter, shimmer) that enhanced perceived authenticity. In Study 2 (N=53), these validated stimuli were incorporated into short scenarios, with each speaker labeled as either ``human'' or ``AI.'' Even when participants heard identical clips, those informed that the speaker was human exhibited greater trust and empathy, resulting in higher donation and advice-following rates. Although contemporary TTS systems effectively convey broad affective states, explicit AI labeling reduces perceived credibility and social engagement, underscoring the critical role that preexisting expectations play in human--AI communication.