The authors introduce SALMON, a comprehensive evaluation suite for assessing the acoustic awareness of speech language models (SLMs). SALMON consists of two main tasks:
Acoustic Consistency: Evaluating whether SLMs can detect unnatural acoustic changes within a recording, such as changes in speaker, sentiment, background noise, or room acoustics.
Acoustic-Semantic Alignment: Assessing whether SLMs can align the acoustic properties of a recording (e.g., background noise, sentiment) with the semantic content of the spoken text.
SALMON covers a wide range of acoustic elements, including speaker identity, sentiment, background noise, and room impulse response. The benchmark uses a modeling-based approach, where the SLM is evaluated on its ability to assign higher likelihood to "real" samples compared to modified, inconsistent samples.
The authors evaluate several popular SLMs, including TWIST, LAST, and pGSLM, on the SALMON benchmark. The results show that while humans easily achieve over 90% accuracy on most tasks, current SLMs struggle to model and identify basic acoustic inconsistencies, highlighting the need for further research in developing acoustically aware speech language models.
The authors provide the SALMON evaluation suite and generation pipeline, aiming to guide future SLM development towards jointly modeling semantic and acoustic aspects of speech.
Egy másik nyelvre
a forrásanyagból
arxiv.org
Mélyebb kérdések