核心概念
Enhancing text-to-speech synthesis by controlling voice characteristics through prompt-based methods.
要約
The content discusses the creation of a novel corpus to manipulate voice characteristics in text-to-speech synthesis. It outlines the methodology for building the corpus, filtering data, ensuring quality, and manual annotation. The paper also proposes a model for retrieving speech from voice characteristics descriptions.
Structure:
- Introduction to Speech Production and TTS Challenges
- Methodology for Corpus Construction
- Quality Assurance Processes
- Manual Annotation of Voice Characteristics Descriptions
- Analysis of Corpus Diversity and Linguistic Features
- Training Algorithm for Model Retrieval of Speech from Descriptions
統計
"We construct an open corpus, Coco-Nut1."
"The corpus consists of 54,610 speech segments from 1,523 videos."
"The MLM scores of collected segments are distributed around the peak of -3."
引用
"We propose a method to construct a paired corpus of speech and voice characteristics descriptions with a broad range of voice characteristics."
"We open-sourced the corpus, which is the first corpus that covers diverse in-the-wild voice characteristics."