Belangrijkste concepten
Attention-Constrained Inference (ACI) is a training-free method that enhances the robustness of decoder-only text-to-speech models by identifying and constraining attention maps responsible for alignment, without altering the model structure or retraining.
Samenvatting
The paper introduces Attention-Constrained Inference (ACI), a training-free method to improve the robustness of decoder-only text-to-speech (TTS) models.
Key insights:
- The authors observe that certain attention heads in the decoder-only TTS model VALL-E exhibit diagonal patterns in their attention maps, indicating alignment between speech and text. They call these Alignment-Emerged Attention Maps (AEAMs).
- The authors propose the Attention Sweeping algorithm to identify AEAMs among all attention maps in the model.
- Based on the identified AEAMs, the authors introduce attention constraining strategies that apply constraining masks (CMasks) on the AEAMs during inference to guide the model towards monotonic synthesis.
- Experiments on VALL-E models of various configurations show that ACI can reduce the word error rate (WER) of synthesized speech by up to 20.5% relatively, while maintaining comparable naturalness and speaker similarity.
The key steps are:
- Identify AEAMs using the Attention Sweeping algorithm.
- Apply constraining masks (CMasks) on the AEAMs during inference to guide the model towards monotonic synthesis.
- Evaluate the impact of ACI on robustness (WER) and synthesis quality (naturalness, speaker similarity).
Statistieken
The word error rate (WER) of synthesized speech is reduced by up to 20.5% relatively with ACI on VALL-E models of various configurations.
The naturalness (measured by Predicted-MOS) and speaker similarity (measured by SECS) of the synthesized speech are comparable to the baseline without ACI.
Citaten
"We discover that among the massive self-attention modules in the whole decoder, some particular heads in a particular early layer are responsible for emerging alignments in their attention maps (i.e., AEAMs)."
"We propose a training-free method, Attention-Constrained Inference (ACI), which seamlessly works on top of existing decoder-only TTS models, to detect AEAMs of a model and apply CMasks on AEAMs at the inference stage to guide the alignment to be monotonic and realize robust synthesis."