Core Concepts
Unimodal aggregation (UMA) enhances feature representations in non-autoregressive speech recognition, reducing errors and complexity.
Abstract
Abstract:
UMA proposed for better feature representations.
Integration of frame-wise features using encoder.
Improved performance compared to regular CTC.
Introduction:
Autoregressive vs. non-autoregressive methods in ASR.
Challenges in aligning input frames to text tokens.
Method:
Overview of CTC and proposed UMA method.
Encoder, unimodal aggregation, and decoder structure explained.
Experiments:
Evaluation on Mandarin datasets like AISHELL-1, AISHELL-2, HKUST.
Model configurations and comparison with other NAR methods.
Results and Analysis:
Example of UMA weights showing clear token segmentation.
Performance comparison on different datasets showcasing UMA's superiority.
Conclusions:
UMA improves feature representation, reduces errors, and computational complexity.
Stats
"Experiments are conducted on three Mandarin Chinese datasets."
"AISHELL-1 recorded with a high-fidelity microphone."
"HKUST dataset consists of spontaneous conversations during phone calls."
Quotes
"No explicit constraint is put on the aggregation weights αt."
"UMA outperforms all comparison NAR models."