Core Concepts
This paper introduces NAST-S2x, a novel non-autoregressive framework for end-to-end simultaneous speech-to-speech translation that overcomes limitations of traditional cascaded approaches by directly generating translations from speech input without relying on intermediate text representation, achieving high translation quality with low latency.
Stats
The acoustic unit sequence is 25 times longer than the corresponding text sequence on average.
NAST-S2x achieves high-quality simultaneous interpretation within a delay of less than 3 seconds.
NAST-S2x provides a 28× decoding speedup in offline generation.
Quotes
"However, pipeline methods often suffer from error propagation and delay accumulation."
"This characteristic allows it to be trained and tested directly from speech to acoustic units, thereby circumventing issues related to error propagation."
"In Offline-S2S, it matches the performance of the strong autoregressive baseline while providing a 28× inference speedup."