Core Concepts
Zipformer is a faster, more memory-efficient Transformer model for ASR, featuring an innovative encoder structure, block design, normalization method, activation functions, and optimizer.
Abstract
Zipformer introduces a U-Net-like encoder structure with downsampling to various frame rates. The re-designed block structure includes more modules and reuses attention weights efficiently. BiasNorm replaces LayerNorm for normalization. New activation functions SwooshR and SwooshL outperform Swish. The ScaledAdam optimizer enables faster convergence and better performance. Extensive experiments show Zipformer's effectiveness on LibriSpeech, Aishell-1, and WenetSpeech datasets.
Stats
Zipformer achieves state-of-the-art results on LibriSpeech dataset.
Zipformer speeds up inference by over 50% compared to previous studies.
Zipformer requires less GPU memory during training.
Quotes
"Modeling changes in Zipformer include a U-Net-like encoder structure with downsampling to lower frame rates."
"Our proposed BiasNorm allows us to retain length information in normalization."
"SwooshR and SwooshL activation functions work better than Swish in Zipformer."
"ScaledAdam achieves faster convergence and better performance than Adam."