Core Concepts
A novel online joint beat and downbeat tracking system based on streaming Transformer with contextual block processing and relative positional encoding, achieving substantial performance improvements over state-of-the-art models.
Abstract
The paper proposes BEAST, a novel online joint beat and downbeat tracking system based on streaming Transformer. To deal with online scenarios, BEAST applies contextual block processing in the Transformer encoder. Moreover, it adopts relative positional encoding in the attention layer of the streaming Transformer encoder to capture relative timing position, which is critically important information in music.
The key highlights and insights are:
BEAST utilizes the contextual block processing mechanism in the Transformer encoder to support online processing, where only the past and present input features are available.
BEAST adopts the relative attention mechanism in the streaming Transformer encoder, inspired by Transformer-XL and Music Transformer, to better capture the relative timing position information in music.
Experiments on benchmark datasets show that for a low latency scenario with maximum latency under 50 ms, BEAST achieves an F1-measure of 80.04% in beat and 46.78% in downbeat, which is a substantial improvement of about 5 percentage points over the state-of-the-art online beat tracking model.
This is the first work to utilize the streaming Transformer in the Music Information Retrieval (MIR) domain.
Stats
Many deep learning models have achieved dominant performance on the offline beat tracking task.
Online beat tracking still remains challenging.
BEAST achieves an F1-measure of 80.04% in beat and 46.78% in downbeat for a low latency scenario with maximum latency under 50 ms, which is a substantial improvement of about 5 percentage points over the state-of-the-art online beat tracking model.
Quotes
"To deal with online scenarios, BEAST applies contextual block processing in the Transformer encoder."
"BEAST adopts the relative attention mechanism in the streaming Transformer encoder to better capture the relative timing position information in music."