The author proposes ASiT, a novel self-supervised learning framework that combines local and global contextual information through group masked model learning and self-distillation to enhance audio representation and achieve state-of-the-art performance in various audio classification tasks.
Transformers are adapted for audio tasks through self-supervised pretraining, enhancing performance in various classification tasks.