Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
Proposing a pipeline for contrastive language-audio pretraining to enhance audio representation by combining audio data with natural language descriptions.