Kernekoncepter
OneEncoder is a lightweight framework that progressively aligns image, text, audio, and video modalities without relying on vast aligned datasets. It leverages frozen pretrained modality-specific encoders and a compact Universal Projection module to achieve efficient and cost-effective multimodal alignment.
Resumé
The paper introduces OneEncoder, a lightweight framework for progressively aligning multiple modalities, including image, text, audio, and video. The key aspects of the method are:
Modality-specific encoders (e.g., ViT for images, BERT for text, Wav2Vec for audio, VideoMAE for video) are used as feature extractors and kept frozen during training.
A lightweight Universal Projection (UP) module is trained in the first step to align image and text modalities using a contrastive loss.
In subsequent steps, the pretrained UP module is frozen, and a compact Alignment Layer (AL) is trained to align new modalities (e.g., audio, video) with the already aligned modalities.
Modality tokens are used to facilitate efficient switching between modalities during the forward pass.
This progressive alignment approach allows OneEncoder to integrate new modalities at a low cost, without the need to retrain the entire framework. Experiments show that OneEncoder outperforms classical methods that train modality-specific encoders simultaneously, especially in scenarios with limited aligned datasets. OneEncoder demonstrates strong performance in various downstream tasks, such as classification, querying, and visual question answering.
Statistik
OneEncoder uses frozen pretrained encoders with a total of 196M parameters.
The Universal Projection (UP) module has 4M trainable parameters.
The Alignment Layer (AL) has 65,792 trainable parameters.
Citater
"OneEncoder operates efficiently and cost-effectively, even in scenarios where vast aligned datasets are unavailable, due to its lightweight design."
"Using OneEncoder represents a balanced compromise between alignment performance and complexity, as it minimizes the number of parameters to tune."