A Lightweight Framework for Progressive Alignment of Image, Text, Audio, and Video Modalities
OneEncoder is a lightweight framework that progressively aligns image, text, audio, and video modalities without relying on vast aligned datasets. It leverages frozen pretrained modality-specific encoders and a compact Universal Projection module to achieve efficient and cost-effective multimodal alignment.