Enhancing Multimodal Large Language Model for Dynamic Audio-Visual Question Answering
CAT enhances Multimodal Large Language Models by aggregating question-related clues, training on mixed audio-visual datasets, and implementing AI-assisted ambiguity-aware direct preference optimization to improve responses in dynamic audio-visual scenarios.