Efficient Video Question Answering with Self-Adaptive Sampling on Image-Text Models
Efficient sampling methods, Most Implied Frames (MIF) and Most Dominant Frames (MDF), are proposed to boost the performance of image-text models on video question answering tasks.