The proposed ViLA network efficiently selects key frames from videos and effectively aligns them with input questions to improve video question answering performance.