Core Concepts
This paper introduces a novel approach to video captioning that leverages pseudo-labeling and keyword refining to achieve comparable performance to fully-supervised methods while significantly reducing the need for expensive human annotations.
Stats
The vocabulary size for MSVD is 12,800 words.
The vocabulary size for MSR-VTT is 28,485 words.
The vocabulary size for VATEX is 21,784 words.
The maximum number of features used for MSVD is 20 for both appearance/motion and objects.
The maximum number of features used for MSR-VTT is 30 for appearance/motion and 40 for objects.
The maximum number of features used for VATEX is 30 for both appearance/motion and objects.