Dynamic Saliency Prediction with Video Foundation Models
SalFoM, a novel encoder-decoder video transformer architecture, employs a video foundation model (UMT) as the feature extractor and presents a heterogeneous decoder to capture diverse spatio-temporal information for superior video saliency prediction.