Efficiently process long video sequences using a text-conditioned resampler for improved performance in various tasks.