Core Concepts
Freezing pretrained vision and/or text modules during the pretraining of vision-language transformer encoders can significantly reduce computational cost without substantial performance loss on downstream tasks.
Stats
Freezing the visual module in a two-tower model resulted in nearly identical performance on SNLI-VE and even slightly better performance on the reference resolution task compared to the baseline model.
The two-tower model with both modules frozen achieved comparable results to the baseline on the reference resolution task and only slightly lower accuracy on NLVR2.
One-tower models with over 100M parameters were outperformed by two-tower models with less than 40M parameters on the same downstream tasks.