Transformers outperform naive baselines in spatial relation prediction by effectively extracting spatial relationships from images.