Text-only language models can effectively ground spatial relations using location tokens, outperforming Vision-and-Language Models.