Incorporating visual grounding into language models improves learning efficiency and aligns with human language acquisition.
Visual grounding enhances language modeling efficiency and human-like representations.