The author explores the potential of large language models to comprehend visual signals without fine-tuning, introducing a Vision-to-Language Tokenizer for image processing.