Multimodal Small Language Models (MSLMs) like Mipha-3B can outperform large models without additional training data.