This paper introduces MID-M, a multimodal framework that utilizes a general-domain large language model (LLM) to process radiology data. The key aspects of the framework are:
Image Conversion: MID-M employs a domain-specific image classifier to convert images into textual descriptions, enabling the LLM to process multimodal data through a text-only interface.
General-Domain LLM: Instead of using heavily pre-trained and fine-tuned multimodal models, MID-M leverages the in-context learning capabilities of a general-domain LLM, which requires significantly fewer parameters.
Robustness to Data Quality: The authors systematically evaluate MID-M's performance under different levels of data corruption, simulating real-world scenarios where medical data can be incomplete or inconsistent. MID-M demonstrates exceptional robustness, outperforming task-specific and heavily pre-trained models when faced with low-quality data.
The experiments show that MID-M achieves comparable or superior performance to other state-of-the-art models, including those with substantially more parameters, while requiring significantly fewer resources for training and deployment. This highlights the potential of leveraging general-domain LLMs for specialized tasks, offering a sustainable and cost-effective alternative to traditional multimodal model development.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問