This paper introduces MID-M, a multimodal framework that utilizes a general-domain large language model (LLM) to process radiology data. The key aspects of the framework are:
Image Conversion: MID-M employs a domain-specific image classifier to convert images into textual descriptions, enabling the LLM to process multimodal data through a text-only interface.
General-Domain LLM: Instead of using heavily pre-trained and fine-tuned multimodal models, MID-M leverages the in-context learning capabilities of a general-domain LLM, which requires significantly fewer parameters.
Robustness to Data Quality: The authors systematically evaluate MID-M's performance under different levels of data corruption, simulating real-world scenarios where medical data can be incomplete or inconsistent. MID-M demonstrates exceptional robustness, outperforming task-specific and heavily pre-trained models when faced with low-quality data.
The experiments show that MID-M achieves comparable or superior performance to other state-of-the-art models, including those with substantially more parameters, while requiring significantly fewer resources for training and deployment. This highlights the potential of leveraging general-domain LLMs for specialized tasks, offering a sustainable and cost-effective alternative to traditional multimodal model development.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Seonhee Cho,... lúc arxiv.org 05-06-2024
https://arxiv.org/pdf/2405.01591.pdfYêu cầu sâu hơn