Core Concepts
InternLM-XComposer2-4KHD is a pioneering large vision-language model that can process images with resolutions up to 4K HD, significantly expanding the capabilities of previous models in handling fine-grained visual content.
Abstract
The paper introduces InternLM-XComposer2-4KHD, a groundbreaking large vision-language model (LVLM) that can process images with resolutions up to 4K HD. This represents a significant advancement over previous LVLMs, which were typically limited to resolutions around 1500 × 1500 pixels.
Key highlights:
InternLM-XComposer2-4KHD supports a wide range of resolutions, from 336 pixels to 4K HD, making it applicable across a variety of real-world scenarios.
The model employs a dynamic image partitioning approach, which maintains the original aspect ratios of images while adaptively adjusting the patch layouts and counts. This allows the model to effectively handle high-resolution inputs.
To address the variability in patch configurations, the model introduces a newline token to clearly delineate the patch layouts, reducing training ambiguity and boosting performance.
Scaling the training resolution up to 4K HD leads to consistent performance improvements, suggesting the potential for further enhancing the model's capabilities by training on even higher resolutions.
Evaluation on 16 diverse benchmarks, including 5 challenging HD-OCR datasets, demonstrates that InternLM-XComposer2-4KHD matches or even surpasses the performance of state-of-the-art closed-source APIs in 10 out of 16 benchmarks, despite having only 7B parameters.
Stats
The model can handle images with resolutions up to 4K HD (3840 × 1600 pixels).
The model supports a wide range of resolutions, from 336 pixels to 4K HD.
Scaling the training resolution up to 4K HD leads to consistent performance improvements.
Quotes
"InternLM-XComposer2-4KHD shows superb capability that matches or even surpasses GPT-4V and Gemini Pro in 10 of the 16 benchmarks."
"Scaling the training resolution up to 4K standard results in a consistent improvement in performance, highlighting the potential for training even beyond 4K resolution."