MLLMs enhance visual-language representation learning by extending diverse captions for each image and addressing issues of hallucinations and monotonous language styles.
MLLMs can enhance visual-language representation learning by extending diverse captions for each image, improving performance in various tasks.