NPHardEval4V: Evaluating Multimodal Large Language Models' Reasoning Abilities
The authors introduce NPHardEval4V, a dynamic benchmark to assess the reasoning abilities of Multimodal Large Language Models (MLLMs) by disentangling recognition and instruction-following from reasoning. The study reveals discrepancies in reasoning abilities across models and emphasizes the need for further development in enhancing MLLMs' reasoning capabilities.