Introducing the Chain-of-Action framework to enhance question answering by addressing unfaithful hallucination and weak reasoning in complex tasks.
Current state-of-the-art large foundation models exhibit varying strengths and weaknesses in multimodal reasoning capabilities, with no single model outperforming others across all tasks. Detailed evaluation reveals opportunities for improvement in areas like geometric reasoning, benefiting from multimodal input, and grounding information retrieval.