Enhancing Zero-Shot Grounded Situation Recognition through Language Explainers
Leveraging large language models as explainers can significantly boost the performance of zero-shot grounded situation recognition by enhancing the model's understanding of verbs, semantic roles, and nouns in complex visual scenes.