Evaluating Robustness of Large Language Models for PowerPoint Task Completion
The author proposes the PPTC-R benchmark to assess the robustness of Language Models in completing PowerPoint tasks, highlighting GPT-4's superior performance and robustness. The study aims to provide insights for developing more resilient Language Models.