แนวคิดหลัก
Virtual agents can automate computer tasks, but current models struggle with visual understanding.
บทคัดย่อ
OmniACT introduces a dataset and benchmark for assessing agents' ability to generate executable programs from natural language tasks. The dataset covers diverse desktop applications and web tasks. Language model agents struggle with visual cues in UI elements. DetACT module converts UI images into structured code for downstream models. GPT-4 outperforms other baselines on the dataset, but still falls short of human proficiency. Human evaluators show high proficiency in completing tasks. Future research directions include building multimodal models for improved performance.
สถิติ
GPT-4 achieves an action score of 11.60 on the benchmark.
LLaMA-13B fine-tuned model improves sequence score from 4.80 to 8.92.
Vicuna-13B fine-tuned model shows improvement in action score from 1.62 to 2.14.