Core Concepts
Combining supervised learning and reinforcement learning techniques to train agents that can efficiently navigate and complete tasks on web interfaces using large language models.
Abstract
This paper proposes a novel approach that combines supervised learning (SL) and reinforcement learning (RL) techniques to train agents for web navigation tasks using large language models (LLMs). The key contributions are:
- Insights into the capabilities of agents trained on user interactions to navigate diverse web interfaces.
- Introduction of a more robust evaluation that exposes the shortcomings of previous models due to their tendencies to memorize rather than comprehend the underlying HTML structure.
- Provision of directions to correct the memorization issue and set more accurate and grounded results over the Miniwob++ benchmark.
- Comprehensive analysis of current methods' limitations and exploration of more performant architectures, including the combination of fine-tuned T5 models with a multimodal CC-Net-inspired RL approach.
The authors find that fine-tuned T5 models outperform previous SL methods, achieving 43.58% average accuracy. However, the combined SL and RL approach, named CC-NeT5, suffers from a performance drop during the RL phase due to a covariate shift towards the T5 model. The paper highlights the need for further exploration of intermediate-sized models, multimodal architectures, and addressing ethical considerations in web navigation automation.
Stats
The distribution of target elements in the recorded actions is concentrated among several locations, which can be linked to the salient elements in the DOM and displayed on the page.
Randomizing the reference numbers in all episodes forces the models to base predictions on element features rather than memorizing the distribution of salient elements.
Removing action history has a larger impact on the T5-base model (2% drop) compared to the T5-large hierarchical model (0.2% drop), indicating the importance of hierarchical planning.
Quotes
"Our findings upon reproducing Gur et al.'s work [12] reveal memorization tendencies rather than genuine task understanding."
"Randomizing references resulted in performance drops, questioning the original claims but reaffirming the importance of data randomization in model training."