The authors study an interactive learning setting where an agent interacts with the world over multiple rounds. In each round, the world presents the agent with an instruction and a context. The agent then generates a response, and receives a hindsight instruction from a teacher that best describes the agent's response.
The authors first prove a lower bound showing that in the general case, the regret of any algorithm must scale with the size of the agent's response space. To overcome this, they introduce a specialized setting where the underlying instruction-response distribution can be decomposed as a low-rank matrix. They propose an algorithm called LORIL for this setting and show that its regret scales with √T and depends on the intrinsic rank but does not depend on the size of the agent's response space.
The authors provide experiments in two domains - a synthetic task that satisfies the low-rank assumption, and a grounded image selection task with natural language instructions where the low-rank assumption is violated. The results show that LORIL outperforms baselines even when the low-rank assumption is not met, demonstrating the broader applicability of the insights from LORIL.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Dipendra Mis... at arxiv.org 04-16-2024
https://arxiv.org/pdf/2404.09123.pdfDeeper Inquiries