Core Concepts
Agents trained with end-to-end deep reinforcement learning can master the challenging task of multi-agent robot soccer using only onboard egocentric RGB vision, without relying on external state estimation or depth sensing.
Abstract
This paper presents a method for training vision-based reinforcement learning (RL) agents to play one-vs-one robot soccer. The agents are trained entirely in simulation using MuJoCo physics and realistic rendering via Neural Radiance Fields (NeRFs), and then deployed zero-shot on physical Robotis OP3 humanoid robots.
The key highlights are:
The agents are trained end-to-end, mapping raw pixel observations from the onboard RGB camera directly to joint-level actions, without any simplifying assumptions or domain-specific architectural components.
The agents display strong performance and agility, comparable to state-based agents that have access to ground-truth information about the opponent, ball, and goal. This is achieved through the use of memory-augmented policies and careful simulation-to-real transfer techniques.
The training pipeline enables the emergence of complex, long-horizon behaviors such as ball tracking, opponent awareness, and accurate shooting, without any explicit rewards for these skills. The agents learn to actively control their head camera to track the ball, even when it is occluded or out of view.
Quantitative analysis shows the vision-based agents maintain similar levels of walking speed, turning speed, and kicking power as state-based agents. In simulation, their scoring ability is on par, but in the real world, the vision-based agents suffer more from the reality gap.
The paper also investigates the benefits of training end-to-end from vision compared to distilling knowledge from state-based experts, finding that the former leads to better performance.
Overall, this work demonstrates the potential of end-to-end deep RL for mastering challenging robotic tasks like multi-agent soccer using only onboard sensors, without relying on external state estimation or privileged information.
Stats
The agents can walk at 0.52 ± 0.02 m/s and kick with a power of 1.95 ± 0.31 m/s.
In simulation, the vision-based agents have a scoring accuracy of 0.86 ± 0.04, compared to 0.82 ± 0.05 for state-based agents.
In the real world, the vision-based agents have a scoring accuracy of 0.4 ± 0.11, compared to 0.58 ± 0.07 for state-based agents.
Quotes
"To our knowledge, this paper constitutes a first demonstration of end-to-end training for multi-agent robot soccer, mapping raw pixel observations to joint-level actions, that can be deployed in the real world."
"Importantly, our approach does not involve any changes to the task or reward structure, makes no simplifying assumptions for state-estimation, and does not use any domain-specific architectural components."