แนวคิดหลัก
The authors present Generative Observation Monte Carlo Tree Search (GO-MCTS), a method that performs MCTS search in the observation space of imperfect information games, using a transformer-based generative model to advance the search.
บทคัดย่อ
The paper introduces Generative Observation Monte Carlo Tree Search (GO-MCTS), a novel approach for planning in games of imperfect information. The key ideas are:
- Performing MCTS search in the observation space rather than the underlying state space, which avoids the need to know the true state.
- Using a transformer-based generative model to predict the next observation given the current observation history, allowing the search to advance without access to the true state.
- Demonstrating the effectiveness of this approach on several popular trick-taking card games: Hearts, Skat, and The Crew: The Quest for Planet Nine.
The authors first provide background on the challenges of applying traditional search algorithms to imperfect information games, particularly the large size of the information sets and the difficulty of sampling relevant underlying states. They then introduce the GO-MCTS algorithm, which performs MCTS in the observation space using the transformer-based generative model to advance the search.
The authors also describe their approach to training the transformer model, using an iterative self-play process with a population-based method. This allows the model to be trained from scratch without access to expert data.
The experimental results show that GO-MCTS is able to outperform strong baseline players in Hearts and The Crew, representing the new state-of-the-art. In Skat, GO-MCTS improves upon a weaker baseline but does not surpass the strong Kermit player. The authors discuss the trade-offs in computational cost, with the GO-MCTS player taking significantly longer per move compared to the baseline players.
Overall, the paper presents a novel and effective approach for planning in imperfect information games, with promising results that demonstrate the potential of transformer-based generative models in this domain.
สถิติ
The authors report the following key metrics:
In Hearts, the GO-MCTS player outperformed the xinxin baseline by 1.74 points on average, which extrapolates to a 31.0 point advantage in a game to 100 points.
In Skat, the GO-MCTS player performed 9.84 points worse than the Kermit baseline, but this was a 6.47 point improvement over the ArgMaxVal* player.
In The Crew, the GO-MCTS player had a significantly higher success rate across all 50 missions compared to the ArgMaxVal* player.
The computational cost of the GO-MCTS player was much higher, taking 25.6 seconds per turn in Hearts, 42 seconds per turn in Skat, and 5.9 seconds per turn in The Crew, compared to the baseline players.
คำพูด
"GO-MCTS works by using an approximation of the observation dynamics model to perform MCTS in this generated observation space."
"Transformers can be trained using only the raw observations as inputs, allowing the approach to be easily applied to new domains."
"We show that in all of these games, GO-MCTS is able to directly improve upon the final trained policy from the iterative learning process, and provides new state of the art results in Hearts and the Crew."