LightZero
最近更新于 2023.12.07 LightZero-v0.0.3
LightZero 是一个轻量、高效、易懂的 MCTS+RL 开源算法库。
背景
以 AlphaZero, MuZero 为代表的结合蒙特卡洛树搜索 (Monte Carlo Tree Search, MCTS) 和深度强化学习 (Deep Reinforcemeent Learning, DRL) 的方法,在诸如围棋,Atari 等各种游戏上取得了超人的水平,也在诸如蛋白质结构预测,矩阵乘法算法寻找等科学领域取得了可喜的进展。下图为蒙特卡洛树搜索(MCTS)算法族的发展历史:
概览
LightZero 是一个结合了蒙特卡洛树搜索和强化学习的开源算法工具包。 它支持一系列基于 MCTS 的 RL 算法,具有以下优点:
- 轻量。
- 高效。
- 易懂。
LightZero 的目标是标准化 MCTS 算法族,以加速相关研究和应用。 Benchmark 中介绍了目前所有已实现算法的性能比较。
导航
特点
轻量:LightZero 中集成了多种 MCTS 族算法,能够在同一框架下轻量化地解决多种属性的决策问题。
高效:LightZero 针对 MCTS 族算法中耗时最长的环节,采用混合异构计算编程提高计算效率。
易懂:LightZero 为所有集成的算法提供了详细文档和算法框架图,帮助用户理解算法内核,在同一范式下比较算法之间的异同。同时,LightZero 也为算法的代码实现提供了函数调用图和网络结构图,便于用户定位关键代码。
框架结构
上图是 LightZero 的框架流程图。我们在下面简介其中的3个核心模块:
Model:
Model
用于定义网络结构,包含__init__
函数用于初始化网络结构,和forward
函数用于计算网络的前向传播。
Policy:
Policy
定义了对网络的更新方式和与环境交互的方式,包括三个过程,分别是训练过程(learn)、采样过程(collect)和评估过程(evaluate)。
MCTS:
MCTS
定义了蒙特卡洛搜索树的结构和与Policy
的交互方式。MCTS
的实现包括 python 和 cpp 两种,分别在ptree
和ctree
中实现。
关于 LightZero 的文件结构,请参考 lightzero_file_structure。
集成算法
LightZero 是基于 PyTorch 实现的 MCTS 算法库,在 MCTS 的实现中也用到了 cython 和 cpp。同时,LightZero 的框架主要基于 DI-engine 实现。目前 LightZero 中集成的算法包括:
LightZero 目前支持的环境及算法如下表所示:
Env./Algo. | AlphaZero | MuZero | EfficientZero | Sampled EfficientZero | Gumbel MuZero | Stochastic MuZero |
---|---|---|---|---|---|---|
TicTacToe | ✔ | ✔ | 🔒 | 🔒 | ✔ | 🔒 |
Gomoku | ✔ | ✔ | 🔒 | 🔒 | ✔ | 🔒 |
Connect4 | ✔ | ✔ | 🔒 | 🔒 | 🔒 | 🔒 |
2048 | ✔ | ✔ | 🔒 | 🔒 | 🔒 | ✔ |
Chess | 🔒 | 🔒 | 🔒 | 🔒 | 🔒 | 🔒 |
Go | 🔒 | 🔒 | 🔒 | 🔒 | 🔒 | 🔒 |
CartPole | --- | ✔ | ✔ | ✔ | ✔ | ✔ |
Pendulum | --- | ✔ | ✔ | ✔ | ✔ | ✔ |
LunarLander | --- | ✔ | ✔ | ✔ | ✔ | ✔ |
BipedalWalker | --- | ✔ | ✔ | ✔ | ✔ | 🔒 |
Atari | --- | ✔ | ✔ | ✔ | ✔ | ✔ |
MuJoCo | --- | ✔ | ✔ | ✔ | 🔒 | 🔒 |
MiniGrid | --- | ✔ | ✔ | ✔ | 🔒 | 🔒 |
Bsuite | --- | ✔ | ✔ | ✔ | 🔒 | 🔒 |
(1): "✔" 表示对应的项目已经完成并经过良好的测试。
(2): "🔒" 表示对应的项目在等待列表中(正在进行中)。
(3): "---" 表示该算法不支持此环境。
安装方法
可以用以下命令从 Github 的源码中安装最新版的 LightZero:
git clone https://github.com/opendilab/LightZero.git
cd LightZero
pip3 install -e .
请注意,LightZero 目前仅支持在 Linux
和 macOS
平台上进行编译。
我们正在积极将该支持扩展到 Windows
平台。
使用 Docker 进行安装
我们也提供了一个Dockerfile,用于设置包含运行 LightZero 库所需所有依赖项的环境。此 Docker 镜像基于 Ubuntu 20.04,并安装了Python 3.8以及其他必要的工具和库。 以下是如何使用我们的 Dockerfile 来构建 Docker 镜像,从该镜像运行一个容器,并在容器内执行 LightZero 代码的步骤。
下载 Dockerfile:Dockerfile 位于 LightZero 仓库的根目录中。将此文件下载到您的本地机器。
准备构建上下文:在您的本地机器上创建一个新的空目录,将 Dockerfile 移动到此目录,并导航到此目录。这一步有助于在构建过程中避免向 Docker 守护进程发送不必要的文件。
mkdir lightzero-docker mv Dockerfile lightzero-docker/ cd lightzero-docker/
构建 Docker 镜像:使用以下命令构建 Docker 镜像。此命令应在包含 Dockerfile 的目录内运行。
docker build -t ubuntu-py38-lz:latest -f ./Dockerfile .
从镜像运行容器:使用以下命令以交互模式启动一个 Bash shell 的容器。
docker run -dit --rm ubuntu-py38-lz:latest /bin/bash
在容器内执行 LightZero 代码:一旦你在容器内部,你可以使用以下命令运行示例 Python 脚本:
python ./LightZero/zoo/classic_control/cartpole/config/cartpole_muzero_config.py
快速开始
使用如下代码在 CartPole 环境上快速训练一个 MuZero 智能体:
cd LightZero
python3 -u zoo/classic_control/cartpole/config/cartpole_muzero_config.py
使用如下代码在 Pong 环境上快速训练一个 MuZero 智能体:
cd LightZero
python3 -u zoo/atari/config/atari_muzero_config.py
使用如下代码在 TicTacToe 环境上快速训练一个 MuZero 智能体:
cd LightZero
python3 -u zoo/board_games/tictactoe/config/tictactoe_muzero_bot_mode_config.py
基线算法比较
点击折叠
AlphaZero 和 MuZero 在3个棋类游戏(TicTacToe (井字棋),Connect4 和 Gomoku (五子棋))上的基线结果:
MuZero,MuZero w/ SSL,EfficientZero 和 Sampled EfficientZero 在3个代表性的 Atari 离散动作空间环境上的基线结果:
Sampled EfficientZero(包括
Factored/Gaussian
2种策略表征方法)在5个连续动作空间环境(Pendulum-v1,LunarLanderContinuous-v2,BipedalWalker-v3,Hopper-v3 和 Walker2d-v3)上的基线结果:其中
Factored Policy
表示智能体学习一个输出离散分布的策略网络,上述5种环境手动离散化后的动作空间维度分别为11、49(7^2)、256(4^4)、64 (4^3) 和 4096 (4^6)。Gaussian Policy
表示智能体学习一个策略网络,该网络直接输出高斯分布的参数 μ 和 σ。
Gumbel MuZero 和 MuZero 在不同模拟次数下,在四个环境(PongNoFrameskip-v4, MsPacmanNoFrameskip-v4, Gomoku 和 LunarLanderContinuous-v2)上的基线结果:
Stochastic MuZero 和 MuZero 在具有不同随机性程度的2048环境 (num_chances=2/5) 上的基线结果:
结合不同的探索机制的 MuZero w/ SSL 在 MiniGrid 环境上的基线结果:
MCTS 相关笔记
论文笔记
以下是 LightZero 中集成算法的中文详细文档:
算法框架图
以下是 LightZero 中集成算法的框架概览图:
MCTS 相关论文
以下是关于 MCTS 相关的论文集合,这一部分 将会持续更新,追踪 MCTS 的前沿动态。
重要论文
(点击查看更多)
LightZero Implemented series
- 2018 Science AlphaZero: A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play
- 2019 MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
- 2021 EfficientZero: Mastering Atari Games with Limited Data
- 2021 Sampled MuZero: Learning and Planning in Complex Action Spaces
- 2022 Stochastic MuZero: Plannig in Stochastic Environments with A Learned Model
- 2022 Gumbel MuZero: Policy Improvement by Planning with Gumbel
AlphaGo series
- 2015 Nature AlphaGo Mastering the game of Go with deep neural networks and tree search
- 2017 Nature AlphaGo Zero Mastering the game of Go without human knowledge
- 2019 ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero
- 2023 Student of Games: A unified learning algorithm for both perfect and imperfect information games
MuZero series
- 2022 Online and Offline Reinforcement Learning by Planning with a Learned Model
- 2021 Vector Quantized Models for Planning
- 2021 Muesli: Combining Improvements in Policy Optimization.
MCTS Analysis
- 2020 Monte-Carlo Tree Search as Regularized Policy Optimization
- 2021 Self-Consistent Models and Values
- 2022 Adversarial Policies Beat Professional-Level Go AIs
- 2022 PNAS Acquisition of Chess Knowledge in AlphaZero.
MCTS Application
- 2023 Symbolic Physics Learner: Discovering governing equations via Monte Carlo tree search
- 2022 Nature Discovering faster matrix multiplication algorithms with reinforcement learning
- 2022 MuZero with Self-competition for Rate Control in VP9 Video Compression
- 2021 DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning
- 2019 Combining Planning and Deep Reinforcement Learning in Tactical Decision Making for Autonomous Driving
其他论文
(点击查看更多)
ICML
- Scalable Safe Policy Improvement via Monte Carlo Tree Search 2023
- Alberto Castellini, Federico Bianchi, Edoardo Zorzi, Thiago D. Simão, Alessandro Farinelli, Matthijs T. J. Spaan
- Key: safe policy improvement online using a MCTS based strategy, Safe Policy Improvement with Baseline Bootstrapping
- ExpEnv: Gridworld and SysAdmin
- Efficient Learning for AlphaZero via Path Consistency 2022
- Dengwei Zhao, Shikui Tu, Lei Xu
- Key: limited amount of self-plays, path consistency (PC) optimality
- ExpEnv: Go, Othello, Gomoku
- Visualizing MuZero Models 2021
- Joery A. de Vries, Ken S. Voskuil, Thomas M. Moerland, Aske Plaat
- Key: visualizing the value equivalent dynamics model, action trajectories diverge, two regularization techniques
- ExpEnv: CartPole and MountainCar. and internal state transition dynamics,
- Convex Regularization in Monte-Carlo Tree Search 2021
- Tuan Dam, Carlo D'Eramo, Jan Peters, Joni Pajarinen
- Key: entropy-regularization backup operators, regret analysis, Tsallis etropy,
- ExpEnv: synthetic tree, Atari
- Information Particle Filter Tree: An Online Algorithm for POMDPs with Belief-Based Rewards on Continuous Domains 2020
- Johannes Fischer, Ömer Sahin Tas
- Key: Continuous POMDP, Particle Filter Tree, information-based reward shaping, Information Gathering.
- ExpEnv: POMDPs.jl framework
- Code
- Retro*: Learning Retrosynthetic Planning with Neural Guided A* Search 2020
- Binghong Chen, Chengtao Li, Hanjun Dai, Le Song
- Key: chemical retrosynthetic planning, neural-based A*-like algorithm, ANDOR tree
- ExpEnv: USPTO datasets
- Code
ICLR
- Become a Proficient Player with Limited Data through Watching Pure Videos 2023
- Weirui Ye, Yunsheng Zhang, Pieter Abbeel, Yang Gao
- Key: pre-training from action-free videos, forward-inverse cycle consistency (FICC) objective based on vector quantization, pre-training phase, fine-tuning phase.
- ExpEnv: Atari
- Policy-Based Self-Competition for Planning Problems 2023
- Jonathan Pirnay, Quirin Göttl, Jakob Burger, Dominik Gerhard Grimm
- Key: self-competition, find strong trajectories by planning against possible strategies of its past self.
- ExpEnv: Traveling Salesman Problem and the Job-Shop Scheduling Problem.
- Explaining Temporal Graph Models through an Explorer-Navigator Framework 2023
- Wenwen Xia, Mincai Lai, Caihua Shan, Yao Zhang, Xinnan Dai, Xiang Li, Dongsheng Li
- Key: Temporal GNN Explainer, an explorer to find the event subsets with MCTS, a navigator that learns the correlations between events and helps reduce the search space.
- ExpEnv: Wikipedia and Reddit, Synthetic datasets
- SpeedyZero: Mastering Atari with Limited Data and Time 2023
- Yixuan Mei, Jiaxuan Gao, Weirui Ye, Shaohuai Liu, Yang Gao, Yi Wu
- Key: distributed RL system, Priority Refresh, Clipped LARS
- ExpEnv: Atari
- Efficient Offline Policy Optimization with a Learned Model 2023
- Zichen Liu, Siyi Li, Wee Sun Lee, Shuicheng YAN, Zhongwen Xu
- Key: Regularized One-Step Model-based algorithm for Offline-RL
- ExpEnv: Atari,BSuite
- Code
- Enabling Arbitrary Translation Objectives with Adaptive Tree Search 2022
- Wang Ling, Wojciech Stokowiec, Domenic Donato, Chris Dyer, Lei Yu, Laurent Sartran, Austin Matthews
- Key: adaptive tree search, translation models, autoregressive models,
- ExpEnv: Chinese–English and Pashto–English tasks from WMT2020, German–English from WMT2014
- What's Wrong with Deep Learning in Tree Search for Combinatorial Optimization 2022
- Maximili1an Böther, Otto Kißig, Martin Taraz, Sarel Cohen, Karen Seidel, Tobias Friedrich
- Key: Combinatorial optimization, open-source benchmark suite for the NP-hard MAXIMUM INDEPENDENT SET problem, an in-depth analysis of the popular guided tree search algorithm, compare the tree search implementations to other solvers
- ExpEnv: NP-hard MAXIMUM INDEPENDENT SET.
- Code
- Monte-Carlo Planning and Learning with Language Action Value Estimates 2021
- Youngsoo Jang, Seokin Seo, Jongmin Lee, Kee-Eung Kim
- Key: Monte-Carlo tree search with language-driven exploration, locally optimistic language value estimates,
- ExpEnv: Interactive Fiction (IF) games
- Practical Massively Parallel Monte-Carlo Tree Search Applied to Molecular Design 2021
- Xiufeng Yang, Tanuj Kr Aasawat, Kazuki Yoshizoe
- Key: massively parallel Monte-Carlo Tree Search, molecular design, Hash-driven parallel search,
- ExpEnv: octanol-water partition coefficient (logP) penalized by the synthetic accessibility (SA) and large Ring Penalty score.
- Watch the Unobserved: A Simple Approach to Parallelizing Monte Carlo Tree Search 2020
- Anji Liu, Jianshu Chen, Mingze Yu, Yu Zhai, Xuewen Zhou, Ji Liu
- Key: parallel Monte-Carlo Tree Search, partition the tree into sub-trees efficiently, compare the observation ratio of each processor
- ExpEnv: speedup and performance comparison on JOY-CITY game, average episode return on atari game
- Code
- Learning to Plan in High Dimensions via Neural Exploration-Exploitation Trees 2020
- Binghong Chen, Bo Dai, Qinjie Lin, Guo Ye, Han Liu, Le Song
- Key: meta path planning algorithm, exploits a novel neural architecture which can learn promising search directions from problem structures.
- ExpEnv: a 2d workspace with a 2 DoF (degrees of freedom) point robot, a 3 DoF stick robot and a 5 DoF snake robot
NeurIPS
- LightZero: A Unified Benchmark for Monte Carlo Tree Search in General Sequential Decision Scenarios 2023
- Yazhe Niu, Yuan Pu, Zhenjie Yang, Xueyan Li, Tong Zhou, Jiyuan Ren, Shuai Hu, Hongsheng Li, Yu Liu
- Key: the first unified benchmark for deploying MCTS/MuZero in general sequential decision scenarios.
- ExpEnv: ClassicControl, Box2D, Atari, MuJoCo, GoBigger, MiniGrid, TicTacToe, ConnectFour, Gomoku, 2048, etc.
- Large Language Models as Commonsense Knowledge for Large-Scale Task Planning 2023
- Zirui Zhao, Wee Sun Lee, David Hsu
- Key: world model (LLM) and the LLM-induced policy can be combined in MCTS, to scale up task planning.
- ExpEnv: multiplication, travel planning, object rearrangement
- Monte Carlo Tree Search with Boltzmann Exploration 2023
- Michael Painter, Mohamed Baioumy, Nick Hawes, Bruno Lacerda
- Key: Boltzmann exploration with MCTS, optimal actions for the maximum entropy objective do not necessarily correspond to optimal actions for the original objective, two improved algorithms.
- ExpEnv: the Frozen Lake environment, the Sailing Problem, Go
- Generalized Weighted Path Consistency for Mastering Atari Games 2023
- Dengwei Zhao, Shikui Tu, Lei Xu
- Key: Generalized Weighted Path Consistency, A weighting mechanism.
- ExpEnv: Atari
- Accelerating Monte Carlo Tree Search with Probability Tree State Abstraction 2023
- Yangqing Fu, Ming Sun, Buqing Nie, Yue Gao
- Key: probability tree state abstraction, transitivity and aggregation error bound
- ExpEnv: Atari, CartPole, LunarLander, Gomoku
- Planning for Sample Efficient Imitation Learning 2022
- Zhao-Heng Yin, Weirui Ye, Qifeng Chen, Yang Gao
- Key: Behavioral Cloning,Adversarial Imitation Learning (AIL),MCTS-based RL,
- ExpEnv: DeepMind Control Suite
- Code
- Evaluation Beyond Task Performance: Analyzing Concepts in AlphaZero in Hex 2022
- Charles Lovering, Jessica Zosa Forde, George Konidaris, Ellie Pavlick, Michael L. Littman
- Key: AlphaZero’s internal representations, model probing and behavioral tests, how these concepts are captured in the network.
- ExpEnv: Hex
- Are AlphaZero-like Agents Robust to Adversarial Perturbations? 2022
- Li-Cheng Lan, Huan Zhang, Ti-Rong Wu, Meng-Yu Tsai, I-Chen Wu, 4 Cho-Jui Hsieh
- Key: adversarial states, first adversarial attack on Go AIs
- ExpEnv: Go
- Monte Carlo Tree Descent for Black-Box Optimization 2022
- Yaoguang Zhai, Sicun Gao
- Key: Black-Box Optimization, how to further integrate samplebased descent for faster optimization.
- ExpEnv: synthetic functions for nonlinear optimization, reinforcement learning problems in MuJoCo locomotion environments, and optimization problems in Neural Architecture Search (NAS).
- Monte Carlo Tree Search based Variable Selection for High Dimensional Bayesian Optimization 2022
- Lei Song∗ , Ke Xue∗ , Xiaobin Huang, Chao Qian
- Key: a low-dimensional subspace via MCTS, optimizes in the subspace with any Bayesian optimization algorithm.
- ExpEnv: NAS-bench problems and MuJoCo locomotion
- Monte Carlo Tree Search With Iteratively Refining State Abstractions 2021
- Samuel Sokota, Caleb Ho, Zaheen Ahmad, J. Zico Kolter
- Key: stochastic environments, Progressive widening, abstraction refining,
- ExpEnv: Blackjack, Trap, five by five Go.
- Deep Synoptic Monte Carlo Planning in Reconnaissance Blind Chess 2021
- Gregory Clark
- Key: imperfect information, belief state with an unweighted particle filter, a novel stochastic abstraction of information states.
- ExpEnv: reconnaissance blind chess
- POLY-HOOT: Monte-Carlo Planning in Continuous Space MDPs with Non-Asymptotic Analysis 2020
- Weichao Mao, Kaiqing Zhang, Qiaomin Xie, Tamer Ba¸sar
- Key: continuous state-action spaces, Hierarchical Optimistic Optimization,
- ExpEnv: CartPole, Inverted Pendulum, Swing-up, and LunarLander.
- Learning Search Space Partition for Black-box Optimization using Monte Carlo Tree Search 2020
- Linnan Wang, Rodrigo Fonseca, Yuandong Tian
- Key: learns the partition of the search space using a few samples, a nonlinear decision boundary and learns a local model to pick good candidates.
- ExpEnv: MuJoCo locomotion tasks, Small-scale Benchmarks,
- Mix and Match: An Optimistic Tree-Search Approach for Learning Models from Mixture Distributions 2020
- Matthew Faw, Rajat Sen, Karthikeyan Shanmugam, Constantine Caramanis, Sanjay Shakkottai
- Key: covariate shift problem, Mix&Match combines stochastic gradient descent (SGD) with optimistic tree search and model re-use (evolving partially trained models with samples from different mixture distributions)
- Code
Other Conference or Journal
- On Monte Carlo Tree Search and Reinforcement Learning Journal of Artificial Intelligence Research 2017.
- Sample-Efficient Neural Architecture Search by Learning Actions for Monte Carlo Tree Search IEEE Transactions on Pattern Analysis and Machine Intelligence 2022.
反馈意见和贡献
有任何疑问或意见都可以在 github 上直接 提出 issue
或者联系我们的邮箱 ([email protected])
感谢所有的反馈意见,包括对算法和系统设计。这些反馈意见和建议都会让 LightZero 变得更好。
引用
@misc{lightzero,
title={LightZero: A Unified Benchmark for Monte Carlo Tree Search in General Sequential Decision Scenarios},
author={Yazhe Niu and Yuan Pu and Zhenjie Yang and Xueyan Li and Tong Zhou and Jiyuan Ren and Shuai Hu and Hongsheng Li and Yu Liu},
year={2023},
eprint={2310.08348},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
致谢
此算法库的实现部分基于以下 GitHub 仓库,非常感谢这些开创性工作:
- https://github.com/opendilab/DI-engine
- https://github.com/deepmind/mctx
- https://github.com/YeWR/EfficientZero
- https://github.com/werner-duvaud/muzero-general
特别感谢以下贡献者 @PaParaZz1, @karroyan, @nighood, @jayyoung0802, @timothijoe, @TuTuHuss, @HarryXuancy, @puyuan1996, @HansBug 对本项目的贡献和支持。
许可证
本仓库中的所有代码都符合 Apache License 2.0。
(回到顶部)