Mastering Atari, Go, chess and shogi by planning with a learned model

Schrittwieser, Julian; Antonoglou, Ioannis; Hubert, Thomas; Simonyan, Karen; Sifre, Laurent; Schmitt, Simon; Guez, Arthur; Lockhart, Edward; Hassabis, Demis; Graepel, Thore; Lillicrap, Timothy; Silver, David

doi:10.1038/s41586-020-03051-4

Article
Published: 23 December 2020

Mastering Atari, Go, chess and shogi by planning with a learned model

Julian Schrittwieser¹^Â na1,
Ioannis Antonoglou^1,2^Â na1,
Thomas Hubert¹^Â na1,
Karen Simonyan¹,
Laurent Sifre¹,
Simon Schmitt¹,
Arthur Guez¹,
Edward Lockhart¹,
Demis Hassabis¹,
Thore Graepel^1,2,
Timothy Lillicrap¹ &
â€¦
David SilverÂ ORCID: orcid.org/0000-0002-5197-2892^1,2^Â na1Â

Nature volumeÂ 588,Â pages 604â€“609 (2020)Cite this article

61k Accesses
860 Citations
1519 Altmetric
Metrics details

Subjects

Abstract

Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challenging domains, such as chess¹ and Go², where a perfect simulator is available. However, in real-world problems, the dynamics governing the environment are often complex and unknown. Here we present the MuZero algorithm, which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics. The MuZero algorithm learns an iterable model that produces predictions relevant to planning: the action-selection policy, the value function and the reward. When evaluated on 57 different Atari games³â€”the canonical video game environment for testing artificial intelligence techniques, in which model-based planning approaches have historically struggled⁴â€”the MuZero algorithm achieved state-of-the-art performance. When evaluated on Go, chess and shogiâ€”canonical environments for high-performance planningâ€”the MuZero algorithm matched, without any knowledge of the game dynamics, the superhuman performance of the AlphaZero algorithm⁵ that was supplied with the rules of the game.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Planning, acting and training with a learned model.**

**Fig. 2: Evaluation of MuZero throughout training in chess, shogi, Go and Atari.**

**Fig. 3: Evaluations of MuZero on Go, all 57 Atari games and Ms. Pac-Man.**

Expertise increases planning depth in human gameplay

Article 31 May 2023

Phy-Q as a measure for physical reasoning intelligence

Article Open access 25 January 2023

Using deep neural networks as a guide for modeling human planning

Article Open access 20 November 2023

Data availability

MuZero is trained only on data generated by MuZero itself; no external data were used to produce the results presented in the article. Data for all figures and tables presented are available in JSON format in theÂ Supplementary Information.

Code availability

The Arcade Learning Environment³ is available open source at https://github.com/mgbellemare/Arcade-Learning-Environment. The Go and chess environments are available open source in OpenSpiel⁵² at https://github.com/deepmind/open_spiel. The pseudocode for the MuZero algorithm can be found in the file pseudocode.py in theÂ Supplementary Information. All the neural architecture details and hyperparameters are described in Methods.

References

Campbell, M., Hoane, A. J. Jr & Hsu, F.-h. Deep Blue. Artif. Intell. 134, 57â€“83 (2002).
ArticleÂ Google ScholarÂ
Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484â€“489 (2016).
ArticleÂ ADSÂ CASÂ Google ScholarÂ
Bellemare, M. G., Naddaf, Y., Veness, J. & Bowling, M. The arcade learning environment: an evaluation platform for general agents. J. Artif. Intell. Res. 47, 253â€“279 (2013).
ArticleÂ Google ScholarÂ
Machado, M. et al. Revisiting the arcade learning environment: evaluation protocols and open problems for general agents. J. Artif. Intell. Res. 61, 523â€“562 (2018).
ArticleÂ MathSciNetÂ Google ScholarÂ
Silver, D. et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 362, 1140â€“1144 (2018).
ArticleÂ ADSÂ MathSciNetÂ CASÂ Google ScholarÂ
Schaeffer, J. et al. A world championship caliber checkers program. Artif. Intell. 53, 273â€“289 (1992).
ArticleÂ Google ScholarÂ
Brown, N. & Sandholm, T. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science 359, 418â€“424 (2018).
ArticleÂ ADSÂ MathSciNetÂ CASÂ Google ScholarÂ
MoravÄÃk, M. et al. Deepstack: expert-level artificial intelligence in heads-up no-limit poker. Science 356, 508â€“513 (2017).
ArticleÂ ADSÂ MathSciNetÂ Google ScholarÂ
Vlahavas, I. & Refanidis, I. Planning and Scheduling Technical Report (EETN, 2013).
Segler, M. H., Preuss, M. & Waller, M. P. Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555, 604â€“610 (2018).
ArticleÂ ADSÂ CASÂ Google ScholarÂ
Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction 2nd edn (MIT Press, 2018).
Deisenroth, M. & Rasmussen, C. PILCO: a model-based and data-efficient approach to policy search. In Proc. 28th International Conference on Machine Learning, ICML 2011 465â€“472 (Omnipress, 2011).
Heess, N. et al. Learning continuous control policies by stochastic value gradients. In NIPSâ€™15: Proc. 28th International Conference on Neural Information Processing Systems Vol. 2 (eds Cortes, C. et al.) 2944â€“2952 (MIT Press, 2015).
Levine, S. & Abbeel, P. Learning neural network policies with guided policy search under unknown dynamics. Adv. Neural Inf. Process. Syst. 27, 1071â€“1079 (2014).
Google ScholarÂ
Hafner, D. et al. Learning latent dynamics for planning from pixels. Preprint at https://arxiv.org/abs/1811.04551 (2018).
Kaiser, L. et al. Model-based reinforcement learning for atari. Preprint at https://arxiv.org/abs/1903.00374 (2019).
Buesing, L. et al. Learning and querying fast generative models for reinforcement learning. Preprint at https://arxiv.org/abs/1802.03006 (2018).
Espeholt, L. et al. IMPALA: scalable distributed deep-RL with importance weighted actor-learner architectures. In Proc. International Conference on Machine Learning, ICML Vol. 80 (eds Dy, J. & Krause, A.) 1407â€“1416 (2018).
Kapturowski, S., Ostrovski, G., Dabney, W., Quan, J. & Munos, R. Recurrent experience replay in distributed reinforcement learning. In International Conference on Learning Representations (2019).
Horgan, D. et al. Distributed prioritized experience replay. In International Conference on Learning Representations (2018).
Puterman, M. L. Markov Decision Processes: Discrete Stochastic Dynamic Programming 1st edn (John Wiley & Sons, 1994).
Coulom, R. Efficient selectivity and backup operators in Monte-Carlo tree search. In International Conference on Computers and Games 72â€“83 (Springer, 2006).
WahlstrÃ¶m, N., SchÃ¶n, T. B. & Deisenroth, M. P. From pixels to torques: policy learning with deep dynamical models. Preprint at http://arxiv.org/abs/1502.02251 (2015).
Watter, M., Springenberg, J. T., Boedecker, J. & Riedmiller, M. Embed to control: a locally linear latent dynamics model for control from raw images. In NIPSâ€™15: Proc. 28th International Conference on Neural Information Processing Systems Vol. 2 (eds Cortes, C. et al.) 2746â€“2754 (MIT Press, 2015).
Ha, D. & Schmidhuber, J. Recurrent world models facilitate policy evolution. In NIPSâ€™18: Proc. 32nd International Conference on Neural Information Processing Systems (eds Bengio, S. et al.) 2455â€“2467 (Curran Associates, 2018).
Gelada, C., Kumar, S., Buckman, J., Nachum, O. & Bellemare, M. G. DeepMDP: learning continuous latent space models for representation learning. Proc. 36th International Conference on Machine Learning: Volume 97 of Proc. Machine Learning Research (eds Chaudhuri, K. & Salakhutdinov, R.) 2170â€“2179 (PMLR, 2019).
van Hasselt, H., Hessel, M. & Aslanides, J. When to use parametric models in reinforcement learning? Preprint at https://arxiv.org/abs/1906.05243 (2019).
Tamar, A., Wu, Y., Thomas, G., Levine, S. & Abbeel, P. Value iteration networks. Adv. Neural Inf. Process. Syst. 29, 2154â€“2162 (2016).
Google ScholarÂ
Silver, D. et al. The predictron: end-to-end learning and planning. In Proc. 34th International Conference on Machine Learning Vol. 70 (eds Precup, D. & Teh, Y. W.) 3191â€“3199 (JMLR, 2017).
Farahmand, A. M., Barreto, A. & Nikovski, D. Value-aware loss function for model-based reinforcement learning. In Proc. 20th International Conference on Artificial Intelligence and Statistics: Volume 54 of Proc. Machine Learning Research (eds Singh, A. & Zhu, J) 1486â€“1494 (PMLR, 2017).
Farahmand, A. Iterative value-aware model learning. Adv. Neural Inf. Process. Syst. 31, 9090â€“9101 (2018).
Google ScholarÂ
Farquhar, G., Rocktaeschel, T., Igl, M. & Whiteson, S. TreeQN and ATreeC: differentiable tree planning for deep reinforcement learning. In International Conference on Learning Representations (2018).
Oh, J., Singh, S. & Lee, H. Value prediction network. Adv. Neural Inf. Process. Syst. 30, 6118â€“6128 (2017).
Google ScholarÂ
Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097â€“1105 (2012).
Google ScholarÂ
He, K., Zhang, X., Ren, S. & Sun, J. Identity mappings in deep residual networks. In 14th European Conference on Computer Vision 630â€“645 (2016).
Hessel, M. et al. Rainbow: combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence (2018).
Schmitt, S., Hessel, M. & Simonyan, K. Off-policy actor-critic with shared experience replay. Preprint at https://arxiv.org/abs/1909.11583 (2019).
Azizzadenesheli, K. et al. Surprising negative results for generative adversarial tree search. Preprint at http://arxiv.org/abs/1806.05780 (2018).
Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529â€“533 (2015).
ArticleÂ ADSÂ CASÂ Google ScholarÂ
Open, A. I. OpenAI five. OpenAI https://blog.openai.com/openai-five/ (2018).
Vinyals, O. et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575, 350â€“354 (2019).
ArticleÂ ADSÂ CASÂ Google ScholarÂ
Jaderberg, M. et al. Reinforcement learning with unsupervised auxiliary tasks. Preprint at https://arxiv.org/abs/1611.05397 (2016).
Silver, D. et al. Mastering the game of Go without human knowledge. Nature 550, 354â€“359 (2017).
ArticleÂ ADSÂ CASÂ Google ScholarÂ
Kocsis, L. & SzepesvÃ¡ri, C. Bandit based Monte-Carlo planning. In European Conference on Machine Learning 282â€“293 (Springer, 2006).
Rosin, C. D. Multi-armed bandits with episode context. Ann. Math. Artif. Intell. 61, 203â€“230 (2011).
ArticleÂ MathSciNetÂ Google ScholarÂ
Schadd, M. P., Winands, M. H., Van Den Herik, H. J., Chaslot, G. M.-B. & Uiterwijk, J. W. Single-player Monte-Carlo tree search. In International Conference on Computers and Games 1â€“12 (Springer, 2008).
Pohlen, T. et al. Observe and look further: achieving consistent performance on Atari. Preprint at https://arxiv.org/abs/1805.11593 (2018).
Schaul, T., Quan, J., Antonoglou, I. & Silver, D. Prioritized experience replay. In International Conference on Learning Representations (2016).
Cloud TPU. Google Cloud https://cloud.google.com/tpu/ (2019).
Coulom, R. Whole-history rating: a Bayesian rating system for players of time-varying strength. In International Conference on Computers and Games 113â€“124 (2008).
Nair, A. et al. Massively parallel methods for deep reinforcement learning. Preprint at https://arxiv.org/abs/1507.04296 (2015).
Lanctot, M. et al. OpenSpiel: a framework for reinforcement learning in games. Preprint at http://arxiv.org/abs/1908.09453 (2019).

Download references

Acknowledgements

We thank L. Bennett, O. Smith and C. Apps for organizational assistance; K. Kavukcuoglu for reviewing the paper; T. Anthony, M. Lai, N. Tomasev, U. Paquet, S. Ghaisas for many discussions; and the rest of the DeepMind team for their support.

Author information

These authors contributed equally: Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, David Silver

Authors and Affiliations

DeepMind, London, UK
Julian Schrittwieser,Â Ioannis Antonoglou,Â Thomas Hubert,Â Karen Simonyan,Â Laurent Sifre,Â Simon Schmitt,Â Arthur Guez,Â Edward Lockhart,Â Demis Hassabis,Â Thore Graepel,Â Timothy LillicrapÂ &Â David Silver
University College London, London, UK
Ioannis Antonoglou,Â Thore GraepelÂ &Â David Silver

Authors

Julian Schrittwieser
View author publications
You can also search for this author inPubMedÂ Google Scholar
Ioannis Antonoglou
View author publications
You can also search for this author inPubMedÂ Google Scholar
Thomas Hubert
View author publications
You can also search for this author inPubMedÂ Google Scholar
Karen Simonyan
View author publications
You can also search for this author inPubMedÂ Google Scholar
Laurent Sifre
View author publications
You can also search for this author inPubMedÂ Google Scholar
Simon Schmitt
View author publications
You can also search for this author inPubMedÂ Google Scholar
Arthur Guez
View author publications
You can also search for this author inPubMedÂ Google Scholar
Edward Lockhart
View author publications
You can also search for this author inPubMedÂ Google Scholar
Demis Hassabis
View author publications
You can also search for this author inPubMedÂ Google Scholar
Thore Graepel
View author publications
You can also search for this author inPubMedÂ Google Scholar
Timothy Lillicrap
View author publications
You can also search for this author inPubMedÂ Google Scholar
David Silver
View author publications
You can also search for this author inPubMedÂ Google Scholar

Contributions

J.S., I.A., T.H. and D.S. designed the MuZero algorithm with advice from A.G., K.S., L.S., E.L., T.L. and T.G.; J.S., I.A., T.H. and S.S. implemented the MuZero program, ran experiments and analysed data. D.S., J.S., I.A. and T.H. wrote the paper with contributions from A.G., K.S., L.S., E.L., T.L., T.G. and D.H.

Corresponding author

Correspondence to David Silver.

Ethics declarations

Competing interests

DeepMind filed GreekÂ patent GR20200100037 on 28 January 2020, covering the MuZero algorithm described in this paper, listing the authors J.S., I.A. and T.H. as inventors. The other authors declare no competing interests.

Additional information

Peer review information Nature thanks Jaap van den Herik and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisherâ€™s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

This file contains Supplementary Figures S1-S5 and Supplementary Tables S1-S2.

Supplementary Data

The ZIP file contains Supplementary Data.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Schrittwieser, J., Antonoglou, I., Hubert, T. et al. Mastering Atari, Go, chess and shogi by planning with a learned model. Nature 588, 604â€“609 (2020). https://doi.org/10.1038/s41586-020-03051-4

Download citation

Received: 03 April 2020
Accepted: 07 October 2020
Published: 23 December 2020
Issue Date: 24 December 2020
DOI: https://doi.org/10.1038/s41586-020-03051-4