Submitted:
28 June 2023
Posted:
29 June 2023
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Approximate dynamic programming methods
2.1. Problem description
- Direct policy evaluation: Find that minimize . We compute the approximation of the projection by Monte Carlo simulation. We can introduce nonlinear architectures in , e.g. deep neural networks.
-
Indirect policy evaluation:
- -
-
Solve the projected equation (Galerkin approximation [16,18])by simulation. Equation (11) is equivalent to minimizing the expected temporal difference error (TDE) ||||ξ down to 0 for [14].
- *
- TD learning: Stochastic iterative algorithm for solving (11).
- *
- Least-squares temporal difference learning (LSTD): TD fixed point method [14].
- *
- Least-squares policy evaluation (LSPE): A simulation-based form of projected iteration; essentially simulation noise.
- -
- Bellman equation error (Bellman residual [14]) method: minimize
2.2. Direct policy evaluation
2.3. Indirect policy evaluation
2.3.1. Stochastic algorithms: a general view
-
Stochastic approximation (SA) approach: starting from an initial point , forare the weights.
-
Monte Carlo estimation (MCE) approach: Form Monte Carlo estimation of b and A,Then solve by a direct method or an iterative method. can be built incrementally.
2.3.2. Temporal difference (TD) learning
2.3.3. Least-squares temporal difference (LSTD) learning
2.3.4. Least-squares policy evaluation
2.3.5. Bellman equation error methods
2.4. Least-squares policy iteration (LSPI)
2.5. A unified oblique projection view
3. Krylov subspace methods
3.1. Projection Process
3.2. Krylov Subspaces
| Name | Problem | Property | ||
|---|---|---|---|---|
| CG | SPD A | |||
| SYMMLQ | ||||
| MINRES | ||||
| GMRES | general A | |||
| BiCG | general A | Does not solve an optimization problem | ||
| QMR | general A | no |
3.3. The Arnoldi algorithm for orthonormal basis
4. Connections
5. Conclusion and Discussion
References
- Dimitri, P. Bertsekas, Dynamic Programming and Stochastic Control lecture notes, 2015 fall, MIT.
- Dimitri P. Bertsekas, Proximal algorithms and temporal differences for large linear systems: extrapolation, approximation, and simulation, Report LIDS-P-3205, MIT; arXiv preprint arXiv: 1610.05427, 2016.
- Justin, A. Boyan, Technical update: Least-squares temporal difference learning, Machine Learning, 49(2-3):233-246, 2002.
- Steven, J. Bradtke and Andrew G. Barto, Linear least-squares algorithms for temporal difference learning, Machine Learning, 22:33-57, 1996.
- Gene, H. Golub and Charles F. Van Loan, Matrix Computations, 4th Edn., The Johns Hopkins University Press, Baltimore, 2013, Chapter 11.
- Michail, G. Lagoudakis and Ronald Parr, Least-squares policy iteration, Journal of Machine Learning Research, 4:1107-1149, 2003.
- Alessandro Lazaric, Mohammad Ghavamzadeh and Rémi Munos, Finite-sample analysis of LSTD, In Proceedings of the 27th International Conference on Machine Learning, 2010.
- Jorg Liesen and Zdenek Strakos, Krylov Subspace Methods: Principles and Analysis, Oxford University Press, 2015, Chapter 1 and 2.
- Angelia Nedić and Dimitri P. Bertsekas, Least-squares policy evaluation algorithms with linear function approximation, Discrete Event Dynamic Systems: Theory and Applications, 13(1-2):79-110, 2003.
- Ronald Parr, Lihong Li, Gavin Taylor, Christopher Painter-Wakefield and Michael L. Littman, An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning, Proceedings of the 25th international conference on Machine learning, pp.752-759, Helsinki, Finland, July 05-09, 2008.
- Marek Petrik, An analysis of Laplacian methods for value function approximation in MDPs, Proceedings of the 20th international joint conference on Artifical intelligence, pp.2574-2579, Hyderabad, India, January 06-12, 2007.
- Pascal Poupart and Craig Boutilier, VDCBPI: an approximate scalable algorithm for large scale POMDPs, Advances in Neural Information Processing Systems 17 (NIPS-2004), pp.1081-1088, MIT Press, 2005.
- Martin, L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, John Wiley & Sons, Inc., New York, NY, 1994, Chapter 6.
- Bruno Scherrer, Should one compute the temporal difference fixed point or minimize the Bellman residual: the unified oblique projection view, Proceedings of International Conference on Machine Learning, Haifa, Israel, pp.959-966, 2010.
- K. Senda, S. Hattori, T. Hishinuma and T. Kohda, Acceleration of reinforcement learning by policy evaluation using nonstationary iterative method, IEEE Trans. Cybern., 44(12):2696-2705, Dec. 2014.
- Csaba Szepesvári, Least squares temporal difference learning and Galerkin’s method, Mini-Workshop: Mathematics of Machine Learning, Oberwolfach Reports, European Mathematical Society, Oberwolfach, pp. 2385-2388, 2011.
- John N. Tsitsiklis and Benjamin Van Roy, An analysis of temporal-difference learning with function approximation, IEEE Trans. Automat. Contr., 42(5):674-690, 1997.
- Huizhen Yu and Dimitri P. Bertsekas, Error bounds for approximations from projected linear equations, Mathematics of Operations Research, 35(2):306-329, 2010.
- Sitao Luan, Mingde Zhao, Xiao-Wen Chang, Doina Precup (2019). Break the ceiling: Stronger multi-scale deep graph convolutional networks. Advances in neural information processing systems, 32.
- Sitao Luan, Xiao-Wen Chang, Doina Precup (2019). Revisit Policy Optimization in Matrix Form. arXiv preprint arXiv:1909.09186.
- Sitao Luan, Mingde Zhao, Chenqing Hua, Xiao-Wen Chang, Doina Precup (2020). Complete the missing half: Augmenting aggregation filtering with diversification for graph convolutional networks. arXiv preprint arXiv:2008.08844.
- Sitao Luan, Chenqing Hua, Qincheng Lu, Jiaqi Zhu, Mingde Zhao, Shuyuan Zhang, Xiao-Wen Chang, Doina Precup (2021). Is Heterophily A Real Nightmare For Graph Neural Networks To Do Node Classification?. arXiv preprint arXiv:2109.05641.
- Sitao Luan, Chenqing Hua, Qincheng Lu, Jiaqi Zhu, Mingde Zhao, Shuyuan Zhang, Xiao-Wen Chang, Doina Precup (2022). Revisiting heterophily for graph neural networks. arXiv preprint arXiv:2210.07606.
- Sitao Luan, Chenqing Hua, Minkai Xu, Qincheng Lu, Jiaqi Zhu, Xiao-Wen Chang, Jie Fu, Jure Leskovec, Doina Precup (2023). When Do Graph Neural Networks Help with Node Classification: Investigating the Homophily Principle on Node Distinguishability. arXiv preprint arXiv:2304.14274.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).