Submitted:
17 January 2024
Posted:
19 January 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction and Motivation
2. Background
2.1. Explaining the Cartpole Scenario
- , where .
- , where .
- .
- , where .
- , where .
- .
- , where .
3. Overcooked-AI: A Multi-Agent Cooperative Environment
- Player 1: Position (5, 1) - Facing (1, 0) - Holding Soup
- Player 2: Position (1, 3) - Facing (-1, 0) - Holding Onion
- Not owned objects: Soup at (4, 0) - with 1 onion and 0 cooking time.
4. Building a Policy Graph for the Trained Agent
- -
- (Off), when .
- -
- (Finished), when .
- -
- (Cooking), when .
- -
- (Waiting), when .
- Greedy Policy Graph: The output of the algorithm is a directed graph. For each state, the agent takes the most probable action. Therefore, not all the agent interactions are present in the graph, only the most probable action from each node. The determinism of this agent could be an interesting approach from a explainability perspective, as it is intuitively more interpretable to analyse a single action than a probability distribution.
- Stochastic Policy Graph: The output of the algorithm is a multi-directed graph. For each state, the agent records the action probability for all actions. As such, each state has multiple possible actions, each with its associated probability, as well as a different probability distributions for future states, one for each action. This representation is much more representative of the original agent, since the original agent may have been stochastic, or its behaviour may not be fully translated to the ’most probable action’ whenever the discretiser does not capture all information of the original state.
- : The surrogate agent picks an action using weights from the probability distribution of s in the PG.
- diff: The agent picks an action using weights from the probability distribution of in the PG.
- diff: The agent picks a random action with an uniform distribution.
5. Explainability Algorithm
- What will you do when you are in state region5 X?
- When do you perform action a?
- Why did you not perform action a in state s?
5.1. What Will You Do When You Are in State Region X?
- : The policy graph generation algorithm has seen one or more states during training, and it is likely that we can extract the probability of choosing among each of the accessible actions from them.7
- , but one or more similar states are found ( dist): the policy graph has never seen any state in X so we rely on a measure of similarity to another state to extrapolate (as in case 1).
- and no similar state is found: Returns a uniform probability distribution over all the actions.
| Algorithm 1 What will you do when you are in state region X? |
|
Input: Policy Graph , Action Set A, Set of states X, Distance threshold
Output: Explanation of policy behavior in X per action
if then
distdist ▹ The set of states in V closest to X
end if
if dist then
else
for all do
end for
end if
return
|
5.2. When Do You Perform Action a?
| Algorithm 2 When do you perform action a? |
|
Input: Policy Graph , Target action a
Output: Set of target states where target a is the dominant action, Set of non-target states
for all do
if then
else
end if
end for
return
|
5.3. Why Did You Not Perform Action a in State s?
| Algorithm 3 Why did you not perform action a in state ? |
|
Input: Policy Graph , Target Action a, Previous State , Distance threshold
Output: Explanation of difference between current state and state region where is performed, explanation of where is performed locally.
for all dist do
if then
else
end if
end for
return
|
- : The states within a distance threshold to s are gathered, and filtered to those where action a is the most likely (). The output is the list of differences . If , no explanation is given and it is suggested to increase the threshold.
- but dist: the state substitutes s in the algorithm above.
- and no similar state is found: no explanation is given due to lack of information.
5.4. Can We Rely on These Explanations?
6. Validating the Policy Graph
- : Picks an action using weights from the probability distribution in the PG.
- , but a similar state is found: Same as case 1 but using the similar state.
- and a similar state is not found: Pick a random action.
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| XAI | Explainable Artificial Intelligence |
| GDPR | General Data Protection Regulation |
| RL | Reinforcement Learning |
| XRL | Explainable Reinforcement Learning |
| PG | Policy Graph |
| MARL | Multi-Agent Reinforcement Learning |
| SFT | Soft Decision Tree |
| LIME | Local Interpretable Model-agnostic Explanations |
| SHAP | SHapley Additive exPlanations |
| PPO | Proximal Policy Optimization |
| TL | Transferred Learning |
| STD | Standard Deviation |
| NS | New States |
References
- Li, B.O.; Qi, P.; Liu, B.O.; Di, S.; Liu, J.; Pei, J.; Yi, J.; Zhou, B. Trustworthy AI: From Principles to Practices 2021. [2110.01167]. [CrossRef]
- Omeiza, D.; Webb, H.; Jirotka, M.; Kunze, L. Explanations in Autonomous Driving: A Survey. IEEE Transactions on Intelligent Transportation Systems 2021, pp. 1–21. arXiv:2103.05154 [cs]. [CrossRef]
- Rosenfeld, A.; Richardson, A. Explainability in human–agent systems. Autonomous Agents and Multi-Agent Systems 2019, 33, 673–705. [CrossRef]
- Longo, L.; Goebel, R.; Lecue, F.; Kieseberg, P.; Holzinger, A. Explainable artificial intelligence: Concepts, applications, research challenges and visions. In Proceedings of the International Cross-Domain Conference for Machine Learning and Knowledge Extraction. Springer, 2020, pp. 1–16. [CrossRef]
- Goodman, B.; Flaxman, S. European Union regulations on algorithmic decision-making and a “right to explanation”. AI magazine 2017, 38, 50–57. [CrossRef]
- Madiega, T. Artificial intelligence act. European Parliament: European Parliamentary Research Service 2021.
- Dafoe, A.; Hughes, E.; Bachrach, Y.; Collins, T.; McKee, K.R.; Leibo, J.Z.; Larson, K.; Graepel, T. Open Problems in Cooperative AI. arXiv:2012.08630 [cs] 2020. arXiv: 2012.08630. [CrossRef]
- Hayes, B.; Shah, J.A. Improving robot controller transparency through autonomous policy explanation. In Proceedings of the 2017 12th ACM/IEEE International Conference on Human-Robot Interaction (HRI. IEEE, 2017, pp. 303–312. [CrossRef]
- Climent, A.; Gnatyshak, D.; Alvarez-Napagao, S. Applying and Verifying an Explainability Method Based on Policy Graphs in the Context of Reinforcement Learning. In Artificial Intelligence Research and Development; IOS Press, 2021; pp. 455–464. [CrossRef]
- Vila, M., Gnatyshak, D., Tormos, A. & Alvarez-Napagao, S. Testing Reinforcement Learning Explainability Methods in a Multi-agent Cooperative Environment. Artificial Intelligence Research And Development. 356 pp. 355-364 (2022).
- Krajna, A.; Brcic, M.; Lipic, T.; Doncevic, J. Explainability in reinforcement learning: perspective and position. arXiv preprint arXiv:2203.11547 2022. [CrossRef]
- Coppens, Y.; Efthymiadis, K.; Lenaerts, T.; Nowé, A.; Miller, T.; Weber, R.; Magazzeni, D. Distilling deep reinforcement learning policies in soft decision trees. In Proceedings of the Proceedings of the IJCAI 2019 workshop on explainable artificial intelligence, 2019, pp. 1–6.
- Juozapaitis, Z.; Koul, A.; Fern, A.; Erwig, M.; Doshi-Velez, F. Explainable reinforcement learning via reward decomposition. In Proceedings of the IJCAI/ECAI Workshop on explainable artificial intelligence, 2019.
- Ribeiro, M.T.; Singh, S.; Guestrin, C. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In Proceedings of the Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, 2016, pp. 1135–1144. [CrossRef]
- Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30; Guyon, I.; Luxburg, U.V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; Garnett, R., Eds.; Curran Associates, Inc., 2017; pp. 4765–4774.
- Greydanus, S.; Koul, A.; Dodge, J.; Fern, A. Visualizing and understanding atari agents. In Proceedings of the International conference on machine learning. PMLR, 2018, pp. 1792–1801.
- Sloman, S. Causal models: How people think about the world and its alternatives; Oxford University Press, 2005.
- Halpern, J.Y.; Pearl, J. Causes and Explanations: A Structural-Model Approach — Part 1: Causes 2013. [1301.2275]. [CrossRef]
- Madumal, P.; Miller, T.; Sonenberg, L.; Vetere, F. Explainable reinforcement learning through a causal lens. In Proceedings of the Proceedings of the AAAI conference on artificial intelligence, 2020, Vol. 34, pp. 2493–2500. Issue: 03. [CrossRef]
- Kulkarni, T.D.; Narasimhan, K.; Saeedi, A.; Tenenbaum, J. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. Advances in neural information processing systems 2016, 29.
- Shu, T.; Xiong, C.; Socher, R. Hierarchical and interpretable skill acquisition in multi-task reinforcement learning. arXiv preprint arXiv:1712.07294 2017. [CrossRef]
- Zambaldi, V.; Raposo, D.; Santoro, A.; Bapst, V.; Li, Y.; Babuschkin, I.; Tuyls, K.; Reichert, D.; Lillicrap, T.; Lockhart, E.; et al. Relational Deep Reinforcement Learning, 2018. Number: arXiv:1806.01830 arXiv:1806.01830. [CrossRef]
- Sarkar, B.; Talati, A.; Shih, A.; Sadigh, D. PantheonRL: A MARL Library for Dynamic Training Interactions, 2021. Number: arXiv:2112.07013 arXiv:2112.07013 [cs]. [CrossRef]
- Carroll, M.; Shah, R.; Ho, M.K.; Griffiths, T.; Seshia, S.; Abbeel, P.; Dragan, A. On the utility of learning about humans for human-ai coordination. Advances in neural information processing systems 2019, 32.
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms, 2017. Number: arXiv:1707.06347 arXiv:1707.06347 [cs]. [CrossRef]
- Raffin, A.; Hill, A.; Gleave, A.; Kanervisto, A.; Ernestus, M; Dormann, N. Stable-Baselines3: Reliable Reinforcement Learning Implementations. Journal Of Machine Learning Research. 22, 1-8 (2021).
- Vinyals, O.; Ewalds, T.; Bartunov, S.; Georgiev, P.; Vezhnevets, A.; Yeo, M.; Makhzani, A.; Küttler, H.; Agapiou, J.; Schrittwieser, J.; Quan, J.; Gaffney, S.; Petersen, S.; Simonyan, K.; Schaul, T.; Hasselt, H.; Silver, D.; Lillicrap, T.; Calderone, K.; Keet, P.; Brunasso, A.; Lawrence, D.; Ekermo, A.; Repp, J.; Tsing, R. StarCraft II: A New Challenge for Reinforcement Learning, 2017. Number: arXiv:1708.04782 arXiv:1708.04782 [cs]. [CrossRef]
- Suarez, J.; Du, Y.; Isola, P; Mordatch, I. Neural MMO: A Massively Multiagent Game Environment for Training and Evaluating Intelligent Agents, 2019. Number: arXiv:1903.00784 arXiv:1903.00784 [cs, stat]. [CrossRef]
- Lowe, R.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances In Neural Information Processing Systems. 30 (2017).
- Munikoti, S.; Agarwal, D.; Das, L.; Halappanavar, M.; Natarajan, B. Challenges and Opportunities in Deep Reinforcement Learning With Graph Neural Networks: A Comprehensive Review of Algorithms and Applications. IEEE Transactions On Neural Networks And Learning Systems. pp. 1-21 (2023). [CrossRef]
| 1 | Vila, M., Gnatyshak, D., Tormos, A. & Alvarez-Napagao, S.: Testing Reinforcement Learning Explainability Methods in a Multi-agent Cooperative Environment. Published in: Artificial Intelligence Research and Development 355 A. Cortés et al. (Eds.) © 2022 The authors and IOS Press. This article is published online with Open Access by IOS Press and distributed under the terms of the Creative Commons Attribution Non-Commercial License 4.0 (CC BY-NC 4.0). doi:10.3233/FAIA220358 |
| 2 | |
| 3 | |
| 4 |
E.g. if we have the states and , then diff. |
| 5 |
E.g. all states where pot_state(Finished). |
| 6 | This distance function can depend heavily on the environment and on the predicate set chosen. An in-depth analysis of possible functions is out of the scope of this paper, but it is part of future work. For the sake of proof-of-concept, the distance function we have chosen for the work presented in this paper consists in: let and , we define distdiff, where diff is the function defined in Section 4. For example, this measure for the states in Figure 6 would be dist as only four predicates change value between them. |
| 7 | This is only likely since a state s may have been visited very few times, and the estimation of probability may be little informed, in which case we would consider the other options in the list. |
| 8 | |
| 9 | |
| 10 |










| Method | Horiz. | Scope | Timing | Env. | Policy | Agents | Description |
| Coppens et al. [12] | Reac. | Global | P-H | Stoch. | Stoch. | Single | Binary decision trees, value heatmap images |
| Juozapaitis et al. [13] | Reac. | Local | P-H | Stoch. | Deter. | Single | Decomposed reward diagrams and images |
| Greydanus et al. [16] | Reac. | Global | P-H | Deter. | Deter. | Multi- | Attention saliency maps |
| Madumal et al. [19] | Proac. | Local | P-H | Stoch. | Stoch. | Multi- | Counterfactual text explanations |
| Kulkarni et al. [20] | Proac. | Local | Intr. | Stoch. | Stoch. | Multi- | Attention saliency maps |
| Zambaldi et al. [22] | Proac. | Local | Intr. | Stoch. | Stoch. | Multi- | Counterfactual text explanations |
| Policy graphs | Proac. | Global | P-H | Stoch. | Stoch. | Sing./Mult. | Behaviour graphs, text explanations, transparent agent version |
| Layout | Mean reward | Std. |
| simple | 387.87 | 25.33 |
| unident_s | 757.71 | 53.03 |
| random0 | 395.01 | 54.43 |
| random1 | 266.01 | 48.11 |
| random3 | 62.5 | 5.00 |


Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).