Submitted:
06 April 2024
Posted:
17 April 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. Background
1.2. Objective
- What are the research methodologies applied in researches regarding learning to communicate with MARL?
- What kind of problems or challenges are there when applying these research methodology designs?
2. Methods
2.1. Searching for Relevant Papers
2.2. Inclusion Criteria
- Publications written in English
- The title and abstract give hints that the topic of the research involves both MARL and Learn to Communicate.
- Has an extensive method section (at least half of a page)
- Exhaustively specifies that a quantitative or qualitative research approach has been used
- Uses collected empirical data or primary data or secondary data for data analysis.
- Regarding qualitative papers, the title or abstract contains a hint that the study has a qualitative nature. Considering the scarcity of these types of papers in the analyzed field, the paper will be reviewed if it mentions interview, participant, questionnaire, ….
2.3. Exclusion Criteria
- Being a literature review study.
- The absence of a method section
- The methodological approach did not correspond with the specified data type (qualitative and quantitative)
- The studies only reference the field of MARL but do not contribute themselves.
- Theoretical, conceptual papers without empirical research
- Short papers/ Workshop papers
2.4. Analysis Strategies
- o
- Approach: In MARL, most works follow one of the well-known algorithms in this field. We checked whether the paper used one of those methods or developed a new model.
- o
- Performance: In MARL, the performance of the methods is examined by various methods such as the amount of reward collected, the amount of error. Performance review is very important when you want to compare several algorithms. That is why we evaluate whether the suggestion is always better in any experiment, consistently or not.
- o
- Number of environments: The number of environments affects the validity as it can show that the paper’s suggestion is consistently winning in multiple cases. If the paper only uses one dataset, we cannot say the paper is highly valid as we cannot expect an undiscovered situation.
- o
- Scalability: The number of agents plays a very decisive role when reviewing the results in MARL. As the number of agents increases, the environment becomes closer to the real situation and the proposed algorithm has more validity. We investigated whether the scalability issue is examined (by performing experiments with the presence of a different number of agents) or not. For scalability criteria, we used ordinal scores to show how much each paper keeps the criteria. In this scoring, 0 means that there is no sign of scalability, 1 means there are some intuitions in which symptoms of scalability can be found, and 2 implies an assurance of scalability through testing the algorithm on several environments and/or a different number of agents.
- o
- Data availability: Public datasets/environments that are available for any people to use can make the research more reliable as everyone can restore data in the study. However, when the paper changes a part of a dataset/environment, the reliability will also be reduced as we may need a clear explanation of the changes.
- o
- Code availability: Available code will help with reliability as we can see that the code produces the same result. We also can see whether the code and the hypothesis are matching or not.
- o
- Comparison to others (baseline model): A comparison to baseline models that are the most recent in the field would be helpful to show reliability. On the other hand, if the paper missed one of the state-of-the-art articles to compare, it might lose reliability. We cannot make sure that the suggestion is better than other recent ones.
- o
- Experimental setup: The data science area has a standard research procedure that can be a condition for improving the reliability of the study. We checked whether the paper contains detailed information on data types, how they prepared the data, and the explanation of modeling and evaluation. Ordinal scores are used to show how much each paper refers to details when explaining test settings, data collection, implementation, and evaluation of the model. In this scoring, 0 means that very little explanation is given, 1 means details are given, and 2 implies an assurance that there is no ambiguity left for reproducing the results.
3. Results
3.1. Quantitative Papers
3.2. Qualitative Papers
4. Discussion
4.1. Insights from Analyzing the Quantitative Studies
4.2. Insights from Analyzing the Qualitative Studies
4.3. Reflections
4.4. Limitations
References
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
- Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., ... & Hassabis, D. (2016). Mastering the game of Go with deep neural networks and tree search. nature, 529(7587), 484-489. [CrossRef]
- Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., ... & Hassabis, D. (2017). Mastering the game of go without human knowledge. nature, 550(7676), 354-359. [CrossRef]
- OpenAI: Openai five. https://blog.openai.com/openai-five/ (2018).
- Vinyals, O., Babuschkin, I., Chung, J., Mathieu, M., Jaderberg, M., Czarnecki, W.M., Dudzik, A., Huang, A., Georgiev, P., Powell, R., Ewalds, T., Horgan, D., Kroiss, M., Danihelka, I., Agapiou, J., Oh, J., Dalibard, V., Choi, D., Sifre, L., Sulsky, Y., Vezhnevets, S., Molloy, J., Cai, T., Budden, D., Paine, T., Gulcehre, C., Wang, Z., Pfaff, T., Pohlen, T., Wu, Y., Yogatama, D., Cohen, J., McKinney, K., Smith, O., Schaul, T., Lillicrap, T., Apps, C., Kavukcuoglu, K., Hassabis, D., Silver, D.: AlphaStar: Mastering the Real-Time Strategy Game StarCraft II. https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/ (2019).
- Kober, J., Bagnell, J. A., & Peters, J. (2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11), 1238-1274.
- Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., ... & Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint. arXiv:1509.02971.
- Brown, N., Sandholm, T., & Machine, S. (2017, January). Libratus: The Superhuman AI for No-Limit Poker. In IJCAI (pp. 5226-5228).
- Brown, N., & Sandholm, T. (2019). Superhuman AI for multiplayer poker. Science, 365(6456), 885-890. [CrossRef]
- Shalev-Shwartz, S., Shammah, S., & Shashua, A. (2016). Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint. arXiv:1610.03295.
- Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Hassabis, D. (2015). Human-level control through deep reinforcement learning. nature, 518(7540), 529-533. [CrossRef]
- Busoniu, L., Babuska, R., & De Schutter, B. (2008). A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38(2), 156-172. [CrossRef]
- Adler, J. L., & Blue, V. J. (2002). A cooperative multi-agent transportation management and route guidance system. Transportation Research Part C: Emerging Technologies, 10(5-6), 433-454. [CrossRef]
- Wang, S., Wan, J., Zhang, D., Li, D., & Zhang, C. (2016). Towards smart factory for industry 4.0: a self-organized multi-agent system with big data based feedback and coordination. Computer networks, 101, 158-168. [CrossRef]
- Lee, J. W., & Zhang, B. T. (2002). Stock trading system using reinforcement learning with cooperative agents. In Proceedings of the Nineteenth International Conference on Machine Learning (pp. 451-458).
- Lee, J. W., Park, J., Jangmin, O., Lee, J., & Hong, E. (2007). A multiagent approach to $ q $-learning for daily stock trading. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 37(6), 864-877. [CrossRef]
- Cortes, J., Martinez, S., Karatas, T., & Bullo, F. (2004). Coverage control for mobile sensing networks. IEEE Transactions on robotics and Automation, 20(2), 243-255. [CrossRef]
- Choi, J., Oh, S., & Horowitz, R. (2009). Distributed learning and cooperative control for multi-agent systems. Automatica, 45(12), 2802-2814. [CrossRef]
- Castelfranchi, C. (2001). The theory of social functions: challenges for computational social science and multi-agent learning. Cognitive Systems Research, 2(1), 5-38. [CrossRef]
- Leibo, J. Z., Zambaldi, V., Lanctot, M., Marecki, J., & Graepel, T. (2017). Multi-agent reinforcement learning in sequential social dilemmas. arXiv preprint. arXiv:1702.03037.
- Stone, P., & Veloso, M. (1998). Towards collaborative and adversarial learning: A case study in robotic soccer. International Journal of Human-Computer Studies, 48(1), 83-104. [CrossRef]
- Kirby, S. (2002). Natural language from artificial life. Artificial life, 8(2), 185-215. [CrossRef]
- Wagner, K., Reggia, J. A., Uriagereka, J., & Wilkinson, G. S. (2003). Progress in the simulation of emergent communication and language. Adaptive Behavior, 11(1), 37-69. [CrossRef]
- Okoli, C., & Schabram, K. (2010). A guide to conducting a systematic literature review of information systems research. Sprouts. Concordia University, Canada.
- Freire, J., Fuhr, N., & Rauber, A. (2016). Reproducibility of data-oriented experiments in e-science (dagstuhl seminar 16041). In Dagstuhl Reports (Vol. 6, No. 1). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. [CrossRef]
- Guba, E. G. (1981). Criteria for assessing the trustworthiness of naturalistic inquiries. Ectj, 29(2), 75-91. [CrossRef]
- Pandey, S. C., & Patnaik, S. (2014). Establishing reliability and validity in qualitative inquiry: A critical examination. Jharkhand journal of development and management studies, 12(1), 5743-5753.
- Foerster, J. N., Assael, Y. M., De Freitas, N., & Whiteson, S. (2016). Learning to communicate with deep multi-agent reinforcement learning. arXiv preprint. arXiv:1605.06676.
- Jorge, E., Kågebäck, M., Johansson, F. D., & Gustavsson, E. (2016). Learning to play guess who? and inventing a grounded language as a consequence. arXiv preprint. arXiv:1611.03218.
- Sukhbaatar, S., & Fergus, R. (2016). Learning multiagent communication with backpropagation. Advances in neural information processing systems, 29, 2244-2252.
- Havrylov, S., & Titov, I. (2017). Emergence of language with multi-agent games: Learning to communicate with sequences of symbols. arXiv preprint. arXiv:1705.11192.
- Das, A., Kottur, S., Moura, J. M., Lee, S., & Batra, D. (2017). Learning cooperative visual dialog agents with deep reinforcement learning. In Proceedings of the IEEE international conference on computer vision (pp. 2951-2960).
- Mordatch, I., & Abbeel, P. (2018, April). Emergence of grounded compositional language in multi-agent populations. In Thirty-second AAAI conference on artificial intelligence. [CrossRef]
- Jiang, J., Dun, C., Huang, T., & Lu, Z. (2018). Graph convolutional reinforcement learning. arXiv preprint. arXiv:1810.09202.
- Celikyilmaz, A., Bosselut, A., He, X., & Choi, Y. (2018). Deep communicating agents for abstractive summarization. arXiv preprint. arXiv:1803.10357.
- Das, A., Gervet, T., Romoff, J., Batra, D., Parikh, D., Rabbat, M., & Pineau, J. (2019, May). Tarmac: Targeted multi-agent communication. In International Conference on Machine Learning (pp. 1538-1546). PMLR.
- Cogswell, M., Lu, J., Lee, S., Parikh, D., & Batra, D. (2019). Emergence of compositional language with deep generational transmission. arXiv preprint. arXiv:1904.09067.
- Kottur, S., Moura, J. M., Lee, S., & Batra, D. (2017). Natural language does not emerge’naturally’in multi-agent dialog. arXiv preprint. arXiv:1706.08502.
- Tucker, M., Li, H., Agrawal, S., Hughes, D., Sycara, K., Lewis, M., & Shah, J. A. (2021). Emergent Discrete Communication in Semantic Spaces. Advances in Neural Information Processing Systems, 34.
- Strouse, D. J., McKee, K. R., Botvinick, M., Hughes, E., & Everett, R. (2021). Collaborating with Humans without Human Data. arXiv preprint. arXiv:2110.08176.
- Miura, S., Cohen, A. L., & Zilberstein, S. (2021, August). Maximizing legibility in stochastic environments. In 2021 30th IEEE International Conference on Robot & Human Interactive Communication (RO-MAN) (pp. 1053-1059). IEEE.
- Woodward, M. P., & Wood, R. J. (2012). Framing Human-Robot Task Communication as a POMDP. arXiv preprint. arXiv:1204.0280.
- Wang, N., Pynadath, D. V., Rovira, E., Barnes, M. J., & Hill, S. G. (2018, April). Is it my looks? or something i said? the impact of explanations, embodiment, and expectations on trust and performance in human-robot teams. In International Conference on Persuasive Technology (pp. 56-69). Springer, Cham.
- Buehler, M. C., Adamy, J., & Weisswange, T. H. (2021). Theory of Mind Based Assistive Communication in Complex Human Robot Cooperation. arXiv preprint. arXiv:2109.01355.
- Braun, V., Clarke, V., Boulton, E., Davey, L., & McEvoy, C. (2020). The online survey as a qualitative research tool. International Journal of Social Research Methodology, 1-14.
- SAGE: Methods map: Content analysis (2022), http://methods.sagepub.com/methodsmap/ content-analysis, last accessed 8 January 2022.
- Beikmohammadi, A., Khirirat, S., & Magnússon, S. (2024). On the Convergence of Federated Learning Algorithms without Data Similarity. arXiv preprint. arXiv:2403.02347.
- Beikmohammadi, A., Khirirat, S., & Magnússon, S. (2024). Distributed Momentum Methods Under Biased Gradient Estimations. arXiv preprint. arXiv:2403.00853.
- Beikmohammadi, A., Khirirat, S., & Magnússon, S. (2024). Compressed Federated Reinforcement Learning with a Generative Model. [CrossRef]
- Beikmohammadi, A., & Magnússon, S. (2024). Accelerating actor-critic-based algorithms via pseudo-labels derived from prior knowledge. Information Sciences, 661, 120182. [CrossRef]
- Beikmohammadi, A., & Magnússon, S. (2023, May). Comparing NARS and Reinforcement Learning: An Analysis of ONA and Q-Learning Algorithms. In International Conference on Artificial General Intelligence (pp. 21-31). Cham: Springer Nature Switzerland.
- Beikmohammadi, A., & Magnússon, S. (2023). Human-inspired framework to accelerate reinforcement learning. arXiv preprint. arXiv:2303.08115.
- Beikmohammadi, A., & Magnússon, S. (2023, May). TA-Explore: Teacher-assisted exploration for facilitating fast reinforcement learning. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems (pp. 2412-2414).
| Stage | Query | Search Engine/Database | Search Results (# Papers) |
|---|---|---|---|
| Initial stage | mainQRY | Google Scholar | 201 |
| mainQRY | Scopus | 13 | |
| Final stage | mainQRY AND (“participants” OR “interview” OR “questionnaire” OR “quantitative” OR “qualitative”) | Google Scholar | 69 |
| mainQRY AND (“participant*” OR “interview” OR “questionnaire” OR “quanti*” OR “quali*”) | Scopus | 2 |
| Paper | Goal | Approach |
|---|---|---|
| Foerster et al. [28] | Learning a binary (in execution mode) communication protocol | DRQN |
| Jorge et al. [29] | Learning Guess who? by two agents (asker, answerer) | Based on [28] |
| Sukhbaatar and Fergus [30] | Learning continuous communication between a dynamically changing set of agents for fully cooperative tasks | New model |
| Havrylov and Titov [31] | Learning to communicate with sequences of discrete symbols (referential game) | LSTM |
| Das et al. [32] | Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning | VGG-16, LSTM |
| Mordatch and Abbeel [33] | Formulate the discovery of the action and communication protocols for their agents jointly as a reinforcement learning problem | LSTM |
| Jiang et al. [34] | Modeling multi-agent environment as a graph | New model |
| Celikyilmaz et al. [35] | Addressing the challenges of representing a long document for abstractive summarization by deep communicating agents in an encoder-decoder architecture | B-LSTM, Attention |
| Das et al. [36] | Proposing an architecture for multi-agent reinforcement learning that allows targeted continuous communication between agents via a sender-receiver soft attention mechanism and multiple rounds of collaborative reasoning. | Actor-critic, GRU |
| Cogswell et al. [37] | Developing an implicit model of cultural transmission and compositionality in deep neural dialog agents, where language is transmitted from generation to generation because it helps agents achieve their goals | Based on [38] |
| Performance Metrics | Paper (s) |
|---|---|
| Normalized rewards | Foerster et al. [28], Das et al. [32], Mordatch and Abbeel [33], Jiang et al. [34] |
| Failure rates/ Win rates | Sukhbaatar and Fergus [30], Havrylov and Titov [31], Das et al. [36] |
| Mean error | Sukhbaatar and Fergus [30], |
| Loss | Havrylov and Titov [31] |
| Accuracy, precision, and recall | Cogswell et al. [37] |
| Custom measure | Jorge et al. [29], Jiang et al. [34], Celikyilmaz et al. [35] |
| Paper | Number of Environments (Ordinal) | Scalability (Ordinal (0-2)) |
|---|---|---|
| Foerster et al. [28] | 2 | 1 |
| Jorge et al. [29] | 1 | 0 |
| Sukhbaatar and Fergus [30] | 5 | 2 |
| Havrylov and Titov [31] | 1 | 0 |
| Das et al. [32] | 1 | 0 |
| Mordatch and Abbeel [33] | 1 | 0 |
| Jiang et al. [34] | 1 | 2 |
| Celikyilmaz et al. [35] | 1 | 2 |
| Das et al. [36] | 4 | 2 |
| Cogswell et al. [37] | 2 | 0 |
| Data availability | Paper(s) |
|---|---|
| Create new dataset | Cogswell et al. [37] |
| Used public dataset | Foerster et al. [28], Havrylov and Titov [31], Celikyilmaz et al. [35], Das et al. [36] |
| Implementing a new environment | Foerster et al. [28], Jorge et al. [29], Sukhbaatar and Fergus [30], Das et al. [32], Mordatch and Abbeel [33] |
| Using the existing environment | Sukhbaatar and Fergus [30], Jiang et al. [34], Das et al. [36], Cogswell et al. [37] |
| Code Availability | Paper(s) |
|---|---|
| Not available | Havrylov and Titov [31], Das et al. [32], Mordatch and Abbeel [33], Celikyilmaz et al. [35], Das et al. [36] |
| Provided pseudo-code | Foerster et al. [28], Cogswell et al. [37] |
| Available | Foerster et al. [28], Jorge et al. [29], Sukhbaatar and Fergus [30], Jiang et al. [34], Cogswell et al. [37] |
| Paper | Number of Comparisons with Baselines Model (Ordinal) | Number of Internal Comparisons (Ordinal) | Experimental Setup (Ordinal (0-2)) |
|---|---|---|---|
| Foerster et al. [28] | 1 | 4 | 2 |
| Jorge et al. [29] | 0 | 5 | 1 |
| Sukhbaatar and Fergus [30] | 3 | 0 | 2 |
| Havrylov and Titov [31] | 0 | 4 | 0 |
| Das et al. [32] | 0 | 2 | 1 |
| Mordatch and Abbeel [33] | 0 | 2 | 0 |
| Jiang et al. [34] | 3 | 2 | 2 |
| Celikyilmaz et al. [35] | 7 | 7 | 1 |
| Das et al. [36] | 4 | 3 | 1 |
| Cogswell et al. [37] | 5 | 0 | 0 |
| Research Design | Paper | Goal |
|---|---|---|
| Quasi-experimental | Tucker et al. [39] | To investigate human judgments of the robot, agent, or human actions using a dynamic survey |
| Strouse et al. [40] | To test how effectively the FCP agents collaborate with humans in a zero-shot setting | |
| Miura et al. [41] | To investigate whether using legibility as an objective would improve the interpretability of agents’ goals by humans | |
| Woodward and Wood [42] | To evaluate if the proposed POMDP representation produces robust robots to teacher error, (that can accurately infer task details, and that are perceived to be intelligent.) | |
| Wang et al. [43] | To investigate the impact of a robot’s embodiment, its explanation, and its promise to learn from mistakes on trust and team performance | |
| Experimental study | Buehler et al. [44] | To evaluate the benefits of the assistive communication on task performance between robot and human |
| Data Collection Method | Paper | Details | Method of Recording |
|---|---|---|---|
| Online survey | Tucker et al. [39] | 253 participants via Amazon Mechanical Turk | Online answers |
| Miura et al. [41] | 26 participants via Amazon Mechanical Turk. The only requirement for participation was the ability to read English. |
Online answers | |
| Woodward and Wood [42] | 26 participants Consisting of undergraduate and graduate students ranging in age from 18 to 31 with a mean age of 22. Four of the participants were randomly selected for the “human robot” role, leaving for the “teacher” role. |
Online answers | |
| Online survey + Questionnaire | Wang et al. [43] | 61 participants from a higher-education military school in the United States 14 women, 39 men, age range: 18-23 |
Online answers |
| Strouse et al. [40] | 114 participants from Prolific, an online participant recruitment platform 37.7% female, 59.6% male, 1.8% nonbinary; median age between 25–34 years. At the end of the study, an open-ended question for feedback on participants’ partners. |
Online answers | |
| Observational study + Questionnaire | Buehler et al. [44] | 14 participants Participants were randomly divided into two groups, one started with an assisted trial, the other started unassisted. The participants had no prior experience with the task |
Recorded actions |
| Paper | Limitations (Mentioned or Identified) |
|---|---|
| Tucker et al. [39] | Inadequate analysis Lack of verification of human judgments |
| Strouse et al. [40] | Inadequate analysis |
| Miura et al. [41] | It is not always possible to significantly improve legibility over policies maximizing underlying rewards. Their initial evaluations are limited to MazeWorld instances using BST belief update |
| Woodward and Wood [42] | Inadequate analysis Lack of full explanation of how to collect data Lack of full explanation of the test scenario |
| Wang et al. [43] | Lack of full explanation of how to collect data |
| Buehler et al. [44] | Inadequate explanation of the questionnaire Lack of full explanation of how to collect data Lack of full explanation of the test scenario |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
