Submitted:
26 July 2024
Posted:
29 July 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
3. Reinforcement Learning Algorithms
3.1. Improved DQN Normalized Advantage Function Algorithm (DQN-NAF)
3.2. Deep Deterministic Policy Gradient Algorithm (DDPG)
4. Construction of a Digital Twin Reinforcement Learning Environment
4.1. Coal Seam Environment Construction
4.2. Digital Twin Scene for Shearer
4.3. Digital Twin Scene for Hydraulic Supports
4.4. Digital Twin Scene for Scraper Conveyor
5. Digital Twin Reinforcement Learning Training
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Wang, G. , Ren, H., Zhao, G., Zhang, D., Wen, Z., Meng, L., Gong, S., 2022. Research and practice of intelligent coal mine technology systems in China. Int J Coal Sci Technol 9, 24.
- Huang Z, Ge S, He Y, Wang D, Zhang S. Research on the Intelligent System Architecture and Control Strategy of Mining Robot Crowds. Energies. 2024;17(8):1834.
- GRIEVES, M. Digital twin: manufacturing excellence through virtual factory replication[R]. Melbourne: U.S. Florida Institute of Technology, 2014.
- EDWARD, M. KRAFT. The Air Force Digital Thread/Digital Twin - Life Cycle Integration and Use of Computational and Experimental Knowledge[C/OL]//54th AIAA Aerospace Sciences Meeting. San Diego, California, USA: American Institute of Aeronautics and Astronautics, 2016[2022-02-25].
- DENNIS J., L. SIEDLAK, OLIVIA J. PINON, PAUL R. SCHLAIS, et al. A digital thread approach to support manufacturing-influenced conceptual aircraft design[J/OL]. Research in Engineering Design, 2018, 29(2): 285-308.
- EUAN BONHAM, KERR MCMASTER, EMMA THOMSON, et al. Designing and Integrating a Digital Thread System for Customized Additive Manufacturing in Multi-Partner Kayak Production[J/OL]. Systems, 2020, 8(4): 43.
- GE Shirong, HAO Xuedi, TIAN Kai, et al. Principle and key technology of autonomous navigation cutting for deep coal seam[J]. Journal of China Coal Society,2021,46(3):774-788.
- GE Shirong, ZHANG Fan, WANG Shibo, et al. Research on the technical architecture of digital twin intelligent coal mining face[J]. Journal of China Coal Society,2020,45(6):1925-1936.
- Miao, B. , Ge, Sh., Guo Y., et al. Construction of digital twin system for intelligent mining in coal mines. Journal of Mining Science and Technology, 2022, 7(2), 143-153.
- Guan Z, Wang S, Wang J, Ge S. Longwall Face Automation: Coal Seam Floor Cutting Path Planning Based on Multiple Hierarchical Clustering. Applied Sciences. 2023;13(18):10242.
- Shibo Wang, Shijia Wang. Longwall mining automation horizon control: Coal seam gradient identification using piecewise linear fitting. International Journal of Mining Science and Technology. 2022;32(4):821-829. [CrossRef]
- Mirjam Holm, Stephan Beitler, Thorsten Arndt, Armin Mozar, Martin Junker, Christian Bohn. Concept of Shield-Data-Based Horizon Control for Longwall Coal Mining Automation. IFAC Proceedings Volumes. 2013;46(16):98-103.
- Dai W, Wang S, Wang S. Longwall Mining Automation—The Shearer Positioning Methods between the Longwall Automation Steering Committee and China University of Mining and Technology. Applied Sciences. 2023;13(22):12168. [CrossRef]
- DENG Y, BAO F, KONG Y, et al. Deep direct reinforcement learning for financial signal representation and trading[J]. IEEE transactions on neural networks and learning systems, 2016,28(3): 653-664.
- TOKIC, M. Adaptive ε-greedy exploration in reinforcement learning based on value differences: Annual Conference on Artificial Intelligence[C]. Springer, 2010.
- SANGIOVANNI B, RENDINIELLO A, INCREMONA G P, et al. Deep reinforcement learning for collision avoidance of robotic manipulators: 2018 European Control Conference (ECC)[C]. IEEE, 2018.
- GAO H, ZHI-QUN H U, LU-HAN Y U, et al. Intelligent Traffic Signal Control Algorithm Based on Sumtree DDPG[J]. Journal of Beijing University of Posts and Telecommunications, 2020,8:55-58.
- LUO S, KASAEI H, SCHOMAKER L. Accelerating reinforcement learning for reaching using continuous curriculum learning: 2020 International Joint Conference on Neural Networks (IJCNN)[C]. IEEE, 2020.
- FLORENSA C, HELD D, WULFMEIER M, et al. Reverse curriculum generation for reinforcement learning: Conference on robot learning[C]. PMLR, 2017.
- KERZEL M, MOHAMMADI H B, ZAMANI M A, et al. Accelerating deep continuous reinforcement learning through task simplification: 2018 International Joint Conference on Neural Networks (IJCNN)[C]. IEEE, 2018.



































| DQN-NAF algorithm |
|---|
| 1: Initialize the online Neural Network Qπ and Target Neural Network Qπ'. |
| 2: For episode = 1 to M do |
| 3: Initialize action exploration noise ƞ. |
| 4: While the state has not reached a terminal state do |
| 5: Input the state into the online Neural Network Qπ, select action a = φ(s) + ƞ. |
| 6: Agent performs action a, receives immediate reward r, and the environment state transitions to s'. |
| 7: Store sampled experience (s, a, s', r) in the experience pool. |
| 8: If the data in the experience pool exceeds the size of the training batch |
| 9: Sample (s, a, s', r) from the experience pool. |
10: Input the action and state into the online Neural Network Qπ to obtain the diagonal of triangular matrix I(s) being positive, action vector φ(s), and state value V(s), obtaining
|
| 11: Input state s' into the Target Neural Network Qπ' to obtain the value V(s') and y = r + γV(s'). |
| 12: Use the mean squared error function between Qπ and y to update the parameters of the online Neural Network using regression method. |
| 13: With σ being a coefficient less than 1, gradually update the parameters of the online Neural Network towards the Target Neural Network: |
|
| 14: End if |
| 15: End while |
| 16: End for |
| DDPG algorithm |
|---|
| 1: Initialize online Network parameters θQ and θμ for Critic and Actor. |
| 2: Initialize Target Network parameters θQ' = θQ and θμ' = θμ for Critic and Actor. |
| 3: For episode = 1 to M do |
| 4: Initialize action exploration noise ƞ, and state s. |
| 5: While the state has not reached a terminal state do |
| 6: Input the state into the Actor's online Neural Network, select action a = μ(s) + ƞ. |
| 7: The Agent performs the action a, receives immediate reward r, and the environmental state transitions to s'. |
| 8: Store the sampled experience (s, a, s', r) in the experience pool. |
| 9: If the data in the experience pool exceeds the training batch size |
| 10: Sample (s, a, s', r) from the experience pool. |
11:
Use the mean squared error loss of Q(s, a) and the target function to update the parameters of the Critic’s online Neural Network. |
12: Use to update the Actor's Target Network. |
13: Update the Target Neural Networks of both the Critic and Actor, with σ being a coefficient less than 1:
|
| 14: End if |
| 15: End while |
| 16: End for |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).




