Submitted:
26 June 2023
Posted:
26 June 2023
You are already at the latest version
Abstract
Keywords:
1. Introduction
3. NUMA-BTLP Algorithm
3.1. Data dependencies considerations
3.2. Data structures used in the implementation
- The main thread which executes the main function is the root of the generation tree and forms the first level of the tree
- Threads that are created in the main function via pthread_create call [5] are sons of the root, forming the second level on the tree
- The threads that are created in the functions executed by the treads (so called attached functions) in the second level form the third level in the tree and so on until the last level is formed.
- The main thread which executes the main function id the root of the communication tree and forms the first level of the tree
- If the execution thread that is candidate to be added in the communication tree is determined to be autonomous or postponed, the candidate is added in the communication tree as son threads of his parent from the generation tree
- If the execution thread is side-by-side, the candidate is added as son thread of all the threads added already in the communication tree, with which the candidate is in side-by-side relation.
4. Mapping of NUMA-BTDM Algorithm
4.1. Getting the hardware architecture details of the underlying architecture
4.2. Mapping autonomous threads
4.3. Mapping side-by-side threads
4.4. Mapping postponed threads
5. Materials and Methods
6. Results
7. Discussion
- The ability to allow the parallel applications written in C that use Pthreads library [5] to customize and control the thread mapping based on static characteristics of the code by inserting pthread_setaffinity_np calls [5] in the LLVM IR of the input code after each pthread_create call [5]. Thus, the mapping is not random
- Paper defines original static criteria for classifying threads in 3 categories and defines the categories
- Mapping of threads depending on their type. The autonomous threads are distributed uniformly on cores allowing better balance in achieving balanced data locality. A side-by-side tread is mapped on the same cores as each other thread with respect to which it is considered side-by-side allowing better data locality [4]
- The definition of the static criteria of classifying the execution threads in three categories and the classification itself. If two threads are data dependent (i.e. the data send to a thread execution is used in the execution of the other thread), they are classified as side-by-side [4]. If a thread has no data dependencies with any other thread, the thread is of type autonomous [4]. If a thread has data dependencies with its parent thread only, the thread type is postponed [4]. The data dependencies are revealed by NUMA-BTLP algorithm [3] which is implemented in LLVM, but not yet part of it
- Mapping execution threads based on their type. The execution of autonomous threads is spread uniformly to cores, which ensures the completion of the balance criteria in achieving the balanced data locality [4]. A side-by-side thread is allocated for execution on each of the cores on which the threads in relation of side-by-side to the thread, are mapped. The previous ensures achieving optimized data locality [4]. The postponed threads are mapped to the less loaded core so far, once they are identified in the traversing the communication tree. The distribution of postponed threads ensures also balanced execution as the distribution of autonomous threads
- Integrating the implementation of the classification and mapping algorithms in a modern compiling infrastructure such as LLVM
- Using two trees, a generation and a communication tree in mapping the execution threads. The communication tree describes the data dependencies between threads and the generation tree describes the generation of the execution threads [4]. The way in which the communication tree is constructed represents novelty. The rules of constructing the tree are the following: any autonomous or postponed thread is added as a son thread to every occurrence in the communication of its parent in the generation tree and every side-by-side thread is added as a son thread to every thread with which he is in side-by-side relation. By constructing the communication tree in the above manner, one can find out the way threads are communicating, by traversing the tree.
8. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- LLVM Compiler Infrastructure Project, Available online: https://llvm.org/ (accessed: 4th of February 2021).
- Ştirb, I., NUMA-BTDM: A thread mapping algorithm for balanced data locality on NUMA systems. In 2016 17th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), December 2016, pp. 317-320. [CrossRef]
- Ştirb, I., NUMA-BTLP: A static algorithm for thread classification. In 2018 5th International Conference on Control, Decision and Information Technologies (CoDIT), April 2018, pp. 882-887. [CrossRef]
- Știrb, I., 2018. Extending NUMA-BTLP Algorithm with Thread Mapping Based on a Communication Tree. Computers, 7(4), p.66. [CrossRef]
- pthreads(7) - Linux manual page, Available online: https://man7.org/linux/man-pages/man7/pthreads.7.html (accessed: 4th of February 2021).
- Finkel, H., Poliakoff, D., Camier, J.S. and Richards, D.F., Clangjit: Enhancing c++ with just-in-time compilation. In 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), November 2019, pp. 82-95. [CrossRef]
- López-Gómez, J., Fernández, J., Astorga, D.D.R., Vassilev, V., Naumann, A. and García, J.D., Relaxing the one definition rule in interpreted C++. In Proceedings of the 29th International Conference on Compiler Construction, February 2020, pp. 212-222. [CrossRef]
- Auler, R. and Borin, E., 2013. A LLVM Just-in-Time Compilation Cost Analysis. Technical Report 13-2013 IC-UNICAMP.
- Ansel, J., Marchenko, P., Erlingsson, U., Taylor, E., Chen, B., Schuff, D.L., Sehr, D., Biffle, C.L. and Yee, B., Language-independent sandboxing of just-in-time compilation and self-modifying code. In Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation, June 2011, pp. 355-366. [CrossRef]
- Diener, M., Cruz, E.H., Navaux, P.O., Busse, A. and Heiß, H.U., 2015. Communication-aware process and thread mapping using online communication detection. Parallel Computing 2015, 43, pp.43-63. [CrossRef]
- Chishti Z., Powell M. D., Vijaykumar T. N., Optimizing Replication, Communication, and Capacity Allocation in CMPs. In ACM SIGARCH Computer Architecture News, 2005, Volume 33 (2), pp. 357–368. [CrossRef]
- Cruz, E.H., Diener, M., Pilla, L.L. and Navaux, P.O., Hardware-assisted thread and data mapping in hierarchical multicore architectures. ACM Transactions on Architecture and Code Optimization (TACO), 13(3), 2016, pp.1-28. [CrossRef]
- Wang, W., Dey, T., Mars, J., Tang, L., Davidson, J.W. and Soffa, M.L., Performance analysis of thread mappings with a holistic view of the hardware resources. In 2012 IEEE International Symposium on Performance Analysis of Systems & Software, April 2012, pp. 156-167. [CrossRef]
- Mallach, S. and Gutwenger, C., Improved scalability by using hardware-aware thread affinities. In Facing the multicore-challenge: aspects of new paradigms and technologies in parallel computing, 2010, pp.29-41. [CrossRef]
- Diener, M., Cruz, E.H., Alves, M.A., Alhakeem, M.S., Navaux, P.O. and Heiß, H.U., Locality and balance for communication-aware thread mapping in multicore systems. In Euro-Par 2015: Parallel Processing: 21st International Conference on Parallel and Distributed Computing, Vienna, Austria, August 24-28, 2015, Proceedings 21, 2015, pp. 196-208. [CrossRef]
- Știrb, I. Reducerea consumului de energie și a timpului de execuție prin optimizarea comunicării între firele de execuție și prin localizarea echilibrată a datelor la execuția programelor paralele, pe sisteme NUMA, Politehnica University of Timișoara, Timișoara, 9th of December 2020.


| Real Benchmark | Optimization in W/s | Optimization in % | ||
|---|---|---|---|---|
| UMA | NUMA | UMA | NUMA | |
| cpu-x | 1.29 | 0.19 | 67.63 | 10.75 |
| cpu | 7.56 | 0.9 | 15.79 | 1.88 |
| flops | 0.3 | 0.6 | 0.41 | 0.82 |
| context switch | 0.77 | 0.32 | 1.57 | 0.76 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).