Submitted:
09 October 2025
Posted:
09 October 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. System Design
2.1. Kernel Pool
2.2. Runtime Profiler
2.3. Decision Engine
3. Methodology
3.1. Profiling Metrics
- Latency (L): Average execution time of the kernel, measured in milliseconds.
- Throughput (T): Number of operations per second (e.g., images/s).
- Memory Footprint (M): Amount of on-chip/off-chip memory consumed.
- Energy Consumption (E): Estimated energy cost measured via hardware counters or external sensors.
3.2. Decision Strategies
Heuristic-based Selection
Multi-Armed Bandit (MAB)
Learning-based Predictor
3.3. Overhead Control
4. Experimental Setup and Results
4.1. Experimental Setup
- Embedded GPU: NVIDIA Jetson Xavier NX (8GB RAM, 21 TOPS).
- Mobile CPU: ARM Cortex-A76 (8 cores, 2.2 GHz).
- Cloud GPU: NVIDIA A100 (40GB HBM2).
- Static: A baseline with pre-selected kernels (TensorRT/TVM default).
- DKS: Our proposed dynamic kernel selection framework.
- Oracle: An upper bound where the best kernel is chosen with perfect foresight.
4.2. Results
5. Discussion
5.1. Practical Implications
5.2. Scalability and Generalization
5.3. Limitations
- Profiling overhead: Although lightweight, profiling incurs additional runtime cost. Future work should explore more efficient methods such as hardware-assisted telemetry.
- Kernel diversity: The effectiveness of DKS depends on the diversity and quality of available kernels. In scenarios with limited kernel implementations, adaptivity benefits are constrained.
- Decision complexity: As the number of kernels increases, the decision space expands. Efficient exploration strategies are needed to avoid performance regressions.
5.4. Future Directions
- Hybrid optimization: Combining offline compiler autotuning with online kernel adaptivity.
- Cross-layer adaptivity: Extending kernel selection beyond single operators to pipeline-level optimization (e.g., fused attention blocks).
- QoS-aware policies: Incorporating application-level constraints such as deadlines, quality-of-service (QoS), or user experience metrics.
- Hardware integration: Leveraging hardware counters, DVFS (Dynamic Voltage and Frequency Scaling), and energy models to improve kernel selection accuracy with minimal overhead.
6. Conclusion
References
- Z. Zhang, “Unified operator fusion for heterogeneous hardware in ml inference frameworks,” Preprint (arXiv or similar) / Under Review, 2025.
- Z. Lia et al., “Inference latency prediction for cnns on heterogeneous mobile platforms,” in 2024 IEEE or other appropriate conference (TBD), 2024.
- X. Song, Y. Cai, et al., “Deep learning inference on heterogeneous mobile processors,” in 22nd ACM International Conference on Mobile Systems, Applications, and Services (MobiSys), 2024.
- Z.Ye, L. Chen, R. Lai, W. Lin, Y. Zhang, S. Wang, T. Chen, B. Kasikci, V. Grover, A. Krishnamurthy, and L. Ceze, “Flashinfer: Efficient and customizable attention engine for llm inference serving,” arXiv preprint arXiv:2501.01005, 2025.
- A. of ML Drift, “Ml drift: Scaling on-device gpu inference for large generative models,” arXiv preprint arXiv:2505.00232, 2025.
- Z. Zhang, “Unified operator fusion for heterogeneous hardware in ml inference frameworks,” 2025.
- Z. Gao, “Modeling reasoning as markov decision processes: A theoretical investigation into nlp transformer models,” 2025.
- C. Li, H. Zheng, Y. Sun, C. Wang, L. Yu, C. Chang, X. Tian, and B. Liu, “Enhancing multi-hop knowledge graph reasoning through reward shaping techniques,” in 2024 4th International Conference on Machine Learning and Intelligent Systems Engineering (MLISE), pp. 1–5, IEEE, 2024.
- C. Wang, Y. Yang, R. Li, D. Sun, R. Cai, Y. Zhang, and C. Fu, “Adapting llms for efficient context processing through soft prompt compression,” in Proceedings of the International Conference on Modeling, Natural Language Processing and Machine Learning, pp. 91–97, 2024.
- T. Wu, Y. Wang, and N. Quach, “Advancements in natural language processing: Exploring transformer-based architectures for text understanding,” in 2025 5th International Conference on Artificial Intelligence and Industrial Technology Applications (AIITA), pp. 1384–1388, IEEE, 2025.
- C. Wang, M. Sui, D. Sun, Z. Zhang, and Y. Zhou, “Theoretical analysis of meta reinforcement learning: Generalization bounds and convergence guarantees,” in Proceedings of the International Conference on Modeling, Natural Language Processing and Machine Learning, pp. 153–159, 2024.
- Y. Sang, “Robustness of fine-tuned llms under noisy retrieval inputs,” 2025.
- Y. Sang, “Towards explainable rag: Interpreting the influence of retrieved passages on generation,” 2025.
- Z. Gao, “Feedback-to-text alignment: Llm learning consistent natural language generation from user ratings and loyalty data,” 2025.
- Z. Gao, “Theoretical limits of feedback alignment in preference-based fine-tuning of ai models,” 2025.

| Model | Static | DKS | Oracle |
|---|---|---|---|
| ResNet-50 Latency (ms) | 42.1 | 26.0 | 24.5 |
| MobileNetV3 Latency (ms) | 18.7 | 12.5 | 11.8 |
| BERT-base Latency (ms) | 115.3 | 71.2 | 68.5 |
| ResNet-50 Energy (J) | 1.42 | 1.03 | 0.98 |
| MobileNetV3 Energy (J) | 0.61 | 0.46 | 0.44 |
| BERT-base Energy (J) | 3.92 | 2.77 | 2.69 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).