Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Efficient Resource Management for Deep Learning Applications with Virtual Containers

Version 1 : Received: 1 November 2020 / Approved: 2 November 2020 / Online: 2 November 2020 (10:38:31 CET)

How to cite: Zheng, W. Efficient Resource Management for Deep Learning Applications with Virtual Containers. Preprints 2020, 2020110020. https://doi.org/10.20944/preprints202011.0020.v1 Zheng, W. Efficient Resource Management for Deep Learning Applications with Virtual Containers. Preprints 2020, 2020110020. https://doi.org/10.20944/preprints202011.0020.v1

Abstract

The explosion of data has transformed the world since much more information is available for collection and analysis than ever before. To extract valuable information from the data in different dimensions, various deep learning models have been developed in the past years. Although these models have demonstrated their strong capability on improving products and services in various applications, training them is still a time-consuming and resource-intensive process. Presently, cloud, one of the most powerful computing infrastructures, has been used for the training. However, how to manage cloud computing resources and to perform the training efficiently is still challenging current techniques. For example, general resource scheduling approaches, such as spread priority and balanced resource schedulers, actually do not work well with deep learning workloads. Besides, the resource allocation problem on a cluster can be divide into two subproblems: (1) local resource optimization: improve resource configuration for a single machine; (2) global resource optimization: improve the cluster-wide resource allocation. In this thesis, we propose two novel container schedulers, FlowCon and SpeCon, that are designed to address these two subproblems respectively and specifically to optimize performance of short-lived deep learning applications in the cloud. FlowCon focuses on resource configuration of single-node in a cluster, as show that it efficiently improves deep learning tasks completion time and resource utilization, and reduces the completion time of a specific job by up to 42.06\% without sacrificing the overall system time. SpeCon targets on cluster-wide resource configuration that speculatively migrate slow-growing models to release resources for fast-growing ones. Based on our experiments, SpeCon improves makespan for up to 24.7\%, compared to current approaches.

Keywords

Cloud Resource Management; Container Scheduling; Deep Learning Applications

Subject

Computer Science and Mathematics, Software

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.