Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

SpeCon: Speculative Container Scheduling for Deep Learning Applications in a Kubernetes Cluster

Version 1 : Received: 25 October 2020 / Approved: 27 October 2020 / Online: 27 October 2020 (07:38:57 CET)

How to cite: Mao, Y.; Fu, Y.; Zheng, W.; Cheng, L.; Liu, Q. SpeCon: Speculative Container Scheduling for Deep Learning Applications in a Kubernetes Cluster. Preprints 2020, 2020100534 (doi: 10.20944/preprints202010.0534.v1). Mao, Y.; Fu, Y.; Zheng, W.; Cheng, L.; Liu, Q. SpeCon: Speculative Container Scheduling for Deep Learning Applications in a Kubernetes Cluster. Preprints 2020, 2020100534 (doi: 10.20944/preprints202010.0534.v1).

Abstract

In the past decade, we have witnessed a dramatically increasing volume of data collected from varied sources. The explosion of data has transformed the world as more information is available for collection and analysis than ever before. To maximize the utilization, various machine and deep learning models have been developed, e.g. CNN [1] and RNN [2], to study data and extract valuable information from different perspectives. While data-driven applications improve countless products, training models for hyperparameter tuning are still time-consuming and resource-intensive. Cloud computing provides infrastructure support for the training of deep learning applications. The cloud service providers, such as Amazon Web Services [3], create an isolated virtual environment (virtual machines and containers) for clients, who share physical resources, e.g., CPU and memory. On the cloud, resource management schemes are implemented to enable better sharing among users and boost the system-wide performance. However, general scheduling approaches, such as spread priority and balanced resource schedulers, do not work well with deep learning workloads. In this project, we propose SpeCon, a novel container scheduler that is optimized for shortlived deep learning applications. Based on virtualized containers, such as Kubernetes [4] and Docker [5], SpeCon analyzes the common characteristics of training processes. We design a suite of algorithms to monitor the progress of the training and speculatively migrate the slow-growing models to release resources for fast-growing ones. Specifically, the extensive experiments demonstrate that SpeCon improves the completion time of an individual job by up to 41.5%, 14.8% system-wide and 24.7% in terms of makespan.

Subject Areas

Container Scheduling; Resource Management; Docker; Kubernetes

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our diversity statement.

Leave a public comment
Send a private comment to the author(s)
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.