Efficient Resource Management for Deep Learning Applications with Virtual Containers

The explosion of data has transformed the world since much more information is available for collection and analysis than ever before. To extract valuable information from the data in different dimensions, various deep learning models have been developed in the past years. Although these models have demonstrated their strong capability on improving products and services in various applications, training them is still a time-consuming and resource-intensive process. Presently, cloud, one of the most powerful computing infrastructures, has been used for the training. However, how to manage cloud computing resources and to perform the training efficiently is still challenging current techniques. For example, general resource scheduling approaches, such as spread priority and balanced resource schedulers, actually do not work well with deep learning workloads. Besides, the resource allocation problem on a cluster can be divide into two subproblems: (1) local resource optimization: improve resource configuration for a single machine; (2) global resource optimization: improve the cluster-wide resource allocation. In this thesis, we propose two novel container schedulers, FlowCon and SpeCon, that are designed to address these two subproblems respectively and specifically to optimize performance of short-lived deep learning applications in the cloud. FlowCon focuses on resource configuration of single-node in a cluster, as show that it efficiently improves deep learning tasks completion time and resource utilization, and reduces the completion time of a specific job by up to 42.06\% without sacrificing the overall system time. SpeCon targets on cluster-wide resource configuration that speculatively migrate slow-growing models to release resources for fast-growing ones. Based on our experiments, SpeCon improves makespan for up to 24.7\%, compared to current approaches.

(graphics processing units) were developed and computational abilities of processing data become faster [1]. Besides, data explode, many large open datasets have been launched, such as ImageNet [2], Open Images Dataset [3] and CIFAR-10 [4]. As a result, the development of deep learning has been accelerated, and various new architectures of neural networks like AlexNet [5], GoogLeNet [6], ResNet [7]  with the cost of the whole system is about 4.1 million USD [9]. This approach significantly improved training time, however, deploying the hardware like this is very expensive and it is unnecessary and unaffordable for the general public or small companies. Only one training job requires a large number of resources, let alone multiple jobs that are very common in the industry environment.
Consider this circumstance, the cloud seems to be a better option to improve the hardware condition.
Many cloud computing service providers such as Microsoft and AWS provide a large scale of sharing computing resources, and users are able to create traditional virtual machines or/and virtual containers and run deep learning training jobs inside the virtual machines. In the cloud environment, various jobs are able to run concurrently and compete for computing resources like CPUs and Memories.
To efficiently allocate computing resources, scheduling algorithms are essential for systems.
There are some commonly used schedulers such as Docker swarm and Kubernetes [10] which provide service discovery, load balancing, and other container and cluster management services.
These general platforms for resource management can balance the workload across containers in the cloud in general, however, they are not designed specifically for deep learning jobs and fail to take into account the characteristics of the training process. For a deep learning training task, the goal is to reduce the loss iteratively until it converges to a minimum number. The converging speed is not static, on the contrary, it decreases generally with time. In the beginning of training, the loss reduces significantly. After a rapid decline phase, the loss will converge to a relatively stable number and the efficiency of the training process decreased. It turns out that although the resource usage stays the same for each unit of the training process, the gain varies over time. problem, we consider addressing it from the following two aspects: • Local optimization: real-time resources configurations on a single machine inside a cluster.
• Global optimization: cluster-wide resource configuration in a cluster of containers.
we have proposed solutions for these two subproblems respectively. We first propose FlowCon, which targets to improve the efficiency of multiple deep learning training applications running on the containerized cloud environment and accelerate the overall makespan for the system by real-time resource allocation on a single machine. Different from current systems that have a fixed configuration, FlowCon is a real-time resource allocation system that monitors resource usage and the progress of training, categories training jobs into different phases based on growth efficiency, and dynamically allocates resources to each job according to its phases for each node in the cluster.
To improve resource allocation through out a cluster, we propose a second container workflow scheduler for deep learning applications called SpeCon, a Speculative Container scheduler, which aims to accelerate multiple running tasks in a cloud cluster. Besides a fixed or static configuration in the worker node in the cluster, the existing schedulers terminate or migrate a container when a the worker node is overwhelmed. This will train the deep learning model from the beginning which will significantly extend the overall system makespan. Similar to FlowCon, SpeCon categorizes a training job into different phases, and decides whether a job converged or not. For a converged job, SpeCon first gathers system-wide resource usages information as well as existing jobs on each node, and then select the most desirable host for migration by using our proposed execute algorithms.
The container reallocation in SpeCon releases resources to fast-growing jobs, and therefore improve overall system performance. When all training jobs have converged, SpeCon keeps monitoring the system to rebalance the workloads.
The main contributions of this thesis are summarized as follows: • To optimize computing cluster resource utilization, we divide it into two subproblems: resource management in a single machine environment and resource management in a cluster-wide environment, and propose two efficient resource schedulers, FlowCon [12] and SpeCon, to address these subproblems respectively.
• We analyze characteristics of training processes for various deep learning models and conduct experiments to study the overhead of saving and resuming jobs in a Kubernetes system. In addition, we introduce the concept of growth efficiency, a measure of the magnitude of the change in the loss function with per unit of computing resource.
• FlowCon is designed to elastically allocate (or deallocate) the resources to (or from) each learning job at running time on a single machine by monitoring growth efficiency of deep learning jobs, allowing jobs to converge more quickly without significant scheduling overhead.
• We propose SpeCon with a suite of algorithms that monitor the evaluation functions of deep learning models on the worker side and on the manager side, it collects cluster-wide information to calculate a weighted score for each worker to select a most desirable host for the migration.
Furthermore, it rebalances workload when all containers are converged.  [32], and many models need several days, even weeks to get the final result. A powerful hardware system is crucial for supporting those time consuming and resource-intensive novel architectures. Some researchers have recognized these problems and worked on improving resource-intensive problem by adjusting architecture itself. In works [33]- [37], they tried different approaches such as reducing filter size, removing layers, exploiting linear structure to speedup training. Although these works show a good result at a certain point, they are still not commonly used tools in both the academic and industry field. Our systems can apply to any type of deep learning training job, and our goal is to directly optimize training performance without modifying model structures for broader applications.
Deploying deep learning training jobs to the cloud is can have massive computing resources.
Different tasks share CPUs and memories of physical machines by utilizing virtual machines, and the cluster servers on the cloud provide more resources to the tasks. On account of the design of virtual machines, however, creating a virtual machine requires a large number of resources, and it also gives rise to provisioning delay [38] since each virtual machine runs its own guest operating system. To address this problem, many works [39], [40] have attempted to reduce the delay, but the provision cost cannot be avoided and is still a waste of virtual machine resources. Another shortcoming of virtual machines is the cost of virtual machine migration. Virtual machine migration is a commonly used technique that migrates virtual instances from one physical machine to another in order to manage the whole system such as load balancing, fault management and low-level system maintenance [41].
However, in virtual machines, hard disk storage is heavy-weight which causes poor performance on virtual machine migration.
Container technology is designed to address these limitations of virtual machines. Based on its design, containers are able to run on a common host operating system while each container is isolated from others. Containers are lighter weight than virtual machines. In addition, the operating time of container migration is much less than that of virtual machine migration [42]. With the repaid development of containers, a lot of systems have been proposed for container scheduling. For instance, [43] proposed an efficient container scheduler, ECSched, which models scheduling problems as a minimum cost flow problem, thus it can make placement decisions with high quality and high speed for concurrent deployment requests. Multiopt [44] is a multi-objective container scheduler that takes five key factors into consideration including CPU usage, memory usage, time consumption, association, clustering, and establishes a composite function by combining scoring functions of each factor. PIVOT [45] introduced a cost-aware scheduler that enables data-intensive tasks to run and scale in the cloud immediately and cost-effectively. Although these novel scheduling algorithms have a good performance in container placement, they are all for the general workload. In contrast, our systems specifically target deep learning applications.
Several previous works focused on container scheduling algorithms for deep learning tasks.
TRADL [46] provides a resources optimizer based on user-defined target loss. Once the deep learning training process reached the target loss, TRADL reduces or removes CPU and Memory usages to other training jobs which still need resources to reach the target. The system is well designed to monitor the training progress and resource usages, and it can allocate the resource in real-time. However, TRADL requires a user to predefined the target number. Gandiva [47] uses the predictability of jobs to efficiently time slice GPUs, hence achieve low latency, self-check job performance and dynamically migrate jobs to better adapt to the GPUs and improve cluster efficiency. However, it does not consider performance of deep learning training progress.
ProCon [48] has a limitation of readjustment containerized deep learning applications since it does not support container migration.
Our solution especially targets resource management for containerized deep learning applications in the cloud environment. We aim to optimize resource configuration by considering two settings: (1) computer resource utilization on a single machine, and improve resource allocation between different tasks that sharing the resource on the certain machine/node; (2) focus on global resources through the cluster, use migration technique to reallocate computing resources for running tasks. These settings are implemented respectively through our two efficient resource management system FlowCon and SpeCon.

BACKGROUND AND MOTIVATION
In this chapter, we study the training processes of containerized deep learning applications and motivate our works with experiments.

Training A Deep Learning Model
Given a deep learning algorithm, model training is the parameter tuning process to approach the global optimum of a predefined evaluation function. Initially, parameters are randomly selected for the model. This model is iteratively fed with mini-batches of training data, where each mini-batch contains features and labels. The model is then evaluated with the evaluation function that informs users how far off does it from the target. The returned value of this function is recorded along with the model parameters. Together, they propagate backward through the neural network architecture.
In the next round of the training process, the algorithm updates and adjusts parameters aiming to achieve better performance. Finally, a well-trained model is generated with a parameter combination that produces an optimized evaluation function. x-axie is the cumulative time. We observe that within the first 20% of the total training time, 65%, 98%, and 90% of the maximum reduction has been achieved for VAE, Bi-RNN and GRU respectively. Since they are trained individually on a physical machine, resources are fully occupied by each model without competition. Therefore, for the same amount of resource, the training gain and efficiency decreases as time goes on.

Motivation for FlowCon
To deploy applications into a production environment, it is difficult to achieve resilience and scalability using only a single compute node. Generally, a cluster (cloud) is used to provide the infrastructure for running a large set of containers at scale. Many toolkits have been designed for container orchestration in cluster environments, such as the Docker Swarm and Kubernetes.
In current containerized cloud systems, running containers compete for resources freely and the system maintains fairness among all of them on a worker node. Alternatively, users can set an upper limit to each of the containers when initializing them. However, these mechanisms are not optimal for deep learning tasks. There are two main reasons. (1) Most models don't need to be perfect in a distant future, they just have to be good in the near future. Suppose we have a set of deep learning tasks running on containers within a cluster. In some settings such as real-time data analytics, a model would be frequently requested by applications (e.g., prediction) even before convergence is reached.
In this case, training the model to an acceptable (rather than perfect) level of accuracy is the most important.
(2) More commonly, some learning tasks converge faster than others, and their models can reach an acceptable state with fewer iterations (i.e. less time). If we want to train all models to a usable state while minimizing wait time, simply maintaining fairness of all tasks will result in resource waste. This is because jobs that are already in an acceptable state will continue to utilize as much resource as those with many optimizations left to do, even though the nearly-converged jobs only make small gains in optimizing their loss function per unit of computing resource.
Back to Figure 3.1, each model runs inside a container on the same physical node. It can be seen that GRU model training job reaches 0.03% loss at 27% of the time. When it completes training, the loss decrease to 0.02%. This observation indicates with the first 27% of the total time, the model achieves 99% of its minimum. Later, it takes 73% of the total time for another 1% of the accuracy.
When there are other learning tasks running in parallel on the same node and seeking computational resources, then it is reasonable to shift parts of the computational resource occupied by GRU training job to other learning tasks. In this work, we propose FlowCon for resource management on a single machine to accomplish this goal.

Motivation for SpeCon
After single machine/node optimization, we consider cluster-wide resource management since nodes are running inside a cluster. As analyzed above, a learning model, at any given iteration, contains the neural network design (the algorithm itself) and values of the network parameters that it has trained.
That is to say, the model progress can be saved during and after training. When necessary, the saved model can be resumed from exactly where they left off.
We experimentally study the overhead of saving and resuming learning models and compare Pytorch [53], where the training jobs are running inside a Kubernetes container. When saving a model in Tensorflow, there are four files generated internally.
• Meta graph file: This is a protocol buffer which saves the complete Tensorflow graph.
• Index file: It stores the list of variable names and shapes saved.
• Data file: This is the file which stores values of all the weights, biases, gradients and all the other variables.
• Checkpoint file: It keeps a record of latest checkpoint files saved. When restoring the model, the system read these files and restart from where it stopped.
Next, we investigate the overhead with different deep learning models. EFFICIENT RESOURCE

MANAGEMENT SCHEMES
In this chapter, We introduce two system-level resource schedulers, concerning the features of deep learning applications. These two schemes can improve two subproblems of cluster resource management respectively: resource configuration on a single machine and cluster-wide resource configuration. Section 4.1 presents system design of FlowCon, containerize resource management that target to single machine environment, and Section 4.2 introduces SpeCon, a cluster-wide resource scheduler.

The FlowCon System
In this section, we present the design of FlowCon in detail, including its architecture, system modules, and the optimization problem it solves.

FlowCon Architecture
In a typical cluster of containers, e.g., Docker Swarm and Kubernetes, there are multiple managers and workers in the system. Managers accept specifications from users and are responsible for reconciling the desired state with the actual cluster state, and workers are responsible for running jobs. Although mangers have all global view of the workers in the system, FlowCon runs on the worker side to prevent overwhelming the manager, which is responsible for collecting the status information from all workers and assigning jobs to them. In our FlowCon system, mangers only interact with the container pools in the workers, which store information of all running containers. With this design, the overhead of running FlowCon is distributed over the entire cluster.

FlowCon Modules
As demonstrated in Figure 4.1, FlowCon consists of three modules, a Container Monitor, a Worker Monitor and a Executor. Each module runs independently and exchanges information about jobs inside the containers as well as the worker status. Their functionality is detailed below.

Container Monitor
The FlowCon focuses on the containers that provide various machine learning services.

Executor
The Executor is a key module that collects and analyzes evaluation functions value and resource usage data on the worker. Based on the initial interval, it calculates the required parameters by using the data from Container Monitor and execute an algorithm (described in chapter 5) to update resource configuration for each container. Upon receiving a report from a listener, the Executor will interrupt the current interval, and running the algorithm to update resource assignment based on the new state of the container pool.

FlowCon System Optimization Problem
FlowCon aims to improve resource efficiency, which is generally assumed to be satisfactory in current cloud systems. However, when considering a system with various deep leaning applications, the term "in use" fails to accurately reflect efficiency, as illustrated in Figure 3.1.
Based on the characteristics of deep learning applications, we introduce a new definition for efficiency based on an application's evaluation function. Given a system with a set of running containers, {c id }, each container uses its own evaluation function to assess its machine learning model (e.g., loss reduction and inception score) E c id (t). For each model, based on its E(t), we define the progress score for the container c id below, where t i − t i−1 is the measurement interval, the value P c id (t i ) is called per-second progress within the interval.
Here, P c id (t i ) reflects the model training progress over a given time interval. In order to account for the resources used towards the progress, we propose the growth efficiency for each container c id with an active deep learning job. G c id ,r i (t i ) in Eq. 4.2 represents growth efficiency with respect to different types of resources (e.g., CPU, memory, network I/O and block I/O), denoted by r i . The , is a function that returns average resource r i usage of c id within interval FlowCon aims to maximize the sum of growth efficiency for the whole system in each interval, where each learning model has its own evaluation function and can be calculated in real-time. Assume that there are n containers, each runs one job in the system, and R imax denotes the overall resource capacity for r i . Then, our performance optimization problem can be formalized as P below, where

SpeCon System Design
Although FlowCon achieves significant performance improvement in the presence of various deep learning workloads, the proposed system design focuses on a single-machine (or single-node) configuration. In a real industry environment, we need to deploy tasks on the cloud, thus, we further introduce SpeCon to take advantage of the cluster environment and achieve global optimization.
SpeCon is implemented on top of Kubernetes, a dominant container-orchestration platform designed to automate the deployment, scaling, and management of containerized applications. In this section, we present the system architecture of SpeCon in detail, including its key modules and their functionalities as well as design logic.

Kubernetes Framework
Generally, a cluster of containers comprises managers and workers. Managers are responsible for system management, such as resource allocation and failure handling. Workers are in charge of hosting containers and executing workloads. In a Kubernetes cluster of containers, a pod is a group of one or more containers which are hosted on the same worker and share the same lifecycle as well as storage resources.
Typically, there are 6 basic units in a Kubernetes cluster as described below: The Kuberlet and Service Proxy reside in workers.
• API Server: It is the main management point of the entire cluster that processes REST operations, validates them, and updates the corresponding objects in storage.
• Controller Manager: It runs controllers, which are the background threads that handle routine tasks.
• Default Scheduler: It watches for newly created Pods that have no node assigned and is responsible for the placement of pods on workers.
• etcd: It is a distributed data storage solution that stores all the data, e.g. configuration, state, and metadata.
• Kuberlet: It is responsible for maintaining a set of pods, which are composed of one or more containers, on a local system.
• Service Proxy: It maintains network rules on nodes, e.g. implementing a form of virtual IP for services. The four modules are as follows:

SpeCon System Architecture
•  • Worker Monitor: It resides on the manager which is responsible for the overall system management. It monitors all workers in the cluster, and keeps track of the number of running jobs and the resource availabilities on each of them. Moreover, it processes messages from workers and communicates with SpeCon scheduler for further actions.
• SpeCon Scheduler: It works on the manager side. When the worker monitor decides that a training job needs to be migrated to another worker node, it use data that gathered by Worker Monitor to execute the algorithm to calculate a score for each of the worker, and then, selects a most desirable node to host this migrated training job (details in chapter 5).
• SpeCon Messager: It generates heartbeat messages to be exchanged between managers and workers. For example, each worker, periodically, sends a message to notify the manager the number of training job hosted on it and their categories. Additionally, whenever a job needs to be migrated, the worker makes use of this channel to notify the manager immediately.
The training jobs in SpeCon write their progress information into the log system in the system. Particularly, SpeCon utilizes the Persistent Volume (PV [54]), which is a piece of storage in the cluster and has a lifecycle independent of any individual Pod that uses the PV, to store the logs. This provides the stability to the SpeCon system.

Solution of FlowCon
In this section, we present the design of the elastic container configuration algorithms in FlowCon, which adjusts resource assignment for containers at run time.

Resource Configuration for Containers
In a traditional cluster of virtual machines (VMs), each VM is assigned with a fixed amount of resources (e.g., CPU cores and memory) when the guest operating system is installed. While dedicating VM's for each job enables better isolation, a cluster of VMs fails to efficiently utilize resources for deep learning jobs given their characteristics as we discussed in Chapter 4.
In a cluster of containers, system administrators have the option to create, configure, and reconfigure containers in real time. If the containers are started without a specific resource limit, they will compete for resources at runtime just like processes in an operating system. However, the resource plan can be updated at any time after initialization. For example, the command docker update < options > container id can reset the resource limit as desired. The sample options include -cpus for the number of cores, -cpu-rt-runtime for CPU real-time runtime in microseconds, -memory for memory usage in MB, -blkio-weight for a relative weight of block I/O and etc. Finally, limits set by the 'docker update' commands are soft limits, which means that the when the container does not fully utilizes its allocated resource, the unused option will be utilized by others.

Resource Assignment in FlowCon
In a cluster of containers, the manager accepts jobs from users and selects a worker to host the containers. Containers compete for resources such as memory and CPU when they are running in a same worker. By default, each container is assigned the same priority resulting in uniform resource distribution among all containers in the worker. This sharing mechanism yields acceptable performance. However, as we have discussed, it fails to consider characteristics of deep learning applications. In comparison, FlowCon utilizes a growth-efficiency based method as presented in Algorithm 1 to update resource allocation of each active container dynamically.
As shown in Line 1 of Algorithm 1, each W i first receives the following parameters from its manager: time t, threshold α and algorithm interval itval. It then initializes three lists as below: scheme to double the value of itval in order to reduce the overhead of running the algorithm (Line 17). Once the growth efficiency is less than the preset threshold, FlowCon applies the following rules: • Each container in the CL has its resource limit set based on its growth during the time interval -If growth is exceedingly small, which is common after convergence, the resource limit is set to a lower bound to prevent abnormal behavior caused by limited resources (Line 22).
• The resource limits of containers in the W L remain unchanged (Line 24).  CL.remove(c id )

15:
Release resource c id

16:
itval = initial value 17: Run Algorithm 1 and i + + • Allocate more resources to containers in the N L (Line 26).

Listeners in FlowCon
The container monitor provides information that allows Algorithm 1 to dynamically allocate resources based on growth-efficiency of each container, and reduce scheduling overhead with an exponential back-off scheme. However, there is latency between the time that a worker's state changes (e.g., a new container is initiated) and the point that it can reallocate resources. To address this issue, FlowCon deploys lightweight background-listeners to track container states in real-time.
With the same set of parameters, Algorithm 2 presents the workflow of listeners on W i . First, it initializes the CL, W L, N L and itval, and uses i to record the number of iteration of the listener (Line 1). When the i − th iteration is running, it uses the function T (i) to fetch the total number of container on the W i (Line 2). In all runs after the first run, the listener calculates the difference c, between the most recent two iterations (Lines 3 -4). If c > 0, it means that there are c new containers now active in the system, so the listener will stop and the algorithm finds out the c id of the new containers and add them to the N L (Lines 5 -7). In the meantime, it resets the itval to the original value in order to break the exponential back-off scheme, and then starts to run Algorithm 1 to update the resource allocation as well as increases the iteration number i. (Lines 8 -9). The case when c < 0 indicates that some containers have completed their jobs. The algorithm will then find the relevant containers by their c id , remove them from their associated category (N L, CL or W L) and release their resources (Lines 10 -15). Finally, we reset the itval, start running Algorithm 1 and increment the iteration number i.

SpeCon Solution Design
As mentioned in chapter 4, SpeCon has a global control of computing resources in a cluster rather than focusing on a single machine. SpeCon aims to improve the system performance by migrating slowgrowing jobs to other workers to release resources for the fast-growing ones, by (1) identify slowgrowing jobs; (2) selecting the most desirable host; (3) and rebalanceing the workers. The Table 5.1 summarizes the parameters and functions that are used in SpeCon.
The worker ID in the worker set c i ∈ C The container ID in the container set The value of evaluation function of c j on w i at time t G w i ,c j (t) The growth of evaluation function of c j on w i at time t R(w i , t) Resource consumption of w i at time t P C The progressing category set W C The watching category set CC The converged category set α Categorization threshold The scoring set that stores weighted scores for each worker bf The balance factor for an uniform distribution

Identifying Slow-Growing Jobs
In SpeCon, we consider a cluster of containers that are hosted on workers. Inside the cluster, we have managers (denoted as M ). In addition, we denote c i ∈ C to be the set of containers and w i ∈ W to be the set of workers. We use c m ∈ w n to denote that a container, c m , is running on a worker w n .
In our problem setting, training jobs are running inside containers, where each container hosts one specific job. Consequently, each container, c i , can be seen as one particular training job in our setting. As analyzed in the previous sections, each job has a predefined evaluation function. During the whole training process, the value of the function forms a time series. When queried in the middle of an iteration, the previous value will be returned. Based on the query time, Equation 5.1 presents the growth of the training job in a given interval, t 2 − t 1 .
where c j is a container that runs a training job on w i with E w i ,c j (t) as the evaluation function.
According to G w i ,c j (t), SpeCon classifies c i into 3 different categories.
• Progressing Containers (P C): The jobs inside these containers are still in the fast growing stage such that their evaluation function is progressing quickly.
• Watching Containers (W C): When a training job join W C category, it indicates that this job is slowing down on the gain of the evaluation function. In this zone, the gain starts jittering and, depending on the model particular, it may speedup the progress again.
• Converged Containers (CC): If training jobs in W C continue slowing down, they are inserted into a Converged Container category, which suggests that they are slow-growing on their evaluation functions and unlikely to speedup again.
Based on the above analysis, we develop the Algorithm 3 in SpeCon to keep tracking the progress c j remains in CC w i

21:
W C w i .remove(c j ) 22: of training jobs on each worker. Note that, in order to avoid unnecessary network traffic and maintain scalability, the containers in SpeCon will be reallocated at most once.
As shown in Line 1, Algorithm 3 on each worker first fetches key parameters from the system.
The categorization interval (t m − t m−1 ) determines how frequently a worker checks containers' log files, which store return values of evaluation functions. The categorization threshold, α, is a percentage value that SpeCon uses to decide whether the progress is growing quickly. Then, in Line 2, the algorithm initializes parameters such that all containers are inserted into P C when they join a worker and their migration indicator is set to false at the beginning.
At time t m , it reads the current value of the evaluation function and calculates the gain during the previous categorization interval, t m − t m−1 (Line 3-5). If the gain is less than the previous round and the threshold α, SpeCon updates the category of this container based on the following condition (Line 6-16).
• If the container is currently in the category P C, but not W C and CC, it will be removed from P C and inserted to W C. It means that this job starts slowing down, but not stable yet.
• If the container is currently in the category W C, but not P C and CC, it will be removed from W C and inserted to CC. It indicates that this job has started its convergence process.
• If the container is currently in the category CC, but not P C and W C, it suggests that the value of the gain during the interval continues decreasing and it has converged. In this condition, it will remain in the CC category and the worker will call SpeCon messenger to send a migrated request of (c j , w i ) to the manager.
In a scenario that the current gain is less than α, but larger than the previous gain value, this container stays in the current category. This is due to the fact that when a training job goes on, the model randomly selects and updates parameters at each iteration that leads to an unstable trends (bouncing growth values). In this stage, the container remains in the same category and waiting for next round (Line 17-18).
At the moment that the gain becomes larger than the threshold, we reset the container's category to P C, which is utilized to accommodate a sudden change of the evaluation function and possible errors from the previous rounds (Line 19-22).

Scheduling on Slow Moving Jobs
The container monitor that runs each worker keeps tracking the evaluation functions and collects data that stores in the persistent volume. Whenever receives a request from workers, the manager Find Min(S) 6: if S w j = Min(S w j ∈W ) then Restore c j on w j with lowest R(w j , t); responses the reallocation message, e.g. (w i , c j ), by executing the Algorithm 4. The main objective of Algorithm 4 is to select the most desirable worker node to host this container, e.g. c j .
SpeCon utilizes a weighted scoring algorithm to rank worker nodes. As shown in Line 1 of Algorithm 4, it first initializes parameters, which include a candidate set (T ) that uses to store targeted workers and it is set to empty initially. Additionally, SpeCon maintains predefined weights for each category as well as resource consumption functions.
For each worker node in the cluster, it calculates a score based on the number of containers in each category and its weights. SpeCon aims to improve the efficiency by allocating more resources to fast-growing jobs and limit resources for slow-growing ones. With this objective in mind, the values of weights have a relationship of w pc > w wc > w cc , where priorities are given to progressing containers (Line 2 -3).
Given the scores of the workers, the algorithm finds out workers that have the minimum score.
Those workers are inserted into the candidate set. If multiple workers have the minimum score, it results in more than one candidates (Line 4-7). The algorithm checks whether w i , the current host of c j , is in the set or not. Then the following logic will be executed.
• If w i in candidate set, worker w i will be returned, which means that container c j should continue running on w i . In this case, the algorithm marks the container as migrated (Line 8-10). This manner reduces unnecessary overhead caused by migration.
• If w i is not in the candidate set, the algorithm takes two branches.

Rebalancing Workloads in the Cluster
Together with Algorithm 3 and 4, SpeCon distributes the workload based on real-time returns from evaluation functions of containers as well as current resource consumption on workers. At this stage, however, it could make inaccurate decisions due to missing information of finishing time. For instance, if ∀c i ∈ W, c i.migrated = True that means all jobs have converged, therefore, the workload distribution in the cluster is fixed, which would result in an imbalanced cluster. As intuitive example, when all 4 jobs become converged in a 2-node cluster, W 1 hosts c 1 and c 2 and W 2 contains c 3 and c 4 . In a scenario that c 1 and c 2 completes before c 3 and c 4 , W 1 runs without any workloads.
In a real cloud environment, it is challenge and costly to obtain an accurate finish time due to the various implementations, dynamic workloads and resource competition. In our solution, SpeCon incorporates the converged duration, which is time length between when a container is marked as converged and current time to compare active containers and re-distribute the workload in a cluster by using Algorithm 5. for c j ∈ w i do 5: if |w i | = 0 then 8: T .insert(w i ) 9: bf = sum ÷ |W | 10: if |T | > 0 then 11: for w i ∈ T do 12: if |w i | < bf then for w i ∈ W do 21: if |w i | < bf − 1 then 22: T .insert(w i ) 23: for w i / ∈ T do 24: ∀c j ∈ w i 25: Find c j with Min(d c j ) in D 26: for w j ∈ T do 27: if |w i | < bf − 1 then Continue; The algorithm prepares the required parameters in Line 1. Then, it calculates the sum of active jobs on workers across the cluster and computes the converged duration for each container c j on each worker w i . The each converged duration, d c j , is inserted into a set D (Line 2-6). If there is no active jobs on a particular worker w i , its id would be inserted to the candidate set T , which stores the workers that can take more workload (Line 7-8). Given tj and |W |, the balanced factor, bf , is obtained in Line 9. The value of bf is based on uniform distribution.
Depending on the number of workers in the candidate set T , Algorithm 5 takes the following two branches.
• If T is nonempty, it suggests that, at least, one worker is idling. For each idling worker w i in T , the algorithm assigns c j , which has the smallest d c j to w i if the number of active containers on it is less than bf and marks c j to be rebalanced (Line 10-18).
• If T is empty, it means that every worker runs some active containers. Then, SpeCon enumerates all workers and finds the ones that host less jobs than bf − 1, which indicates they can hold, at least, one more job. These workers are inserted into T (Line [19][20][21][22]. Then, the algorithm finds the c j with the smallest d c j and, in the meanwhile, c j runs on a worker w j such that w j / ∈ T (Line 23-25). It basically avoids the scenario that c j assigns to its current host. With a nonempty T , SpeCon assigns the previously found c j to w i which has room for an additional container and marks this container to be rebalanced (Line 26-31).

SYSTEM EVALUATION
In this chapter, we evaluate FlowCon and SpeCon separately with large scale experiments, and present the evaluation results.

Evaluation of FlowCon
In this section, we evaluate the performance of FlowCon through a set of experiments, carried out in the cloud.  Quadratic Loss T

Experiment Setup and Evaluation Metrics
There are two key parameters in FlowCon: (1) α, the threshold for classifying jobs into N L, W L and CL; and (2) itval, the interval for running the Algorithm 1. We evaluate the performance of FlowCon with different parameter configurations and compare it with the original Docker system (denoted as NA in this subsection) using the following three scenarios: • Fixed scheduling: the time to launch a job is controlled by the administrator.
• Random scheduling: the launch times are randomized to simulate random submissions of jobs by users in a real cluster.
• Scalability: we evaluate FlowCon with an increased number of learning jobs.
The following three metrics are considered in our experiments.
• Overall makespan: the total length of the schedule for all the jobs in the system.
• Individual job completion time: the completion time of each individual jobs in the system.
• CPU usage: all of our tested deep learning models are computation intensive jobs, we focus on analyzing the CPU usage for better understanding FlowCon.

Fixed Scheduling Results
We fix the schedule of three jobs that VAE on Pytorch starts at 0s, MNIST on Pytorch begins at 40s, and MNIST on Tensorflow launches at 80s. We test our system with different input parameters to understand its performance in various settings. and α = 10% respectively. It can be observed that different itval values have a small effect on the makespan (dominated by VAE), and FlowCon improves makespan by 1% to 5% compared to NA. This is particularly evident when α = 5%, as the makespans are 386.1s, 372.4s, 384.8s, 389.0s, 388.1s and 394.0s respectively. The reason lies in the fact that although FlowCon moves some jobs that have their resource limits constrained, thus slowing them down, the jobs which were allocated more resources finished more quickly, thus reducing the overlap between jobs.  In Figure 6 Table 6.2 by comparing FlowCon with N A. We can see that FlowCon performs better than N A in all the parameter settings. When α = 10% with itval = 60, the performance improvement is the smallest one, only 3.1%. This is because the value of itval is large, and the algorithms need more time to adjust the resource plan for jobs with a large interval. When we fix the value of itval to 20 and vary the value of α, it can be seen that the time reduction generally decreases with the increase of α. The explanation for this result is that jobs stay longer in N L for α = 1%, causing the algorithm to make updates more frequently. For the case α = 15%, jobs stay longer in CL, in which the limits are set to 1, and running tasks will compete for resources freely.
CPU usage: Figure 6.5 and Figure 6.6 illustrate the detailed CPU usage of FlowCon (α = 5% with itval = 20) and N A in the presence of the three jobs, respectively. The results in Figure 6.6 verify that the system equally distributes CPU resources among active jobs when without any configuration (N A). For example, from 40s to 80s and 180s to 280s, the CPU usages of VAE (Pytorch) and MNIST (Pytorch) are approximately equivalent. In comparison, Figure 6.5 shows both that FlowCon can dynamically set the upper resource limit for each job (actually the resource usage also reflects each job's growth-efficiency). Specifically, when MNIST (Pytorch) is launched at time 40s, FlowCon takes two actions: (1) sets VAE's (Pytorch) resource limit to 0.25 since it is growing slowly, and (2) sets MNIST's (Pytorch) resource limit to 1, allowing for a maximum resource. In this case, VAE (Pytorch) will receive 25% while MNIST (Pytorch) will use 75% of the total resources.

Random Scheduling Results
For the random scheduling case, we have used five different deep learning models, LSTM-CFC, VAE, VAET, MNIST, and GRU, in our experiments. We randomly select a starting time point from 0s -200s to submit a training job, and the responsible jobs are marked as Job-1, Job-2, Job-3, Job-4, and Job-5 respectively in the following results. Makespan: Figure 6.7 shows the results of system makespan. Similar to the fixed scheduling case, the results here demonstrate, once again, that FlowCon improves the overall makespan, by 1% -5%. Given the same resource availabilities, FlowCon achieves the reduction of makespan by reducing the overlap between jobs.
Individual: Considering the complete time for each individual job in Figure 6.7, we can observe that FlowCon reduces the completion time for 4 jobs, 5 jobs, 4 jobs, and 4 jobs out of 5 learning jobs for the case with α = 3%, itval = 30; α = 3%, itval = 60; α = 5%, itval = 30; and α = 5%, itval = 60, respectively. The biggest loss happens at the fourth job (denoted as Job-4 and others are similar) with α = 5% and itval = 60. There, Job-4 completes in 472.4s, which is 11.80% slower than N A, the completion time of which is 422.5s. The reason is that: although resource allocation to Job-4 greatly decreases when Job-5 begins, the interval of itval = 60 prevents FlowCon immediately vs. 489.4s). Therefore, a smaller value of itval will allow FlowCon to reassign the resources more quickly.
CPU usage: The random schedule with a larger number of jobs produces more challenges for resource assignments. Figure 6.8 and Figure 6.9 present the CPU usage of FlowCon with α = 3% and itval = 30 and N A in a system of 5 randomly submitted jobs. Unlike Figure 6.6, the resource usage illustrated in Figure 6.9 is not equally distributed. For example, from 50s to 80s and from 650s to 730s, the first and the second job are active and use 19% and 79% of the resources, respectively.
The reason is that: although the first job (LSTM-CFC) is running alone, it does not maximize the CPU usage (e.g., from 0s to 50s as in Figure 6.9). Generally, when a container cannot maximize its resource limit, a portion of the resources may be wasted depending on other jobs in the system.

Scalability of FlowCon
In this subsection, we conduct experiments to evaluate the performance of FlowCon at a larger scale.  respectively, in both FlowCon and N A. As we can see from Figure 6.11, FlowCon gains a lot in growth efficiency at the very beginning. The reason lies in the fact that in FlowCon, Job-2 does not need to compete for resources freely due to an upper limit of resources that applies to every job. The more resources allocate to it, the faster it grows. Even when Job-3 joins the system, resources for Job-2 will not reduce too much since it is still in the W L. In comparison, in N A, every job has the same priority and they compete for the resource whenever it becomes available. When Job-2 converges in FlowCon, more resources will be moved to newer jobs due to a smaller value of the limit, which results in a loss when compares to N A at the time point 320s. We find a different trend in Figure 6.12. There, at the first 2 iterations, Job-6, in FlowCon, records sightly lower values of growth efficiency. This is because FlowCon needs more time to update the configuration when there are 5 active jobs in the system. It should be noted that the time for resource usages is shorter than the  From Figure 6.14, we can see the jitter with N A. The jitter is a result of uncontrolled resource competition: whenever there is an idle slot, the system will allocate resources to the first job in the queue. FlowCon also produces jitter, however, the resource usage for each container is much smoother by comparison. This is because FlowCon employs a soft, upper resource limit to the containers, and therefore the room for free competition is reduced. The majority of jitter in FlowCon occurs in the interval between 0s to 200s. In this interval, jobs are submitted to the system randomly. After one container joins the system, resource assignment for each container will be updated to reflect this change in the system's status.

15-Job Experiments
We further increase the number of jobs to 15. Again, jobs are randomly submitted to the system during the interval 0s to 200s. As the number of concurrent jobs increases, so does the degree of competition for resources. The results are presented in Figure 6.15. There, we find the same trend as we can see these increments are quite small, e.g., Job-5's completion time increases the most, only by 5.7%. In comparison, in the other 11 jobs, the degree of reduction ranges from 1.2% to 11.9% and the largest degree of reduction occurs in Job-10, from 308.1s to 271.4s.

Evaluation of SpeCon
In this section, we evaluate the performance of SpeCon through intensive cloud-executed experiments.

Experimental Setup Implementation, Testbed and Workloads
SpeCon is integrated into Kubernetes 1.17 as plug-in modules that reside on both managers and workers. It receives tasks from the manager, directs the given tasks to workers, monitors the evaluation functions, and balances the workloads.
We build a testbed on an NSF Cloudlab [55] datacenter that hosts at the University of Utah.
Specifically, we use multiple M510 as our physical machines that contain two 8-core Intel Xeon D-1548 at 2.0 GHz, 64GB ECC memory, and 256 GB NVMe flash storage. The following two clusters have been built to evaluate SpeCon.
•  [53] and Tensorflow (T) [52]. As shown in Table 6.3, we choose 5 different models on those two platforms as our workloads that execute inside containers.

Evaluation Metrics
The training processes of deep learning applications are computationally intensive. They are more sensitive to CPU powers than memory spaces and network bandwidth. The following evaluation metrics are considered in our experiments.
• Completion time: The completion time of each job in the cluster.
• Average completion time: The average completion time among the jobs in the cluster.
• Makespan: The total length of the schedule for all the jobs in the cluster.
When multiple models are training on the same worker, the jobs create a highly dynamic computing environment in terms of resource competition. Therefore, we design the following submission schedules to ensure a comprehensive evaluation.
• Fixed Schedule: The time to launch containers follows a fixed interval. It simulates an administrator controlled cloud environment.
• Random Schedule: The time to launch a job is randomly selected within an interval. It simulates a user-specified, first-come-first-serve cloud environment.
As for the weight of each category that utilized by Algorithm 4, we use w pc = 2, w wc = 1.5 , and w cc = 1. Additionally, we compare SpeCon with the default scheduler in Kubernetes, which noted as DS in the rest of the evaluation subsection.

Fixed Schedule
Firstly, we conduct experiments with a fixed schedule such that containers with the same model image are submitted to the system one by one with a fixed interval. In this experiment, we use VAE model and 50 second interval. Furthermore, we set α = 0.01, |t m − t m−1 | = 30 for SpeCon algorithms. As shown in Fig. 6.16, we can see that SpeCon outperforms DS in terms of completion time. The largest gain is found on Job-17, which reduces from 348.7s to 217.6s, which is a 37.6% reduction.
There are 11 jobs out of 20 that get a reduction in the completion time.  With SpeCon, however, the containers can be migrated according to their training progresses and system-wide workload distribution. In Fig. 6.22, we mark the job that is migrated due to its slowgrowing progress as "Job-ID-M" and "Job-ID-R" indicates that this job is migrated as an reaction of system-wide rebalancing. For example, Job-1 was originally running on Worker-4 and it was added to CC category at 155.9s by Algorithm 3. Then, Job-1 was pasted to Algorithm 4. At that moment,  Finally, we conduct an experiment with a larger cluster, 1 manager and 8 workers. The more workers give more options for SpeCon, which leads to challenges when reallocating containers.