Job Scheduling in the Cloud: A Tale on Fairness and Efficiency

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Job Scheduling in the Cloud: A Tale on Fairness and Efficiency"

By

Mr. Chen CHEN


Abstract

With the burst of data volume and application complexity, it has become 
prevalent to host large-scale computations in clusters of distributed servers. 
In shared production clusters, job scheduling is of paramount importance to the 
cluster performance. The two basic scheduling objectives are efficiency and 
fairness---an ideal scheduler shall facilitate fast job response, and meanwhile 
avoid starvation by guaranteeing worst-case service quality to each job.

For inter-job scheduling, efficiency and fairness are conflicting with each 
other, leading to a dilemma of either predictable performance at the expense of 
long response time, or minimum mean response time at the risk of starvation. As 
a result, it's critical to develop resource scheduling strategies that can do 
well in both worlds. In this regard, we make the following contributions.

First, we present Cluster Fair Queuing (CFQ), a scheduling mechanism to 
minimize the mean job response time while ensuring predictable performance. It 
works by preferentially offering resources to jobs that finishes earliest under 
an idealized fair sharing policy. Second, we reveal that service isolation is 
crucial for both fairness and efficiency, but has not been guaranteed even when 
the jobs are assigned high priorities. We identify the reasons behind and 
propose Speculative Slot Reservation to achieve service isolation, which works 
by reserving slots if and only if that's appropriate according to job inner 
dependencies. Third, we observe that the marginal benefit from additional 
resources varies significantly for different jobs, and then propose 
Performance-Aware Fair (PAF) scheduling to reallocate certain resources for 
better overall efficiency while ensuring near-optimal fairness.

For intra-job scheduling however, fairness regarding workloads allocation on 
distributed workers, i.e., load-balancing, can help to improve the efficiency. 
We apply that insight to distributed deep learning applications, which might 
suffer salient performance degradation when running in heterogeneous clusters. 
Specifically, we propose a new worker-coordinating scheme, called Load-balanced 
Bulk Synchronous Parallel (LB-BSP), that can adaptively adjust workers' loads 
based on their progressing capabilities to achieve fast distributed deep 
learning.


Date:			Wednesday, 18 July 2018

Time:			3:00pm - 5:00pm

Venue:			Room 3494
 			Lifts 25/26

Chairman:		Prof. Ming Yi Hung (ACCT)

Committee Members:	Prof. Bo Li (Supervisor)
 			Prof. Wei Wang (Supervisor)
 			Prof. Pan Hui
 			Prof. Qian Zhang
 			Prof. Jiang Xu (ECE)
 			Prof. Cong Wang (Computer Science, CityU)


**** ALL are Welcome ****