More about HKUST
Job Scheduling in the Cloud: A Tale on Fairness and Efficiency
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "Job Scheduling in the Cloud: A Tale on Fairness and Efficiency" By Mr. Chen CHEN Abstract With the burst of data volume and application complexity, it has become prevalent to host large-scale computations in clusters of distributed servers. In shared production clusters, job scheduling is of paramount importance to the cluster performance. The two basic scheduling objectives are efficiency and fairness---an ideal scheduler shall facilitate fast job response, and meanwhile avoid starvation by guaranteeing worst-case service quality to each job. For inter-job scheduling, efficiency and fairness are conflicting with each other, leading to a dilemma of either predictable performance at the expense of long response time, or minimum mean response time at the risk of starvation. As a result, it's critical to develop resource scheduling strategies that can do well in both worlds. In this regard, we make the following contributions. First, we present Cluster Fair Queuing (CFQ), a scheduling mechanism to minimize the mean job response time while ensuring predictable performance. It works by preferentially offering resources to jobs that finishes earliest under an idealized fair sharing policy. Second, we reveal that service isolation is crucial for both fairness and efficiency, but has not been guaranteed even when the jobs are assigned high priorities. We identify the reasons behind and propose Speculative Slot Reservation to achieve service isolation, which works by reserving slots if and only if that's appropriate according to job inner dependencies. Third, we observe that the marginal benefit from additional resources varies significantly for different jobs, and then propose Performance-Aware Fair (PAF) scheduling to reallocate certain resources for better overall efficiency while ensuring near-optimal fairness. For intra-job scheduling however, fairness regarding workloads allocation on distributed workers, i.e., load-balancing, can help to improve the efficiency. We apply that insight to distributed deep learning applications, which might suffer salient performance degradation when running in heterogeneous clusters. Specifically, we propose a new worker-coordinating scheme, called Load-balanced Bulk Synchronous Parallel (LB-BSP), that can adaptively adjust workers' loads based on their progressing capabilities to achieve fast distributed deep learning. Date: Wednesday, 18 July 2018 Time: 3:00pm - 5:00pm Venue: Room 3494 Lifts 25/26 Chairman: Prof. Ming Yi Hung (ACCT) Committee Members: Prof. Bo Li (Supervisor) Prof. Wei Wang (Supervisor) Prof. Pan Hui Prof. Qian Zhang Prof. Jiang Xu (ECE) Prof. Cong Wang (Computer Science, CityU) **** ALL are Welcome ****