More about HKUST
Cluster Resource Scheduling: A Tale on Fairness and Efficiency
PhD Thesis Proposal Defence Title: "Cluster Resource Scheduling: A Tale on Fairness and Efficiency" by Mr. Chen CHEN Abstract: With the burst of large-scale data analytics, it has become quite prevalent to run data-parallel jobs in clusters of distributed servers. To shared production clusters that host a variety of workloads, the resource scheduler is of paramount importance to the cluster performance. Regarding the scheduler, the two basic scheduling objectives are efficiency and fairness---an ideal scheduler shall provide optimal mean job response time, and meanwhile guarantee predictable job performance by avoiding starvation. Practically however, performance optimality and predictability are conflicting with each other under today's cluster schedulers, leading to a dilemma of either obtaining predictable performance at the expense of long response time, or running the risk of starving some jobs to achieve minimal mean response time. As a result, it's critical to develop scheduling strategies that can do well in both worlds---which is the very focus of this thesis. Specifically, we make the following three key contributions. First, we present Cluster Fair Queuing (CFQ), a cluster resource scheduling mechanism to minimize the mean job response time while achieving predictable performance. Noticing the inefficiency of instantaneous fair sharing, CFQ works by preferentially offering resources to jobs that finishes earliest under an idealized fair sharing policy. Second, we reveal that service isolation for dependent data-parallel jobs is crucial for both fairness and efficiency, but has not been guaranteed in scenarios with fine-grained resource sharing, even when the jobs are assigned high priorities. We identify the reasons behind and propose Speculative Slot Reservation to achieve service isolation---by judiciously reserving slots for the downstream computations of jobs with inner dependencies. Third, we observe an interesting property of data-parallel jobs---Demand Elasticity---that some jobs can run with much less resources than they ideally need without noticeable performance degradation. Then we propose Performance-Aware Fair (PAF) scheduling to exploit that property for speeding up overall job performance while ensuring near-optimal fairness. PAF works by iteratively transferring resources from certain jobs to others that can utilize those resource more efficiently, as long as the formers are not sacrificed over a small threshold. Date: Tuesday, 24 April 2018 Time: 3:00pm - 5:00pm Venue: Room 4475 (lifts 25/26) Committee Members: Prof. Bo Li (Supervisor) Dr. Wei Wang (Supervisor) Prof. Lei Chen (Chairperson) Dr. Ke Yi **** ALL are Welcome ****