More about HKUST
Cluster Resource Scheduling: A Tale on Fairness and Efficiency
PhD Thesis Proposal Defence
Title: "Cluster Resource Scheduling: A Tale on Fairness and Efficiency"
by
Mr. Chen CHEN
Abstract:
With the burst of large-scale data analytics, it has become quite
prevalent to run data-parallel jobs in clusters of distributed servers. To
shared production clusters that host a variety of workloads, the resource
scheduler is of paramount importance to the cluster performance. Regarding
the scheduler, the two basic scheduling objectives are efficiency and
fairness---an ideal scheduler shall provide optimal mean job response
time, and meanwhile guarantee predictable job performance by avoiding
starvation.
Practically however, performance optimality and predictability are
conflicting with each other under today's cluster schedulers, leading to a
dilemma of either obtaining predictable performance at the expense of long
response time, or running the risk of starving some jobs to achieve
minimal mean response time. As a result, it's critical to develop
scheduling strategies that can do well in both worlds---which is the very
focus of this thesis. Specifically, we make the following three key
contributions.
First, we present Cluster Fair Queuing (CFQ), a cluster resource
scheduling mechanism to minimize the mean job response time while
achieving predictable performance. Noticing the inefficiency of
instantaneous fair sharing, CFQ works by preferentially offering resources
to jobs that finishes earliest under an idealized fair sharing policy.
Second, we reveal that service isolation for dependent data-parallel jobs
is crucial for both fairness and efficiency, but has not been guaranteed
in scenarios with fine-grained resource sharing, even when the jobs are
assigned high priorities. We identify the reasons behind and propose
Speculative Slot Reservation to achieve service isolation---by judiciously
reserving slots for the downstream computations of jobs with inner
dependencies.
Third, we observe an interesting property of data-parallel jobs---Demand
Elasticity---that some jobs can run with much less resources than they
ideally need without noticeable performance degradation. Then we propose
Performance-Aware Fair (PAF) scheduling to exploit that property for
speeding up overall job performance while ensuring near-optimal fairness.
PAF works by iteratively transferring resources from certain jobs to
others that can utilize those resource more efficiently, as long as the
formers are not sacrificed over a small threshold.
Date: Tuesday, 24 April 2018
Time: 3:00pm - 5:00pm
Venue: Room 4475
(lifts 25/26)
Committee Members: Prof. Bo Li (Supervisor)
Dr. Wei Wang (Supervisor)
Prof. Lei Chen (Chairperson)
Dr. Ke Yi
**** ALL are Welcome ****