More about HKUST
OPTIMIZE RESOURCE SCHEDULING IN MULTI-TENANT CLUSTERS AT SCALE
PhD Thesis Proposal Defence Title: "OPTIMIZE RESOURCE SCHEDULING IN MULTI-TENANT CLUSTERS AT SCALE" by Mr. Qizhen WENG Abstract: With the rise in Cloud Computing over the past few decades, there has been a trend of employing large-scale shared clusters consisting of commodity machines to serve multiple user groups. Such multi-tenant clusters are usually highly heterogeneous, and their workloads are widely diverse. In this dissertation, we aim to improve the performance of workloads and reduce the operating costs of clusters by optimizing resource scheduling. In clusters for online cloud services, long-running applications (LRAs), deployed in containers, are prevailing and of the highest priority. But placing LRA containers is known to be difficult; they often have sophisticated performance interactions (e.g., resource interferences and I/O dependencies) that are hard to be quantitatively evaluated by the existing constraint-based schedulers. Fortunately, we find that modern reinforcement learning (RL) techniques offer an appealing solution for LRA scheduling. We propose Metis, a general-purpose RL-based scheduler that learns to optimally place LRA containers and scales to production clusters with hierarchical learning techniques. Shared clusters running diverse workloads of Machine Learning (ML) algorithms, on the other hand, are usually equipped with Graph Processing Units (GPUs) of different generations. However, the characteristics of such scenarios remain largely unexplored. We therefore present a comprehensive trace study of a typical ML-as-a-Service (MLaaS) cloud in the enterprise and discuss the scheduling opportunities and challenges with benchmarks and simulations. We not only show that GPU sharing and task recurrence can be leveraged to improve the cluster efficiency, but also reveal the presence of hard-toschedule tasks, the imbalance load across heterogeneous machines, the potential bottleneck on CPUs, and so forth, calling for further designs on resource scheduling. Date: Friday, 29 July 2022 Time: 4:00pm - 6:00pm Zoom Meeting: https://hkust.zoom.us/j/91058544752?pwd=NkZhc3VUWC9hMVJPK3F5bjZmM3dtZz09 Committee Members: Dr. Wei Wang (Supervisor) Prof. Qian Zhang (Chairperson) Prof. Kai Chen Prof. Bo Li **** ALL are Welcome ****