More about HKUST
Learning to Schedule Long-Running Applications in Shared Container Clusters
PhD Thesis Proposal Defence Title: "Learning to Schedule Long-Running Applications in Shared Container Clusters" by Mr. Luping WANG Abstract: Online cloud services are increasingly deployed as long-running applications (LRAs) in containers. Placing LRA containers are known to be difficult as they often have sophisticated resource interferences and I/O dependencies. Existing schedulers rely on operators to manually express the container scheduling requirements as placement constraints and strive to satisfy as many constraints as possible. Such schedulers, however, fall short in performance as placement constraints only provide qualitative scheduling guidelines and minimizing constraint violations does not necessarily result in the optimal performance. In my work, we present Metis, a general-purpose scheduler that learns to optimally place LRA containers using deep reinforcement learning (RL) techniques. This eliminates the complex manual specification of placement constraints and offers, for the first time, concrete quantitative scheduling criteria. As directly training an RL agent does not scale, we develop a novel hierarchical learning technique that decomposes a complex container placement problem into a hierarchy of subproblems with significantly reduced state and action space. We show that many subproblems have similar structures and can hence be solved by training a unified RL agent offline. This work is accepted by IEEE/ACM International Conference for High Performance Computing, Networking, Storage, and Analysis (SC20). In a following-up work, we present another scheduler, George, to achieve high-quality container performance subject to the operation constraints. In specific, we design tailored constrained policy optimization algorithm that projects the performance-improving training direction to a safe zone where the operation constraints can be satisfied. We provide theoretical proof to show the algorithm can guarantee an effective, stable, and safe learning process. Furthermore, to achieve timely decision-making, George transfers and temporally reuses the learned knowledge between sequential LRA scheduling events. By inheriting the previous knowledge and adapting it to the next decision-making process using Transfer Learning (TL) methods, George’s model training efforts can be dramatically alleviated. This work is under submission to ACM/IEEE SC 2021. Date: Wednesday, 7 April 2021 Time: 4:00pm - 6:00pm Zoom Meeting: https://hkust.zoom.us/j/5767775326 Committee Members: Prof. Bo Li (Supervisor) Dr. Yangqiu Song (Chairperson) Prof. Lei Chen Dr. Qiong Luo **** ALL are Welcome ****