More about HKUST
Flow Scheduling for Parallel Computing Applications in Datacenters
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "Flow Scheduling for Parallel Computing Applications in Datacenters" By Mr. Li CHEN Abstract Distributed and parallel computing systems are cornerstones of this era of Big Data, machine learning, and artificial intelligence. This type of computing systems spans over hundreds or thousands of machines in datacenter(s), so as to cope with the ever-expanding data volume and the increasing complexity of models/problems. Most of the recent and important applications, such as web search, business analytics, recommendation systems, and deep neural networks, run on clusters of thousands of machines for both small companies and large enterprises. As such scale, the communication between machines is a bottleneck issue, and the scheduling of communication sessions within applications (or network flows) is a key factor in the acceleration of these applications. This thesis focuses on optimizing flow-level scheduling in datacenters, namely, its three essential aspects: information collection, scheduling algorithm, and scheduling decision enforcement. We begin with the design flow-level information collection systems for parallel computing applications. We then study three important but previously ignored scheduling problems in real-world datacenter applications: * Scheduling with incomplete information: Scheduling general flows for applications without knowledge of flow size, such as database query/response. Such flows cannot be handled by existing flow schedulers that depends solely on size information. We adopted the Multilevel-Feedback queues in operating systems to network flows, and developed a queueing theory model to determine the optimal parameter settings. * Scheduling heterogeneous flows with diverging objectives: Flows from user-facing applications, such as web search, have completion time constraints (deadlines). They coexist with general flows without such constraints. We identify and abstract this type of problems as mix-flow scheduling. For this problem, we find out that state-of-the-art flow schedulers cannot achieve objectives of different types of flows at the same time. We approach this problem with a systematic formulation, and derive control-theoretic solution using Lyapunov Optimization techniques. * Scheduling with erroneous information: Machine learning techniques are increasing popular in inferencing flow information, especially for groups of flow in data-shuffling stage in parallel computing applications. However, machine learning results are not always accurate. Thus, we design error-tolerant scheduling algorithm to mitigate the impact of prediction errors. Finally, we enforce the scheduling decisions efficiently using application-level, operating system kernel, and switch-based mechanisms. We present the proposed solutions for each problem, and demonstrate their effectiveness via simulations and experiments using the enforcement mechanisms. Our work has been integrated into a comprehensive flow scheduling framework, Chukonu. It has already been deployed in small-scale in production datacenters of large Internet service companies, such as Tencent and Huawei. Date: Friday, 27 April 2018 Time: 3:00pm - 5:00pm Venue: Room 3494 Lifts 25/26 Chairman: Prof. Jianwei Sun (CHEM) Committee Members: Prof. Kai Chen (Supervisor) Prof. Bo Li Prof. Wei Wang Prof. Chin-Tau Lea (ECE) Prof. Wing-Cheong Lau (Inf Engg, CityU) **** ALL are Welcome ****