Flow Scheduling for Parallel Computing Applications in Datacenters

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Flow Scheduling for Parallel Computing Applications in 
Datacenters"

By

Mr. Li CHEN


Abstract

Distributed and parallel computing systems are cornerstones of this era of 
Big Data, machine learning, and artificial intelligence. This type of 
computing systems spans over hundreds or thousands of machines in 
datacenter(s), so as to cope with the ever-expanding data volume and the 
increasing complexity of models/problems. Most of the recent and important 
applications, such as web search, business analytics, recommendation 
systems, and deep neural networks, run on clusters of thousands of 
machines for both small companies and large enterprises. As such scale, 
the communication between machines is a bottleneck issue, and the 
scheduling of communication sessions within applications (or network 
flows) is a key factor in the acceleration of these applications.

This thesis focuses on optimizing flow-level scheduling in datacenters, 
namely, its three essential aspects: information collection, scheduling 
algorithm, and scheduling decision enforcement. We begin with the design 
flow-level information collection systems for parallel computing 
applications. We then study three important but previously ignored 
scheduling problems in real-world datacenter applications:

* Scheduling with incomplete information: Scheduling general flows for 
applications without knowledge of flow size, such as database 
query/response. Such flows cannot be handled by existing flow schedulers 
that depends solely on size information. We adopted the 
Multilevel-Feedback queues in operating systems to network flows, and 
developed a queueing theory model to determine the optimal parameter 
settings.

* Scheduling heterogeneous flows with diverging objectives: Flows from 
user-facing applications, such as web search, have completion time 
constraints (deadlines). They coexist with general flows without such 
constraints. We identify and abstract this type of problems as mix-flow 
scheduling. For this problem, we find out that state-of-the-art flow 
schedulers cannot achieve objectives of different types of flows at the 
same time. We approach this problem with a systematic formulation, and 
derive control-theoretic solution using Lyapunov Optimization techniques.

* Scheduling with erroneous information: Machine learning techniques are 
increasing popular in inferencing flow information, especially for groups 
of flow in data-shuffling stage in parallel computing applications. 
However, machine learning results are not always accurate. Thus, we design 
error-tolerant scheduling algorithm to mitigate the impact of prediction 
errors.

Finally, we enforce the scheduling decisions efficiently using 
application-level, operating system kernel, and switch-based mechanisms. 
We present the proposed solutions for each problem, and demonstrate their 
effectiveness via simulations and experiments using the enforcement 
mechanisms. Our work has been integrated into a comprehensive flow 
scheduling framework, Chukonu. It has already been deployed in small-scale 
in production datacenters of large Internet service companies, such as 
Tencent and Huawei.


Date:			Friday, 27 April 2018

Time:			3:00pm - 5:00pm

Venue:			Room 3494
 			Lifts 25/26

Chairman:		Prof. Jianwei Sun (CHEM)

Committee Members:	Prof. Kai Chen (Supervisor)
 			Prof. Bo Li
 			Prof. Wei Wang
 			Prof. Chin-Tau Lea (ECE)
 			Prof. Wing-Cheong Lau (Inf Engg, CityU)


**** ALL are Welcome ****