TOWARDS OPTIMAL DELAY AND THROUGHPUT IN DATA-PARALLEL COMPUTING CLUSTERS

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "TOWARDS OPTIMAL DELAY AND THROUGHPUT IN DATA-PARALLEL COMPUTING 
CLUSTERS"

By

Mr. Jingjie JIANG


Abstract

Data-parallel computing frameworks are designed to support the processing of 
large volumes in computing clusters for big data analytics, such as search 
engines, personalized recommendation, video analytics and graph processing. Due 
to the distributed nature of big data analytics, computation and network 
resources both serve as the most critical factors to improve individual job 
performance and overall system throughput. There is a pressing need to 
coordinate the allocation of network bandwidth and the scheduling of 
computation tasks.

This thesis handles the allocation of both network and computation resources 
through delay-aware bandwidth allocation schemes and network-aware task 
schedul- ing frameworks. Specifically, we make the following three 
contributions.

First, we design Tailor, a dynamic monitoring and routing system to reduce 
network transfer times between successive computation stages of a job (captured 
as coflow completion time). Tailor is transparent to data-parallel applications 
and requires minimum modifications of end-hosts. For clusters where only edge 
networks experience severe and persistent congestion, we identify the 
non-trivial tradeoff between coflow performance and network utilization. 
Through in-depth analysis, we show that achieving work conservation is 
insufficient to maximizing the utilization of access links. We propose a 
hierarchical bandwidth allocation framework, Adia, that maximizes link 
utilization while achieves near-optimal coflow performance.

Secondly, we propose to embrace network-awareness into task scheduling, since 
network communication still serves as the determining factor for job 
performance even with the state-of-the-art bandwidth allocation schemes. By 
introducing a novel network-aware queueing model, we decouple the usage of 
network and computation resources and thus accurately capture the total 
processing time of each task. We then propose a network-aware scheduling 
algorithm, Adrestia, and prove it is throughput-optimal given the demand for 
network and computation resources as prior knowledge.

Last but not least, we propose an online scheduling framework, Symbiosis, that 
identifies resource imbalance and coordinates computation-bound and 
network-bound tasks in a large cluster, with the objective of utilizing all 
types of resources in a cluster with optimal system throughput. Symbiosis 
provides both a substrate and an application programming interface (API) to 
support existing task schedulers in data analytics frameworks. With 
network-awareness, our framework fully considers network and computation 
resources, making task scheduling and bandwidth allocation decisions based on 
live analytics of cluster states. We have implemented Symbiosis on top of Spark 
and demonstrated it improves both delay and throughput in a real-world cloud 
testbed using diversified analytic workloads.


Date:			Wednesday, 2 August 2017

Time:			3:00pm - 5:00pm

Venue:			Room 1511
 			Lifts 25/26

Chairman:		Prof. David Cook (ECON)

Committee Members:	Prof. Bo Li (Supervisor)
 			Prof. Kai Chen
 			Prof. Wei Wang
 			Prof. Michael Wong (PHYS)
 			Prof. Jianliang Xu (Comp. Sci., Baptist U)


**** ALL are Welcome ****