More about HKUST
TOWARDS OPTIMAL DELAY AND THROUGHPUT IN DATA-PARALLEL COMPUTING CLUSTERS
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "TOWARDS OPTIMAL DELAY AND THROUGHPUT IN DATA-PARALLEL COMPUTING CLUSTERS" By Mr. Jingjie JIANG Abstract Data-parallel computing frameworks are designed to support the processing of large volumes in computing clusters for big data analytics, such as search engines, personalized recommendation, video analytics and graph processing. Due to the distributed nature of big data analytics, computation and network resources both serve as the most critical factors to improve individual job performance and overall system throughput. There is a pressing need to coordinate the allocation of network bandwidth and the scheduling of computation tasks. This thesis handles the allocation of both network and computation resources through delay-aware bandwidth allocation schemes and network-aware task schedul- ing frameworks. Specifically, we make the following three contributions. First, we design Tailor, a dynamic monitoring and routing system to reduce network transfer times between successive computation stages of a job (captured as coflow completion time). Tailor is transparent to data-parallel applications and requires minimum modifications of end-hosts. For clusters where only edge networks experience severe and persistent congestion, we identify the non-trivial tradeoff between coflow performance and network utilization. Through in-depth analysis, we show that achieving work conservation is insufficient to maximizing the utilization of access links. We propose a hierarchical bandwidth allocation framework, Adia, that maximizes link utilization while achieves near-optimal coflow performance. Secondly, we propose to embrace network-awareness into task scheduling, since network communication still serves as the determining factor for job performance even with the state-of-the-art bandwidth allocation schemes. By introducing a novel network-aware queueing model, we decouple the usage of network and computation resources and thus accurately capture the total processing time of each task. We then propose a network-aware scheduling algorithm, Adrestia, and prove it is throughput-optimal given the demand for network and computation resources as prior knowledge. Last but not least, we propose an online scheduling framework, Symbiosis, that identifies resource imbalance and coordinates computation-bound and network-bound tasks in a large cluster, with the objective of utilizing all types of resources in a cluster with optimal system throughput. Symbiosis provides both a substrate and an application programming interface (API) to support existing task schedulers in data analytics frameworks. With network-awareness, our framework fully considers network and computation resources, making task scheduling and bandwidth allocation decisions based on live analytics of cluster states. We have implemented Symbiosis on top of Spark and demonstrated it improves both delay and throughput in a real-world cloud testbed using diversified analytic workloads. Date: Wednesday, 2 August 2017 Time: 3:00pm - 5:00pm Venue: Room 1511 Lifts 25/26 Chairman: Prof. David Cook (ECON) Committee Members: Prof. Bo Li (Supervisor) Prof. Kai Chen Prof. Wei Wang Prof. Michael Wong (PHYS) Prof. Jianliang Xu (Comp. Sci., Baptist U) **** ALL are Welcome ****