Observable and Economical Dataflow Computation in Datacenters

PhD Thesis Proposal Defence


Title: "Observable and Economical Dataflow Computation in Datacenters"

by

Mr. Huangshi TIAN


Abstract:

With the proliferation of data emerges a myriad of dataflow frameworks 
When they are deployed in a datacenter and productized as a service, their 
performance and cost become two primary concerns. However, performance 
issues prevail in dataflow computation. Their diagnosis is complicated by 
the heterogeneity of dataflow frameworks because the frameworks differ in 
underlying design, application domain, and computation complexity. It 
poses challenges for service providers and users to debug and locate the 
problems. A side effect of performance issues is higher resource costs as 
the datacenter operator cannot easily determine the appropriate allocation 
that could guarantee stable performance, thus leading to unwanted resource 
waste.

To tackle the challenges of performance and cost, the dissertation first 
characterizes dataflow computation in a large datacenter by analyzing a 
recently released workload trace. It examines the static properties of job 
DAGs and the runtime characteristics of their task execution. Statically, 
the DAGs are discovered to exhibit high artificiality when compared with 
random graphs. The dependent tasks may have significant variability in 
resource usage and duration—–even for recurring tasks. The results confirm 
the challenge of performance debugging and resource allocation. To 
diagnose performance issues, the dissertation enables resource 
observability in dataflow

computation by proposing CrystalPerf, a new approach that learns to 
characterize the performance of dataflow computation based on code 
analysis. It requires no code instrumentation and applies to a wide 
variety of dataflow frameworks. Our key insight is that the source code of 
an operation contains learnable syntactic and semantic patterns that 
reveal how it uses resources. Our approach establishes a 
performance-resource model that, given a dataflow program, infers 
automatically how much time each operation has spent on each resource 
(e.g., CPU, network, disk) from past execution traces and the program 
source code, using machine learning techniques. Extensive evaluations and 
real-world case studies show that CrystalPerfcan predict job performance 
and accurately detect runtime bottlenecks of DAG jobs.

To reduce resource costs, the dissertation proposed Owl, an overcommitted 
scheduler for executing dataflow computation on serverless platforms. It 
achieves high utilization without compromising performance with a dual 
approach. (1) For less-invoked functions, it allocates resources to the 
sandboxes with usage-based heuristic, keeps monitoring their performance, 
and remedies any detected degradation. (2) For frequently-invoked 
functions, Owl profiles the interference patterns among collocated 
functions and places the sandboxes under the guidance of profiles. Owl 
further consolidates idle sandboxes to reduce resource waste. We prototype 
OWL in our production system and implement a representative benchmark 
suite to evaluate it. The results demonstrate that the prototype could 
reduce VM cost by 43.80% and effectively mitigate latency degradation, 
with negligible overhead incurred.


Date:			Wednesday, 11 May 2022

Time:                  	9:30am - 11:30am

Zoom Meeting:		https://hkust.zoom.us/j/6943077680

Committee Members:	Dr. Wei Wang (Supervisor)
  			Dr. Charles Zhang (Chairperson)
 			Prof. Qiong Luo
 			Dr. Shuai Wang


**** ALL are Welcome ****