More about HKUST
Application-Aware Communication Optimization for Distributed Systems
The Hong Kong University of Science and Technology
Department of Computer Science and Engineering
PhD Thesis Defence
Title: "Application-Aware Communication Optimization for Distributed Systems"
By
Mr. Xudong LIAO
Abstract:
As modern datacenters scale to support increasingly complex and
data-intensive applications, overall system efficiency remains a persistent
optimization goal across the computing stack. While decades of research have
pushed the boundaries of computation, communication continues to be treated
largely as an infrastructure-level concern, abstracted away from application
semantics. This paradigm, however, falls short in contemporary systems where
communication patterns are tightly coupled with workload behaviors and
runtime dynamics, ultimately limiting the ability to optimize system
performance in a holistic and principled manner.
This dissertation advocates for a shift toward application-aware
communication optimization—a design paradigm that leverages application
characteristics to guide communication scheduling, resource allocation, and
interconnect configuration. By embracing this principle, we show that
distributed systems can achieve significantly improved performance,
scalability, and responsiveness across a broad range of scenarios.
We begin with Pallas, a rack-scale CPU scheduling system targeting
microsecond-level services. Pallas introduces an in-network workload shaping
mechanism that partitions mixed workloads into homogeneous shards at the
top-of-rack switch. This design enables simple yet near-optimal scheduling
within each server and reduces tail latency under dynamic load patterns.
Pallas demonstrates that proactive, application-aware scheduling at the
network level can effectively improve datacenter responsiveness.
Second, we present Herald, a neural recommendation training system for deep
learning recommendation models. Herald leverages the sparse and predictable
access patterns of embedding layers to perform location-aware input
assignment and dynamic communication plan generation. As a result, it
significantly reduces redundant data transfers and accelerates training.
Herald exemplifies how application semantics at the model layer can inform
efficient communication scheduling in machine learning (ML) training
pipelines.
Third, we propose MixNet, a runtime reconfigurable optical-electrical
interconnect architecture designed for large-scale Mixture-of-Experts (MoE)
training. MixNet regionally adapts its physical topology to match evolving
communication patterns across training iterations. By fusing the flexibility
of optical switching with the reach of electrical fabrics, MixNet approaches
the performance of ideal topologies while maintaining practical
cost-efficiency and scalability. MixNet demonstrates that fine-grained,
application-aware topology reconfiguration can unlock new trade-offs in
distributed ML interconnect design.
Date: Thursday, 25 September 2025
Time: 4:00pm - 6:00pm
Venue: Room 2408
Lifts 17-18
Chairman: Prof. Weiping LI (MATH)
Committee Members: Prof. Kai CHEN (Supervisor)
Prof. Song GUO
Dr. Binhang YUAN
Prof. Wei ZHANG (ECE)
Dr. Hong XU (CUHK)