More about HKUST
Towards Efficient and Scalable RDMA Networking for Datacenter Applications
The Hong Kong University of Science and Technology
Department of Computer Science and Engineering
PhD Thesis Defence
Title: "Towards Efficient and Scalable RDMA Networking for Datacenter Applications"
By
Miss Wenxue LI
Abstract:
Remote Direct Memory Access (RDMA) has become a cornerstone of high-speed
networking in modern datacenters. As applications such as AI training and HPC
continue to scale, datacenter networks demand both higher performance and
stronger scalability. However, existing RDMA techniques face key limitations,
including sluggish congestion handling, inflexible communication semantics,
coarse-grained load balancing, and poor network scalability, which
collectively constrain datacenter efficiency.
This thesis addresses these challenges with four contributions. First, we
propose FlowSail to enable timely congestion handling. By adopting hop-by-hop
flow regulation without requiring per-flow queues, it achieves sub-RTT
responsiveness while remaining practical for deployment. Second, we design
Cepheus, which leverages programmable switches to extend RDMA semantics from
one-to-one to one-to-many. Through in- network connection bridging and signal
aggregation, it minimizes transmission hops and maximizes bandwidth
utilization for one-to-many communication. Third, we present FLB, an
efficient load balancing scheme for RDMA networks. By employing separate
control actions for different flows, FLB enables flexible rerouting during
normal conditions while avoiding congestion spreading when congestion occurs.
Finally, we introduce DCP, which revisits RDMA reliability for lossy fabrics.
By integrating lightweight switch-assisted packet trimming with redesigned
RDMA NIC reliability logic, it enables fast and precise loss recovery under
per-packet multipath transmission, enabling scalable and efficient RDMA
transmission over lossy networks.
Together, these contributions advance RDMA networking by improving congestion
handling, enriching communication semantics, enhancing the granularity of
load balancing, and enabling scalable transmission over lossy fabrics,
thereby strengthening the foundation for future datacenter networks.
Date: Thursday, 27 November 2025
Time: 4:00pm - 6:00pm
Venue: Room 5501
Lifts 25/26
Chairman: Dr. Yue ZHENG (ACCT)
Committee Members: Prof. Kai CHEN (Supervisor)
Dr. Binhang YUAN
Prof. Song GUO
Prof. Wei ZHANG (ECE)
Dr. Lei YANG (PolyU)