Towards Efficient and Scalable RDMA Networking for Datacenter Applications

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Towards Efficient and Scalable RDMA Networking for Datacenter Applications"

By

Miss Wenxue LI


Abstract:

Remote Direct Memory Access (RDMA) has become a cornerstone of high-speed 
networking in modern datacenters. As applications such as AI training and HPC 
continue to scale, datacenter networks demand both higher performance and 
stronger scalability. However, existing RDMA techniques face key limitations, 
including sluggish congestion handling, inflexible communication semantics, 
coarse-grained load balancing, and poor network scalability, which 
collectively constrain datacenter efficiency.

This thesis addresses these challenges with four contributions. First, we 
propose FlowSail to enable timely congestion handling. By adopting hop-by-hop 
flow regulation without requiring per-flow queues, it achieves sub-RTT 
responsiveness while remaining practical for deployment. Second, we design 
Cepheus, which leverages programmable switches to extend RDMA semantics from 
one-to-one to one-to-many. Through in- network connection bridging and signal 
aggregation, it minimizes transmission hops and maximizes bandwidth 
utilization for one-to-many communication. Third, we present FLB, an 
efficient load balancing scheme for RDMA networks. By employing separate 
control actions for different flows, FLB enables flexible rerouting during 
normal conditions while avoiding congestion spreading when congestion occurs. 
Finally, we introduce DCP, which revisits RDMA reliability for lossy fabrics. 
By integrating lightweight switch-assisted packet trimming with redesigned 
RDMA NIC reliability logic, it enables fast and precise loss recovery under 
per-packet multipath transmission, enabling scalable and efficient RDMA 
transmission over lossy networks.

Together, these contributions advance RDMA networking by improving congestion 
handling, enriching communication semantics, enhancing the granularity of 
load balancing, and enabling scalable transmission over lossy fabrics, 
thereby strengthening the foundation for future datacenter networks.



Date:                   Thursday, 27 November 2025

Time:                   4:00pm - 6:00pm

Venue:                  Room 5501
                        Lifts 25/26

Chairman:               Dr. Yue ZHENG (ACCT)

Committee Members:      Prof. Kai CHEN (Supervisor)
                        Dr. Binhang YUAN
                        Prof. Song GUO
                        Prof. Wei ZHANG (ECE)
                        Dr. Lei YANG (PolyU)