Advancing High-Performance and Scalable RDMA NICs for Datacenters

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Advancing High-Performance and Scalable RDMA NICs for Datacenters"

By

Mr. Zilong WANG


Abstract:

Datacenter applications are increasingly driving the demands for high-speed 
networks, which are expected to be high-performance and scalable. Currently 
Remote Direct Memory Access (RDMA) has become the de-facto standard for 
high-speed networks in modern datacenters, providing high throughput, low 
latency, and low CPU overhead through architectural innovations including 
kernel bypass and transport offload.

However, commercial RDMA NICs (RNICs) suffer from scalability issues in 
general scenarios and performance issues in specific ones. Specifically, 
commercial RNICs support only a small number of high-performance connections 
(connections scalability issue), and rely on lossless, limited-scale network 
fabrics (network scalability issue). Moreover, in bandwidth-intensive large 
language model (LLM) training scenarios, they fail to achieve maximum 
throughput and optimal link bandwidth utilization. This thesis describes my 
research efforts in advancing high-performance and scalable RNICs for 
datacenters by improving their scalability in general use cases and enhancing 
their performance in specific scenarios through protocol and hardware 
architecture innovations.

First, we improve connection scalability by designing SRNIC, a scalable RDMA 
NIC architecture. SRNIC minimizes memory requirements through careful 
protocol and architecture co-design, all while maintaining high performance, 
low CPU overhead, and efficient selective retransmission mechanisms. SRNIC is 
our first attempt towards more scalable and performant, next-generation RDMA 
designs, and we hope it can inspire the redesign of a new RDMA specification 
for lossy networks.

Second, we further enhance network scalability by designing Tassel, a fast, 
scalable, and accurate rate limiter for RNICs. Tassel supports accurate rate 
limiting for tens of thousands of flows at ultra-high packet rates. This help 
RNICs' congestion control mechanisms to smooth traffic, prevent traffic 
bursts, and mitigate network congestion, thereby enabling broader RDMA 
deployment and enhancing overall network scalability. We hope this can 
inspire more innovative protocols in high-speed networks and domain-specific 
networks, aiming for congestion-free datacenters.

Finally, we improve the performance of RNICs in specific scenarios by 
designing AINIC, a streamlined RNIC tailored for LLM training. AINIC is 
high-performance and lightweight, meeting the extreme performance demands of 
AI applications. It follows an inside that simplifying RNICs can actually 
improve performance while reducing chip complexity. We hope that our work 
will inspire further development of application- specific RNIC designs, 
leading to more efficient and scalable network architectures for emerging 
technologies like LLMs.


Date:                   Tuesday, 11 February 2025

Time:                   3:00pm - 5:00pm

Venue:                  Room 4472
                        Lifts 25/26

Chairman:               Prof. Wing Hung KI (ECE)

Committee Members:      Prof. Kai CHEN (Supervisor)
                        Prof. Song GUO
                        Prof. Qiong LUO
                        Prof. Wei ZHANG (ECE)
                        Dr. Hong XU (CUHK)