Advancing High-Performance and Scalable RDMA NICs for Datacenters

PhD Thesis Proposal Defence


Title: "Advancing High-Performance and Scalable RDMA NICs for Datacenters"

by

Mr. Zilong WANG


Abstract:

Datacenter applications are increasingly driving the demands for high-speed 
networks, which are expected to be high-performance and scalable. Currently 
Remote Direct Memory Access (RDMA) has become the de-facto standard for 
high-speed networks in modern datacenters, providing high throughput, low 
latency, and low CPU overhead through architectural innovations including 
kernel bypass and transport offload. However, commercial RDMA NICs (RNICs) 
suffer from scalability issues in general scenarios and performance issues in 
specific ones. Specifically, commercial RNICs support only a small number of 
high-performance connections (connections scalability issue), and rely on 
lossless, limited-scale network fabrics (network scalability issue). Moreover, 
in bandwidth-intensive large language model (LLM) training scenarios, they fail 
to achieve maximum throughput and optimal link bandwidth utilization.

This thesis describes my research efforts in advancing high-performance and 
scalable RNICs for datacenters by improving their scalability in general use 
cases and enhancing their performance in specific scenarios through protocol 
and hardware architecture innovations.

First, we improve connection scalability by designing SRNIC, a scalable RDMA 
NIC architecture. SRNIC minimizes memory requirements through careful protocol 
and architecture co-design, all while maintaining high performance, low CPU 
overhead, and efficient selective retransmission mechanisms. SRNIC is our first 
attempt towards more scalable and performant, next-generation RDMA designs, and 
we hope it can inspire the redesign of a new RDMA specification for lossy 
networks.

Second, we further enhance network scalability by designing Tassel, a fast, 
scalable, and accurate rate limiter for RNICs. Tassel supports accurate rate 
limiting for tens of thousands of flows at ultra-high packet rates. This help 
RNICs' congestion control mechanisms to smooth traffic, prevent traffic bursts, 
and mitigate network congestion, thereby enabling broader RDMA deployment and 
enhancing overall network scalability. We hope this can inspire more innovative 
protocols in high-speed networks and domain-specific networks, aiming for 
congestion-free datacenters.

Finally, we address the performance limitations of commercial RNICs in LLM 
training scenarios by designing AINIC, a streamlined RNIC tailored for LLM 
training. AINIC is high-performance and easy to scale up, meeting the extreme 
performance demands of AI applications. It follows an inside that simplifying 
RNICs can actually improve performance and architecture scalability. We hope 
that our work will inspire further development of application-specific RNIC 
designs, leading to more efficient and scalable network architectures for 
emerging technologies like LLMs.


Date:                   Wednesday, 23 October 2024

Time:                   10:00am - 12:00noon

Venue:                  Room 4475
                        Lifts 25/26

Committee Members:      Prof. Kai Chen (Supervisor)
                        Dr. Wei Wang (Chairperson)
                        Prof. Song Guo
                        Dr. Binhang Yuan