More about HKUST
Advancing High-Performance and Scalable RDMA NICs for Datacenters
The Hong Kong University of Science and Technology
Department of Computer Science and Engineering
PhD Thesis Defence
Title: "Advancing High-Performance and Scalable RDMA NICs for Datacenters"
By
Mr. Zilong WANG
Abstract:
Datacenter applications are increasingly driving the demands for high-speed
networks, which are expected to be high-performance and scalable. Currently
Remote Direct Memory Access (RDMA) has become the de-facto standard for
high-speed networks in modern datacenters, providing high throughput, low
latency, and low CPU overhead through architectural innovations including
kernel bypass and transport offload.
However, commercial RDMA NICs (RNICs) suffer from scalability issues in
general scenarios and performance issues in specific ones. Specifically,
commercial RNICs support only a small number of high-performance connections
(connections scalability issue), and rely on lossless, limited-scale network
fabrics (network scalability issue). Moreover, in bandwidth-intensive large
language model (LLM) training scenarios, they fail to achieve maximum
throughput and optimal link bandwidth utilization. This thesis describes my
research efforts in advancing high-performance and scalable RNICs for
datacenters by improving their scalability in general use cases and enhancing
their performance in specific scenarios through protocol and hardware
architecture innovations.
First, we improve connection scalability by designing SRNIC, a scalable RDMA
NIC architecture. SRNIC minimizes memory requirements through careful
protocol and architecture co-design, all while maintaining high performance,
low CPU overhead, and efficient selective retransmission mechanisms. SRNIC is
our first attempt towards more scalable and performant, next-generation RDMA
designs, and we hope it can inspire the redesign of a new RDMA specification
for lossy networks.
Second, we further enhance network scalability by designing Tassel, a fast,
scalable, and accurate rate limiter for RNICs. Tassel supports accurate rate
limiting for tens of thousands of flows at ultra-high packet rates. This help
RNICs' congestion control mechanisms to smooth traffic, prevent traffic
bursts, and mitigate network congestion, thereby enabling broader RDMA
deployment and enhancing overall network scalability. We hope this can
inspire more innovative protocols in high-speed networks and domain-specific
networks, aiming for congestion-free datacenters.
Finally, we improve the performance of RNICs in specific scenarios by
designing AINIC, a streamlined RNIC tailored for LLM training. AINIC is
high-performance and lightweight, meeting the extreme performance demands of
AI applications. It follows an inside that simplifying RNICs can actually
improve performance while reducing chip complexity. We hope that our work
will inspire further development of application- specific RNIC designs,
leading to more efficient and scalable network architectures for emerging
technologies like LLMs.
Date: Tuesday, 11 February 2025
Time: 3:00pm - 5:00pm
Venue: Room 4472
Lifts 25/26
Chairman: Prof. Wing Hung KI (ECE)
Committee Members: Prof. Kai CHEN (Supervisor)
Prof. Song GUO
Prof. Qiong LUO
Prof. Wei ZHANG (ECE)
Dr. Hong XU (CUHK)