More about HKUST
Advancing High-Performance and Scalable RDMA NICs for Datacenters
PhD Thesis Proposal Defence
Title: "Advancing High-Performance and Scalable RDMA NICs for Datacenters"
by
Mr. Zilong WANG
Abstract:
Datacenter applications are increasingly driving the demands for high-speed
networks, which are expected to be high-performance and scalable. Currently
Remote Direct Memory Access (RDMA) has become the de-facto standard for
high-speed networks in modern datacenters, providing high throughput, low
latency, and low CPU overhead through architectural innovations including
kernel bypass and transport offload. However, commercial RDMA NICs (RNICs)
suffer from scalability issues in general scenarios and performance issues in
specific ones. Specifically, commercial RNICs support only a small number of
high-performance connections (connections scalability issue), and rely on
lossless, limited-scale network fabrics (network scalability issue). Moreover,
in bandwidth-intensive large language model (LLM) training scenarios, they fail
to achieve maximum throughput and optimal link bandwidth utilization.
This thesis describes my research efforts in advancing high-performance and
scalable RNICs for datacenters by improving their scalability in general use
cases and enhancing their performance in specific scenarios through protocol
and hardware architecture innovations.
First, we improve connection scalability by designing SRNIC, a scalable RDMA
NIC architecture. SRNIC minimizes memory requirements through careful protocol
and architecture co-design, all while maintaining high performance, low CPU
overhead, and efficient selective retransmission mechanisms. SRNIC is our first
attempt towards more scalable and performant, next-generation RDMA designs, and
we hope it can inspire the redesign of a new RDMA specification for lossy
networks.
Second, we further enhance network scalability by designing Tassel, a fast,
scalable, and accurate rate limiter for RNICs. Tassel supports accurate rate
limiting for tens of thousands of flows at ultra-high packet rates. This help
RNICs' congestion control mechanisms to smooth traffic, prevent traffic bursts,
and mitigate network congestion, thereby enabling broader RDMA deployment and
enhancing overall network scalability. We hope this can inspire more innovative
protocols in high-speed networks and domain-specific networks, aiming for
congestion-free datacenters.
Finally, we address the performance limitations of commercial RNICs in LLM
training scenarios by designing AINIC, a streamlined RNIC tailored for LLM
training. AINIC is high-performance and easy to scale up, meeting the extreme
performance demands of AI applications. It follows an inside that simplifying
RNICs can actually improve performance and architecture scalability. We hope
that our work will inspire further development of application-specific RNIC
designs, leading to more efficient and scalable network architectures for
emerging technologies like LLMs.
Date: Wednesday, 23 October 2024
Time: 10:00am - 12:00noon
Venue: Room 4475
Lifts 25/26
Committee Members: Prof. Kai Chen (Supervisor)
Dr. Wei Wang (Chairperson)
Prof. Song Guo
Dr. Binhang Yuan