More about HKUST
Advancing High-Performance and Scalable RDMA NICs for Datacenters
PhD Thesis Proposal Defence Title: "Advancing High-Performance and Scalable RDMA NICs for Datacenters" by Mr. Zilong WANG Abstract: Datacenter applications are increasingly driving the demands for high-speed networks, which are expected to be high-performance and scalable. Currently Remote Direct Memory Access (RDMA) has become the de-facto standard for high-speed networks in modern datacenters, providing high throughput, low latency, and low CPU overhead through architectural innovations including kernel bypass and transport offload. However, commercial RDMA NICs (RNICs) suffer from scalability issues in general scenarios and performance issues in specific ones. Specifically, commercial RNICs support only a small number of high-performance connections (connections scalability issue), and rely on lossless, limited-scale network fabrics (network scalability issue). Moreover, in bandwidth-intensive large language model (LLM) training scenarios, they fail to achieve maximum throughput and optimal link bandwidth utilization. This thesis describes my research efforts in advancing high-performance and scalable RNICs for datacenters by improving their scalability in general use cases and enhancing their performance in specific scenarios through protocol and hardware architecture innovations. First, we improve connection scalability by designing SRNIC, a scalable RDMA NIC architecture. SRNIC minimizes memory requirements through careful protocol and architecture co-design, all while maintaining high performance, low CPU overhead, and efficient selective retransmission mechanisms. SRNIC is our first attempt towards more scalable and performant, next-generation RDMA designs, and we hope it can inspire the redesign of a new RDMA specification for lossy networks. Second, we further enhance network scalability by designing Tassel, a fast, scalable, and accurate rate limiter for RNICs. Tassel supports accurate rate limiting for tens of thousands of flows at ultra-high packet rates. This help RNICs' congestion control mechanisms to smooth traffic, prevent traffic bursts, and mitigate network congestion, thereby enabling broader RDMA deployment and enhancing overall network scalability. We hope this can inspire more innovative protocols in high-speed networks and domain-specific networks, aiming for congestion-free datacenters. Finally, we address the performance limitations of commercial RNICs in LLM training scenarios by designing AINIC, a streamlined RNIC tailored for LLM training. AINIC is high-performance and easy to scale up, meeting the extreme performance demands of AI applications. It follows an inside that simplifying RNICs can actually improve performance and architecture scalability. We hope that our work will inspire further development of application-specific RNIC designs, leading to more efficient and scalable network architectures for emerging technologies like LLMs. Date: Wednesday, 23 October 2024 Time: 10:00am - 12:00noon Venue: Room 4475 Lifts 25/26 Committee Members: Prof. Kai Chen (Supervisor) Dr. Wei Wang (Chairperson) Prof. Song Guo Dr. Binhang Yuan