More about HKUST
Deep Learning Workload Management in Large-Scale GPU Clusters
PhD Qualifying Examination Title: "Deep Learning Workload Management in Large-Scale GPU Clusters" by Mr. Lingyun YANG Abstract: In the past decade, the rapid technological advance of deep learning (DL) has achieved remarkable performance in a variety of application domains. Large tech companies build large-scale heterogeneous computing clusters equipped with GPUs to accelerate the development of DL models. Compared to high-performance computing (HPC) and big data analytics workloads, DL workloads exhibit different characteristics such as gang scheduling and resource heterogeneity, which bring new challenges and opportunities for cluster resource management. Efficiently managing DL workloads can improve resource utilization, reduce operational costs, reduce energy consumption, etc. This survey reviews the recent research efforts on GPU cluster management tailored for DL training and inference workloads. We first summarize how DL workloads are integrated into GPU clusters and their common characteristics. Then we present prior works according to their different optimization goals: resource utilization, job efficiency, and fairness among multiple tenants. We hope this survey can shed light on system optimization for GPU cluster management and facilitate future industrial-oriented designs. Date: Thursday, 18 August 2022 Time: 4:00pm - 6:00pm Zoom Meeting: https://hkust.zoom.us/j/93975876687?pwd=d0xRcmVpYWgwTDNwQnJENGF5K0Ftdz09 Committee Members: Dr. Wei Wang (Supervisor) Prof. Kai Chen (Chairperson) Prof. Bo Li Prof. Qian Zhang **** ALL are Welcome ****