More about HKUST
Sharing the Deep Learning Cluster Network
MPhil Thesis Defence Title: "Sharing the Deep Learning Cluster Network" By Mr. Jingrong CHEN Abstract The performance bottleneck of distributed deep learning training (DLT) is shifting from computation to communication as GPUs getting faster and model sizes growing larger. Despite continuous efforts in communication optimization, prior researches focus mostly on one single job. Such practice ignoring the diversified network demands across different DLT jobs, and the heterogeneous computing resource demands of workers and aggregators may finally double the communication time, cause a significant waste of network resources, and guarantee no performance objectives. Our goal is to design a novel framework that enables efficient network resource sharing and minimizes the average completion time for DLT jobs. We present DeepScheduler to achieve our goal. At its core, a dedicated communication layer constituting with aggregators across all machines throughout the cluster allows "borrowing" the network resource from other jobs. Furthermore, it makes several algorithmic innovations on inter-job interference minimization and prioritization by de-colocating aggregators and workers to optimize the average DLT job completion time. We have implemented DeepScheduler and evaluated its performance on a small-scaled testbed with NVIDIA V100 GPUs and 40G RDMA network environment. Testbed experiments show that DeepScheduler speeds up DLT jobs by 1.72x through de-colocation and outperforms NCCL by up to 1.8x. Date: Thursday, 6 August 2020 Time: 2:00pm - 4:00pm Zoom meeting: https://hkust.zoom.us/j/9534912643 Committee Members: Dr. Kai Chen (Supervisor) Prof. Qian Zhang (Chairperson) Dr. Qiong Luo **** ALL are Welcome ****