Sharing the Deep Learning Cluster Network

MPhil Thesis Defence


Title: "Sharing the Deep Learning Cluster Network"

By

Mr. Jingrong CHEN



Abstract

The performance bottleneck of distributed deep learning training (DLT) is 
shifting from computation to communication as GPUs getting faster and model 
sizes growing larger. Despite continuous efforts in communication optimization, 
prior researches focus mostly on one single job. Such practice ignoring the 
diversified network demands across different DLT jobs, and the heterogeneous 
computing resource demands of workers and aggregators may finally double the 
communication time, cause a significant waste of network resources, and 
guarantee no performance objectives.

Our goal is to design a novel framework that enables efficient network resource 
sharing and minimizes the average completion time for DLT jobs. We present 
DeepScheduler to achieve our goal. At its core, a dedicated communication layer 
constituting with aggregators across all machines throughout the cluster allows 
"borrowing" the network resource from other jobs. Furthermore, it makes several 
algorithmic innovations on inter-job interference minimization and 
prioritization by de-colocating aggregators and workers to optimize the average 
DLT job completion time. We have implemented DeepScheduler and evaluated its 
performance on a small-scaled testbed with NVIDIA V100 GPUs and 40G RDMA 
network environment. Testbed experiments show that DeepScheduler speeds up DLT 
jobs by 1.72x through de-colocation and outperforms NCCL by up to 1.8x.


Date:  			Thursday, 6 August 2020

Time:			2:00pm - 4:00pm

Zoom meeting:		https://hkust.zoom.us/j/9534912643

Committee Members:	Dr. Kai Chen (Supervisor)
  			Prof. Qian Zhang (Chairperson)
 			Dr. Qiong Luo


**** ALL are Welcome ****