More about HKUST
Towards Scalable Deep Learning with Communication Optimizations
PhD Thesis Proposal Defence Title: "Towards Scalable Deep Learning with Communication Optimizations" by Mr. Lin ZHANG Abstract: With the burst of data and model sizes, it has become prevalent to parallelize large deep neural networks (DNNs) training in clusters of distributed devices. While distributed training schemes enable large-scale deep learning applications, they introduce extensive communications over the network. The communication overheads often consume a significant portion of the training time, resulting in a severe performance bottleneck. The communication optimizations have attracted much attention from both academia and industry to improve the system scalability. We notice that the state-of-the-art data parallel training frameworks, such as PyTorch-DDP and Horovod, only optimize the all-reduce communication for gradient aggregation. They do not consider other collective communication alternatives, and do not support novel training algorithms such as second-order methods. To address these limitations, our research objective is to optimize any form of communication overheads in data parallel training systems. First, we present DeAR, a novel distributed training mechanism that decouples the allreduce primitive to two operators to enable fine-grained communication scheduling. By doing so, we can overlap the first operation with the back-propagation computation task, and overlap the second operation with the feed-forward computation task, which can hide more communications of gradient aggregation. Moreover, we propose a dynamic tensor fusion algorithm using Bayesian optimization in DeAR to judiciously determine which tensors should be fused to improve the training efficiency. Extensive experiments are conducted to show that DeAR can achieve up to 83% speedup over state-of-the-art solutions on a 64-GPU cluster connected by 10Gb/s Ethernet. Second, we extend existing distributed training systems to support second-order methods, notably the distributed K-FAC (D-KFAC) algorithms. We find that D-KFAC algorithms require computing and communicating a large volume of second-order information of Kronecker factors (KFs), posing new challenges for communication optimizations. To address it, we present smart parallel D-KFAC (SPD-KFAC), with a pipeline technique for KFs' computation and communication tasks, and a load-balancing scheme for workloads of inverting the KFs. Next, we propose placement-aware D-KFAC (PAD-KFAC) with efficient communication and optimal tensor placement scheduling, to eliminate the redundant communications in prior work of SPD-KFAC. Our experimental results show that PAD-KFAC can achieve up to 36% speedup over state-of-the-art D-KFAC algorithms, and outperform the SGD counterpart in end-to-end training time on a 64-GPU cluster. Date: Friday, 2 June 2023 Time: 2:00pm - 4:00pm Venue: Room 3494 lifts 25/26 Committee Members: Prof. Bo Li (Supervisor) Prof. Qian Zhang (Chairperson) Prof. Kai Chen Dr. Wei Wang **** ALL are Welcome ****