TOWARDS SCALABLE DEEP LEARNING WITH COMMUNICATION OPTIMIZATIONS

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "TOWARDS SCALABLE DEEP LEARNING WITH COMMUNICATION OPTIMIZATIONS"

By

Mr. Lin ZHANG


Abstract:

With the burst of data and model sizes, it has become prevalent to parallelize
deep neural networks (DNNs) training in a cluster of distributed devices, which
however introduces extensive communication overheads. In this thesis, we study
both system-level and algorithm-level communication optimization techniques to
improve training efficiency.

First, existing data parallel training systems rely on the all-reduce primitive
for gradient aggregation, which only achieve sub-optimal training performance.
We present DeAR, that decouples the all-reduce primitive to two operators to
enable fine-grained communication scheduling, and then we use dynamic tensor
fusion to derive an optimal solution.

Second, many gradient compression algorithms have been proposed to compress
communication data in synchronous stochastic gradient descent (S-SGD) to
accelerate distributed training, but we find they fail to outperform S-SGD in
most cases. To this end, we propose ACP-SGD, which largely reduces the
compression and communication overheads, and enjoys three system optimizations
(all-reduce, pipelining, and tensor fusion).

Third, we are concerned with the advancement of second-order methods such as
distributed K-FAC (D-KFAC) for training DNNs because of their utilization of
curvature information to accelerate the training process. However, D-KFAC
incurs extensive computations and communications for curvature information. We
present smart parallel D-KFAC (SPD-KFAC) and placement-aware D-KFAC (PAD-KFAC)
to accelerate D-KFAC with efficient pipelining and optimal tensor placement
scheduling techniques, respectively.

Fourth, we present a memory- and time-efficient second-order algorithm named
Eva, with two novel techniques: 1) we approximate the curvature information
with two small stochastic vectors to reduce the memory and communication
consumption, and 2) we derive an efficient update formula without explicitly
computing the inverse of matrices to address the high computation overhead.


Date:                   Friday, 18 August 2023

Time:                   2:00pm - 4:00pm

Venue:                  Room 3494
                        lifts 25/26

Chairperson:            Prof. Yong HUANG (CHEM)

Committee Members:      Prof. Bo LI (Supervisor)
                        Prof. Yangqiu SONG
                        Prof. Qian ZHANG
                        Prof. Weichuan YU (ECE)
                        Prof. Yuanqing ZHENG (PolyU)


**** ALL are Welcome ****