A Survey on Efficient Distributed Machine Learning in Large-scale ML Clusters

PhD Qualifying Examination


Title: "A Survey on Efficient Distributed Machine Learning in Large-scale ML
Clusters"

by

Mr. Kaiqiang XU


Abstract:

The demand for artificial intelligence (AI) has significantly increased in
recent decades. Machine learning (ML) job, which are known to require
significant computational resources and time, require distributed execution in
large computing clusters that are equipped with specialized hardware
accelerators.

To address these resource demands, distributed ML systems manage ML clusters
that are specifically configured to executeML tasks. These systems introduce
unique challenges at two operational levels: (1) At the individual job level,
there is a need to efficiently parallelize the training processes across the
distributed system to create a unified ML model. (2) At the cluster level, it
is essential to allocate resources strategically among concurrent ML jobs to
improve the overall efficiency of the cluster.

This survey provides an overview of the research path as well as the
state-of-the-arts in ML systems, offering insights into distributed training
techniques and cluster scheduling strategies. At the end, we discuss the
challenges and potential future developments in the ML systems architectures
designed for large models (e.g., LLMs).


Date:                   Wednesday, 15 November 2023

Time:                   4:30pm - 6:30pm

Venue:                  Room 5504
                        lifts 25/26

Committee Members:      Prof. Kai Chen (Supervisor)
                        Prof. Gary Chan (Chairperson)
                        Dr. Qifeng Chen
                        Dr. Binhang Yuan


**** ALL are Welcome ****