More about HKUST
A Survey on Efficient Distributed Machine Learning in Large-scale ML Clusters
PhD Qualifying Examination Title: "A Survey on Efficient Distributed Machine Learning in Large-scale ML Clusters" by Mr. Kaiqiang XU Abstract: The demand for artificial intelligence (AI) has significantly increased in recent decades. Machine learning (ML) job, which are known to require significant computational resources and time, require distributed execution in large computing clusters that are equipped with specialized hardware accelerators. To address these resource demands, distributed ML systems manage ML clusters that are specifically configured to executeML tasks. These systems introduce unique challenges at two operational levels: (1) At the individual job level, there is a need to efficiently parallelize the training processes across the distributed system to create a unified ML model. (2) At the cluster level, it is essential to allocate resources strategically among concurrent ML jobs to improve the overall efficiency of the cluster. This survey provides an overview of the research path as well as the state-of-the-arts in ML systems, offering insights into distributed training techniques and cluster scheduling strategies. At the end, we discuss the challenges and potential future developments in the ML systems architectures designed for large models (e.g., LLMs). Date: Wednesday, 15 November 2023 Time: 4:30pm - 6:30pm Venue: Room 5504 lifts 25/26 Committee Members: Prof. Kai Chen (Supervisor) Prof. Gary Chan (Chairperson) Dr. Qifeng Chen Dr. Binhang Yuan **** ALL are Welcome ****