More about HKUST
Efficient and Accessible Systems for Machine Learning: From Clusters to Jobs
PhD Thesis Proposal Defence Title: "Efficient and Accessible Systems for Machine Learning: From Clusters to Jobs" by Mr. Kaiqiang XU Abstract: Large-scale machine learning (ML) computing for AI involves highly distributed, parallelized workloads with intricate patterns. Efficiently managing and executing these workloads using existing computing abstractions and mechanisms presents significant challenges to underlying systems. To address these complexities and optimize performance, specialized ML systems are designed to meet the unique demands of ML computing. This paper explores my research in ML systems from two key perspectives: cluster-level multi-job orchestration and individual job efficiency. At the cluster-level, SING introduces a full-stack GPU cluster management architecture that optimizes ML job scheduling and execution in shared multi-tenant environments, enhancing usability and fairness while mitigating resource underutilization. Complementing this, GREEN presents a carbon-efficient ML cluster scheduler that dynamically aligns workloads with greener energy periods, effectively balancing job completion times and carbon emissions without compromising cluster-wide performance. At the individual job level, Sequoia optimizes secure distributed data processing for ML with an extensible compiler framework, and efficiently schedules execution across data owners, reducing both programming complexity and execution time. Meanwhile, G3 introduces a scalable distributed system for graph neural network training, employing hybrid parallelism, locality-aware partitioning, and multi-level pipeline to enable efficient full-graph training on billion-edge graphs. The new abstractions, parallelization strategies, and resource scheduling algorithms introduced in this dissertation address the increasing complexity of ML workloads and the constraints of existing infrastructure. By improving the efficiency and accessibility of ML systems, this research lay a foundation for the wider adoption of AI technologies. Date: Monday, 9 December 2024 Time: 2:00pm - 4:00pm Venue: Room 3494 Lifts 25/26 Committee Members: Prof. Kai Chen (Supervisor) Prof. Qiong Luo (Chairperson) Dr. Yangqiu Song Dr. Binhang Yuan