Efficient and Accessible Systems for Machine Learning: From Clusters to Jobs

PhD Thesis Proposal Defence


Title: "Efficient and Accessible Systems for Machine Learning: From Clusters 
to Jobs"

by

Mr. Kaiqiang XU


Abstract:

Large-scale machine learning (ML) computing for AI involves highly 
distributed, parallelized workloads with intricate patterns. Efficiently 
managing and executing these workloads using existing computing abstractions 
and mechanisms presents significant challenges to underlying systems. To 
address these complexities and optimize performance, specialized ML systems 
are designed to meet the unique demands of ML computing.

This paper explores my research in ML systems from two key perspectives: 
cluster-level multi-job orchestration and individual job efficiency. At the 
cluster-level, SING introduces a full-stack GPU cluster management 
architecture that optimizes ML job scheduling and execution in shared 
multi-tenant environments, enhancing usability and fairness while mitigating 
resource underutilization. Complementing this, GREEN presents a 
carbon-efficient ML cluster scheduler that dynamically aligns workloads with 
greener energy periods, effectively balancing job completion times and carbon 
emissions without compromising cluster-wide performance. At the individual 
job level, Sequoia optimizes secure distributed data processing for ML with 
an extensible compiler framework, and efficiently schedules execution across 
data owners, reducing both programming complexity and execution time. 
Meanwhile, G3 introduces a scalable distributed system for graph neural 
network training, employing hybrid parallelism, locality-aware partitioning, 
and multi-level pipeline to enable efficient full-graph training on 
billion-edge graphs.

The new abstractions, parallelization strategies, and resource scheduling 
algorithms introduced in this dissertation address the increasing complexity 
of ML workloads and the constraints of existing infrastructure. By improving 
the efficiency and accessibility of ML systems, this research lay a 
foundation for the wider adoption of AI technologies.


Date:                   Monday, 9 December 2024

Time:                   2:00pm - 4:00pm

Venue:                  Room 3494
                        Lifts 25/26

Committee Members:      Prof. Kai Chen (Supervisor)
                        Prof. Qiong Luo (Chairperson)
                        Dr. Yangqiu Song
                        Dr. Binhang Yuan