More about HKUST
Efficient and Accessible Systems for Machine Learning: From Clusters to Jobs
PhD Thesis Proposal Defence
Title: "Efficient and Accessible Systems for Machine Learning: From Clusters
to Jobs"
by
Mr. Kaiqiang XU
Abstract:
Large-scale machine learning (ML) computing for AI involves highly
distributed, parallelized workloads with intricate patterns. Efficiently
managing and executing these workloads using existing computing abstractions
and mechanisms presents significant challenges to underlying systems. To
address these complexities and optimize performance, specialized ML systems
are designed to meet the unique demands of ML computing.
This paper explores my research in ML systems from two key perspectives:
cluster-level multi-job orchestration and individual job efficiency. At the
cluster-level, SING introduces a full-stack GPU cluster management
architecture that optimizes ML job scheduling and execution in shared
multi-tenant environments, enhancing usability and fairness while mitigating
resource underutilization. Complementing this, GREEN presents a
carbon-efficient ML cluster scheduler that dynamically aligns workloads with
greener energy periods, effectively balancing job completion times and carbon
emissions without compromising cluster-wide performance. At the individual
job level, Sequoia optimizes secure distributed data processing for ML with
an extensible compiler framework, and efficiently schedules execution across
data owners, reducing both programming complexity and execution time.
Meanwhile, G3 introduces a scalable distributed system for graph neural
network training, employing hybrid parallelism, locality-aware partitioning,
and multi-level pipeline to enable efficient full-graph training on
billion-edge graphs.
The new abstractions, parallelization strategies, and resource scheduling
algorithms introduced in this dissertation address the increasing complexity
of ML workloads and the constraints of existing infrastructure. By improving
the efficiency and accessibility of ML systems, this research lay a
foundation for the wider adoption of AI technologies.
Date: Monday, 9 December 2024
Time: 2:00pm - 4:00pm
Venue: Room 3494
Lifts 25/26
Committee Members: Prof. Kai Chen (Supervisor)
Prof. Qiong Luo (Chairperson)
Dr. Yangqiu Song
Dr. Binhang Yuan