Accommodating LLM Service over Heterogeneous Computational Resources

PhD Qualifying Examination


Title: "Accommodating LLM Service over Heterogeneous Computational Resources"

by

Mr. Ran YAN


Abstract:

Serving generative inference and training large language models are crucial 
components of contemporary AI applications. Due to the intensive inference 
and training computation, the state-of-the-art inference service or training 
task of the large language models (LLMs) are usually hosted in centralized 
data centers with homogeneous high-performance GPUs, which can be very 
expensive. The high cost of such deployment potentially limits the 
application and advancement of this great technique. In this survey, we 
explore an alternative approach by deploying the inference and training 
tasks across heterogeneous GPUs to enable better flexibility and efficiency 
for heterogeneous resource utilization. However, with the heterogeneity of 
GPU hardware specifications and numerous potential parallel strategies, 
effectively accommodating LLM service over heterogeneous resources is 
extremely challenging. Excluding the heterogeneity factors (i.e. in 
homogeneous settings), identifying an efficient parallel configuration still 
requires significant effort. To outline future research directions, we first 
review the state-of-the-art works on scheduling LLM inference and training 
and then analyze potential research avenues.


Date:                   Monday, 17 February 2025

Time:                   10:00am - 12:00noon

Venue:                  Room 2408
                        Lifts 25/26

Committee Members:      Dr. Binhang Yuan (Supervisor)
                        Dr. Dongdong She (Chairperson)
                        Dr. Zili Meng
                        Dr. Wei Wang