More about HKUST
Advancing Training and Inference Efficiency in Large-Scale Models
PhD Thesis Proposal Defence
Title: "Advancing Training and Inference Efficiency in Large-Scale Models"
by
Mr. Shih-yang LIU
Abstract:
While large-scale deep learning models, particularly Large Language Models
(LLMs), have achieved unprecedented performance across a wide range of tasks,
their rapidly increasing model size, computational cost, and inference latency
present significant challenges for both deployment and training. These
challenges are especially pronounced in resource-constrained environments and
real-time applications, where memory bandwidth/size, computational throughput,
and reasoning efficiency become critical bottlenecks. As a result, improving
the efficiency of training and inference has become a central focus in the
research community and an essential component for enabling practical and
scalable AI systems.
This thesis presents a comprehensive investigation of efficiency in large
scale models, categorizing these enhancements into training and inference
phases. Spanning the spectrum from model compression to reasoning
acceleration, we first outline methods designed to improve inference
efficiency, such as quantization, pruning, and token- length reduction. We
then survey prior works that address training efficiency, specifically
focusing on parameter-efficient fine tuning. Following this foundation, we
introduce several of our innovative methodologies that elevate the efficiency
of the entire pipeline, from training to final inference, across diverse model
architectures and inference paradigms.
First, we address the inference efficiency of Vision Transformers and propose
Oscillation-Free Quantization. This quantization-aware training technique
eliminates the instability caused by oscillatory weight updates, enabling
stable and accurate quantization even at extremely low bitwidths. Next, we
introduce LLM-FP4, a post training quantization method designed to enhance the
inference efficiency of large language models. This approach achieves superior
trade offs between accuracy and efficiency through the careful co-design of
numerical formats and quantization strategies. Third, we propose Eigenspace
Low Rank Approximation, a weight space inference acceleration technique that
combines quantization, pruning, and low-rank decomposition. This method
enables accuracy recovery for compressed models without the need for
additional fine-tuning. By operating within the eigenspace, the framework
effectively restores performance lost during aggressive compression while
avoiding the overhead of expensive retraining. Fourth, we shift our focus to
activation space efficiency by leveraging reinforcement learning to enhance
model reasoning efficiency. We propose Group Reward Decoupled Normalization
Policy Optimization (GDPO), a multi reward reinforcement learning algorithm
that help teach the model to simultaneously minimize reasoning tokens and
maximize accuracy. By incentivizing higher intelligence per token and
curtailing redundant reasoning steps, this approach significantly reduces
inference latency and computational cost during autoregressive generation.
Finally, we address training efficiency by focusing on low-rank based
parameter-efficient fine tuning. We propose Weight Decomposed Low Rank
Adaptation (DoRA), a novel technique that improves training efficiency by
partitioning weight updates into low rank components. This framework enhances
training stability, accelerates convergence, and improves parameter
efficiency, enabling highly cost effective adaptation of large language models
for various downstream tasks.
Date: Tuesday, 10 March 2026
Time: 3:00pm - 5:00pm
Venue: Room 2132C
Lift 22
Committee Members: Prof. Tim Cheng (Supervisor)
Dr. Yangqiu Song (Chairperson)
Dr. Dan Xu