More about HKUST
Advancing Training and Inference Efficiency in Large-Scale Models
The Hong Kong University of Science and Technology
Department of Computer Science and Engineering
PhD Thesis Defence
Title: "Advancing Training and Inference Efficiency in Large-Scale Models"
By
Mr. Shih-yang LIU
Abstract:
While large-scale deep learning models, particularly Large Language Models
(LLMs), have achieved unprecedented performance across a wide range of tasks,
their rapidly increasing model size, computational cost, and inference
latency present significant challenges for both deployment and training.
These challenges are especially pronounced in resource-constrained
environments and real-time applications, where memory bandwidth/size,
computational throughput, and reasoning efficiency become critical
bottlenecks. As a result, improving the efficiency of training and inference
has become a central focus in the research community and an essential
component for enabling practical and scalable AI systems.
This thesis presents a comprehensive investigation of efficiency in large
scale models, categorizing these enhancements into training and inference
phases. Spanning the spectrum from model compression to reasoning
acceleration, we first outline methods designed to improve inference
efficiency, such as quantization, pruning, and token- length reduction. We
then survey prior works that address training efficiency, specifically
focusing on parameter-efficient fine tuning. Following this foundation, we
introduce several of our innovative methodologies that elevate the efficiency
of the entire pipeline, from training to final inference, across diverse
model architectures and inference paradigms.
First, we address the inference efficiency of Vision Transformers and propose
Oscillation-Free Quantization. This quantization-aware training technique
eliminates the instability caused by oscillatory weight updates, enabling
stable and accurate quantization even at extremely low bitwidths. Next, we
introduce LLM-FP4, a post training quantization method designed to enhance
the inference efficiency of large language models. This approach achieves
superior trade offs between accuracy and efficiency through the careful
co-design of numerical formats and quantization strategies. Third, we propose
Eigenspace Low Rank Approximation, a weight space inference acceleration
technique that combines quantization, pruning, and low-rank decomposition.
This method enables accuracy recovery for compressed models without the need
for additional fine-tuning. By operating within the eigenspace, the framework
effectively restores performance lost during aggressive compression while
avoiding the overhead of expensive retraining. Fourth, we shift our focus to
activation space efficiency by leveraging reinforcement learning to enhance
model reasoning efficiency. We propose Group Reward Decoupled Normalization
Policy Optimization (GDPO), a multi reward reinforcement learning algorithm
that helps teach the model to simultaneously minimize reasoning tokens and
maximize accuracy. By incentivizing higher intelligence per token and
curtailing redundant reasoning steps, this approach significantly reduces
inference latency and computational cost during autoregressive generation.
Finally, we address training efficiency by focusing on low-rank based
parameter-efficient fine tuning. We propose Weight Decomposed Low Rank
Adaptation (DoRA), a novel technique that improves training efficiency by
partitioning weight updates into low rank components. This framework enhances
training stability, accelerates convergence, and improves parameter
efficiency, enabling highly cost effective adaptation of large language
models for various downstream tasks.
Date: Tuesday, 16 June 2026
Time: 3:30pm - 5:30pm
Venue: Room 3494
Lifts 25/26
Chairman: Dr. Shiheng WANG (ACCT)
Committee Members: Prof. Tim CHENG (Supervisor)
Dr. Yangqiu SONG
Dr. Dan XU
Prof. Jun ZHANG (ECE)
Dr. Jing LI (PolyU)