Advancing Training and Inference Efficiency in Large-Scale Models

PhD Thesis Proposal Defence


Title: "Advancing Training and Inference Efficiency in Large-Scale Models"

by

Mr. Shih-yang LIU


Abstract:

While large-scale deep learning models, particularly Large Language Models 
(LLMs), have achieved unprecedented performance across a wide range of tasks, 
their rapidly increasing model size, computational cost, and inference latency 
present significant challenges for both deployment and training. These 
challenges are especially pronounced in resource-constrained environments and 
real-time applications, where memory bandwidth/size, computational throughput, 
and reasoning efficiency become critical bottlenecks. As a result, improving 
the efficiency of training and inference has become a central focus in the 
research community and an essential component for enabling practical and 
scalable AI systems.

This thesis presents a comprehensive investigation of efficiency in large 
scale models, categorizing these enhancements into training and inference 
phases. Spanning the spectrum from model compression to reasoning 
acceleration, we first outline methods designed to improve inference 
efficiency, such as quantization, pruning, and token- length reduction. We 
then survey prior works that address training efficiency, specifically 
focusing on parameter-efficient fine tuning. Following this foundation, we 
introduce several of our innovative methodologies that elevate the efficiency 
of the entire pipeline, from training to final inference, across diverse model 
architectures and inference paradigms.

First, we address the inference efficiency of Vision Transformers and propose 
Oscillation-Free Quantization. This quantization-aware training technique 
eliminates the instability caused by oscillatory weight updates, enabling 
stable and accurate quantization even at extremely low bitwidths. Next, we 
introduce LLM-FP4, a post training quantization method designed to enhance the 
inference efficiency of large language models. This approach achieves superior 
trade offs between accuracy and efficiency through the careful co-design of 
numerical formats and quantization strategies. Third, we propose Eigenspace 
Low Rank Approximation, a weight space inference acceleration technique that 
combines quantization, pruning, and low-rank decomposition. This method 
enables accuracy recovery for compressed models without the need for 
additional fine-tuning. By operating within the eigenspace, the framework 
effectively restores performance lost during aggressive compression while 
avoiding the overhead of expensive retraining. Fourth, we shift our focus to 
activation space efficiency by leveraging reinforcement learning to enhance 
model reasoning efficiency. We propose Group Reward Decoupled Normalization 
Policy Optimization (GDPO), a multi reward reinforcement learning algorithm 
that help teach the model to simultaneously minimize reasoning tokens and 
maximize accuracy. By incentivizing higher intelligence per token and 
curtailing redundant reasoning steps, this approach significantly reduces 
inference latency and computational cost during autoregressive generation. 
Finally, we address training efficiency by focusing on low-rank based 
parameter-efficient fine tuning. We propose Weight Decomposed Low Rank 
Adaptation (DoRA), a novel technique that improves training efficiency by 
partitioning weight updates into low rank components. This framework enhances 
training stability, accelerates convergence, and improves parameter 
efficiency, enabling highly cost effective adaptation of large language models 
for various downstream tasks.


Date:                   Tuesday, 10 March 2026

Time:                   3:00pm - 5:00pm

Venue:                  Room 2132C
                        Lift 22

Committee Members:      Prof. Tim Cheng (Supervisor)
                        Dr. Yangqiu Song (Chairperson)
                        Dr. Dan Xu