Towards Efficient and Effective Inference for Large-Scale Models

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Towards Efficient and Effective Inference for Large-Scale Models"

By

Mr. Xijie HUANG


Abstract:

While large-scale deep learning models have achieved remarkable success in 
natural language and vision tasks, their growing computational demands and 
model sizes necessitate efficient inference, particularly for inference on 
edge devices with memory bandwidth constrain. To address this efficiency 
bottleneck, model compression techniques, such as quantization, pruning, 
knowledge distillation, and low-rank decomposition, have been extensively 
studied in the research community and widely adopted in various AI 
applications.

In this thesis, we will start from an introduction on the principles and 
discussion on the challenges of model compression and inference acceleration 
for large-scale models, including Convolutional Neural Networks (CNNs), 
Vision Transformers (ViTs), Large Language Models (LLMs), and Diffusion 
Models (DMs). In the following chapters, we will present novel methods to 
enhance the efficiency and effectiveness of inference across these 
architectures.

First, we focus on the inference efficiency of CNNs and propose Stochastic 
Differentiable Quantization (SDQ). In our SDQ framework, the optimal 
mixed-precision strategy is learned via optimization on the differentiable 
bitwidth parameters during the stochastic quantization. Second, we turn to 
the challenges in ViTs' inference efficiency. We propose an effective 
Variation-aware ViT Quantization (VVTQ), which includes module-dependent 
quantization and scaling, variation-aware knowledge distillation, and 
oscillation-aware bin regularization. Third, we improve the inference 
efficiency of LLMs via solving the activation outlier problems. We propose 
RoLoRA, the first LoRA-based scheme for effective weight-activation 
quantization. RoLoRA utilizes rotation for outlier elimination and proposes 
rotation-aware fine-tuning to preserve the outlier-free characteristics in 
rotated LLMs. Fourth, we improve both the reasoning efficiency and 
effectiveness of LLMs using a coarse-to-fine prompt pruner, named as 
CoT-Influx. The CoT-Influx pruner first selects important Chain-of-Thoughts 
(CoT) candidates and then prunes uninformative tokens to fit the context 
window. Lastly, we build an efficient text-to-image (T2I) diffusion models, 
SnapGen, that generates high-resolution and high-quality images on mobile 
platforms. A cross-architecture knowledge distillation scheme is proposed to 
guide the training of SnapGen, and we also enable fewer-step generations by 
integrating adversarial distillation.


Date:                   Monday, 7 July 2025

Time:                   2:00pm - 4:00pm

Venue:                  Room 5501
                        Lifts 25/26

Chairman:               Prof. Kin Fai Ellick WONG (MGMT)

Committee Members:      Prof. Tim CHENG (Supervisor)
                        Dr. Junxian HE
                        Dr. Dan XU
                        Prof. Chi Ying TSUI (ECE)
                        Prof. Ping LUO (HKU)