More about HKUST
Towards Efficient and Effective Inference for Large-Scale Models
The Hong Kong University of Science and Technology
Department of Computer Science and Engineering
PhD Thesis Defence
Title: "Towards Efficient and Effective Inference for Large-Scale Models"
By
Mr. Xijie HUANG
Abstract:
While large-scale deep learning models have achieved remarkable success in
natural language and vision tasks, their growing computational demands and
model sizes necessitate efficient inference, particularly for inference on
edge devices with memory bandwidth constrain. To address this efficiency
bottleneck, model compression techniques, such as quantization, pruning,
knowledge distillation, and low-rank decomposition, have been extensively
studied in the research community and widely adopted in various AI
applications.
In this thesis, we will start from an introduction on the principles and
discussion on the challenges of model compression and inference acceleration
for large-scale models, including Convolutional Neural Networks (CNNs),
Vision Transformers (ViTs), Large Language Models (LLMs), and Diffusion
Models (DMs). In the following chapters, we will present novel methods to
enhance the efficiency and effectiveness of inference across these
architectures.
First, we focus on the inference efficiency of CNNs and propose Stochastic
Differentiable Quantization (SDQ). In our SDQ framework, the optimal
mixed-precision strategy is learned via optimization on the differentiable
bitwidth parameters during the stochastic quantization. Second, we turn to
the challenges in ViTs' inference efficiency. We propose an effective
Variation-aware ViT Quantization (VVTQ), which includes module-dependent
quantization and scaling, variation-aware knowledge distillation, and
oscillation-aware bin regularization. Third, we improve the inference
efficiency of LLMs via solving the activation outlier problems. We propose
RoLoRA, the first LoRA-based scheme for effective weight-activation
quantization. RoLoRA utilizes rotation for outlier elimination and proposes
rotation-aware fine-tuning to preserve the outlier-free characteristics in
rotated LLMs. Fourth, we improve both the reasoning efficiency and
effectiveness of LLMs using a coarse-to-fine prompt pruner, named as
CoT-Influx. The CoT-Influx pruner first selects important Chain-of-Thoughts
(CoT) candidates and then prunes uninformative tokens to fit the context
window. Lastly, we build an efficient text-to-image (T2I) diffusion models,
SnapGen, that generates high-resolution and high-quality images on mobile
platforms. A cross-architecture knowledge distillation scheme is proposed to
guide the training of SnapGen, and we also enable fewer-step generations by
integrating adversarial distillation.
Date: Monday, 7 July 2025
Time: 2:00pm - 4:00pm
Venue: Room 5501
Lifts 25/26
Chairman: Prof. Kin Fai Ellick WONG (MGMT)
Committee Members: Prof. Tim CHENG (Supervisor)
Dr. Junxian HE
Dr. Dan XU
Prof. Chi Ying TSUI (ECE)
Prof. Ping LUO (HKU)