More about HKUST
Towards Efficient and Effective Inference for Large-Scale Models
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "Towards Efficient and Effective Inference for Large-Scale Models" By Mr. Xijie HUANG Abstract: While large-scale deep learning models have achieved remarkable success in natural language and vision tasks, their growing computational demands and model sizes necessitate efficient inference, particularly for inference on edge devices with memory bandwidth constrain. To address this efficiency bottleneck, model compression techniques, such as quantization, pruning, knowledge distillation, and low-rank decomposition, have been extensively studied in the research community and widely adopted in various AI applications. In this thesis, we will start from an introduction on the principles and discussion on the challenges of model compression and inference acceleration for large-scale models, including Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), Large Language Models (LLMs), and Diffusion Models (DMs). In the following chapters, we will present novel methods to enhance the efficiency and effectiveness of inference across these architectures. First, we focus on the inference efficiency of CNNs and propose Stochastic Differentiable Quantization (SDQ). In our SDQ framework, the optimal mixed-precision strategy is learned via optimization on the differentiable bitwidth parameters during the stochastic quantization. Second, we turn to the challenges in ViTs' inference efficiency. We propose an effective Variation-aware ViT Quantization (VVTQ), which includes module-dependent quantization and scaling, variation-aware knowledge distillation, and oscillation-aware bin regularization. Third, we improve the inference efficiency of LLMs via solving the activation outlier problems. We propose RoLoRA, the first LoRA-based scheme for effective weight-activation quantization. RoLoRA utilizes rotation for outlier elimination and proposes rotation-aware fine-tuning to preserve the outlier-free characteristics in rotated LLMs. Fourth, we improve both the reasoning efficiency and effectiveness of LLMs using a coarse-to-fine prompt pruner, named as CoT-Influx. The CoT-Influx pruner first selects important Chain-of-Thoughts (CoT) candidates and then prunes uninformative tokens to fit the context window. Lastly, we build an efficient text-to-image (T2I) diffusion models, SnapGen, that generates high-resolution and high-quality images on mobile platforms. A cross-architecture knowledge distillation scheme is proposed to guide the training of SnapGen, and we also enable fewer-step generations by integrating adversarial distillation. Date: Monday, 7 July 2025 Time: 2:00pm - 4:00pm Venue: Room 5501 Lifts 25/26 Chairman: Prof. Kin Fai Ellick WONG (MGMT) Committee Members: Prof. Tim CHENG (Supervisor) Dr. Junxian HE Dr. Dan XU Prof. Chi Ying TSUI (ECE) Prof. Ping LUO (HKU)