More about HKUST
Efficient and Effective Quantization for Large-Scale Models
PhD Thesis Proposal Defence Title: "Efficient and Effective Quantization for Large-Scale Models" by Mr. Xijie HUANG Abstract: Despite the outstanding performance of large-scale deep learning models across language and vision tasks, the expansion of computation and model size has increased the demand for efficient deployment. To address the heavy computation and parameter drawbacks, quantization of weights and activations has been the most widely utilized technique because it enjoys the advantage of promising affinity across different hardware architectures. However, existing quantization methods fall short in either efficiency or effectiveness, largely limiting their generalizability for application. In this thesis proposal, we will start from an introduction to the concepts and principles of quantization, followed by an in-depth survey of existing quantization techniques on various large-scale models, including Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and Large Language Models (LLMs). In the following sections, we will focus on how to improve the efficiency and effectiveness of quantization algorithms of a specific aforementioned large-scale model or their combinations. First, we proposed an effective mixed-precision quantization method named Stochastic Differentiable Quantization (SDQ) specifically designed for CNNs. In our SDQ framework, the optimal mixed precision quantization strategy is learned via a set of differentiable bitwidth parameters as probability factors during the stochastic quantization. Then, we focus on the challenges in ViTs Quantization. We propose an effective Variation-aware ViT Quantization based on an in-depth analysis of quantization sensitivity, contrasting CNNs with transformers, and monitoring the weight oscillation during training. To address the challenges presented by variation, our variation-aware quantization technique includes module-dependent quantization and scaling, variation-aware knowledge distillation, and oscillation-aware bin regularization. Lastly, we improve the quantization effectiveness of LLMs via solving the activation outlier problems. We propose RoLoRA, the first LoRA-based scheme for effective weight-activation quantization. RoLoRA utilizes rotation for outlier elimination and proposes rotation-aware fine-tuning to preserve the outlier-free characteristics in rotated LLMs. Date: Thursday, 24 April 2025 Time: 2:00pm - 4:00pm Venue: Room 4621 Lifts 31/32 Committee Members: Prof. Tim Cheng (Supervisor) Dr. Dan Xu (Chairperson) Prof. Gary Chan