Efficient and Effective Quantization for Large-Scale Models

PhD Thesis Proposal Defence


Title: "Efficient and Effective Quantization for Large-Scale Models"

by

Mr. Xijie HUANG


Abstract:

Despite the outstanding performance of large-scale deep learning models 
across language and vision tasks, the expansion of computation and model 
size has increased the demand for efficient deployment. To address the heavy 
computation and parameter drawbacks, quantization of weights and activations 
has been the most widely utilized technique because it enjoys the advantage 
of promising affinity across different hardware architectures. However, 
existing quantization methods fall short in either efficiency or 
effectiveness, largely limiting their generalizability for application. In 
this thesis proposal, we will start from an introduction to the concepts and 
principles of quantization, followed by an in-depth survey of existing 
quantization techniques on various large-scale models, including 
Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and Large 
Language Models (LLMs). In the following sections, we will focus on how to 
improve the efficiency and effectiveness of quantization algorithms of a 
specific aforementioned large-scale model or their combinations.

First, we proposed an effective mixed-precision quantization method named 
Stochastic Differentiable Quantization (SDQ) specifically designed for CNNs. 
In our SDQ framework, the optimal mixed precision quantization strategy is 
learned via a set of differentiable bitwidth parameters as probability 
factors during the stochastic quantization. Then, we focus on the challenges 
in ViTs Quantization. We propose an effective Variation-aware ViT 
Quantization based on an in-depth analysis of quantization sensitivity, 
contrasting CNNs with transformers, and monitoring the weight oscillation 
during training. To address the challenges presented by variation, our 
variation-aware quantization technique includes module-dependent 
quantization and scaling, variation-aware knowledge distillation, and 
oscillation-aware bin regularization. Lastly, we improve the quantization 
effectiveness of LLMs via solving the activation outlier problems. We 
propose RoLoRA, the first LoRA-based scheme for effective weight-activation 
quantization. RoLoRA utilizes rotation for outlier elimination and proposes 
rotation-aware fine-tuning to preserve the outlier-free characteristics in 
rotated LLMs.


Date:                   Thursday, 24 April 2025

Time:                   2:00pm - 4:00pm

Venue:                  Room 4621
                        Lifts 31/32

Committee Members:      Prof. Tim Cheng (Supervisor)
                        Dr. Dan Xu (Chairperson)
                        Prof. Gary Chan