More about HKUST
Efficient and Effective Quantization for Large-Scale Models
PhD Thesis Proposal Defence
Title: "Efficient and Effective Quantization for Large-Scale Models"
by
Mr. Xijie HUANG
Abstract:
Despite the outstanding performance of large-scale deep learning models
across language and vision tasks, the expansion of computation and model
size has increased the demand for efficient deployment. To address the heavy
computation and parameter drawbacks, quantization of weights and activations
has been the most widely utilized technique because it enjoys the advantage
of promising affinity across different hardware architectures. However,
existing quantization methods fall short in either efficiency or
effectiveness, largely limiting their generalizability for application. In
this thesis proposal, we will start from an introduction to the concepts and
principles of quantization, followed by an in-depth survey of existing
quantization techniques on various large-scale models, including
Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and Large
Language Models (LLMs). In the following sections, we will focus on how to
improve the efficiency and effectiveness of quantization algorithms of a
specific aforementioned large-scale model or their combinations.
First, we proposed an effective mixed-precision quantization method named
Stochastic Differentiable Quantization (SDQ) specifically designed for CNNs.
In our SDQ framework, the optimal mixed precision quantization strategy is
learned via a set of differentiable bitwidth parameters as probability
factors during the stochastic quantization. Then, we focus on the challenges
in ViTs Quantization. We propose an effective Variation-aware ViT
Quantization based on an in-depth analysis of quantization sensitivity,
contrasting CNNs with transformers, and monitoring the weight oscillation
during training. To address the challenges presented by variation, our
variation-aware quantization technique includes module-dependent
quantization and scaling, variation-aware knowledge distillation, and
oscillation-aware bin regularization. Lastly, we improve the quantization
effectiveness of LLMs via solving the activation outlier problems. We
propose RoLoRA, the first LoRA-based scheme for effective weight-activation
quantization. RoLoRA utilizes rotation for outlier elimination and proposes
rotation-aware fine-tuning to preserve the outlier-free characteristics in
rotated LLMs.
Date: Tuesday, 22 April 2025
Time: 2:00pm - 4:00pm
Venue: Room 4621
Lifts 31/32
Committee Members: Prof. Tim Cheng (Supervisor)
Dr. Dan Xu (Chairperson)
Prof. Gary Chan