More about HKUST
Towards Efficiently Building Trustworthy Language Models
PhD Thesis Proposal Defence
Title: "Towards Efficiently Building Trustworthy Language Models"
by
Mr. Ning LU
Abstract:
Language models have revolutionized natural language processing, with large
language models like ChatGPT, Llama, and DeepSeek achieving impressive
performance across various tasks. Despite their success, these models face
issues with trustworthiness, including weak robustness and the risk of
generating unsafe content. In this thesis proposal, we aim to enhance the
trustworthiness of language models using efficient methods. Specifically, we
propose three efficient approaches to enhance the robustness and safety of
language models: utilizing word frequency information, prompting language
models for automatic data augmentation, and a training-free weight
modification approach.
First, we propose an $n$-gram frequency descent training method that enhances
model robustness without relying on gradient computations, thereby reducing
overall training time. This approach is motivated by a systematic analysis of
word-level adversarial attacks, which reveals that such attacks often use
words or phrases with lower $n$-gram frequencies. To address this, we
construct low-frequency text sequences and incorporate them into the training
data to enhance the robustness of the models.
Second, we propose a prompt-based LLM adversarial example generation
approach, which constructs universal prompts that guide LLMs to generate
high-quality, transferable adversarial examples. We employ a local
combinatorial optimization algorithm to iteratively optimize in-context
examples. This approach eliminates the need for costly model training and
enables efficient adversarial data generation, thereby enhancing the
robustness of language models.
Third, we propose a training-free weight modification technique that enhances
content safety. This method adjusts model weights before and after
fine-tuning, without relying on data augmentation. We model the weight
selection using the knapsack problem, where a utility-to-safety ratio is
employed to rank and choose weights. This method achieves strong performance
in terms of safety and utility through efficient weight modification,
including delta weight selection and safety compensation.
Collectively, these proposed approaches provide a data- and compute-
efficient framework for building trustworthy language models.
Date: Thursday, 12 February 2026
Time: 10:00am - 11:30am
Venue: Room 2132C
Lift 22
Committee Members: Prof. Cunsheng Ding (Supervisor)
Dr. Qi Wang (Co-Supervisor, SUSTECH)
Prof. Ke Yi (Chairperson)
Dr. Xiaojuan Ma