Towards Efficiently Building Trustworthy Language Models

PhD Thesis Proposal Defence


Title: "Towards Efficiently Building Trustworthy Language Models"

by

Mr. Ning LU


Abstract:

Language models have revolutionized natural language processing, with large 
language models like ChatGPT, Llama, and DeepSeek achieving impressive 
performance across various tasks. Despite their success, these models face 
issues with trustworthiness, including weak robustness and the risk of 
generating unsafe content. In this thesis proposal, we aim to enhance the 
trustworthiness of language models using efficient methods. Specifically, we 
propose three efficient approaches to enhance the robustness and safety of 
language models: utilizing word frequency information, prompting language 
models for automatic data augmentation, and a training-free weight 
modification approach.

First, we propose an $n$-gram frequency descent training method that enhances 
model robustness without relying on gradient computations, thereby reducing 
overall training time. This approach is motivated by a systematic analysis of 
word-level adversarial attacks, which reveals that such attacks often use 
words or phrases with lower $n$-gram frequencies. To address this, we 
construct low-frequency text sequences and incorporate them into the training 
data to enhance the robustness of the models.

Second, we propose a prompt-based LLM adversarial example generation 
approach, which constructs universal prompts that guide LLMs to generate 
high-quality, transferable adversarial examples. We employ a local 
combinatorial optimization algorithm to iteratively optimize in-context 
examples. This approach eliminates the need for costly model training and 
enables efficient adversarial data generation, thereby enhancing the 
robustness of language models.

Third, we propose a training-free weight modification technique that enhances 
content safety. This method adjusts model weights before and after 
fine-tuning, without relying on data augmentation. We model the weight 
selection using the knapsack problem, where a utility-to-safety ratio is 
employed to rank and choose weights. This method achieves strong performance 
in terms of safety and utility through efficient weight modification, 
including delta weight selection and safety compensation.

Collectively, these proposed approaches provide a data- and compute- 
efficient framework for building trustworthy language models.


Date:                   Thursday, 12 February 2026

Time:                   10:00am - 11:30am

Venue:                  Room 2132C
                        Lift 22

Committee Members:      Prof. Cunsheng Ding (Supervisor)
                        Dr. Qi Wang (Co-Supervisor, SUSTECH)
                        Prof. Ke Yi (Chairperson)
                        Dr. Xiaojuan Ma