Towards Efficiently Building Trustworthy Language Models

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Towards Efficiently Building Trustworthy Language Models"

By

Mr. Ning LU


Abstract:

Language models have revolutionized natural language processing, with large
language models like ChatGPT showing strong performance across many tasks and
being widely applied in practice. Despite their success, they still face
challenges in trustworthiness, such as limited robustness and the risk of
producing unsafe content. This thesis aims to improve the trustworthiness of
language models through efficient methods. Specifically, we propose three
efficient approaches to enhance the robustness and safety of language models:
utilizing word frequency information, prompting language models for automatic
data augmentation, and a training-free weight modification approach.

First, we propose an n-gram frequency descent training method that improves
robustness without requiring gradient computations, thereby reducing training
time. It is based on the observation that adversarial attacks often use
low-frequency words. We therefore generate low-frequency text and add it to
the training data to strengthen model robustness.

Second, we introduce a prompt-based approach for generating adversarial
examples with LLMs. By designing universal prompts and using a local search
to refine in-context examples, this method efficiently produces high-quality
adversarial data without training, thereby enhancing model robustness.

Third, we present a training-free weight modification method to improve content
safety. It adjusts model weights before and after fine-tuning without data
augmentation. We formulate weight selection as a knapsack problem and rank
weights by utility-to-safety ratio, achieving strong performance across safety
and utility.

Collectively, these approaches provide a data- and compute-efficient
framework for building trustworthy language models. Building on the
theoretical and empirical findings of this thesis, we identify many directions
for future exploration.


Date:                   Tuesday, 12 May 2026

Time:                   2:00pm - 4:00pm

Venue:                  Room 3494
                        Lifts 25/26

Chairman:               Prof. Jun ZHANG (ECE)

Committee Members:      Prof. Cunsheng DING (Supervisor)
                        Dr. Qi WANG (Co-supervisor, SUSTech)
                        Prof. Ke YI
                        Dr. Mingxun ZHOU
                        Dr. Maosheng XIONG (MATH)
                        Prof. Chengqi ZHANG (PolyU)
Privacy Sitemap
Towards Efficiently Building Trustworthy Language Models

About

People

Research

Academics

Admissions