Enhancing and Hardening Neural Code Model

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Enhancing and Hardening Neural Code Model"

By

Mr. Zongjie LI


Abstract:

With the rapid advancement of deep learning technologies, neural code models 
have achieved remarkable success, facilitating significant breakthroughs 
across various code-related applications. Leveraging powerful computational 
resources and massive training data, these models demonstrate sophisticated 
capabilities in understanding, analyzing, and generating diverse programming 
code. Unlike models primarily designed for natural language tasks, code 
models are typically engineered for integration into various productivity 
scenarios and practical development workflows. Consequently, developing 
neural code models with high accuracy, reliability, and freedom from 
potential intellectual property risks has become imperative.

This thesis proposal focuses on designing and developing neural code models 
through three key aspects: 1) enhancing model performance through data 
augmentation and architectural improvements, 2) refining output consistency 
through code structure and semantic analysis, 3) incorporating verifiable 
watermarks to protect intellectual property, and 4) synthesizing the 
domain-specific dataset for code models. In our first contribution, we 
present a framework that leverages compiler-generated Intermediate 
Representation (IR) code for data augmentation, enabling improved embeddings 
that support various downstream code applications. To further enhance code 
generation capabilities, our second work introduces CCTEST, a system that 
inserts context-free code snippets to detect and rectify inconsistencies. In 
our third work, we exploit programming language semantics and token 
distribution characteristics to embed verifiable watermarks in model outputs, 
thereby enhancing model security and intellectual property protection. In our 
fourth work, we propose a novel approach to synthesizing domain-specific 
datasets for fine-tuning the code models, addressing the challenges of data 
scarcity and quality in specialized domains.


Date:                   Tuesday, 29 July 2025

Time:                   1:00pm - 3:00pm

Venue:                  Room 5501
                        Lifts 25/26

Chairman:               Dr. Terence Tsz Wai WONG (CBE)

Committee Members:      Dr. Shuai WANG (Supervisor)
                        Dr. Junxian HE
                        Prof. Fangzhen LIN
                        Dr. Yi YANG (ISOM)
                        Dr. Dongliang MU (HUST)