More about HKUST
Enhance Binary Analysis Tooling
PhD Thesis Proposal Defence
Title: "Enhance Binary Analysis Tooling"
by
Mr. Wai Kin WONG
Abstract:
Binary analysis is fundamental to modern cybersecurity, empowering critical
applications such as vulnerability discovery, malware detection, and patch
analysis. However, binary is often presented in the form of low-level
assembly instructions, which are inherently difficult to interpret and
analyze. To assist analysts, automated tools like Binary Code Similarity
Analysis tools (BCSA) and decompilers are indispensable, yet they suffer
from significant limitations. State-of-the-art BCSA tools, while powerful,
frequently exhibit high false-positive rates due to architectural
limitations in their underlying deep neural network models. Similarly,
leading decompilers prioritize human readability over programmatic utility,
producing pseudocode that is often syntactically incorrect and
non-recompilable, thereby hindering automated downstream analysis. This
thesis introduces novel methodologies to address these distinct yet related
challenges, enhancing the reliability of BCSA and the utility of
decompilers.
Our first work addresses the high false-positive rate of DNN-based BCSA
techniques. We introduce BinAug, a model-agnostic, post-processing framework
that mitigates this issue without requiring expensive model retraining.
Observing that DNN models often generate low-quality embeddings or overfit
specific patterns, BinAug re-ranks similarity scores based on features
derived from the binary functions under comparison. In black-box and
white-box evaluations, BinAug consistently improves the performance of
state-of-the-art BCSA tools by an average of 2.38% and 6.46%, respectively.
Furthermore, it enhances the F1 score for the crucial downstream task of
binary software component analysis by an average of 5.43% and 7.45% in the
same settings.
Our second work enables the programmatic use of decompiler outputs through
Recompilable Decompilation. We present DecLLM, an iterative repair framework
that leverages off-the-shelf Large Language Models (LLMs) to automatically
correct decompiler outputs into compilable C code. Unlike existing
approaches that focus on readability, DecLLM employs a novel feedback loop
that integrates both static compilation errors and dynamic runtime behavior
as oracles to guide the LLM's repair process. Evaluated on C benchmarks and
real-world binaries, DecLLM successfully renders approximately 70% of
originally non-recompilable decompiler outputs into valid, compilable code.
Furthermore, we demonstrate that this recompilable code maintains semantic
consistency for CodeQL-based vulnerability analysis when compared to
ground-truth source code. For the remaining 30% of challenging cases, we
conduct an in-depth analysis to inform future improvements in
decompilation-oriented LLM techniques.
Date: Wednesday, 2 July 2025
Time: 11:00am - 1:00pm
Venue: Room 5501
Lifts 25/26
Committee Members: Dr. Shuai Wang (Supervisor)
Dr. Dongdong She (Chairperson)
Dr. Lionel Parreaux