Enhance Binary Analysis Tooling

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Enhance Binary Analysis Tooling"

By

Mr. Wai Kin WONG


Abstract:

Binary analysis is fundamental to modern cybersecurity, empowering critical 
applications such as vulnerability discovery, malware detection, and patch 
analysis. However, binary is often presented in the form of low-level 
assembly instructions, which are inherently difficult to interpret and 
analyze. To assist analysts, automated tools like Binary Code Similarity 
Analysis tools (BCSA) and decompilers are indispensable, yet they suffer from 
significant limitations. State-of-the-art BCSA tools, while powerful, 
frequently exhibit high false-positive rates due to architectural limitations 
in their underlying deep neural network models. Similarly, leading 
decompilers prioritize human readability over programmatic utility, producing 
pseudocode that is often syntactically incorrect and non-recompilable, 
thereby hindering automated downstream analysis. This thesis introduces novel 
methodologies to address these distinct yet related challenges, enhancing the 
reliability of BCSA and the utility of decompilers.

Our first work presents an attack that perturbs software in executable format 
to deceive DNN-based binary code matching. Unlike prior attacks which mostly 
change non-functional code components to generate adversarial programs, our 
approach proposes the design of several semantics-preserving transformations 
directly toward the control flow graph of binary code, making it particularly 
effective to deceive DNNs. To speedup the process, we design a framework that 
leverages gradient- or hill climbing-based optimizations to generate 
adversarial examples in both white-box and black-box settings. We evaluated 
our attack against two popular DNN-based binary code matching tools, Asm2Vec 
and NCC, and achieve reasonably high success rates. Our attack toward an 
industrial-strength DNN-based binary code matching service, BinaryAI, shows 
that the proposed attack can fool remote APIs in challenging black-box 
settings with a success rate of over 16.2% (on average). Furthermore, we show 
that the generated adversarial programs can be used to augment robustness of 
two white-box models, Asm2Vec and NCC, reducing the attack success rates by 
17.3% and 6.8% while preserving stable, if not better, standard accuracy.

Our second work addresses the high false-positive rate of DNN-based BCSA 
techniques. We introduce BinAug, a model-agnostic, post-processing framework 
that mitigates this issue without requiring expensive model retraining. 
Observing that DNN models often generate low-quality embeddings or overfit 
specific patterns, BinAug re-ranks similarity scores based on features 
derived from the binary functions under comparison. In black-box and 
white-box evaluations, BinAug consistently improves the performance of 
state-of-the-art BCSA tools by an average of 2.38% and 6.46%, respectively. 
Furthermore, it enhances the F1 score for the crucial downstream task of 
binary software component analysis by an average of 5.43% and 7.45% in the 
same settings.

Our third work enables the programmatic use of decompiler outputs through 
Recompilable Decompilation. We present DecLLM, an iterative repair framework 
that leverages off-the-shelf Large Language Models (LLMs) to automatically 
correct decompiler outputs into compilable C code. Unlike existing approaches 
that focus on readability, DecLLM employs a novel feedback loop that 
integrates both static compilation errors and dynamic runtime behavior as 
oracles to guide the LLM's repair process. Evaluated on C benchmarks and 
real-world binaries, DecLLM successfully renders approximately 70% of 
originally non-recompilable decompiler outputs into valid, compilable code. 
Furthermore, we demonstrate that this recompilable code maintains semantic 
consistency for CodeQL-based vulnerability analysis when compared to 
ground-truth source code. For the remaining 30% of challenging cases, we 
conduct an in-depth analysis to inform future improvements in 
decompilation-oriented LLM techniques.


Date:                   Wednesday, 17 September 2025

Time:                   4:00pm - 6:00pm

Venue:                  Room 5501
                        Lifts 25/26

Chairman:               Prof. Yongshun CAI (SOSC)

Committee Members:      Dr. Shuai WANG (Supervisor)
                        Prof. Shing-Chi CHEUNG
                        Dr. Binhang YUAN
                        Dr. Chao TANG (ACCT)
                        Dr. Lwin Khin SHAR (SMU)