More about HKUST
Towards Industrial-Scale Software Binary Analysis
The Hong Kong University of Science and Technology
Department of Computer Science and Engineering
PhD Thesis Defence
Title: "Towards Industrial-Scale Software Binary Analysis"
By
Mr. Anshunkang ZHOU
Abstract:
At the heart of modern computing lies a fundamental component: the software
binary, a complex sequence of ones and zeroes that determines how specific
tasks are executed on a computer. Ensuring the security and correctness of
software binaries is of paramount importance, as vulnerabilities can have
far-reaching consequences, potentially affecting financial systems and even
human lives. However, analyzing software binaries at an industrial scale
remains a significant challenge, primarily due to the difficulty of meeting
three essential design requirements: rigor, non-intrusiveness, and
scalability.
This dissertation proposes a binary-centric solution to enhance
industrial-scale software binary analysis, addressing these three
requirements by contributing to several fundamental binary analysis
techniques. Together, they constitute a systematic software binary analysis
framework that can be seamlessly integrated into modern software development
lifecycles, enabling early detection and prevention of defects. Our
approaches have demonstrated a tangible, real-world impact, having been
deployed by major companies/organizations to perform daily software quality
checks and have helped identify hundreds of high-risk defects in both
industrial software products and open-source projects.
The first part of our research focuses on the problem of obtaining analyzable
code representations from software binaries in a non-intrusive way. The
problem is challenging due to information loss during the compilation, and
existing binary lifters still cannot produce precise enough code for rigorous
static analysis, even in the presence of debug information. To solve the
problem, we present a new binary lifter, PLANKTON, which features two new
algorithms that can fill the gaps between the low- and high-level code to
produce high-quality LLVM intermediate representations (IRs) from binaries
with debug information, enabling full-fledged static analysis with minor
precision loss. PLANKTON shows comparable static analysis results with
traditional compilation interference solutions, producing only 17.2%
differences while being much more practical, outperforming existing lifters
by 76.9% on average.
The second part of our research solves the scalability issue of existing
lifters. We found that the root cause of the issue is the inherent
"monolithic" design that performs all lifting stages on a single LLVM module,
which entails a global environment that enforces sequential dependences
between any two transformations on IRs, thus limiting the parallelism. To
solve the issue, we proposed DIATOM, the first parallel binary lifter powered
by a new "polylithic" design, which decomposes the monolithic LLVM module
into partitions to perform fully parallelized binary lifting. In the
meantime, it leverages light-weight data-flow summaries and type-aware IR
linking to avoid soundness loss caused by separating dependent code
fragments. Large-scale experiments show that DIATOM achieves an average
speedup of 7.45× and a maximum speedup of 16.8× over a traditional monolithic
binary lifter, while still maintaining the lifting soundness.
The third part of our research aims to solve the problem of effective binary
similarity analysis, which is extremely useful since it provides rich
information about an unknown binary, such as identifying reused libraries.
The problem is challenging, as heavy compiler optimizations can make
semantically similar binaries have gigantic syntactic differences. To tackle
the challenge, we propose ARCTURUS, a new technique that can achieve high
code coverage and high accuracy simultaneously by manipulating program
execution under the guidance of code reachability, which we found is nearly
invariant across optimizations. Experimental results show that ARCTURUS
achieves an average precision of 87.8% with 100% block coverage,
outperforming compared methods by 38.4% on average.
The final part of our research investigates the problem of efficiently
generating exploitable bugs through parallel fuzzing. Specifying efficient
parallel fuzzing strategies for programs with different characteristics is
challenging due to the difficulty of reasoning about fuzzing runtime
statically. To tackle the challenge, we propose KRAKEN, a new
program-adaptive parallel fuzzer that improves fuzzing efficiency through
dynamic strategy optimization. Experimental results show that KRAKEN can
achieve 54.7% more code coverage and find 70.2% more bugs in the given time
compared with existing parallel fuzzers.
Date: Thursday, 9 October 2025
Time: 10:00am - 12:00noon
Venue: Room 5501
Lifts 25/26
Chairman: Dr. Darwin CHOI (FINA)
Committee Members: Prof. Charles ZHANG (Supervisor)
Dr. Dongdong SHE
Dr. Shuai WANG
Prof. Jun ZHANG (ECE)
Prof. Xiangyu ZHANG (Purdue University)