Towards Industrial-Scale Software Binary Analysis

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Towards Industrial-Scale Software Binary Analysis"

By

Mr. Anshunkang ZHOU


Abstract:

At the heart of modern computing lies a fundamental component: the software 
binary, a complex sequence of ones and zeroes that determines how specific 
tasks are executed on a computer. Ensuring the security and correctness of 
software binaries is of paramount importance, as vulnerabilities can have 
far-reaching consequences, potentially affecting financial systems and even 
human lives. However, analyzing software binaries at an industrial scale 
remains a significant challenge, primarily due to the difficulty of meeting 
three essential design requirements: rigor, non-intrusiveness, and 
scalability.

This dissertation proposes a binary-centric solution to enhance 
industrial-scale software binary analysis, addressing these three 
requirements by contributing to several fundamental binary analysis 
techniques. Together, they constitute a systematic software binary analysis 
framework that can be seamlessly integrated into modern software development 
lifecycles, enabling early detection and prevention of defects. Our 
approaches have demonstrated a tangible, real-world impact, having been 
deployed by major companies/organizations to perform daily software quality 
checks and have helped identify hundreds of high-risk defects in both 
industrial software products and open-source projects.

The first part of our research focuses on the problem of obtaining analyzable 
code representations from software binaries in a non-intrusive way. The 
problem is challenging due to information loss during the compilation, and 
existing binary lifters still cannot produce precise enough code for rigorous 
static analysis, even in the presence of debug information. To solve the 
problem, we present a new binary lifter, PLANKTON, which features two new 
algorithms that can fill the gaps between the low- and high-level code to 
produce high-quality LLVM intermediate representations (IRs) from binaries 
with debug information, enabling full-fledged static analysis with minor 
precision loss. PLANKTON shows comparable static analysis results with 
traditional compilation interference solutions, producing only 17.2% 
differences while being much more practical, outperforming existing lifters 
by 76.9% on average.

The second part of our research solves the scalability issue of existing 
lifters. We found that the root cause of the issue is the inherent 
"monolithic" design that performs all lifting stages on a single LLVM module, 
which entails a global environment that enforces sequential dependences 
between any two transformations on IRs, thus limiting the parallelism. To 
solve the issue, we proposed DIATOM, the first parallel binary lifter powered 
by a new "polylithic" design, which decomposes the monolithic LLVM module 
into partitions to perform fully parallelized binary lifting. In the 
meantime, it leverages light-weight data-flow summaries and type-aware IR 
linking to avoid soundness loss caused by separating dependent code 
fragments. Large-scale experiments show that DIATOM achieves an average 
speedup of 7.45× and a maximum speedup of 16.8× over a traditional monolithic 
binary lifter, while still maintaining the lifting soundness.

The third part of our research aims to solve the problem of effective binary 
similarity analysis, which is extremely useful since it provides rich 
information about an unknown binary, such as identifying reused libraries. 
The problem is challenging, as heavy compiler optimizations can make 
semantically similar binaries have gigantic syntactic differences. To tackle 
the challenge, we propose ARCTURUS, a new technique that can achieve high 
code coverage and high accuracy simultaneously by manipulating program 
execution under the guidance of code reachability, which we found is nearly 
invariant across optimizations. Experimental results show that ARCTURUS 
achieves an average precision of 87.8% with 100% block coverage, 
outperforming compared methods by 38.4% on average.

The final part of our research investigates the problem of efficiently 
generating exploitable bugs through parallel fuzzing. Specifying efficient 
parallel fuzzing strategies for programs with different characteristics is 
challenging due to the difficulty of reasoning about fuzzing runtime 
statically. To tackle the challenge, we propose KRAKEN, a new 
program-adaptive parallel fuzzer that improves fuzzing efficiency through 
dynamic strategy optimization. Experimental results show that KRAKEN can 
achieve 54.7% more code coverage and find 70.2% more bugs in the given time 
compared with existing parallel fuzzers.


Date:                   Thursday, 9 October 2025

Time:                   10:00am - 12:00noon

Venue:                  Room 5501
                        Lifts 25/26

Chairman:               Dr. Darwin CHOI (FINA)

Committee Members:      Prof. Charles ZHANG (Supervisor)
                        Dr. Dongdong SHE
                        Dr. Shuai WANG
                        Prof. Jun ZHANG (ECE)
                        Prof. Xiangyu ZHANG (Purdue University)