More about HKUST
Towards Industrial-Scale Software Binary Analysis
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "Towards Industrial-Scale Software Binary Analysis" By Mr. Anshunkang ZHOU Abstract: At the heart of modern computing lies a fundamental component: the software binary, a complex sequence of ones and zeroes that determines how specific tasks are executed on a computer. Ensuring the security and correctness of software binaries is of paramount importance, as vulnerabilities can have far-reaching consequences, potentially affecting financial systems and even human lives. However, analyzing software binaries at an industrial scale remains a significant challenge, primarily due to the difficulty of meeting three essential design requirements: rigor, non-intrusiveness, and scalability. This dissertation proposes a binary-centric solution to enhance industrial-scale software binary analysis, addressing these three requirements by contributing to several fundamental binary analysis techniques. Together, they constitute a systematic software binary analysis framework that can be seamlessly integrated into modern software development lifecycles, enabling early detection and prevention of defects. Our approaches have demonstrated a tangible, real-world impact, having been deployed by major companies/organizations to perform daily software quality checks and have helped identify hundreds of high-risk defects in both industrial software products and open-source projects. The first part of our research focuses on the problem of obtaining analyzable code representations from software binaries in a non-intrusive way. The problem is challenging due to information loss during the compilation, and existing binary lifters still cannot produce precise enough code for rigorous static analysis, even in the presence of debug information. To solve the problem, we present a new binary lifter, PLANKTON, which features two new algorithms that can fill the gaps between the low- and high-level code to produce high-quality LLVM intermediate representations (IRs) from binaries with debug information, enabling full-fledged static analysis with minor precision loss. PLANKTON shows comparable static analysis results with traditional compilation interference solutions, producing only 17.2% differences while being much more practical, outperforming existing lifters by 76.9% on average. The second part of our research solves the scalability issue of existing lifters. We found that the root cause of the issue is the inherent "monolithic" design that performs all lifting stages on a single LLVM module, which entails a global environment that enforces sequential dependences between any two transformations on IRs, thus limiting the parallelism. To solve the issue, we proposed DIATOM, the first parallel binary lifter powered by a new "polylithic" design, which decomposes the monolithic LLVM module into partitions to perform fully parallelized binary lifting. In the meantime, it leverages light-weight data-flow summaries and type-aware IR linking to avoid soundness loss caused by separating dependent code fragments. Large-scale experiments show that DIATOM achieves an average speedup of 7.45× and a maximum speedup of 16.8× over a traditional monolithic binary lifter, while still maintaining the lifting soundness. The third part of our research aims to solve the problem of effective binary similarity analysis, which is extremely useful since it provides rich information about an unknown binary, such as identifying reused libraries. The problem is challenging, as heavy compiler optimizations can make semantically similar binaries have gigantic syntactic differences. To tackle the challenge, we propose ARCTURUS, a new technique that can achieve high code coverage and high accuracy simultaneously by manipulating program execution under the guidance of code reachability, which we found is nearly invariant across optimizations. Experimental results show that ARCTURUS achieves an average precision of 87.8% with 100% block coverage, outperforming compared methods by 38.4% on average. The final part of our research investigates the problem of efficiently generating exploitable bugs through parallel fuzzing. Specifying efficient parallel fuzzing strategies for programs with different characteristics is challenging due to the difficulty of reasoning about fuzzing runtime statically. To tackle the challenge, we propose KRAKEN, a new program-adaptive parallel fuzzer that improves fuzzing efficiency through dynamic strategy optimization. Experimental results show that KRAKEN can achieve 54.7% more code coverage and find 70.2% more bugs in the given time compared with existing parallel fuzzers. Date: Thursday, 9 October 2025 Time: 10:00am - 12:00noon Venue: Room 5501 Lifts 25/26 Chairman: Dr. Darwin CHOI (FINA) Committee Members: Prof. Charles ZHANG (Supervisor) Dr. Dongdong SHE Dr. Shuai WANG Prof. Jun ZHANG (ECE) Prof. Xiangyu ZHANG (Purdue University)