More about HKUST
Advanced Binary Similarity Analysis and Its Downstream Applications
PhD Thesis Proposal Defence Title: "Advanced Binary Similarity Analysis and Its Downstream Applications" by Mr. Huaijin WANG Abstract: Given multiple pieces of binary code, binary similarity analysis (BSA) techniques detect their similarities and differences. When the source code is unavailable, understanding the functionality of a piece of binary code is difficult since it is designed for machines with poor readability. By utilizing already-analyzed binary code, the similarities and differences between pieces of binary code can help an analyzer understand their functionalities. Nowadays, BSA techniques empower many real-world applications, including patch analysis, malware detection, vulnerability searching, and software composition analysis. Many BSA tools were developed during the last two decades, all focusing on overcoming the challenges of analyzing binary code introduced by the code compilation process. Due to diverse architectures, various compilers, optimizations, and sophisticated obfuscations, the pieces of binary code compiled from the same source code can be completely different, e.g., the instruction sequence and control-flow graph can be changed significantly. Therefore, recent BSA techniques aim to extract the semantics from binary code, which remains unchanged even with distinct compilation configurations. However, existing approaches for extracting semantic features are either program-analysis-based and unscalable or deep-neural-network-based (DNN-based) with a high false positive rate. This thesis demonstrates our novel designs of combining program-analysis-based and DNN-based methods for BSA and presents the performance of our works in various scenarios. Our first work addressed the high false positive rate problem of DNN-based BSA techniques with acceptable overhead. We noticed that DNN-based BSA techniques frequently score a piece of relatively-dissimilar binary code with a similarity higher than the binary code compiled from the same source code, resulting in many false positive cases. When the BSA tool assigns the highest similarity score between a vulnerable function and its patched version, an analyzer likely treats the binary as patched and causes fatal problems. Therefore, we propose our low-cost equivalence checking tool, BinUSE. It utilizes under-constrained symbolic execution to explore paths from a function’s entry until an external function call. By comparing each path’s invoked external functions and symbolic constraints, we can trim irrelative functions and raise the top-1 accuracy by 11%. Our second work aims at designing a usable BSA tool for the most challenging setting, analyzing obfuscated binary code. Most existing BSA techniques did not evaluate their accuracy on the obfuscated binary code, whereas some obfuscation-resilient BSA techniques depend on symbolic execution and constraint solving, which are inefficient. We further evaluated several de facto DNN-based BSA tools on obfuscated binary code. Compared with their evaluation results on normally compiled binaries, their top-1 accuracy reduced from 74% to 13%. We realized that existing DNN-based BSA techniques still heavily rely on structural information like instruction sequences and control-flow graphs. Hence, we design sem2vec. It first extracts symbolic expressions from binary code via under-constrained symbolic execution, then feeds those obfuscation-resilient features into the DNN model. Our evaluation proves sem2vec’s obfuscation-resilient capability with 52.2% top-1 while analyzing heavily-obfuscated software. Date: Monday, 13 March 2023 Time: 3:00pm - 5:00pm Venue: Room 4472 lifts 25/26 Committee Members: Dr. Shuai Wang (Supervisor) Prof. Shing-Chi Cheung (Chairperson) Dr. Lionel Parreaux Dr. Jiasi Shen **** ALL are Welcome ****