More about HKUST
Advanced Binary Similarity Analysis and Its Downstream Applications
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "Advanced Binary Similarity Analysis and Its Downstream Applications" By Mr. Huaijin WANG Abstract: Given multiple pieces of binary code, binary similarity analysis (BSA) techniques detect their similarities and differences. It enables the security experts to analyze suspicious software (e.g., malware) without source code. By utilizing already-analyzed binary code, the similarities and differences between input binary code and already-analyzed binary code pieces can help an analyzer understand input binary code's functionality. Nowadays, BSA techniques empower many real-world applications, including patch analysis, malware detection, vulnerability searching, and software composition analysis. Many BSA tools were developed during the last two decades, all focusing on overcoming the challenges of analyzing binary code introduced by the code compilation process. Due to diverse architectures, various compilers, optimizations, and sophisticated obfuscations, the pieces of binary code compiled from the same source code can be completely different, e.g., the instruction sequence and control-flow graph can be changed significantly. Therefore, recent BSA techniques aim to extract the semantics from binary code, which remains unchanged even with distinct compilation configurations. However, existing approaches for extracting semantic features are either program-analysis-based approaches, which are unscalable, or deep-neural-network-based (DNN-based) methods, which suffer high false positive rates. This thesis demonstrates our novel designs of combining program-analysis-based and DNN-based methods for BSA and improve the precision, recall, and F1 score for various tasks, including binary code clone detection, vulnerability detection, and software composition analysis. Our first work addresses the high false positive rate problem of DNN-based BSA techniques with a low-cost equivalence checking technique, namely BinUSE. It utilizes under-constrained symbolic execution (USE) to explore paths from a function's entry until an external function call. By comparing each path's invoked external functions and symbolic constraints, we can trim irrelative functions and raise the top-1 accuracy by 11%. Our second work aims at improving the robustness of DNN-based BSA works. Different from existing DNN-based BSA techniques relying on structural information like control-flow graphs, we design sem2vec to produce function-level embeddings by learning from semantic features. It first efficiently extracts semantic features from binary code via USE, then uses a DNN model to learn the semantics directly. sem2vec is superior in learning semantics, achieving 52.2% top-1 while analyzing heavily obfuscated and optimized binaries. In comparison, the top-1 accuracy of a commercial BSA solution is 13%. Our third work explores applying BSA techniques in software composition analysis (SCA). SCA aims to detect open-source software (OSS) usage in a given program, which is a critical step for software security and license compliance. We propose a SCA pipeline empowered by BSA techniques for stripped binaries. Moreover, we design three enhancements to raise the accuracy from three aspects by utilizing high-precision features, global knowledge (i.e., call graph and binary layout), and dynamic weights. The enhanced accuracy surpasses a de facto commercial binary-based SCA tool. Our fourth work aims to build the first privacy-preserving SCA framework, which is empowered by source and binary code similarity analysis. We investigate the privacy leakage of five SCA forms and establish a privacy-preserving SCA framework, SafeSCA, based on a multi-party crypto protocol. We design three filters to reduce the overhead of encrypted computation to 12.5% of the original cost. Our evaluation shows that SafeSCA achieves the best accuracy among all evaluated SCA tools, including the de facto commercial SCA tool (i.e., BinaryAI), with the best privacy guarantee. Date: Tuesday, 12 December 2023 Time: 4:00pm - 6:00pm Venue: Room 3494 Lifts 25/26 Chairman: Prof. Iam Keong SOU (PHYS) Committee Members: Prof. Shuai WANG (Supervisor) Prof. Lionel PARREAUX Prof. Charles ZHANG Prof. Zili MENG (ECE) Prof. Kehuan ZHANG (CUHK) **** ALL are Welcome ****