Advanced Binary Similarity Analysis and Its Downstream Applications

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Advanced Binary Similarity Analysis and Its Downstream Applications"

By

Mr. Huaijin WANG


Abstract:

Given multiple pieces of binary code, binary similarity analysis (BSA)
techniques detect their similarities and differences. It enables the security
experts to analyze suspicious software (e.g., malware) without source code. By
utilizing already-analyzed binary code, the similarities and differences
between input binary code and already-analyzed binary code pieces can help an
analyzer understand input binary code's functionality. Nowadays, BSA techniques
empower many real-world applications, including patch analysis, malware
detection, vulnerability searching, and software composition analysis.

Many BSA tools were developed during the last two decades, all focusing on
overcoming the challenges of analyzing binary code introduced by the code
compilation process. Due to diverse architectures, various compilers,
optimizations, and sophisticated obfuscations, the pieces of binary code
compiled from the same source code can be completely different, e.g., the
instruction sequence and control-flow graph can be changed significantly.
Therefore, recent BSA techniques aim to extract the semantics from binary code,
which remains unchanged even with distinct compilation configurations. However,
existing approaches for extracting semantic features are either
program-analysis-based approaches, which are unscalable, or
deep-neural-network-based (DNN-based) methods, which suffer high false positive
rates. This thesis demonstrates our novel designs of combining
program-analysis-based and DNN-based methods for BSA and improve the precision,
recall, and F1 score for various tasks, including binary code clone detection,
vulnerability detection, and software composition analysis.

Our first work addresses the high false positive rate problem of DNN-based BSA
techniques with a low-cost equivalence checking technique, namely BinUSE. It
utilizes under-constrained symbolic execution (USE) to explore paths from a
function's entry until an external function call. By comparing each path's
invoked external functions and symbolic constraints, we can trim irrelative
functions and raise the top-1 accuracy by 11%.

Our second work aims at improving the robustness of DNN-based BSA works.
Different from existing DNN-based BSA techniques relying on structural
information like control-flow graphs, we design sem2vec to produce
function-level embeddings by learning from semantic features. It first
efficiently extracts semantic features from binary code via USE, then uses a
DNN model to learn the semantics directly. sem2vec is superior in learning
semantics, achieving 52.2% top-1 while analyzing heavily obfuscated and
optimized binaries. In comparison, the top-1 accuracy of a commercial BSA
solution is 13%.

Our third work explores applying BSA techniques in software composition
analysis (SCA). SCA aims to detect open-source software (OSS) usage in a given
program, which is a critical step for software security and license compliance.
We propose a SCA pipeline empowered by BSA techniques for stripped binaries.
Moreover, we design three enhancements to raise the accuracy from three aspects
by utilizing high-precision features, global knowledge (i.e., call graph and
binary layout), and dynamic weights. The enhanced accuracy surpasses a de facto
commercial binary-based SCA tool.

Our fourth work aims to build the first privacy-preserving SCA framework, which
is empowered by source and binary code similarity analysis. We investigate the
privacy leakage of five SCA forms and establish a privacy-preserving SCA
framework, SafeSCA, based on a multi-party crypto protocol. We design three
filters to reduce the overhead of encrypted computation to 12.5% of the
original cost. Our evaluation shows that SafeSCA achieves the best accuracy
among all evaluated SCA tools, including the de facto commercial SCA tool
(i.e., BinaryAI), with the best privacy guarantee.


Date:                   Tuesday, 12 December 2023

Time:                   4:00pm - 6:00pm

Venue:                  Room 3494
                        Lifts 25/26

Chairman:               Prof. Iam Keong SOU (PHYS)

Committee Members:      Prof. Shuai WANG (Supervisor)
                        Prof. Lionel PARREAUX
                        Prof. Charles ZHANG
                        Prof. Zili MENG (ECE)
                        Prof. Kehuan ZHANG (CUHK)


**** ALL are Welcome ****