A Survey on Software Embedding Techniques and Their Downstream Applications

PhD Qualifying Examination


Title: "A Survey on Software Embedding Techniques and Their Downstream 
Applications"

by

Mr. Huaijin WANG


Abstract:

The embedding technique maps a specific object into a numeric vector. Unlike 
hashing, the numeric vector is supposed to contain the semantic meaning of the 
object, and a well-designed embedding can help its downstream applications a 
lot by accelerating the training process or improving the performance. In the 
natural language processing (NLP) field, embedding techniques developed 
rapidly, such as word2vec [Mikolov et al., 2013a, Mikolov et al., 2013b], BPEmb 
[Heinzerling and Strube, 2017] and HBMP [Talman et al., 2019]. It becomes 
natural to think about adopting the embedding techniques to software analysis. 
The intuitions are source code is the language for programmers, and binary code 
is the language for machines.

Many works producing software embeddings are published in recent years. These 
works could be categorized into binary code embedding and source code embedding 
techniques. Binary code embedding techniques utilize information provided by 
binary analysis tools (e.g., disassemblers) to capture the semantics of machine 
code. Existing works aim to embed various granularities (i.e., byte, 
instruction, basic block, and function) for different tasks like binary code 
similarity and malware classification. Source code contains richer information 
(e.g., names and types) than binary code. Hence, researchers expect to employ 
source code embedding techniques to help software development, including method 
name recommendation and automatic program repair.

Improving the quality of software embedding requires selecting meaningful 
information and carefully-designed encoding methods, and various applications 
are employed to evaluate the quality of embedding. This survey introduces 
frequently-used analyzing techniques and their applications first. Then we 
summarize the problems and introduce existing solutions. Finally, we provide 
advice on potential future directions. We believe this survey shed light on our 
future work on software embedding.


Date:			Thursday, 27 January 2022

Time:                  	1:00pm - 3:00pm

Zoom Meeting:		https://hkust.zoom.us/j/9236191239

Committee Members:	Dr. Shuai Wang (Supervisor)
 			Dr. Dimitris Papadopoulos (Chairperson)
 			Prof. Shing-Chi Cheung
 			Dr. Charles Zhang


**** ALL are Welcome ****