More about HKUST
A Survey on Software Embedding Techniques and Their Downstream Applications
PhD Qualifying Examination Title: "A Survey on Software Embedding Techniques and Their Downstream Applications" by Mr. Huaijin WANG Abstract: The embedding technique maps a specific object into a numeric vector. Unlike hashing, the numeric vector is supposed to contain the semantic meaning of the object, and a well-designed embedding can help its downstream applications a lot by accelerating the training process or improving the performance. In the natural language processing (NLP) field, embedding techniques developed rapidly, such as word2vec [Mikolov et al., 2013a, Mikolov et al., 2013b], BPEmb [Heinzerling and Strube, 2017] and HBMP [Talman et al., 2019]. It becomes natural to think about adopting the embedding techniques to software analysis. The intuitions are source code is the language for programmers, and binary code is the language for machines. Many works producing software embeddings are published in recent years. These works could be categorized into binary code embedding and source code embedding techniques. Binary code embedding techniques utilize information provided by binary analysis tools (e.g., disassemblers) to capture the semantics of machine code. Existing works aim to embed various granularities (i.e., byte, instruction, basic block, and function) for different tasks like binary code similarity and malware classification. Source code contains richer information (e.g., names and types) than binary code. Hence, researchers expect to employ source code embedding techniques to help software development, including method name recommendation and automatic program repair. Improving the quality of software embedding requires selecting meaningful information and carefully-designed encoding methods, and various applications are employed to evaluate the quality of embedding. This survey introduces frequently-used analyzing techniques and their applications first. Then we summarize the problems and introduce existing solutions. Finally, we provide advice on potential future directions. We believe this survey shed light on our future work on software embedding. Date: Thursday, 27 January 2022 Time: 1:00pm - 3:00pm Zoom Meeting: https://hkust.zoom.us/j/9236191239 Committee Members: Dr. Shuai Wang (Supervisor) Dr. Dimitris Papadopoulos (Chairperson) Prof. Shing-Chi Cheung Dr. Charles Zhang **** ALL are Welcome ****