More about HKUST
Automatic Techniques for Code Example Generation
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "Automatic Techniques for Code Example Generation" By Mr. Xiaodong Gu Abstract Developers often wonder how to implement a program functionality. Code examples are very helpful in this regard. Over the years, many approaches have been proposed to generate code examples. The existing approaches often treat queries and source code as textual documents and utilize information retrieval models to retrieve relevant code snippets that match a given query. However, conventional code example generation approaches involve the following major challenges. First, they rely on a bag-of-words assumption and cannot recognize high-level features of queries and source code. Second, source code and natural language queries are heterogeneous. Existing approaches mainly rely on the textual similarity between source code and natural language query. They lack a mapping of high-level semantics between queries and source code. Moreover, the generated code examples may be redundant and project-specific, this requires to generate succinct and high-coverage code examples. To address these challenges, in this thesis, we propose three machine learning based approaches to the generation of code examples. Instead of mapping keywords, our approaches learn the deep semantics of queries and code snippets. We first propose a technique, DeepAPI which generates API usage sequences via deep learning. DeepAPI adapts a neural language model named RNN Encoder-Decoder. Given a corpus of annotated API sequences, i.e.,pairs, DeepAPI trains the language model that encodes each sequence of words (annotation) into a fixed-length context vector and decodes an API sequence based on the context vector. Then, in response to an API-related user query, it generates API sequences by consulting the neural language model. Furthermore, we propose a technique, DeepCodeHow to generate code examples via searching from existing code corpus. To bridge the lexical gap between queries and source code, DeepCodeHow jointly embeds code snippets and natural language descriptions into a high-dimensional vector space. With the unified vector representation, code snippets semantically related to a natural language query can be retrieved according to their vectors. Finally, to generate succinct and high-coverage examples, we design a code example selection technique named CodeKernel. CodeKernel leverages a machine learning technique named Graph Kernel. It represents code snippets as object usage graphs and embeds graphs into a high-level vector space. With the graph embedding, CodeKernel clusters similar graphs and selects a typical graph as the code example. We empirically evaluate our techniques on a large scale code corpus collected from GitHub. The experimental results show that Our proposed techniques effectively generate relevant code examples and outperform the conventional IR-based approaches. Date: Friday, 30 June 2017 Time: 3:00pm - 5:00pm Venue: Room 2612A Lifts 31/32 Chairman: Prof. Huihe Qiu (MAE) Committee Members: Prof. Sunghun Kim (Supervisor) Prof. Frederick Lochovsky Prof. Xiaojuan Ma Prof. Yiwen Wang (ECE) Prof. Alice Oh (Comp. Sci., KAIST) **** ALL are Welcome ****