More about HKUST
Towards Good Utilisation of Crowd (Stack Overflow) Wisdom
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "Towards Good Utilisation of Crowd (Stack Overflow) Wisdom" By Mr. Fuxiang CHEN Abstract Stack Overflow (SO), established in 2008, is a community question answering forum tailored specially for developers. It is widely and actively used by developers. Today, there are more than 40 million questions and answers residing in SO, and this number is expected to grow over time. Despite the huge amount of invaluable information residing in SO, taking full advantage of it has been challenging, mainly due to the interleaving of unstructured natural language text and code snippets embedded in each post. To effectively utilise this crowd wisdom, in this thesis, we propose three different novel works that leverage the SO wisdom to help developers improve their productivity. In the first work, we propose mining SO to help developers to debug their code. Our approach finds defective code fragments (from developers’ software projects) by detecting code clones between the code snippets in SO questions and the code in developers’ software projects, before processing them to triangulate the source code anomalies inside developers’ software projects. Our approach reveals 189 warnings and 171 (90.5%) of them are confirmed by developers from eight high-quality and well-maintained projects. We also compared the confirmed bugs with three popular static analysis tools (FindBugs, JLint and PMD). Of the 171 bugs identified by our approach, only FindBugs detected six of them whereas JLint and PMD detected none. In the second work, we propose highlighting problem-cause and solution summary sentences in answer posts to guide developers in reading the answers. A recent survey revealed that majority of the non-native English speaking developers have trouble understanding English text and source code as the programming languages are all English-based, and they prefer more visuals in QA sites such as SO to help them understand the content easier. Separately, it has also been reported that the irrelevance and redundancy of SO answers may inhibit developers’ ability to retrieve information from SO efficiently. We also observed that in many of the SO answers, a single sentence can represent the high-level description of the problem-cause or solution of the question asked. We thus propose highlighting both problem-cause and solution summary sentences in the SO answer posts to guide developers in their reading. Our technique comprises of ensemble models of extractive summarization techniques involving detecting salient sentences. Compared with other extractive summarization methods, including the state-of-the art, our approach consistently outperforms them between 13.41% and 40.91% for problem-cause extractive summarization, and between 4.12% and 40.28% for solution summarization, with respect to relative improvement. A user study was also conducted with developers and most of them reported that the extracted summaries are accurate and the summaries help them to read the answers better. In the third work, we propose generating SQL statements automatically from natural language. Using natural language to program has been a long-cherished dream. Existing works on generating SQL queries from natural language are conditioned either on some given table schema or relational databases. We analyzed real-world developers’ data management issues in SO and found that these scenarios are a tiny portion of a myriad of other problems developers are facing. In this work, we propose an end-to-end general purpose natural language to SQL (NL2SQL) statement generation using SO dataset. Our method also incorporates a denoising module that can be applied to correct SQL syntax errors induced in the generated SQL queries regardless of the NL2SQL generation model used. Experiments show that the proposed NL2SQL yields more syntactically correct queries (up to 43% more using a Seq2Seq model) in most of the cases. Date: Tuesday, 7 August 2018 Time: 10:30am - 12:30pm Venue: Room 5560 Lifts 27/28 Chairman: Prof. Xinghua Zheng (ISOM) Committee Members: Prof. Sunghun Kim (Supervisor) Prof. Andrew Horner Prof. Frederick Lochovsky Prof. Eric Nelson (HUMA) Prof. Doo-Hwan Bae (KAIST) **** ALL are Welcome ****