More about HKUST
Scalable Detection of Similar Code: Techniques and Applications
Speaker: Lingxiao JIANG Department of Computer Science University of California, Davis Title: "Scalable Detection of Similar Code: Techniques and Applications" Date: Friday, 13 March, 2009 Time: 2:00pm - 3:00pm Venue: Lecture Theater E (Cheung On Tak Lecture Theater) HKUST Abstract: Similar code, commonly known as code clones, often occur in large programs because of various software development practices, such as copying and pasting code and n-version programming. Studies show that code duplication can incur higher maintenance costs and lead to subtle errors. Identifying similar code thus has many important applications, such as program understanding, refactoring, optimization, and bug detection. In this talk, I will present three pieces of my work on techniques and applications of scalable detection of both syntactic and semantic code clones. First, I will present Deckard, a scalable and accurate tree-based code clone detection technique. The key insight is to represent syntax trees of a program as structure-preserving characteristic vectors in the Euclidean space and employ efficient hashing algorithms to cluster these vectors. Our experiments showed that Deckard scales to billions of lines of code with few false positives. Also, Deckard is language-generic, applicable to any language with a formally specified grammar. Second, we will look at a novel application of Deckard to bug detection. In particular, I will describe a general notion of context-based inconsistencies as strong indicators of clone-related bugs and the application of Deckard to identify such inconsistencies. Many previously unknown bugs in large projects such as the Linux kernel and Eclipse were discovered. These bugs exhibit diverse characteristics and are difficult to detect with any single previous bug detection technique. Third, I will describe EqMiner, the first scalable technique to detect functionally equivalent code for understanding code duplication at the semantic level. Inspired by Schwartz's randomized polynomial identity testing, EqMiner uses automated random testing to quickly determine the functional equivalence of arbitrary code fragments automatically extracted from a large program. Evaluated on the Linux kernel, EqMiner discovered many functionally equivalent code fragments that are syntactically different. I will conclude this talk by discussing future opportunities and challenges related to code clone detection. ******************** Biography: Lingxiao Jiang is a Ph.D. candidate in Computer Science at University of California, Davis, where he specializes in software engineering. His current research focuses on techniques and tools for improving software quality and developer productivity. He received his M.S. in Applied Mathematics and B.S. in Information Science from Peking University.