Scalable Detection of Similar Code: Techniques and Applications

Speaker:	Lingxiao JIANG
		Department of Computer Science
		University of California, Davis

Title:		"Scalable Detection of Similar Code: Techniques and
		 Applications"

Date:		Friday, 13 March, 2009

Time:		2:00pm - 3:00pm

Venue:		Lecture Theater E (Cheung On Tak Lecture Theater)
		HKUST

Abstract:

Similar code, commonly known as code clones, often occur in large programs
because of various software development practices, such as copying and
pasting code and n-version programming. Studies show that code duplication
can incur higher maintenance costs and lead to subtle errors. Identifying
similar code thus has many important applications, such as program
understanding, refactoring, optimization, and bug detection. In this talk,
I will present three pieces of my work on techniques and applications of
scalable detection of both syntactic and semantic code clones.

First, I will present Deckard, a scalable and accurate tree-based code
clone detection technique. The key insight is to represent syntax trees of
a program as structure-preserving characteristic vectors in the Euclidean
space and employ efficient hashing algorithms to cluster these vectors.
Our experiments showed that Deckard scales to billions of lines of code
with few false positives. Also, Deckard is language-generic, applicable to
any language with a formally specified grammar.

Second, we will look at a novel application of Deckard to bug detection.
In particular, I will describe a general notion of context-based
inconsistencies as strong indicators of clone-related bugs and the
application of Deckard to identify such inconsistencies. Many previously
unknown bugs in large projects such as the Linux
kernel and Eclipse were discovered. These bugs exhibit diverse
characteristics and are difficult to detect with any single previous bug
detection technique.

Third, I will describe EqMiner, the first scalable technique to detect
functionally equivalent code for understanding code duplication at the
semantic level. Inspired by Schwartz's randomized polynomial identity
testing, EqMiner uses automated random testing to quickly determine the
functional equivalence of arbitrary code fragments automatically extracted
from a large program. Evaluated on the Linux kernel, EqMiner discovered
many functionally equivalent code fragments that are syntactically
different.

I will conclude this talk by discussing future opportunities and
challenges related to code clone detection.


********************
Biography:

Lingxiao Jiang is a Ph.D. candidate in Computer Science at University of
California, Davis, where he specializes in software engineering. His
current research focuses on techniques and tools for improving software
quality and developer productivity. He received his M.S. in Applied
Mathematics and B.S. in Information Science from Peking University.