The Data Linkage Project

Speaker:	Dr. Dongwon LEE
		College of Information Sciences and Technology (IST)
		Pennsylvania State University

Title:		"The Data Linkage Project"

Date:		Thursday, 21 December 2006

Time:		2:00pm - 3:00pm

Venue:		Room 3315 (via lift nos. 17/18)
		HKUST

Abstract:

I revisit the extensively-studied traditional (record) linkage problem of
identifying matching entities in a collection, and argue that novel
solutions be needed to cope with new challenges.  In particular, in the
talk, I report some of preliminary results of the four directions:

(1) Googled linkage: When one cannot determine well if two entities are
matching or not due to the lack of evidences, we propose to use the Web as
the ultimate knowledge source,

(2) Parallel linkage: By turning existing data linkage solutions to
parallel programs, one can achieve substantial speed-up. However, due to
intricate interplay of match vs. merge operations in the data linkage, the
parallelization requires a careful design,

(3) Group linkage: When entities to match are no longer simple records but
have internal properties or structures (such as groups), its exploitation
can bring a significant improvement to the data linkage, and

(4) Adaptive linkage: Often, in existing data linkage solutions, various
parameters need to be set once (by users) and do not change during the
execution. However, by adaptively changing the values to maximize
objective functions, one can substantially false negatives.

This is a joint work with:

Ergin Elmacioglu (Penn State), Min-Yen Kan (NUS), Jaewoo Kang (Korea U.),
Hung-sik Kim (Penn State), Nick Koudas (U. Toronto), Byung-Won On (Penn
State), Jian Pei (Simon Fraser U.), Divesh Srivastava (AT&T Labs --
Research), Yee Fan Tan (NUS), Su Yan (Penn State)


*************
Biography:

Dongwon LEE has been an assistant professor of the Pennsylvania State
University, College of IST, USA, since 2002. He obtained a BS from Korea
University in 1993, an MS from Columbia University in 1995, and a PhD from
UCLA in 2002, all in Computer Science.  In-between MS and PhD, from 1995
to 1997, he has worked at AT&T Bell Labs as a programmer. His research
interests include Databases and Data Mining, Digital Library and
Bibliometrics, and XML and Semantic Web Services. He has (co-)authored
about 60+ scholarly articles in conferences or journals, and received the
Best Paper Award at ER conference in 2000, IBM Eclipse Innovation Award in
2004 and 2006, and Microsoft SciData Award in 2005, among others.