More about HKUST
The Data Linkage Project
Speaker: Dr. Dongwon LEE College of Information Sciences and Technology (IST) Pennsylvania State University Title: "The Data Linkage Project" Date: Thursday, 21 December 2006 Time: 2:00pm - 3:00pm Venue: Room 3315 (via lift nos. 17/18) HKUST Abstract: I revisit the extensively-studied traditional (record) linkage problem of identifying matching entities in a collection, and argue that novel solutions be needed to cope with new challenges. In particular, in the talk, I report some of preliminary results of the four directions: (1) Googled linkage: When one cannot determine well if two entities are matching or not due to the lack of evidences, we propose to use the Web as the ultimate knowledge source, (2) Parallel linkage: By turning existing data linkage solutions to parallel programs, one can achieve substantial speed-up. However, due to intricate interplay of match vs. merge operations in the data linkage, the parallelization requires a careful design, (3) Group linkage: When entities to match are no longer simple records but have internal properties or structures (such as groups), its exploitation can bring a significant improvement to the data linkage, and (4) Adaptive linkage: Often, in existing data linkage solutions, various parameters need to be set once (by users) and do not change during the execution. However, by adaptively changing the values to maximize objective functions, one can substantially false negatives. This is a joint work with: Ergin Elmacioglu (Penn State), Min-Yen Kan (NUS), Jaewoo Kang (Korea U.), Hung-sik Kim (Penn State), Nick Koudas (U. Toronto), Byung-Won On (Penn State), Jian Pei (Simon Fraser U.), Divesh Srivastava (AT&T Labs -- Research), Yee Fan Tan (NUS), Su Yan (Penn State) ************* Biography: Dongwon LEE has been an assistant professor of the Pennsylvania State University, College of IST, USA, since 2002. He obtained a BS from Korea University in 1993, an MS from Columbia University in 1995, and a PhD from UCLA in 2002, all in Computer Science. In-between MS and PhD, from 1995 to 1997, he has worked at AT&T Bell Labs as a programmer. His research interests include Databases and Data Mining, Digital Library and Bibliometrics, and XML and Semantic Web Services. He has (co-)authored about 60+ scholarly articles in conferences or journals, and received the Best Paper Award at ER conference in 2000, IBM Eclipse Innovation Award in 2004 and 2006, and Microsoft SciData Award in 2005, among others.