More about HKUST
Entity Resolution for Hidden Web Data
The Hong Kong University of Science and Technology
Department of Computer Science and Engineering
PhD Thesis Defence
Title: "Entity Resolution for Hidden Web Data"
By
Miss Xiaoheng Xie
Abstract
Entity resolution (ER) identifies and merges records judged to represent
the same real-world entity. With the development of the Internet, ER for
hidden Web data has become increasingly important in many real-world
applications such as online search engines, web data integration and so
on. Hidden Web data often originates from different data sources that
usually have different schemas. As a consequence, there is no one most
efficient way to compare and merge records from different schemas.
Moreover, the existing proposed techniques that put all records together
under a unified schema are often not suitable.
In this thesis, we investigate ER methods for hidden Web data using a
multi-schema approach. That is, we keep the data under the original
schemas instead of placing them under a unified schema. Based on the
multi-schema structure, a pair-wise ER method validity-ensured and
order-sensitive (VEOS) is proposed. For the rest parts of the thesis, we
first propose two techniques for improving the performance of the VEOS
method. Since duplicates that exist in the same data source may adversely
affect recall performance, the first technique applies an expanding window
to VEOS to enhance the recall performance. To reduce the number of record
pair comparisons, our second technique separates the records in large data
sources into several blocks, so that only records in the blocks with the
same key values need to be compared. Then, we propose an efficient ER
method for on-line query data integration, which self-trains the schema
fields (attributes) so as to set appropriate weights, such that more
representative attributes will be used for the ER process.
We demonstrate through extensive experiments using real online data sets
from different domains and some reasonable synthetic data sets, the
scalability of the ER algorithms, the efficiency of the advanced VEOS
approaches and the effectiveness of our proposed ER method for online
querying.
Date: Thursday, 6 September 2012
Time: 2:00pm – 4:00pm
Venue: Room 3501
Lifts 25/26
Chairman: Prof. Weichuan Yu (ECE)
Committee Members: Prof. Frederick Lochovsky (Supervisor)
Prof. Dik-Lun Lee
Prof. Qiong Luo
Prof. Rong Zheng (ISOM)
Prof. Felix Naumann (Univ. of Potsdam)
**** ALL are Welcome ****