More about HKUST
Entity Resolution for Hidden Web Data
PhD Thesis Proposal Defence
Title: "Entity Resolution for Hidden Web Data"
by
Miss Xiaoheng Xie
ABSTRACT:
Entity resolution (ER) identifies and merges records judged to represent the
same real-world entity. With the development of the Internet, ER for hidden Web
data has become increasingly important in many real-world applications such as
online search engines, web data integration and so on. Hidden Web data often
originates from different data sources that usually have different schemas. As
a consequence, there is no one most efficient way to compare and merge records
from different schemas and thus the existing techniques proposed by putting all
records together under a unified schema are often not suitable.
In this thesis, we investigate ER methods for hidden Web data using a
multi-schema approach. That is, we keep the data under the original schemas
instead of placing them under a unified schema. First, we propose two
techniques for improving the performance of the multi-schema ER method
validity-ensured and order-sensitive (VEOS). By focusing on the existing
duplicates in the same schema, an expanding window is applied to VEOS to
enhance the recall performance. To reduce the number of record pair
comparisons, we separate the records in large data sets into several blocks, so
that only records in the blocks with the same key values need to be compared.
Then we propose an efficient ER method for on-line query data integration by
self-training the schema fields (attributes) so as to set appropriate weights.
We demonstrate through extensive experiments using real online data sets from
different domains and some reasonable synthetic data sets, the scalability of
the ER algorithms, the efficiency of the advanced VEOS approaches and the
effectiveness of our proposed ER method for online querying.
Date: Thursday, 19 April 2012
Time: 2:00pm - 4:00pm
Venue: Room 3408
lifts 17/18
Committee Members: Prof. Frederick Lochovsky (Supervisor)
Prof. Dik-Lun Lee (Chairperson)
Dr. Lei Chen
Dr. Raymond Wong
**** ALL are Welcome ****