Entity Resolution for Hidden Web Data

PhD Thesis Proposal Defence


Title: "Entity Resolution for Hidden Web Data"

by

Miss Xiaoheng Xie


ABSTRACT:

Entity resolution (ER) identifies and merges records judged to represent the 
same real-world entity. With the development of the Internet, ER for hidden Web 
data has become increasingly important in many real-world applications such as 
online search engines, web data integration and so on. Hidden Web data often 
originates from different data sources that usually have different schemas. As 
a consequence, there is no one most efficient way to compare and merge records 
from different schemas and thus the existing techniques proposed by putting all 
records together under a unified schema are often not suitable.

In this thesis, we investigate ER methods for hidden Web data using a 
multi-schema approach. That is, we keep the data under the original schemas 
instead of placing them under a unified schema. First, we propose two 
techniques for improving the performance of the multi-schema ER method 
validity-ensured and order-sensitive (VEOS). By focusing on the existing 
duplicates in the same schema, an expanding window is applied to VEOS to 
enhance the recall performance. To reduce the number of record pair 
comparisons, we separate the records in large data sets into several blocks, so 
that only records in the blocks with the same key values need to be compared. 
Then we propose an efficient ER method for on-line query data integration by 
self-training the schema fields (attributes) so as to set appropriate weights.

We demonstrate through extensive experiments using real online data sets from 
different domains and some reasonable synthetic data sets, the scalability of 
the ER algorithms, the efficiency of the advanced VEOS approaches and the 
effectiveness of our proposed ER method for online querying.


Date:                   Thursday, 19 April 2012

Time:                   2:00pm - 4:00pm

Venue:                  Room 3408
                         lifts 17/18

Committee Members:      Prof. Frederick Lochovsky (Supervisor)
                         Prof. Dik-Lun Lee (Chairperson)
 			Dr. Lei Chen
 			Dr. Raymond Wong


**** ALL are Welcome ****