More about HKUST
Entity Resolution for Hidden Web Data
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "Entity Resolution for Hidden Web Data" By Miss Xiaoheng Xie Abstract Entity resolution (ER) identifies and merges records judged to represent the same real-world entity. With the development of the Internet, ER for hidden Web data has become increasingly important in many real-world applications such as online search engines, web data integration and so on. Hidden Web data often originates from different data sources that usually have different schemas. As a consequence, there is no one most efficient way to compare and merge records from different schemas. Moreover, the existing proposed techniques that put all records together under a unified schema are often not suitable. In this thesis, we investigate ER methods for hidden Web data using a multi-schema approach. That is, we keep the data under the original schemas instead of placing them under a unified schema. Based on the multi-schema structure, a pair-wise ER method validity-ensured and order-sensitive (VEOS) is proposed. For the rest parts of the thesis, we first propose two techniques for improving the performance of the VEOS method. Since duplicates that exist in the same data source may adversely affect recall performance, the first technique applies an expanding window to VEOS to enhance the recall performance. To reduce the number of record pair comparisons, our second technique separates the records in large data sources into several blocks, so that only records in the blocks with the same key values need to be compared. Then, we propose an efficient ER method for on-line query data integration, which self-trains the schema fields (attributes) so as to set appropriate weights, such that more representative attributes will be used for the ER process. We demonstrate through extensive experiments using real online data sets from different domains and some reasonable synthetic data sets, the scalability of the ER algorithms, the efficiency of the advanced VEOS approaches and the effectiveness of our proposed ER method for online querying. Date: Thursday, 6 September 2012 Time: 2:00pm – 4:00pm Venue: Room 3501 Lifts 25/26 Chairman: Prof. Weichuan Yu (ECE) Committee Members: Prof. Frederick Lochovsky (Supervisor) Prof. Dik-Lun Lee Prof. Qiong Luo Prof. Rong Zheng (ISOM) Prof. Felix Naumann (Univ. of Potsdam) **** ALL are Welcome ****