More about HKUST
BIG DATA ANALYTICS WITH NOVEL TOP-K QUERY PROCESSING AND CLASSIFICATION
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "BIG DATA ANALYTICS WITH NOVEL TOP-K QUERY PROCESSING AND CLASSIFICATION" By Mr. Peng PENG Abstract In the era of big data, with the dramatic explosion in both the number of records and the number of attributes, making decisions becomes harder and harder than before. Traditionally, top-k query processing was applied for dealing with the problem of multi-criteria decision making. A user issues a top-k query by involving his/her utility function as the input and expects to be recommended k items for the selection. However, in the case that the utility function is unknown, top-k queries cannot be utilized to capture the users’ requirements. Recently, researchers proposed several novel top-k queries such as k-representative skyline queries and k-regret queries, which are regarded as the solutions in the case that the utility function is unknown. Nevertheless, it is still far away from the end of the story. Due to the complexity issue, most of these novel top-k queries (without utility functions as inputs) cannot be directly applied to the large-scale data scenario. Specifically, the algorithms for answering these novel top-k queries cannot be easily modified to run in a parallel and distributed platform. Another problem is that most existing top-k queries are either related to an exact utility function or independent of the users’ utility function. In a more general case, we may have the user’s information partially. That is, even a user may not be able to provide an exact utility function, it is possible to obtain his/her partial information which can be used as the input of the queries so as to improve the quality of the query answers. In all, we conclude these two problems as the scalability problem and the general personalization problem. Motivated by these observations, I propose two directions to address the scalability and the personalization issue. On one hand, it is possible to extend those traditional techniques for top-k query processing in the large-scale data scenario. On the other hand, we could design a new type of top-k queries such that each newly proposed top-k query can be originally answered through a distributed computing platform and incorporates users’ information into the answers. In my thesis, I give an emphasis on the solutions towards the above two directions. Lastly, I include my research results on the interactive top-k query processing for solving a subproblem in classification. In particular, I apply the idea of top-k query processing to sample a training dataset of size k in classification, one of the most fundamental problems in machine learning and data mining. When constructing a training dataset for classification, a good sampling strategy is extremely crucial for the quality of the training dataset. As a result, I introduce an importance weighted sampling strategy to obtain the set for labeling as well as the analysis on the sample complexity of the proposed algorithm. Date: Wednesday, 29 July 2015 Time: 2:00pm - 4:00pm Venue: Room 2132C Lift 19 Chairman: Prof. Johnny Sin (ECE) Committee Members: Prof. Raymond Wong (Supervisor) Prof. Lei Chen Prof. Qiong Luo Prof. Yu-Hsing Wang (CIVL) Prof. Xuemin Lin (Univ. of NSW, Australia) **** ALL are Welcome ****