BIG DATA ANALYTICS WITH NOVEL TOP-K QUERY PROCESSING AND CLASSIFICATION

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "BIG DATA ANALYTICS WITH NOVEL TOP-K QUERY PROCESSING AND 
CLASSIFICATION"

By

Mr. Peng PENG


Abstract

In the era of big data, with the dramatic explosion in both the number of 
records and the number of attributes, making decisions becomes harder and 
harder than before. Traditionally, top-k query processing was applied for 
dealing with the problem of multi-criteria decision making. A user issues 
a top-k query by involving his/her utility function as the input and 
expects to be recommended k items for the selection. However, in the case 
that the utility function is unknown, top-k queries cannot be utilized to 
capture the users’ requirements. Recently, researchers proposed several 
novel top-k queries such as k-representative skyline queries and k-regret 
queries, which are regarded as the solutions in the case that the utility 
function is unknown.

Nevertheless, it is still far away from the end of the story. Due to the 
complexity issue, most of these novel top-k queries (without utility 
functions as inputs) cannot be directly applied to the large-scale data 
scenario. Specifically, the algorithms for answering these novel top-k 
queries cannot be easily modified to run in a parallel and distributed 
platform. Another problem is that most existing top-k queries are either 
related to an exact utility function or independent of the users’ utility 
function. In a more general case, we may have the user’s information 
partially. That is, even a user may not be able to provide an exact 
utility function, it is possible to obtain his/her partial information 
which can be used as the input of the queries so as to improve the quality 
of the query answers. In all, we conclude these two problems as the 
scalability problem and the general personalization problem.

Motivated by these observations, I propose two directions to address the 
scalability and the personalization issue. On one hand, it is possible to 
extend those traditional techniques for top-k query processing in the 
large-scale data scenario. On the other hand, we could design a new type 
of top-k queries such that each newly proposed top-k query can be 
originally answered through a distributed computing platform and 
incorporates users’ information into the answers. In my thesis, I give an 
emphasis on the solutions towards the above two directions.

Lastly, I include my research results on the interactive top-k query 
processing for solving a subproblem in classification. In particular, I 
apply the idea of top-k query processing to sample a training dataset of 
size k in classification, one of the most fundamental problems in machine 
learning and data mining. When constructing a training dataset for 
classification, a good sampling strategy is extremely crucial for the 
quality of the training dataset. As a result, I introduce an importance 
weighted sampling strategy to obtain the set for labeling as well as the 
analysis on the sample complexity of the proposed algorithm.


Date:			Wednesday, 29 July 2015

Time:			2:00pm - 4:00pm

Venue:			Room 2132C
 			Lift 19

Chairman:		Prof. Johnny Sin (ECE)

Committee Members:	Prof. Raymond Wong (Supervisor)
 			Prof. Lei Chen
 			Prof. Qiong Luo
 			Prof. Yu-Hsing Wang (CIVL)
 			Prof. Xuemin Lin (Univ. of NSW, Australia)


**** ALL are Welcome ****