More about HKUST
Big Data Analytics with Novel Top-k Query Processing and Classification
PhD Thesis Proposal Defence
Title: "Big Data Analytics with Novel Top-k Query Processing and
Classification"
by
Mr. Peng PENG
Abstract:
In the era of big data, with the dramatic explosion in both the number of
records and the number of attributes, making decisions becomes harder and
harder than before. Traditionally, top-k query processing was focused on
dealing with the problem of multi-criteria decision making. However, when
the utility function is unknown, it cannot capture the users' requirements
since the utility function is regarded as a form of the users'
requirements. Recently, researchers proposed several novel top-k queries
such as k-representative skyline queries and k-regret queries, which are
regarded as the better solutions in the case that the utility function is
unknown.
Nevertheless, it is still far away from the end of the story. Due to
the complexity issue, most of these novel top-k queries (without utility
functions as inputs) cannot be directly applied to the large-scale data
scenario. Specifically, the algorithms for answering these novel top-k
queries cannot be easily modified to run in a parallel and distributed
platform. Another problem is that most existing top-k queries are
independent of the users' requirements/information when the utility
function is unknown. In general, even a user may not be able to provide an
exact utility function, it is possible to obtain his/her partial
information which can be used as the input of the queries so as to improve
the quality of the query answers. In the following, we propose two
directions for addressing the scalability and the personalization issue.
On one hand, it is possible to extend those traditional techniques for
top-k query processing in the large-scale data scenario. On the other
hand, we could design a new type of top-k queries such that each newly
proposed top-k query can be originally answered through a distributed
computing platform and incorporates users' information into the answers.
In my thesis proposal, I mainly give an emphasis on the solutions towards
the above two directions.
Lastly, I include my research results on an application of top-k
query processing. In particular, I extend the idea of top-k query
processing for sampling a training dataset of size k in the problem of
classification, one of the most fundamental problems in machine learning
and data mining. The problem of classification can be studied in a big
data environment. When constructing a training dataset for classification,
a good sampling strategy is extremely crucial for determining the quality
of the training dataset. Therefore, a new type of top-k queries can be
applied here for returning k representative data points from the dataset.
Date: Wednesday, 6 May 2015
Time: 2:00pm - 4:00pm
Venue: Room 3494
lifts 25/26
Committee Members: Dr. Raymond Wong (Supervisor)
Dr. Huamin Qu (Chairperson)
Dr. Lei Chen
Dr. Qiong Luo
**** ALL are Welcome ****