More about HKUST
Big Data Analytics with Novel Top-k Query Processing and Classification
PhD Thesis Proposal Defence Title: "Big Data Analytics with Novel Top-k Query Processing and Classification" by Mr. Peng PENG Abstract: In the era of big data, with the dramatic explosion in both the number of records and the number of attributes, making decisions becomes harder and harder than before. Traditionally, top-k query processing was focused on dealing with the problem of multi-criteria decision making. However, when the utility function is unknown, it cannot capture the users' requirements since the utility function is regarded as a form of the users' requirements. Recently, researchers proposed several novel top-k queries such as k-representative skyline queries and k-regret queries, which are regarded as the better solutions in the case that the utility function is unknown. Nevertheless, it is still far away from the end of the story. Due to the complexity issue, most of these novel top-k queries (without utility functions as inputs) cannot be directly applied to the large-scale data scenario. Specifically, the algorithms for answering these novel top-k queries cannot be easily modified to run in a parallel and distributed platform. Another problem is that most existing top-k queries are independent of the users' requirements/information when the utility function is unknown. In general, even a user may not be able to provide an exact utility function, it is possible to obtain his/her partial information which can be used as the input of the queries so as to improve the quality of the query answers. In the following, we propose two directions for addressing the scalability and the personalization issue. On one hand, it is possible to extend those traditional techniques for top-k query processing in the large-scale data scenario. On the other hand, we could design a new type of top-k queries such that each newly proposed top-k query can be originally answered through a distributed computing platform and incorporates users' information into the answers. In my thesis proposal, I mainly give an emphasis on the solutions towards the above two directions. Lastly, I include my research results on an application of top-k query processing. In particular, I extend the idea of top-k query processing for sampling a training dataset of size k in the problem of classification, one of the most fundamental problems in machine learning and data mining. The problem of classification can be studied in a big data environment. When constructing a training dataset for classification, a good sampling strategy is extremely crucial for determining the quality of the training dataset. Therefore, a new type of top-k queries can be applied here for returning k representative data points from the dataset. Date: Wednesday, 6 May 2015 Time: 2:00pm - 4:00pm Venue: Room 3494 lifts 25/26 Committee Members: Dr. Raymond Wong (Supervisor) Dr. Huamin Qu (Chairperson) Dr. Lei Chen Dr. Qiong Luo **** ALL are Welcome ****