More about HKUST
TWO POPULAR QUERIES IN MASSIVE MULTIDIMENSIONAL DATASETS
The Hong Kong University of Science and Technology
Department of Computer Science and Engineering
PhD Thesis Defence
Title: "TWO POPULAR QUERIES IN MASSIVE MULTIDIMENSIONAL DATASETS"
By
Miss Wuman Luo
Abstract
The arrival of cyber-physical system era is changing data analysis in many
ways. Driven by the advances in Internet and sensor techniques, the amount of
multidimensional contents, such as images, trajectories, video clips, etc., has
grown to an unprecedented level. Supporting multidimensional objects in large
scale requires significant extensions from traditional databases. One critical
issue is indexing and query processing. In this thesis, we discuss two
important queries in massive multidimensional datasets: frequent path finding
and high-dimensional similarity join. In the context of big data, these two
queries are challenging due two reasons: 1) they both contain expensive
comparison operations, and 2) the complex structures of their interested data
complicate the index design. To address these issues, we conduct theoretical
analysis, advanced algorithm design, as well as extensive experiments on
large-scale datasets.
First, we address the problem of frequent path finding by proposing a new
problem named the Most Frequent Path with Temporal Constraints (MFPTC).
Specifically, given a set of time periods T, a source s, and a destination d,
MFPTC query aims at computing the Most Frequent Path (MFP) from s to d during
T. This query not only reveals the common routing preferences of the past
travelers, but also considers the time effectiveness of MFP. To reflect the
common sense notion of frequent paths, we further propose a novel MFP
definition that satisfies three key properties: suffix-optimal,
length-insensitive, and bottleneck-free. Assuming that all trajectories have
been matched to a road network, we solve the MFPTC problem in two major steps.
First, we develop efficient footmark construction algorithms to calculate the
path frequencies by constructing a weighted sub-graph of the road network
called footmark graph. Specifically, to deal with very large datasets, we
devise two index structures to reduce the number of trajectories need to be
fetched from the disk. Second, we propose a dynamic programming algorithm to
search the MFP in the footmark graph using a "more frequent" relation defined
over path frequencies.
Second, we study how to perform parallel high-dimensional similarity joins
(HDSJs) in MapReduce efficiently in the MapReduce paradigm. Particularly, we
propose a cost model to demonstrate that it is important to take both
communication cost and computation cost into account as dimensionality and data
volume increases. To this end, we propose DAA (Dimension Aggregation
Approximation), an efficient compression approach that can help significantly
reduce both these costs when performing parallel HDSJs. Moreover, we design
DAA-based parallel HDSJ algorithms which can scale up to massive data sizes and
very high dimensionality.
Finally, we conduct comprehensive experiments using real large-scale datasets.
The number of trajectories for path finding is up to 10 million and the average
trajectory length (in terms of the number of vertices) is 21. The number of
image vectors for join is up to 1 million and the dimensionality of the vectors
is up to ten thousand. The results show that our proposed algorithms have
desirable performance in the context of large-scale datasets.
Date: Thursday, 29 November 2012
Time: 10:15am - 12:15pm
Venue: Room 2404
Lifts 17/18
Chairman: Prof. Ping Gao (CBME)
Committee Members: Prof. Lionel Ni (Supervisor)
Prof. Lei Chen
Prof. Qiong Luo
Prof. Furong Gao (CBME)
Prof. Qing Li (Comp. Sci., CityU)
**** ALL are Welcome ****