TWO POPULAR QUERIES IN MASSIVE MULTIDIMENSIONAL DATASETS

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "TWO POPULAR QUERIES IN MASSIVE MULTIDIMENSIONAL DATASETS"

By

Miss Wuman Luo


Abstract

The arrival of cyber-physical system era is changing data analysis in many 
ways. Driven by the advances in Internet and sensor techniques, the amount of 
multidimensional contents, such as images, trajectories, video clips, etc., has 
grown to an unprecedented level. Supporting multidimensional objects in large 
scale requires significant extensions from traditional databases. One critical 
issue is indexing and query processing. In this thesis, we discuss two 
important queries in massive multidimensional datasets: frequent path finding 
and high-dimensional similarity join. In the context of big data, these two 
queries are challenging due two reasons: 1) they both contain expensive 
comparison operations, and 2) the complex structures of their interested data 
complicate the index design. To address these issues, we conduct theoretical 
analysis, advanced algorithm design, as well as extensive experiments on 
large-scale datasets.

First, we address the problem of frequent path finding by proposing a new 
problem named the Most Frequent Path with Temporal Constraints (MFPTC). 
Specifically, given a set of time periods T, a source s, and a destination d, 
MFPTC query aims at computing the Most Frequent Path (MFP) from s to d during 
T. This query not only reveals the common routing preferences of the past 
travelers, but also considers the time effectiveness of MFP. To reflect the 
common sense notion of frequent paths, we further propose a novel MFP 
definition that satisfies three key properties: suffix-optimal, 
length-insensitive, and bottleneck-free. Assuming that all trajectories have 
been matched to a road network, we solve the MFPTC problem in two major steps. 
First, we develop efficient footmark construction algorithms to calculate the 
path frequencies by constructing a weighted sub-graph of the road network 
called footmark graph. Specifically, to deal with very large datasets, we 
devise two index structures to reduce the number of trajectories need to be 
fetched from the disk. Second, we propose a dynamic programming algorithm to 
search the MFP in the footmark graph using a "more frequent" relation defined 
over path frequencies.

Second, we study how to perform parallel high-dimensional similarity joins 
(HDSJs) in MapReduce efficiently in the MapReduce paradigm. Particularly, we 
propose a cost model to demonstrate that it is important to take both 
communication cost and computation cost into account as dimensionality and data 
volume increases. To this end, we propose DAA (Dimension Aggregation 
Approximation), an efficient compression approach that can help significantly 
reduce both these costs when performing parallel HDSJs. Moreover, we design 
DAA-based parallel HDSJ algorithms which can scale up to massive data sizes and 
very high dimensionality.

Finally, we conduct comprehensive experiments using real large-scale datasets. 
The number of trajectories for path finding is up to 10 million and the average 
trajectory length (in terms of the number of vertices) is 21. The number of 
image vectors for join is up to 1 million and the dimensionality of the vectors 
is up to ten thousand. The results show that our proposed algorithms have 
desirable performance in the context of large-scale datasets.


Date:			Thursday, 29 November 2012

Time:			10:15am - 12:15pm

Venue:			Room 2404
 			Lifts 17/18

Chairman:		Prof. Ping Gao (CBME)

Committee Members:	Prof. Lionel Ni (Supervisor)
 			Prof. Lei Chen
 			Prof. Qiong Luo
 			Prof. Furong Gao (CBME)
                         Prof. Qing Li (Comp. Sci., CityU)


**** ALL are Welcome ****