More about HKUST
TWO POPULAR QUERIES IN MASSIVE MULTIDIMENSIONAL DATASETS
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "TWO POPULAR QUERIES IN MASSIVE MULTIDIMENSIONAL DATASETS" By Miss Wuman Luo Abstract The arrival of cyber-physical system era is changing data analysis in many ways. Driven by the advances in Internet and sensor techniques, the amount of multidimensional contents, such as images, trajectories, video clips, etc., has grown to an unprecedented level. Supporting multidimensional objects in large scale requires significant extensions from traditional databases. One critical issue is indexing and query processing. In this thesis, we discuss two important queries in massive multidimensional datasets: frequent path finding and high-dimensional similarity join. In the context of big data, these two queries are challenging due two reasons: 1) they both contain expensive comparison operations, and 2) the complex structures of their interested data complicate the index design. To address these issues, we conduct theoretical analysis, advanced algorithm design, as well as extensive experiments on large-scale datasets. First, we address the problem of frequent path finding by proposing a new problem named the Most Frequent Path with Temporal Constraints (MFPTC). Specifically, given a set of time periods T, a source s, and a destination d, MFPTC query aims at computing the Most Frequent Path (MFP) from s to d during T. This query not only reveals the common routing preferences of the past travelers, but also considers the time effectiveness of MFP. To reflect the common sense notion of frequent paths, we further propose a novel MFP definition that satisfies three key properties: suffix-optimal, length-insensitive, and bottleneck-free. Assuming that all trajectories have been matched to a road network, we solve the MFPTC problem in two major steps. First, we develop efficient footmark construction algorithms to calculate the path frequencies by constructing a weighted sub-graph of the road network called footmark graph. Specifically, to deal with very large datasets, we devise two index structures to reduce the number of trajectories need to be fetched from the disk. Second, we propose a dynamic programming algorithm to search the MFP in the footmark graph using a "more frequent" relation defined over path frequencies. Second, we study how to perform parallel high-dimensional similarity joins (HDSJs) in MapReduce efficiently in the MapReduce paradigm. Particularly, we propose a cost model to demonstrate that it is important to take both communication cost and computation cost into account as dimensionality and data volume increases. To this end, we propose DAA (Dimension Aggregation Approximation), an efficient compression approach that can help significantly reduce both these costs when performing parallel HDSJs. Moreover, we design DAA-based parallel HDSJ algorithms which can scale up to massive data sizes and very high dimensionality. Finally, we conduct comprehensive experiments using real large-scale datasets. The number of trajectories for path finding is up to 10 million and the average trajectory length (in terms of the number of vertices) is 21. The number of image vectors for join is up to 1 million and the dimensionality of the vectors is up to ten thousand. The results show that our proposed algorithms have desirable performance in the context of large-scale datasets. Date: Thursday, 29 November 2012 Time: 10:15am - 12:15pm Venue: Room 2404 Lifts 17/18 Chairman: Prof. Ping Gao (CBME) Committee Members: Prof. Lionel Ni (Supervisor) Prof. Lei Chen Prof. Qiong Luo Prof. Furong Gao (CBME) Prof. Qing Li (Comp. Sci., CityU) **** ALL are Welcome ****