Efficient Processing of Complex Join Queries on the Coud

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Efficient Processing of Complex Join Queries on the Coud"

By

Mr. Xiaofei ZHANG


Abstract

Join operation is one of the most expressive and expensive data analytic 
tools in traditional Database systems. Along with the exponential growth 
of various data collections, NoSQL data storage has risen as the 
prevailing solution for Big Data. However, without the strong support of 
heavy index, the join operator becomes even more crucial and challenging 
for querying against or mining from massive data. There have been 
intensive studies over different types of join operations over distributed 
data, e.g. similarity join, set join, fuzzy join and etc., all of which 
focus on efficient join query evaluation by exploring the massive 
parallelism of the MapReduce computing framework on the Cloud platform. 
However, the multi-way generalized join problem, which is summarized as 
the complex join in this thesis, has not yet been thoroughly explored. The 
substantial challenge of complex join lies in, given a number of 
processing units, mapping a complex join query to a number of parallel 
tasks and having them executed in a well scheduled sequence, such that the 
total processing time span is minimized. In this thesis, we demonstrate 
how our complex join solution can be well applied to the query processing 
over various data analytic scenarios, i.e., querying RDF data, pattern 
matching over graph data and etc.To summarize, our study covers four 
following aspects:

1) We propose a cost model based RDF join processing solution using 
MapReduce and general purposed optimization strategy;

2) We propose an novel representation of RDF data on Cloud platforms, 
based on which we propose an I/O efficient strategy to evaluate SPARQL 
queries as quickly as possible.

3) We study the problem of efficient processing of multi-way Theta-join 
queries using MapReduce from a cost-effective perspective;

4) We develop a complete solution framework for join-based efficient 
analysis over distributed graphs using the distance join query as an 
example.

We validate our solutions through extensive experiments and discuss 
several interesting research directions of the complex join processing on 
the Cloud.


Date:			Wednesday, 19 June 2013

Time:			11:00am – 1:00pm

Venue:			Room 3494
 			Lifts 25/26

Chairman:		Prof. Qing Li (ISOM)

Committee Members:	Prof. Lei Chen (Supervisor)
 			Prof. Dik-Lun Lee
 			Prof. Ke Yi
 			Prof. Yeou-Koung Tung (CIVL)
                        Prof. Jianliang Xu (Comp. Sci., HKBU)


**** ALL are Welcome ****