More about HKUST
Approximate Join Processing by Random Sampling
PhD Thesis Proposal Defence
Title: "Approximate Join Processing by Random Sampling"
by
Mr. Bin WU
Abstract:
For many of today’s data-driven analytical tasks, users often need to pose
ad hoc complex join queries involving multiple relational tables over
gigabytes or even terabytes of data. When the data set is large, computing
exact answers can be time-consuming. An important observation is that
these analytical queries often do not require exact results. Therefore,
approximate query processing techniques have been extensively studied in
the literature, which offer users a nice and flexible tradeoff between
query efficiency and accuracy. While the problem is relatively well solved
when the query is posed on a single table, current solutions for complex
join queries involving multiple tables are still far from satisfactory.
In this thesis proposal, we set to address this latter issue, making
progress on both the theory and practice of approximate join processing
through the use of random sampling.
Firstly, we study the problem in a simplified setting, that of estimating
the number triangles in a large graph, which is self-join over the three
copies of the same relation with two attributes. This problem has been
extensively studied in the streaming model, which by definition requires
at least one pass over the entire data set. Nevertheless, we observe that
the ideas in many of these streaming algorithms can be coupled with random
sampling to yield sublinear-time algorithms. We make these random sampling
triangle counting algorithms more explicit, and present an experimental
and analytical comparison of different approaches, identifying the best
performers among a number of candidates.
Next, we study the general problem of approximate processing of complex
join queries. We generalize our triangle sampling algorithm to arbitrary
join queries, by performing random walks over the underlying join graph.
We also design an optimizer that chooses the optimal plan for conducting
the random walks without having to collect any statistics a priori.
Extensive experiments using the TPC-H benchmark have demonstrated the
superior performance of wander join over previous works.
Finally, we have implemented our algorithm in a number of popular database
systems, demonstrating the practicality of the algorithm and its
performance in full-fledged systems. This includes PostgreSQL, a widely
used opensource database system, SparkSQL, the next-generation parallel
data processing platform designed to scale, and PL/SQL, so that our
algorithm can be provided as an add-on package to commercial database
systems whose kernel code is not accessible.
Date: Tuesday, 17 January 2017
Time: 9:30am - 11:30am
Venue: Room 5510
(lifts 25/26)
Committee Members: Dr. Ke Yi (Supervisor)
Dr. Qiong Luo(Chairperson)
Prof. Lei Chen
Dr. Raymond Wong
**** ALL are Welcome ****