More about HKUST
Approximate Join Processing by Random Sampling
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "Approximate Join Processing by Random Sampling" By Mr. Bin WU Abstract For many of today’s data-driven analytical tasks, users often need to pose ad hoc complex join queries involving multiple relational tables over gigabytes or even terabytes of data. When the data set is large, computing exact answers can be time-consuming. An important observation is that these analytical queries often do not require exact results. Therefore, approximate query processing techniques have been extensively studied in the literature, which offer users a nice and flexible tradeoff between query efficiency and accuracy. While the problem is relatively well solved when the query is posed on a single table, current solutions for complex join queries involving multiple tables are still far from satisfactory. In this thesis proposal, we set to address this latter issue, making progress on both the theory and practice of approximate join processing through the use of random sampling. Firstly, we study the problem in a simplified setting, that of estimating the number of triangles in a large graph, which is self-join over the three copies of the same relation with two attributes. This problem has been extensively studied in the streaming model, which by definition requires at least one pass over the entire data set. Nevertheless, we observe that the ideas in many of these streaming algorithms can be coupled with random sampling to yield sublinear-time algorithms. We make these random sampling triangle counting algorithms more explicit, and present an experimental and analytical comparison of different approaches, identifying the best performers among a number of candidates. Next, we study the general problem of approximate processing of complex join queries. We generalize our triangle sampling algorithm to arbitrary join queries, by performing random walks over the underlying join graph. We also design an optimizer that chooses the optimal plan for conducting the random walks without having to collect any statistics a priori. Extensive experiments using the TPC-H benchmark have demonstrated the superior performance of wander join over previous works. Finally, we have implemented our algorithm in a number of popular database systems, demonstrating the practicality of the algorithm and its performance in full-fledged systems. This includes PostgreSQL, a widely used opensource database system, SparkSQL, the next-generation parallel data processing platform designed to scale, and PL/SQL, so that our algorithm can be provided as an add-on package to commercial database systems whose kernel code is not accessible. Date: Wednesday, 31 May 2017 Time: 10:00am - 12:00noon Venue: Room 4475 Lifts 25/26 Chairman: Prof. Jianan Qu (ECE) Committee Members: Prof. Ke Yi (Supervisor) Prof. Lei Chen Prof. Qiong Luo Prof. Jiheng Zhang (IELM) Prof. Xuemin Lin (Comp. Sci. & Engg., U of NSW) **** ALL are Welcome ****