More about HKUST
Join Algorithms: From External Memory to the BSP
PhD Thesis Proposal Defence
Title: "Join Algorithms: From External Memory to the BSP"
by
Miss Xiao HU
Abstract:
Database systems have been traditionally disk-based, which had motivated the
extensive study on external memory (EM) algorithms. However, as RAMs continue
to get larger and cheaper, modern distributed database systems are increasingly
adopting a main memory based, shared-nothing architecture, exemplified by
systems like Spark and Flink. These systems can be abstracted by the bulk
synchronous parallel (BSP) model (with variants like the MPC model and the
MapReduce model), and there has been a strong revived interest in designing BSP
algorithms for handling large amounts of data. With hard disks starting to fade
away from the picture, EM algorithms may now seem less relevant. However, we
observe that the recently developed join algorithms under the BSP model have a
high degree of resemblance with their counterparts in the EM model. In this
proposal, we study the relationships between the EM and BSP model, and present
a general theoretical framework for converting EM algorithms to the BSP.
More specifically, the current state of art for BSP algorithm design is still
akin to that of PRAM (a fundamental mode in parallel computing but seldom
directly used in parallel algorithm), i.e., one has to specify, in each round,
what each BSP processor should compute and to which other processors messages
should be sent. Instead, work-depth models have enjoyed much more popularity,
as they relieve the algorithm designers and programmers from worrying about how
various tasks should be assigned to each of the processors. Meanwhile, they
also make it easy to study the fundamental parallel complexity of the
algorithm, namely work and depth, which are irrelevant to the number of
processors available. Seeing the impact of the work-depth models on parallel
algorithm design, we propose an EM work-depth model, by incorporating elements
from the EM model into an internal memory work-depth model (we choose the
multi-thread model). We show that algorithms designed in this model can be
optimally simulated by the BSP if possible at all, and illustrate how it can be
used to more easily design BSP algorithms by parallelizing the corresponding EM
algorithms. This simulation result is quite friendly to many problems, in
particular those based on divide-and-conquer. The primary target problem we
have investigated is the join problem, which is the most central operation in
relational databases. By parallelizing the existing EM algorithms, we are able
to obtain new, better, and simpler join algorithms in the BSP. This means that
EM algorithms, which were traditionally designed to optimize disk I/O, can also
be used in today's main memory only, shared-nothing systems.
Date: Monday, 23 April 2018
Time: 4:00pm - 6:00pm
Venue: Room 2463
(lifts 25/26)
Committee Members: Dr. Ke Yi (Supervisor)
Prof. Mordecai Golin (Chairperson)
Dr. Sunil Arya
Prof. Siu-Wing Cheng
**** ALL are Welcome ****