More about HKUST
Join Algorithms: From External Memory to the BSP
PhD Thesis Proposal Defence Title: "Join Algorithms: From External Memory to the BSP" by Miss Xiao HU Abstract: Database systems have been traditionally disk-based, which had motivated the extensive study on external memory (EM) algorithms. However, as RAMs continue to get larger and cheaper, modern distributed database systems are increasingly adopting a main memory based, shared-nothing architecture, exemplified by systems like Spark and Flink. These systems can be abstracted by the bulk synchronous parallel (BSP) model (with variants like the MPC model and the MapReduce model), and there has been a strong revived interest in designing BSP algorithms for handling large amounts of data. With hard disks starting to fade away from the picture, EM algorithms may now seem less relevant. However, we observe that the recently developed join algorithms under the BSP model have a high degree of resemblance with their counterparts in the EM model. In this proposal, we study the relationships between the EM and BSP model, and present a general theoretical framework for converting EM algorithms to the BSP. More specifically, the current state of art for BSP algorithm design is still akin to that of PRAM (a fundamental mode in parallel computing but seldom directly used in parallel algorithm), i.e., one has to specify, in each round, what each BSP processor should compute and to which other processors messages should be sent. Instead, work-depth models have enjoyed much more popularity, as they relieve the algorithm designers and programmers from worrying about how various tasks should be assigned to each of the processors. Meanwhile, they also make it easy to study the fundamental parallel complexity of the algorithm, namely work and depth, which are irrelevant to the number of processors available. Seeing the impact of the work-depth models on parallel algorithm design, we propose an EM work-depth model, by incorporating elements from the EM model into an internal memory work-depth model (we choose the multi-thread model). We show that algorithms designed in this model can be optimally simulated by the BSP if possible at all, and illustrate how it can be used to more easily design BSP algorithms by parallelizing the corresponding EM algorithms. This simulation result is quite friendly to many problems, in particular those based on divide-and-conquer. The primary target problem we have investigated is the join problem, which is the most central operation in relational databases. By parallelizing the existing EM algorithms, we are able to obtain new, better, and simpler join algorithms in the BSP. This means that EM algorithms, which were traditionally designed to optimize disk I/O, can also be used in today's main memory only, shared-nothing systems. Date: Monday, 23 April 2018 Time: 4:00pm - 6:00pm Venue: Room 2463 (lifts 25/26) Committee Members: Dr. Ke Yi (Supervisor) Prof. Mordecai Golin (Chairperson) Dr. Sunil Arya Prof. Siu-Wing Cheng **** ALL are Welcome ****