Output-optimal Massively Parallel Streaming Joins

MPhil Thesis Defence


Title: "Output-optimal Massively Parallel Streaming Joins"

By

Mr. Serafeim PAPADIAS


Abstract

The advent of big data caused huge, rapid and volatile data streams to emerge, 
pushing research community into designing both real-time Distributed Stream 
Processing Systems (DSPSs) and streaming algorithms that run on top of those 
systems. The DSPSs must exhibit a variety of features such as hight throughput 
and low latency processing of data streams. In the first part of this thesis, 
we present the state of the art DSPSs and describe certain features that make 
them unique. In the second part, we focus on the problem of join processing in 
the streaming context. Specifically, we present the first output- optimal join 
algorithm for stream join processing, called Streaming Randomized HyperCube 
(SRHC). The algorithm operates optimally in the presence of high skew, 
considering both the input and the output sizes of the join, a feature that 
makes it quite suitable for many-to-many joins. Finally, we implement SRHC on 
top of Flink and evaluate its efficiency compared to state of the art join 
algorithms after conducting experiments on both synthetic and real datasets.


Date:			Wednesday, 5 September 2018

Time:			3:00pm - 5:00pm

Venue:			Room 5566
 			Lifts 27-28

Committee Members:	Dr. Ke Yi (Supervisor)
 			Dr. Raymond Wong (Chairperson)
 			Dr. Qiong Luo


**** ALL are Welcome ****