BUILDING SUMMARIES OVER STREAMING DATA

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "BUILDING SUMMARIES OVER STREAMING DATA"

By

Mr. Ge LUO


Abstract

Many of today’s applications generate huge amounts of data in a streaming 
fashion. The data often has to be processed in real time with limited 
memory space. On the other hand, it has been observed that in most cases, 
the value density of these data is rather low, and they can usually be 
compacted in a small summary that can be used to serve queries on the 
original data without losing much accuracy. Therefore, building various 
small summaries over streaming data has received much attention in both 
the algorithms and database communities in recent years.

This thesis proposal makes three contributions to this area. First, we 
continue the study of building quantile summaries over streaming data. 
This is a fundamental problem that has been studied for over 30 years, yet 
there has been limited formal comparison of the competing methods, and no 
comprehensive study of their performance. In this proposal, we remedy this 
deficit by providing a taxonomy of different methods, and describe 
efficient implementations. In doing so, we also propose and analyze 
variations that have not been explicitly studied before, yet which turn 
out to perform the best.

Second, we revisit another classical problem, which is to build a 
piecewise linear application (PLA) over streaming time series data. Prior 
work addressed two versions of the problem, where either the PLA consists 
of disjoint segments, or it is required to be a continuous piecewise 
linear function. However, we observe that neither minimizes the true 
representation size of the PLA, i.e., the number of parameters required to 
represent it. In this proposal, we design an algorithm that generates the 
optimal PLA in terms of representation size while meeting the prescribed 
max-error guarantee.

Thirdly, we propose a new type of summaries, called persistent data 
sketching. All existing data summaries are updated upon the arrival of 
every element in the stream, thus destroying its previous version. A 
persistent sketch, on the other hand, preserves all its previous versions 
as it is updated over time. Every update creates a new version, while all 
the versions are kept compactly so that any previous version can still be 
queried. This gives the summary the ability to answer queries about the 
stream at any prior time.

Finally we study a simplified version of above persistent data sketching 
problem which only need to answer historical queries in the cache-register 
model. We will first show a theoretical reduction between sliding window 
summary and persistent summary, and then propose a more practical 
persistent summary for this model.


Date:			Tuesday, 15 December 2015

Time:			10:00am - 12:00noon

Venue:			Room 5566
 			Lifts 27/28

Chairman:		Prof. Chii Shang (CIVL)

Committee Members:	Prof. Ke Yi (Supervisor)
 			Prof. Lei Chen
 			Prof. Siu-Wing Cheng
 			Prof. Yang Xiang (MATH)
 			Prof. Wing-Kai Hon (National Tsinghu U., Taiwan)


**** ALL are Welcome ****