Title: Beyond Simple Aggregates: Indexing for Summary Queries

Speaker: Zhewei Wei, HKUST

Time/Date: Friday, April 8, 11-11:30a
Location: Room 3501

Abstract:

Database queries can be broadly classified into two categories:
reporting queries and aggregation queries. The former retrieves a
collection of records from the database that match the query's
conditions, while the latter returns an aggregate, such as count, sum,
average, or max (min), of a particular attribute of these records.
Aggregation queries are especially useful in business intelligence and
data analysis applications where users are interested not in the
actual records, but some statistics of them. They can also be executed
much more efficiently than reporting queries, by embedding properly
precomputed aggregates into an index.

However, reporting and aggregation queries provide only two extremes
for exploring the data. Data analysts often need more insight into the
data distribution than what those simple aggregates provide, and yet
certainly do not want the sheer volume of data returned by reporting
queries. In this paper, we design indexing techniques that allow for
extracting a statistical summary of all the records in the query.  The
summaries we support include frequent items, quantiles, various
sketches, and wavelets, all of which are of central importance in
massive data analysis.  Our indexes require linear space and extract a
summary with the optimal or near-optimal query cost.

============================================================

Title: Sampling Based Algorithms for Quantile Computation in Sensor
Networks

Speaker: Zengfeng Huang, HKUST

Time/Date: Friday, April 8, 11:30a-12p
Location: Room 3501

Abstract:

We study the problem of computing approximate quantiles in large-scale
sensor networks communication-efficiently, a problem previously
studied by Greenwald and Khana and Shrivastava et al.  Their
algorithms have a total communication cost of O(k log^2 n/eps) and O(k
log u/eps), respectively, where k is the number of nodes in the
network, n is the total size of the data sets held by all the nodes, u
is the universe size, and o is the required approximation error. In
this paper, we present a sampling based quantile computation algorithm
with O(sqrt{kh}/eps) total communication (h is the height of the
routing tree), which grows sublinearly with the network size except in
the pathological case h = Theta(k). In our experiments on both
synthetic and real data sets, this improvement translates into a 10 to
100-fold communication reduction for achieving the same accuracy in
the computed quantiles. Meanwhile, the maximum individual node
communication of our algorithm is no higher than that of the previous
two algorithms.