Sublinear Algorithms for Massive Data

PhD Thesis Proposal Defence


Title: "Sublinear Algorithms for Massive Data"

by

Mr. Di CHEN


Abstract:

Sublinear algorithms address the rapid growth in data volume with a simple 
yet powerful premise, that useful tasks can be performed with even fewer 
resources than required to simply store the data.

This thesis studies randomized algorithms for massive data. We either 
devise new algorithms, or improve analysis on existing algorithms, 
resulting in meaningful theoretical guarantees with sublinear space or 
communication.

First, we present an algorithm that uses sublinear communication to 
perform set reconciliation under a `noisy data' model, where two data 
points shall be considered `the same' when the distance between them is 
small, modelling tiny perturbations caused in data due to some form of 
noise.

The second is a 1-pass streaming algorithm for estimating the number of 
distinct entities in the same noisy data model. Geometrically, this may be 
seen as counting sparsely placed clusters of small diameter, using space 
that is sublinear in the number of clusters.

Finally, we give an improved analysis of random Fourier features (RFF) for 
the Gaussian kernel, a popular data-oblivious embedding of kernel 
functions. In contrast to competing techniques that are typically 
data-dependent, RFF can be used under the data stream setting. Our work 
justifies the use of RFF in a variety of learning problems under the 
Gaussian kernel distance.


Date:			Friday, 23 September 2016

Time:                  	9:00am - 11:00am

Venue:                  Room 4475
                         (lifts 25/26)

Committee Members:	Prof. Mordecai Golin (Supervisor)
 			Dr. Ke Yi (Supervisor)
  			Dr. Raymond Wong (Chairperson)
  			Dr. Sunil Arya


**** ALL are Welcome ****