More about HKUST
Incomplete Data Analysis in Smart City Applications
The Hong Kong University of Science and Technology
Department of Computer Science and Engineering
PhD Thesis Defence
Title: "Incomplete Data Analysis in Smart City Applications"
By
Mr. Siyuan Liu
Abstract
Incomplete data in my work is defined as the data with extremely limited
samples or data observed, which brings big challenges to data mining and
real applications. Such extremely limited sample data would obviously give
us terrible bias and inaccurate results. In my work, I studied three real
applications in smart city and proposed novel algorithms and methodology
to tackle the incomplete data analysis problem. Application one: Given one
over ten thousand of the whole set of vehicles in a city, how can we still
retrieve the vehicle distribution and detect the hot spots or crowded
areas in the city? The traditional density-based clustering methods work
not well because of the very limited and errorable vehicle density or
location information. Hence we need new algorithms to handle such
incomplete data, in terms of accuracy and scalability. On the other hand,
the vehicle traces are typical spatio-temporal data, which requires
efficient approaches. In this paper, we have an interesting observation
that the vehicle speed can indicate the crowdedness of a given area. In
other words, if a given area is very crowded, then the vehicles' speed in
this area is low; while if this area is not crowded, then the vehicles'
speed in this area prefers high. As such the mobility of samples is
naturally incorporated and a novel non-density-based clustering method is
developed, called mobility-based clustering. Several key factors beyond
the vehicle crowdedness have been identified and techniques to compensate
these effects are proposed. We evaluate the performance of mobility-based
clustering based on real traffic situations. Experimental results show
that using 0.3 % of vehicles as the samples, mobility-based clustering can
accurately identify hot spots which can hardly be obtained by the latest
representative algorithm UMicro. Application two: Given one over one
hundred thousand of the whole set of population in a city, how can we
still retrieve the population distribution and detect the population
bursts in the city? To address the difficulties of lacking real population
data, we take the advantage of mobile phone networks, which offer enormous
spatiotemporal communication data among persons. More importantly, we find
the fact that we can utilize these mobile phone data to infer and
approximate the population data. Thus, we can study the population burst
detection problem taking advantages of the unique features hidden in the
mobile phone data. In this work, we present PBD, a system to conduct
Population Burst Detection. First, we propose an effective clustering
method, correlation-based clustering, to cluster the incomplete location
information from the mobile phone data. Then, we design an adaptive
parameter-free detection method, R-scan, to capture the distributed
dynamic bursts. Finally, we devise an efficient algorithm, BT-miner, to
retrieve burst trajectories. The experimental results on real life mobile
phone data confirm the effectiveness and efficiency of the proposed
algorithms. Application three: We study a very interesting and practical
problem, pattern matching in a distributed mobile environment. Pattern
matching is a well-known problem and extensive research has been conducted
for performing effective and efficient search. However, previous
approaches assume that data are centrally stored, which is not the case in
a mobile environment (e.g., mobile phone networks), where one person's
pattern could be separately stored in a number of different stations, and
such a local pattern is incomplete compared with the global pattern. A
simple solution to pattern matching over a mobile environment is to
collect all the data distributed in base stations to a data center and
conduct pattern matching at the data center afterwards. Clearly, such a
simple solution will raise huge amount of communication cost, which may
cause the traffic congestion brought by the limited wireless bandwidth to
be even worse. Therefore, a communication efficient and search effective
solution is necessary. In this work, we propose a novel solution based on
a well-designed Weighted Bloom Filter (WBF), called, Distributed
Incomplete pattern matching (DI-matching), to find target patterns over a
distributed mobile environment. Specifically, to save communication cost
and ensure pattern matching in distributed incomplete patterns, we use WBF
to encode a query pattern and disseminate the encoded data to each base
station. Each base station conducts a local pattern search according to
the received WBF. Only qualified IDs and corresponding weights in each
base station are sent to the data center for aggregation and verification.
Through non-trivial theoretical analysis and extensive empirical
experiments on a real city-scale mobile networks data set, we demonstrate
the effectiveness and efficiency of our proposed solutions.
Date: Thursday, 18 August 2011
Time: 3:00pm – 5:00pm
Venue: Room 3501
Lifts 25/26
Chairman: Prof. Hong-Kam Lo (CIVL)
Committee Members: Prof. Lionel Ni (Supervisor)
Prof. Lei Chen
Prof. Shing-Chi Cheung
Prof. Guohua Chen (CBME)
Prof. Xiaohua Jia (Comp. Sci., CityU)
**** ALL are Welcome ****