More about HKUST
Hadoop-based Storage System for Big Spatio-Temporal Data Analytics
The Hong Kong University of Science and Technology
Department of Computer Science and Engineering
PhD Thesis Defence
Title: "Hadoop-based Storage System for Big Spatio-Temporal Data Analytics"
By
Mr. Haoyu Tan
Abstract
During the past decade, various GPS-equipped devices have generated a
tremendous amount of data with time and location information, which we
refer to as big spatio-temporal data. As the size of the data is
continuously growing, it will outgrow the capabilities of any serial
processing techniques and it is therefore necessary to perform the data
analytics in parallel.
There are two main paradigms for large scale data processing: parallel
relational database management system (RDBMS) and MapReduce. The debate on
which paradigm is superior to the other has lasted for several years,
which led to a widely accepted view that there are advantages to both
paradigms in different aspects. It was once believed that RDBMS can
deliver a better performance while MapReduce can scale out more easily due
to its emphasis on fault tolerance design. However, recent works from both
sides demonstrate that techniques used in one paradigm can be incorporated
into another to fix the deficiencies. In context of our research, we use
Hadoop, an open-source implementation of MapReduce and related components,
to perform spatio-temporal data analytics. The main consideration is that
Hadoop provides us with low-level application programming interface (API)
which is more flexible for implementing complex data mining algorithms
than structured query language (SQL) supported by RDBMS.
In this thesis, we first describe the design and implementation of CloST,
a scalable big spatio-temporal data storage system to support data
analytics using Hadoop. The main objective of CloST is to avoid scan the
whole dataset when a spatio-temporal range is given. To this end, we
propose a novel data model which has special treatments on three core
attributes including an object id, a location and a time. Based on this
data model, CloST hierarchically partitions data using all core attributes
which enables efficient parallel processing of spatio-temporal range
scans. According to the data characteristics, we devise a compact storage
structure which reduces the storage size by an order of magnitude. In
addition, we proposes scalable bulk loading algorithms capable of
incrementally adding new data into the system. Then we address the problem
of parallel creation of secondary indexes in CloST. Particularly, we
present the design of a general framework for parallel R-tree packing
using Hadoop. This framework sequentially packs each R-tree level from
bottom up. For lower levels that have a large number of rectangles, we
propose a partition-based algorithm for parallel packing. We also discuss
two spatial partitioning methods that can efficiently handle heavily
skewed datasets.
To evaluate the performance, we conduct extensive experiments using large
real datasets. The size of the datasets is up to 200GB and the number of
spatial objects is up to 2 billion. The results show that CloST has fast
data loading speed, high scalability in query processing, high data
compression ratio, and desirable performance in building very large
secondary spatial indexes (R-trees).
Date: Thursday, 29 November 2012
Time: 12:30pm - 2:30pm
Venue: Room 3588
Lifts 27/28
Chairman: Prof. Zongjin Li (CIVL)
Committee Members: Prof. Lionel Ni (Supervisor)
Prof. Lei Chen
Prof. Qiong Luo
Prof. Furong Gao (CBME)
Prof. Qing Li (Comp. Sci., CityU)
**** ALL are Welcome ****