More about HKUST
Large-Scale In-Memory Data Processing
PhD Thesis Proposal Defence
Title: "Large-Scale In-Memory Data Processing"
by
Mr. Zhiqiang MA
Abstract:
As cloud-based computation grows to be an increasingly important paradigm,
providing a general computational interface and data substrate to support
datacenter-scale programming has become an imperative research agenda.
Traditional cloud computing technologies, such as MapReduce, use disk-based
file systems as the system-wide substrate for data storage and sharing. A
distributed file system provides a global name space and stores data
persistently, but it also introduces significant overhead. Several recent
systems use DRAM to store data and tremendously improve the performance of
cloud computing systems. However, both our own experience and related work
indicate that a simple substitution of distributed DRAM for the file system
does not provide a solid and viable foundation for data storage and processing
in the datacenter environment, and the capacity of such systems is limited by
the amount of physical memory in the cluster.
We view the unified physical memory of many hosts as the solid data substrate
for large-scale efficient data processing for cloud-based systems. We
investigate the limitation of the traditional file system-based system
MapReduce using the parallel project compilation as a probing case with
moderate-size data with dependcences among numerous computational steps. We
propose organizing the in-memory data processing in many compute nodes by
presenting programmers a illusion of a big virtual machine, and design a new
instruction set architecture, i0, to unify myriads of compute nodes to form a
big virtual machine called MAchine ZEro (MAZE), and present programmers the
view of a single computer where thousands of tasks run concurrently in a large,
unified, and snapshotted memory space. i0 and MAZE form the foundation of the
Layer Zero system which provides a generate substrate for cloud computing. The
Layer Zero provides a simple yet scalable programming model and mitigates the
scalability bottleneck of traditional distributed shared memory systems. Along
with an efficient execution engine, the capacity of a Layer Zero can scale up
to support large clusters. We have implemented and tested Layer Zero on four
platforms, and our evaluation shows that Layer Zero has excellent performance
and scalability. On the other hand, the simple substitution of distributed DRAM
for the file system does not fulfill the needs of many data storage and
processing applications in the datacenter environment. The capacity of such
systems is limited by the amount of physical memory in the cluster and do not
provided data persistency mechanisms. We propose an improved data substrate to
unify the physical memory and disk resources on many compute nodes, to form a
system-wide data substrate for large-scale data processing. The substrate
provides a general memory-based abstraction, takes advantage of DRAM in the
system to accelerate computation, and, transparent to programmers, scales the
system to handle large datasets by swapping data to disks and remote servers.
The memory-based data substrate can also provide a solid foundation for data
storage systems such as key/value stores.
Date: Wednesday, 9 April 2014
Time: 2:00pm - 4:00pm
Venue: Room 3501
lifts 25/26
Committee Members: Dr. Lin Gu (Supervisor)
Dr. Kai Chen (Chairperson)
Dr. Ke Yi
Prof. Qian Zhang
**** ALL are Welcome ****