More about HKUST
Large-Scale In-Memory Data Processing
PhD Thesis Proposal Defence Title: "Large-Scale In-Memory Data Processing" by Mr. Zhiqiang MA Abstract: As cloud-based computation grows to be an increasingly important paradigm, providing a general computational interface and data substrate to support datacenter-scale programming has become an imperative research agenda. Traditional cloud computing technologies, such as MapReduce, use disk-based file systems as the system-wide substrate for data storage and sharing. A distributed file system provides a global name space and stores data persistently, but it also introduces significant overhead. Several recent systems use DRAM to store data and tremendously improve the performance of cloud computing systems. However, both our own experience and related work indicate that a simple substitution of distributed DRAM for the file system does not provide a solid and viable foundation for data storage and processing in the datacenter environment, and the capacity of such systems is limited by the amount of physical memory in the cluster. We view the unified physical memory of many hosts as the solid data substrate for large-scale efficient data processing for cloud-based systems. We investigate the limitation of the traditional file system-based system MapReduce using the parallel project compilation as a probing case with moderate-size data with dependcences among numerous computational steps. We propose organizing the in-memory data processing in many compute nodes by presenting programmers a illusion of a big virtual machine, and design a new instruction set architecture, i0, to unify myriads of compute nodes to form a big virtual machine called MAchine ZEro (MAZE), and present programmers the view of a single computer where thousands of tasks run concurrently in a large, unified, and snapshotted memory space. i0 and MAZE form the foundation of the Layer Zero system which provides a generate substrate for cloud computing. The Layer Zero provides a simple yet scalable programming model and mitigates the scalability bottleneck of traditional distributed shared memory systems. Along with an efficient execution engine, the capacity of a Layer Zero can scale up to support large clusters. We have implemented and tested Layer Zero on four platforms, and our evaluation shows that Layer Zero has excellent performance and scalability. On the other hand, the simple substitution of distributed DRAM for the file system does not fulfill the needs of many data storage and processing applications in the datacenter environment. The capacity of such systems is limited by the amount of physical memory in the cluster and do not provided data persistency mechanisms. We propose an improved data substrate to unify the physical memory and disk resources on many compute nodes, to form a system-wide data substrate for large-scale data processing. The substrate provides a general memory-based abstraction, takes advantage of DRAM in the system to accelerate computation, and, transparent to programmers, scales the system to handle large datasets by swapping data to disks and remote servers. The memory-based data substrate can also provide a solid foundation for data storage systems such as key/value stores. Date: Wednesday, 9 April 2014 Time: 2:00pm - 4:00pm Venue: Room 3501 lifts 25/26 Committee Members: Dr. Lin Gu (Supervisor) Dr. Kai Chen (Chairperson) Dr. Ke Yi Prof. Qian Zhang **** ALL are Welcome ****