More about HKUST
A Survey of Column-Oriented Storage Techniques in Read-Optimized Data Warehouse Systems
PhD Qualifying Examination Title: "A Survey of Column-Oriented Storage Techniques in Read-Optimized Data Warehouse Systems" by Mr. Jiangchuan Zheng Abstract: Most traditional DBMS store records row-by-row. Historically, the choice of row-store layout is not merely for technical simplicity, but rather motivated by the typical workloads in transactional processing which access data on the granularity of entity. However, with the emergence of big data comes another kind of queries more analytical in nature, which do not care about the details of certain entities, but target at high-level statistical information that help with data mining tasks in warehouse environment. Analytical workloads are read-intensive, attribute-focused and big data-oriented, which contrast sharply with transactional queries. In view of these new characteristics, write-optimized row-store layout is no longer the best choice and redesign of physical layer is needed. In recent years, column-oriented storage structure has gained popularity in both research and industrial communities. By organizing tabular data column-by-column in physical layer, column-store outperforms row-stores in processing analytical workloads as it need only access relevant attributes. Advantages of column-store over row-store include high I/O efficiency, great chances of compression and high flexibility in adapting to dynamic workloads. Nevertheless, quite a few challenges exist ranging from tuple reconstruction to compression-based query execution. In this survey, we review major research results towards building a high-performance, analytics-oriented column-store warehouse system. We start from the description of the storage layout and execution engine in C-Store, an open-source column-store system. In the following, we delve into several key issues in column-store system such as compression, tuple reconstruction, materialization strategies. We summarize key challenges and typical solutions, and describe from a system perspective how they help improve the performance of analytical workloads processing. Also, we review major issues of applying column-store techniques in distributed environment such as MapReduce. Finally, we end this survey with some conclusions and future directions. Date: Friday, 17 February 2012 Time: 2:00pm - 4:00pm Venue: Room 3301A lifts 17/18 Committee Members: Prof. Lionel Ni (Supervisor) Dr. Qiong Luo (Chairperson) Dr. Lei Chen Dr. Lin Gu **** ALL are Welcome ****