More about HKUST
STORAGE OPTIMIZATION FOR LARGE WIDE TABLES IN HADOOP
MPhil Thesis Defence Title: "STORAGE OPTIMIZATION FOR LARGE WIDE TABLES IN HADOOP" By Mr. Wei LI Abstract Recent advances in data warehousing technologies are enabling the storage and processing of extremely large data sets. In viewing this opportunity, the leading cross-bank settlement institute in China is looking for more business intelligence in their large-volume historical transaction data accumulated in more than 10 years. Though a mature data warehousing solution Hive (an open-source data warehousing solution built on top of Hadoop) is being adopted in production, the efficiency of data storage and processing is suboptimal due to the lack of advanced customization and optimization on the system. Specifically, overlapping fractions of the original data set are materialized to different tables, introducing intra-table redundancy. Additions and changes on columns inside a table also occur in an inconsistent manner during the over 10-year history of the data sets, resulting in intra-table redundancy. Multiple user groups with various levels of needs use different processing engines to access the data sets, causing cross-platform redundancy and difficulty in system migration. Based on these observations, we propose an optimization design that is transparent to all users of the system. It exploits the inter- and intra-table redundancy to improve space efficiency. It also employs a cross-platform row columnar storage format to further improve the space efficiency and make the data accessible to multiple processing engines. We apply our optimizations on their system and conduct extensive experiments on one month of the historical transaction data. The result shows orders of magnitude improvements in both data storage and processing efficiency. Date: Friday, 8 May 2015 Time: 3:00pm - 5:00pm Venue: Room 5508 Lifts 25/26 Committee Members: Prof. Lionel Ni (Supervisor) Dr. Qiong Luo (Chairperson) Dr. Lei Chen **** ALL are Welcome ****