STORAGE OPTIMIZATION FOR LARGE WIDE TABLES IN HADOOP

MPhil Thesis Defence


Title: "STORAGE OPTIMIZATION FOR LARGE WIDE TABLES IN HADOOP"

By

Mr. Wei LI


Abstract

Recent advances in data warehousing technologies are enabling the storage and 
processing of extremely large data sets. In viewing this opportunity, the 
leading cross-bank settlement institute in China is looking for more business 
intelligence in their large-volume historical transaction data accumulated in 
more than 10 years. Though a mature data warehousing solution Hive (an 
open-source data warehousing solution built on top of Hadoop) is being adopted 
in production, the efficiency of data storage and processing is suboptimal due 
to the lack of advanced customization and optimization on the system. 
Specifically, overlapping fractions of the original data set are materialized 
to different tables, introducing intra-table redundancy. Additions and changes 
on columns inside a table also occur in an inconsistent manner during the over 
10-year history of the data sets, resulting in intra-table redundancy. Multiple 
user groups with various levels of needs use different processing engines to 
access the data sets, causing cross-platform redundancy and difficulty in 
system migration. Based on these observations, we propose an optimization 
design that is transparent to all users of the system. It exploits the inter- 
and intra-table redundancy to improve space efficiency. It also employs a 
cross-platform row columnar storage format to further improve the space 
efficiency and make the data accessible to multiple processing engines. We 
apply our optimizations on their system and conduct extensive experiments on 
one month of the historical transaction data. The result shows orders of 
magnitude improvements in both data storage and processing efficiency.


Date:			Friday, 8 May 2015

Time:			3:00pm - 5:00pm

Venue:			Room 5508
 			Lifts 25/26

Committee Members:	Prof. Lionel Ni (Supervisor)
 			Dr. Qiong Luo (Chairperson)
 			Dr. Lei Chen


**** ALL are Welcome ****
Privacy Sitemap
STORAGE OPTIMIZATION FOR LARGE WIDE TABLES IN HADOOP

About

People

Research

Academics

Admissions