MPhil Thesis Defence


"The Development of a Structural Index Tree for Processing XML Data"

By

Mr. Sheung-Chak Cheng


Abstract

XML has become the de facto standard for data exchange over the Web.
However, XML data contain a lot of duplicate structures, resulting in much
redundant processing on path navigation when querying the data. We propose
an index structure, the Structure Index Tree (SIT), which is defined on an
XML document. The SIT eliminates duplicate structures arising from the
equivalent branches and subtrees in an XML document by merging them into a
concise structure. Query evaluation is thus speeded up as a result of
pruned search space.

The SIT, as well as many other existing indexes that are defined based on
the partition of equivalent classes of nodes in an XML tree, is only
efficient for the evaluation of structural queries. In the presence of
value-based predicates, equivalent nodes in the same partition become
distinguishable by their data contents. To enhance the applicability of
the index, we propose a lattice structure on the SIT (SIT-lattice). A
SIT-lattice element (SLE) is an index of an arbitrary subset of paths in
the document. Since paths represent the structure of the XML data and each
text node is associated with a unique path, we can define an SLE to filter
out both irrelevant structures and text nodes. We propose a set of SLE
operations and devise efficient techniques to generate SLEs that can be
tailored towards query workloads.

We apply the SIT in the design and implementation of an XML compressor,
XQzip, to query compressed XML data. XQzip uses a block-compression
strategy which strikes a balance of the compression strategies adopted in
existing unqueriable and queriable compressors. In querying the compressed
data, we devise a linear-time evaluation algorithm based on the SIT. We
also employ the decompressed blocks to manage a buffer pool, which
effectively avoids repeated decompression as well as caches data for query
evaluation.

We evaluate the SIT-lattice and XQzip on a wide spectrum of benchmark XML
datasets and synthetic datasets. On average, the SIT can reduce the search
space 140 times for most real datasets. The results also show that node
selection using the SIT is approximately twice as fast as that using a
similar index on XML data. For the SIT-lattice, our experiments show that
SLEs help to speed up significantly the evaluation of path queries with
value-based and aggregation-based conditions. We also demonstrate that
SLEs are able to support effective querying over very large XML documents
in memory-limited hand-held devices. Finally, we show that XQzip achieves
a compression ratio 17% better and a querying time 13 times less than
those of another well-known queriable XML compressor.


Date:				Wednesday, 18 August 2004

Time:				2:00p.m.-4:00p.m.

Venue:				Room 1402
				Lifts 25-26

Committee Members:		Prof. Wilfred Ng (Supervisor)
				Prof. Dik-Lun Lee (Chairperson)
				Prof. Hongjun Lu


**** ALL are Welcome ****