MPhil Thesis Defence "The Development of a Structural Index Tree for Processing XML Data" By Mr. Sheung-Chak Cheng Abstract XML has become the de facto standard for data exchange over the Web. However, XML data contain a lot of duplicate structures, resulting in much redundant processing on path navigation when querying the data. We propose an index structure, the Structure Index Tree (SIT), which is defined on an XML document. The SIT eliminates duplicate structures arising from the equivalent branches and subtrees in an XML document by merging them into a concise structure. Query evaluation is thus speeded up as a result of pruned search space. The SIT, as well as many other existing indexes that are defined based on the partition of equivalent classes of nodes in an XML tree, is only efficient for the evaluation of structural queries. In the presence of value-based predicates, equivalent nodes in the same partition become distinguishable by their data contents. To enhance the applicability of the index, we propose a lattice structure on the SIT (SIT-lattice). A SIT-lattice element (SLE) is an index of an arbitrary subset of paths in the document. Since paths represent the structure of the XML data and each text node is associated with a unique path, we can define an SLE to filter out both irrelevant structures and text nodes. We propose a set of SLE operations and devise efficient techniques to generate SLEs that can be tailored towards query workloads. We apply the SIT in the design and implementation of an XML compressor, XQzip, to query compressed XML data. XQzip uses a block-compression strategy which strikes a balance of the compression strategies adopted in existing unqueriable and queriable compressors. In querying the compressed data, we devise a linear-time evaluation algorithm based on the SIT. We also employ the decompressed blocks to manage a buffer pool, which effectively avoids repeated decompression as well as caches data for query evaluation. We evaluate the SIT-lattice and XQzip on a wide spectrum of benchmark XML datasets and synthetic datasets. On average, the SIT can reduce the search space 140 times for most real datasets. The results also show that node selection using the SIT is approximately twice as fast as that using a similar index on XML data. For the SIT-lattice, our experiments show that SLEs help to speed up significantly the evaluation of path queries with value-based and aggregation-based conditions. We also demonstrate that SLEs are able to support effective querying over very large XML documents in memory-limited hand-held devices. Finally, we show that XQzip achieves a compression ratio 17% better and a querying time 13 times less than those of another well-known queriable XML compressor. Date: Wednesday, 18 August 2004 Time: 2:00p.m.-4:00p.m. Venue: Room 1402 Lifts 25-26 Committee Members: Prof. Wilfred Ng (Supervisor) Prof. Dik-Lun Lee (Chairperson) Prof. Hongjun Lu **** ALL are Welcome ****