The Hong Kong University of Science and Technology Department of Computer Science PhD Thesis Defence "Structural Join: Processing Algorithms and Size Estimation" By Mr. Wei Wang Abstract Extensible Markup Language (XML) has become the de facto standard for data representation and exchange over the Internet. It has enabled and stimulated a great multitude of research and applications. However, unique features of XML and its query languages have posed great challenges to efficient managing and querying large volumes of XML data. This, in turn, hinders progress of XML research and the development of applications based on XML. This dissertation makes the attempt to enhance the efficiency of XML database management system by advanced query processing and optimization techniques. More specifically, we studied one widely accepted core operator in XML query processing, structural join. A structural join takes two sets of XML nodes as input and returns pairs of nodes such that a special ancestor-descendant relationship holds between them. We first proposed a novel coding scheme, PBiTree coding, based on which an efficient and robust structural query processing framework is developed. The PBiTree code enables efficient checking of the ancestor-descendant relationship between two nodes solely based on their PBiTree codes. We developed algorithms in the framework that are optimized for various combinations of physical settings. In particular, the newly proposed partitioning based algorithms can process structural joins efficiently without sorting or indexing. Experimental results demonstrate that the structural join processing algorithms based on the proposed coding scheme outperform existing algorithms significantly. When structural join is used in XML query processing, an essential issue is to estimate the size of its result for selecting optimal query processing plan. We studied the result size estimation problem of structural joins, and proposed two models, the interval model and the position model, under which the original estimation problem can be converted into estimating the size of a spatial join and a relational equijoin respectively. A set of estimation methods based on the histogram and sampling techniques were developed, which have not only high accuracy but also theoretical guarantees on the estimation. Comprehensive performance studies were conducted. The results demonstrated that the accuracy and robustness of our newly proposed estimation methods outperforms those of the previous method up to an order of magnitude. Date: Monday, 26 January 2004 Time: 2:30p.m.-4:30p.m. Venue: Room 2302 Lifts 17-18 Chairman: Prof. Ki Ling Cheung (ISMT) Committee Members: Prof. Hongjun Lu (Supervisor) Prof. Frederick Lochovsky Prof. Dimitris Papadias Prof. James S.H. Kwok (ISMT) Prof. Jeffrey Xu Yu (Sys Engg & Engg Mgmt, CUHK) **** ALL are Welcome ****