More about HKUST
A Novel Scalable Join Processor over Large RDF Graphs with Linkage Information Aware
MPhil Thesis Defence Title: "A Novel Scalable Join Processor over Large RDF Graphs with Linkage Information Aware" By Mr. Yincheng Lin Abstract RDF(Resource Description Framework), which is developed by W3C, is a web semantic data description format. With the development of semantic web, RDF data integrated from many sources become larger and larger. Because of its large amount and free schema, the efficiency of RDF data processing still remains a major challenge in the RDF data management. Many research works have been carried out to issue this problem. The idea of property table tries to discover the correlation among the predicates and stores the related date in the same table so that query processing could be executed in the way just as we do in the relational database. Column store focuses on each individual predicate. It partitions the RDF data into different tables based on the corresponding predicates and builds the indices for each table. RDF-3X, a RISC-style engine to manage the RDF data efficiently, keeps the original triple format of RDF data and store them directly and builds all possible permutation of indices. In this thesis, we step further to discover potential properties of RDF data and make full use of them to process queries efficiently. To be more specified, we introduce 1) Two linkage structures: star linkage and chain linkage. We extract this structure information, store it separately and build the aggregated indices on it. 2) For the data which doesn't contain structure information, we store it in different tables based on the predicate, which is similar to column store. However, the big difference between our storage and the column store is that we treat the predicates not equally. We observe that there are some predicates which are multiple value predicates. For this kind of predicates,instead of using B+ tree index, we use a local bitmap index which is more suitable for it and improve the query performance. 3) In order to gain a query plan with high performance, we introduce a more complex and more accurate selectivity estimation which actually doesn't need extra time cost compared with the traditional estimation. We evaluate our approach over two different RDF datasets, Billion Triple Challenge and Yago, and develop different kinds of possible queries. Compared with RDF-3X and monetDB, the performance of our approach is better, especially for some queries with star linkage or chain linkage information. Date: Wednesday, 24 August 2011 Time: 2:00pm – 4:00pm Venue: Room 5509 Lifts 25/26 Committee Members: Dr. Lei Chen (Supervisor) Dr. Ke Yi (Chairperson) Dr. Charles Zhang **** ALL are Welcome ****