|M.Sc Thesis||Department of Computer Science|
|Supervisor:||Prof. Shmueli Oded|
|Full Thesis text|
XML is one of the primary encoding schemes for data and knowledge. We investigate incremental physical data clustering in systems that store XML documents using a native format. We formulate the XML clustering problem as an augmented (with sibling edges) tree partitioning problem and propose the PIXSAR (Practical Incremental XML Sibling Augmented Reclustering) algorithm for incrementally clustering XML documents. We show the fundamental importance of workload-driven dynamically rearranging storage. We also touch upon extending PIXSAR for handling insertions and deletions. PIXSAR incrementally executes reclustering operations on selected subgraphs of the global augmented document tree. The subgraphs are implied by significant changes in the workload. As the workload changes, PIXSAR incrementally adjusts the XML data layout so as to better fit the workload. PIXSAR's main parameters are the radius, in pages, of the augmented portion to be reclustered and the way reclustering is triggered.
We also adapted PIXSAR to operate on a physical disk. Further we suggest novel architectures that utilize PIXSAR in multi-level storage. We propose a specific algorithm for one of these architectures.
We use an experimental data clustering system that includes a disk simulator and a File System simulator for storing native XML data. We use a novel method for 'exporting' the Saxon query processor into our setting. Experimental results indicate that using PIXSAR significantly reduces the number of page faults (counting ALL page faults incurred while querying the document as well as those during maintenance operations) thereby resulting in improved query performance (often by 20%-40%). We also conducted extensive experiments on a physical disk. Interestingly, the improvements observed on the real disk are much larger (often by more than 50%) than those predicted by simulations.
In addition to the XML augmented tree, there are also indices. The kind of index we consider is based on a XPath expression and it consists of index entries pointing to XML target nodes. Using such index entries one jumps directly to target nodes. Often, target XML nodes are accessed in temporal proximity and hence, for paging reasons, it is beneficial to store them on the same disk page. An extension to the PIXSAR algorithm, called iPIXSAR, is proposed. It extends PIXSAR so as to make storing decisions of target XML nodes based on possible membership in more than one tree. Experimental results show that in the presence of indices iPIXSAR is superior to PIXSAR by at times up to 8%.