טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
Ph.D Thesis
Ph.D StudentShnaiderman Lila
SubjectParallelism in Querying and Storage for Large XML
and Graph Databases
DepartmentDepartment of Computer Science
Supervisor Professor Oded Shmueli
Full Thesis textFull thesis text - English Version


Abstract

XML is based on a tree-structured data model. Naturally, the most popular XML querying language (XPath) uses patterns of selection predicates on multiple elements related by a tree structure. These are abstracted by twig patterns. Finding all occurrences of such a twig pattern in an XML database is a core operation for XML query processing.

Lately, large amounts of data are modeled and stored as graphs in order to express complex data relationships and for allowing machine learning methods to easily use the stored information. As a result, query processing on graph structures is becoming an important component in real-world applications. The most commonly used query format is that of tree pattern queries.

Due to the continuous growth of data, the efficiency of database operations becomes of paramount importance. Parallelism is the approach most commonly used to improve efficiency. Parallelism may be realized by using Multi-core-based processing. GPGPU (General Purpose Graphics Processing Unit) computing is an innovative approach that may be applied as well for realizing massive Single Instruction Multi-Data (SIMD) parallelism.

The subject of this research is parallelism in querying and storage for large native XML databases and Graph databases using different parallel platforms (multi-core CPUs and GPUs). In this research I have designed and implemented four different algorithms.

The first two algorithms are Parallel Path Stack algorithm (PPS) and Parallel Twig Stack algorithm (PTS). PPS and PTS are efficient algorithms for matching XML query twig patterns in a parallel multi-threaded computing platform. These algorithms employ a sophisticated search technique for limiting stream processing to specific subtrees. I have also designed and implemented a novel parallel scheme for running PPS and PTS on Non-Main-Memory-Resident (NMMR) documents.

The third algorithm is GPU-Twig, for matching twig patterns in large XML documents, using a GPU. GPU-Twig uses the data and task parallelism of the GPU to perform memory-intensive tasks whereas the CPU is used to perform I/O and resource management.


The fourth algorithm is GGQ (GPU Graph Data Base Query), for processing tree pattern queries on graph databases, using a GPU.  GGQ is novel in that processors identifiers (IDs) are used to determine the choices made in matching the tree pattern to actual database graph nodes and edges. As the space of possibilities that can be represented by an ID is limited, methods are presented to practically increase this space.

I conducted extensive experimentation with all these algorithms using known and especially constructed benchmarks. I compared PPS and PTS to the standard (sequential) PathStack and Twig Stack algorithms in terms of runtime (to completion). I present the results of an extensive experimentation of the GPU-Twig algorithm on large XML documents using the DBLP and XMark benchmarks. I experimented with the GGQ algorithm on different graph databases. The experiments stress-test the cases in which the ID space is limited. Experimental results indicate that all the above algorithms reduce the running time of queries in comparison to other relevant algorithms under various settings and scenarios.