M.Sc Thesis | |

M.Sc Student | Ben Moshe Sagi |
---|---|

Subject | Using Property Testing for Efficient Detection of Nearly-Sorted relations |

Department | Department of Computer Science |

Supervisor | ASSOCIATE PROF. Eldar Fischer |

Full Thesis text |

Many relational
operations are best performed when the relations are stored sorted over the
relevant attributes (e.g. the common attributes in a natural join operation). However,
generally relations are not stored sorted because it is expensive to maintain
them this way (and impossible whenever there is more than one relevant sort
key). Still, many times
relations turn out to be nearly-sorted, where most tuples are close to their
place in the order.

This state can
result from "leftover sortedness", where originally sortedrelations
were updated, or were combined into interim results when evaluating a complex query.
It can also
result from weak correlations between attribute values.

Currently, nearly-sorted
relations are treated the same as unsorted relations, and when relational operations
are evaluated for them, a generic algorithm is used. Yet, many
operations can be computed more efficiently by an algorithm that exploits this
near-ordering.

However, to
consistently benefit from using such algorithms the system should also refrain
from using the wrong algorithm for relations which happen not to be sorted at
all. Thus, an
efficient test is required, i.e., a very fast approximation algorithm for
establishing whether a given relation is sufficiently nearly-sorted.

In this paper,
we provide the theoretical foundations for improving query evaluation over possibly
nearly-sorted relations.

First we
formally define what it means for a relation to be nearly-sorted, and show how
operations over such relations, such as natural join, set operations and
sorting, can be executed significantly more efficiently using an algorithm that
we provide. If a relation is
nearly-sorted enough, then it can be sorted using two sequential reads of the
relation, and writing no intermediate data to disk. We then construct
efficient probabilistic tests for approximating the degree of the
near-sortedness of a relation without having to read an entire file. The role of our
algorithms in a database management system setting is illustrated as soon as
the theoretical foundation is laid out.

Finally, we show
how our approach can also benefit distributed systems and systems that use a
solid-state drive.