טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
M.Sc Thesis
M.Sc StudentBen Moshe Sagi
SubjectUsing Property Testing for Efficient Detection of
Nearly-Sorted relations
DepartmentDepartment of Computer Science
Supervisor Professor Eldar Fischer
Full Thesis textFull thesis text - English Version


Abstract

Many relational operations are best performed when the relations are stored sorted over the relevant attributes (e.g. the common attributes in a natural join operation). However, generally relations are not stored sorted because it is expensive to maintain them this way (and impossible whenever there is more than one relevant sort key). Still, many times relations turn out to be nearly-sorted, where most tuples are close to their place in the order.
This state can result from "leftover sortedness", where originally sortedrelations were updated, or were combined into interim results when evaluating a complex query. It can also result from weak correlations between attribute values.
Currently, nearly-sorted relations are treated the same as unsorted relations, and when relational operations are evaluated for them, a generic algorithm is used. Yet, many operations can be computed more efficiently by an algorithm that exploits this near-ordering.
However, to consistently benefit from using such algorithms the system should also refrain from using the wrong algorithm for relations which happen not to be sorted at all. Thus, an efficient test is required, i.e., a very fast approximation algorithm for establishing whether a given relation is sufficiently nearly-sorted.
In this paper, we provide the theoretical foundations for improving query evaluation over possibly nearly-sorted relations.
First we formally define what it means for a relation to be nearly-sorted, and show how operations over such relations, such as natural join, set operations and sorting, can be executed significantly more efficiently using an algorithm that we provide. If a relation is nearly-sorted enough, then it can be sorted using two sequential reads of the relation, and writing no intermediate data to disk. We then construct efficient probabilistic tests for approximating the degree of the near-sortedness of a relation without having to read an entire file. The role of our algorithms in a database management system setting is illustrated as soon as the theoretical foundation is laid out.
Finally, we show how our approach can also benefit distributed systems and systems that use a solid-state drive.