M.Sc Thesis

M.Sc StudentShinkar Tal
SubjectClustering-Correcting Codes for DNA-based Storage Systems
DepartmentDepartment of Computer Science
Supervisor ASSOCIATE PROF. Eitan Yaakobi
Full Thesis textFull thesis text - English Version


A new family of codes, called clustering-correcting codes, is presented in this thesis. This family of codes is motivated by the special structure of the data that is stored in DNA-based storage systems. The data stored in these systems has the form of unordered sequences, also called strands, and every strand is synthesized thousands to millions of times, where some of these copies are read back during sequencing.

In the context of storing data, we refer to a DNA strand as a linear sequence over the alphabet {A, C, G, T}. In 2012 and 2013 two pioneering DNA based storage systems were implemented by Church et al. and Goldman et al., respectively. Those systems paved the way for multiple studies in which synthetic DNA molecules are used as a volume for storing data. The different systems share similar architecture principles. The writing channel is a chemical process named synthesis that generates artificial DNA strands. The reading is done using a technology called DNA sequencing. The length of the strands is limited to few hundreds symbols hence the DNA pool consists of multiple strands in order to store reasonable amount of data.

Due to the unordered structure of the strands, an important task in the decoding process is to place them in their correct order. This is usually accomplished by allocating part of the strand for an index. However, in the presence of errors in the index field, important information on the order of the strands may be lost. Clustering-correcting codes ensure that if the distance between the index fields of two strands is small, their data fields have large distance. It is shown how this property enables to place the strands together in their correct clusters even in the presence of errors.

Clustering-correcting codes are parametric in τi, τd. Those two values are chosen in respect to a DNA-storage system that defines the maximum number of errors that can occur in the index field and in the data field. Those values are denoted by ti, td, respectively.  We show that for τd > 4td and a suitable choice of τi clustering-correcting capabilities can be achieved. This is due to the fact that for two strands read with the same index (indi, uk), (indi, uj), it holds that dH(uk, uj) ≤ td td = 2td, where dH(x, y) is the Hamming distance between two vectors x and y While on the other hand for two different (indi, ui), (indj, uj) strands that satisfy the constraint, hence dH(indi, indj) ≤ τi, it holds that dH(ui, uj) ≥ τd - 2td > 2td.

This property allows to differentiate the strands that belong in the same cluster and those that do not. In some cases, it is also possible to move the mis-clustered strands to their correct clusters.  We present lower and upper bounds on the size of clustering-correcting codes and an explicit construction of these codes which uses only a single symbol of redundancy.  The results are first presented for the Hamming metric and are then extended for the edit distance.