M.Sc Thesis

M.Sc StudentSabary Omer
SubjectReconstruction Algorithms for DNA-Based Storage Systems
DepartmentDepartment of Computer Science
Supervisor ASSOCIATE PROF. Eitan Yaakobi
Full Thesis textFull thesis text - English Version


In the trace reconstruction problem a length-n string x, yields a collection of noisy copies, called traces, y1,?, yt where each yi is independently obtained from x by passing through a deletion channel, which deletes every symbol with some fixed probability pd. Suppose the input string x is arbitrary. In the trace reconstruction problem, the main goal is to determine the required minimum number of i.i.d traces in order to reconstruct x with high probability. The trace reconstruction problem can be extended to the model where each trace is a result of x passing through a deletion-insertion-substitution channel. Here, in addition to deletions, each symbol can be switched with some substitution probability ps, and for each j, with probability pi, a symbol is inserted before the j-th symbol of x.

Motivated by the storage channel of DNA, this work is focused on another variation of the trace reconstruction problem, which is referred by the DNA reconstruction problem. The setup is similar to the trace reconstruction problem. A length-n string x is transmitted t times over the deletion-insertion-substitution channel and generates t traces y1,?, yt. A DNA reconstruction algorithm is a mapping R: (åq*)tq* which receives the t traces y1,?, yt as an input and produces x^, an estimation of x. The goal in the DNA reconstruction problem is to minimize de(x, x^), i.e., the edit distance between the original string and the algorithm's estimation.

This work consists of three parts. The first part presents a software tool to support quality control of synthetic DNA libraries. This part includes a comprehensive study and characterization of the errors in the DNA storage channel.

The second part studies the maximum likelihood (ML) decoder for two deletion channels. In this part, we calculate the error and the failure probability of the ML decoder, and also simulate it to verify our results.

         In the third part we present several new algorithms for the DNA reconstruction problem and for the deletion DNA reconstruction problem. While most of the previous algorithms look locally on each symbol and use a symbol-wise majority technique, our algorithms look globally on the entire sequence of the traces and use dynamic programming algorithms to estimate the original sequence. This global approach builds upon the algorithms to find the shortest common supersequence and the longest common subsequence of a set of given sequences. Our algorithms also differ from previous results by introducing less limitations on the input, and more than that, they perform well even for error probabilities as high as 0.27.

Lastly, we tested our algorithms on simulated data and on data from previous DNA experiments. We compared the edit error rates of our algorithms with the edit error rates of previous algorithms. In all of the tests our algorithms presented significant improvements of the edit error rates.