CSoI Presents Keynote I Research Workshop for Students and Postdocs
"Optimal Whole Genome Shotgun Assembly: From Simple Models to Complex Data"
David Tse, UC Berkeley
DNA sequencing is the basic workhorse of modern day biology and medicine. Shotgun sequencing is the dominant technique used: many randomly located short fragments called reads are extracted from the DNA sequence, and these reads are assembled to reconstruct the original sequence. Two basic questions: what is the minimum read length and the minimum number of reads required for reliable reconstruction? What is an optimal assembly algorithm that achieves the minimum? In an earlier work, we provided a complete solution to this question for DNA sequences modeled as asymptotically long i.i.d. strings. But real DNA sequences have much more complex repeat statistics. Rather than attempting to model the DNA data more accurately, we instead derive upper and lower performance bounds directly in terms of repeat statistics that can be measured from DNA data, and design an assembly algorithm that performs close to the lower bound for a wide range of DNA datasets. These results form the basis of a systematic data-driven approach to designing optimal assembly algorithms.
This is joint work with Guy Bresler and Ma'ayan Bresler.