Arratia, Richard, Bollobás, Béla, Coppersmith, Don and Sorkin, Gregory B.
(2000)
*Euler circuits and DNA sequencing by hybridization.*
Discrete Applied Mathematics, 104 (1-3).
pp. 63-96.
ISSN 0166-218X

## Abstract

Sequencing by hybridization is a method of reconstructing a long DNA string — that is, figuring out its nucleotide sequence — from knowledge of its short substrings. Unique reconstruction is not always possible, and the goal of this paper is to study the number of reconstructions of a random string. For a given string, the number of reconstructions is determined by the pattern of repeated substrings; in an appropriate limit substrings will occur at most twice, so the pattern of repeats is given by a pairing: a string of length 2n in which each symbol occurs twice. A pairing induces a 2-in, 2-out graph, whose directed edges are defined by successive symbols of the pairing — for example the pairing ABBCAC induces the graph with edges AB, BB, BC, and so forth — and the number of reconstructions is simply the number of Euler circuits in this 2-in, 2-out graph. The original problem is thus transformed into one about pairings: to find the number fk(n) of n-symbol pairings having k Euler circuits. We show how to compute this function, in closed form, for any fixed k, and we present the functions explicitly for k=1,…,9. The key is a decomposition theorem: the Euler “circuit number” of a pairing is the product of the circuit numbers of “component” sub-pairings. These components come from connected components of the “interlace graph”, which has the pairing's symbols as vertices, and edges when symbols are “interlaced”. (A and B are interlaced if the pairing has the form ABAB or BABA.) We carry these results back to the original question about DNA strings, and provide a total variation distance upper bound for the approximation error. We perform an asymptotic enumeration of 2-in, 2-out digraphs to show that, for a typical random n-pairing, the number of Euler circuits is of order no smaller than 2n/n, and the expected number is asymptotically at least e−1/22n−1/n. Since any n-pairing has at most 2n−1 Euler circuits, this pinpoints the exponential growth rate.

Item Type: | Article |
---|---|

Official URL: | http://www.elsevier.com/wps/find/journaldescriptio... |

Additional Information: | © 2000 Elsevier Science B.V. |

Divisions: | Management |

Subjects: | Q Science > QA Mathematics |

Sets: | Research centres and groups > Management Science Group Departments > Management |

Date Deposited: | 13 Apr 2011 14:29 |

Last Modified: | 20 Feb 2021 01:14 |

URI: | http://eprints.lse.ac.uk/id/eprint/35505 |

### Actions (login required)

View Item |