- Home
- Documents
*DNA Sequence Assembly and Multiple Sequence Alignment 1 DNA Sequence Assembly Assembly of short DNA*

prev

next

out of 34

View

0Download

0

Embed Size (px)

DNA Sequence Assembly and Multiple Sequence Alignment by an Eulerian Path

Approach

Yu Zhang∗

Department of Mathematics University of Southern California

Los Angeles, CA 90089-1113 Phone: 213-821-2231

yuzhang@usc.edu

Michael S.Waterman

Department of Biological Sciences University of Southern California

Los Angeles, CA 90089-1340 Phone: 213-740-2409 Fax: 213-740-2437

msw@usc.edu

22 May 2003

1

Running Head: Eulerian Assembly and Multiple Alignment

Corresponding Author:

Yu Zhang

Mailing Address:

USC, Department of Mathematics 1042 W. 36th Place, DRB289 Los Angeles, CA 90089-1113

Phone: 213-821-2231

Fax: 213-740-2437

E-mail: yuzhang@usc.edu

1

Abstract

We describe an Eulerian path approach to the DNA fragment as-

sembly that was originated by Idury and Waterman 1995, and then

advanced by Pevzner et al. 2001b. This combinatorial approach by-

passes the traditional “overlap-layout-consensus” approach and suc-

cessfully resolved some of the troublesome repeats in practical assem-

bly projects. The assembly results by the Eulerian path approach are

accurate, and its computation is significantly more efficient than other

assembly programs.

As an extension, we use the Eulerian path idea to address the multiple

sequence alignment problem. In particular, we have as a goal align-

ing thousands of sequences simultaneously, which is computationally

exorbitant for all existing alignment algorithms. As a beginning, we

focus on DNA sequence alignment. Our method can align hundreds

of DNA sequences within minutes with high accuracy, and its compu-

tational time is linear to the number of sequences. We demonstrate

its performance by alignments of simulated sequences and by an ap-

plication in a resequencing project of Arabidopsis thaliana.

Although having some weaknesses including aligning gap-rich regions,

the Eulerian path approach is distinguished from other existing al-

gorithms in solving either fragment assembly or multiple alignment

2

problems, and hence provides a new perspective in solving these prob-

lems.

1 DNA Sequence Assembly

Assembly of short DNA fragments (500-1000bp) generated by shotgun

sequencing is a widely used technique for sequencing large genomes,

including the human genome. The most popular framework of DNA

fragment assembly algorithms in the past 25 years is the “overlap-

layout-consensus” approach. All high quality DNA fragments are first

compared to each other for possible overlaps; then a layout is made by

arranging all DNA fragments into relative positions and orientations

according to the overlap information; finally a multiple alignment is

computed to obtain a consensus sequence that will be used as the

genomic sequence. The main difficulty with this framework in addi-

tion to the computation required, comes from the fact that genomic

sequences always contain large amount of repeat regions accumulated

along their evolutionary history. In particular, repeats that are longer

than the fragment length and have > 98% identity are hard to dis-

tinguish from true overlaps, and hence finding a correct path in the

layout step is difficult.

Surprisingly, a 15-year-old computational model from DNA arrays

provides the basis for a novel approach to assembly. Sequencing by

3

Hybridization (SBH) is conceptually analogous to DNA fragment as-

sembly by regarding each DNA fragment as a k-tuple. Idury and

Waterman (Idury and Waterman 1995) mimicked SBH procedure by

breaking each DNA fragment of length n into n-k+1 overlapping k-

tuples and hybridized all k-tuples in silicon such that a DNA fragment

assembly problem is mapped into a SBH problem. A de Bruijn Graph

is constructed in their SBH approach where each edge represents a k-

tuple from fragments; two edges share a common vertex if they share a

common (k-1)-tuple; identical k-tuples share the same edge. Repeats

and sequencing errors make the simple de Bruijn graph extremely

tangled and the solution of DNA fragment assembly from such a com-

plex graph remains challenging. Pevzner et al. took the SBH idea

and provided some additional features (Pevzner et al. 2001b). In-

stead of building an overlap graph in the traditional “overlap-layout-

consensus” framework, the Eulerian path approach builds a de Bruijn

graph that represent all fragments and their relationships in a much

simpler way. In addition, the difficulty resulting from the repeats and

sequencing errors often are much easier to conquer in the de Bruijn

graph structure.

Assume a DNA sequence consists of four unique subsequences that

are separated by one triple repeat, as shown in figure 1a. The tradi-

tional “overlap-layout-consensus” approach builds an overlap graph by

4

regarding every fragment as a vertex, and two vertices are connected

if two corresponding fragments overlap. Figure 1b shows the overlap

graph in our example. The assembly problem is thus finding a path

in the overlap graph that visits every vertex exactly once, a Hamilton

path problem that is well known NP-complete. On the contrary, the

Eulerian path approach breaks all fragments into k-tuples and builds

a de Bruijn graph as described above (figure 1c). A conceptually ideal

de Bruijn graph is shown in figure 1d, a much simpler representation

of repeats than the overlap graph. The most important advantage of

this representation is that the assembly problem is now finding a path

that visits every edge exactly once, an Eulerian path problem that has

linear-time solutions (Fleischner 1990).

The implementation of the Eulerian path approach to DNA fragment

assembly problems is called EULER. Traditional algorithms postpone

the error correction until the last consensus step, while EULER applies

error correction at the beginning of assembly (this is the innovation

of Pevzner et al.). Without knowing the finished genomic sequence

or even the layout of fragments, error correction is still possible by

approximating the k-tuple spectrum and the result is usually accurate

enough. After error correction, EULER constructs a de Bruijn graph,

and stores the fragment information in corresponding edges. This

fragment information is the fundamental difference between EULER

5

and SBH where such information is unavailable. EULER’s superpath

idea can successfully solve many repeats by using fragment informa-

tion. Finally, EULER outputs an Eulerian path from the de Bruijn

graph that represents the finished genomic sequence.

1.1 Error Correction

Sequencing errors make the de Bruijn graph a tangle of erroneous

edges, thus very difficult to solve. EULER’s error correction proce-

dure reduces 97% errors from the original DNA fragments and makes

the data almost error-free. The idea is based on an approximation

to the spectrum of real genomic sequence, i.e. to find a collection

of k-tuples all of which are from the genomic sequence instead of se-

quencing errors. With sequencing errors, many k-tuples are erroneous

and many “true” k-tuples are missing. EULER calls a k-tuple solid if

it appears in more than M fragments, otherwise it is a weak k-tuple.

The error correction problem is thus to transform the spectrum of the

original DNA fragments into the spectrum of a genomic sequence by

changing weak k-tuples to solid k-tuples. Without knowing the real

genomic sequence, one natural criteria of error correction is to mini-

mize the total number of distinct k-tuples in the spectrum. One error

in a fragment will create at most 2k (including the reverse complement

part) erroneous k-tuples in the spectrum, or 2d (d

affected. EULER uses a greedy approach to look for error corrections

that reduce the number of weak k-tuples by 2k or 2d.

The Neisseria meningitidis (NM) sequencing project (Parkhill 2000),

one of the most “difficult-to-assemble” and “repeat-rich” bacterial

genome completed so far, was used to demonstrate the efficiency of

EULER’s error correction method. The NM genome contains 2,184,406

nucleotides, with 126 long nearly perfect repeats up to 3,832 bp in

length. The sequencing project resulted in 53,263 fragments (cover-

age 9.7), with 255,631 sequencing errors in total. EULER corrected

97.7% errors and made the original sequencing data almost error-free

with 0.11 errors per fragment (Pevzner et al. 2001a).

1.2 Construction of De Bruijn Graph

EULER constructs a de Bruijn graph from the corrected DNA frag-

ments in the following way: given a set of DNA fragments, EULER

breaks each fragment and its reverse complement into overlapping k-

tuples; each k-tuple represents a directed edge in the graph and the

direction of an edge is the direction of reading a k-tuple; tuple posi-

tions and fragment indices are stored in the corresponding edg