State of the art assembly

The sheer amount of data collected implies that the assembly itself must be done by computer-based methods. The above mentioned error rates and repetitive properties of DNA require the algorithms to tolerate faults and seek alternatives in a given solution space. Wang and Jiang (1994) showed that the assembly problem - even using error free representations of the true sequence - is NP complete. This means that the volume of data can only be assembled by approximating strategies, relying on heuristic algorithms that are well-behaved in both time and space complexity.

The evolution of assembly strategies clearly moved away from simply using base sequences toward approaches using additional information like, e.g., coverage analysis, sequence orientation, quality and probability values and template identity. A common characteristic to all existing assemblers is that they rely on the quality values with which the bases were previously attributed. Within this process which is named ``base-calling'', an error probability is computed by the base caller program to express the confidence with which the called base is thought to be the true base. The positive aspect is the possibility for assemblers to decide in favour of the best, most probable bases when a discrepancy occurs. The negative aspects of the most widely used current base callers are 1) the fact that base probability values sometimes cannot be computed accurately2enough and 2) their inability to write confidence values for optional, uncalled bases at the same place, which would help assembly algorithms to compute better alternative solutions.

The common method for assembling DNA and RNA therefore consists of using fault-tolerant algorithms that produce a basic sequence which then has to be reviewed and manually corrected by human experts, the so called ``finishing''. Incorrectly assembled sequences must be dismantled and reassembled at different places. For large scale sequencing projects, this slow and very inefficient method contradicts the principle of parsimony and represents the most important bottle-neck and cause of errors, even if powerful finishing tools are now available (see Staden et al. (1997); Gordon et al. (1998); Lawrence et al. (1994)). But errors still happen much too frequently and especially sequencing projects of higher organisms are affected by problems as their genomes consistently contain more and longer repetitive regions that create new levels of complexity in the assembly process. Recent studies from Cheung et al. (2003) confirmed this by an in-depth analysis of the human genome sequence assembly: as of June 2002, ``1.28% of the sequences in the 3,043.1 megabases of the genome are likely to be involved in sequence misalignment errors''.

Bastien Chevreux 2006-05-11