Aim of the thesis

The essential criterion in the design of an assembler is the quality of the final result. Although different groups may have different - sometimes arbitrary - sets of acceptance criteria for quality aspects of assemblies, the target accuracy of 1 error in 10,000 finished bases in DNA sequence set by the Human Genome Project can be seen as most demanding, especially when assembling sequences from higher eukaryotic organisms. An assembly system has to support this target by making cautious use of the available data and increase automation by reducing human involvement in correcting assembly errors. This has to happen through ensuring a firm base and good building blocks in the beginning of an assembly.

The aim of this thesis is therefore centred at reducing assembly errors caused by repetitive sequences as well as increasing the reliability of consensus sequences derived from automatically assembled projects.

To achieve this goal, the first key point of the approach developed in this thesis is the usage of all additionally gathered information like, e.g., repeat tags, template insert sizes, quality values, original signal traces, etc. as checking mechanisms. The information is used to confirm and optimise the basic assembly and to identify possibilities to discern between different occurrences of repetitive sequences. The second key point in the algorithms designed is the combination of the assembler with the capabilities of an automatic editor. Both the assembler and the automatic editor are separate programs and can run separately, but the task of assembly and finishing is so closely related that both parts can include routines from each other (see also Chevreux et al. (2000); Pfisterer and Wetter (1999)).

In this combined information analysis and integration process, the signal-analysis aided assembler promises two substantial advantages compared to a sequential-base-caller-and-assembler strategy:

The assembler gains the ability to perform signal analysis on partly assembled data. Analysing experimental signal data at precise points with a given hypothesis 'in mind' helps to discern possible base caller errors from errors due to misassemblies, especially in problematic regions like repeats where simple base probabilities alone could not help.
During the assembly process, sequences in preliminary builds can be automatically edited to increase their quality if the signal data supports the hypothesis of an error that occurred in the base calling step. The edited sequences in turn can be used to increase assembly quality in the ongoing assembly process.

As the underlying problem of string assembly has been shown to be NP-hard by Armen and Stein (1995), heuristic algorithms represent the only possibility to solve the challenges induced by assembly projects. This thesis presents the developed strategy of using all available assembly data together with the major assets of signal analysis and automatic editing as a tremendously useful and versatile policy for improving existing heuristic algorithms.

In chapter 2, the present thesis first presents and formally defines the problems posed by sequencing and describes the reasons for the complexity of this task. Chapter 3 presents the theory developed and the algorithms implemented for a new type of assembler that combines - and substantially extends - the strengths of existing assembly approaches while mimicking and automating some assembly analysis strategies used by human experts. Methods and algorithms both used in bioinformatics and developed for this thesis are described in chapter 4, together with their advantages and disadvantages for different applications in sequence assembly. The focus is to show both theoretically and practically how relatively simple, yet well behaved and highly effective algorithms can be used to build an assembler that handles large and complex real-world shotgun projects. Chapter 5 presents the results obtained with the assembler developed in this thesis in comparison with the two most widely used genome assembly systems at the time of the study. Furthermore, results for assembly of EST projects are presented. Conclusions of this work are presented in the last chapter (6) together with some remarks on additional work that could further improve the assembly system. Appendix A contains some thoughts and insights on implementational issues gained through this research project. Appendix B shows the documentation for the 2.2.8 version as of October 2004 for the mira³ and miraEST assembler which has an accurate description on usage modes. The last but nevertheless important appendix 6 contains acknowledgements to people who helped this work become a success.

In closing this introductionary chapter, it must be noted that the work presented within this thesis focuses on the assembler part of the assembler $\leftrightarrow$ automatic editor combination. Conditions an automatic editor system has to comply with when providing assistance to an assembler will also be explained although detailed descriptions of the automatic editor algorithms are not subject of this work.⁴

Bastien Chevreux 2006-05-11