Automatic editing

Up to this point, the mira assembler has put all reads into contigs, forming singlets for reads that could not be assembled anywhere. Assembling sequences with discrepancies in alignments induces the necessity to use other methods for dealing with possible base call errors that might be present in reads and introduce discrepancies or misassemblies in the assembly. This is entirely done by an incorporated version of the automatic editor developed by Pfisterer and Wetter (1999).

Previous solutions like the one presented by Xu et al. (1995) only used a dynamic programming algorithm and majority vote to adjudicate conflicting base positions. The main advantage of using an automatic editor is that decisions taken are based on the original trace signals and not on majority assumption. Different assumptions (hypotheses) on what could have gone wrong during the sequencing or base-calling process are established and inspected.

Although the exact methods and algorithms of the editor are not subject of this thesis, a short abstract on the strategy used is nevertheless included at this place to give an overview on the operations performed:

The editor steps through the contigs column by column and searches for discrepancies between the bases of a column. Once a discrepancy is found, the editor will build an enlarged error region where it will test different base-calling error hypotheses in the reads present at this position. Enlarging the error region is necessary as clustered errors tend to obfuscate the true nature of the errors occurring in different reads because of an awkward multiple alignment. The most probable complex error hypotheses are then split into atomic fault hypotheses (AFH), each AFH describing an insert/delete/base change operation needed in one read to correct the whole error region. Each AFH is being tested by applying up to 30 different quality measures - depending on the type of atomic error hypothesis - directly to the underlying trace signal of a read.

The signal quality measures analysed can be roughly classified into four categories: peak shapes, peak positions, peak distance and peak intensity. Relevant decision information is then extracted from the calculated measures by a neural network. The network decides whether the atomic fault hypothesis can be confirmed by looking at the trace signals.

Although error hypotheses in regions involving complex base shuffling have a lower likeliness to be confirmed, they will be edited if they are the only possible solution that is supported by the trace signals. Using this methods ensures a maximum of safety for the assembler that editing decisions do not contradict the underlying trace signals. It can therefore - in contrast to simpler programs like ReAligner presented by Anson and Myers (1997) - also be seen as an improved method of realigning sequences to improve consensus quality.

Bastien Chevreux 2006-05-11