Subsections

Finding previously unknown repeats

Correct handling of repeats belongs to the most difficult problems an assembler has to perform. This section gives a short introduction to different types of repeats, the current methods to find and adjudicate them and present the method used in the assembler.

Repeat types

Repeats can be classified in the following categories:

Simple multiple bases where the repeat consists of a specific base present k time. Example: AAAAAAAA... Because of their usually short nature (the repeat will be far shorter than the average length of a read), simple multiple base repeats are not really problematic for an assembler unless they are very frequent. For longer stretches, the correct number of bases can be off by one or two. This is due to problems for any base-caller to correctly separate bases in trace signal stretches containing more than 5 to 6 equal bases when the trace quality is sub-optimal.
Micro-satellites where the repeat consists of a small number of bases present k times and where some copies might present point mutations. Example: multicopy CG in CGCGCGAGCGCG... or multicopy CAG in the sequence CAGCAGCTGCAG... Micro-satellites too are far shorter than an average read and therefore mostly unproblematic to assemble.
Short repeats where copies of a medium number of bases are separated by non-repetitive subsequences. The copies present an identity between 70% and 100%. In the human genome for example, ALU repeats are very common. Although short term repeats are mostly shorter than the average read length, the sheer number of occurrences and the sometimes considerable identity can occasionally lead to misassemblies.
Long repeats are subsequences that contain up to several kilobases and where the repeat is present at least in two locations and the identity ranges from around 50% to 100%. Long repeats are the most difficult cases to assemble, especially if the identity exceeds more than 90% to 95% and the repeat itself is longer than the average read length.

The repeats of type 1 and 2 need no special handling routines as these are enclosed by (mostly) non-repetitive subsequences which ensure the correct placement of the read within an assembly. Repeats of type 3 (standard short term repeats) are sometimes harder to place as they are generally longer than repeats of type 1 and 2. But they have the considerable advantage that standard repeats are well known sequences, documented throughout literature and databases. Consequently, they can be searched for and tagged in the single reads before the assembly process takes place, giving the assembler the possibility to use the additional information gained during preprocessing.

From an assembler's point of view, the most annoying repeats are those of type 4. Segmental duplications - as an example for this type - are a special cases of extremely large repeats with sometimes several tens or even hundreds of kilobases. They play a fundamental role both in genomic diseases and gene evolution. Mutation and natural selection of duplicate copies of genes can diversify protein function, which explains why they are now seen as one of the primary forces in evolutionary change (Eichler (2001)). Bailey et al. (2001) note that they typically range in size between 1 and 200 kilobases and often contain special sequence features such as high-copy repeats and gene sequences with intron-exon structure. Another interesting - but from the viewpoint of an assembler rather annoying - recent discovery is the fact that, citing Delcher et al. (2002), ``chromosome-scale inversions are a common evolutionary phenomenon in bacteria'' and that some plants like the Arabidopsis thaliana contain large scale duplications on the chromosome level. On a similar level of annoyance is the fact that in grass genomes like rice, ``most of the repeats are attributable to nested retrotransposons in the intergenic regions between the genes'' (Wang et al. (2002)). Eichler (2001) observes that ``exceptional duplicated regions underlie exceptional biology''. For algorithms trying to resolve the assembly problem, they induce difficulties for in-silico computation and result in underrepresentation and misassembly of duplicated sequences in assembled genome.

Existing approaches

The most difficult task for an assembler consists in finding long term repeats in an assembly and preventing reads to be assembled at wrong locations within a contig. Suggestions to surmount this problem are sparse in literature and can be classified into two main approaches.

The first approach consists of relying on base probabilities only and prevent the alignment of reads that show too many discrepancies in high probability areas. This method is quick and its sensitivity can be easily adjusted. The advantages, however, are outweighed by the disadvantages this method inherently has: the assembler must rely solely on the ability of the analysing algorithms of the base-caller to correctly adjudicate each base upon the trace signal only. As good as current base-callers are nowadays, this cannot be guaranteed. Errors happen in the base-calling process and if the sensitivity of the assembler is set too high, the specificity of the repeat misassembly prevention mechanism decreases sharply: many non-repetitive reads will not align because their errors reach the repeat recognition threshold. Reads will thus not align although they might otherwise perfectly match, which in turn constitutes a handicap for the assembler when trying to build long contigs.

The second approach for repeat location assumes the shotgun process to produce uniformly distributed reads across the target genome. The solution to the long term repeat problem then consists in analysing read coverage in overlap graphs and rearrange read assembly in a way that the reads are distributed as uniformly as possible in the assembly (Kececioglu and Myers (1992)). The main problem of this method is the assumption of uniform read distribution of reads itself. A shotgun process is a stochastic method to gain reads from a genome. As in every stochastic process trying to reach a uniform distribution, the uniformity cannot be guaranteed throughout each segment of the genome. Additionally, chemical properties of the DNA itself sometimes inhibit the correct DNA duplication during the different cloning stages of the shotgun process, leading to skewed distributions of reads. In summary, assuming a uniform distribution is a working hypothesis that cannot be relied upon as only attribute.

Locating repeats through error pattern analysis

The assembler developed combines both methods described above together with information on template insert sizes and a fault pattern analysis algorithm. Contrary to the methods presented by Huang (1996) and Kececioglu and Myers (1992), the approach described within this section is able to handle complex repeat patterns which have more than two copies of extremely strong similarity. Here again, the use of an automatic editor during the assembly - which performs edits based on trace evidence only - is a major asset as it permits the duplication of the approach human finishers use when editing contigs.

**Figure 37:** Example for a misassembly due to previously unknown long term repeats. Dark bases are discrepancies between reads and the actual consensus. The upper picture shows the initial alignment built by the assembler, the lower picture shows the same contig after the automatic editor made the corrections it could answer for. Observe the two heavy discrepancy columns automatic editor left untouched: the assembler will tag the bases of these columns as Possible Repeat Marker Bases (PRMB), dismantle the contig and reassemble the reads contained within.
$\includegraphics[width=\textwidth]{figures/u13rep_mincol_f}$

A very important factor for any human finisher - when searching for misalignments due to repeats - is the observable circumstance that normally errors in reads which cause a drop in the alignment quality do not mass at specific column positions.³⁰Repeats causing misalignments however will show up as massive column discrepancies between bases of different reads that simply cannot be edited away. The human finisher performs a search for patterns - like those shown in figure 37 on a symbolic level in an assembly to detect misassemblies.

**Figure 38:** Resolved repeat problem: with the additional knowledge of possible repeat marker bases, the assembler is able to find the right solution when assembling repeats. The upper picture shows an alignment containing reads from figure 37 that were previously misassembled but are now at the right place. All the reads have been edited automatically at least once before they were reassembled. Observe that there are still a lot of discrepancies and base calling uncertainties contained in the reads. The lower picture shows the same alignment after automatic editing of the second assembly pass. This demonstrates that the correctly assembled repeats enabled the automatic editor to correct more errors and increase the quality of the assembly.
$\includegraphics[width=\textwidth]{figures//u13rep_mincol_r}$

The method developed is based on symbolic pattern recognition of column discrepancies in alignments to recognise long term repeats and non-marked short term repeats. For each column in an alignment, the method uses the same algorithms as for computing a consensus quality (presented earlier in section 4.4.5). But instead of computing a consensus, each column which contains contradicting bases with a group quality surpassing a predefined threshold (e.g. 30, which translates to an error probability of max. 0.001 for each base) is marked as potentially dangerous. By analysing the frequency of dangerous columns within a certain window length, the repeat detection algorithm can find and mark those columns that exceed an expected occurrence frequency.

Once most of the trivial base calling errors have been corrected by the automatic editor, even a single marked discrepancy column can be seen as a hint for a repeat misalignment if the coverage is high enough and the area has been built with reads sequenced from both strands of the DNA (see again figure 37). The bases allowing discrimination of reads belonging to different repeats are then tagged as Possible Repeat Marker Bases (PRMB) by the assembler. Contigs containing misassemblies are immediately dismantled and reassembled and during the subsequent reassembly, no discrepancy in alignments implicating these bases will be allowed and hence misassemblies will be prevented.

The reason for dismantling completely the contigs containing repeat induced errors in the assembly is the unpredictable effect the misaligned reads had on the alignment process. The most simple assumption could be that the misaligned reads could be inserted at another position of the assembly. However, in some cases the misaligned reads change the whole assembly layout and contig structure and lead to a totally different assembly. Misassemblies can be prevented best by the interaction of pathfinder and contig objects that were already described, the most sensible thing to do is to let these algorithms redo an assembly using the additional knowledge gained in this step. Figure 38 shows the example from figure 37 continued in which misassembled repetitive reads had single base columns marked as Possible Repeat Marker Base and subsequently reassembled at a totally different position, leading to a substantially different (and correct) assembly than the previous attempt.

Bastien Chevreux 2006-05-11