Data preprocessing and input

Strictly speaking, data preprocessing does not belong the actual assembler as almost every laboratory has its own means to define 'good' quality within reads and already use existing programs to perform this task.¹⁶ But as this preprocessing step directly influences the quality of the results obtained during the assembly, defining the scope of the expected data is desirable. Moreover it can explain strategies implemented to eventually handle incorrectly preprocessed data.

The most important part in the sequenced fragments (apart from the target sequence itself) is the sequencing vector data, which will invariably be found at the start of each read and sometimes, for short inserts, at the end. These parts of any cloned sequence must imperatively be marked or removed from an assembly as these would contaminate the ``real'' sequence that is to be determined. Programs like LUCY presented by Chou and Holmes (2001) go a great length to remove vector sequences, perform quality trimming and even compare the sequence produced by several different base-calling programs from the same chromatogram file to define what they call the ``final clean range'' (or high confidence region, HCR, in terms of this thesis). In analogy to the terms used in the GAP4 package, this thesis will refer to marked or removed parts as 'hidden' data (Staden et al. (1997)), other terms frequently used are 'masked out' or 'clipped' data.

Errors occurring during the base-calling step or simply quality problems with a clone can lead to more or less spurious errors occurring in the gained sequences. These in turn sometimes interfere with the ability of preprocessing programs to correctly recognise and clip the offending sequence parts. Therefore the mira and miraEST assemblers developed during this thesis incorporate a number of routines across all steps of the assembly that 'save' sequences that were incorrectly preprocessed. While this section gives a brief algorithmical overview over implemented methods within the scope of this section, please refer to the program documentation in appendix A for a full description of all available options. The routines that were implemented and that can be used by the assembler are:

Standard quality clipping routines:
Clipping is done with a modified sliding window approach known from literature as in Staden et al. (1997); Chou and Holmes (2001), where a window of a defined length l is slided across the sequence until the average of the quality values attains a threshold t. Usual values for this procedure are l = 30 and t = 20 when using log-quality values as described in section 2.2.1. An additional backtracking step is implemented to search for the optimal cutoff-point within the window once the stop-criterion has been reached, discarding bases with quality values below the threshold. This is performed from both sides of the sequences.
Pooling masked areas at sequence tails:
Parts of sequences that were masked (X'ed out) by other preprocessing programs sometimes contain small areas between 1 and 30 nucleotides of non-masked characters within the masked area due to, e.g., low quality data or the usage of slightly differing sequencing vectors. If requested, the assembler will merge such masked areas when the non-masked sections do not exceed a given length. E.g, the sequence XXXXATXXXXXXXXXX... becomes XXXXXXXXXXXXXXXX...
Clipping of sequencing vector relicts (while differentiating them from possible splice variants:
This is done by generating hit/miss histograms of subsequence alignments between all the sequences. In an alignment of two sequences, it is normally to be expected that two neighbouring subsequences of one sequence should also be neighbouring to each other in the other sequence. If this is the case, then a ``hit'' is counted, if not, a ``miss''. The good quality middle parts will have a high ratio of consecutive subsequence alignment hits versus ``unexpected'' misses within a sequence histogram. Meanwhile, vector leftovers at the end of sequences will have a very low ratio of hits vs. misses. The beginning/end of such vector fractions is marked by a relatively sharp change in the ratio - a ``cliff'' - which can easily be detected.
Unfortunately - in EST projects - different splice variants of eukaryotic genes present the same effects within histograms so that hit/miss ratio changes are searched for only within a given window at the start and end of the 'good' sequence parts (usually between 1 and 20 bases) to only catch such vector relicts present there.
Uncovering and tagging of poly-A and poly-T bases at sequence ends in EST projects:
Unlike other specialised transcript assemblers like pta (Paracel (2002c)), the algorithms of the assembler developed in this thesis differentiate between different splice variants present in an assembly. They therefore include poly-A / poly-T bases when aligning EST sequences. The assembler will recover those areas by comparing masked sequences with the original counterpart and uncover exactly the poly-A/T stretches present at the end of the sequences by a simple but fault-tolerant base-by-base comparison algorithm. These stretches will furthermore be tagged with assembly-internal meta information to help the algorithms in the splice detection task.

A high confidence region (HCR) of bases within every read is selected through quality clipping as an anchor point for the next phases. Existing base callers (ABI, PHRED, TraceTuner and others) detect bases and rate their quality quite accurately and keep increasing in their performance, but bases in a called sequence always remain afflicted by increasing uncertainty towards the ends of a read. This additional information, potentially worthful, can nevertheless constitute an impeding moment in the early phases of an assembly process, bringing in too much noise. It is therefore marked as low confidence region (LCR) for cautious use in the assembly process.

The following list shows the type of data the assembler will work with, any of which can be left out (except sequence and vector clippings) but will reduce the efficiency of the assembler:

the initial trace data, representing the gel electrophoresis signal;
the called nucleic acid sequence;
position specific confidence values for the called bases of the nucleic acid sequence;
a stretch in each sequence marked as HCR;
general properties like direction of the clone read and name of the sequencing template etc.;
special sequence properties in different regions of a read (like sequencing vector, known standard repeat sequence and known SNP sites etc.) that have been tagged or marked.

Bastien Chevreux 2006-05-11