... representations1
also called fragments (see Myers (1995)) or readings (reads)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... accurately2
e.g. for new electrophoresis methods
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...mira3
which is an acronym for Mimicking Intelligent Read Assembly
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... work.4
The automatic editor is subject of a thesis to be presented by Thomas Pfisterer
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... character.5
e.g., the symbol 'W' for an uncertainty between A (Adenosin) and T (Thymin)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... orientation6
the experimentally gained sequences have a 50% chance of being in reverse complement orientation
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... traces7
Mean length of useful sequences gathered on ABI 3730 machines at The Institute of Genomic Research (TIGR) in 2003, pers. communication from Bill Niermann (Investigator at TIGR) in April 2004
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... bases.8
insertions and deletions are commonly referred to as indels
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... reads''.9
The term contig is derived from ``contiguous sequence''
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... Green10
PHRAP as acronym for PHils Revised Assembly Program, see also http://www.phrap.org/
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...fuguization11
this refers to the rather compact genome of the puffer fish (Fugu rubripes) which is largely devoid of large copy repeats.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... enough.12
e.g. for new electrophoresis methods
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... operations13
Specialised hardware for this type of operations starts approximately at EUR 500,000
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... information14
which they call ``double-barreled data''
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... mind'15
e.g. 'could the base G at position 235 in read 4 be replaced by a A?' (because the overall consensus at this position of the other reads suggests this possibility)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... task.16
For example, quality clipping, sequencing vector and cosmid vector removal can be controlled by the PREGAP4 environment provided with the GAP4 package (Bonfield et al. (1995b); Staden (1996); Bonfield and Staden (1996)) or the LUCY program, parts of these tasks can also be done with cross_match provided by the PHRAP package or other packages like, e.g., PFP from Paracel (Paracel (2002a)).
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... biosciences17
see also Myers (1991)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... sequence18
It is again of no consequence which sequence is in reverse complement direction to the other, as both will be searched with the reverse complement pattern of the other one.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... ZEBRA19
ZEBRA is not an acronym, but the algorithm was named because it produces 'bands' in memory which resemble the patterns of the african Zebra
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... number20
which is 4k for k Ns.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... sequences.21
see also the paper from Pearson (1998) for a review on empirical statistical estimates for sequence similarity searches
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... exemplarily.22
The graphical display of the resulting bit vectors reminds vaguely to the african zebra, hence the algorithm's name.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... devised23
for example see Grice et al. (1997); Chao et al. (1994) for an overview and Chao et al. (1995) for an application
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... mismatches.24
but perhaps one or several aligns of a base against an 'N'
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... reads25
Chimeric reads, as described in section 2.2.2, must be considered as garbage.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... contigs.26
Of course, a single read itself cannot be called a contig. But putting it into the same data structure (a contig object) like the other, assembled reads is a convenient way to keep unassembled reads in a database.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... projects27
cosmid, BAC or even whole genome size
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... project.28
for example to close gaps
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... occurring.29
e.g. the infamous AG-problem known with the ABI 373 and 377 machines where a G preceeded by an A is often unincisive
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... positions.30
although chemistry together with the sequencing direction of a read might play a minor role on the type of errors generated, but this has no real impact on the error distribution itself
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... threshold31
working with relatively high score ratios beginning with 80%
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... routines.32
This is one of the more prominent places where it shows that the EST assembler is a sibling of its genome pendant.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...mira33
MIRA: Mimicking Intelligent Read Assembly
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... sequences34
see Notredame (2002) for a review of state-of-the-art algorithms,Thompson et al. (1999a) and Lassmann and Sonnhammer (2002) for a evaluation of some of these tools and Morgenstern et al. (2003) for the description of a web-based solution
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... GAP4/cycle35
GAP4/cycle is a script performing several GAP4 assemblies with decreasing strictness
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... test.36
PHRAP uses base qualities for the assembly, MIRA/EdIt can use them if present and GAP4/cycle does not
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... 2.0.137
TraceTuner is from from Paracel Inc.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... capabilities38
It is widely known that the speed of a program depends mainly on the quality of the algorithms it bases on. However, good compilers can squeeze a considerable amount of execution speed from optimal algorithms by optimising them on machine level, sometimes to a factor of 3 and more.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... EGCS39
which later on was promoted as official GCC until the new GCC 3 compiler lineage appeared by mid of 2001
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... .o''40
mind the blank between the asterisk and .o
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... hierarchy41
especially into a single-rooted hierarchy
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.