|
Sequence assembly with MIRA2
Bastien Chevreux
November 30, 2006
Version 2.6.0
mira
- a genome and EST sequence assembly system.
Table of Contents
mira
[-project=<name>]
[-fasta[=<filename>] | -caf[=<filename>] | -phd[=<filename>]]
[-genomedraft | -genomenormal | -genomeaccurate]
[-mappingdraft | -mappingnormal | -mappingaccurate]
[-clippinglight | -clippingnormal | -clippingheavy]
[-highlyrepetitive]
[-highqualitydata]
[-borg]
[-estmode]
[-horrid]
[-params=<filename>]
[-GENERAL:arguments]
[-STRAIN/BACKBONE:arguments]
[-ASSEMBLY:arguments]
[-DATAPROCESSING:arguments]
[-CLIPPING:arguments]
[-SKIM:arguments]
[-ALIGN:arguments]
[-CONTIG:arguments]
[-EDIT:arguments]
[-DIRECTORY:arguments]
[-FILE:arguments]
[-OUTPUT:arguments]
mira
is a DNA sequence data assembly program (or "fragment
assembler") for whole genome and EST projects. mira
assembles reads /
sequences gained by gel or capillary electrophoresis experiments into
contiguous sequences (contigs). Input can be in various formats like Staden
experiment (EXP), Sanger CAF, FASTA or PHD file.
If present, base qualities in phred(1)
style and SCF signal
electrophoresis trace files are used to adjudicate between or even correct
contradictory stretches of bases in reads by either the integrated automatic
EdIt
editor (written by Thomas Pfisterer) or the assembler itself.
Ancillary data containing additional information helpful to the assembly as is
contained in, e.g. NCBI traceinfo XML files or Staden EXP files, is also
honoured.
mira
was conceived especially with the problem of repeats in genomic
data and SNPs in EST data in mind. Considerable effort was made to develop a
number of strategies -- ranging from standard clone-pair size restrictions to
discovery and marking of base positions discriminating the different repeats /
SNPs -- to ensure that repetitive elements are correctly resolved and that
misassemblies do not occur.
The resulting assembly can be written in different standard formats like CAF,
Staden GAP4 directed assembly, ACE, HTML, FASTA, simple text or transposed
contig summary (TCS) files. These can easily be imported into numerous
finishing tools or further evaluated with simple scripts.
The MIRA acronym stands for Mimicking
Intelligent
Read
Assembly
and the program pretty well does what its
acronym says. The aim of mira
is to build the best possible assembly by
- having a full overview on the whole project at any time of the
assembly, i.e. knowledge of all possible read-pairs in a project,
- using high confidence regions (HCRs) of several aligned read-pairs to
start contig building at a good anchor point of a contig, extending
clipped regions of reads on a 'can justify ' basis.
- using all available data present at the time of assembly, i.e.,
instead of relying on sequence and base confidence values only, the
assembler will profit from trace files containing electrophoresis
signals, tags marking possible special attributes of DNA,
information on specific insert sizes of read-pairs etc.
- having intelligent contig objects accept or refuse reads based on the
rate of unexplainable errors introduced into the consensus
- discovering and analysing of possible repeats differentiated only by
single nucleotide polymorphisms. The important bases for discriminating
different repetitive elements are tagged.
- using the possibility given by the integrated automatic editor to
correct errors present in contigs (and subsequently) reads by generating and
verifying complex error hypotheses through analysis of trace signals in
several reads covering the same area of a consensus,
- iteratively extending reads (and subsequently) contigs based on
a) additional information gained by overlapping read pairs in contigs and
b) corrections made by the automated editor.
mira
-- written by Bastien Chevreux -- is part of a bigger project
which also contains the automated editor / finisher EdIt
package --
written by Thomas Pfisterer. The strength of mira
and EdIt
is
the automatic interaction of both packages which produces assemblies with less
work for human finishers to be done. This is the documentation belonging to
the 2.6.0 release of the mira
assembler.
MIRA2 has been rewritten in large parts; compared to the previous versions it
supports a lot more in- and output files, got more accurate in the assembly,
has better support for standard cases, learned to assemble EST sequences, has
now a mode where it can assemble to pre-existing sequences (backbones) and is
quite faster.
I'd like to thank everybody who reported bugs to me, pointed out problems,
sent ideas and suggestions they encountered while using the predecessors.
Please continue to do so, the feedback made this second version possible.
mira
has two basic working modes: genome or EST. From version 2.4 on,
there is only executable which supports both modes. However, the name of the
executable defines the working mode:
- mira(1) for assembly of genomic data as well as assembly
of EST data from one strain/organism
and
- miraEST(1) for assembly of EST data from different
strains and SNP detection within this assembly.
Therefore, miraEST(1)
is usually realised as a link to the mira(1)
executable.
Parameters can be given on the command line or via parameter files. Since
version 2.6.0, mira
knows two basic parameter types: quick switches
and extensive switches.
- quick switches, also dubbed DWIM switches (for
'Do-What-I-Mean'),
are easy to
use-and-combine switches activating parameter collections for predefined
tasks that will suit most people's needs.
- extensive switches offer a way to set about any possible
parameter to configure mira
for any kind of special need. While the
format of extensive switches might look a little bit strange, it is borrowed
from the SGI C compiler options and allows both compact command lines as
well as readable and / or script generated parameter files.
The mira(1)
command line options accept several arguments and allow a
user to specify a setting for each argument. To specify multiple arguments,
use colons to separate each argument on the command line. You may either use
the long or the short form of each argument, the later being given in
brackets.
As example, a typical call of mira(1)
using quick switches and some
tweaking with extended switches on the command line could look like this:
mira -fasta -genomedraft -ALIGN:min_relative_score=70 -GENERAL:use_template_information=yes -CONTIG:defaultinsertsizeminimum=500:defaultinsertsizemaximum=2500
or in short form
mira -fasta -genomedraft -AL:mrs=70 -GE:uti=yes -CO:dismin=500:dismax=2500
Please note that it is also perfectly legal to write
mira -fasta -genomedraft -AL:mrs=70 -GE:uti=yes -CO:dismin=500 -CO:dismax=2500
(just for those of you who want to use mira
in scripted environments
and / or write scripts for it).
There are example parameter files included in the distribution which
show how to format parameters in files.
These switches are 'Do-What-I-Mean'
parameter collections for predefined tasks which should suit most people's
needs.
For some switches, the order of appearance in the command line (or parameter
file) is important. It is generally a good idea to place them in the order as
described in this documentation, e.g. always write -genomedraft -clippingheavy
and not -clippingheavy -genomedraft.
This is true for
the following switches:
- [-genomedraft | -genomenormal | -genomeaccurate]
-
One-stop-switches for three quality grades of de-novo genome assembly.
Draft
is quick-and-dirty, suited to get a first look on approximate
coverage of a running project. Should not be used for anything else.
Normal
is the default parameter set of mira
that is able to
tackle most genomes. A bit slower than the draft version, but includes such
options as read extension and vector remnant clipping. Accurate
is
still slower than the normal mode but should be used for genomes that pose a
problem to the normal mode.
- [-mappingdraft | -mappingnormal | -mappingaccurate]
- Work like
the [-genome...]
switches except they are to be used when performing
mapping assemblies against given backbone sequences.
- [-clippinglight | -clippingmedium | -clippingheavy]
- Three
clipping grade modifiers, from light
clipping when working with well
preprocessed sequences to heavy
clipping when the sequences that are
being assembled had only sloppy or no preprocessing.
Note 1: the light
version is already included in the
[-genome...]
and [-mapping...]
switches.
Note 2: it is recommended to perform a thorough preprocessing (clipping
sequencing vector stretches, clipping of low quality bases, tagging standard
repeats etc.) before assembling sequences. The clipping routines of
mira
are more optimised to cope with last remnants of wrongly
preprocessed sequences than with sequences having had no pre-processing at
all.
- [-highlyrepetitive]
- A modifier switch for genome data that is
deemed to be highly repetitive. The assemblies will run slower due to more
iterative cycles that give mira
a chance to resolve nasty repeats.
- [-highqualitydata]
- A modifier switch when the sequences that are
used are of exceptional quality. mira
will then bump up a few quality
parameters which should lead to less false positives in the repeat and SNP
detection routines.
The following switches can be placed anywhere on the command line without
interfering with other switches:
- [-fasta | -fasta=<filename>]
- Sets parameters suited for
loading sequences from FASTA files. The version with =<filename> will
also set the input file to the given filename.
- [-phd | -phd=<filename>]
- Sets parameters suited for
loading sequences from PHD files. The version with =<filename> will also
set the input file to the given filename.
- [-caf | -caf=<filename>]
- Sets parameters suited for
loading sequences from CAF files. The version with =<filename> will also
set the input file to the given filename.
- [-project=<name>]
- Default is mira.
Defines the
project name for this assembly. The project name automatically influences
the name of input and output files / directories. E.g. in the default
setting, the file names for the output of the assembly in FASTA format would
be "mira_out.fasta" and "mira_out.fasta.qual". Setting the project name to
"MyProject" would generate "MyProject_out.fasta" and
"MyProject_out.fasta.qual". See also -FILE:
and
-DIRECTORY:
for a list of names that are influenced.
- [-params|-parameterfile=<filename>]
- Loads parameters from the
filename given. Allows a maximum of 10 levels of recursion, i.e. a -params
option appearing within a file that loads other parameter files (though I
cannot think of useful applications with more than 3 levels).
Please note that the switches -estmode, -horrid, -borg
imply that quality values are available in the assembly. If
not, these switches will not work properly and the results obtained not the
intended ones. These switches are ususally not combined with order dependent
switches from above.
- [-estmode]
- Switches mira
to a good initial preset for
assembling EST data. Note that this is not needed (and even
counterproductive) when used with miraEST.
- [-horrid]
- Sets a number of parameters useful when dealing with
really horrid data sets. Useful means that parameters are chosen to so that
time and memory consumption do not explode beyond all hope of the program
returning. Note that MIRA will return in most cases useful assemblies with
this switch, but these might not be as optimised as with normal operation.
The definition of "horrid" being a bit flexible, this is what I would call
horrid: (a) a genomic projects with more that 2.000 reads that all seem to
align partly to each other but have different repetitive structures or (b)
EST clusters with a few thousand almost similar reads.
- [-borg]
- Sets a bunch of parameters to have mira
try to
assemble as many reads as possible. Will probably slow down the assembly
process and use more memory.
"We are MIRA of borg. You will be assembled, resistance is futile!"
Note: A double dash (e.g. --params) may also be used instead of a single one
in front of these switches.
Examples for using these switches can be found documentation file describing
mira
usage.
General options control the type of assembly to be performed and other
switches not belonging anywhere else.
- project(pro)=string
- Same as the quick switch [-project].
- load_job(lj)=fofnexp, fasta, caf, phd, fofnphd
- Default is
fofnexp.
Defines whether to load and assemble EXP files from a
file of filenames ("mira_in.fofn"), load and assemble FASTA sequences
("mira_in.fasta") and their qualities ("mira_in.fasta.qual"), load and
assemble sequences or qualities from a phd file ("mira_in.phd") or to load
a project from a CAF file ("mira_in.caf") and assemble or eventually
reassemble it. Note: fofnphd currently not available.
- filecheck_only(fo)=on|yes|1, off|no|0
- Default is
no.
If set to yes, the project will not be assembled and no
assembly output files will be produced. Instead, the project files will only
be loaded. This switch is useful for checking consistency of input files.
- merge_xmltraceinfo(mxti)=on|yes|1, off|no|0
- Default
is no.
Some file formats above (FASTA, PHD or even CAF and EXP)
possibly do not contain all the info necessary or useful for each read of an
assembly. Should additional information -- like clipping positions etc. --
be available in a XML trace info file in NCBI format (see
File formats),
then set this option to yes
and it will
be merged to the data loaded. See also -FILE:
for the name of the
XML file to load.
Please note: quality clippings given here will override quality clippings
loaded earlier or performed by mira.
Minimum clippings will still be
made by the program, though.
- readnaming_scheme(rns)=sanger, tigr
- Default is
sanger.
Defines the centre naming scheme for read suffixes.
Currently, only Sanger Institute and TIGR naming schemes are
supported out of the box.
How to choose: please read the documentation available at the different
centres or ask your sequence provider. In a nutshell, Sanger scheme is
"somename.[pqsfrw][12][bckdeflmnpt][a|b|c|..."
(e.g.
U13a08f10.p1ca), TIGR scheme is "somenameTF*|TR*|TA*"
(e.g. GCPBN02TF or GCPDL68TABRPT103A58B).
- external_quality(eq)=none, SCF
- Default is SCF.
Defines the source format for reading qualities from external sources.
Normally takes effect only
when these are not present in the format
of the load_job project (EXP and FASTA can have them, CAF and PHD must have
them).
- external_quality_override(eqo)=on|yes|1, off|no|0
- Default
is no,
only takes effect when load_job is fofnexp.
Defines whether or not the qualities from the external source override the
possibly loaded qualities from the load_job project. This might be of use
in case some post-processing software fiddles around with the quality values
of the input file but one wants to have the original ones.
- discard_read_on_eq_error(droeqe)=on|yes|1, off|no|0
- Default
is yes.
Should there be a major mismatch between the external
quality source and the sequence (e.g.: the base sequence read from a SCF
file does not match the originally read base sequence), should the read be
excluded from assembly or not. If not, it will use the qualities it had
before trying to load the external qualities (either default qualities or
the ones loaded from the original source).
- use_template_information(uti)=on|yes|1, off|no|0
-
Default is Yes.
Two reads sequenced from the same clone template
form a read pair with a known minimum and maximum distance. This feature
will definitively help for contigs containing lots of repeats. Set this to
'yes' if your data contains information on insert sizes.
Information on insert sizes can be given via the SI tag in EXP files (for
each read pair individually), or for the whole project using
-CO:dismin
and -CO:dismax
(see below).
- est_startstep(ess)=1 <= integer <= 4
- Default is
1.
Controls the starting step of the EST assembly and is
therefore only useful in miraEST.
EST assembly is a three step process, each with different settings to the
assembly engine, with the result of each step being saved to disk. If
results of previous steps are present in a directory, one can easily "play
around" with different setting for subsequent steps by reusing the results
of the previous steps and directly starting with step two or three.
- print_date(ps)=on|yes|1, off|no|0
- Default
is yes.
Controls whether date and time are printed out during
the assembly. Suppressing it is not useful in normal operation, only when
debugging or benchmarking.
General options for controlling strain and backbone information
- load_straindata(lsd)=on|yes|1, off|no|0
- Default is
no.
Straindata is a key value file, one read per line. First the
name of the read, then the strain name of the organism the read comes
from. It is used by the program to differentiate different types of SNPs
appearing in organisms and classifying them.
- load_backbone(lb)=on|yes|1, off|no|0
- Default
is no.
A backbone is a sequence (or a previous assembly) that is
used as template for the current assembly. The current assembly process will
assemble reads first to those loaded backbone contigs before creating new
contigs.
This feature is helpful for assembling against previous (and already
possibly edited) assembly iterations, or to make a comparative assembly of
two very closely related organisms. Please read "very closely related" as
in: only SNP mutations or short indels present.
- startbackboneusage_inloop(sbuil)=0 < integer
- Default is
3.
When assembling against backbones, this parameter defines the
loop iteration (see -AS:nol)
from which on the backbones will be
really used. In the loops preceeding this number, the non-backbone reads
will be assembled together as if no backbones existed. This allows
mira
to correctly spot repetitive stretches that
differ by single bases and tag them accordingly.
Rule of thumb: if backbones belong to same strain as reads to assemble, set
to 1.
If backbones are a different strain, then set
-SB:sbuil
to 1 lower than -AS:nol
(example: nol=4 and
sbuil=3).
- backbone_strain_name(bsn)=string
- Default is
<empty string>.
Defines the name of the strain that the
backbone sequences have.
- backbone_filetype(bft)=fasta, caf, gbf
- Default is
fasta.
Defines the filetype of the backbone file given.
Currently (2.6.0 ) only FASTA, CAF and GBF files are supported.
When GBF (GenBank files, also named .gbk) files are loaded, the features
within theses files are automatically transformed into Staden compatible
tags and get passed through the assembly.
- backbone_raillength(brl)=1000 <= integer <= 3000
-
Default is 2500.
Parameter for the internal sectioning size of
the backbone. Extremely repetitive sequences may require reducing the
default value, but the default value should work well in 99.9% of all
cases.
- backbone_basequality(bbq)=-1 <= integer <= 100
-
Default is -1.
Defines the default quality that the
backbone sequences have if they came without quality values in their files
(like in GBF format or when FASTA is used without .qual files). A value of
-1 mira
to use the same default quality for backbones as for reads.
- also_build_new_contigs(abnc)=on|yes|1, off|no|0
- Default
is yes.
Standard mode of the assembler is to assemble available
reads to a backbone and make new contigs with remaining reads. If this
option is set to no,
the reads that cannot be assembled into
exsiting contigs are put as singlets into the assembly, not forming new
contigs.
General options for controlling the assembly.
- minimum_read_length(mrl)=integer >= 20
- Default is
40.
Minimum length that reads must have to be considered for the
assembly. Shorter sequences will be filtered out at the beginning of the
process and won't be present in the final project.
- num_of_loops(nol)=integer > 0
- Default is 3.
Defines how many iterations of the whole assembly process are done. Rule of
thumb: quick and dirty assembly: use 1 (not recommended). Assembly using
read extensions and / or automatic contig editing (-DP:ure
and
-ED:ace,
see below): at least 2. The recommended setting is 3 or
higher, as some knowledge generated by the assembler can be used only from
the third iteration on. More than 3 loops might be useful for projects
containing many repetitive elements. See also -AS:rbl
and
-AS:mr
below for loop parameters that affect the assembly and
disentanglement of possible repeats.
- skim_each_loop(sel)=on|yes|1, off|no|0
- Default is
yes.
Defines whether the skim algorithm (and with it also the
recalculation of Smith-Waterman alignments) is called inbetween each main
loop. If set to no,
skimming is done only when needed by the
workflow: either when read extensions are searched for (-DP:ure)
or when possible vector leftovers are to be clipped (-CL:pvc).
Setting this option to yes
is highly recommended, setting it to
no
only for quick and dirty assemblies.
- rmb_break_loopsmax(rbl)=integer > 0
- Default is
2.
Defines how many times a contig can be rebuilt during a main
assembly loop (-AS:nol)
if misassemblies due to possible repeats
are found.
- spoiler_detection(sd)=on|yes|1, off|no|0
- Default is
yes
for mira
and no
miraEST.
A spoiler
can be either a chimeric read or it is a read with long parts of unclipped
vector sequence still included (that was too long for the -CL:pvc
vector leftover clipping routines). A spoiler typically prevents contigs to
be joined, MIRA will cut them back so that they represent no more harm to
the assembly.
Recommended for assemblies of mid- to high-coverage genomic assemblies, not
recommended for assemblies of ESTs as one might loose splice variants with
that,
A minimum number of two assembly loops (-AS:nol)
must be run for
this option to take effect.
- sd_last_loop_only(sdllo)=on|yes|1, off|no|0
-
Default is yes.
Defines whether the spoiler detection algorithms
are run only for the last loop or for all loops (-AS:nol).
Takes effect only if spoiler detection (-AS:sd)
is on.
- use_genomic_pathfinder(ugpf)=on|yes|1, off|no|0
-
Default is yes.
MIRA has two different pathfinder algorithms it
chooses from to find its way through the (more or less) complete set of
possible sequence overlaps: a genomic and an EST pathfinder. The genomic
looks a bit into the future of the assembly and tries to stay on safe
grounds using a maximum of information already present in the contig that is
being built. The EST version on the contrary will directly jump at the
complex cases posed by very similar repetitive sequences and try to solve
those first and is willing to fall down to brute force when really bad cases
(like, e.g., coverage with thousands of sequences) are
encountered.
Generally, the genomic pathfinder will also work quite well with EST
sequences (but might get slowed down a lot in pathological cases), while the
EST algorithm does not work so well on genomes. If in doubt, leave on
yes
for genome projects and set to no
for EST
projects.
- use_emergency_search_stop(uess)=on|yes|1, off|no|0
-
Default is yes.
Another important switch if you plan to assemble
non-normalised EST libraries, where some ESTs may reach coverages of several
hundreds or thousands of reads. This switch lets MIRA save a lot of
computational time when aligning those extremely high coverage areas (but
only there), at the expense of some accuracy.
- ess_partnerdepth(esspd)=integer > 0
- Default is
500.
Defines the number of potential partners a read must have
for MIRA switching into emergency search stop mode for that read.
- use_max_contig_buildtime(umcbt)=on|yes|1,off|no|0
-
Default is no.
Defines whether there is an upper limit of time
to be used to build one contig. Set this to yes in EST assemblies where you
think that extremely high coverages occur. Less useful for assembly of
genomic sequences.
- buildtime_in_seconds(bts)=integer > 0
- Default is
10000.
Depending on -AS:umcbt
above, this number
defines the time in seconds alloted to building one contig.
Options for controlling some data processing during the assembly.
- use_read_extension(ure)=on|yes|1, off|no|0
- Default
is yes.
mira
expects the sequences it is given to be
quality clipped. During the assembly though, it will try to extend reads
into the clipped region by analysing Smith-Waterman alignments between reads
that were found to be valid. Only the right clip is extended though, the
left clip (most of the time containing sequencing vector) is never touched.
- read_extension_window_length(rewl)=integer > 0
- Default
is 30.
Only takes effect when -DP:ure
(see above) is
set to yes.
The read extension routines use a sliding window
approach on Smith-Waterman alignments. This parameter defines the window
length.
- read_extension_with_maxerrors(rewme)=integer > 0
-
Default is 2.
Only takes effect when -DP:ure
(see
above) is set to yes.
The read extension routines use a sliding
window approach on Smith-Waterman alignments. This parameter defines the
number maximum number of errors (=disagreements) between two alignment in
the given window.
- first_extension_in_loop(feil)=integer >= 0
- Default is
0.
Only takes effect when -DP:ure
(see above) is set
to yes.
The read extension routines can be called before
assembly and/or after each assembly loop (see -AS:nol).
This
parameter defines the first loop in which the read extension routines are
called. The default of 0
tells mira
to extend the reads
the first time before the first assembly loop.
- last_extension_in_loop(leil)=integer >= 0
- Default is
0.
Only takes effect when -DP:ure
(see above) is set
to yes.
The read extension routines can be called before
assembly and/or after each assembly loop (see -AS:nol).
This
parameter defines the last loop in which the read extension routines are
called. The default of 0
tells mira
to extend the reads
the last time before the first assembly loop.
- tag_polyat_at_ends(tpae)=on|yes|1, off|no|0
-
Default is no.
This option is useful in EST assembly. Poly-AT
stretches at end of reads that were not correctly masked or clipped in
preprocessing steps from external programs get tagged here. The assembler
will not use these stretches for critical operations. Additionally, the tags
do provide a good visual anchor when looking at the assembly with different
programs.
- polybase_window_len(pbwl)=integer > 0
- Default is
7.
Only takes effect when -DP:tpae
(see above) is set
to yes.
Defines the window length within which all bases (except
the maximum number of errors allowed, see below) must be either A or T to be
considered a polybase stretch.
- polybase_window_maxerrors(pbwme)=integer > 0
- Default is
2.
Only takes effect when -DP:tpae
(see above) is set
to yes.
Defines the maximum number of errors allowed in a given
window length (see above) so that a stretch is considered to be a polybase
stretch. The distribution of these errors is not important.
- polybase_window_gracedistance(pbwgd)=integer > 0
- Default
is 9.
Only takes effect when -DP:tpae
(see above) is
set to yes.Defines
the number of bases from the end of a
sequence (if masked: from the end of the masked area) within which a
polybase stretch is looked for without finding one.
Controls for clipping options: when and how sequences should be clipped.
- possible_vector_clip(pvc)=on|yes|1, off|no|0
-
Default is yes.
mira
will try to identify possible
sequencing vector relicts present at the start of a sequence and clip them
away. These relicts are usually a few bases long and were not correctly
removed from the sequence in data preprocessing steps of external programs.
You might want to turn off this option if you know (or think) that your data
contains a lot of repeats and the option below to fine tune the clipping
behaviour does not give the expected results.
- pvc_maxlenallowed(pvcmla)=integer >= 0
- Default is
18.
The clipping of possible vector relicts option works quite
well. Unfortunately, especially the bounds of repeats or differences in EST
splice variants sometimes show the same alignment behaviour than
possible sequencing vector relicts and could therefore also be clipped.
To refrain the vector clipping from mistakenly clip repetitive regions or
EST splice variants, this option puts an upper bound to the number of bases
a potential clip is allowed to have. If the number of bases is below or
equal to this threshold, the bases are clipped. If the number of bases
exceeds the threshold, the clip is NOT
performed.
Setting the value to 0 turns off the threshold, i.e., clips are then always
performed if a potential vector was found.
- quality_clip(qc)=on|yes|1, off|no|0
- Default is
no,
but is set automatically to yes
when using the
quickmode switches -fasta
or -phd
(can be turned off again by
subsequent options afterwards). This will let mira
perform its own
quality clipping before sequences are entered into the assembly. The clip
function performed is a sequence end window quality clip with back iteration
to get a maximum number of bases as useful sequence. Note that the bases
clipped away here can still be used afterwards if there is enough evidence
supporting their correctness when the option -DP:ure
is turned
on.
- qc_minimum_quality(qcmq)=integer >= 15 and <= 35
-
Default is 20.
This is the minimum quality bases in a window
require to be accepted. Please be cautious not to take too extreme values
here, because then the clipping will be too lax or too harsh. Values below
15 and higher than 30-35 are not recommended.
- qc_window_length(qcwl)=integer >= 10
- Default is
30.
This is the length of a window in bases for the quality
clip.
- maskedbases_clip(mbc)=on|yes|1, off|no|0
- Default is
yes.
This will let mira
perform a 'clipping' of bases
that were masked out (replaced with the character X). It is generally not a
good idea to use mask bases to remove unwanted portions of a sequence, the
EXP file format and the NCBI traceinfo format have excellent possibilities
to circumvent this. But because a lot of preprocessing software are built
around cross_match,
scylla-
and phrap-style
of base
masking, the need arised for mira
to be able to handle this, too.
mira
will look at the start and end of each sequence to see whether
there are masked bases that should be 'clipped'.
- mbc_gap_size(mbcgs)=integer >= 0
- Default is
20.
While performing the clip of masked bases, mira
will
look if it can merge larger chunks of masked bases that are a maximum of
-CL:mbcgs
apart.
- mbc_max_front_gap(mbcmfg)=integer >= 0
- Default is
40.
While performing the clip of masked bases at the start of a
sequence, mira
will allow up to this number of unmasked bases
in front of a masked stretch.
- mbc_max_end_gap(mbcmeg)=integer >= 0
- Default is
60.
While performing the clip of masked bases at the end of a
sequence, mira
will allow up to this number of unmasked bases
behind a masked stretch.
- ensure_minimum_left_clip(emlc)=on|yes|1, off|no|0
-
Default is yes.
If on, ensures a minimum left clip on each read
according to the parameters in -CL:mlcr:smlc
- minimum_left_clip_required(mlcr)=integer >= 0
- Default is
25.
If -CL:emlc
is on, checks whether there is a
left clip which length is at least the one specified here.
- set_minimum_left_clip(smlc)=integer >= 0
- Default is
30.
If -CL:emlc
is on and actual left clip is <
-CL:mlcr,
set left clip of read to the value given here.
Options that control the behaviour of the initial fast match finding
algorithm. Matches found here will be confirmed later in the alignment phase.
- initial_match(im)=integer >= 1
- Default is 4.
Controls how many exact word matches must be present in two sequences at the
exact same spacing inbetween them to further analyse a the potential hit.
Increasing this value save a little bit of time, but values > 8 are likely
to loose possible weak overlaps occurring at the end of reads. After passing
this test, (-SK:mc,
see below) constitutes the next stage.
- match_confirm(mc)=integer >= 1
- Default is 10.
Controls the number of exact word matches that need to be present in two
sequences whose spacing inbetween them is "approximately" the same. Same
restrictions to increasing this value apply as to -SK:im.
-SK:pr
(see below) constitutes the next (and last) stage of the
SKIM pipeline.
- percent_required(pr)=integer >= 1
- Default is
25.
Controls the relative percentage of exact word matches in an
approximate overlap that has to be reached to accept this overlap as
possible match. Increasing this number will decrease the number of possible
alignments that have to be checked by Smith-Waterman later on in the
assembly, but tit also will lead to the rejection of weaker overlaps (i.e.
overlaps that contain a higher number of mismatches).
- maxhits_perread(mhpr)=integer >= 1
- Default is
200.
Controls the maximum number of possible hits one read can
maximally transport to the Smith-Waterman alignment phase. If more potential
hits are found, only the best ones are taken.
This is an important option for tackling projects which contain
extreme
assembly conditions. For example, 5000 reads that are all
very similar would generate around 40 to 50 million possible alignments
(forward and reverse complement). Setting then this parameter to 200 reduces
the number of alignments to check to around 1.5-2 million.
As the assembly increases in loops (-AS:nol),
different
combinations of possible hits will be checked, always the probably best ones
first. So the accuracy of the assembly should only suffer when lowering this
number too much.
The align options control the behaviour of the Smith-Waterman alignment
routines. Only read pairs which are confirmed here may be included into
contigs. Affects both the checking of possible alignments found by SKIM as
well as the phase when reads are integrated into a contig.
- bandwidth_in_percent(bip)=integer>0 and <=100
- Default is
15.
The banded Smith-Waterman alignment uses this percentage
number to compute the bandwidth it has to use when computing the alignment
matrix. E.g., expected overlap is 150 bases, bip=10 -> the banded SW will
compute a band of 15 bases to each side of the expected alignment diagonal,
thus allowing up to 15 unbalanced inserts / deletes in the alignment.
INCREASING AND DECREASING THIS NUMBER: increase:
will find more
non-optimal alignments, but will also increase SW runtime between linear and
^2.decrease:
the other way round, might miss a few bad
alignments but gaining speed.
- bandwidth_min(bmin)=integer>0
- Default is
25.
Minimum bandwidth in bases to each side.
- bandwidth_max(bmax)=integer>0
- Default is
50.
Maximum bandwidth in bases to each side.
- min_overlap(mo)=integer>0
- Default is 15.
Minimum
number of overlapping bases needed in an alignment of two sequences to be accepted.
- min_score(ms)=integer>0
- Default is 15.
Describes the minimum score of an overlap to be taken into account for
assembly. mira
uses a default scoring scheme for SW align: each match
counts 1, a match with an N counts 0, each mismatch with a non-N base -1 and
each gap -2. Take a bigger score to weed out a number of chance matches, a
lower score to perhaps find the single (short) alignment that might join two
contigs together (at the expense of computing time and memory).
- min_relative_score(mrs)=integer>0 and <=100
- Default is
65.
Describes the min % of matching between two reads to be
considered for assembly. Increasing this number will save memory, but one
might loose possible alignments. I propose a maximum of 80 here.
Decreasing below 55% will make memory and time consumption probably
explode.
- extra_gap_penalty(egp)=on|yes|1, off|no|0
- Default
is no.
Defines whether or not to increase penalties applied
to alignments containing long gaps. Setting this to 'yes' might help in
projects with frequent repeats. On the other hand, it is definitively
disturbing when assembling very long reads containing multiple long indels
in the called base sequence ... although this should not happen in the first
place and is a sure sign for problems lying ahead.
When in doubt, set it to yes
for EST projects and de-novo genome
assembly, set it to no
for assembly of closely related strains
(assembly against a backbone).
When set to no,
it is recommended to have -CO:amgb
and -CO:amgbemc
both set to yes.
- egp_level(egpl)=low|0, medium|1, high|2, est_splitsplices|10
- Default is low.
Has no effect if
extra_gap_penalty is off. Defines an extra penalty applied to 'long' gaps.
There are these
are predefined levels: low - use this if you expect your base caller
frequently misses 2 or more bases. medium - use this if your base caller is
expected to frequently miss 1 to 2 bases. high - use this if your base
caller does not frequently miss more than 1 base.
For some stages of
the EST assembly process, a special value est_splitsplices
is used.
- egp_level(megpp)=0<=integer<=0
- Default is
100.
Has no effect if extra_gap_penalty is off. Defines the
maximum extra penalty in percent applied to 'long' gaps.
The contig options control the behaviour of the contig objects.
- name_prefix(np)=string
- Default is mira.
Contigs
will have this string prepended to their names.
- analysis(an)=none, text, signal
- Default is
signal.
When adding reads to a contig, dangerous regions can
get an extra integrity check.
none = no extra check.
text = check is only text-based.
signal = check is signal based, if the SCF trace is not available, fallback is 'text'.
For the time being, only regions tagged as ALUS or REPT in the experiment
file are considered dangerous.
- reject_on_drop_in_relscore(rodirs)=integer>0 and <=100
-
Default is 15.
When adding reads to a contig, reject the reads
if the drop in the quality of the consensus is > the given value in %.
Lower values mean stricter checking.
This value is doubled should a read be entered that has a template partner
(a read pair) at the right distance.
- danger_max_error_rate(dmer)=integer>=1 and <=100
- Default is
1.
When adding reads to a contig, reject the reads if the error
in zones known as dangerous exceeds the given value in %. Lower values
mean stricter checking in these danger zones.
For the time being, only regions tagged as ALUS or REPT in the experiment
file are considered dangerous.
- mark_repeats(mr)=on|yes|1, off|no|0
- Default is
yes.
One of the most important switches in MIRA: if set to
yes,
MIRA will try to resolve misassemblies due to repeats by
identifying single base stretch differences and tag those critical bases as
RMB (Repeat Marker Base, weak or strong). This switch is also needed when
MIRA is run in EST mode to identify possible inter-, intra- and
intra-and-interorganism SNPs.
- assume_snp_instead_repeat(asir)=on|yes|1, off|no|0
- Default is no.
Only takes effect when -CO:mr
(see above) is set to
yes,
effect is also dependent on the fact whether strain data
(see --SB:lsd)
is present or not.
Usually, mira
will mark bases that differentiate between repeats when
a conflict occurs between reads that belong to one strain. If the conflict
occurs between reads belonging to different strains, they are marked as
SNP. However, if this switch is set to yes,
conflict within a
strain are also marked as SNP.
This switch is mainly used in assemblies of ESTs, it should not be set for
genomic assembly.
- min_reads_per_group(mrpg)=integer >= 2
- Default is
2.
Only takes effect when -CO:mr
(see above) is set
to yes.
This defines the minimum number of reads in a group that
are needed for the RMB (Repeat Marker Bases) or SNP detection routines to be
triggered. A group is defined by the reads carrying the same nucleotide for
a given position, i.e., an assembly with mrpg=2 will need at least two
times two reads with the same nucleotide (having at least a quality as
defined in -CO:mgqrt)
to be recognised as repeat marker or a SNP.
Setting this to a low number increases sensitivity, but might produce a few
false positives, resulting in reads being thrown out of contigs because of
falsely identified possible repeat markers (or wrongly recognised as SNP).
- min_groupqual_for_rmb_tagging(mgqrt)=integer >= 25
- Default is
30.
Takes only effect when -CO:mr
is set to
yes.
This defines the minimum quality of a group of bases to be
taken into account as potential repeat marker. The lower the number, the
more sensitive you get, but lowering below 25 is not recommended as a lot of
wrongly called bases can have a quality approaching this value and you'd end
up with a lot of false positives. The higher the overall coverage of your
project, the better, and the higher you can set this number. A value of 35
will probably remove all false positives, a value of 40 will probably never
show false positives.
- endread_mark_exclusion_area(emea)=integer >= 0
- Default
is 15.
Takes only effect when -CO:mr
is set to
yes.
Using the end of sequences of Sanger type shotgun
sequencing is always a bit risky, as wrongly called bases tend to crowd
there or some sequencing vector relicts hang around. It is even more risky
to use these stretches for detecting possible repeats, so one can define an
exclusion area where the bases are not used when determining whether a
mismatch is due to repeats or not.
- also_mark_gap_bases(amgb)=on|yes|1, off|no|0
- Default is yes.
Determines whether columns containing gap bases (indels) are also tagged.
- also_mark_gap_bases_even_multicolumn(amgbemc)=on|yes|1, off|no|0
- Default is yes.
Takes effect only when -CO:amgb
is set to
yes.
Determines whether multiple columns containing gap bases
(indels) are also tagged.
- also_mark_gap_bases_need_both_strands(amgbnbs)=on|yes|1, off|no|0
- Default is yes.
Takes effect only when
-CO:amgb
is set to yes.
Determines whether both for
tagging columns containing gap bases, both strands.need to have a gap.
Setting this to no
is not recommended except when working in
desperately low coverage situations.
- defaultinsertsizeminimum(dismin)=integer >= 0
- Default is
500.
The minimum distance that read pairs may be
apart. There is an additional error margin of 10% subtracted from
this value during internal computations.
- defaultinsertsizemaximum(dismax)=integer >= 0
- Default is
5000.
The maximum distance that read pairs may be
apart. There is an additional error margin of 10% added to
this value during internal computations.
General options for controlling the integrated automatic editor.
- automatic_contig_editing(ace)=on|yes|1, off|no|0
-
Default is no.
Once contigs have been build, mira
can
call a built-in version of the automatic contig editor EdIt.
EdIt
will try to resolve discrepancies in the contig by performing
trace analysis and correct even hard to resolve errors. This option is
always useful, but especially in conjunction with -GE:nol
and
-GE:ure
(see above).
Notice: the current development version has a memory leak in the editor,
therefore the option is not automatically turned on.
- strict_editing_mode(sem)=on|yes|1, off|no|0
-
Default is yes.
If set to yes, the automatic editor will not
take error hypotheses with a low probability into account, even if all the
requirements to make an edit are fulfilled.
- confirmation_threshold(ct)=integer, 0 < x <= 100
-
Default is 50.
The higher this value, the more strict the
automatic editor will apply its internal rule set. Going below 40 is not
recommended.
General options for controlling where to find or where to write data.
- gap4da=<directoryname>
- Default is gap4da.
Defines the extension of the directory where mira
will write the
result of an assembly ready to import into the Staden package (GAP4) in
Direct Assembly format. The name of the directory will then be
<projectname>_.<extension>
- exp=<directoryname>
- Default is ..
Defines the
directory where mira
should search for experiment files (EXP).
- scf=<directoryname>
- Default is ..
Defines
the directory where mira
should search for SCF files.
- log=<directoryname>
- Default is miralog.
Defines the directory where mira
will write some log files to. Note
that the name of the actual project will be prepended.
The file options allows you to define your own input and output files.
- fofnexpin(fei)=string
- Default is
<projectname>_in.fofn.
Defines the file of filenames where
the names of the EXP files of a project are located.
- fofnphdin(fpi)=string
- Default is
<projectname>_in.fofn.
Defines the file of filenames where
the names of the PHD files of a project are located. Note: this is currently
not available.
- phdin(pi)=string
- Default is
<projectname>_in.phd.
Defines the file of where all the
sequences of a project are in PHD format.
- fastain(fai)=string
-
Default is mira_in.fasta.
Defines the fasta file to load
sequences of a project from.
- fastaqualin(fqi)=string
- Default is
mira_in.fasta.qual.
Defines the fasta file to load base
qualities of a project from. Although the order of reads in the quality file
does not need to be the same as in the fasta or fofn (although it saves a
bit of time if they are).
projects).
- cafin(ci)=string
-
Default is mira_in.caf.
Defines the file to load a CAF project
from. Filename must end with '.caf'.
- straindatain(sdi)=string
- Default is
mira_straindata_in.txt.
Defines the file to load straindata
from. Only used in EST projects (miraEST).
- xmltraceinfoin(xtii)=string
- Default is
mira_xmltraceinfo_in.xml.
Defines the file to load a trace
info file in XML format from. This can be used both when merging XML data to
loaded files or when loading a project from an XML trace info file.
- cafout(co)=string
- Default is mira_out.caf.
Defines the file in CAF format to save an assembled project to. Filename
must end with '.caf'.
Options for controlling which results to write to which type of files.
Additionally, a few options allow output customisation of textual
alignments (in text and HTML files).
There are 3 types of results: result, temporary results and extra temporary
results. One probably needs only the results. Temporary and extra temporary
results are written while building different stages of a contig and are given
as convenience for trying to find out why mira
set some RMBs or
disassembled some contigs.
Output can be generated in these formats: CAF, Gap4 Directed Assembly, FASTA,
ACE, HTML and simple text. Please note that the ACE output is experimental as
I don't have the necessary programs (phrap, consed) to verify the output.
(Anyone who wants to sponsor them? :-)
Naming conventions of the files follow the rules described in section
Input / Output,
subsection Filenames.
- output_result_caf(orc)=on|yes|1, off|no|0
- Default is
yes.
- output_result_gap4da(org)=on|yes|1, off|no|0
- Default is
yes.
- output_result_fasta(orf)=on|yes|1, off|no|0
- Default is
yes.
- output_result_ace(ora)=on|yes|1, off|no|0
- Default is
no.
- output_result_txt(ort)=on|yes|1, off|no|0
- Default is
yes.
- output_result_tcs(ors)=on|yes|1, off|no|0
- Default is
yes.
- output_result_html(orh)=on|yes|1, off|no|0
- Default is
no.
- output_tmpresult_caf(otc)=on|yes|1, off|no|0
- Default is
no.
- output_tmpresult_gap4da(otg)=on|yes|1, off|no|0
- Default is
no.
- output_tmpresult_fasta(otf)=on|yes|1, off|no|0
- Default is
no.
- output_tmpresult_ace(ota)=on|yes|1, off|no|0
- Default is
no.
- output_tmpresult_txt(ott)=on|yes|1, off|no|0
- Default is
no.
- output_result_tcs(ots)=on|yes|1, off|no|0
- Default is
no.
- output_tmpresult_html(oth)=on|yes|1, off|no|0
- Default is
no.
- output_exttmpresult_caf(oetc)=on|yes|1, off|no|0
- Default is
no.
- output_exttmpresult_gap4da(oetg)=on|yes|1, off|no|0
- Default is
no.
- output_exttmpresult_fasta(oetf)=on|yes|1, off|no|0
- Default is
no.
- output_exttmpresult_ace(oeta)=on|yes|1, off|no|0
- Default is
no.
- output_exttmpresult_txt(oett)=on|yes|1, off|no|0
- Default is
no.
- output_exttmpresult_html(oeth)=on|yes|1, off|no|0
- Default is
no.
- text_chars_per_line(tcpl)=integer > 0
- Default is
60.
When producing an output in text format
(-OUT:ort|ott|oett),
this parameter defines how many bases
each line of an alignment should contain.
- html_chars_per_line(tcpl)=integer > 0
- Default is
60.
When producing an output in HTML format,
(-OUT:orh|oth|oeth),
this parameter defines how many bases
each line of an alignment should contain.
- text_endgap_fillchar(tegfc)=<single character>
- Default is
(a blank). When producing an output in text format
(-OUT:ort|ott|oett),
endgaps are filled up with this
character.
- html_endgap_fillchar(hegfc)=<single character>
- Default is
(a blank). When producing an output in HTML format
(-OUT:orh|oth|oeth),
endgaps are filled up with this
character.
- <projectname>_in.fofn
- File of filenames containing the
names of the experiment or phd files to assemble when the
-GE:lj=FOFNEXP
or -GE:lj=FOFNPHD
option is used. One filename per line, blank lines accepted, lines
starting with a hash (#) treated as comment lines, nothing else. Use
-FI:fofnin(fei)
or -FI:fofnin(fpi)
to
change the default name.
- <projectname>_in.phd
- File containing the sequences (and
their qualities) to assemble in PHD format.
- <projectname>_in.fasta
- File containing sequences and ...
- <projectname>_in.fasta.qual
- ... file containing quality
values of sequences for the assembly in FASTA format.
- <projectname>_in.caf
- File containing the sequences (and
their qualities) to assemble in CAF format. This format also may
contain the result of an assembly (the contig consensus sequences).
- <projectname>_out.<type>
- Assembled project written in
type = (gap4da
/ caf
/ ace
/ fasta
/ html
/ tcs
/ text)
format by mira,
final result. Type
gap4da
is a directory containing experiment files and a file of
filenames (called 'fofn'), all other types are files. gap4da,
caf,
ace
contain the complete assembly information suitable
for import into different post-processing tools (gap4,
consed
and others). html
and text
contain visual
representations of the assembly suited for viewing in browsers or as simple
text file. tcs
is a summary of a contig suited for "quick" analyses
from command-line tools or even visual inspection.
fasta
contains the contig consensus sequences (and .fasta.qual the
consensus qualities). Please note that they come in two flavours:
padded
and unpadded.
The padded versions may
contains stars (*) denoting gap base positions where there was some minor
evidence for additional bases, but not strong enough to be considered as a
real base. Unpadded versions have these gaps removed. Padded versions have
an additional postfix .padded,
while unpadded versions do not have a
special postfix.
- <projectname>_info_contigreadlist.txt
- This file contains
information which reads have been assembled into which contigs (or
singlets).
- <projectname>_info_contigstats.txt
- This file contains
statistics about the contigs themselves, their length, average consensus
quality, number of reads, maximum and average coverage, average read length,
number of A, C, G, T, N, X and gaps in consensus.
- <projectname>_info_consensustaglist.txt
- This file contains
information about the tags (and their position) that are present in the
consensus of a contig.
- <projectname>_info_readstooshort
- A list containing the
names of those reads that have been sorted out of the assembly before any
processing started only due to the fact that they were too short.
- <projectname>_info_readtaglist.txt
- This file contains
information about the tags and their position that are present in each
read. The read positions are given relative to the forward direction of the
sequence (i.e. as it was entered into the the assembly).
- <projectname>_error_reads_invalid
- A list of sequences that
have been found to be invalid due to various reasons (given in the output of
the assembler).
- EXP
- Standard experiment files used in genome sequencing. Correct
EXP files are expected. Especially the ID record (containing the id of the
reading) and the LN record (containing the name of the corresponding trace
file) should be correctly set. See
http://www.sourceforge.net/projects/staden/
for links to online format
description.
- SCF
- The Staden trace file format that has established itself as
compact standard replacement for the much bigger ABI files. See
http://www.sourceforge.net/projects/staden/
for links to
online format description.
The SCF files should be V2-8bit, V2-16bit, V3-8bit or V3-16bit and can be
packed with compress or gzip.
- CAF
- Common Assembly Format (CAF) developed by the Sanger
Centre. http://www.sanger.ac.uk/Software/formats/CAF/
provides a
description of the format and some software documentation as well as the
source for compiling caf2gap
and gap2caf
(thanks to Rob Davies
for this).
- ACE
- The assembly file format used mainly by phrap
and
consed.
Support for .ace output is currently only in test status in
mira
as documentation on that format is ... sparse and I currently
don' have access to consed to verify my assumptions.
Using consed,
you will need to load projects with -nophd to view
them. Tags are rudimentary supported, this should improve over time. You
might also want to try clview
(http://www.tigr.org/tdb/tgi/software/)
from TIGR to look at .ace files.
- HTML
- Hypertext Markup Language. Projects written in HTML format
can be viewed directly with any table capable browser. Display is even
better if the browser knows style sheets (CSS).
- FASTA
- A simple format for sequence data, see
http://www.ncbi.nlm.nih.gov/BLAST/fasta.html.
An often used extension
of that format is used to also store quality values in a
similar fashion, these files have a .fasta.qual ending.
Mira
writes two kinds of FASTA files for results: padded
and
unpadded.
The difference is that the padded version still contains
the gap (pad) character (an asterisk) at positions in the consensus where
some of the reads apparently had some more bases than others but where the
consensus routines decided that to treat them as artefacts. The
unpadded
version has the gaps removed.
- PHD
- This file type originates from the phred
base caller
and contains basically -- along with some other status information -- the
base sequence, the base quality values and the peak indices, but not the
sequence traces itself.
- GBF, GBK
- GenBank file format as used at the NCBI to describe
sequences. mira
is able to read this format for using sequences as
backbones in an assembly. Features of the GenBank format are also
transferred automatically to Staden compatible tags.
- traceinfo.XML
- XML based file with information relating to
traces. Used at the NCBI and ENSEMBL trace archive to store additional
information (like clippings, insert sizes etc.) for projects. See
http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=rfc&m=main&s=rfc
for a description of the fields used.
- TCS
- Transpose Contig Summary. A text file as written by
mira
which gives a summary of a contig, one line per base. Nicely
suited for "quick" analyses from command line tools or even visual
inspection.
The actual stage of the assembly is written to STDOUT, giving status messages
on what mira
is actually doing.
Some debugging information might also be written to STDOUT if mira
generates error messages.
During the assembly, mira
might dump some messages to standard error.
Basically, three error classes exist:
- WARNING
- Messages in this error class do not stop the assembly but are
meant as an information to the user. In some rare cases these errors are due
to (an always possible) error in the I/O routines of mira,
but
nowadays they are mostly due to unexpected (read: wrong) input data and can
be traced back to errors in the preprocessing stages. If these errors arise,
you definitively DO
want to check how and why these errors came
into
those files in the first place.
Frequent cause for warnings include missing SCF files, SCF files containing
known quirks, EXP files containing known quirks etc.
- FATAL
- Messages in this error class actually stop the assembly. These
are mostly due to missing files that mira
needs or to very garbled
(wrong) input data.
Frequent causes include naming an experiment file in the 'file of filenames'
that could not be found on the disk, same experiment file twice in the
project, suspected errors in the EXP files, etc.
- INTERNAL
- These are true programming errors that were caught by internal
checks. Should this happen, please mail the output of STDOUT and STDERR to
the authors.
During assembly, mira
will write a whole lot of log files which all
will be placed into a subdirectory named "<projectname>_miralog"
Please do not delete the project where errors happened. I will get
in touch with you for additional information that might possibly be present in
temporary files and the log file. They are not deleted in cases like this.
mira
uses and sets a couple of tags during the assembly process. That
is, if information is known before the assembly, it can be stored in tags (in
the EXP and CAF formats) and will be used in the assembly.
- ALUS, REPT
- Sequence stretches tagged as ALUS (ALU Sequence) or REPT
(general repetitive sequence) will be handled with extreme care during the
assembly process. The allowed error rate after automatic contig editing
within these stretches is normally far below the general allowed error rate,
leading to much higher stringency during the assembly process and
subsequently to a better repeat resolving in many cases.
- SRM, WRM
- Strong
Repeat
Marker
and
Weak
Repeat
Marker.
These tags are set in two
flavours: as SRMr
and WRMr
when set in reads, and as
SRMc
and WRMc
when set in the consensus. These tags are
used on an individual per base basis for each read. They denote bases that
have been identified as crucial for resolving repeats, often denoting a
single SNP within several hundreds or thousands of bases. While a SRM is
quite certain, the WRM really is either weak (there wasn't enough
comforting information in the vicinity to be really sure) or not supported
by other probable repeat markers in the vicinity.
- SAO, SRO, SIO
- SNP
intrA
Organism,
SNP
R
Organism,
SNP
Intra
and
inter Organism.
As for SRM and WRM, these tags have a r
appended when set in reads and a c
appended when set in the
consensus. These tags denote SNP positions.
- F...
- GenBank features as described in GBF/GBK files or set in the
Staden package are used to make some SNP impact analysis on genes.
- other
- Other tags in reads will be read and passed through the assembly
without being changed and they currently do not influence the assembly
process.
- SRM, WRM
- See "Tags read (and used)" above for a description what
these tags mean.
mira
will automatically set these tags when it encounters repeats and
will tag exactly those bases that can be used to
discern the differences.
Seeing such a tag in the consensus means that mira
was not able to
finish the disentanglement of that special repeat stretch or that it found a
new one in one of the last loops without having the opportunity to resolve
the problem.
- SAO, SRO, SIO
- See "Tags read (and used)" above for a description what
these tags mean.
mira
will automatically set these tags when it encounters SNPs and
will tag exactly those bases that can be used to discern the differences.
They denote SNPs as they occur within an organism (SAO),
between two or more organisms (SRO) or within and between organisms (SIO).
Seeing such a tag in the consensus means that mira
set this as a
valid SNP in the assembly loop. Seeing such tags only in reads (but not in
the consensus) shows that in a previous loop, mira
thought these
bases to be SNPs but that in later loops, this SNP does not appear anymore
(perhaps due to resolving misassemblies).
- ED_C, ED_I, ED_D
- EDit Change, EDit Insertion, EDit Deletion. These tags
are set by the integrated automatic editor EdIt and show which edit actions have
been performed.
The assembly of ESTS can be done in two ways when using the MIRA2 system: by
using miraEST
or mira.
When using miraEST,
an automated
pipeline is run that is able to assemble transcripts cleanly according to
given organism strains. Afterwards, an integrated SNP analysis highlights the
exact nature of mutations within the strain transcripts. mira
can also
be used to assemble ESTs, but it is recommended only for projects that do not
need to sort transcripts according to strains.
miraEST
is a pipeline that reconstructs the pristine mRNA transcript
sequences gathered in EST sequencing projects, which can be a reliable basis
for subsequent analysis steps like clustering or exon analysis. This means
that even genes that contain only one transcripted SNP on different alleles
are first treated as different transcripts. The optional last step of the
assembly process can be configured as a simple clusterer that can assemble
transcripts containing the same exon sequence -- but only differ in SNP
positions -- into one consensus sequence. Such SNPs can then be analysed,
classified and reliably assigned to their corresponding mRNA transcriptome
sequence. How ever, it is important to note that miraEST
is an
assembler and not a full blown clustering tool.
Generally speaking, miraEST
is a three-stage assembly system that was
designed to catch SNPs in different strains and reconstruct the mRNA present
in those strains. That is, one really should have different strains to analyse
(and the information provided to the assembler) to make the most out of
miraEST.
Here is a quick overview on what miraEST
does:
- Step 1
- assemble everything together, not caring about strain
information. Potential SNPs are not treated as SNPs, but as possible repeat
marker bases and are tagged as such (temporarily) to catch each and every
possible sequence alignment which might be important later. As a result of
this stage, the following information is written out:
- Into "step1_snpsinSTRAIN_<strainname>.caf" all the sequences of a
given strain that are in contigs (can be aligned with at least one other
sequence) - also, all sequences that are singlets BUT have been tagged
previously as containing tagged bases showing that they aligned previously
(even to other strains) but were torn apart due to the SNP bases.
- Into "step1_nonsnps_remain.caf" all the singlets of all the strains
that have no such tag if and only if no strain information is provided
Obviously, if one did not provide strain information to the assembly of step
1, all the sequences belong to the same strain (named "default"). The CAF
files generated in this step are the input sequences for the next step.
- Step 2
- Now, miraEST
assembles each strain independently from
each other. Again, sequences containing SNPs are torn apart into different
contigs (or singlets) to give a clean representation of the "really
sequenced" ESTs. In the end, each of the contigs (or singlets) coming out of
the assemblies for the strains is a representation of the mRNA that was
floating around the given cell/strain/organism. The results of this step are
written out into one big file (step2_reads.caf) and a new straindata that
goes along with those results (step2_straindata.txt).
- Step 3
- miraEST
takes the result of the previous step (which
should now be clean transcripts) and assembles them together, *this time*
allowing transcripts from different strains with different SNP bases to be
assembled together. The result is then written to step3_out.* files and
directories.
miraEST
can also be used for EST data of a single strain or when no
strain information is available. In this case, it will cleanly sort out
transcripts of almost identical genes or, when eukaryotic ESTs are assembled,
according to their respective allele when these contain mutations.
Like the normal mira,
miraEST
keeps track on a lot of things and
writes out quite a lot of additional information files after each step.
Results and and additional information of step 1 are stored in "step1_*"
files. Results and information of step 2 are in "<strainname>_*" files
(and directories). For step 3, it's "step3_*" again.
Each step of miraEST
can be configured by a parameter file. These files
understand the same parameters as for mira.
However, they are presently
the only
way to configure miraEST,
i.e., miraEST
does not honour parameters given on the command line. Additionally, the
parameter files must
be named me_step1par,
me_step2.par
and me_step3.par.
The parameter files that
are present in the distribution are the ones currently used by default. By
simply copying them into the assembly directory (keep the name of the files),
miraEST
will load those instead of using built in defaults.
For the time being, the pipeline of miraEST
is not as flexible as
mira
itself. For example, every input and output file must have the
prefix ``step<number>''. Also, a straindata file must be present for step
1 (step1_straindata_in.txt),
but it can be an empty file.
Using mira
in EST projects is a bit more bare bone that the
miraEST
pipeline, but it is quite useful to get a first impression of a
given data set or when used in projects that have no strain or only one
strain.
It is recommended to use -estmode
as first argument on the command
line to switch to a good initial default and then eventually adapt with own
settings.
To use mira
itself, one doesn't need very much:
- the sequences in EXP, CAF, PHD, or FASTA format (ideally preprocessed)
- optionally: ancillary information in NCBI traceinfo XML format;
ancillary information about strains in tab delimited format
- mira, some memory and disk space
Viewing the results or preprocessing the sequences (these are just examples,
there are a lot more packages available):
- A browser who understands tables is needed to view the HTML output. A
browser knowing style sheets (CSS) is recommended, as different tags will be
highlighted. Konqueror, Opera, Mozilla, Netscape and Internet Explorer all do
fine, lynx is not really ... optimal.
- A text viewer for the different textual output files.
- You'll want GAP4 (generally speaking: the Staden package) to preprocess
the sequences, visualise and eventually rework the results when using gap4da
output. The Staden package comes with a fully featured sequence preparing
and annotating engine (pregap4)
that is very useful to preprocess
your data (conversion between file types,
quality clipping, tagging etc.).
See http://www.sourceforge.net/projects/staden/
for further
information and also a possibility to download precompiled binaries for
different platforms.
- phred(basecaller) - cross_match(sequence comparison and
filtering) - phrap(assembler)
- consed(assembly
viewer and
editor). This is another package that can be used for this type of job, but
requires more programming work. The fact that sequence stretches are masked
out (overwritten with the character X) if they shouldn't be used in an
assembly doesn't really help and is considered harmful (but it works).
See http://www.phrap.org/
for further information.
- Viewing .ace output without consed can be done with clview
(http://www.tigr.org/tdb/tgi/software/)
from TIGR.
- Paracel (http://www.paracel.com) offered useful tools for working
with sequences: TraceTuner (modified and improved basecaller based on
phred),
PFP (sequence filtering package) and PGA / PTA (genome and
EST assembler based on the CAP4 assembler) and the corresponding viewers.
The filtering package fortunately knows two modes of operation (masking
bases and tagging them) and therefore integrates nicely into existing
workflows based on the masking mode while offering advanced features
like tagging (through XML files) when used in that mode.
As Paracel closed shop, I do not know whether these package can still be
obtained.
As always, most of the time a combination of several different packages is
possible.
For re-assembling projects that were edited in gap4, one will also nee the
gap2caf
converter. The source for this is available at http://www.sanger.ac.uk/Software/formats/CAF/.
Memory consumption has decreased a bit since the V2.2.x versions.
- As a rough rule of thumb: for every base in a sequence you will have to
constantly provide somewhere between 15 to 20 bytes of RAM.
- A certain amount of temporary RAM is used for Smith-Waterman matches.
This amount depends on the size of your reads and is a square to their
length. Maximum usage examples: 4 MB for reads of length 1000, 16 MB for
length 2000, 100 MB for length 5000
NEW
since 2.5.4: Memory consumption during the fast read comparison
part (SKIM) has decreased dramatically. The partitioned SKIM algorithm now
uses a constant amount of memory to do the comparisons (currently set to
something between 6 and 7 MB).
As an example, 2GB of RAM (and some disk swap) are sufficient to assemble a
bacterium with 10 mega-bases and an 8x coverage.
The times given here are only approximate and were gathered on my small home
development box (Athlon 2400+) using a single processor and
some
debug code compiled in, somewhat slowing down the whole process. Here are the
times for a non-normalised (thus highly repetitive) EST project, 7651 reads
with a mean length of 450 used bases,
- The fast filtering algorithm performs about 1 million sequence
comparisons per second (26 seconds).
- Banded Smith-Waterman performs around 1000 sequence alignments per second
(with a 15% band to each side, which is quite generous), 2:13 for about
135000 alignment checks.
- Assembling the whole project in one main assembly loop, including
resolving very high coverage contigs (>500 sequences) in multiple passes and
splitting them into different SNP and splice variants took about 22 minutes.
Assembling for example a small genomic project with 720 reads forming a 35k
bases contig in three main loop iterations, resolving minor repeat
misassemblies, full read extension and automatic contig editing takes 56
seconds.
File Input / Output:
- mira can only read unedited EXP files.
- due to the fact that the Sanger Centre currently doesn't offer caf2gap
and gap2caf precompiled (and compiling them is a non-trivial task), the
option to reassemble GAP4 projects after working on them is currently not
possible.
- ACE output is still a bit shaky and might contain errors, especially in
the tag format.
- There sometimes is a (rather important) memory leak occurring while using
the integrated editor. I have not been able to trace the reason yet.
Assembly process:
- The routines for determining Repeat Marker Bases (RMB)
are sometimes too sensitive, which sometimes leads to excessive base tagging
and preventing right assemblies in subsequent assembly processes. The
parameters you should look at for this problem are
-CO:mrc:nrz:mgqrt:mgqwpc.
Also look at -CL:pvc
and
-CO:emea
if you have a lot of sequencing vector relicts at the
end of the sequences.
- mira cannot work (yet) with EXP files resulting from GAP4 that
already have been edited. If you want to reassemble an edited GAP4 project,
convert it to CAF format and use the -CL:lj=CAF
option.
- As also explained earlier, mira relies on sequencing vector being
recognised in preprocessing steps by other programs. Sometimes, when a whole
stretch of bases is not correctly marked as sequencing vector, the reads
might not be aligned into a contig although they might otherwise match quite
perfectly. You can use -CL:pvc
and -CO:emea
to address
problem with incomplete clipping of sequencing vectors. Also having the
assembler work with less strict parameters may help out of this.
- mira has been developed to assemble shotgun sequencing or EST
sequencing data. There are no explicit limitations concerning length or
number of sequences. However, there are a few implicit assumptions that were
made while writing portions of the code:
- Sequence data produced by electrophoresis rarely surpasses 1000 usable
bases and I never heard of more than 1100. The fast filtering SKIM relies
on the fact that sequences will never exceed 10000 bases in length.
- The next problem that might arise with 'unnatural' long sequence reads
will be my implementation of the Smith-Waterman alignment routines. I use
a banded version with linear running time (linear to the bandwidth) but
quadratic space usage. So, comparing two 'reads' of length 5000 will result
in memory usage of 100MB. I know that this could be considered as a flaw.
On the other hand - unless someone comes up with electrophoresis producing
reads with more than 2000 usable bases - I see no real need to change this
as long as there are more important things on the TODO list. Of course, if
anyone is willing to contribute a fast banded SW alignment routine which
runs in linear time and space, just feel free to contact the authors.
- Current data structures allow for a worst case read coverage of maximally
16384 reads on top of the other. Anyone who wants to comment on that?
- the 32-bit Linux version is limited by the memory made available by the
Linux kernel (somewhere around 2.3 to 2.7GB).
- the 64-bit Linux version has no implicit limits, although I think that
the maximum number of bases of all reads may not surpass 2.147.483.648
bases. With that, even small eukaryotes should be no problem.
- mira is not (yet) multi-threaded, but even bigger bacteria can be
assembled in 6 hours on a current hardware platform.
Note: Versions with uneven minor versions (e.g. 1.1.x, 1.3.x, ..., 2.1.x, ...
etc.) are development versions which might be unstable in parts (although I
don't think so). But to catch possible bugs, development versions of
mira
are distributed with tons of internal checks compiled into the
code, making it somewhere between 10% and 50% slower than it could be.
These are some of the topics on my TODO list for the next revisions to come:
- Making parts of the process multi-threaded (currently in work)
- Done as of 2.5.12! Hidden data (parts of a read that have been clipped
off by quality clipping) is almost optimally used by the assembler, but it
considerably slows down the process when searching for that. I am working on
optimising it.
- Others nifty ideas that I have not completely thought out yet.
To avoid the "garbage-in, garbage-out" problematic, mira
uses a 'high
quality alignments first' contig building strategy. This means that the
assembler will start with those regions of sequences that have been marked as
good quality (high confidence region - HCR) with low error probabilities (the
clipping must have been done by the base caller or other preprocessing
programs, e.g. pregap4)
and then gradually extends the alignments as
errors in different reads are resolved through error hypothesis verification
and signal analysis.
This assembly approach relies on some of the automatic editing functionality
provided by the EdIt package which has been integrated in parts within
mira.
This is an approximate overview on the steps that are executed while
assembling:
- All the experiment / phd / fasta sequences that act as input are
loaded (or the CAF project). Qualities for the bases are loaded from the
FASTA or SCF if needed.
- The high confidence region (HCR) of each read is compared with a quick
algorithm to the HCR of every other read to see if it could match and
have overlapping parts (this is the 'SKIM' filter).
- All the reads which could match are being checked with an adapted
Smith-Waterman alignment algorithm (banded version). Obvious mismatches are
rejected, the accepted alignments form one or several alignment graphs.
- Optional pre-assembly read extension step: mira tries to extend
HCR of reads by analysing the read pairs from the previous
alignment. This is a bit shaky as reads in this step have not been edited
yet, but it can help. Go back to step 2.
- A contig gets made by building a preliminary partial path through the
alignment graph (through in-depth analysis up to a given level) and then
adding the most probable overlap candidates to a given contig. Contigs may
reject reads if these introduce to many errors in the existing consensus.
Errors in regions known as dangerous (for the time being only ALUS and REPT)
get additional attention by performing simple signal analysis when alignment
discrepancies occur.
- Optional: the contig can be analysed and corrected by the automatic
editor (EdIt).
- Long term repeats are searched for, bases in reads of different repeats
that have been assembled together but differ sufficiently (for EdIT so that
they didn't get edited and by phred quality value) get tagged with RMB.
- Go back to step 5 if there are reads present that have not been
assembled into contigs.
- Optional post-assembly read-extension: mira extends the HCR of
the reads that have been assembled into the contigs.
- Optional: Detection of spoiler reads that prevent joining of
contigs. Remedy by shortening them.
- Optional: Write out a checkpoint assembly file and go back to step 2.
- The resulting project is written out to different output files and
directories.
There are no restrictions whatsoever built into the executable, neither in
time nor in the number of sequences. The only limit is given by the available
memory.
There are two kind of versions floating around: production and development.
Production versions are from the stable branch of the source code. These
versions are available for download on the web site of MIRA.
Development versions are from the development branch of the source tree. These
are also made available to the public and are distributed for users to test
out new functionality or to track down bugs or errors that might arise at a
given location. Release candidates (rc) also fall into the development
versions: they are usually the last versions of a given development branch.
They are made public to make the whole environment stable and iron out
inconsistencies.
Any version of this software and documentation is provided 'as-is', without
any express or implied warranty. In no event will the author be held liable
for any damages arising from the use of this software. Every kind of using the
software is only allowed with the acceptance of this license. 'Using'
certainly defines installing and running the software, it also includes
copying or providing the software or a service built upon its functionalities
for third-parties.
The author disclaims all warranties with regard to this software, in all
versions. Use it at your own risk.
- Production versions
- Permission to use, copy and distribute
production versions of this software and its documentation for any purpose is
hereby granted without fee, provided that this copyright and notice appears
in all copies.
- Development versions
- Permission to use development versions of this
software and its documentation for research purposes is hereby granted
without fee. Copying and redistribution of development versions or using
them for commercial projects is explicitly prohibited.
© 1997-2000 for MIRA V1.x and EdIt: Bastien Chevreux, Thomas Pfisterer,
Deutsches Krebsforschungszentrum Heidelberg -- Dept. of Molecular Biophysics.
© 2001-2006 for MIRA V2.x: Bastien Chevreux.
All rights reserved.
MIRA uses the excellent Expat library to parse XML files. Expat is Copyright
© 1998, 1999, 2000 Thai Open Source Software Center Ltd and Clark
Cooper as well as Copyright ©
2001, 2002 Expat maintainers.
See http://www.libexpat.org/
and
http://sourceforge.net/projects/expat/
for more information on Expat.
Bastien Chevreux (mira):
bach@chevreux.org
mira
can use automatic editing routines which were written by Thomas
Pfisterer (EdIt):
t.pfisterer@dkfz-heidelberg.de
WWW: http://www.chevreux.org/projects_mira.html
If you find this software useful, please send me a postcard. If postcards are
not available, a treasure chest full of spanish doubloons, gold and
jewellery will do nicely, thank you.
Please use these citations:
- mira
- Chevreux, B., Wetter, T. and Suhai, S. (1999): Genome Sequence Assembly Using Trace Signals and Additional Sequence Information.
Computer Science and Biology: Proceedings of the German
Conference on Bioinformatics (GCB) 99, pp. 45-56.
- miraEST
- Chevreux, B., Pfisterer, T., Drescher, B., Driesel, A. J.,
Müller, W. E., Wetter, T. and Suhai, S. (2004): Using the miraEST Assembler for Reliable and Automated mRNA Transcript Assembly and SNP Detection in Sequenced ESTs.
Genome Research, 14(6)
EdIt(1),
gap4(1),
pregap4(1),
ttuner(1),
scylla(1),
pga(1),
pta(1),
cap4(1),
phred(1),
phrap(1),
cross_match(1),
clview(1),
consed(1),
caf2gap(1),
gap2caf(1),
compress(1)
and gzip(1).

|