Sequence assembly with MIRA2

Bastien Chevreux

August 1, 2007

Version 2.8.2

mira - a genome and EST sequence assembly system.

Table of Contents

Synopsis

mira [-project=<name>] [-fasta[=<filename>] | -caf[=<filename>] | -phd[=<filename>]] [-genomedraft | -genomenormal | -genomeaccurate] [-mappingdraft | -mappingnormal | -mappingaccurate] [-clippinglight | -clippingnormal | -clippingheavy] [-highlyrepetitive] [-highqualitydata] [-borg] [-estmode] [-horrid] [-params=<filename>] [-GENERAL:arguments] [-STRAIN/BACKBONE:arguments] [-ASSEMBLY:arguments] [-DATAPROCESSING:arguments] [-CLIPPING:arguments] [-SKIM:arguments] [-ALIGN:arguments] [-CONTIG:arguments] [-EDIT:arguments] [-DIRECTORY:arguments] [-FILE:arguments] [-OUTPUT:arguments]

Description

mira is a multi-pass DNA sequence data assembly program (or "fragment assembler") for whole genome and EST projects. mira assembles reads / sequences gained by gel or capillary electrophoresis experiments into contiguous sequences (contigs). Input can be in various formats like Staden experiment (EXP), Sanger CAF, FASTA or PHD file.

If present, base qualities in phred(1) style and SCF signal electrophoresis trace files are used to adjudicate between or even correct contradictory stretches of bases in reads by either the integrated automatic EdIt editor (written by Thomas Pfisterer) or the assembler itself. Ancillary data containing additional information helpful to the assembly as is contained in, e.g. NCBI traceinfo XML files or Staden EXP files, is also honoured.

mira was conceived especially with the problem of repeats in genomic data and SNPs in EST data in mind. Considerable effort was made to develop a number of strategies -- ranging from standard clone-pair size restrictions to discovery and marking of base positions discriminating the different repeats / SNPs -- to ensure that repetitive elements are correctly resolved and that misassemblies do not occur.

The resulting assembly can be written in different standard formats like CAF, Staden GAP4 directed assembly, ACE, HTML, FASTA, simple text or transposed contig summary (TCS) files. These can easily be imported into numerous finishing tools or further evaluated with simple scripts.

The MIRA acronym stands for Mimicking Intelligent Read Assembly and the program pretty well does what its acronym says (well, most of the time anyway). The aim of mira is to build the best possible assembly by

  1. having a full overview on the whole project at any time of the assembly, i.e. knowledge of almost all possible read-pairs in a project,
  2. using high confidence regions (HCRs) of several aligned read-pairs to start contig building at a good anchor point of a contig, extending clipped regions of reads on a 'can be justified' basis.
  3. using all available data present at the time of assembly, i.e., instead of relying on sequence and base confidence values only, the assembler will profit from trace files containing electrophoresis signals, tags marking possible special attributes of DNA, information on specific insert sizes of read-pairs etc.
  4. having intelligent contig objects accept or refuse reads based on the rate of unexplainable errors introduced into the consensus
  5. discovering and analysing of possible repeats differentiated only by single nucleotide polymorphisms. The important bases for discriminating different repetitive elements are tagged.
  6. using the possibility given by the integrated automatic editor to correct errors present in contigs (and subsequently) reads by generating and verifying complex error hypotheses through analysis of trace signals in several reads covering the same area of a consensus,
  7. iteratively extending reads (and subsequently) contigs based on
    a) additional information gained by overlapping read pairs in contigs and
    b) corrections made by the automated editor.

mira -- written by Bastien Chevreux -- is part of a bigger project which also contains the automated editor / finisher EdIt package -- written by Thomas Pfisterer. The strength of mira and EdIt is the automatic interaction of both packages which produces assemblies with less work for human finishers to be done. This is the documentation belonging to the 2.8.2 release of the mira assembler.

MIRA2 has been rewritten in large parts; compared to the previous versions it supports a lot more in- and output files, got more accurate in the assembly, has better support for standard cases, learned to assemble EST sequences, has now a mode where it can assemble to pre-existing sequences (backbones) and is much faster.

I'd like to thank everybody who reported bugs to me, pointed out problems, sent ideas and suggestions they encountered while using the predecessors. Please continue to do so, the feedback made this second version possible.

Working modes

mira has three basic working modes: genome, EST or EST-reconstruction-and-SNP-detection. From version 2.4 on, there is only executable which supports all modes. The name with which this executable is called defines the working mode:

  1. mira(1) for assembly of genomic data as well as assembly of EST data from one strain/organism
    and
  2. miraEST(1) for assembly of EST data from different strains (or organisms) and SNP detection within this assembly.
Therefore, miraEST(1) is usually realised as a link to the mira(1) executable.

Parameters

Parameters can be given on the command line or via parameter files. Since version 2.6.0, mira knows two basic parameter types: quick switches and extensive switches.

  1. quick switches, also dubbed DWIM switches (for 'Do-What-I-Mean'), are easy to use-and-combine switches activating parameter collections for predefined tasks that will suit most people's needs.
  2. extensive switches offer a way to set about any possible parameter to configure mira for any kind of special need. While the format of extensive switches might look a little bit strange, it is borrowed from the SGI C compiler options and allows both compact command lines as well as readable and / or script generated parameter files.

The mira(1) command line options accept several arguments and allow a user to specify a setting for each argument. To specify multiple arguments, use colons to separate each argument on the command line. You may either use the long or the short form of each argument, the later being given in brackets.

As example, a typical call of mira(1) using quick switches and some tweaking with extended switches on the command line could look like this:

mira -fasta -genomedraft -ALIGN:min_relative_score=70 -GENERAL:use_template_information=yes -CONTIG:defaultinsertsizeminimum=500:defaultinsertsizemaximum=2500

or in short form

mira -fasta -genomedraft -AL:mrs=70 -GE:uti=yes -CO:dismin=500:dismax=2500

Please note that it is also perfectly legal to write

mira -fasta -genomedraft -AL:mrs=70 -GE:uti=yes -CO:dismin=500 -CO:dismax=2500

(just for those of you who want to use mira in scripted environments and / or write scripts for it).

There are example parameter files included in the distribution which show how to format parameters in files.

Quick switches

These switches are 'Do-What-I-Mean' parameter collections for predefined tasks which should suit most people's needs.

Order dependent quick switches
For some switches, the order of appearance in the command line (or parameter file) is important. It is generally a good idea to place them in the order as described in this documentation, e.g. always write -genomedraft -clippingheavy and not -clippingheavy -genomedraft. This is true for the following switches:
[-genomedraft | -genomenormal | -genomeaccurate]
One-stop-switches for three quality grades of de-novo genome assembly. Draft is quick-and-dirty, suited to get a first look on approximate coverage of a running project. Should not be used for anything else. Normal is the default parameter set of mira that is able to tackle most genomes. A bit slower than the draft version, but includes such options as read extension and vector remnant clipping. Accurate is still slower than the normal mode but should be used for genomes that pose a problem to the normal mode.
[-mappingdraft | -mappingnormal | -mappingaccurate]
Work like the [-genome...] switches except they are to be used when performing mapping assemblies against given backbone sequences.
[-clippinglight | -clippingmedium | -clippingheavy]
Three clipping grade modifiers, from light clipping when working with well preprocessed sequences to heavy clipping when the sequences that are being assembled had only sloppy or no preprocessing.
Note 1: the light version is already included in the [-genome...] and [-mapping...] switches.
Note 2: it is recommended to perform a thorough preprocessing (clipping sequencing vector stretches, clipping of low quality bases, tagging standard repeats etc.) before assembling sequences. The clipping routines of mira are more optimised to cope with last remnants of wrongly preprocessed sequences than with sequences having had no pre-processing at all.
[-highlyrepetitive]
A modifier switch for genome data that is deemed to be highly repetitive. The assemblies will run slower due to more iterative cycles that give mira a chance to resolve nasty repeats.
[-highqualitydata]
A modifier switch when the sequences that are used are of exceptional quality. mira will then bump up a few quality parameters which should lead to less false positives in the repeat and SNP detection routines.

Order independent quick switches
The following switches can be placed anywhere on the command line without interfering with other switches:
[-fasta | -fasta=<filename>]
Sets parameters suited for loading sequences from FASTA files. The version with =<filename> will also set the input file to the given filename.
[-phd | -phd=<filename>]
Sets parameters suited for loading sequences from PHD files. The version with =<filename> will also set the input file to the given filename.
[-caf | -caf=<filename>]
Sets parameters suited for loading sequences from CAF files. The version with =<filename> will also set the input file to the given filename.
[-project=<name>]
Default is mira. Defines the project name for this assembly. The project name automatically influences the name of input and output files / directories. E.g. in the default setting, the file names for the output of the assembly in FASTA format would be "mira_out.fasta" and "mira_out.fasta.qual". Setting the project name to "MyProject" would generate "MyProject_out.fasta" and "MyProject_out.fasta.qual". See also -FILE: and -DIRECTORY: for a list of names that are influenced.
[-params|-parameterfile=<filename>]
Loads parameters from the filename given. Allows a maximum of 10 levels of recursion, i.e. a -params option appearing within a file that loads other parameter files (though I cannot think of useful applications with more than 3 levels).

Other quick switches
Please note that the switches -estmode, -horrid, -borg imply that quality values are available in the assembly. If not, these switches will not work properly and the results obtained not the intended ones. These switches are ususally not combined with order dependent switches from above.
[-estmode]
Switches mira to a good initial preset for assembling EST data. Note that this is not needed (and even counterproductive) when used with miraEST.
[-horrid]
Sets a number of parameters useful when dealing with really horrid data sets. Useful means that parameters are chosen to so that time and memory consumption do not explode beyond all hope of the program returning. Note that MIRA will return in most cases useful assemblies with this switch, but these might not be as optimised as with normal operation.
The definition of "horrid" being a bit flexible, this is what I would call horrid: (a) a genomic projects with more that 2.000 reads that all seem to align partly to each other but have different repetitive structures or (b) EST clusters with a few thousand almost similar reads.
[-borg]
Sets a bunch of parameters to have mira try to assemble as many reads as possible. Will probably slow down the assembly process and use more memory.
"We are MIRA of borg. You will be assembled, resistance is futile!"

Note: A double dash (e.g. --params) may also be used instead of a single one in front of the quick switches.

Examples for using these switches can be found documentation file describing mira usage.

Extensive switches

Extensive switches open up the full panoply of possibilities the MIRA assembler offers. This ranges from fine-tuning assemblies with the quick switches from above to setting parameters in a way so that mira is suited also for very special assembly cases.

-GENERAL (-GE)
General options control the type of assembly to be performed and other switches not belonging anywhere else.
project(pro)=string
Same as the quick switch [-project].
load_job(lj)=fofnexp, fasta, caf, phd, fofnphd
Default is fofnexp. Defines whether to load and assemble EXP files from a file of filenames ("mira_in.fofn"), load and assemble FASTA sequences ("mira_in.fasta") and their qualities ("mira_in.fasta.qual"), load and assemble sequences or qualities from a phd file ("mira_in.phd") or to load a project from a CAF file ("mira_in.caf") and assemble or eventually reassemble it. Note: fofnphd currently not available.
filecheck_only(fo)=on|yes|1, off|no|0
Default is no. If set to yes, the project will not be assembled and no assembly output files will be produced. Instead, the project files will only be loaded. This switch is useful for checking consistency of input files.
merge_xmltraceinfo(mxti)=on|yes|1, off|no|0
Default is no. Some file formats above (FASTA, PHD or even CAF and EXP) possibly do not contain all the info necessary or useful for each read of an assembly. Should additional information -- like clipping positions etc. -- be available in a XML trace info file in NCBI format (see File formats), then set this option to yes and it will be merged to the data loaded. See also -FILE: for the name of the XML file to load.
Please note: quality clippings given here will override quality clippings loaded earlier or performed by mira. Minimum clippings will still be made by the program, though.
readnaming_scheme(rns)=sanger, tigr
Default is sanger. Defines the centre naming scheme for read suffixes. Currently, only Sanger Institute and TIGR naming schemes are supported out of the box.
How to choose: please read the documentation available at the different centres or ask your sequence provider. In a nutshell, Sanger scheme is "somename.[pqsfrw][12][bckdeflmnpt][a|b|c|..." (e.g. U13a08f10.p1ca), TIGR scheme is "somenameTF*|TR*|TA*" (e.g. GCPBN02TF or GCPDL68TABRPT103A58B).
external_quality(eq)=none, SCF
Default is SCF. Defines the source format for reading qualities from external sources. Normally takes effect only when these are not present in the format of the load_job project (EXP and FASTA can have them, CAF and PHD must have them).
external_quality_override(eqo)=on|yes|1, off|no|0
Default is no, only takes effect when load_job is fofnexp. Defines whether or not the qualities from the external source override the possibly loaded qualities from the load_job project. This might be of use in case some post-processing software fiddles around with the quality values of the input file but one wants to have the original ones.
discard_read_on_eq_error(droeqe)=on|yes|1, off|no|0
Default is yes. Should there be a major mismatch between the external quality source and the sequence (e.g.: the base sequence read from a SCF file does not match the originally read base sequence), should the read be excluded from assembly or not. If not, it will use the qualities it had before trying to load the external qualities (either default qualities or the ones loaded from the original source).
use_template_information(uti)=on|yes|1, off|no|0
Default is Yes. Two reads sequenced from the same clone template form a read pair with a known minimum and maximum distance. This feature will definitively help for contigs containing lots of repeats. Set this to 'yes' if your data contains information on insert sizes.
Information on insert sizes can be given via the SI tag in EXP files (for each read pair individually), or for the whole project using -CO:dismin and -CO:dismax (see below).
est_startstep(ess)=1 <= integer <= 4
Default is 1. Controls the starting step of the EST assembly and is therefore only useful in miraEST.
EST assembly is a three step process, each with different settings to the assembly engine, with the result of each step being saved to disk. If results of previous steps are present in a directory, one can easily "play around" with different setting for subsequent steps by reusing the results of the previous steps and directly starting with step two or three.
print_date(ps)=on|yes|1, off|no|0
Default is yes. Controls whether date and time are printed out during the assembly. Suppressing it is not useful in normal operation, only when debugging or benchmarking.

-STRAIN/BACKBONE (-SB)
General options for controlling strain and backbone information
load_straindata(lsd)=on|yes|1, off|no|0
Default is no. Straindata is a key value file, one read per line. First the name of the read, then the strain name of the organism the read comes from. It is used by the program to differentiate different types of SNPs appearing in organisms and classifying them.
load_backbone(lb)=on|yes|1, off|no|0
Default is no. A backbone is a sequence (or a previous assembly) that is used as template for the current assembly. The current assembly process will assemble reads first to those loaded backbone contigs before creating new contigs.
This feature is helpful for assembling against previous (and already possibly edited) assembly iterations, or to make a comparative assembly of two very closely related organisms. Please read "very closely related" as in: only SNP mutations or short indels present.
startbackboneusage_inpass(sbuip)=0 < integer
Default is 3. When assembling against backbones, this parameter defines the pass iteration (see -AS:nop) from which on the backbones will be really used. In the passes preceeding this number, the non-backbone reads will be assembled together as if no backbones existed. This allows mira to correctly spot repetitive stretches that differ by single bases and tag them accordingly.
Rule of thumb: if backbones belong to same strain as reads to assemble, set to 1. If backbones are a different strain, then set -SB:sbuib to 1 lower than -AS:nop (example: nop=4 and sbuip=3).
backbone_strain_name(bsn)=string
Default is <empty string>. Defines the name of the strain that the backbone sequences have.
backbone_filetype(bft)=fasta, caf, gbf
Default is fasta. Defines the filetype of the backbone file given. Currently (2.8.2 ) only FASTA, CAF and GBF files are supported.
When GBF (GenBank files, also named .gbk) files are loaded, the features within theses files are automatically transformed into Staden compatible tags and get passed through the assembly.
backbone_raillength(brl)=1000 <= integer <= 3000
Default is 2500. Parameter for the internal sectioning size of the backbone. Extremely repetitive sequences may require reducing the default value, but the default value should work well in 99.9% of all cases.
backbone_basequality(bbq)=-1 <= integer <= 100
Default is -1. Defines the default quality that the backbone sequences have if they came without quality values in their files (like in GBF format or when FASTA is used without .qual files). A value of -1 mira to use the same default quality for backbones as for reads.
also_build_new_contigs(abnc)=on|yes|1, off|no|0
Default is yes. Standard mode of the assembler is to assemble available reads to a backbone and make new contigs with remaining reads. If this option is set to no, the reads that cannot be assembled into exsiting contigs are put as singlets into the assembly, not forming new contigs.

-ASSEMBLY (-AS)
General options for controlling the assembly.
minimum_read_length(mrl)=integer >= 20
Default is 40. Minimum length that reads must have to be considered for the assembly. Shorter sequences will be filtered out at the beginning of the process and won't be present in the final project.
num_of_passes(nop)=integer > 0
Default is 3. Defines how many iterations of the whole assembly process are done. Rule of thumb: quick and dirty assembly: use 1 (not recommended). Assembly using read extensions and / or automatic contig editing (-DP:ure and -ED:ace, see below): at least 2. The recommended setting is 3 or higher, as some knowledge generated by the assembler can be used only from the third iteration on. More than 3 passes might be useful for projects containing many repetitive elements. See also -AS:rbl and -AS:mr below for parameters that affect the assembly and disentanglement of possible repeats.
skim_each_pass(sep)=on|yes|1, off|no|0
Default is yes. Defines whether the skim algorithm (and with it also the recalculation of Smith-Waterman alignments) is called inbetween each main pass. If set to no, skimming is done only when needed by the workflow: either when read extensions are searched for (-DP:ure) or when possible vector leftovers are to be clipped (-CL:pvc).
Setting this option to yes is highly recommended, setting it to no only for quick and dirty assemblies.
rmb_break_loops(rbl)=integer > 0
Default is 2. Defines the maximum number of times a contig can be rebuilt during a main assembly passes (-AS:nop) if misassemblies due to possible repeats are found.
spoiler_detection(sd)=on|yes|1, off|no|0
Default is yes for mira and no miraEST. A spoiler can be either a chimeric read or it is a read with long parts of unclipped vector sequence still included (that was too long for the -CL:pvc vector leftover clipping routines). A spoiler typically prevents contigs to be joined, MIRA will cut them back so that they represent no more harm to the assembly.
Recommended for assemblies of mid- to high-coverage genomic assemblies, not recommended for assemblies of ESTs as one might loose splice variants with that,
A minimum number of two assembly passes (-AS:nop) must be run for this option to take effect.
sd_last_pass_only(sdlpo)=on|yes|1, off|no|0
Default is yes. Defines whether the spoiler detection algorithms are run only for the last pass or for all passes (-AS:nop).
Takes effect only if spoiler detection (-AS:sd) is on.
base_default_quality(bdq)=integer >= 0
Default is 10. Defines the default base quality of reads that have no quality read from file.
use_genomic_pathfinder(ugpf)=on|yes|1, off|no|0
Default is yes. MIRA has two different pathfinder algorithms it chooses from to find its way through the (more or less) complete set of possible sequence overlaps: a genomic and an EST pathfinder. The genomic looks a bit into the future of the assembly and tries to stay on safe grounds using a maximum of information already present in the contig that is being built. The EST version on the contrary will directly jump at the complex cases posed by very similar repetitive sequences and try to solve those first and is willing to fall down to brute force when really bad cases (like, e.g., coverage with thousands of sequences) are encountered.
Generally, the genomic pathfinder will also work quite well with EST sequences (but might get slowed down a lot in pathological cases), while the EST algorithm does not work so well on genomes. If in doubt, leave on yes for genome projects and set to no for EST projects.
use_emergency_search_stop(uess)=on|yes|1, off|no|0
Default is yes. Another important switch if you plan to assemble non-normalised EST libraries, where some ESTs may reach coverages of several hundreds or thousands of reads. This switch lets MIRA save a lot of computational time when aligning those extremely high coverage areas (but only there), at the expense of some accuracy.
ess_partnerdepth(esspd)=integer > 0
Default is 500. Defines the number of potential partners a read must have for MIRA switching into emergency search stop mode for that read.
use_max_contig_buildtime(umcbt)=on|yes|1,off|no|0
Default is no. Defines whether there is an upper limit of time to be used to build one contig. Set this to yes in EST assemblies where you think that extremely high coverages occur. Less useful for assembly of genomic sequences.
buildtime_in_seconds(bts)=integer > 0
Default is 10000. Depending on -AS:umcbt above, this number defines the time in seconds alloted to building one contig.

-DATAPROCESSING (-DP)
Options for controlling some data processing during the assembly.
use_read_extension(ure)=on|yes|1, off|no|0
Default is yes. mira expects the sequences it is given to be quality clipped. During the assembly though, it will try to extend reads into the clipped region by analysing Smith-Waterman alignments between reads that were found to be valid. Only the right clip is extended though, the left clip (most of the time containing sequencing vector) is never touched.
read_extension_window_length(rewl)=integer > 0
Default is 30. Only takes effect when -DP:ure (see above) is set to yes. The read extension routines use a sliding window approach on Smith-Waterman alignments. This parameter defines the window length.
read_extension_with_maxerrors(rewme)=integer > 0
Default is 2. Only takes effect when -DP:ure (see above) is set to yes. The read extension routines use a sliding window approach on Smith-Waterman alignments. This parameter defines the number maximum number of errors (=disagreements) between two alignment in the given window.
first_extension_in_pass(feip)=integer >= 0
Default is 0. Only takes effect when -DP:ure (see above) is set to yes. The read extension routines can be called before assembly and/or after each assembly pass (see -AS:nop). This parameter defines the first pass in which the read extension routines are called. The default of 0 tells mira to extend the reads the first time before the first assembly pass.
last_extension_in_pass(leip)=integer >= 0
Default is 0. Only takes effect when -DP:ure (see above) is set to yes. The read extension routines can be called before assembly and/or after each assembly pass (see -AS:nop). This parameter defines the last pass in which the read extension routines are called. The default of 0 tells mira to extend the reads the last time before the first assembly pass.
tag_polyat_at_ends(tpae)=on|yes|1, off|no|0
Default is no. This option is useful in EST assembly. Poly-AT stretches at end of reads that were not correctly masked or clipped in preprocessing steps from external programs get tagged here. The assembler will not use these stretches for critical operations. Additionally, the tags do provide a good visual anchor when looking at the assembly with different programs.
polybase_window_len(pbwl)=integer > 0
Default is 7. Only takes effect when -DP:tpae (see above) is set to yes. Defines the window length within which all bases (except the maximum number of errors allowed, see below) must be either A or T to be considered a polybase stretch.
polybase_window_maxerrors(pbwme)=integer > 0
Default is 2. Only takes effect when -DP:tpae (see above) is set to yes. Defines the maximum number of errors allowed in a given window length (see above) so that a stretch is considered to be a polybase stretch. The distribution of these errors is not important.
polybase_window_gracedistance(pbwgd)=integer > 0
Default is 9. Only takes effect when -DP:tpae (see above) is set to yes.Defines the number of bases from the end of a sequence (if masked: from the end of the masked area) within which a polybase stretch is looked for without finding one.

-CLIPPING (-CL)
Controls for clipping options: when and how sequences should be clipped.
possible_vector_clip(pvc)=on|yes|1, off|no|0
Default is yes. mira will try to identify possible sequencing vector relicts present at the start of a sequence and clip them away. These relicts are usually a few bases long and were not correctly removed from the sequence in data preprocessing steps of external programs.
You might want to turn off this option if you know (or think) that your data contains a lot of repeats and the option below to fine tune the clipping behaviour does not give the expected results.

pvc_maxlenallowed(pvcmla)=integer >= 0
Default is 18. The clipping of possible vector relicts option works quite well. Unfortunately, especially the bounds of repeats or differences in EST splice variants sometimes show the same alignment behaviour than possible sequencing vector relicts and could therefore also be clipped.
To refrain the vector clipping from mistakenly clip repetitive regions or EST splice variants, this option puts an upper bound to the number of bases a potential clip is allowed to have. If the number of bases is below or equal to this threshold, the bases are clipped. If the number of bases exceeds the threshold, the clip is NOT performed.
Setting the value to 0 turns off the threshold, i.e., clips are then always performed if a potential vector was found.
quality_clip(qc)=on|yes|1, off|no|0
Default is no, but is set automatically to yes when using the quickmode switches -fasta or -phd (can be turned off again by subsequent options afterwards). This will let mira perform its own quality clipping before sequences are entered into the assembly. The clip function performed is a sequence end window quality clip with back iteration to get a maximum number of bases as useful sequence. Note that the bases clipped away here can still be used afterwards if there is enough evidence supporting their correctness when the option -DP:ure is turned on.
qc_minimum_quality(qcmq)=integer >= 15 and <= 35
Default is 20. This is the minimum quality bases in a window require to be accepted. Please be cautious not to take too extreme values here, because then the clipping will be too lax or too harsh. Values below 15 and higher than 30-35 are not recommended.
qc_window_length(qcwl)=integer >= 10
Default is 30. This is the length of a window in bases for the quality clip.
maskedbases_clip(mbc)=on|yes|1, off|no|0
Default is yes. This will let mira perform a 'clipping' of bases that were masked out (replaced with the character X). It is generally not a good idea to use mask bases to remove unwanted portions of a sequence, the EXP file format and the NCBI traceinfo format have excellent possibilities to circumvent this. But because a lot of preprocessing software are built around cross_match, scylla- and phrap-style of base masking, the need arised for mira to be able to handle this, too. mira will look at the start and end of each sequence to see whether there are masked bases that should be 'clipped'.
mbc_gap_size(mbcgs)=integer >= 0
Default is 20. While performing the clip of masked bases, mira will look if it can merge larger chunks of masked bases that are a maximum of -CL:mbcgs apart.
mbc_max_front_gap(mbcmfg)=integer >= 0
Default is 40. While performing the clip of masked bases at the start of a sequence, mira will allow up to this number of unmasked bases in front of a masked stretch.
mbc_max_end_gap(mbcmeg)=integer >= 0
Default is 60. While performing the clip of masked bases at the end of a sequence, mira will allow up to this number of unmasked bases behind a masked stretch.
ensure_minimum_left_clip(emlc)=on|yes|1, off|no|0
Default is yes. If on, ensures a minimum left clip on each read according to the parameters in -CL:mlcr:smlc
minimum_left_clip_required(mlcr)=integer >= 0
Default is 25. If -CL:emlc is on, checks whether there is a left clip which length is at least the one specified here.
set_minimum_left_clip(smlc)=integer >= 0
Default is 30. If -CL:emlc is on and actual left clip is < -CL:mlcr, set left clip of read to the value given here.

-SKIM (-SK)
Options that control the behaviour of the initial fast all-against-all read comparison algorithm. Matches found here will be confirmed later in the alignment phase. The new SKIM3 algorithm that is in place since version 2.7.4 uses a hash based algorithm that works similarly to SSAHA (see Ning Z, Cox AJ, Mullikin JC; ``SSAHA: a fast search method for large DNA databases.''; Genome Res. 2001;11;1725-9)

The major differences of SKIM3 and SSAHA are:

  1. the word length n of a hash can be up to 30 bases (in 64 bit versions or MIRA)
  2. a parameter controls the stepping increment s with which hashes are generated. This allows for a more fine grained search as matches are now found with at least n+s equal bases instead of the SSAHA 2n
  3. SKIM3 uses a fixed amount of RAM that is independent of the word size. E.g., SSAHA would need 4 Exabyte to work with word length of 30 bases ... SKIM3 just takes a couple of hundred MB (but is a bit slower than SSAHA).

The parameters for SKIM3:

bases_per_hash(bph)=integer >= 1
Default is 14 on 32 bit systems and 16 on 64 bit systems. Controls the number of consecutive bases n which are used as a word hash. The higher the value, the faster the search. The lower the value, the more weak matches are found. Values below 10 are not recommended.
hash_saving_step(hss)=integer >= 1
Default is 4. This is a parameter controlling the stepping increment s with which hashes are generated. This allows for a more fine grained search as matches are now found with at least n+s (see -OptArg{-SK:bph}) equal bases instead of the SSAHA 2n. The higher the value, the faster the search. The lower the value, the more weak matches are found.
percent_required(pr)=integer >= 1
Default is 50. Controls the relative percentage of exact word matches in an approximate overlap that has to be reached to accept this overlap as possible match. Increasing this number will decrease the number of possible alignments that have to be checked by Smith-Waterman later on in the assembly, but it also might lead to the rejection of weaker overlaps (i.e. overlaps that contain a higher number of mismatches).
maxhits_perread(mhpr)=integer >= 1
Default is 200. Controls the maximum number of possible hits one read can maximally transport to the Smith-Waterman alignment phase. If more potential hits are found, only the best ones are taken.
This is an important option for tackling projects which contain extreme assembly conditions. For example, 5000 reads that are all very similar would generate around 40 to 50 million possible alignments (forward and reverse complement). Setting then this parameter to 200 reduces the number of alignments to check to around 1.5-2 million.
As the assembly increases in passes (-AS:nop), different combinations of possible hits will be checked, always the probably best ones first. So the accuracy of the assembly should only suffer when lowering this number too much.

-ALIGN (-AL)
The align options control the behaviour of the Smith-Waterman alignment routines. Only read pairs which are confirmed here may be included into contigs. Affects both the checking of possible alignments found by SKIM as well as the phase when reads are integrated into a contig.
bandwidth_in_percent(bip)=integer>0 and <=100
Default is 15. The banded Smith-Waterman alignment uses this percentage number to compute the bandwidth it has to use when computing the alignment matrix. E.g., expected overlap is 150 bases, bip=10 -> the banded SW will compute a band of 15 bases to each side of the expected alignment diagonal, thus allowing up to 15 unbalanced inserts / deletes in the alignment. INCREASING AND DECREASING THIS NUMBER: increase: will find more non-optimal alignments, but will also increase SW runtime between linear and ^2.decrease: the other way round, might miss a few bad alignments but gaining speed.
bandwidth_min(bmin)=integer>0
Default is 25. Minimum bandwidth in bases to each side.
bandwidth_max(bmax)=integer>0
Default is 50. Maximum bandwidth in bases to each side.
min_overlap(mo)=integer>0
Default is 15. Minimum number of overlapping bases needed in an alignment of two sequences to be accepted.
min_score(ms)=integer>0
Default is 15. Describes the minimum score of an overlap to be taken into account for assembly. mira uses a default scoring scheme for SW align: each match counts 1, a match with an N counts 0, each mismatch with a non-N base -1 and each gap -2. Take a bigger score to weed out a number of chance matches, a lower score to perhaps find the single (short) alignment that might join two contigs together (at the expense of computing time and memory).
min_relative_score(mrs)=integer>0 and <=100
Default is 65. Describes the min % of matching between two reads to be considered for assembly. Increasing this number will save memory, but one might loose possible alignments. I propose a maximum of 80 here. Decreasing below 55% will make memory and time consumption probably explode.
extra_gap_penalty(egp)=on|yes|1, off|no|0
Default is no. Defines whether or not to increase penalties applied to alignments containing long gaps. Setting this to 'yes' might help in projects with frequent repeats. On the other hand, it is definitively disturbing when assembling very long reads containing multiple long indels in the called base sequence ... although this should not happen in the first place and is a sure sign for problems lying ahead.
When in doubt, set it to yes for EST projects and de-novo genome assembly, set it to no for assembly of closely related strains (assembly against a backbone).
When set to no, it is recommended to have -CO:amgb and -CO:amgbemc both set to yes.
egp_level(egpl)=low|0, medium|1, high|2, est_splitsplices|10
Default is low. Has no effect if extra_gap_penalty is off. Defines an extra penalty applied to 'long' gaps. There are these are predefined levels: low - use this if you expect your base caller frequently misses 2 or more bases. medium - use this if your base caller is expected to frequently miss 1 to 2 bases. high - use this if your base caller does not frequently miss more than 1 base.
For some stages of the EST assembly process, a special value est_splitsplices is used.
egp_level(megpp)=0<=integer<=0
Default is 100. Has no effect if extra_gap_penalty is off. Defines the maximum extra penalty in percent applied to 'long' gaps.

-CONTIG (-CO)
The contig options control the behaviour of the contig objects.
name_prefix(np)=string
Default is mira. Contigs will have this string prepended to their names.
analysis(an)=none, text, signal
Default is signal. When adding reads to a contig, dangerous regions can get an extra integrity check.
  none = no extra check.
  text = check is only text-based.
  signal = check is signal based, if the SCF trace is not available, fallback is 'text'.
For the time being, only regions tagged as ALUS or REPT in the experiment file are considered dangerous.
reject_on_drop_in_relscore(rodirs)=integer>0 and <=100
Default is 15. When adding reads to a contig, reject the reads if the drop in the quality of the consensus is > the given value in %. Lower values mean stricter checking. This value is doubled should a read be entered that has a template partner (a read pair) at the right distance.
danger_max_error_rate(dmer)=integer>=1 and <=100
Default is 1. When adding reads to a contig, reject the reads if the error in zones known as dangerous exceeds the given value in %. Lower values mean stricter checking in these danger zones.
For the time being, only regions tagged as ALUS or REPT in the experiment file are considered dangerous.
mark_repeats(mr)=on|yes|1, off|no|0
Default is yes. One of the most important switches in MIRA: if set to yes, MIRA will try to resolve misassemblies due to repeats by identifying single base stretch differences and tag those critical bases as RMB (Repeat Marker Base, weak or strong). This switch is also needed when MIRA is run in EST mode to identify possible inter-, intra- and intra-and-interorganism SNPs.
assume_snp_instead_repeat(asir)=on|yes|1, off|no|0
Default is no. Only takes effect when -CO:mr (see above) is set to yes, effect is also dependent on the fact whether strain data (see --SB:lsd) is present or not. Usually, mira will mark bases that differentiate between repeats when a conflict occurs between reads that belong to one strain. If the conflict occurs between reads belonging to different strains, they are marked as SNP. However, if this switch is set to yes, conflict within a strain are also marked as SNP.
This switch is mainly used in assemblies of ESTs, it should not be set for genomic assembly.
min_reads_per_group(mrpg)=integer >= 2
Default is 2. Only takes effect when -CO:mr (see above) is set to yes. This defines the minimum number of reads in a group that are needed for the RMB (Repeat Marker Bases) or SNP detection routines to be triggered. A group is defined by the reads carrying the same nucleotide for a given position, i.e., an assembly with mrpg=2 will need at least two times two reads with the same nucleotide (having at least a quality as defined in -CO:mgqrt) to be recognised as repeat marker or a SNP. Setting this to a low number increases sensitivity, but might produce a few false positives, resulting in reads being thrown out of contigs because of falsely identified possible repeat markers (or wrongly recognised as SNP).
min_groupqual_for_rmb_tagging(mgqrt)=integer >= 25
Default is 30. Takes only effect when -CO:mr is set to yes. This defines the minimum quality of a group of bases to be taken into account as potential repeat marker. The lower the number, the more sensitive you get, but lowering below 25 is not recommended as a lot of wrongly called bases can have a quality approaching this value and you'd end up with a lot of false positives. The higher the overall coverage of your project, the better, and the higher you can set this number. A value of 35 will probably remove all false positives, a value of 40 will probably never show false positives.
endread_mark_exclusion_area(emea)=integer >= 0
Default is 15. Takes only effect when -CO:mr is set to yes. Using the end of sequences of Sanger type shotgun sequencing is always a bit risky, as wrongly called bases tend to crowd there or some sequencing vector relicts hang around. It is even more risky to use these stretches for detecting possible repeats, so one can define an exclusion area where the bases are not used when determining whether a mismatch is due to repeats or not.
also_mark_gap_bases(amgb)=on|yes|1, off|no|0
Default is yes. Determines whether columns containing gap bases (indels) are also tagged.
also_mark_gap_bases_even_multicolumn(amgbemc)=on|yes|1, off|no|0
Default is yes. Takes effect only when -CO:amgb is set to yes. Determines whether multiple columns containing gap bases (indels) are also tagged.
also_mark_gap_bases_need_both_strands(amgbnbs)=on|yes|1, off|no|0
Default is yes. Takes effect only when -CO:amgb is set to yes. Determines whether both for tagging columns containing gap bases, both strands.need to have a gap. Setting this to no is not recommended except when working in desperately low coverage situations.
defaultinsertsizeminimum(dismin)=integer >= 0
Default is 500. The minimum distance that read pairs may be apart. There is an additional error margin of 10% subtracted from this value during internal computations.
defaultinsertsizemaximum(dismax)=integer >= 0
Default is 5000. The maximum distance that read pairs may be apart. There is an additional error margin of 10% added to this value during internal computations.

-EDIT (-ED)
General options for controlling the integrated automatic editor.
automatic_contig_editing(ace)=on|yes|1, off|no|0
Default is no. Once contigs have been build, mira can call a built-in version of the automatic contig editor EdIt. EdIt will try to resolve discrepancies in the contig by performing trace analysis and correct even hard to resolve errors. This option is always useful, but especially in conjunction with -AS:nop and -GE:ure (see above).
Notice: the current development version has a memory leak in the editor, therefore the option is not automatically turned on.
strict_editing_mode(sem)=on|yes|1, off|no|0
Default is yes. If set to yes, the automatic editor will not take error hypotheses with a low probability into account, even if all the requirements to make an edit are fulfilled.
confirmation_threshold(ct)=integer, 0 < x <= 100
Default is 50. The higher this value, the more strict the automatic editor will apply its internal rule set. Going below 40 is not recommended.

-DIRECTORY (-DIR, -DI)
General options for controlling where to find or where to write data.
gap4da=<directoryname>
Default is gap4da. Defines the extension of the directory where mira will write the result of an assembly ready to import into the Staden package (GAP4) in Direct Assembly format. The name of the directory will then be <projectname>_.<extension>
exp=<directoryname>
Default is .. Defines the directory where mira should search for experiment files (EXP).
scf=<directoryname>
Default is .. Defines the directory where mira should search for SCF files.
log=<directoryname>
Default is miralog. Defines the directory where mira will write some log files to. Note that the name of the actual project will be prepended.

-FILE (-FI)
The file options allows you to define your own input and output files.
fofnexpin(fei)=string
Default is <projectname>_in.fofn. Defines the file of filenames where the names of the EXP files of a project are located.
fofnphdin(fpi)=string
Default is <projectname>_in.fofn. Defines the file of filenames where the names of the PHD files of a project are located. Note: this is currently not available.
phdin(pi)=string
Default is <projectname>_in.phd. Defines the file of where all the sequences of a project are in PHD format.
fastain(fai)=string
Default is mira_in.fasta. Defines the fasta file to load sequences of a project from.
fastaqualin(fqi)=string
Default is mira_in.fasta.qual. Defines the fasta file to load base qualities of a project from. Although the order of reads in the quality file does not need to be the same as in the fasta or fofn (although it saves a bit of time if they are). projects).
cafin(ci)=string
Default is mira_in.caf. Defines the file to load a CAF project from. Filename must end with '.caf'.
straindatain(sdi)=string
Default is mira_straindata_in.txt. Defines the file to load straindata from. Only used in EST projects (miraEST).
xmltraceinfoin(xtii)=string
Default is mira_xmltraceinfo_in.xml. Defines the file to load a trace info file in XML format from. This can be used both when merging XML data to loaded files or when loading a project from an XML trace info file.
cafout(co)=string
Default is mira_out.caf. Defines the file in CAF format to save an assembled project to. Filename must end with '.caf'.

-OUTPUT (-OUT)
Options for controlling which results to write to which type of files. Additionally, a few options allow output customisation of textual alignments (in text and HTML files).
There are 3 types of results: result, temporary results and extra temporary results. One probably needs only the results. Temporary and extra temporary results are written while building different stages of a contig and are given as convenience for trying to find out why mira set some RMBs or disassembled some contigs.
Output can be generated in these formats: CAF, Gap4 Directed Assembly, FASTA, ACE, HTML and simple text. Please note that the ACE output is experimental as I don't have the necessary programs (phrap, consed) to verify the output. (Anyone who wants to sponsor them? :-)
Naming conventions of the files follow the rules described in section Input / Output, subsection Filenames.

output_result_caf(orc)=on|yes|1, off|no|0
Default is yes.
output_result_gap4da(org)=on|yes|1, off|no|0
Default is yes.
output_result_fasta(orf)=on|yes|1, off|no|0
Default is yes.
output_result_ace(ora)=on|yes|1, off|no|0
Default is no.
output_result_txt(ort)=on|yes|1, off|no|0
Default is yes.
output_result_tcs(ors)=on|yes|1, off|no|0
Default is yes.
output_result_html(orh)=on|yes|1, off|no|0
Default is no.

output_tmpresult_caf(otc)=on|yes|1, off|no|0
Default is no.
output_tmpresult_gap4da(otg)=on|yes|1, off|no|0
Default is no.
output_tmpresult_fasta(otf)=on|yes|1, off|no|0
Default is no.
output_tmpresult_ace(ota)=on|yes|1, off|no|0
Default is no.
output_tmpresult_txt(ott)=on|yes|1, off|no|0
Default is no.
output_result_tcs(ots)=on|yes|1, off|no|0
Default is no.
output_tmpresult_html(oth)=on|yes|1, off|no|0
Default is no.

output_exttmpresult_caf(oetc)=on|yes|1, off|no|0
Default is no.
output_exttmpresult_gap4da(oetg)=on|yes|1, off|no|0
Default is no.
output_exttmpresult_fasta(oetf)=on|yes|1, off|no|0
Default is no.
output_exttmpresult_ace(oeta)=on|yes|1, off|no|0
Default is no.
output_exttmpresult_txt(oett)=on|yes|1, off|no|0
Default is no.
output_exttmpresult_html(oeth)=on|yes|1, off|no|0
Default is no.

text_chars_per_line(tcpl)=integer > 0
Default is 60. When producing an output in text format (-OUT:ort|ott|oett), this parameter defines how many bases each line of an alignment should contain.
html_chars_per_line(tcpl)=integer > 0
Default is 60. When producing an output in HTML format, (-OUT:orh|oth|oeth), this parameter defines how many bases each line of an alignment should contain.
text_endgap_fillchar(tegfc)=<single character>
Default is (a blank). When producing an output in text format (-OUT:ort|ott|oett), endgaps are filled up with this character.
html_endgap_fillchar(hegfc)=<single character>
Default is (a blank). When producing an output in HTML format (-OUT:orh|oth|oeth), endgaps are filled up with this character.

Input / Output

Directories

Since version 2.7.4, mira now puts all files it generates into three directories:

Filenames

Input
The input files must be placed (or linked to) in the directory from which mira is called.
<projectname>_in.fofn
File of filenames containing the names of the experiment or phd files to assemble when the -GE:lj=FOFNEXP or -GE:lj=FOFNPHD option is used. One filename per line, blank lines accepted, lines starting with a hash (#) treated as comment lines, nothing else. Use -FI:fofnin(fei) or -FI:fofnin(fpi) to change the default name.
<projectname>_in.phd
File containing the sequences (and their qualities) to assemble in PHD format.
<projectname>_in.fasta
File containing sequences and ...
<projectname>_in.fasta.qual
... file containing quality values of sequences for the assembly in FASTA format.
<projectname>_in.caf
File containing the sequences (and their qualities) to assemble in CAF format. This format also may contain the result of an assembly (the contig consensus sequences).

Output
These result output files and subdirectories are placed in in the <projectname>_results directory after a run of mira.
<projectname>_out.<type>
Assembled project written in type = (gap4da / caf / ace / fasta / html / tcs / text) format by mira, final result. Type gap4da is a directory containing experiment files and a file of filenames (called 'fofn'), all other types are files. gap4da, caf, ace contain the complete assembly information suitable for import into different post-processing tools (gap4, consed and others). html and text contain visual representations of the assembly suited for viewing in browsers or as simple text file. tcs is a summary of a contig suited for "quick" analyses from command-line tools or even visual inspection.
fasta contains the contig consensus sequences (and .fasta.qual the consensus qualities). Please note that they come in two flavours: padded and unpadded. The padded versions may contains stars (*) denoting gap base positions where there was some minor evidence for additional bases, but not strong enough to be considered as a real base. Unpadded versions have these gaps removed. Padded versions have an additional postfix .padded, while unpadded versions do not have a special postfix.

Assembly statistics and information files
These information files are placed in in the <projectname>_info directory after a run of mira.
<projectname>_info_contigreadlist.txt
This file contains information which reads have been assembled into which contigs (or singlets).
<projectname>_info_contigstats.txt
This file contains statistics about the contigs themselves, their length, average consensus quality, number of reads, maximum and average coverage, average read length, number of A, C, G, T, N, X and gaps in consensus.
<projectname>_info_consensustaglist.txt
This file contains information about the tags (and their position) that are present in the consensus of a contig.
<projectname>_info_readstooshort
A list containing the names of those reads that have been sorted out of the assembly before any processing started only due to the fact that they were too short.
<projectname>_info_readtaglist.txt
This file contains information about the tags and their position that are present in each read. The read positions are given relative to the forward direction of the sequence (i.e. as it was entered into the the assembly).
<projectname>_error_reads_invalid
A list of sequences that have been found to be invalid due to various reasons (given in the output of the assembler).

File formats

EXP
Standard experiment files used in genome sequencing. Correct EXP files are expected. Especially the ID record (containing the id of the reading) and the LN record (containing the name of the corresponding trace file) should be correctly set. See http://www.sourceforge.net/projects/staden/ for links to online format description.
SCF
The Staden trace file format that has established itself as compact standard replacement for the much bigger ABI files. See http://www.sourceforge.net/projects/staden/ for links to online format description.
The SCF files should be V2-8bit, V2-16bit, V3-8bit or V3-16bit and can be packed with compress or gzip.
CAF
Common Assembly Format (CAF) developed by the Sanger Centre. http://www.sanger.ac.uk/Software/formats/CAF/ provides a description of the format and some software documentation as well as the source for compiling caf2gap and gap2caf (thanks to Rob Davies for this).
ACE
The assembly file format used mainly by phrap and consed. Support for .ace output is currently only in test status in mira as documentation on that format is ... sparse and I currently don' have access to consed to verify my assumptions.
Using consed, you will need to load projects with -nophd to view them. Tags are rudimentary supported, this should improve over time. You might also want to try clview (http://www.tigr.org/tdb/tgi/software/) from TIGR to look at .ace files.
HTML
Hypertext Markup Language. Projects written in HTML format can be viewed directly with any table capable browser. Display is even better if the browser knows style sheets (CSS).
FASTA
A simple format for sequence data, see http://www.ncbi.nlm.nih.gov/BLAST/fasta.html. An often used extension of that format is used to also store quality values in a similar fashion, these files have a .fasta.qual ending.
Mira writes two kinds of FASTA files for results: padded and unpadded. The difference is that the padded version still contains the gap (pad) character (an asterisk) at positions in the consensus where some of the reads apparently had some more bases than others but where the consensus routines decided that to treat them as artefacts. The unpadded version has the gaps removed.
PHD
This file type originates from the phred base caller and contains basically -- along with some other status information -- the base sequence, the base quality values and the peak indices, but not the sequence traces itself.
GBF, GBK
GenBank file format as used at the NCBI to describe sequences. mira is able to read this format for using sequences as backbones in an assembly. Features of the GenBank format are also transferred automatically to Staden compatible tags.
traceinfo.XML
XML based file with information relating to traces. Used at the NCBI and ENSEMBL trace archive to store additional information (like clippings, insert sizes etc.) for projects. See http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=rfc&m=main&s=rfc for a description of the fields used.
TCS
Transpose Contig Summary. A text file as written by mira which gives a summary of a contig, one line per base. Nicely suited for "quick" analyses from command line tools or even visual inspection.

STDOUT

The actual stage of the assembly is written to STDOUT, giving status messages on what mira is actually doing.

Some debugging information might also be written to STDOUT if mira generates error messages.

STDERR

During the assembly, mira might dump some messages to standard error. Basically, three error classes exist:
WARNING
Messages in this error class do not stop the assembly but are meant as an information to the user. In some rare cases these errors are due to (an always possible) error in the I/O routines of mira, but nowadays they are mostly due to unexpected (read: wrong) input data and can be traced back to errors in the preprocessing stages. If these errors arise, you definitively DO want to check how and why these errors came into those files in the first place.
Frequent cause for warnings include missing SCF files, SCF files containing known quirks, EXP files containing known quirks etc.
FATAL
Messages in this error class actually stop the assembly. These are mostly due to missing files that mira needs or to very garbled (wrong) input data.
Frequent causes include naming an experiment file in the 'file of filenames' that could not be found on the disk, same experiment file twice in the project, suspected errors in the EXP files, etc.
INTERNAL
These are true programming errors that were caught by internal checks. Should this happen, please mail the output of STDOUT and STDERR to the authors.

Log files

During assembly, mira will write a whole lot of log files which all will be placed into a subdirectory named "<projectname>_miralog"

Please do not delete the project where errors happened. I will get in touch with you for additional information that might possibly be present in temporary files and the log file. They are not deleted in cases like this.

Tags used in the assembly by MIRA and EdIt

mira uses and sets a couple of tags during the assembly process. That is, if information is known before the assembly, it can be stored in tags (in the EXP and CAF formats) and will be used in the assembly.

Tags read (and used)

Tags set (and used)

Assembling ESTs

The assembly of ESTS can be done in two ways when using the MIRA2 system: by using miraEST or mira. When using miraEST, an automated pipeline is run that is able to assemble transcripts cleanly according to given organism strains. Afterwards, an integrated SNP analysis highlights the exact nature of mutations within the strain transcripts. mira can also be used to assemble ESTs, but it is recommended only for projects that do not need to sort transcripts according to strains.

Using miraEST for EST assembly

miraEST is a pipeline that reconstructs the pristine mRNA transcript sequences gathered in EST sequencing projects, which can be a reliable basis for subsequent analysis steps like clustering or exon analysis. This means that even genes that contain only one transcripted SNP on different alleles are first treated as different transcripts. The optional last step of the assembly process can be configured as a simple clusterer that can assemble transcripts containing the same exon sequence -- but only differ in SNP positions -- into one consensus sequence. Such SNPs can then be analysed, classified and reliably assigned to their corresponding mRNA transcriptome sequence. How ever, it is important to note that miraEST is an assembler and not a full blown clustering tool.

Generally speaking, miraEST is a three-stage assembly system that was designed to catch SNPs in different strains and reconstruct the mRNA present in those strains. That is, one really should have different strains to analyse (and the information provided to the assembler) to make the most out of miraEST. Here is a quick overview on what miraEST does:

    Step 1
    assemble everything together, not caring about strain information. Potential SNPs are not treated as SNPs, but as possible repeat marker bases and are tagged as such (temporarily) to catch each and every possible sequence alignment which might be important later. As a result of this stage, the following information is written out:

    Obviously, if one did not provide strain information to the assembly of step 1, all the sequences belong to the same strain (named "default"). The CAF files generated in this step are the input sequences for the next step.
    Step 2
    Now, miraEST assembles each strain independently from each other. Again, sequences containing SNPs are torn apart into different contigs (or singlets) to give a clean representation of the "really sequenced" ESTs. In the end, each of the contigs (or singlets) coming out of the assemblies for the strains is a representation of the mRNA that was floating around the given cell/strain/organism. The results of this step are written out into one big file (step2_reads.caf) and a new straindata that goes along with those results (step2_straindata.txt).
    Step 3
    miraEST takes the result of the previous step (which should now be clean transcripts) and assembles them together, *this time* allowing transcripts from different strains with different SNP bases to be assembled together. The result is then written to step3_out.* files and directories.

miraEST can also be used for EST data of a single strain or when no strain information is available. In this case, it will cleanly sort out transcripts of almost identical genes or, when eukaryotic ESTs are assembled, according to their respective allele when these contain mutations.

Like the normal mira, miraEST keeps track on a lot of things and writes out quite a lot of additional information files after each step. Results and and additional information of step 1 are stored in "step1_*" directories. Results and information of step 2 are in "<strainname>_*" directories. For step 3, it's "step3_*" again.

Each step of miraEST can be configured by a parameter file. These files understand the same parameters as for mira. However, they are presently the only way to configure miraEST, i.e., miraEST does not honour parameters given on the command line. Additionally, the parameter files must be named me_step1par, me_step2.par and me_step3.par. The parameter files that are present in the distribution are the ones currently used by default. By simply copying them into the assembly directory (keep the name of the files), miraEST will load those instead of using built in defaults.

For the time being, the pipeline of miraEST is not as flexible as mira itself. For example, every input and output file must have the prefix ``step<number>''. Also, a straindata file must be present for step 1 (step1_straindata_in.txt), but it can be an empty file.

Using mira for EST assembly

Using mira in EST projects is a bit more bare bone that the miraEST pipeline, but it is quite useful to get a first impression of a given data set or when used in projects that have no strain or only one strain.

It is recommended to use -estmode as first argument on the command line to switch to a good initial default and then eventually adapt with own settings.

Requirements

To use mira itself, one doesn't need very much:

Viewing the results or preprocessing the sequences (these are just examples, there are a lot more packages available):

As always, most of the time a combination of several different packages is possible.

For re-assembling projects that were edited in gap4, one will also nee the gap2caf converter. The source for this is available at http://www.sanger.ac.uk/Software/formats/CAF/.

Speed and memory considerations

Memory consumption has decreased a bit since the V2.2.x versions.

As a rough rule of thumb, 2GB of RAM (and some disk swap) should be sufficient to assemble a bacterium with 10 mega-bases and an 8x coverage (Sanger type reads).

NEW since 2.7.4: The new SKIM3 algorithm (initial all-against-all read comparison) is now approximately 60 times faster that the SKIM algorithms of earlier versions. E.g. SKIMming of 53,000 Sanger type shotguns reads now takes a minute instead of 62 minutes.

The times given below are only approximate and were gathered on my home development box (Athlon 4800+) using a single core and minimal debug code compiled in, somewhat slowing down the whole process.

Example 1: a small genomic project with 720 reads forming 35k bases of contig sequences in three main pass iterations, resolving minor repeat misassemblies, full read extension and automatic contig editing takes 21 seconds.

Example 2: a bacterial genomic project with two very closely related strains, 53000 reads bases forming a bit more than 3 megabases of contig sequences for each strain. Using the -genomeaccurate Do-What-I-Mean switch (four main passes, read extension, clipping of vector remnants), resolving repeat misassemblies (mostly rna stretches, but also some very closely related genes) takes 2hrs and 30 minutes.

Example 3: Here are the times for miraEST in a non-normalised (thus very repetitive) EST project, 9747 reads with a mean length of 674 used bases,

The three steps of miraEST (each one again subdivided in a number of MIRA passes), including resolving very high coverage contigs (>500 sequences) in multiple passes and splitting them into different SNP and splice variants takes about 20 minutes.

Known Problems / Bugs

File Input / Output:
  1. mira can only read unedited EXP files.
  2. ACE output is still a bit shaky and might contain errors, especially in the tag format.
  3. There sometimes is a (rather important) memory leak occurring while using the assembly integrated editor. I have not been able to trace the reason yet.

Assembly process:

  1. The routines for determining Repeat Marker Bases (RMB) are sometimes too sensitive, which sometimes leads to excessive base tagging and preventing right assemblies in subsequent assembly processes. The parameters you should look at for this problem are -CO:mrc:nrz:mgqrt:mgqwpc. Also look at -CL:pvc and -CO:emea if you have a lot of sequencing vector relicts at the end of the sequences.

Caveats

Note: Versions with uneven minor versions (e.g. 1.1.x, 1.3.x, ..., 2.1.x, ... etc.) are development versions which might be unstable in parts (although I don't think so). But to catch possible bugs, development versions of mira are distributed with tons of internal checks compiled into the code, making it somewhere between 10% and 50% slower than it could be.

TODOs

These are some of the topics on my TODO list for the next revisions to come:
  1. 454 assembly. Please have a look at the development version (2.9.x) of mira where assembly of 454 data is tested.
  2. Making parts of the process multi-threaded (currently in work)
  3. Others nifty ideas that I have not completely thought out yet.

Working principles

To avoid the "garbage-in, garbage-out" problematic, mira uses a 'high quality alignments first' contig building strategy. This means that the assembler will start with those regions of sequences that have been marked as good quality (high confidence region - HCR) with low error probabilities (the clipping must have been done by the base caller or other preprocessing programs, e.g. pregap4) and then gradually extends the alignments as errors in different reads are resolved through error hypothesis verification and signal analysis.

This assembly approach relies on some of the automatic editing functionality provided by the EdIt package which has been integrated in parts within mira.

This is an approximate overview on the steps that are executed while assembling:

  1. All the experiment / phd / fasta sequences that act as input are loaded (or the CAF project). Qualities for the bases are loaded from the FASTA or SCF if needed.
  2. The high confidence region (HCR) of each read is compared with a quick algorithm to the HCR of every other read to see if it could match and have overlapping parts (this is the 'SKIM' filter).
  3. All the reads which could match are being checked with an adapted Smith-Waterman alignment algorithm (banded version). Obvious mismatches are rejected, the accepted alignments form one or several alignment graphs.
  4. Optional pre-assembly read extension step: mira tries to extend HCR of reads by analysing the read pairs from the previous alignment. This is a bit shaky as reads in this step have not been edited yet, but it can help. Go back to step 2.
  5. A contig gets made by building a preliminary partial path through the alignment graph (through in-depth analysis up to a given level) and then adding the most probable overlap candidates to a given contig. Contigs may reject reads if these introduce to many errors in the existing consensus. Errors in regions known as dangerous (for the time being only ALUS and REPT) get additional attention by performing simple signal analysis when alignment discrepancies occur.
  6. Optional: the contig can be analysed and corrected by the automatic editor (EdIt).
  7. Long term repeats are searched for, bases in reads of different repeats that have been assembled together but differ sufficiently (for EdIT so that they didn't get edited and by phred quality value) get tagged with RMB.
  8. Go back to step 5 if there are reads present that have not been assembled into contigs.
  9. Optional post-assembly read-extension: mira extends the HCR of the reads that have been assembled into the contigs.
  10. Optional: Detection of spoiler reads that prevent joining of contigs. Remedy by shortening them.
  11. Optional: Write out a checkpoint assembly file and go back to step 2.
  12. The resulting project is written out to different output files and directories.

Versions, Licenses, Disclaimer and Copyright

There are no restrictions whatsoever built into the executable, neither in time nor in the number of sequences. The only limit is given by the available memory.

Versions

There are two kind of versions that can be compiled: production and development.

Production versions are from the stable branch of the source code. These versions are available for download on the web site of MIRA.

Development versions are from the development branch of the source tree. These are also made available to the public and should be compiled by users who want to test out new functionality or to track down bugs or errors that might arise at a given location. Release candidates (rc) also fall into the development versions: they are usually the last versions of a given development branch.

License

MIRA falls under the GPL (currently version 2).

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA

Copyright

©  1997-2000 Deutsches Krebsforschungszentrum Heidelberg -- Dept. of Molecular Biophysics and Bastien Chevreux (for MIRA) and Thomas Pfisterer (for EdIt)
©  2001-2007 for MIRA V2.x: Bastien Chevreux.
All rights reserved.

External libraries

MIRA uses the excellent Expat library to parse XML files. Expat is Copyright ©  1998, 1999, 2000 Thai Open Source Software Center Ltd and Clark Cooper as well as Copyright ©   2001, 2002 Expat maintainers.
See http://www.libexpat.org/ and http://sourceforge.net/projects/expat/ for more information on Expat.

Author

Bastien Chevreux (mira): bach@chevreux.org

mira can use automatic editing routines which were written by Thomas Pfisterer (EdIt): t.pfisterer@dkfz-heidelberg.de

WWW:   http://www.chevreux.org/projects_mira.html

Miscellaneous

If you find this software useful, please send the author a postcard. If postcards are not available, a treasure chest full of spanish doubloons, gold and jewellery will do nicely, thank you.

Citations

Please use these citations:
    mira
    Chevreux, B., Wetter, T. and Suhai, S. (1999): Genome Sequence Assembly Using Trace Signals and Additional Sequence Information. Computer Science and Biology: Proceedings of the German Conference on Bioinformatics (GCB) 99, pp. 45-56.
    miraEST
    Chevreux, B., Pfisterer, T., Drescher, B., Driesel, A. J., Müller, W. E., Wetter, T. and Suhai, S. (2004): Using the miraEST Assembler for Reliable and Automated mRNA Transcript Assembly and SNP Detection in Sequenced ESTs. Genome Research, 14(6)

See Also

EdIt(1), gap4(1), pregap4(1), ttuner(1), scylla(1), pga(1), pta(1), cap4(1), phred(1), phrap(1), cross_match(1), clview(1), consed(1), caf2gap(1), gap2caf(1), compress(1) and gzip(1).