Sequence assembly with MIRA2
Bastien Chevreux
August 1, 2007
Version 2.8.2
mira
- a genome and EST sequence assembly system.
Table of Contents
This guide assumes that you have basic working knowledge of Unix systems, know
the basic principles of sequencing (and sequence assembly) and what assemblers
do.
The mira
assembler can be installed easily by copying the binaries into
a directory that is contained in your path. Alternatively, add the directory
you have extracted mira
into to your path
environment variable.
This example assumes that you have a few sequences in FASTA format that have
been preprocessed - that is, where sequencing vector has been cut back or
masked out - and concatenated into one file. If quality values are also
present in a fasta like format, so much the better.
We need to give a name to our project: throughout this example, we will
assume that the sequences we are working with are from
Bacillus chocorafoliensis
(or short: Bchoc);
a well known,
chocolate-adoring bug from the Bacillus
family. Our project will
therefore be named 'bchoc'.
The following steps will allow to quickly start a simple assembly:
- mkdir bchoc_assembly1
- cd bchoc_assembly1
- cp /path/where/the/data/resides/sequences.fasta bchoc_in.fasta
- (optional) cp /path/where/the/data/resides/qualities.someextension bchoc_in.fasta.qual
- mira -project=bchoc -fasta -genomenormal
Explanation:
we created a directory for the assembly, copied the
sequences into it (to make things easier for us, we named the file directly in
a format suitable for mira
to load it automatically) and, optionally, we
also copied quality values for the sequences into the same directory. As last
step, we started mira
with options telling it that
- our project is named 'bchoc' and hence, input and output files will have
this as prefix;
- the data is in a FASTA formatted file;
- the data should be assembled with a parameter combination suited for
assembly of most genomes.
By giving mira
the project name 'bchoc' (-project=bchoc) and naming
sequence file with an appropriate extension _in.fasta,
mira
automatically loaded that file for assembly. When there are
additional quality values available (bchoc_in.fasta.qual),
these are also automatically loaded and used for the assembly.
Once mira
has completed the assembly, a number of directories will have
appeared in the directory within which mira
was called. These
directories (and files within) contain - of course - the results of the
assembly itself as well as general information and statistics on the results.
The following directories will have be created:
- bchoc_results: this directory contains all the output files of
the assembly in different formats. there are many different formats that can
be chosen from the command line: fasta, caf, ace, gap4 directed assembly,
simple text etc.
- bchoc_info: this directory contains information files of the
final assembly. They provide statistics as well as, e.g., information
(easily parseable by scripts) on which read is found in which contig etc.
- bchoc_log: this directory contains log files and temporary
assembly files. It can be safely removed after an assembly as there may be
easily a few GB of data in there that are not normally not needed anymore.
The following files in bchoc_results
contain results of the assembly in
different formats:
- bchoc_out.txt: this file contains in a human readable
format the aligned assembly results, where all input sequences are shown
in the context of the contig they were assembled into. This file is just
meant as a quick way for people to have a look at their assembly without
specialised alignment finishing tools.
- bchoc_out.padded.fasta: this file contains as FASTA
sequence the consensus of the contigs that were assembled in the process.
Positions in the consensus containing gaps (also called 'pads', denoted by
an asterisk) are still present. The computed consensus qualities are in the
corresponding bchoc_out.padded.fasta.qual
file.
- bchoc_out.unpadded.fasta: as above, this file contains
as FASTA sequence the consensus of the contigs that were assembled in the
process, put positions in the consensus containing gaps were removed. The
computed consensus qualities are in the corresponding
bchoc_out.unpadded.fasta.qual
file.
- bchoc_out.caf: this is the result of the assembly in CAF
format, which can be further worked on with, e.g., tools from the
caftools
package from the Sanger Centre and later on be imported
into, e.g., the Staden gap4
assembly and finishing tool.
- bchoc_out.ace: this is the result of the assembly in ACE
format. This format can be read by viewers like the TIGR clview
or by
consed
from the phred/phrap/consed package.
- bchoc_out.gap4da: this directory contains the result of
the assembly suited for the direct assembly
import of the Staden
gap4
assembly viewer and finishing tool.
The following files in bchoc_info
contain statistics and other results
of the assembly:
- bchoc_info_callparameters.txt: This file contains the
parameters as given on the mira
command line when the assembly was
started.
- bchoc_info_contigstats.txt: This file contains in tabular
format statistics about the contigs themselves, their length, average
consensus quality, number of reads, maximum and average coverage, average
read length, number of A, C, G, T, N, X and gaps in consensus.
- bchoc_info_contigreadlist.txt: This file contains
information which reads have been assembled into which contigs (or
singlets).
- bchoc_info_consensustaglist.txt: This file contains
information about the tags (and their position) that are present in the
consensus of a contig.
- bchoc_info_readstooshort: A list containing the names of those
reads that have been sorted out of the assembly only due to the fact that
they were too short, before any processing started.
- bchoc_info_readtaglist.txt: This file contains
information about the tags and their position that are present in each read.
The read positions are given relative to the forward direction of the
sequence (i.e. as it was entered into the the assembly).
- bchoc_error_reads_invalid: A list of sequences
that have been found to be invalid due to various reasons (given in the
output of the assembler).
Mira
can be used in many different ways: building assemblies from
scratch, performing reassembly on existing projects, assembling sequences from
closely related strains, assembling sequences against an existing backbone
(mapping assembly), etc.pp. Mira
comes with a number of quick switches,
i.e., switches that turn on parameter combinations which should
be suited for most needs.
E.g.: mira -project=foobar -fasta -genomeaccurate -highlyrepetitive
The line above will tell mira
that our project will have the general
name foobar
and that the sequences are to be loaded from FASTA files,
the sequence input file being named foobar_in.fasta
(and sequence
quality file, if available, foobar_in.fasta.qual.
Then, mira
will perform a genome assembly in accurate
grade, additionally being
prepared for the genome containing nasty repeats. The result files will be in
a directory named foobar_results,
statistics about the assembly will be
available in the foobar_info
directory like, e.g., a summary of contig
statistics in foobar_info/foobar_info_contigstats.txt.
E.g.: mira -fasta -project=foobar -mappingaccurate -highlyrepetitive
This is the same as the previous example except mira
will perform a
mapping assembly of the sequences against a backbone sequence(s). mira
will therefore additionally load the backbone sequence(s) from the file
foobar_backbone_in.fasta
(FASTA being the default type of backbone
sequence to be loaded).
E.g.: mira -fasta -project=foobar -mappingaccurate -highlyrepetitive -SB:bft=gbf
As above, except we have mixed some extensive switches
into the lot
to tell mira
to load the backbone(s) from a GenBank format file (GBF).
mira
will therefore load the backbone sequence(s) from the file
foobar_backbone_in.gbf.
Note that the GBF file can also contain
multiple entries, i.e., it can be a GBFF file.
A simple GAP4
project will do nicely. Please take care of the
following: You need already preprocessed experiment / fasta / phd files, i.e.,
at least the sequencing vector should have been tagged (in EXP files) or
masked out (FASTA or PHD files). It would be nice if some kind of not too lazy
quality clipping had also been done for the EXP files, pregap4
should
do this for you.
- Step 1
- Create a file of filenames (named "mira_in.fofn")
for
the project you wish to assemble. The file of filenames should contain the
newline separated names of the EXP-files and nothing else.
- Step 2
- Execute the mira
assembly, eventually using command line
options
or output redirection:
/path/to/the/mira/package/mira
or simply
mira
if mira
is in a directory which is in your PATH. The result of the
assembly will now be in directory named mira_results
where you will
find mira_out.caf,
mira_out.html
etc. or in gap4 direct
assembly format in the mira_out.gap4da
subdirectory.
- Step 3a
- Have a quick look at how the project looks like by loading the
file mira_results/mira_out.html
into your favourite web browser or
...
- Step 3b
- ... change to the gap4da directory:
cd mira_results/mira_out.gap4da
start gap4:
gap4
choose the menu 'File->New' and enter a name for your new database (like
'demo'). Then choose the menu 'Assembly->Directed assembly'. Enter the
text 'fofn' in the entry labelled "Input readings from List or file name"
and enter the text 'failures' into the entry labelled "Save failures to List
or file name".
Press "OK".
That's it.
- Step 3c
- As an alternative to step 3b, one can use the caf2gap
converter (see below)
caf2gap -project demo -version 0 -ace mira_out.caf
gap4 DEMO.0
Out-of-the box example
Mira
comes with a few really small toy project to test usability on a
given system. Go to the minidemo directory and follow the instructions given
in the section for own projects above, but start with step 2. Eventually, you
might want to start mira
while redirecting the output to a file for
later analysis.
It is sometimes wanted to reassemble a project that has already been edited,
for example when hidden data in reads has been uncovered. The canonical way to
do this is by using CAF files as data exchange format and the caf2gap
and gap2caf
converters available from the Sanger Centre
(http://www.sanger.ac.uk/Software/formats/CAF/).
- Step 1
- Convert your GAP4 project with the gap2caf tool. Assuming that
the assembly is in the GAP4 database CURRENT.0,
convert it with the
gap2caf
tool:
gap2caf -project CURRENT -version 0 -ace > newstart_in.caf
The name "newstart" will be the project name of the new assembly project.
- Step 2
- Start mira
with the -caf option and tell it the name of
your new reassembly project:
mira -caf -project=newstart
(and other options at will. E.g. -genomeaccurate)
- Step 3
- Convert the resulting CAF file newstart_out.caf
to GAP
database format as explained above.
caf2gap -project reassembled -version 0 -ace newstart_out.caf
Then start gap4
gap4 REASSEMBLED.0
Warning: The project will be completely reassembled, contig joins or breaks
that have been made in the GAP4 database will be lost, you will get an
entirely new assembly with what mira
determines to be the best
assembly.
One of the most useful features of mira
is the ability to assemble
against already existing sequences or contigs (also called a mapping
assembly). The parameters that control the behaviour of the assembly in these
cases are in the -STRAIN/BACKBONE
section of the parameters.
Please have a look at the example in the minidemo/bbdemo2
directory
which maps sequences from C.jejuni RM1221
against (parts of) the genome
of C.jejuni NCTC1168.
There are a few things to consider when using backbone sequences:
- Backbone sequences can be as long as needed! They are not subject to
normal read length constraints of a maximum of 10k bases. That is, if one
wants to load one or several entire chromosome of a bacteria as backbone
sequence(s), this is just fine.
- Backbone sequences will not be reversed! They will always appear in
forward direction in the output of the assembly. Please note: if the
backbone sequence consists of a CAF file that contain contigs which contain
reversed reads, then the contigs themselves will be in forward direction.
But the reads they contain that are in reverse complement direction will of
course also stay reverse complement direction.
- Backbone sequences will not not be assembled together! That is, if a
sequence of the backbones has a perfect overlap with another backbone
sequence, the will still not be merged.
- Reads are assembled to backbones in a first come, first served
strategy. Suppose you have two identical backbones and a number of reads
that match perfectly in the assembly. Then all the reads will be assembled
to the first backbone sequence while the second one gets none.
- Only in backbones loaded from CAF files: contigs made out of single
reads (singlets) loose their status as backbones and will be returned to the
normal read pool for the assembly process. That is, these sequences will be
assembled to other backbones or with each other.
Examples for using backbone sequences:
- Example 1: assume you have a genome of an existing organism. From that,
a mutant has been made by mutagenesis and you are skimming the genome in
shotgun mode for mutations. You would generate for this a straindata file
that gives the name of the mutant strain to the newly sequenced reads and
simply assemble those against your existing genome, using the following
parameters:
-SB:lsd=yes:lb=yes:bsn=<nameOriginalStrain>:bft=<caf,fasta,gbf>.
When loading backbones from CAF, the qualities of the consensus bases will
be calculated by mira
according normal consensus computing rules.
When loading backbones from FASTA or GBF, one can set the expected overall
quality of the sequences (e.g. 1 error in 1000 bases = quality of 30) with
-SB:bbq=30.
It is recommended to have the backbone quality at
least as high as the -CO:mgqrt
value, so that mira
can
automatically detect and report SNPs.
- Example 2: suppose that you are in the process of performing a shotgun
sequencing and you want to determine the moment when you have got enough
reads. One could make a complete assembly each day when new sequences
arrive. However, starting with genomes the size of mid-sized bacteria, this
becomes prohibitive from the computational point of view. A quick and
efficient way to resolve this problem is to use the CAF file of the previous
assembly as backbone and simply add the new reads to the pool. The number of
singlets remaining after the assembly versus the total number of reads of
the project is a good measure for the coverage of the project.
- Example 3: in EST assembly with miraEST, existing cDNA sequences
can also be useful when added to the project during step 3 (in the file
step3_in.par).
They will provide a framework to which mRNA-contigs
built in previous steps will be assembled against, allowing for a fast
evaluation of the results. Additionally, they provide a direction for the
assembled sequences so that one does not need to invert single contigs by
hand afterwards.