Sequence assembly with MIRA2

Bastien Chevreux

August 1, 2007

Version 2.8.2

mira - a genome and EST sequence assembly system.

Table of Contents

This guide assumes that you have basic working knowledge of Unix systems, know the basic principles of sequencing (and sequence assembly) and what assemblers do.

Installation

The mira assembler can be installed easily by copying the binaries into a directory that is contained in your path. Alternatively, add the directory you have extracted mira into to your path environment variable.

Quick start for the impatient

This example assumes that you have a few sequences in FASTA format that have been preprocessed - that is, where sequencing vector has been cut back or masked out - and concatenated into one file. If quality values are also present in a fasta like format, so much the better.

We need to give a name to our project: throughout this example, we will assume that the sequences we are working with are from Bacillus chocorafoliensis (or short: Bchoc); a well known, chocolate-adoring bug from the Bacillus family. Our project will therefore be named 'bchoc'.

Preparing and starting an assembly from scratch with FASTA files

The following steps will allow to quickly start a simple assembly:
  1. mkdir bchoc_assembly1
  2. cd bchoc_assembly1
  3. cp /path/where/the/data/resides/sequences.fasta bchoc_in.fasta
  4. (optional) cp /path/where/the/data/resides/qualities.someextension bchoc_in.fasta.qual
  5. mira -project=bchoc -fasta -genomenormal

Explanation: we created a directory for the assembly, copied the sequences into it (to make things easier for us, we named the file directly in a format suitable for mira to load it automatically) and, optionally, we also copied quality values for the sequences into the same directory. As last step, we started mira with options telling it that

By giving mira the project name 'bchoc' (-project=bchoc) and naming sequence file with an appropriate extension _in.fasta, mira automatically loaded that file for assembly. When there are additional quality values available (bchoc_in.fasta.qual), these are also automatically loaded and used for the assembly.

Looking at results

Once mira has completed the assembly, a number of directories will have appeared in the directory within which mira was called. These directories (and files within) contain - of course - the results of the assembly itself as well as general information and statistics on the results.

The following directories will have be created:

The following files in bchoc_results contain results of the assembly in different formats:

The following files in bchoc_info contain statistics and other results of the assembly:

Calling mira from the command line

Mira can be used in many different ways: building assemblies from scratch, performing reassembly on existing projects, assembling sequences from closely related strains, assembling sequences against an existing backbone (mapping assembly), etc.pp. Mira comes with a number of quick switches, i.e., switches that turn on parameter combinations which should be suited for most needs.

E.g.: mira -project=foobar -fasta -genomeaccurate -highlyrepetitive

The line above will tell mira that our project will have the general name foobar and that the sequences are to be loaded from FASTA files, the sequence input file being named foobar_in.fasta (and sequence quality file, if available, foobar_in.fasta.qual. Then, mira will perform a genome assembly in accurate grade, additionally being prepared for the genome containing nasty repeats. The result files will be in a directory named foobar_results, statistics about the assembly will be available in the foobar_info directory like, e.g., a summary of contig statistics in foobar_info/foobar_info_contigstats.txt.

E.g.: mira -fasta -project=foobar -mappingaccurate -highlyrepetitive

This is the same as the previous example except mira will perform a mapping assembly of the sequences against a backbone sequence(s). mira will therefore additionally load the backbone sequence(s) from the file foobar_backbone_in.fasta (FASTA being the default type of backbone sequence to be loaded).

E.g.: mira -fasta -project=foobar -mappingaccurate -highlyrepetitive -SB:bft=gbf

As above, except we have mixed some extensive switches into the lot to tell mira to load the backbone(s) from a GenBank format file (GBF). mira will therefore load the backbone sequence(s) from the file foobar_backbone_in.gbf. Note that the GBF file can also contain multiple entries, i.e., it can be a GBFF file.

Usage examples

Assembly from scratch with GAP4 and EXP files

A simple GAP4 project will do nicely. Please take care of the following: You need already preprocessed experiment / fasta / phd files, i.e., at least the sequencing vector should have been tagged (in EXP files) or masked out (FASTA or PHD files). It would be nice if some kind of not too lazy quality clipping had also been done for the EXP files, pregap4 should do this for you.
    Step 1
    Create a file of filenames (named "mira_in.fofn") for the project you wish to assemble. The file of filenames should contain the newline separated names of the EXP-files and nothing else.
    Step 2
    Execute the mira assembly, eventually using command line options or output redirection:
    /path/to/the/mira/package/mira
    or simply
    mira
    if mira is in a directory which is in your PATH. The result of the assembly will now be in directory named mira_results where you will find mira_out.caf, mira_out.html etc. or in gap4 direct assembly format in the mira_out.gap4da subdirectory.
    Step 3a
    Have a quick look at how the project looks like by loading the file mira_results/mira_out.html into your favourite web browser or ...
    Step 3b
    ... change to the gap4da directory:
    cd mira_results/mira_out.gap4da
    start gap4:
    gap4
    choose the menu 'File->New' and enter a name for your new database (like 'demo'). Then choose the menu 'Assembly->Directed assembly'. Enter the text 'fofn' in the entry labelled "Input readings from List or file name" and enter the text 'failures' into the entry labelled "Save failures to List or file name". Press "OK".
    That's it.
    Step 3c
    As an alternative to step 3b, one can use the caf2gap converter (see below)
    caf2gap -project demo -version 0 -ace mira_out.caf
    gap4 DEMO.0

Out-of-the box example
Mira comes with a few really small toy project to test usability on a given system. Go to the minidemo directory and follow the instructions given in the section for own projects above, but start with step 2. Eventually, you might want to start mira while redirecting the output to a file for later analysis.

Reassembly of GAP4 edited projects

It is sometimes wanted to reassemble a project that has already been edited, for example when hidden data in reads has been uncovered. The canonical way to do this is by using CAF files as data exchange format and the caf2gap and gap2caf converters available from the Sanger Centre (http://www.sanger.ac.uk/Software/formats/CAF/). Warning: The project will be completely reassembled, contig joins or breaks that have been made in the GAP4 database will be lost, you will get an entirely new assembly with what mira determines to be the best assembly.

Using existing sequences (backbones) to assemble against

One of the most useful features of mira is the ability to assemble against already existing sequences or contigs (also called a mapping assembly). The parameters that control the behaviour of the assembly in these cases are in the -STRAIN/BACKBONE section of the parameters.

Please have a look at the example in the minidemo/bbdemo2 directory which maps sequences from C.jejuni RM1221 against (parts of) the genome of C.jejuni NCTC1168.

There are a few things to consider when using backbone sequences:

  1. Backbone sequences can be as long as needed! They are not subject to normal read length constraints of a maximum of 10k bases. That is, if one wants to load one or several entire chromosome of a bacteria as backbone sequence(s), this is just fine.
  2. Backbone sequences will not be reversed! They will always appear in forward direction in the output of the assembly. Please note: if the backbone sequence consists of a CAF file that contain contigs which contain reversed reads, then the contigs themselves will be in forward direction. But the reads they contain that are in reverse complement direction will of course also stay reverse complement direction.
  3. Backbone sequences will not not be assembled together! That is, if a sequence of the backbones has a perfect overlap with another backbone sequence, the will still not be merged.
  4. Reads are assembled to backbones in a first come, first served strategy. Suppose you have two identical backbones and a number of reads that match perfectly in the assembly. Then all the reads will be assembled to the first backbone sequence while the second one gets none.
  5. Only in backbones loaded from CAF files: contigs made out of single reads (singlets) loose their status as backbones and will be returned to the normal read pool for the assembly process. That is, these sequences will be assembled to other backbones or with each other.

Examples for using backbone sequences: