Genome and EST sequencing projects

Todays large scale genome sequencing efforts are a semi-industrial process (Dear et al. (1998)) and produce enormous quantities of data. They are nearly all based on the chain-termination dideoxy method published by Sanger et al. (1977) in one way or another. But the gel or capillary electrophoresis used can determine only about a maximum of 1,000 to 1,500 bases in one run, the high quality stretch with low error probabilities for the called bases often being around the first 400 to 500 bases. Current sequencing strategies for a larger contiguous DNA sequence (contig) or for a whole genome - ranging anywhere between 20 kilobases (kb) and 3,000 megabases (mb) - therefore require an indirect approach. Basically the given DNA is fragmented in hundreds or thousands of overlapping subclones (Durbin and Dear (1998)), analysed by fluorescent-dye electrophoresis and subsequently the subclones are reassembled in-silico. This computer-based reconstruction of DNA (or RNA) from fragments is called ``the assembly problem''.

On the way to understand all genes of an organism, it is now clear that the genome sequence alone may not be enough, especially if the organism shows a high degree of complexity like, e.g. in mammals. Therefore, analysis of the genome must be supported by an understanding of its transcription - the transcriptome - occurring in cells. Projects that focus on sequencing mRNA transcripts are also called EST projects as they analyse Expressed Sequence Tags. This direct RNA sequencing remains - citing Camargo et al. (2001) - the ``most definitive approach to the elucidation of transcripts''. At the same time, Bonfield et al. (1998) conclude that ``direct sequencing is required to define the precise location and nature of any [mutational] change'', as this method ensures highest reliability and quality regarding the definition of single nucleotide polymorphisms (SNPs). EST projects constitute thus a perfect opportunity to both elucidate the transcriptome and analyse mutational polymorphisms contained therein, especially when doing cross-species EST analyses as was shown in Chevreux et al. (2004).

Bastien Chevreux 2006-05-11