Ngs Read Mapping

Read Complete Research Material



NGS Read Mapping

Introduction

Next-generation sequencing data are now the standard to produce genomic and transcriptomic knowledge about an organism, and they are massively produced due to an affordable cost. Mapping short reads against a reference genome is typically the first step to analyze such next-generation sequencing data and it should be as accurate as possible. Because of the high number of reads to handle, numerous sophisticated algorithms have been developped in the last 3 years to tackle this problem and many mapping tools exist now. These tools usually have their own specificities and the question of which one to use for a given application is a very vexing question. A very recent paper (Ruffalo et al. p.2790-2796) presents a comparative analysis of 6 mapping tools run on the human genome. Their criteria to compare the performances of the mapping tools are based on the quality score, computed by the different tools, of the retrieved mappings. We thus address the following questions: Are the tools capable to systematically map a read occurring exactly (with no mismatch) in the reference genome? Can they always do it for a read having as many errors as the maximum number of mismatches allowed in the alignments? For reads occurring at several positions, do they retrieve all the occurrences or only a subset? Do the reads reported as unique really occur only once along the genome? As we will see, the answer will not always be positive, so it is important to know the limitation of each tools. They can be divided into two main categories according to the type of algorithm they are based on: hash table based algorithms (indexing either the reads or the reference genome) and Burrows-Wheeler Transform based algorithms (see Table 2). MPscan uses an intermediate approach based on suffix trees.

Algorithmic overview

In this section, our aim is to describe the algorithms on which mapping tools are based. Although these algorithms are complex, we tried to make them as clear as possible. The beginning of each section broadly describe the methods and the structures, then a more in-depth description follows.

Basic algorithms

Let us suppose, for the sake of convenience, that all the read have the same size, say 36 nucleotides. The most straightforward way of finding all the occurrences of a read, if no gap is allowed, consists in “sliding” the read along the genome sequence and noting the positions where there exists a perfect match. Unfortunately, although conceptually simple, this algorithm has a complexity O(LGLrNr) where LG is the size of the genome sequence, Lr the size of the read and Nr is the number of reads. When gaps are allowed, one has to resort to the popular dynamic programming algorithm, such as the Needleman and Wunsch algorithm, whose complexity is also O(LGLrNr). Algorithms with this complexity are far too slow for aligning several hundreds millions reads on the human genome as the case may happen nowadays. Therefore, to be efficient, all the methods must rely on some sort of pre-computing. For instance, it ...
Related Ads
  • Map Reading
    www.researchomatic.com...

    Map Reading , Map Reading Essay writing ...