Advertisement
Research Article

DistMap: A Toolkit for Distributed Short Read Mapping on a Hadoop Cluster

  • Ram Vinay Pandey,

    Affiliation: Institut für Populationsgenetik, Vetmeduni Vienna, Vienna, Austria

    X
  • Christian Schlötterer mail

    christian.schloetterer@vetmeduni.ac.at

    Affiliation: Institut für Populationsgenetik, Vetmeduni Vienna, Vienna, Austria

    X
  • Published: August 23, 2013
  • DOI: 10.1371/journal.pone.0072614

Abstract

With the rapid and steady increase of next generation sequencing data output, the mapping of short reads has become a major data analysis bottleneck. On a single computer, it can take several days to map the vast quantity of reads produced from a single Illumina HiSeq lane. In an attempt to ameliorate this bottleneck we present a new tool, DistMap - a modular, scalable and integrated workflow to map reads in the Hadoop distributed computing framework. DistMap is easy to use, currently supports nine different short read mapping tools and can be run on all Unix-based operating systems. It accepts reads in FASTQ format as input and provides mapped reads in a SAM/BAM format. DistMap supports both paired-end and single-end reads thereby allowing the mapping of read data produced by different sequencing platforms. DistMap is available from http://code.google.com/p/distmap/

Introduction

Next Generation Sequencing (NGS) technologies have revolutionized biological research by producing an unprecedented amount of data in a cost effective manner. The rapid progress in NGS technology, however, has resulted in a subsequent increase in data volume that currently outpaces advances in computational power [1]. Hence there is a growing need for new software solutions that minimize the impact of increasing data volume on workflow duration. The first step in nearly all NGS workflows is the mapping of short sequence reads to a reference genome. As it may take several days to map NGS read data on a single computer, mapping is a potential bottleneck for most workflows. For this reason tools and software which support distributed computing are a powerful means to expedite mapping and the subsequent workflow.

At present several tools exist which support distributed mapping, each having specificities which may limit more general usage. For instance, Crossbow [2], Eoulsan [3] and FX [4], are designed for specific types of analyses, do not produce mapping output in SAM/BAM [5] format, and only support paired-end mapping. Similarly, SEAL [6] only supports paired-end BWA mapping using a sinlge version (0.5.8c), and is restricted to Linux operating systems. Other virtual machine workflows include Bio-Linux [7], Cloud Bio-Linux [8] and Galaxy [9]. Bio-Linux and Cloud Bio-Linux include a collection of useful bioinformatics tools, and the later has integrated MapReduce tools such as Crossbow [2], CloudBurst [10], SEAL [6] and bcbio-nextgen [11] but they have restrictions on mapper version and do not produce an output in SAM/BAM format. Galaxy [9] is another powerful workflow system but supports only BWA and bowtie mapping and imposes version restrictions for both the mapper and the reference sequence.

Here we present a new tool, DistMap, a modular, scalable and user-friendly workflow, which facilitates the mapping of short reads on a Hadoop cluster [14]. DistMap supports nine different mappers – the most available in a single distributed mapping tool at present – and cover a wide range of NGS applications. Furthermore, DistMap allows mapping of paired-end and single-end reads from different sequencing platforms and produces mapping output in BAM/SAM file format, such that it can be used in conjunction with distributed downstream analytical tools such as GATK [12], and Hadoop-BAM [13]. Finally, the workflow has been developed specifically for adoption by non-expert users who wish to benefit from distributed NGS read mapping.

Results and Discussion

DistMap provides an integrated workflow for short read mapping against a user-specified reference genome. The whole workflow can be run with a single Perl command. This workflow is equipped with various customized parameters and provides detailed guidelines for its implementation. All components of DistMap and their inputs have been summarized in Figure 1. An overview of the key features are described below.

thumbnail

Figure 1. The workflow of DistMap.

The DistMap workflow has 7 modules. Modul 1 is to index reference fasta file. Module 2-6 are mandetory for each new FASTQ files. The DistMap entire workflow can be executed at once and if required each module can be executed one by one.

doi:10.1371/journal.pone.0072614.g001

An integrated workflow

DistMap provides a complete workflow that includes several modules. Each module has a start and stop point (shown in Figure 1). The user is provided with the flexibility to either execute the whole workflow or to execute individual modules one after another.

Job Scheduler support

DistMap is the only workflow, which supports the different job schedulers currently available for a Hadoop cluster. It supports three schedulers 1) FIFO [14] 2) Fair scheduler [15] and 3) Capacity scheduler [16]. The user can specify the scheduler information with the option -- hadoop-scheduler within the Perl command.

Custom queue and pool support

DistMap is designed to manage the short read mapping on a large scale, whereby many custom queues and pools are available within the Hadoop cluster. Unlike other distributed mapping tools which only utilize a default queue, DistMap can also use customized queues and pools to run multiple mapping jobs in parallel. The queue assignment can be done with the option –queue-name to the Perl command.

Custom job prioritization support

DistMap supports a top-level job prioritization. The user can directly set job priority via the DistMap command line with the option –job-priority. Five priority levels are currently supported: VERY_LOW, LOW, NORMAL, HIGH and VERY_HIGH.

Mapper flexibility

The current version of DistMap (v 1.0) supports nine different mappers, covering a broad spectrum of NGS applications. Table 1 provides the corresponding version of all mapping software that has been used for testing. DistMap does not impose a version restriction for any mapper, whereby the user can download and compile any version for direct implementation within their specific workflow.

Mapper supported in DistMapVersion testedApplication
BWA0.5.8cShort read mapping
bowtie0.12.7Short read mapping
Bowtie22.0.6Short read mapping
GSNAP2012-07-20RNA-Seq Alignment DNA methylation
SOAP2.20Short read mapping
STAR2.2.0cRNA-Seq Alignment
Bismarkv 0.7.7Bisulfite mapping
BSMAP2.73Bisulfite mapping
TopHat2.0.6RNA-Seq mapping

Table 1. List of mappers supported in current version of DistMap.

Nine short read mappers for a wide range of NGS applications are supported. DistMap can use any version of these mappers.

Flexibility and transparency

DistMap imposes no restriction on the assembly version of the reference genome for mapping. Since DistMap does genome indexing itself during the execution of the workflow it is possible to map short reads with any reference genome assembly. DistMap collects all input files and parameters from a local computer and returns the final output to a local output directory as a single SAM or BAM file. There is no need to install the DistMap source code or mapper executables on all working nodes. The entire DistMap workflow can be run at once or in step-by-step fashion, such that the user can start from any step in the workflow. DistMap supports paired-end, single-end and mixed mapping of FASTQ formatted reads produced from various sequencing platforms. DistMap archives the genome index as a *. tgz file in the local output directory such that it can be re-used in subsequent mapping and thereby avoids unnecessarily re-indexing of the same genomes.

Scalability

DistMap has no restriction on NGS data handling; it is specifically designed to map Gigabytes or Terabytes of sequencing data with a single command. The speed of DistMap scales with cluster size. Several different datasets ranging in size were used to test the scalability of the DistMap workflow on a Hadoop cluster (see Table 2)

Number of read pairs (million)Size (GB)Read length (bp)
52.46100
5024.7100
10049.4100
20098.82100
500247.04100

Table 2. DistMap evaluation input datasets.

These NGS data were generated from Illumina HiSeq sequencer. 100bp paired-end reads from Drosophila melanogaster genomes.

DistMap evaluation

To evaluate the performance of DistMap we used paired-end genomic reads of 2x100bp from a Drosophila melanogaster pooled sequencing project [17] (see Table 2 for further information). The Hadoop cluster (version 1.0.3) consisted of 13 nodes running Mac OSX 10.6.8. For each node 10 CPUs, 32 GB RAM and 1TB SATA hard disk space was made available. All computer nodes were connected via Gigabit Ethernet.

We evaluated the performance of DistMap by comparing the execution time of four different mappers (BWA, bowtie, GSNAP and SOAP) for multiple datasets ranging in size, each time using a single worker node and a fixed set of parameters (see Figure 2 and Table 3). We found that DistMap execution time scales almost linearly with increase of input size, whereas single node execution increased exponentially. In particular, as data size increases beyond 500 million reads, DistMap had between 20-fold to 80-fold reduction in processing time relative to a single computer for the mappers tested. We reason that the non-linear scaling observed for the single node results from non-parallelized steps in the mapping procedure. For example, BWA only uses a single processor for the sampe/samse module, regardless of the number of processors on that computer. In contrast, DistMap can make use of all cluster nodes and their processors by splitting the data into subsets, which are later reassembled. Thus, by distributing mapping across several computers, DistMap avoids probable workflow bottlenecks caused by mapping on a single computer.

thumbnail

Figure 2. Evaluation of DistMap execution time with increase of data size.

Execution times were measured for different mappers using DistMap (red line) and compared to a single node (blue line). Datasets of different size (5, 50, 100, 200 and 500 million read pairs) from different pool-Seq experiments were used to estimate the scalability of DistMap. The hardware configuration of the single node was a Mac OSX 10.6.8 computer with 10 CPU, 32 GB RAM and 1TB Disk space available for mapping. The Hadoop cluster consists of 13 worker nodes with the same configuration. (a) BWA mapping (b) bowtie mapping, (c) GSNAP mapping, (d) mapping. All mappers were run with the same default parameters and datasets.

doi:10.1371/journal.pone.0072614.g002
Number of read pairs (million)MapperDistMap time (hour, 13 nodes)Mapper time (hour, 1 node)
5BWA0.120.47
50BWA0.704.22
100BWA1.408.33
200BWA2.7819.35
500BWA6.0582.73*
5bowtie0.100.18
50bowtie0.301.77
100bowtie0.683.50
200bowtie0.977.10
500bowtie3.4322.38
5GSNAP0.10.63
50GSNAP0.575.62
100GSNAP1.0711.12
200GSNAP2.1722.13
500GSNAP4.97120.75*
5SOAP0.080.52
50SOAP0.475.48
100SOAP0.879.2
200SOAP1.6819.63
500SOAP4.8849.4

Table 3. Comparison of running time in hours of different mappers

on a single node with 10 cores and DistMap running on 13 nodes.

Reproducibility and accuracy

We estimated the reproducibility, fault tolerance and data security of DistMap by mapping 5 million read pairs multiple times for each of the BWA, bowtie, GSNAP and SOAP mappers. Fault tolerance and data security were specifically tested by randomly deactivating nodes during an active mapping job. Even with random node deactivation, however, mapping results were found to be identical across the five independent runs for each mapper. Indeed, because DistMap makes a minimum of three copies of each data block, each distributed to a different node, data loss is highly improbable.

Comparison with existing tools

DistMap can be run on any Unix system as a single Perl command and has many user-friendly features, which enable both advanced and non-expert users to map NGS reads using distributed computing. The workflow can be executed as a whole or individual modules can be executed one after another. A comparison of the most important features of DistMap to other available tools is given in Table 4.

FeaturesCrossbow v 1.2.0SEAL v 0.1.0Eoulsan v 1.1.6FX v 1.0.4DistMap v 1.0Galaxy
SAM outputnonononoyesyes
BAM outputnonononoyesyes
Pair-end Mappingyesyesyesyesyesyes
Single-end Mappingyesnononoyesyes
Bisulfite Mappingnonononoyesno
Installation requiredyesyesyesyesnoyes
Dependencyyesyesyesyesnoyes
Operating systemAll UnixOnly LinuxOnly LinuxAll OSAll UnixAll OS
Mappers (complied)bowtieBWABWA, bowtie, SOAP, GSNAPGSNAPBWA, bowtie, Bowtie2, GSNAP, SOAP, STAR, TopHat, TopHat2, Bismark, BSMAPbowtie, BWA
Mapper version dependencynoyesyesyesnoyes
Reference sequence version dependencynononononoyes
Custom queue/pool supportnonononoyesno
Fair scheduler and Capacity schedulernonononoyesno

Table 4. Comparison of various features of DistMap and other tools for distributed mapping.

Materials and Methods

Implementation

The DistMap was developed in Perl 5.8.8 [18] to map millions of short reads produced from a single NGS experiment in a distributed manner. The mapping module of DistMap is based on the MapReduce programming algorithm [19], which runs on the Hadoop cluster. There is no dependency required to use DistMap except a working Hadoop cluster. The minimum input requirements to run DistMap on any Unix operating system are (1) executables of the required mapping software, (2) MergeSamFiles.jar and SortSam.jar, two jar files from PICARD [20], (3) access to a Hadoop cluster, (4) FASTQ formatted reads (paired-end, single end or mixed) and (5) a FASTA formatted reference genome.

Hadoop and MapReduce

Hadoop is an open source software framework, which can be installed and run on commodity computers and enables large-scale distributed data analysis. There are two components of Hadoop (1) a fault-tolerant and robust Hadoop Distributed File System (HDFS) and (2) MapReduce: a java-based API, which enables parallel computing across all nodes of a cluster.

Currently nine different mappers are supported in DistMap: BWA [21], bowtie [22], Bowtie2 [23], GSNAP [24], SOAP [25], STAR [26], Bismark [27], BSMAP [28] and TopHat [29], thereby supporting a wide range of biological applications (Table 1). While TopHat could be run on DistMap, we do not recommend this since the distributed mapping interferes with the identification of splice sites by TopHat. DistMap supports two different implementations of customized queuing systems: Capacity scheduler [26] and pool: Fair scheduler [27].

DistMap components

The DistMap workflow consists of seven main modules which can be executed either end-to-end by a single command, or each module can be executed by giving appropriate command line flags. The whole workflow is summarized in Figure 1.

DistMap input and output

DistMap workflow takes the input from a local computer, performs the mapping of the reads on a Hadoop cluster, and stores the final output file on the local computer. No direct interaction between the user and the Hadoop cluster is needed. DistMap requires reference sequence data in FASTA format and short read data in FASTQ format. If the user has to map many datasets in a single DistMap run then it can be done via command line by using the option –input multiple times. The final output of DistMap is a single SAM or BAM file without any filtering. The user can request either a SAM or a BAM output file by using the DistMap command line option –output-format.

Availability, installation and usage

DistMap is an open-source tool and is freely available for all researchers. The source code of the DistMap can be downloaded from http://distmap.googlecode.com/files/Dist​Map_v1.0.tar.gz.

The user manual of DistMap is available on http://code.google.com/p/distmap/wiki/Ma​nual.

Since DistMap was developed with easy use for non-expert researchers in mind, we provide a step-by-step guide to set up a Hadoop cluster on Linux computers http://code.google.com/p/distmap/wiki/Se​tupHadoopLinux and on Macintosh computers http://code.google.com/p/distmap/wiki/Se​tupHadoopMacintosh.

Conclusions

DistMap is a user-friendly, modular, and integrated workflow for distributed mapping of NGS-generated short reads on a Hadoop cluster. Since in most NGS applications mapping is an essential and highly time intensive step, we believe that DistMap will be greatly expedite this process and the subsequent workflow. In comparison to other tools, DistMap stands out for its generality and flexibility, supporting nine different mappers that facilitate a range of different NGS-based analyses. The availability of multiple mappers means that DistMap can be readily integrated into many existing workflows without having to incorporate a new mapper. Similarly, the SAM/BAM output format was chosen to be compatible with the most widely used downstream analytical tools. Furthermore, unlike other distributed read mapping tools, DistMap supports customized queuing, multiple job schedulers, and job prioritization within a queue. Finally, DistMap was built to be accessible to novice and advanced users alike, being executed via a single Perl command. To help facilitate its use an extensive user manual and the step-by-step instructions for setting up a Hadoop cluster on Macintosh or Linux computers are provided alongside the source code.

Acknowledgments

We are grateful to E. Versace for helpful discussions and suggestions on manuscript. We thank all members of the ‘Institut für Populationsgenetik’ for the DistMap testing and feedback. Special thanks to R. Tobler and T. Hill for revising the manuscript.

Author Contributions

Conceived and designed the experiments: RVP CS. Performed the experiments: RVP CS. Analyzed the data: RVP CS. Contributed reagents/materials/analysis tools: RVP CS. Wrote the manuscript: RVP CS.

References

  1. 1. Taylor R (2010) An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinformatics 11(Suppl 12): S1. doi:10.1186/1471-2105-11-S10-P1.
  2. 2. Langmead B, Schatz MC, Lin J, Pop M, Salzberg SL (2009) Searching for SNPs with cloud computing. Genome Biol 10(11): R134. doi:10.1186/gb-2009-10-11-r134. PubMed: 19930550.
  3. 3. Jourdren L, Bernard M, Dillies MA, Le Crom S (2012) Eoulsan: a cloud computing-based framework facilitating high throughput sequencing analyses. Bioinformatics 28(11): 1542–1543. doi:10.1093/bioinformatics/bts165. PubMed: 22492314.
  4. 4. Hong D, Rhie A, Park SS, Lee J, Ju YS et al. (2012) FX: an RNA-Seq analysis tool on the cloud. Bioinformatics 28(5): 721–723. doi:10.1093/bioinformatics/bts023. PubMed: 22257667.
  5. 5. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J et al. (2009) The sequence alignment/map (SAM) format and SAMtools. Bioinformatics 25: 2078–2079. doi:10.1093/bioinformatics/btp352. PubMed: 19505943.
  6. 6. Pireddu L, Leo S, Zanetti G (2011) SEAL a distributed short read mapping and duplicate removal tool. Bioinformatics 27(15): 2159–2160. doi:10.1093/bioinformatics/btr325. PubMed: 21697132.
  7. 7. Field D, Tiwari B, Booth T, Houten S, Swan D et al. (2006) Open software for biologists: from famine to feast. Nat Biotechnol 24(7): 801-803. doi:10.1038/nbt0706-801. PubMed: 16841067.
  8. 8. Krampis K, Booth T, Chapman B, Tiwari B, Bicak M et al. (2012) Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community. BMC Bioinformatics 13: 42. doi:10.1186/1471-2105-13-42. PubMed: 22429538.
  9. 9. Goecks J, Nekrutenko A, Taylor J, Galaxy Team (2010) Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 11(8): R86. doi:10.1186/gb-2010-11-8-r86. PubMed: 20738864.
  10. 10. Schatz MC (2009) CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics 25: 1363-1369. doi:10.1093/bioinformatics/btp236. PubMed: 19357099.
  11. 11. bcbio-nextgen. Available: . https://github.com/chapmanb/bcbio-nextge​n . Accessed 2013 Jun 14.
  12. 12. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K et al. (2010) The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20(9): 1297–1303. doi:10.1101/gr.107524.110. PubMed: 20644199.
  13. 13. Niemenmaa M, Kallio A, Schumacher A, Klemelä P, Korpelainen E et al. (2012) Hadoop-BAM: directly manipulating next generation sequencing data in the cloud. Bioinformatics 28(6): 876–877. doi:10.1093/bioinformatics/bts054. PubMed: 22302568.
  14. 14. Apache. Hadoop. Available: . http://hadoop.apache.org/ . Accessed 2013 Jun 14.
  15. 15. Capacity scheduler. Available: . http://hadoop.apache.org/docs/r1.2.0/cap​acity_scheduler.html . Accessed 2013 Jun 14.
  16. 16. Fair scheduler. Available: . http://hadoop.apache.org/docs/r1.2.0/fai​r_scheduler.html . Accessed 2013 Jun 14.
  17. 17. Kofler R, Orozco-terWengel P, De Maio N, Pandey RV, Nolte V et al. (2011) PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals. PLOS ONE 6: e15925. doi:10.1371/journal.pone.0015925. PubMed: 21253599.
  18. 18. Perl . Available: http://www.perl.org. Accessed 2013 June 14.
  19. 19. Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating Systems Design & Implementation: 6-8 December 2004; San Francisco, California, USA. Volume 6. . New York, NY, USA: ACM . pp. 137-150.
  20. 20. PICARD. Available: http://picard.sourceforge.net. Accessed 2013 June 14.
  21. 21. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25: 1754–1760. doi:10.1093/bioinformatics/btp324. PubMed: 19451168.
  22. 22. Langmead B, Trapnell C, Pop M, Salzberg SL (2009b) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10(3): R25. doi:10.1186/gb-2009-10-3-r25. PubMed: 19261174.
  23. 23. Langmead B, Salzberg SL (2012) Fast gapped-read alignment with Bowtie 2. Nat Methods 9(4): 357-359. doi:10.1038/nmeth.1923. PubMed: 22388286.
  24. 24. Wu TD, Nacu S (2010) Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26: 873–881. doi:10.1093/bioinformatics/btq057. PubMed: 20147302.
  25. 25. Li R, Li Y, Kristiansen K, Wang J (2008) SOAP: short oligonucleotide alignment program. Bioinformatics 24(5): 713–714. doi:10.1093/bioinformatics/btn025. PubMed: 18227114.
  26. 26. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C et al. (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29(1): 15–21. doi:10.1093/bioinformatics/bts635. PubMed: 23104886.
  27. 27. Krueger F, Andrews SR (2011) Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics 27(11): 1571–1572. doi:10.1093/bioinformatics/btr167. PubMed: 21493656.
  28. 28. Xi Y, Li W (2009) BSMAP: whole genome bisulfite sequence MAPping program. BMC Bioinformatics 10: 232. doi:10.1186/1471-2105-10-232. PubMed: 19635165.
  29. 29. Trapnell C, Pachter L, Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25(9): 1105-1111. doi:10.1093/bioinformatics/btp120. PubMed: 19289445.