The authors have declared that no competing interests exist.
Conceived and designed the experiments: JRK SS PB. Performed the experiments: JRK SS. Analyzed the data: JRK SS. Contributed reagents/materials/analysis tools: DRM MA JL WC HC QP BL JQ JW. Wrote the paper: JRK SS PB.
MOCAT is a highly configurable, modular pipeline for fast, standardized processing of single or paired-end sequencing data generated by the Illumina platform. The pipeline uses state-of-the-art programs to quality control, map, and assemble reads from metagenomic samples sequenced at a depth of several billion base pairs, and predict protein-coding genes on assembled metagenomes. Mapping against reference databases allows for read extraction or removal, as well as abundance calculations. Relevant statistics for each processing step can be summarized into multi-sheet Excel documents and queryable SQL databases. MOCAT runs on UNIX machines and integrates seamlessly with the SGE and PBS queuing systems, commonly used to process large datasets. The open source code and modular architecture allow users to modify or exchange the programs that are utilized in the various processing steps. Individual processing steps and parameters were benchmarked and tested on artificial, real, and simulated metagenomes resulting in an improvement of selected quality metrics. MOCAT can be freely downloaded at
The emerging field of metagenomics has enabled researchers to study the structure, dynamics, and functionality of uncultured microbial communities. Processing the vast amounts of metagenomics data usually involves quality-controlling raw sequence reads, aligning them to reference databases, and assembling them into longer contigs prior to predicting genes. Several packages are available for processing and analyzing metagenomics data, either as web- and cloud-based services or stand-alone computational pipelines
As exemplified by recent clinical, large-scale, and on-going studies (e.g., the Human and Earth Microbiome Projects), the usage of high throughput sequencing (HTS) data can be anticipated to further increase considerably in both terms of data volume and scope of application
To address these issues, we have developed MOCAT, a metagenomics assembly and gene prediction toolkit for both small and large-scale processing of metagenomic data produced by the Illumina sequencing technology.
The main pipeline is divided into five major steps: (i) quality trimming and filtering of raw reads, (ii) optional mapping to remove, extract, and/or quantify reads matching a reference database, (iii) assembly, (iv) assembly revision, and (v) gene prediction (
Metagenomic samples are collected and sequenced. The raw sequence reads are given as input to the pipeline, which are processed by modular steps resulting in metagenome assemblies and predicted genes. Arrows extending to the right from boxes, indicate input to various downstream analyses. Statistics from each step are summarized into multi-sheet Excel documents, as well as queryable SQLite databases.
The individual processing steps in MOCAT were benchmarked using three different data sets: 124 published human gut metagenomic samples
Read quality trimming and filtering can greatly improve the length and accuracy of metagenomic assemblies
Additionally, to reduce base composition-biases that commonly occur in HTS data
MOCAT also supports the FastQC package, for evaluating raw read quality statistics (
In the second step, reads can be mapped to reference sequences in order to extract or remove reads from the original data set as well as to calculate base or read coverages. For example, reads from a human fecal metagenomic sample can be mapped to the provided human genome database (hg19, Genome Reference Consortium Human Reference 37) to remove reads of human origin using SOAPAligner2
Here, we estimated the taxonomic composition of the simulated metagenome by mapping reads to the set of original reference genomes (
The observed abundances by mapping reads to reference genomes and the expected abundance correlate with a Pearson correlation coefficient of 0.95 (base and read counts). Circles represent genomes with multiple strains from one species and squares represent genomes with only one strain within the species. All, but one, of the observations deviating from the diagonal are strains from the same species. These strains are either over- or under represented because reads are mapped to other closely related strains in addition to the strain of origin. Highlighted by dashed lines, are two examples where a high sequence similarity between strains (99.9% and 98.7% for the
When estimating taxonomic composition of the HMP mock community, reads were mapped to reference sequences of the community (
The estimated abundances using qPCR and by mapping reads to reference genomes correlate with a Pearson correlation coefficient of 0.75 (base counts) and 0.83 (read counts).
In the assembly step, a new version (1.06) of SOAPdenovo
The accuracy of metagenomic assemblies was assessed using data from the simulated metagenome and the mock community. We used the percentage of predicted complete genes aligning to the reference sequences of origin, as a proxy for correctly assembled scaftigs (contigs that were extended and linked using the paired-end information of sequencing reads). For the simulated metagenome this value was 95.2% (12,385 complete genes predicted), and for the mock community 89.3% of the complete genes aligned (1,042 complete genes predicted). The lower number of predicted complete genes in the mock community may be explained by the relatively low number of high quality reads used in the assembly for this metagenome.
The effect of using variable Kmer sizes, rather than a fixed kmer, in the assembly step, was evaluated using the 124 gut metagenomes. Estimating Kmer sizes at run-time for each individual metagenome, rather than using a fixed Kmer size across all samples, improved the number and frequency of complete gene calls as well as overall average gene length (column 1 in
Quality metric | Improvement compared to fixed kmer = 23 (%) | |
No assembly revision | Revised assembly | |
Number of complete genes | 8.1 | 10.2 |
Number of complete genes/Mbp | 4.6 | 18.5 |
Average gene length | 1.7 | 1.8 |
Gene prediction metrics are improved when using an automated kmer size in SOAPdenovo and with assembly revision (correction of base errors, short indels, and chimeric contigs), compared to a fixed kmer size of = 23 in SOAPdenovo and no assembly revision. The Kmer size is estimated as the closest odd number greater than half the average read length for a sample. Numbers reported are in percent improvement of the respective quality metric. The calculated Kmer for each sample is given in
In the assembly revision step, a feature independent of the utilized assembly packages, MOCAT can revise existing paired-end read assemblies by aligning the reads to assembled scaftigs using the gap-tolerant BWA aligner
Finally, protein coding genes on the metagenomes are predicted using either the default component Prodigal
The functionality and versatility of the pipeline has been demonstrated using an artificial mock community metagenome, a simulated metagenome with 100 species, and 124 human gut metagenomes. Based on parameter exploration and data driven parameter optimization at run-time, the MOCAT pipeline can process metagenomes in a standardized and automated way while improving the quality of assembly and gene prediction compared to using default parameters for the supported programs. To date, MOCAT has additionally been used to process and assemble hundreds of host-associated and ocean metagenomes within the scope of the MetaHIT
MOCAT is implemented in Perl and installed by extracting the package and executing one script, which downloads the default external software used by the pipeline and sets up the software. This reduces the otherwise tedious process of downloading all the individual components, a common drawback of in-house pipelines
A new project is quickly setup requiring only single- or paired-end FastQ formatted sequencing reads files
A queuing system enables processing of a large number of samples in parallel. If present, MOCAT seamlessly integrates all processing steps with the SGE and PBS queuing systems. However, if no queuing system is available, MOCAT processes samples serially on the machine it was executed.
MOCAT runs on 64-bit UNIX systems and can be freely downloaded at
Data for the simulated metagenome is publically available at
Raw reads for the 124 human gut microbiomes were downloaded from the EBI homepage (accession number ERA000116,
The three datasets were processed by the
Estimated taxonomic compositions for the simulated metagenome and the mock community were calculated in three steps. First, quality trimmed and filtered reads from the mock community were screened against a FASTA-file with Illumina adapter sequences (
Assembly and gene prediction, on the simulated metagenome and mock community, were performed using the
The 124 human gut microbiomes were processed with and without 5′ trimming. 5′ trimmed reads were assembled using SOAPdenovo
Complete commands for processing the simulated metagenome and mock community in MOCAT are bundled with the installation of the pipeline.
(DOCX)
(DOCX)
(DOC)
(DOCX)
(DOCX)
(DOC)
(DOCX)
(DOCX)
We wish to thank the MetaHIT consortium and members of the Bork group, especially Siegfried Schloissnig, for fruitful discussions and code improvements.