Conceived and designed the experiments: DRZ EB. Performed the experiments: DRZ. Analyzed the data: DRZ GKM. Contributed reagents/materials/analysis tools: EHM. Wrote the paper: DRZ EB.
EB has a small number of shares (<$5,000) in Illumina from previous consultancy with Solexa.
Despite the short length of their reads, micro-read sequencing technologies have shown their usefulness for
We present a novel heuristic algorithm, Pebble, which uses paired-end read information to resolve repeats and scaffold contigs to produce large-scale assemblies. In simulations, we can achieve weighted median scaffold lengths (N50) of above 1 Mbp in Bacteria and above 100 kbp in more complex organisms. Using real datasets we obtained a 96 kbp N50
These algorithms extend the utility of short read only assemblies into large complex genomes. They have been implemented and made available within the open-source Velvet short-read
Next-generation sequencing (NGS) technologies, thanks to their high throughput and cost effectiveness, present many experimental opportunities for molecular biologists. These tools are gaining wide popularity, for example in SNP detection
The NGS platforms currently on the market can be divided into two categories. Firstly, 454 sequencers
Despite this difficulty, assembly software has been developed to solve this problem, such as EULER-SR
To overcome this obstacle, paired-end reads have been widely considered as a promising solution. Sequencing both ends of a DNA fragment of known length produces not only two sequences but also their relative placement information, which can be exploited to constrain the space of possible assemblies. This approach has been studied in projects such as ARACHNE
EULER-DB
The SHORTY algorithm
In our previous paper
In a subsequent paper, Hossain
We describe here a new read-pair resolution algorithm, Pebble, which exploits the knowledge of insert lengths to resolve more complex situations. This use of insert lengths not only allows the algorithm to localise mate-pairs much more efficiently using only short unique contigs, but is also instrumental in resolving the sequence of complex repeat copies.
Another approach to resolving repeats is through mixing long and short reads, to improve assemblies by benefiting from the respective advantages of the different technologies. We present a simple method, named Rock Band, to exploit sparse long read datasets within a short-read assembly to resolve repeats and extend contigs.
We tested our solutions on simulated datasets then extended to experimental data from
The Pebble and Rock Band algorithms were designed to function within the Velvet assembler, and as such are based on the de Bruijn graph structure presented in detail in
To resolve repeats with confidence it is necessary to determine which contigs in the assembly are unique. Various methods based on topology have been used
Pebble tries to connect the unique nodes identified previously, but using the paired-end information. For each unique node, chosen in an arbitrary order, it iteratively estimates distances from that node, extends it to the next unique node, then merges the distance information provided by both nodes.
Before resolving repeats, it is necessary to organise read-pair information in a condensed and convenient structure. For any two nodes in the graph, Velvet enumerates the reads or mate-pairs which connect them. Using the maximum likelihood estimator (MLE) described in
This approach makes a number of simplifying assumptions. Firstly, it supposes that each insert library has a normal length distribution. Secondly, it only accounts for read-pairs which successfully connect two given nodes, thus biasing the insert length distribution by interval censoring. To be exhaustive, the MLE would have to consider the likelihood of the read-pairs which fail to connect the two nodes. However, this formalization would be much more expensive to calculate, because of the quadratic number of operations, as the information of each read-pair would have to be integrated in many MLE calculations simultaneously.
Before trying to extend a unique node
The SHORTY
From a unique node A, Pebble starts by incorporating all the distances relative to this node found in the primary scaffold. For every unique node B which is connected to A, Pebble then follows the primary connections associated to B, thus flagging secondary neighbours of A. Assuming that all the nodes are laid out on a line, we can estimate that the distance from A to C is equal to the distance from A to B, minus that from B to A.
The secondary scaffold takes into account orientation by using algebraic distances. Contigs which are upstream of the reference point are assigned negative distances, whereas contigs downstream are assigned positive ones. This notation allows Pebble to subtract distances without any constraint.
However, derived distance estimates have to be handled with caution, because secondary neighbours are not necessarily unique. It is possible that the distance estimate from a unique contig to a repeated one is actually based on the distances from the unique node to separate copies of the repeat. This means that distances in the secondary scaffold tend to be fuzzy.
Once Pebble has determined a set of distance estimates from node
At each iteration, it searches through the outgoing arcs, and chooses the direct neighbour which is putatively closest to the starting point. To avoid loops, priority is systematically given to nodes which have not yet been visited, or have been visited the fewest amount of times.
If a path is found to the next unique node, then the sequence associated to the path and the sequence of the destination node are appended to the starting node. All the read information from the destination is transposed onto the starting node. The destination node can then be deleted.
Reads belonging to the path are left in place. By default the nodes along the path are considered to be potential repeats. This is why Pebble does not track the position reads within resolved repeats.
The advantage of the de Bruijn graph over the more traditional overlap graph is to allow the mixture of read-lengths. If long reads are available, they can be easily used to connect the nodes of the graph after error correction, using a simple procedure called Rock Band. The main idea is that if all the long reads which go out of one unique node go consistently to another unique node and vice-versa, then both nodes can be confidently merged.
Examining every unique node in an arbitrary order, Rock Band iteratively tries to extend that node to the next unique node. It enumerates the long reads which go out of that node, and follows them to the next unique node, as described in
In this simplified diagram, contigs are represented as boxes, and long reads as thick curved lines. Instead of performing a pair-wise comparison of reads, the algorithm only examines the reads going out of unique node A. Two of the reads (1 and 2) go to node B. Node 3 is disregarded because it is not confirmed by another read. The algorithm then examines the reads going into node B. They all come from node A, except read 4, which is disregarded because unconfirmed, and read 5 which is not in contradiction with the assembly of contigs A and B. Finally, read 6, despite its overlap with the other reads, is disregarded throughout the analysis, as it goes through neither nodes A nor B.
At the next iteration, some of the reads used previously to find a path may come to an end. However, merging the nodes involves also merging the sets of reads, possibly incorporating new reads which allows the process to continue.
We tested our software on simulated datasets generated from genomic regions from 4 different species, namely
Randomly placed reads were generated according to three different scenarios. In the first case, perfect reads were generated from the reference sequence. In the second, single substitution base-pair errors were inserted at a rate of 1%. Finally, we simulated a diploid sequencing project. We generated an alternate version of the reference where artificial SNPs were introduced randomly at a rate of 1 per 500 bp. One half of the reads was generated from the original reference, the other from the modified reference. All reads contained random errors as above.
In the first simulation, we tested various insert lengths, from 100 bp to 12kb in short read format (35 bp) alone. The results in
Final scaffold N50 as a function of insert length, in four different species and three different simulations scenarios. The horizontal red lines represent the initial N50 after error removal and before repeat resolution. The dashed blue lines represent the highest possible N50, namely the length of the sequence being sampled.
Pebble can integrate multiple read pair libraries, but the length of the scaffolds produced is governed mainly by the size of the longest library as long as the variance in the insert length is reasonable, in practice around 10% of the insert length. Larger variance in insert sizes degrades the N50.
In the second experiment, we tested various mixtures of long and short reads. The results, shown in
Final contig N50 as a function of long read concentration, in four different species, and three different simulation scenarios. The length of the long reads is represented by the colour of the curves: 100 (black) 200 (red) 400 (green) 500 (blue) and 1000 bp (light blue).
However there is a fixed ceiling for each read length, due to the repeat structure of long repeats in genomes. The level and behaviour of this ceiling varies according to the genome and the length of the long reads. For
In the case of reads with errors, at very high coverage depths the N50 diminishes. This would be due to the accumulation of errors which become more difficult to distinguish from genuine sequence, and then prevent the Rock Band algorithm from resolving repeats.
Higher N50 contig lengths were obtained in this experiment (∼1 Mbp in bacteria, ∼100 Kbp in human) than in the previous simulations with mixed length reads. This difference is due to the fact that inserts can be much longer than long reads. Pebble is able to effectively exploit these very long insert lengths through the construction of the secondary scaffold. However, for comparable lengths, high density paired-end reads consistently produce slightly longer scaffolds than long reads, as shown on
The red and black curves represent the final scaffold N50 after the execution of the Rock Band or Pebble algorithms respectively, as a function of long read or insert length, in four different species and three different simulations scenarios, as described in
Pebble was tested on actual Illumina data, generated from
Only 15 short contigs did not align to the reference, representing a total of 26 kbp. 10 of those contigs represented the copies of a common repeat, which did not align perfectly to the reference genome. The other contigs only aligned partially to the reference, and presented novel sequence, possibly due to minor contamination. For example, one of them aligned to a
If scaffolding is turned off then the contig N50 goes down to 24 kbp. However, many of the buffers are very short. If the contigs are only broken up at buffers which are estimated to be strictly longer than 1 bp, then the N50 of the “near-contigs” becomes 48 kbp
These results were compared to those obtained with EULER-USR
The details of this comparison are in
Assembly | N50 length (kbp) | Maximum length (kbp) | Contig or scaffold count | Coverage (%) |
EULER contigs | 40 | 215 | 598 | 103.0 |
Velvet contigs | 24 | 134 | 595 | 97.9 |
Velvet near-contigs | 48 | 157 | 430 | 97.9 |
EULER scaffolds | 51 | 215 | 620 | 107.8 |
Velvet scaffolds | 96 | 314 | 274 | 98.4 |
Comparison of Velvet and Euler assemblies. The contig or scaffold count corresponds to the number of contigs or scaffolds longer than 100 bp. Near-contigs are defined as scaffolds which are broken up only if the distance between two contigs is estimated to be strictly greater than 1 bp.
Finally, Pebble was tested on Illumina data from a ferret BAC. 690,494 36+36 bp read pairs were generated with a 270 bp insert length. This assembly produced a single contig 147.362 bp long, which aligned with a 0.02% global error rate. These results show that microreads can be used to obtain contigs of significant length even on complex genomes.
Pebble is a new de Bruijn graph algorithm which can use read pairs to extend genomic assembly contigs using the dense paired-end information produced by next-generation sequencing platforms. We tested this method on simulated as well as experimental datasets. The length weighted median contig length (N50) in simulations is within that reached in traditional “draft” assemblies, and for all genomes is greater than the median gene length. Although not the focus of this paper, a light coverage of longer insert lengths (such as fosmid end pairs) will be able to easily super-scaffold these scaffold-contigs into even longer collinear components. Thus using only short reads, useful assemblies can be generated for complex organisms.
The RockBand algorithm fills a similar role using single long reads added to the genome. Long reads in this context might be from 454 data, or traditional Sanger reads, with not enough coverage for assembly by other means. In many cases there are already partial long read datasets which can be leveraged now with the addition of short reads. For reads as long as short-read paired-end inserts, Rock Band and Pebble produces comparable N50s. However, Pebble is also able to resolve repeats using longer insert lengths. This study confirms the intuition of many genomic researchers that the generation of longer length inserts (3 Kbp-10 Kbp) will be very valuable for resolving genomic features.
The Pebble algorithm requires tracking of each short read through the error correction processes of tip clipping and TourBus described in our previous paper. This tracking information produces a large memory requirement, with 120 Mbp assemblies needing 50 to 100 GB of physical memory. This means that currently the main limitation for using Velvet on larger genomes are engineering (rather than algorithmic) issues. The ABySS de Bruijn graph system works on a distributed memory system and can handle larger genomes
In this paper we have explored these algorithms by simulation and with two real datasets. These algorithms have been available in partial form in the Velvet package for around one year, and a variety of other groups have already reported good results with these methods, such as
After establishing for each unique node a list of its primary neighbours, Pebble removes connections between unique nodes which are not supported by enough evidence. To start with, it discards distance estimates which were derived from less than a given cutoff (by default 4) of mate-pairs. Assuming that the distance estimate is correct, it then estimates the expected number of mate-pairs which should connect the two contigs (cf. below). If the observed number is below 10% of that expected values, then the distance estimate is deleted.
For each pair of contigs connected by read pairs, and for each insert library, we call
We first define a few variables:
We finally obtain an estimate of the expected number
As two unique nodes are being merged into one, it is vital to also merge the corresponding scaffolding information, to keep the process going. This operation is directly based on the original construction of the local scaffold, since the formula allows us to incrementally add information, both from primary and derived distances.
Admittedly this use of the formula ignores the fact that the derived distances around one node are not necessarily independent from those around a neighbouring node. Nonetheless, this bias is neglected for the sake of computational speed.
It is not always possible to extend a node to another unique contig. This can be due to coverage gaps, or to anomalies detected by Pebble as described below. However, the search for a path between two contigs is asymmetric, depending on which of the two the local scaffold was built around.
Whenever Pebble estimates that two contigs, using read-pair information, should be neighbours, but fails to find a path from both ends, it finally scaffolds the two together. They are merged as if a path were found, but separated by a buffer sequence of undetermined nucleotides (marked as N's). The length of this buffer is equal to the distance estimate between the nodes.
Because Pebble is a greedy algorithm, special care must be taken to avoid creating misassemblies. These can be detected through inconsistencies in contig distances. When extending a unique contig, Pebble checks that it terminates on the nearest unique contig in the local scaffold, and did not step directly to a farther one. Such inconsistencies can be caused by the variance in the distance estimator, especially when using mate-pair libraries with high insert length variance. However, for the sake of precaution, Pebble stops whenever it meets such an occurrence.
Another pitfall of this heuristic depth-first search is looping within highly repetitive regions of the graph before eventually finding the correct path out. In its current implementation, Pebble does not try to resolve these regions, but simply jumps over them.
To prevent looping, Pebble stops whenever it visits the same node twice, without having visited a new node in between.
We denote by
We assume that the multiplicity within each contig's
According to the Central Limit Theorem, the mean of all the
To ensure specificity, we impose a uniqueness cutoff of
To reduce calculations, long reads are clipped at their ends so that all their tips are mapped onto unique nodes. This clipping can potentially completely erase long reads which are apparently included in repeated regions. Nonetheless, this is quite rare, as in our experience, even within eukaryotic repeats short unique regions can be found.
To reduce noise, a path between two unique nodes is neglected if it is not validated by two reads or more. When confronted with discrepancies between long reads, Rock Band interrupts its progression, and flags the corresponding node as non-unique.
Random 35 bp paired-end short reads were generated at a constant coverage of 50x. The simulated insert lengths were either: 100, 200, 300, 400, 500, 600, 700, 800, 1600, 3200, 6400, or 12800 bp long. The actual insert lengths were randomly varied around the expected value with a standard deviation of 10% of that length.
In this set of simulations, random 35 bp short reads were generated at a constant coverage of 50x. Random long reads were generated at varying coverage values, namely all integer values from 1 to 20x. The long reads were either 100, 200, 400, 500 or 1000 bp long.
The 36 bp reads from
To set these parameters, we ran several assemblies without any options other than a varying k-mer length, and chose the one which produced the highest N50 length. From that preliminary assembly, 13x was the mode of the contig coverage distribution, and 7x was chosen as just above half the previous value. Finally, the insert length was determined by the fragment length selection performed by the authors of the experiment
The contigs were aligned to the reference using exonerate
The ferret Illumina dataset is available in the Short Read Archive under the accession number SRA009025. The 36 bp reads correspond to sample 149, run on lanes 3 and 4 of flowcell 30GTEAAXX (2 December 2008). The BAC corresponds to clone CH237-509L18, accession number AC170700. Velvet was run with a hash length of 31 bp and the following parameters, “-cov_cutoff 20 -exp_cov 40 -ins_length 270”.
Velvet is implemented in C, and has been tested on various Linux, MacOS and Sparc/Solaris systems. It is freely available under GPL 2 license at
This additional text provides more details on the methods described in the main manuscript.
(0.23 MB DOC)
Many thanks to Hatice Ozel Abaan for making and sequencing the pooled BAC libraries and to the Illumina company for early access to reagents, to David Studholme and Eric Kemen for providing the