May 15, 2014

Comparison of SOAPdenovo, SPAdes and IDBA–Hybrid in 147 Vibrio cholera genomes

I have been using SOAPdenovo ( as my main assembler for 6 years. It is very fast and can get good assemblies with high N50 value and low error rates in half an hour even with only one processor. The other assembler that was widely used is Vevlet. SOAPdenovo outperformed Vevlet in many comprisons I have done in bacterial genomes. However, things are changing and may it is the time for me to use a new assembler. There have been multiple pulications acclaimed that many new assemblers are much better than SOAPdenovo. One of which, SPAdes (, seems to have been proven to be the best ( And many of my colleages suggested it to me. Thus, I decided to make a comparison between SOAPdenovo and SPAdes.

Here I also included another assembler, IDBA-Hybrid ( The standard version of IDBA, IDBA-UC seems to be in the same level according to a comparison ( done by the authors of SPAdes. IDBA-Hybrid fits my request on population genetics perfectly. It is a reference accelerated assembler and the authors said "The expriments showed it outperforms all existing de novo or hybrid assembly algorithms". It is clear that IDBA-Hybrid at least outperforms IDBA-UC when you can find a closely related reference. However, I can not find any data on the performance of IDBA-Hybrid online. Thus, I hope this comparison can give me a overview for its capability.

DATA to compare

I used Vibrio cholera as an example. Some of the data are from Sanger Instititue ( and many other are from BGI. It is not a good dataset for all bacteria but seems to be a good representive for Gram negative ones. V. cholera has two chromosomes with an average size of 4 MB, and carries kinds of repetitive regions as well as a long, complex region of super-intergron.

Three combinations of libararies were included in the datasset. 73 samples have reads in 54 bps with 300bps insertion length. 6 samples have reads in 100 bps with 300bps insertion length. 68 samples have reads in 100bps with 500bp Paired-end insertion and 6Kb mate-pair insertion.


IDBA_hybrid and spades were run with default settings. SOAPdenovo was run in 4 combinations of parameters, which are Kmer 23-43 with 2X kmer cutoff, Kmer 23-43 with 4X kmer cutoff, Kmer 41-63 with 2X kmer cutoff and Kmer 41-63 with 4X kmer cutoff. After the assembly, scaffolds with the highest N50 value were kept and intra-scaffold gaps were filled by Gapcloser from the same package. Although there are 4X147 runs to go for SOAPdenovo, it finished in about 1 day in two processors. IDBA_hybrid finished in 2 days and SPAdes finished after 5 days.


All the values in the table are average values over all the samples with the same libararies. Calculations are done by QUAST2.3 (


This result is very different from many comparisons that have been done previously. The contigs and scaffolds generated by SOAPdenovo is comparable with those from SPAdes. It actually outperforms SPAdes in terms of N50 values for short reads (54bps). There are still major problems for SOAPdenovo for its high local misassemblies, and high indel rates with short reads. This comparison shows that SOAPdenovo is still a good assembler that can be used in first line applications, especially with its advantage of high speed comparing to any other assemblers.

The other assembler, IDBA_hybrid performs extremely good, with low error rates and high N50 values in samples with low coverage and short reads. In some extra samples with only 10-15X coverages, IDBA_hybrid is the only assembler that can give me meaningful assemblies. However, IDBA_hybrid works not as good as I expected in high coverage samples and with mate-pair libaries.

