January 10, 2014

What coverage is needed for “good” assembly of a bacterial genome ?

I have been asked this question several times recently, along with "how much data is required for genome assembly and will more data give a better assembly ? "

I looked at these questions in a little more detail. To go about answering this I used some data from a bacterium that had been previously sequenced and closed using shotgun cloning and sanger sequencing. It has since been re-sequenced using illumina PE (2 x 250bp). The average coverage was 150x . The genome size of the bacterium is ~ 2.36 Mbp.

I first randomly subsampled this total read pool to produce sub-samples that would produce an average coverage from 15 - 150x. Each sample was assembled de novo with SPAdes and basic assembly properties assessed with QUAST.

Basic assembly properties are shown below

N50 -The length for which all contigs of that length or longer contains at least half of the total of the lengths of the contigs


graph_1

As seen above once the coverage is above 15x there is a big jump in the N50 from 181 kb to 317 kb

To further narrow down the particular point at which the N50 increased, I produced samples that would have coverage of 15, 22.5, 24, 25.5, 27 and 28.5x coverage. Again they were assembled with SPAdes

The results of this are below and show that as coverage increases so does the N50 value upto 27x coverage. Above this coverage there is no increase in the N50

graph_2

So if assessing a “good” assembly purely on N50, which is not a great idea, then coverage of 27x is giving the same answer as 150x and not providing any additional data

Looking at other parameters such as number of contigs longer than 1kb, also reveals that above 27x coverage there is no decrease in the number of contigs longer than 1kb.

graph3a_andrew_millard

Whilst N50 will give an indication of the size of contigs, it does not inform on the quality of the assembly. To do this again QUAST was used to compare back against the original reference genome. A range of metrics can be produced, only a few are detailed below. With all data below compared back to the known reference sequence.

First looking at the Genome Fraction of the "de novo assembly" compared to reference.

From QUAST Genome fraction (%) is defined as:the percentage of aligned bases in the reference. A base in the reference is aligned if there is at least one contig with at least one alignment to this base.

graph4_Andrew_millard

Again above 27x coverage the % was stable and did not increase with coverage – even at the lowest 15x coverage 99.75 % of the genome was still present.

Further looking at the % of complete genes that were found in the de novo assembly compared to the reference genome.

graph5_andrew_millard

Above 24x coverage the result again plateaued. The small variation in % of complete genes, is due to 1 gene that is found in some assemblies and not others.

As can be seen below there are a number of genes that are only partialy complete (compared to the reference)

graph_6_andrew_millard

Increasing the fold coverage above 24x will not decrease the number of partial genes that are consistently obtained for an assembly.

For this particular bacterium with a small genome size ( 2.36 Mb) then there is no advantage is sequencing to a higher depth than 27x coverage. As the number of contigs >1kb in length does not decrease and the number of complete genes and % of genome that is mapped does not continue increasing with greater coverage .

So in answer to the original questions increasing the amount of sequene will produce a better assembly, but only upto a certain point.

How much sequence is needed will vary with the bacterium that is being sequenced. For this bacterium the 27x gives the same answer as 150x. But this was only calculated after 150x coverage was obtained. However, the use of simple metrics from programs such as QUAST and sub sampling of existing data allows some indication wether increasing the amount of data will produce a better assembly






- 2 comments by 1 or more people Not publicly viewable

  1. Shilp Purohit

    Dear Andrew,

    Fantastic post..! very informative..

    I would like to know how did you down-sample your Illumina data? Is it that you take first N reads such that the sub-sampled group represent, 30x, 60x or 90x coverage respectively?

    Thank you.

    Best,
    Shilp

    03 Apr 2014, 11:36

  2. Andrew Millard

    Hi Shilp,

    glad that you found it useful . I used seqtk sample. So if I had 100 reads

    for 90% seqtk sample -s10 read1.fq 90 > sub_90.fq
    for 80% seqtk sample -s10 read1.fq 80 > sub_80.fq

    Seqtk also takes fractions as an input so you don’t have to calculate the number of reads that you want

    Andy

    03 Apr 2014, 12:20


Add a comment

You are not allowed to comment on this entry as it has restricted commenting permissions.

Search this blog

Blog archive

Loading…

Most recent comments

  • I have just seen this paper on the ~7–thousand–year auroochs genome: http://www.genomebiology.com/20… by Mark Pallen on this entry
  • Hi Chris, You are right that there is nothing implicit in being Open Access that guarantees a right … by Mark Pallen on this entry
  • Good to see it on biorxiv. I didn't fully follow the second point about elife, your criticism seems … by Chris Keene on this entry
  • Congratulations to Professor Achtman. by Madikay Senghore on this entry
  • Hi Shilp, glad that you found it useful . I used seqtk sample. So if I had 100 reads for 90% seqtk s… by Andrew Millard on this entry
Not signed in
Sign in

Powered by BlogBuilder
© MMXXIII