All entries for Friday 10 January 2014
January 10, 2014
I have been asked this question several times recently, along with "how much data is required for genome assembly and will more data give a better assembly ? "
I looked at these questions in a little more detail. To go about answering this I used some data from a bacterium that had been previously sequenced and closed using shotgun cloning and sanger sequencing. It has since been re-sequenced using illumina PE (2 x 250bp). The average coverage was 150x . The genome size of the bacterium is ~ 2.36 Mbp.
I first randomly subsampled this total read pool to produce sub-samples that would produce an average coverage from 15 - 150x. Each sample was assembled de novo with SPAdes and basic assembly properties assessed with QUAST.
Basic assembly properties are shown below
N50 -The length for which all contigs of that length or longer contains at least half of the total of the lengths of the contigs
As seen above once the coverage is above 15x there is a big jump in the N50 from 181 kb to 317 kb
To further narrow down the particular point at which the N50 increased, I produced samples that would have coverage of 15, 22.5, 24, 25.5, 27 and 28.5x coverage. Again they were assembled with SPAdes
The results of this are below and show that as coverage increases so does the N50 value upto 27x coverage. Above this coverage there is no increase in the N50
So if assessing a “good” assembly purely on N50, which is not a great idea, then coverage of 27x is giving the same answer as 150x and not providing any additional data
Looking at other parameters such as number of contigs longer than 1kb, also reveals that above 27x coverage there is no decrease in the number of contigs longer than 1kb.
Whilst N50 will give an indication of the size of contigs, it does not inform on the quality of the assembly. To do this again QUAST was used to compare back against the original reference genome. A range of metrics can be produced, only a few are detailed below. With all data below compared back to the known reference sequence.
First looking at the Genome Fraction of the "de novo assembly" compared to reference.
From QUAST Genome fraction (%) is defined as:the percentage of aligned bases in the reference. A base in the reference is aligned if there is at least one contig with at least one alignment to this base.
Again above 27x coverage the % was stable and did not increase with coverage – even at the lowest 15x coverage 99.75 % of the genome was still present.
Further looking at the % of complete genes that were found in the de novo assembly compared to the reference genome.
Above 24x coverage the result again plateaued. The small variation in % of complete genes, is due to 1 gene that is found in some assemblies and not others.
As can be seen below there are a number of genes that are only partialy complete (compared to the reference)
Increasing the fold coverage above 24x will not decrease the number of partial genes that are consistently obtained for an assembly.
For this particular bacterium with a small genome size ( 2.36 Mb) then there is no advantage is sequencing to a higher depth than 27x coverage. As the number of contigs >1kb in length does not decrease and the number of complete genes and % of genome that is mapped does not continue increasing with greater coverage .
So in answer to the original questions increasing the amount of sequene will produce a better assembly, but only upto a certain point.
How much sequence is needed will vary with the bacterium that is being sequenced. For this bacterium the 27x gives the same answer as 150x. But this was only calculated after 150x coverage was obtained. However, the use of simple metrics from programs such as QUAST and sub sampling of existing data allows some indication wether increasing the amount of data will produce a better assembly