All 2 entries tagged Genomes
View all 3 entries tagged Genomes on Warwick Blogs | View entries tagged Genomes at Technorati | There are no images tagged Genomes on this blog
January 10, 2014
I have been asked this question several times recently, along with "how much data is required for genome assembly and will more data give a better assembly ? "
I looked at these questions in a little more detail. To go about answering this I used some data from a bacterium that had been previously sequenced and closed using shotgun cloning and sanger sequencing. It has since been re-sequenced using illumina PE (2 x 250bp). The average coverage was 150x . The genome size of the bacterium is ~ 2.36 Mbp.
I first randomly subsampled this total read pool to produce sub-samples that would produce an average coverage from 15 - 150x. Each sample was assembled de novo with SPAdes and basic assembly properties assessed with QUAST.
Basic assembly properties are shown below
N50 -The length for which all contigs of that length or longer contains at least half of the total of the lengths of the contigs
As seen above once the coverage is above 15x there is a big jump in the N50 from 181 kb to 317 kb
To further narrow down the particular point at which the N50 increased, I produced samples that would have coverage of 15, 22.5, 24, 25.5, 27 and 28.5x coverage. Again they were assembled with SPAdes
The results of this are below and show that as coverage increases so does the N50 value upto 27x coverage. Above this coverage there is no increase in the N50
So if assessing a “good” assembly purely on N50, which is not a great idea, then coverage of 27x is giving the same answer as 150x and not providing any additional data
Looking at other parameters such as number of contigs longer than 1kb, also reveals that above 27x coverage there is no decrease in the number of contigs longer than 1kb.
Whilst N50 will give an indication of the size of contigs, it does not inform on the quality of the assembly. To do this again QUAST was used to compare back against the original reference genome. A range of metrics can be produced, only a few are detailed below. With all data below compared back to the known reference sequence.
First looking at the Genome Fraction of the "de novo assembly" compared to reference.
From QUAST Genome fraction (%) is defined as:the percentage of aligned bases in the reference. A base in the reference is aligned if there is at least one contig with at least one alignment to this base.
Again above 27x coverage the % was stable and did not increase with coverage – even at the lowest 15x coverage 99.75 % of the genome was still present.
Further looking at the % of complete genes that were found in the de novo assembly compared to the reference genome.
Above 24x coverage the result again plateaued. The small variation in % of complete genes, is due to 1 gene that is found in some assemblies and not others.
As can be seen below there are a number of genes that are only partialy complete (compared to the reference)
Increasing the fold coverage above 24x will not decrease the number of partial genes that are consistently obtained for an assembly.
For this particular bacterium with a small genome size ( 2.36 Mb) then there is no advantage is sequencing to a higher depth than 27x coverage. As the number of contigs >1kb in length does not decrease and the number of complete genes and % of genome that is mapped does not continue increasing with greater coverage .
So in answer to the original questions increasing the amount of sequene will produce a better assembly, but only upto a certain point.
How much sequence is needed will vary with the bacterium that is being sequenced. For this bacterium the 27x gives the same answer as 150x. But this was only calculated after 150x coverage was obtained. However, the use of simple metrics from programs such as QUAST and sub sampling of existing data allows some indication wether increasing the amount of data will produce a better assembly
July 18, 2013
Press release on our recent publication in the New England Journal of Medicine
Metagenomic Analysis of Tuberculosis in a Mummy
N Engl J Med 2013; 369:289-290July 18, 2013DOI: 10.1056/NEJMc1302295
Jacqueline Z.-M. Chan, Martin J. Sergeant, Oona Y.-C. Lee, David E. Minnikin, Gurdyal S. Besra, Ilidkó Pap, Mark Spigelman, Helen D. Donoghue, Mark J. Pallen
Researchers at the University of Warwick have recovered tuberculosis (TB) genomes from the lung tissue of a 215-year old mummy using a technique known as metagenomics.
The team, led by Professor Mark Pallen, Professor of Microbial Genomics at Warwick Medical School, working with Helen Donoghue at University College London and collaborators in Birmingham and Budapest, sought to use the technique to identify TB DNA in a historical specimen.
The term ‘metagenomics’ is used to describe the open-ended sequencing of DNA from samples without the need for culture or target-specific amplification or enrichment. This approach avoids the complex and unreliable workflows associated with culture of bacteria or amplification of DNA and draws on the remarkable throughput and ease of use of modern sequencing approaches.
The sample came from a Hungarian woman, Terézia Hausmann, who died aged 28 on 25 December 1797. Her mummified remains were recovered from a crypt in the town of Vác, Hungary. When the crypt was opened in 1994, it was found to contain the naturally mummified bodies of 242 people. Molecular analyses of the chest sample in a previous study confirmed the diagnosis of tuberculosis and hinted that TB DNA was extremely well preserved in her body.
Professor Pallen explained the importance of the breakthrough,
“Most other attempts to recover DNA sequences from historical or ancient samples have suffered from the risk of contamination, because they rely on amplification of DNA in the laboratory, plus they have required onerous optimisation of target-specific assays. The beauty of metagenomics is that it provides a simple but highly informative, assumption-free, one-size-fits-all approach that works in a wide variety of contexts. A few months ago we showed that metagenomics allowed us to identify an E. coli outbreak strains from faecal samples and a few weeks ago a similar approach was shown by another group to deliver a leprosy genome from historical material”.
The research, published this week in the New England Journal of Medicine, showed that Terézia Hausmann suffered from a mixed infection with two different strains of the TB bacterium. This information, combined with work on contemporary tuberculosis, highlights the significance of mixed-strain infections, particularly when tuberculosis is highly prevalent.
Professor Pallen added,
“It was fascinating to see the similarities between the TB genome sequences we recovered and the genome of a recent outbreak strain in Germany. It shows once more that using metagenomics can be remarkably effective in tracking the evolution and spread of microbes without the need for culture—in this case, metagenomes revealed that some strain lineages have been circulating in Europe for more than two centuries.”
Notes for editors
For more information, or to arrange interviews with author Mark J. Pallen, M.A., M.D., Ph.D., email firstname.lastname@example.org.
Alternatively, contact Warwick Medical School Press Officer Luke Harrison on +44 (0) 2476 574255/150483 or +44 (0) 7920531221 email@example.com