Skip to content

Metagenome assembly tutorial, JAX long-read sequencing workshop 2022

Notifications You must be signed in to change notification settings

mikolmogorov/jax-meta-tutorial

Repository files navigation

Metagenomic assembly using metaFlye

This tutorial is a part of the Long read sequencing workshop, held at The Jackson Laboratory for Genomic Medicine in May 2022. Presented by Mikhail Kolmogorov.

Prerequsites

The tutorial requires a Linux or Mac machine with at least 16 Gb of RAM and 5 Gb disk space. You can use either a laptop/desctop or remote computational server. If you are using a server, you will need to copy assembly results back to the local machine to visualize them using Bandage and Quast.

If you don't have it already, download and install bioconda as described here.

Instal the required tools in a separate environemnt

conda create -n jax-meta-tutorial flye=2.9 quast bandage checkm-genome biopython wget minimap2 samtools
conda activate jax-meta-tutorial

Assembling mock bacterial community from ONT data

In this part, we will be using Oxford Nanopore sequencing of a mock bacterial community, originally published by Nicholls et al.. We will be using the version with even bacterial distribution. The community also contains two yeast genomes, but we will ignore them for the sake of simplicity.

For the convenience, I am providing a downsampled version of the dataset (total read size ~1Gb) that was also re-basecalled with the recent basecaller (Guppy 5).

To download reads, use:

wget https://zenodo.org/record/6537461/files/Nicholls_ont_guppy5_1gb.fastq.gz

Then, run metaFlye as follows:

flye --nano-hq Nicholls_ont_guppy5_1gb.fastq.gz -o flye_ont -t 30 --meta

This step may take considerable time, depending on the available resorces. Feel free to adjust -t parameter accoding to the number of CPUs on your machine. Using 30 threads, it takes about 30 minutes to run the assembly.

Once assembly is done, time to visualize it! Open flye/assembly_graph.gfa using Bandage. If you were using a remote server for computation, you'll need to download the graph file to local machine first.

This is how your assembly might look like, visualized by Bandage:

Bandageont

Notice multiple circular chromosomes, that likely correspond to complete bacterial genomes. There are also two pairs of bacterial genomes that have shared long repeats that were unresolved. Because of that, these genomes are "glued" on the graph, instead of forming separate circular contigs.

Can we verify that the assemblies indeed represent the correct genomes? Yes, because we are lucky to have reference genomes for this particular community.

Download and extract reference genomes as follows:

wget https://zenodo.org/record/6537461/files/Nicholls_ont_refs.tar.gz
tar -xvf Nicholls_ont_refs.tar.gz

Now, use metaQUAST to analyze the assembly:

metaquast.py flye_ont/assembly.fasta -R Nicholls_ont_refs -o quast_ont

The best way to look at Quast results is though their interactive web browser report. You can open quast_ont/report.html to do that. Alternatively, you can examine the text report as follows. First of all, let's look at genome completion rates:

cat quast_ont/summary/TXT/Genome_fraction.txt 

Which outputs:

Assemblies           assembly
155_P.aeruginosa     99.998  
220_E.coli           100.000 
227_S.enterica       99.930  
445_S.aureus         99.726  
464_E.faecalis       100.000 
516_B.subtilis       99.797  
525_L.monocytogenes  99.963  
528_L.fermentum      100.000 

All bacteria have 99%+ sequence recovered, nice!

Next, let's check base-level accuracy. You can see the number of indels and missmatches for each bacteria here:

cat quast_ont/summary/TXT/num_indels_per_100_kbp.txt
cat quast_ont/summary/TXT/num_mismatches_per_100_kbp.txt

For example, for L.monocytogenes, the total number of mismatches + indels is about 10 per 100kb, which translates into ~QV40. Not bad!

Finally, how complete our assmeblies are? And what to do if we are analysing a real metagenome and don't have references? In this case, CheckM could be helpful.

First, download the reference database, and set it up:

wget https://data.ace.uq.edu.au/public/CheckM_databases/checkm_data_2015_01_16.tar.gz
mkdir checkm_data && tar -xvf checkm_data_2015_01_16.tar.gz -C checkm_data
checkm data setRoot `pwd`/checkm_data

CheckM was designed to work with metagenomic bins (MAGs), however in our situation our contigs are already as good as MAGs (if not better). Therefore, we will create "fake" bins with only a single contig within each bin. To make it convenient, I've prepared a converter script for you. The command below converts all contigs longer than 100kb into bins.

wget https://raw.githubusercontent.com/fenderglass/jax-meta-tutorial/main/create_meta_bins.py
python create_meta_bins.py flye_ont/assembly.fasta 100000 checkm_bins

Now, we are ready to run CheckM:

checkm lineage_wf -x fasta checkm_bins checkm_out -t 10 --pplacer_threads 10 -f checkm_out/qa_table.txt
cat checkm_out/qa_table.txt

You might see the following summary:

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Bin Id                Marker lineage           # genomes   # markers   # marker sets    0     1     2   3   4   5+   Completeness   Contamination   Strain heterogeneity  
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  contig_183         g__Bacillus (UID865)            36         1200          269         8    1190   2   0   0   0       99.64            0.25               0.00          
  contig_34      o__Lactobacillales (UID544)        293         475           267         2    472    1   0   0   0       99.53            0.19               0.00          
  contig_51    f__Enterobacteriaceae (UID5139)      119         1169          340         8    1160   1   0   0   0       99.31            0.04               0.00          
  contig_32      o__Pseudomonadales (UID4488)       185         813           308         5    803    5   0   0   0       99.00            0.61               0.00          
  contig_19    f__Enterobacteriaceae (UID5124)      134         1173          336         10   1154   9   0   0   0       98.96            0.20              77.78          
  contig_156     o__Lactobacillales (UID355)        490         334           183         4    329    1   0   0   0       98.63            0.55              100.00         
  contig_72       g__Staphylococcus (UID301)         45         940           178         9    930    1   0   0   0       98.50            0.08               0.00          
  contig_99          c__Bacilli (UID354)            515         329           183        103   226    0   0   0   0       74.32            0.00               0.00          
  contig_97          c__Bacilli (UID354)            515         329           183        259    70    0   0   0   0       15.30            0.00               0.00          
  contig_98              root (UID1)                5656         56            24         55    1     0   0   0   0        4.17            0.00               0.00          
  contig_300             root (UID1)                5656         56            24         56    0     0   0   0   0        0.00            0.00               0.00          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

You can see that 7 contigs represent bacterial genomes with 98%+ completeness. And looks like one bacteria was split into two large contigs (contig_99 and contig_97). Must be that one tangle!

Assembling mock bacterial community from PacBio HiFi data

The previous mock community consisted of different bacterial species and mostly different genera. In a real metagenome, we commonly see many closely-related species or strains. To model this, PacBio recently released HiFi sequencing of mock bacterial community with 5 E.coli strains, among other species. Let's give it a try!

As before, I've prepared a downsampled version here:

wget https://zenodo.org/record/6537461/files/SRR13128014_hifi_1gb.fastq.gz

Assembly using metaFlye's hifi mode:

flye --pacbio-hifi SRR13128014_hifi_1gb.fastq.gz -o flye_hifi -t 30 --meta

If you visualize the assembly with Bandage as before, you'll notice that some tangled components, in addition to nicely assembled circular chromosomes.

Bandagehifi

Similarly to the ONT dataset, you can run also QUAST analysis with the following references:

wget https://zenodo.org/record/6537461/files/SRR13128014_refs.tar.gz

But why the assembly grpah is tangled? Some graph edges actually represent multiple (collapsed) strains, and some edges correspond to one single strain. Let's look into the read alinments. First, let's convert graph nodes from gfa into fasta format:

awk '/^S/{print ">"$2"\n"$3}' flye_hifi/assembly_graph.gfa | fold > assembly_graph.fasta

Then, align reads with minimap2:

minimap2 -ax map-hifi assembly_graph.fasta SRR13128014_1gb.fastq.gz -t 30 | samtools sort > graph_alignment.bam
samtools index graph_alignment.bam

Next, let's open the bam alignment using IGV. A "regular" strain-specific assembly segment should look like this (e.g. no detectable SNPs):

igv_specific

However, an assembly segment that represent multiple collapsed bacterial strains may look like this:

igv_collapsed

Try to play with the Bandage graph visualization and figure out which graph node is collapsed and which is not, and then verify it in IGV!

About

Metagenome assembly tutorial, JAX long-read sequencing workshop 2022

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages