Skip to content

Generating assemblies

Ryan Wick edited this page Oct 14, 2023 · 24 revisions

There is no single prescribed way to prepare your assemblies. The goal is to have multiple assemblies (~12 is nice) which are reasonably independent of each other. To achieve this independence, you can use different assemblers and/or different subsets of your reads.

This page describes how I prepare my assemblies, in case you want to copy me. It's not the only way to do it, but I hope it gives you a good starting point.

Requirements

I'm assuming you have long-read sequencing for your isolate which is of decent depth. In the commands below, I assume these are in a file named input_reads.fastq.gz. Ideally your reads are at least 100×, but more is better. If you have a read set that's 200× depth or more, that's wonderful! If you have only have 50× depth or less, running Trycycler may be more difficult – you've been warned!

I'm also assuming that your reads are sufficiently long. Practically, this means longer than the longest repeat in the genome, as this will allow most assemblies to reach completion. For many bacterial genomes, the longest repeat will be the ribosomal RNA operon which is ~5.5 kbp, in which case a read N50 of 10 kbp is sufficient. However, some genomes have much longer repeats and will require longer reads. Overly-long reads rarely cause a problem, so you should aim for the longest reads you reasonably can.

It will also be handy to know the approximate size of your genome. If you're assembling something novel (i.e. you don't know the genome size), you could do a quick initial assembly at this point to get an estimate.

Finally, you'll need a couple tools installed:

Some light read QC

As I first step, I usually do a bit of light QC of the reads. By "light" I mean that I only want to toss out very bad reads. I would rather keep mediocre reads because high read depth will help with the next step.

Something like this should work:

filtlong --min_length 1000 --keep_percent 95 input_reads.fastq.gz > reads.fastq

This will discard short reads (less than 1 kbp) and very bad reads (the worst 5%).

If you have a very deep read set, you might be able to do more aggressive QC here, but small plasmids can be a problem. Filtlong prefers longer reads to shorter reads, so if you filter aggressively with Filtlong, you might lose your small plasmid reads. I would therefore recommend that any aggressive QC filtering be done on read quality more than read length.

Subsampling reads

Now we want to subsample the QC'd reads to make multiple different read sets. This is handled by the Trycycler subsample command:

trycycler subsample --reads reads.fastq --out_dir read_subsets

Trycycler subsample tries to make maximally-independent read subsets of an appropriate depth for your genome. Read more about it on this page: How read subsampling works.

It has the following settings:

  • --count: the number of subsampled read sets to output. The default is 12, a good number for most cases.
  • --genome_size: an estimate of the isolate's total genome size. If you don't provide this, Trycycler subsample will run miniasm to quickly assemble your read set to get a size. If you know your genome size, you can provide a value (e.g. --genome_size 5.5m) and Trycycler subsample will run faster because it won't run miniasm. This value is used to calculate read depths and doesn't need to be exact, e.g. 10% error is fine.
  • --min_read_depth: this is the minimum allowed read depth and also controls the subsampled depths (see How read subsampling works for more information).
  • --threads: this is how many threads Trycycler will use for the miniasm assembly. It will only affect the speed performance, so you'll probably want to use as many threads as you have available. If you provided a genome size estimate, then miniasm isn't run and this setting has no effect.

When Trycycler subsample is done, you'll find your read subsets here:

read_subsets/sample_*.fastq

Assemblies

There are many different assemblers you can use, but I like Flye, Miniasm+Minipolish and Raven. They all do well with bacterial genomes, are reasonably fast and are easy to run. Check out the Extra-thorough assembly section below for other assemblers.

With 12 read sets (the default for Trycycler subsample) and three assemblers, it works to do four assemblies with each:

threads=16  # change as appropriate for your system
mkdir assemblies

flye --nano-hq read_subsets/sample_01.fastq --threads "$threads" --out-dir assembly_01 && cp assembly_01/assembly.fasta assemblies/assembly_01.fasta && cp assembly_01/assembly_graph.gfa assemblies/assembly_01.gfa && rm -r assembly_01
miniasm_and_minipolish.sh read_subsets/sample_02.fastq "$threads" > assemblies/assembly_02.gfa && any2fasta assemblies/assembly_02.gfa > assemblies/assembly_02.fasta
raven --threads "$threads" --disable-checkpoints --graphical-fragment-assembly assemblies/assembly_03.gfa read_subsets/sample_03.fastq > assemblies/assembly_03.fasta

flye --nano-hq read_subsets/sample_04.fastq --threads "$threads" --out-dir assembly_04 && cp assembly_04/assembly.fasta assemblies/assembly_04.fasta && cp assembly_04/assembly_graph.gfa assemblies/assembly_04.gfa && rm -r assembly_04
miniasm_and_minipolish.sh read_subsets/sample_05.fastq "$threads" > assemblies/assembly_05.gfa && any2fasta assemblies/assembly_05.gfa > assemblies/assembly_05.fasta
raven --threads "$threads" --disable-checkpoints --graphical-fragment-assembly assemblies/assembly_06.gfa read_subsets/sample_06.fastq > assemblies/assembly_06.fasta

flye --nano-hq read_subsets/sample_07.fastq --threads "$threads" --out-dir assembly_07 && cp assembly_07/assembly.fasta assemblies/assembly_07.fasta && cp assembly_07/assembly_graph.gfa assemblies/assembly_07.gfa && rm -r assembly_07
miniasm_and_minipolish.sh read_subsets/sample_08.fastq "$threads" > assemblies/assembly_08.gfa && any2fasta assemblies/assembly_08.gfa > assemblies/assembly_08.fasta
raven --threads "$threads" --disable-checkpoints --graphical-fragment-assembly assemblies/assembly_09.gfa read_subsets/sample_09.fastq > assemblies/assembly_09.fasta

flye --nano-hq read_subsets/sample_10.fastq --threads "$threads" --out-dir assembly_10 && cp assembly_10/assembly.fasta assemblies/assembly_10.fasta && cp assembly_10/assembly_graph.gfa assemblies/assembly_10.gfa && rm -r assembly_10
miniasm_and_minipolish.sh read_subsets/sample_11.fastq "$threads" > assemblies/assembly_11.gfa && any2fasta assemblies/assembly_11.gfa > assemblies/assembly_11.fasta
raven --threads "$threads" --disable-checkpoints --graphical-fragment-assembly assemblies/assembly_12.gfa read_subsets/sample_12.fastq > assemblies/assembly_12.fasta

If all went well, you will now have 12 assemblies in FASTA format (assembly_01.fasta to assembly_12.fasta) in a directory named assemblies. These should all be reasonably similar (they are of the same genome) and you're now ready to run Trycycler to make a consensus assembly! The assemblies are also available in GFA format which can be useful for manual curation (see the next section).

Now that you have your input assemblies, you no longer need the subsampled reads and can clean them up:

rm -r read_subsets

Optional: manual curation

Before continuing to the next step, you might want to manually curate your assemblies, i.e. look at them and judge which look good and which look bad, so you can throw out the bad ones before continuing. 'Bad' in this context means fragmented. My preference is to look at them in Bandage. This is especially useful for assemblers that give you assemblies in GFA graph format, as contig circularity (obvious in Bandage) is a decent indicator of completeness.

This manual curation step can be especially useful if you've got a tough-to-assemble genome and a lot of your input assemblies are fragmented. Giving lots of bad assemblies to Trycycler will result in very messy clusters, so cleaning things up at this point can help. If you had to throw out a lot of your input assemblies (e.g. you made 12 assemblies and threw out 8 of them), then you should consider starting over with a larger number of read subsets (e.g. run Trycycler subsample with --count 24).

Extra-thorough assembly

If you want to be very thorough during this assembly step, you can use more read subsets and more assemblers. In addition to Flye, Miniasm+Minipolish and Raven (used above), I like Canu, NECAT and NextDenovo/NextPolish. These six assemblers were the best performing tools in my Benchmarking of long-read assemblers for prokaryote whole genome sequencing paper.

If we produce 24 subsampled read sets, we can do four assemblies with each of the six assemblers:

# Change as appropriate for your system and genome:
threads=16
nextdenovo_dir="/path/to/NextDenovo"
nextpolish_dir="/path/to/NextPolish"
genome_size="5500000"

trycycler subsample --reads reads.fastq --out_dir read_subsets --count 24 --genome_size "$genome_size"

mkdir assemblies

for i in 01 07 13 19; do
    canu -p canu -d canu_temp -fast genomeSize="$genome_size" useGrid=false maxThreads="$threads" -nanopore read_subsets/sample_"$i".fastq
    canu_trim.py canu_temp/canu.contigs.fasta > assemblies/assembly_"$i".fasta
    rm -rf canu_temp
done
for i in 02 08 14 20; do
    flye --nano-hq read_subsets/sample_"$i".fastq --threads "$threads" --out-dir flye_temp
    cp flye_temp/assembly.fasta assemblies/assembly_"$i".fasta
    cp flye_temp/assembly_graph.gfa assemblies/assembly_"$i".gfa
    rm -r flye_temp
done
for i in 03 09 15 21; do
    miniasm_and_minipolish.sh read_subsets/sample_"$i".fastq "$threads" > assemblies/assembly_"$i".gfa
    any2fasta assemblies/assembly_"$i".gfa > assemblies/assembly_"$i".fasta
done
for i in 04 10 16 22; do
    necat.pl config config.txt
    realpath read_subsets/sample_"$i".fastq > read_list.txt
    sed -i "s/PROJECT=/PROJECT=necat/" config.txt
    sed -i "s/ONT_READ_LIST=/ONT_READ_LIST=read_list.txt/" config.txt
    sed -i "s/GENOME_SIZE=/GENOME_SIZE="$genome_size"/" config.txt
    sed -i "s/THREADS=4/THREADS="$threads"/" config.txt
    necat.pl bridge config.txt
    cp necat/6-bridge_contigs/polished_contigs.fasta assemblies/assembly_"$i".fasta
    rm -r necat config.txt read_list.txt
done
for i in 05 11 17 23; do
    echo read_subsets/sample_"$i".fastq > input.fofn
    cp "$nextdenovo_dir"/doc/run.cfg nextdenovo_run.cfg
    sed -i "s/genome_size = 1g/genome_size = "$genome_size"/" nextdenovo_run.cfg
    sed -i "s/parallel_jobs = 20/parallel_jobs = 1/" nextdenovo_run.cfg
    sed -i "s/read_type = clr/read_type = ont/" nextdenovo_run.cfg
    sed -i "s/pa_correction = 3/pa_correction = 1/" nextdenovo_run.cfg
    sed -i "s/correction_options = -p 15/correction_options = -p "$threads"/" nextdenovo_run.cfg
    sed -i "s/-t 8/-t "$threads"/" nextdenovo_run.cfg
    "$nextdenovo_dir"/nextDenovo nextdenovo_run.cfg
    cp 01_rundir/03.ctg_graph/nd.asm.fasta nextdenovo_temp.fasta
    rm -r 01_rundir nextdenovo_run.cfg input.fofn
    echo read_subsets/sample_"$i".fastq > lgs.fofn
    cat "$nextpolish_dir"/doc/run.cfg | grep -v "sgs" | grep -v "hifi" > nextpolish_run.cfg
    sed -i "s/parallel_jobs = 6/parallel_jobs = 1/" nextpolish_run.cfg
    sed -i "s/multithread_jobs = 5/multithread_jobs = "$threads"/" nextpolish_run.cfg
    sed -i "s|genome = ./raw.genome.fasta|genome = nextdenovo_temp.fasta|" nextpolish_run.cfg
    sed -i "s|-x map-ont|-x map-ont -t "$threads"|" nextpolish_run.cfg
    "$nextpolish_dir"/nextPolish nextpolish_run.cfg
    cp 01_rundir/genome.nextpolish.fasta assemblies/assembly_"$i".fasta
    rm -r 01_rundir pid*.log.info nextpolish_run.cfg lgs.fofn nextdenovo_temp.fasta
done
for i in 06 12 18 24; do
    raven --threads "$threads" --disable-checkpoints --graphical-fragment-assembly assemblies/assembly_"$i".gfa read_subsets/sample_"$i".fastq > assemblies/assembly_"$i".fasta
done

Some notes:

  • Canu, NECAT and NextDenovo require a genome size estimate to run.
  • Canu often produces large amounts of overlap in circular contigs, but its contig headers indicate the necessary trimming to remove this overlap. The canu_trim.py script (in Trycycler's repo) will conduct this trimming and is used in the assembly code above.
  • NextDenovo/NextPolish also produces large amounts of overlap in circular contigs, but unlike Canu, it doesn't indicate how to trim. You will therefore likely have to Manually fix contig overlap during the Reconciling contigs step.
  • Canu does a good job with assembly, but it is usually slower than other assemblers, even with its -fast option. Be prepared to wait a few hours.
  • NECAT and NextDenovo/NextPolish rely on config files, making them cumbersome to run. Sorry for all the lines in the assembly code above!
  • La Jolla Assembler (LJA) might be worth a try as well if you are assembling HiFi reads. I haven't tried it myself, but it was recommended by Thomas Roder.
Clone this wiki locally