Dorado Polish
Should I use correct or polish?
Dorado polish is a high accuracy assembly polishing tool which outperforms
similar tools for most ONT-based assemblies.
It takes as input a draft assembly produced by a tool such as Hifiasm or Flye and aligned reads and outputs an updated version of the assembly.
Quick Start
To produce a polished assembly from a draft and reads aligned to that draft, run
where <aligned_reads> is a BAM of reads aligned to a draft by Dorado aligner
and <draft> is a FASTA or FASTQ file containing the draft assembly. The draft
can be uncompressed or compressed with bgzip.
By default, polish queries the BAM and selects the best model for the basecalled
reads.
Info
Currently, Dorado polish only supports hac@v5.0 and sup@v5.0 basecalls.
Move Table Aware Models
Significantly more accurate assemblies can be produced by giving the polishing model access to additional information about the underlying signal for each read. For more information, see this section from the NCM 2024 secondary analysis update .
Dorado polish includes models which can use the move table to get temporal
information about each read.
These models will be selected automatically if the corresponding mv tag is
in the input BAM. To do this, pass the --emit-moves tag to
dorado basecaller when basecalling. To check if a BAM contains the move
table for reads, use samtools:
The output should be equal to the total number of reads in the bam. This can be found using: samtools view -c <reads_to_draft_bam>.
If move tables are not available in the BAM, then the non-move table-aware model will be automatically selected.
Specifying resources
Dorado polish will automatically select all available compute resources
to perform polishing.
To specify resources manually use:
-x / --deviceto specify specific GPU resources (if available). For more information see here.--threadsto set the maximum number of threads to be used for everything but the inference.--infer-threadsto set the number of CPU threads for inference (when "--device cpu" is used).
Polishing is both compute and memory intensive. It is best run using multiple threads (≥64) on a system with:
- multiple high performance CPU cores ( > 24 cores)
- a modern GPU with a large VRAM ( > 32GB)
FAQ
How is Dorado polish different from Medaka?
Medaka and Dorado polish are both assembly polishing tools. They accept the same input formats and produce the same output formats, and in principle they could run the same polishing model to produce equivalent results. However, Dorado polish is optimised for higher performance, and can support more accurate models with more computationally intensive architectures. For use cases in low-resource settings (small genomes such as bacteria with CPUs only available) Medaka remains the recommended tool. For large genomes or in other instances where speed is important, we suggest trying Dorado polish.
Should I use correct or polish?
Dorado polish is a post-assembly tool and it is intended to improve the accuracy of pre-existing
assemblies. Dorado correct conversely is a pre-assembly tool and is intended to improve the
contiguity of an assembly by improving the fidelity of reads used to create it.
Troubleshooting
GPU Memory Issues
The default inference batch size (16) may be too high for your GPU. If you are experiencing warnings/errors regarding available GPU memory, try reducing the batch size:
Alternatively, consider running model inference on the CPU, although this will take longer:
CLI reference
Below is a slightly re-formatted output from the Dorado polish subcommand for reference.
Info
Please check the --help output of your own installation of Dorado as this page may be
outdated and argument defaults have been omitted as they are platform specific.
> dorado polish --help
Consensus tool for polishing draft assemblies
Positional arguments:
in_aln_bam Aligned reads in BAM format
in_draft_fastx Draft assembly for polishing
Optional arguments:
-h, --help shows help message and exits
-t, --threads Number of threads for processing. Default uses all available threads.
--infer-threads Number of threads per device.
-x, --device Specify CPU or GPU device: 'auto', 'cpu', 'cuda:all' or 'cuda:<device_id>[,<device_id>...]'. Specifying 'auto' will choose either 'cpu' or 'cuda:all' depending on the presence of a GPU device.
-v, --verbose [may be repeated]
Input/output options (detailed usage):
-o, --out-path Output to a file instead of stdout.
-m, --model Path to correction model folder.
-q, --qualities Output with per-base quality scores (FASTQ).
Advanced options (detailed usage):
-b, --batchsize Batch size for inference.
--draft-batchsize Approximate batch size for processing input draft sequences.
--window-len Window size for calling consensus.
--window-overlap Overlap length between windows.
--bam-chunk Size of draft chunks to parse from the input BAM at a time.
--bam-subchunk Size of regions to split the bam_chunk in to for parallel processing
--no-fill-gaps Do not fill gaps in consensus sequence with draft sequence.
--fill-char Use a designated character to fill gaps.
--regions Process only these regions of the input. Can be either a path to a BED file or a list of comma-separated Htslib-formatted regions (start is 1-based, end is inclusive).
--RG Read group to select.
--ignore-read-groups Ignore read groups in bam file.
--tag-name Two-letter BAM tag name for filtering the alignments during feature generation
--tag-value Value of the tag for filtering the alignments during feature generation
--tag-keep-missing Keep alignments when tag is missing. If specified, overrides the same option in the model config.
--min-mapq Minimum mapping quality of the input alignments. If specified, overrides the same option in the model config.
--min-depth Sites with depth lower than this value will not be polished.