Dorado Polish

Should I use correct or polish?

Should I use variant or polish?

Dorado polish is a high accuracy assembly polishing tool which outperforms similar tools for most ONT-based assemblies.

It takes as input a draft assembly produced by a tool such as Hifiasm or Flye and aligned reads, and outputs an updated version of the assembly.

Additionally, Dorado polish can output a VCF file containing records for all variants discovered during polishing, or a gVCF file containing records for all locations in the input draft sequences.

Note that Dorado polish is a haploid polishing tool and does not implement any sort of phasing internally. It will take input alignment data as is and run it through the polishing model to produce the consensus sequences. For more information, please take a look at this section.

Quick Start

Consensus

# Align unmapped reads to a reference using dorado aligner, sort and index
dorado aligner <draft.fasta> <unmapped_reads.bam> | samtools sort --threads <num_threads> > aligned_reads.bam
samtools index aligned_reads.bam

# Call consensus
dorado polish <aligned_reads.bam> <draft.fasta> > polished_assembly.fasta

In the above example, <aligned_reads> is a BAM of reads aligned to a draft by Dorado aligner and <draft> is a FASTA or FASTQ file containing the draft assembly. The draft can be uncompressed or compressed with bgzip.

Consensus from a FASTQ input instead of BAM

This feature supports only FASTQ files with HTS-style tags in the header and will not work for the old MinKnow style FASTQ files.

Here is a full example:

# Align reads to a reference using dorado aligner, sort and index
dorado aligner <draft.fasta> <reads.fastq> | samtools sort --threads <num_threads> > aligned_reads.bam
samtools index aligned_reads.bam

# Call consensus
dorado polish <aligned_reads.bam> <draft.fasta> > polished_assembly.fasta

Consensus on bacterial genomes

dorado polish <aligned_reads> <draft> --bacteria > polished_assembly.fasta

This will automatically resolve a suitable bacterial polishing model, if one exits for the input data type.

Variant calling

dorado polish <aligned_reads> <draft> --vcf > polished_assembly.vcf
dorado polish <aligned_reads> <draft> --gvcf > polished_assembly.all.vcf

Specifying --vcf or --gvcf flags will output a VCF file to stdout instead of the consensus sequences.

Output to a folder

dorado polish <aligned_reads> <draft> -o <output_dir>

Specifying -o will write multiple files to a given output directory (and create the directory if it doesn't exist):

Consensus file: <output_dir>/consensus.fasta by default, or <output_dir>/consensus.fastq if --qualities is specified.
VCF file: <output_dir>/variants.vcf which contains only variant calls by default, or records for all positions if --gvcf is specified.

Resources

Dorado polish will automatically select the compute resources to perform polishing. It can use one or more GPU devices, or the CPU, to call consensus.

To specify resources manually use:

-x / --device - to specify specific GPU resources (if available).
--threads - to set the maximum number of threads to be used for everything but the inference.
--infer-threads - to set the number of CPU threads for inference (when "--device cpu" is used).
--batchsize - batch size for inference, important to control memory usage on the GPUs. Automatically computed by default (--batchsize 0).

Example:

dorado polish reads_to_draft.bam draft.fasta --device cuda:0 --threads 24 > consensus.fasta

Models

Dorado polish auto-resolves the polishing model based on the input BAM file. The BAM file needs to contain the @RG headers with the basecaller model name specified, otherwise the model will not be resolved. If the input BAM records contain move tables, an appropriate move-aware polishing model will be selected.

Once the model is resolved, Dorado polish will either download it or look it up in the models-directory if specified.

For example:

dorado polish reads_to_draft.bam draft.fasta > consensus.fasta

will find the compatible model based on the input BAM file and download it to a temporary folder.

When --models-directory is specified, the resolved polishing model will first be looked up in the models-directory, and only downloaded if the model does not exist. The specified models-directory must exist. Example:

mkdir -p models
dorado polish --models-directory models reads_to_draft.bam draft.fasta > consensus.fasta

More information about the --models-directory can be found in this section

If there are multiple read groups in the input dataset which were generated using different basecaller models, Dorado polish will report an error and stop execution.

Move Table Aware Models

Significantly more accurate assemblies can be produced by giving the polishing model access to additional information about the underlying signal for each read. For more information, see this section from the NCM 2024 secondary analysis update.

Dorado polish includes models which can use the move table to get temporal information about each read. These models will be selected automatically if the corresponding mv tag is in the input BAM. To do this, pass the --emit-moves tag to Dorado basecaller when basecalling. To check if a BAM contains the move table for reads, use samtools:

samtools view --keep-tag "mv" -c <reads_to_draft_bam>

The output should be equal to the total number of reads in the bam (samtools view -c <reads_to_draft_bam>).

If move tables are not available in the BAM, then the non-move table-aware model will be automatically selected.

FAQ

How is Dorado `polish` different from Medaka?

Medaka and Dorado polish are both assembly polishing tools. They accept the same input formats and produce the same output formats, and in principle they could run the same polishing model to produce equivalent results. However, Dorado polish is optimised for higher performance, and can support more accurate models with more computationally intensive architectures. For use cases in low-resource settings (small genomes such as bacteria with CPUs only available) Medaka remains the recommended tool. For large genomes or in other instances where speed is important, we suggest trying Dorado polish.

Should I use `correct` or `polish`?

Dorado polish is a post-assembly tool and it is intended to improve the accuracy of pre-existing assemblies. Dorado correct conversely is a pre-assembly tool and is intended to improve the contiguity of an assembly by improving the fidelity of reads used to create it.

How do I go from raw POD5 data to a polished T2T assembly?

Here is a high-level example workflow:

# Generate basecalled data with dorado basecaller
dorado basecaller <model> pod5s/ --emit-moves > calls.bam
samtools fastq calls.bam > calls.fastq

# Apply dorado correct to a set of reads that can be used as input in an assembly program.
dorado correct calls.fastq > corrected.fasta

# Assemble the genome using those corrected reads
<some_assembler> --input corrected.fasta > draft_assembly.fasta

# Align original calls to the draft assembly
dorado aligner draft_assembly.fasta calls.bam > aligned_calls.bam

# Run dorado polish using the raw reads aligned to the draft assembly
dorado polish aligned_calls.bam draft_assembly.fasta > polished_assembly.fasta

Polishing diploid/polyploid assemblies

Dorado polish is a haploid polishing tool and does not implement any sort of phasing internally. It will take input alignment data as is and run it through the polishing model to produce the consensus sequences.

In order to polish diploid/polyploid assemblies, it is up to the user to properly separate haplotypes before giving the data to Dorado polish.

We are currently working on a set of best practices. In the meantime, an unofficially suggested approach to polish diploid genomes would be to align the reads using the lr:hqae Minimap2 setting as this was specifically designed for alignment back to a diploid genome. This setting is available through Dorado aligner using the following option:

dorado aligner --mm2-opts "-x lr:hqae" <ref> <reads>

Troubleshooting

Memory consumption / Torch out-of-memory (OOM) issues

The inference batch size is computed to fit the largest possible batches into the available GPU memory (default --batchsize 0).

There are two cases when an OOM issue can happen:

The auto batch size feature is underestimating the memory consumption. If an Out-Of-Memory (OOM) warning/error is raised with the default auto batch size, try setting the batch size manually to a fixed value instead. For example:
```
dorado polish reads_to_draft.bam draft.fasta --batchsize <number> > consensus.fasta
```
A good rule of thumb would be --batchsize 16 for a large GPU, or try using a smaller value if this is still too high.

Additionally, the number of inference workers can be reduced to lower the memory usage (the default is 2 workers per device):
```
dorado polish reads_to_draft.bam draft.fasta --infer-threads 1 > consensus.fasta
```
Alternatively, consider running inference on the CPU, although this can take longer:
```
dorado polish reads_to_draft.bam draft.fasta --device "cpu" > consensus.fasta
```
Note that using multiple CPU inference threads can cause much higher memory usage.
GPU memory fragmentation during the run. This can happen when there were many small allocations followed by a large memory allocation which then cannot be fitted into a single contiguous block of memory. Such errors will have a specific Torch error message which looks like this: > Exception caught: CUDA out of memory. Tried to allocate 15.12 GiB. GPU 1 has a total capacity of 31.73 GiB of which 14.77 GiB is free. Including non-PyTorch memory, this process has 16.95 GiB memory in use. Of the allocated memory 2.10 GiB is allocated by PyTorch, and 14.46 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The key portion here is: 2.10 GiB is allocated by PyTorch, and 14.46 GiB is reserved by PyTorch but unallocated., which means that almost all non-free memory is actually unused.

In this case, follow the suggestion from the error message, and it should resolve the issue.

Example:
```
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True dorado polish reads_to_draft.bam draft.fasta > consensus.fasta
```

"[error] Could not open index for BAM file: 'aln.bam'!"

Example message:

$ dorado polish aln.bam assembly.fasta > polished.fasta
[2024-12-23 07:18:23.978] [info] Running: "polish" "aln.bam" "assembly.fasta"
[E::idx_find_and_load] Could not retrieve index file for 'aln.bam'
[2024-12-23 07:18:23.987] [error] Could not open index for BAM file: 'aln.bam'!

This message means that there the input BAM file does not have an accompanying index file .bai. This may also mean that the input BAM file is not sorted, which is a prerequisite for producing the .bai index using samtools.

Dorado polish requires input alignments to be produced using Dorado aligner. When Dorado aligner outputs alignments to stdout, they are not sorted automatically. Instead, samtools needs to be used to sort and index the BAM file. For example:

dorado aligner <draft.fasta> <reads.bam> | samtools sort --threads <num_threads> > aln.bam
samtools index aln.bam

Note that the sorting step is added after the pipe symbol.

The output from dorado aligner is already sorted when the output is to a folder, specified using the --output-dir option.

dorado aligner --output-dir <out_dir> <draft.fasta> <reads.bam>

"[error] Input BAM file has no basecaller models listed in the header."

Dorado polish requires that the aligned BAM has one or more @RG lines in the header. Each @RG line needs to contain a basecaller model used for generating the reads in this group. This information is required to determine the compatibility of the selected polishing model, as well as for auto-resolving the model from data.

When using Dorado aligner please provide the input basecalled reads in the BAM format. The basecalled reads BAM file (e.g. calls.bam) contains the @RG header lines, and this will be propagated into the aligned BAM file. Example:

dorado aligner draft.fasta calls.bam | samtools sort --threads <num_threads> > aligned_reads.bam
samtools index aligned_reads.bam

Alternatively, Dorado aligner will automatically sort and index the alignments when an output directory is specified instead of stdout.

dorado aligner --output-dir out draft.fasta calls.bam

Note that this feature will only work for the HTS-style FASTQ headers, such as:

@74960cfd-0b82-43ed-ae04-05162e3c0a5a qs:f:27.7534 du:f:75.1604 ns:i:375802 ts:i:1858 mx:i:1 ch:i:295 st:Z:2024-08-29T22:06:03.400+00:00 rn:i:585 fn:Z:FBA17175_7da7e070_f8e851a5_5.pod5 sm:f:414.101 sd:f:107.157 sv:Z:pa dx:i:0 RG:Z:f8e851a5d56475e9ecaa43496da18fad316883d8_dna_r10.4.1_e8.2_400bps_sup@v5.0.0

Dorado polish currently supports data generated using only the simplex basecallers.

"[error] Input BAM file was not aligned using Dorado."

Dorado polish accepts only BAMs aligned with Dorado aligner. Aligners other than Dorado aligner are not supported.

Example usage:

dorado aligner <draft.fasta> <reads.bam> | samtools sort --threads <num_threads> > aln.bam
samtools index aln.bam

"[error] The input BAM contains more than one read group. Please specify --RG to select which read group to process."

It is possible that the input BAM file contains more than 1 read group. In this case, Dorado polish requires that a single read group is selected for processing using the --RG <id> command line argument. The <id> should exactly match the ID: field in one of the @RG lines in the input BAM/SAM file.

Specifying the --RG option will filter out any read which does not belong to that read group and will apply the appropriate polishing model for that read group based on the basecaller model specified in the corresponding @RG line in the input BAM file.

Specifying a read group which corresponds to duplex data will not work because Dorado polish currently does not have duplex polishing models available.

In case of a duplex BAM - note that by default the simplex parents of the duplex reads will also be present in the output BAM file from Dorado. Consider filtering these out first if this could bias your results.

"[error] Duplex basecalling models are not supported."