Dorado Correct
Should I use correct
or polish
?
Dorado supports single-read error correction with the integration of the HERRO algorithm in Dorado correct
.
dorado correct
is essentially a reimplementation of the HERRO algorithm.
HERRO Algorithm
Citation
Telomere-to-telomere phased genome assembly using error-corrected Simplex nanopore reads
Dominik Stanojević, Dehui Lin, Paola Florez de Sessions, Mile Šikić bioRxiv 2024.05.18.594796;
HERRO uses all-vs-all alignment followed by haplotype-aware correction using a deep learning model to achieve higher single-read accuracies. The corrected reads are primarily useful for generating de novo assemblies of diploid organisms.
The original paper containing implementation details can be downloaded from bioRxiv.
Quick start
To run Dorado correct
, pass in a FASTQ or a bgz compressed FASTQ.gz file. Note that bgzip
needs to be used for compression instead of the vanilla gzip
because Htslib does not support FASTA/FASTQ with plain gzip
.
Dorado will perform read correction on this dataset after automatically downloading the
required HERRO model.
You may pre-download the HERRO model if required:
and select it as shown:
Important
Currently there is only one Dorado correct
model which is herro-v1
for the r10.4
run condition.
Usage
Dorado correct
supports FASTQ(.gz) as the input and generates a FASTA file as output.
An index file is generated for the input FASTQ file in the same folder unless one is already present. Please ensure that the folder with the input file is writeable by the Dorado process and has sufficient disk space (no more than 10GB should be necessary for a whole genome dataset).
To correct reads, run:
All required model weights are downloaded automatically by Dorado. However, the weights can also be pre-downloaded and passed via command line in case of offline execution. To do so, run:
dorado download --model herro-v1
dorado correct --model-path herro-v1 reads.fq.gz > corrected_reads.fasta
Separate mapping and inference
Dorado correct
can run mapping (CPU-only stage) and inference (GPU-intensive stage) individually.
This enables separation of the CPU and GPU heavy stages into individual steps which can
even be run on different nodes with appropriate compute characteristics. For example:
dorado correct reads.fastq --to-paf > overlaps.paf
dorado correct reads.fastq --from-paf overlaps.paf > corrected_reads.fasta
Gzipped PAF is currently not supported for the --from-paf
option.
Resume
If a run was stopped or has failed, Dorado correct
provides functionality to resume from where the previous
run stopped.
The --resume-from
argument takes a list of previously corrected reads provided via
a .fai
index from the outputs of the previous run. The reads that have been previously
processed are then skipped when resuming.
To generate the .fai
file from a previous output from Dorado correct
use:
# corrected_reads.fasta is the output from the previously interrupted run.
mv corrected_reads.fasta corrected_reads.res.fasta
samtools faidx corrected_reads.res.fasta
And to continue Dorado correct
using --resume-from
use:
The input file format for the --resume-from
feature can be any plain text file where
the first whitespace-delimited column (or a full row) consists of sequence names to skip, one per row.
Specifying resources
Dorado correct will automatically select all available compute resources to perform error correction.
To specify resources manually use:
-x / --device
to specify specific GPU resources (if available).--threads
to set the maximum number of threads to be used during correction.--infer-threads
to set the number of threads used per-device for inference.
The error correction tool is both compute and memory intensive.
As a result, it is best run on a system with:
- multiple high performance CPU cores ( > 64 cores)
- large system memory ( > 256GB)
- a modern GPU with a large VRAM ( > 32GB)
HPC support
Dorado correct
now also provides a feature to enable simpler distributed computation.
It is now possible to run a single block of the input target reads file, specified by the block ID. This enables granularization of the correction process, making it possible to easily utilise distributed HPC architectures.
For example, this is now possible:
# Determine the number of input target blocks.
num_blocks=$(dorado correct in.fastq --compute-num-blocks)
# For every block, run correction of those target reads.
for ((i=0; i<${num_blocks}; i++)); do
dorado correct in.fastq --run-block-id ${i} > out.block_${i}.fasta
done
# Optionally, concatenate the corrected reads.
cat out.block_*.fasta > out.all.fasta
On an HPC system, individual blocks can simply be submitted to the cluster management system. For example:
# Determine the number of input target blocks.
num_blocks=$(dorado correct in.fastq --compute-num-blocks)
# For every block, run correction of those target reads.
for ((i=0; i<${num_blocks}; i++)); do
qsub <options> dorado correct in.fastq --run-block-id ${i} > out.block_${i}.fasta
done
In case that the available HPC nodes do not have GPUs available, the CPU power of those nodes can still be leveraged for overlap computation - it is possible to combine a blocked run with the --to-paf
option. Inference stage can then be run afterwards on another node with GPU devices from the generated PAF and the --from-paf
option.
Frequently asked questions / Troubleshooting
High memory consumption
In case the process is consuming too much memory (RAM) for your system, try running it with a smaller index size. For example:
The auto-computed inference batch size may still be too high for your system. If you are experiencing warnings/errors regarding available GPU memory, try reducing the batch size by selecting it manually. For example:
Missing reads
In case your output FASTA file contains a very low amount of corrected reads compared to the input, please check the following:
- The input dataset has average read length >=10kbp.
- Dorado Correct is designed for long reads, and it will not work on short libraries.
- Input coverage is reasonable, preferably >=30x.
- Check the average base qualities of the input dataset. Dorado Correct expects accurate inputs for both mapping and inference.
Some corrected reads have a suffix of type :0
, :1
, etc.
When a region of an input read has low/zero coverage, Dorado correct
(and HERRO) will split it in this region and produce one or more chunks for that read.
If this occurs, the corrected chunks will have a suffix of type :<number>
added to the header, where <number>
is the ordinal ID of this chunk along the input read.
CLI reference
Here's a slightly re-formatted output from the Dorado correct
subcommand for reference.
Info
Please check the --help output of your own installation of dorado as this page may be outdated and argument defaults have been omitted as they are platform specific.
> dorado correct --help
Positional arguments:
reads Path to a file with reads to correct in FASTQ format.
Optional arguments:
-h, --help shows help message and exits
-v, --verbose [may be repeated]
Resources arguments:
-x, --device Specify CPU or GPU device
-t, --threads Number of threads for processing. Default uses all available threads.
--infer-threads Number of threads per device.
Input/output arguments:
-m, --model-path Path to correction model folder.
-p, --from-paf Path to a PAF file with alignments. Skips alignment computation.
--to-paf Generate PAF alignments and skip consensus.
--resume-from Resume a previously interrupted run.
Requires a path to a file where sequence headers are stored in the first column
(whitespace delimited), one per row.
Advanced arguments:
-b, --batch-size Batch size for inference.
-i, --index-size Size of index for mapping and alignment. Decrease index size to lower memory footprint.