Dorado SmallVar

Should I use `smallvar` or `polish`?

Dorado smallvar is a small variant caller for diploid samples aligned to a haploid species reference (e.g. GRCh38) whereas polish is intended for workflows involving reads aligned to a haplotype-resolved (or haploid) draft assembly.

Although Dorado polish can also generate a VCF file of variants, there are some substantial distinctions between the two tools.

`dorado polish`	`dorado smallvar`
- Polishing of draft assemblies - Input is a haplotype-resolved draft assembly - Output is a polished sequence - Optionally, a VCF/gVCF of diffs is output - Uses specialised polishing models	- Diploid variant calling - Input is a reference genome - Output is a VCF/gVCF of called diploid variants - Uses specialised variant calling models

Quick Start

# Align the reads using dorado aligner, sort and index
dorado aligner <ref.fasta> <reads.bam> | samtools sort --threads <num_threads> > aligned_reads.bam
samtools index aligned_reads.bam

# Call variants
dorado smallvar <aligned_reads.bam> <ref.fasta> > variants.vcf

Output to a folder

dorado smallvar <aligned_reads> <reference> -o <output_dir>

Specifying -o will write the output to one or more files stored in the given output directory (and create the directory if it doesn't exist). Concretely:

<output_dir>/variants.vcf - contains only variant calls by default, or records for all positions if --gvcf is specified.
<output_dir>/processed_regions.bed - BED file of all regions processed by the inference stage.

Resources

Dorado smallvar will automatically select the compute resources to perform variant calling. It can use one or more GPU devices. Variant calling can be performed on CPU-only, but we highly recommend to run on GPU for desired performance. High-memory GPUs are recommended to run this tool.

To specify resources manually use:

-x / --device - to specify specific GPU resources (if available).
--threads - to set the maximum number of threads to be used for everything but the inference.
--infer-threads - number of inference workers to use (per device). For CPU-only runs, this specifies the number of CPU inference threads.
--batchsize - batch size for inference, important to control memory usage on the GPUs. Automatically computed by default (--batchsize 0).

Example:

dorado smallvar aligned_reads.bam reference.fasta --device cuda:0 --threads 24 > variants.vcf

Models

Dorado smallvar auto-resolves the model based on the input BAM file. The BAM file needs to contain the @RG headers with the basecaller model name specified. The most recent version of a compatible model will be selected for variant calling.

Once the model is resolved, Dorado smallvar will either download it or look it up in the models-directory if specified.

For example:

dorado smallvar aligned_reads.bam reference.fasta > variants.vcf

will find the compatible model based on the input BAM file and download it to a temporary folder.

When --models-directory is specified, the resolved model will first be looked up in the models-directory, and only downloaded if the model does not exist. The specified models-directory must exist. Example:

mkdir -p models
dorado smallvar --models-directory models aligned_reads.bam reference.fasta > variants.vcf

More information about the --models-directory can be found in this section

If there are multiple read groups in the input dataset which were generated using different basecaller models, Dorado smallvar will report an error and stop execution. However, it is possible to select a single read group from such a BAM file for processing using the --RG option.

As an advanced option, Dorado smallvar offers a parameter to override the auto model selection. A specific model can be given using the --model-override parameter. Note that the use of this parameter is at your own risk - the accuracy of results is not guaranteed.

Valid --model-override values: | Value | Description | | -------- | ------- | | <basecaller_model> | Simplex basecaller model name (e.g. dna_r10.4.1_e8.2_400bps_hac@v6.0.0) | | <variant_model> | SmallVar variant calling model name (e.g. dna_r10.4.1_e8.2_400bps_hac@v6.0.0_smallvar@v1.0) | | <path> | Local path on disk where the model can be loaded from. |

When the <basecaller_model> syntax is used, the most recent version of a compatible model will be selected for variant calling.

Supported basecaller models

dna_r10.4.1_e8.2_400bps_hac@v5.2.0
dna_r10.4.1_e8.2_400bps_hac@v6.0.0

Common questions and Troubleshooting

I created a merged BAM file composed of multiple different data types. Why can't I call variants on this dataset? Using `--ignore-read-groups` does not help either

Please see the following section in Dorado polish: I created a merged BAM file composed of multiple different data types

Memory consumption / Torch out-of-memory (OOM) issues

The inference batch size is computed to fit the largest possible batches into the available GPU memory (default --batchsize 0).

There are two cases when an OOM issue can happen:

The auto batch size feature is underestimating the memory consumption. If an Out-Of-Memory (OOM) warning/error is raised with the default auto batch size, try setting the batch size manually to a fixed value instead. For example:
```
dorado smallvar aligned_reads.bam reference.fasta --batchsize <number> > variants.vcf
```
A good rule of thumb would be --batchsize 10 for a large GPU, or try using a smaller value if this is still too high.

Additionally, the number of inference workers can be reduced to lower the memory usage (the default is 2 workers per device):
```
dorado smallvar aligned_reads.bam reference.fasta --infer-threads 1 > variants.vcf
```
GPU memory fragmentation during the run. This can happen when there were many small allocations followed by a large memory allocation which then cannot be fitted into a single contiguous block of memory. Such errors will have a specific Torch error message which looks like this: > Exception caught: CUDA out of memory. Tried to allocate 15.12 GiB. GPU 1 has a total capacity of 31.73 GiB of which 14.77 GiB is free. Including non-PyTorch memory, this process has 16.95 GiB memory in use. Of the allocated memory 2.10 GiB is allocated by PyTorch, and 14.46 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The key portion here is: 2.10 GiB is allocated by PyTorch, and 14.46 GiB is reserved by PyTorch but unallocated., which means that almost all non-free memory is actually unused.

In this case, follow the suggestion from the error message, and it should resolve the issue.

Example:
```
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True dorado smallvar aligned_reads.bam reference.fasta > variants.vcf
```

"[error] Input BAM file was not aligned using Dorado."

Dorado smallvar accepts only BAMs aligned with Dorado aligner. Aligners other than Dorado aligner are not supported.

Example usage:

dorado aligner <draft.fasta> <reads.bam> | samtools sort --threads <num_threads> > aln.bam
samtools index aln.bam

"[error] Input BAM file has no basecaller models listed in the header."

Please refer to this section.

"[error] Duplex basecalling models are not supported."

Dorado smallvar currently supports data generated using only the simplex basecallers.

Does Dorado SmallVar phase variants?

At this early stage, Dorado smallvar does not yet produce phased VCF variants. This is work in progress.