Skip to content

Model Selection

The Dorado model argument used basecaller and duplex is used to select basecalling models. There are multiple model selection methods to support most use cases and the value of this field is known as the model complex.

The model complex is interpreted in one of 3 ways depending on its format:

  1. Path: Select a model using a directory path.
  2. Name: Select a model using a full model name.
  3. Variant: Select models based on some properties.

The Name and Variant methods will automatically download models using the model discovery rules.

Model selection via Path

Existing models can be selected using their directory path.

Use dorado download to download models into a directory and then specify the model using the model directory path. For example:

# Download a model into a --models-directory
dorado download --model dna_r10.4.1_e8.2_400bps_hac@v5.2.0 --models-directory ~/models/

# Use the downloaded model
dorado basecaller ~/models/dna_r10.4.1_e8.2_400bps_hac@v5.2.0 reads/ ... > calls.bam

Model selection via Name

Dorado supports selecting basecaller models by their full name. If the model name is in the list of available models it will be found using the model discovery rules.

dorado basecaller dna_r10.4.1_e8.2_400bps_hac@v5.2.0 reads/ ... > calls.bam

Model selection via Variant

Using a variant model complex instructs Dorado to select a basecalling models based on the type of data to be basecalled. The example below will download the latest hac model for the type of data in reads/ which could be either DNA or RNA.

dorado basecaller hac reads/ ... > calls.bam

Model variant syntax

A model variant must start with the simplex model speed and follows this syntax:

speed[version][,mod[version]]*
  • [] - Enclose an optional field.
  • * - The field may be repeated zero or more times.
  • , - All items must be comma-separated.

speed

The model speed can be any of fast, hac or sup.

version

The version takes the form of @vX.Y.Z or @latest.

If @latest is used, the latest available model version is used. This is the default i.e. hac -> hac@latest.

X, Y and Z are major, minor, and patch version numbers (e.g. @v1.2.3).

Missing trailing values are assumed to be zero e.g. @v1.2 -> @v1.2.0.

Missing internal values @v0..1 and trailing periods @v1. are not permitted.

mod

Multiple Modification Models

More than one modification model may be selected at once and each must be separated by a comma.

For example: sup,6mA,5mC@latest

The mod field can be any modification name which is available for the simplex model and can be optionally followed by a version.

Examples: 6mA, m6A, pseU, 5mC@v2 and 5mCG_5hmCG@v1.0.0.

Automatically selected modification models will always match the base simplex model version and will be the latest compatible version unless a specific version is set by the user.

Multiple modification models must use different canonical bases

When selecting multiple modification models, only one modification per canonical base may be active at once.

For example, sup,4mC,5mC is invalid as both modification models operate on the C canonical base context.

This is because the modification probabilities reported could be nonsensical as each model could report high confidence of two different modifications at the same position.

See the Model List for a list of all available models.

Examples of model variants

Model Complex Description
fast Latest compatible fast model
hac Latest compatible hac model
sup Latest compatible sup model
hac@latest Latest compatible hac simplex basecalling model
hac@v4.2.0 Simplex basecalling hac model with version v4.2.0
hac@v3.5 Simplex basecalling hac model with version v3.5.0
hac,5mCG_5hmCG Latest compatible hac simplex model and latest 5mCG_5hmCG modified bases model matching the chosen simplex model
hac,5mCG_5hmCG@v3 Latest compatible hac simplex model and compatible 5mCG_5hmCG modified bases model with version v3.0.0
sup@v5.2,5mCG_5hmCG,6mA Simplex basecalling sup model with version v5.2.0 and latest compatible 5mCG_5hmCG and 6mA modification models

Here are some examples of model complexes in use:

# Simplex basecalling
dorado basecaller hac                   reads/ > calls.bam # HAC simplex basecalling
dorado basecaller hac@v4.1.0            reads/ > calls.bam # HAC simplex with specific version

# Simplex modification basecalling
dorado basecaller sup,6mA               reads/ > calls.bam # SUP with modifications
dorado basecaller sup,6mA,5mCG_5hmCG    reads/ > calls.bam # Multiple modification models
dorado basecaller sup@v4.2.0,6mA@v1     reads/ > calls.bam # Setting versions

# Duplex basecalling
dorado duplex  sup@v4.1.0  reads/ > calls.bam # SUP duplex basecalling with specific version
dorado duplex  sup,5mC     reads/ > calls.bam # SUP duplex basecalling with modification model

Selecting modified base models

Via model Variant

Recommended

Please refer to the model variant section which contains examples of both simplex and modified base model selection using a model variant.

This is the recommended method of selecting both simplex and modified bases models.

The --modified-bases and --modified-bases-models arguments are not permitted when using the model variant syntax.

Via modified bases model Name

As an extension to the model selection via Name, a single modified base model can be selected using its name as shown below. If the required simplex model is not found it will also be found using the model discovery rules.

dorado basecaller dna_r10.4.1_e8.2_400bps_hac@v5.2.0_5mC_5hmC@v1 reads/ ... > calls.bam

The --modified-bases and --modified-bases-models arguments can be used to select additional modified bases models.

Via --modified-bases CLI argument

Space Separated

Multiple modified base model codes must be space separated.

Similarly to how the model complex functions, the --modified-bases argument takes a space separated list of modification codes and automatically resolves which modified base model to use based on your simplex basecalling model selection. The modified base model selected will always be the latest available as there is no way to specify a version (unlike when using model complex).

Examples:

dorado basecaller hac         reads/  --modified-bases pseU      > calls.bam
dorado basecaller hac         reads/  --modified-bases 6mA 5mC   > calls.bam
dorado duplex /simplex/model/ reads/  --modified-bases 5mC       > calls.bam

Via --modified-bases-models CLI argument

Comma separated

Multiple modified base model paths must be comma separated.

Similarly to how simplex basecall models can be specified using a path to an existing simplex model, modified base models can be specified via a path using the --modified-bases-models argument.

See also documentation for the model downloader.

# Download the models into a models directory
dorado download --model rna004_130bps_hac@v5.2.0          --models-directory ~/models
dorado download --model rna004_130bps_hac@v5.2.0_m6A@v1   --models-directory ~/models
dorado download --model rna004_130bps_hac@v5.2.0_pseU@v1  --models-directory ~/models

# Run the basecaller
dorado basecaller ~/models/rna004_130bps_hac@v5.2.0/ reads/ \
    --modified-bases-models ~/models/rna004_130bps_hac@v5.2.0_m6A@v1,~/models/rna004_130bps_hac@v5.2.0_pseU@v1 \
    > calls.bam

Model discovery

When not using paths to select models Dorado will search for models in the following locations listed in priority order:

  1. The --models-directory path set via CLI argument.
  2. The DORADO_MODELS_DIRECTORY path set via environment variable.
  3. The current working directory.

Automatic model download

If --models-directory or DORADO_MODELS_DIRECTORY environment variable is set, models will be downloaded into the nominated directory. Otherwise models will be downloaded into the current working directory and deleted after Dorado has finished.

To avoid repeatedly downloading models it is recommended that the --models-directory argument or DORADO_MODELS_DIRECTORY environment variable is set.

The example below shows that without using --models-directory automatic model selection will download and clean up models on every use of Dorado.

# Model is downloaded into temporary directory and cleaned when Dorado is finished
dorado basecaller hac pod5s/ > calls.bam

ls *hac*
# No results

# Model is re-downloaded and cleaned up again
dorado basecaller hac pod5s/ > calls.bam

The example below shows that when using --models-directory, automatic model selection will download models that are missing and reuse previously existing models.

# Model is downloaded into models/
dorado basecaller hac pod5s/ --models-directory models/ > calls.bam

ls models/
    dna_r10.4.1_e8.2_400bps_hac@v5.2.0

# Model is re-used
dorado basecaller hac pod5s/ --models-directory models/ > calls.bam

Models can be re-used as shown above but by using the DORADO_MODELS_DIRECTORY environment variable. This can be set once in a user configuration file and re-used without needing to set the --models-directory argument on the command line.

export DORADO_MODELS_DIRECTORY="/path/to/models/"

The environment variable can also be set inline with dorado but this is just shown for completeness.

DORADO_MODELS_DIRECTORY=/path/to/models/ dorado basecaller hac pod5s/ > calls.bam