Model Selection
The Dorado model argument used basecaller and duplex is used to select basecalling models.
There are multiple model selection methods to support most use cases and the value of this field is known
as the model complex.
The model complex is interpreted in one of 3 ways depending on its format:
- Path: Select a model using a directory path.
- Name: Select a model using a full model name.
- Variant: Select models based on some properties.
The Name and Variant methods will automatically download models using the model discovery rules.
Model selection via Path
Existing models can be selected using their directory path.
Use dorado download to download models into a directory and then specify the model using
the model directory path. For example:
# Download a model into a --models-directory
dorado download --model dna_r10.4.1_e8.2_400bps_hac@v5.2.0 --models-directory ~/models/
# Use the downloaded model
dorado basecaller ~/models/dna_r10.4.1_e8.2_400bps_hac@v5.2.0 reads/ ... > calls.bam
Model selection via Name
Dorado supports selecting basecaller models by their full name. If the model name is in the list of available models it will be found using the model discovery rules.
Model selection via Variant
Using a variant model complex instructs Dorado to select a basecalling models based on the type of data to
be basecalled. The example below will download the latest hac model for the type of data in reads/ which could
be either DNA or RNA.
Model variant syntax
A model variant must start with the simplex model speed and follows this syntax:
[]- Enclose an optional field.*- The field may be repeated zero or more times.,- All items must be comma-separated.
speed
The model speed can be any of fast, hac or sup.
version
The version takes the form of @vX.Y.Z or @latest.
If @latest is used, the latest available model version is used. This is the default i.e. hac -> hac@latest.
X, Y and Z are major, minor, and patch version numbers (e.g. @v1.2.3).
Missing trailing values are assumed to be zero e.g. @v1.2 -> @v1.2.0.
Missing internal values @v0..1 and trailing periods @v1. are not permitted.
mod
Multiple Modification Models
More than one modification model may be selected at once and each must be separated by a comma.
For example: sup,6mA,5mC@latest
The mod field can be any modification name which is available for the simplex model
and can be optionally followed by a version.
Examples: 6mA, m6A, pseU, 5mC@v2 and 5mCG_5hmCG@v1.0.0.
Automatically selected modification models will always match the base simplex model version and will be the latest compatible version unless a specific version is set by the user.
Multiple modification models must use different canonical bases
When selecting multiple modification models, only one modification per canonical base may be active at once.
For example, sup,4mC,5mC is invalid as both modification models operate on
the C canonical base context.
This is because the modification probabilities reported could be nonsensical as each model could report high confidence of two different modifications at the same position.
See the Model List for a list of all available models.
Examples of model variants
| Model Complex | Description |
|---|---|
fast |
Latest compatible fast model |
hac |
Latest compatible hac model |
sup |
Latest compatible sup model |
hac@latest |
Latest compatible hac simplex basecalling model |
hac@v4.2.0 |
Simplex basecalling hac model with version v4.2.0 |
hac@v3.5 |
Simplex basecalling hac model with version v3.5.0 |
hac,5mCG_5hmCG |
Latest compatible hac simplex model and latest 5mCG_5hmCG modified bases model matching the chosen simplex model |
hac,5mCG_5hmCG@v3 |
Latest compatible hac simplex model and compatible 5mCG_5hmCG modified bases model with version v3.0.0 |
sup@v5.2,5mCG_5hmCG,6mA |
Simplex basecalling sup model with version v5.2.0 and latest compatible 5mCG_5hmCG and 6mA modification models |
Here are some examples of model complexes in use:
# Simplex basecalling
dorado basecaller hac reads/ > calls.bam # HAC simplex basecalling
dorado basecaller hac@v4.1.0 reads/ > calls.bam # HAC simplex with specific version
# Simplex modification basecalling
dorado basecaller sup,6mA reads/ > calls.bam # SUP with modifications
dorado basecaller sup,6mA,5mCG_5hmCG reads/ > calls.bam # Multiple modification models
dorado basecaller sup@v4.2.0,6mA@v1 reads/ > calls.bam # Setting versions
# Duplex basecalling
dorado duplex sup@v4.1.0 reads/ > calls.bam # SUP duplex basecalling with specific version
dorado duplex sup,5mC reads/ > calls.bam # SUP duplex basecalling with modification model
Selecting modified base models
Via model Variant
Recommended
Please refer to the model variant section which contains examples of both simplex and modified base model selection using a model variant.
This is the recommended method of selecting both simplex and modified bases models.
The --modified-bases and --modified-bases-models arguments are not permitted when using the model variant syntax.
Via modified bases model Name
As an extension to the model selection via Name, a single modified base model can be selected using its name as shown below. If the required simplex model is not found it will also be found using the model discovery rules.
The --modified-bases and --modified-bases-models arguments can be used to select additional modified bases models.
Via --modified-bases CLI argument
Space Separated
Multiple modified base model codes must be space separated.
Similarly to how the model complex functions, the --modified-bases argument takes a space separated
list of modification codes and automatically resolves which modified base model to use based on your simplex
basecalling model selection. The modified base model selected will always be the latest available
as there is no way to specify a version (unlike when using model complex).
Examples:
dorado basecaller hac reads/ --modified-bases pseU > calls.bam
dorado basecaller hac reads/ --modified-bases 6mA 5mC > calls.bam
dorado duplex /simplex/model/ reads/ --modified-bases 5mC > calls.bam
Via --modified-bases-models CLI argument
Comma separated
Multiple modified base model paths must be comma separated.
Similarly to how simplex basecall models can be specified using a path to an
existing simplex model, modified base models can be specified via a path using the
--modified-bases-models argument.
See also documentation for the model downloader.
# Download the models into a models directory
dorado download --model rna004_130bps_hac@v5.2.0 --models-directory ~/models
dorado download --model rna004_130bps_hac@v5.2.0_m6A@v1 --models-directory ~/models
dorado download --model rna004_130bps_hac@v5.2.0_pseU@v1 --models-directory ~/models
# Run the basecaller
dorado basecaller ~/models/rna004_130bps_hac@v5.2.0/ reads/ \
--modified-bases-models ~/models/rna004_130bps_hac@v5.2.0_m6A@v1,~/models/rna004_130bps_hac@v5.2.0_pseU@v1 \
> calls.bam
Model discovery
When not using paths to select models Dorado will search for models in the following locations listed in priority order:
- The
--models-directorypath set via CLI argument. - The
DORADO_MODELS_DIRECTORYpath set via environment variable. - The current working directory.
Automatic model download
If --models-directory or DORADO_MODELS_DIRECTORY environment variable is set, models will be downloaded into the nominated directory.
Otherwise models will be downloaded into the current working directory and deleted after Dorado has finished.
To avoid repeatedly downloading models it is recommended that the --models-directory argument or
DORADO_MODELS_DIRECTORY environment variable is set.
The example below shows that without using --models-directory automatic model selection will download and
clean up models on every use of Dorado.
# Model is downloaded into temporary directory and cleaned when Dorado is finished
dorado basecaller hac pod5s/ > calls.bam
ls *hac*
# No results
# Model is re-downloaded and cleaned up again
dorado basecaller hac pod5s/ > calls.bam
The example below shows that when using --models-directory, automatic model selection will
download models that are missing and reuse previously existing models.
# Model is downloaded into models/
dorado basecaller hac pod5s/ --models-directory models/ > calls.bam
ls models/
dna_r10.4.1_e8.2_400bps_hac@v5.2.0
# Model is re-used
dorado basecaller hac pod5s/ --models-directory models/ > calls.bam
Models can be re-used as shown above but by using the DORADO_MODELS_DIRECTORY
environment variable. This can be set once in a user configuration file and re-used without
needing to set the --models-directory argument on the command line.
The environment variable can also be set inline with dorado but this is just shown for completeness.