Skip to content

Simplex Basecalling

Quick start

To run Dorado basecalling, using the automatically downloaded hac model on a directory of POD5 files or a single POD5 file use:

dorado basecaller hac pod5s/ > calls.bam

To basecall a single file, simply replace the directory pod5s/ with a path to your data.

dorado basecaller hac /path/to/reads.pod5 > calls.bam

To automatically download and use the fast or sup models try the following:

dorado basecaller fast pod5s/ > calls.bam
dorado basecaller sup  pod5s/ > calls.bam

If you have a model that has already been downloaded you can specify that simplex model using a path. For more information on how models are downloaded and how they can be re-used please see the downloader documentation.

dorado basecaller /path/to/simplex_model/ pod5s/ > calls.bam

Adding modified bases

To add modified basecalling extend the variant model complex or refer to modified basecalling model selection for more details on the other options available.

dorado basecaller hac,5mC     pod5s/ > calls.bam
dorado basecaller sup,6mA,5mC pod5s/ > calls.bam

Selecting data

To basecall all reads in a nested directory structure recursively use -r / --recursive:

dorado basecaller hac data/ --recursive  > calls.bam

To basecall only a limited number reads use the -n / --max-reads argument:

dorado basecaller hac data/ --max-reads 100  > calls.bam

Tip

You can generate a list of read ids using the pod5 view tool.

To basecall a specific selection of reads use the -l / --read-ids argument passing in a file path to a newline-delimited list of read ids. Only these read ids will be basecalled.

dorado basecaller hac data/ --read-ids read_ids.txt > calls.bam

Resume basecalling

If basecalling is interrupted, it is possible to resume basecalling from a BAM file. To do so, use the --resume-from flag to specify the path to the incomplete BAM file.

dorado basecaller hac pod5s/ --resume-from incomplete.bam > calls.bam

Warning

Do not reuse the filenames for --resume-from and the new output.

If they are the same then the interrupted file will be deleted when Dorado is launched and the previous work will be lost.

# WARNING: This will overwrite the existing `resume.bam` file before it is used.
dorado basecaller hac pod5/ --resume-from resume.bam > resume.bam

Read trimming

See read trimming.

Output Folder Structure

If the --output-dir <DIR> argument is set, Dorado basecaller will write output files into a nested folder structure following the MinKnow output structure specifications.

The chosen directory <DIR> becomes the root of the nested folder structure and replaces /data/ in the specification examples.

Reads with mean Q-score below the --min-qscore threshold are written to the files marked fail. If --min-qscore is not set, a default threshold of 0 is used and all reads are written to files marked pass.


CLI reference

Here's a slightly re-formatted output from the Dorado basecaller subcommand for reference.

Info

Please check the --help output of your own installation of Dorado as this page may be outdated and argument defaults have been omitted as they are platform specific.

> dorado basecaller --help

Positional arguments:
  model                             Model selection {fast,hac,sup}@v{version} for automatic model selection including modbases, or path to existing model directory.
  data                              The data directory or POD5 file path.

Optional arguments:
  -h, --help                        shows help message and exits
  -v, --verbose                     [may be repeated]
  -x, --device                      Specify CPU or GPU device: 'auto', 'cpu', 'cuda:all' or 'cuda:<device_id>[,<device_id>...]'.
                                      Specifying 'auto' will choose either 'cpu', 'metal' or 'cuda:all' depending
                                      on the presence of a GPU device.
  --models-directory                Optional directory to search for existing models or download new models into.

Input data arguments:
  -r, --recursive                   Recursively scan through directories to load POD5 files.
  -l, --read-ids                    A file with a newline-delimited list of reads to basecall. If not provided,
                                      all reads will be basecalled.
  -n, --max-reads                   Limit the number of reads to be basecalled.
  --resume-from                     Resume basecalling from the given HTS file. Fully written read records are
                                      not processed again.
  --disable-read-splitting          Disable read splitting

Output arguments:
  --min-qscore                      Discard reads with mean Q-score below this threshold or write them to output files marked `fail` if `--output-dir` is set.
  --emit-moves                      Write the move table to the 'mv' tag.
  --emit-fastq                      Output in fastq format.
  --emit-sam                        Output in SAM format.
  --emit-summary                    If specified, a summary file containing the details of the primary alignments for each read will be emitted to the root of the --output-dir folder. If --output-dir is not set, the summary file is placed in the current working directory. 
  -o, --output-dir                  Output folder which becomes the root of the nested output folder structure.

Alignment arguments:
  --reference                       Path to reference for alignment.
  --bed-file                        Optional bed-file. If specified, overlaps between the alignments and
                                      bed-file entries will be counted, and recorded in BAM output using
                                      the 'bh' read tag.
  --mm2-opts                        Optional minimap2 options string. For multiple arguments surround with
                                      double quotes. 

Modified model arguments:
  --modified-bases                  A space separated list of modified base codes. Choose from:
                                      pseU_2OmeU, pseU, 2OmeG, m6A_DRACH, 4mC_5mC, 5mC_5hmC, 5mCG, m6A, 5mCG_5hmCG, 5mC, inosine_m6A, m5C, 6mA, m5C_2OmeC, inosine_m6A_2OmeA.

  --modified-bases-models           A comma separated list of modified base model names or paths.
  --modified-bases-threshold        The minimum predicted methylation probability for a modified base
                                      to be emitted in an all-context model, [0, 1].
  --modified-bases-batchsize        The modified base models batch size.

Barcoding arguments:
  --kit-name                        Enable barcoding with the provided kit name. Choose from:
                                      EXP-NBD103 EXP-NBD104 EXP-NBD114 EXP-NBD114-24 EXP-NBD196 EXP-PBC001
                                      EXP-PBC096 SQK-16S024 SQK-16S114-24 SQK-DRB004-24 SQK-HTB114-96 SQK-LWB001
                                      SQK-MAB114-24 SQK-MLK111-96-XL SQK-MLK114-96-XL SQK-NBD111-24 SQK-NBD111-96
                                      SQK-NBD114-24 SQK-NBD114-96 SQK-PBK004 SQK-PCB109 SQK-PCB110 SQK-PCB111-24
                                      SQK-PCB114-24 SQK-RAB201 SQK-RAB204 SQK-RBK001 SQK-RBK004 SQK-RBK110-96
                                      SQK-RBK111-24 SQK-RBK111-96 SQK-RBK114-24 SQK-RBK114-96 SQK-RLB001 SQK-RPB004
                                      SQK-RPB114-24 TWIST-16-UDI TWIST-96A-UDI VSK-PTC001 VSK-VMK001 VSK-VMK004 VSK-VPS001.
  --sample-sheet                    Path to the sample sheet to use.
  --barcode-both-ends               Require both ends of a read to be barcoded for a double ended barcode.
  --barcode-arrangement             Path to file with custom barcode arrangement. Requires --kit-name.
  --barcode-sequences               Path to file with custom barcode sequences. Requires --kit-name and --barcode-arrangement. 
  --primer-sequences                Path to fasta file with custom primer sequences, or the name of a supported 3rd-party primer set. If specifying a supported primer set, choose from: 10X_Genomics. 

Trimming arguments:
  --no-trim                         Skip trimming of barcodes, adapters, and primers.
                                      If option is not chosen, trimming of all three is enabled.
  --trim                            Specify what to trim. Options are 'none', 'all', and 'adapters'.
                                      The default behaviour is to trim all detected adapters, primers, and barcodes.
                                      Choose 'adapters' to just trim adapters. The 'none' choice is equivalent to using --no-trim.
                                       Note that this only applies to DNA. RNA adapters are always trimmed.

Poly(A) arguments:
  --estimate-poly-a                 Estimate poly(A)/poly(T) tail lengths (beta feature).
                                      Primarily meant for cDNA and dRNA use cases.
  --poly-a-config                   Configuration file for poly(A) estimation to change default behaviours

Advanced arguments:
  -b, --batchsize                   The number of chunks in a batch. If 0 an optimal batchsize will be selected.