Skip to content

SAM specification

@HD  VN:1.6  SO:unknown
@PG  ID:basecaller PN:dorado VN:0.2.4+3fc2b0f CL:dorado basecaller hac pod5/ DS:gpu:Quadro GV100

Read Group Header

Tag Description
ID <runid>_<basecalling_model>_<barcode_arrangement>
PU <flow_cell_id>
PM <device_id>
DT <exp_start_time>
PL ONT
DS runid=<run_id> basecall_model=<basecall_model_name> modbase_models=<modbase_model_names> experiment_id=<experiment_id> acquisition_start_time=<acquisition_start_time> model_stride=<model_stride>
LB <sample_id>
SM <barcode_name> (only if barcoding, and barcode is not "unclassified")
al <barcode_alias> (only if barcoding, same as SM tag if no alias)
bk <barcode_kit> (only if barcoding, and barcode is not "unclassified")

Read Tags

Tag Description
RG:Z: <runid>_<basecalling_model>_<barcode_arrangement>
qs:f: mean basecall q-score
ts:i: the number of samples trimmed from the start of the signal
ns:i: the basecalled sequence corresponds to the interval signal[ts : ns]
the move table maps to the same interval.
note that ns reflects trimming (if any) from the rear
of the signal.
mx:i: read mux
ch:i: read channel
rn:i: read number
st:Z: read start time (in UTC)
du:f: duration of the read (in seconds)
fn:Z: file name
sm:f: scaling midpoint/mean/median (pA to ~0-mean/1-sd)
sd:f: scaling dispersion (pA to ~0-mean/1-sd)
sv:Z: scaling version
mv:B:c sequence to signal move table (optional)
dx:i: bool to signify duplex read (only in duplex mode)
pi:Z: parent read id for a split read
sp:i: start coordinate of split read in parent read signal
bh:i: number of detected bedfile hits (only if alignment was performed with a specified bed-file)
me:I: number of minknow_events identified during sequencing
po:Z: pore type
er:Z: the reason the read ended
bv:Z: the variant of the detected barcode arrangement
bi:B:f an array of barcode info arranged as
[score, front_begin_index, front_seq_length, front_score, rear_end_index, rear_seq_length, rear_score]

Poly(A/T) Tags

When dorado is run with poly(A/T) estimation enabled, additional tags are added to each SAM record as follows:

  • pt:i is the estimated poly(A/T) tail length in cDNA and dRNA reads
  • pa:B:i is an array of signal positions related to the poly(A/T) estimation, in order:
    • The position in the signal used as the anchor for the poly(A/T) search
    • The start of the poly(A/T) region
    • The end of the poly(A/T) region
    • The the start of a secondary poly(A/T) region in the case of plasmids, (-1 otherwise or if not found)
    • The end of a secondary poly(A/T) region in the case of plasmids, (-1 otherwise or if not found)