Skip to content

Specification

The Oxford Nanopore Technologies output specifications have more details on supported file formats created by ONT devices and software.

Overview

The file format is, at its core, a collection of Apache Arrow tables, stored in the Apache Feather 2 (also know as Apache Arrow IPC File) format, and bundled into a container format. The container file has the extension .pod5.

Table Schemas

POD5 files are a custom wrapper format around arrow that contain several arrow tables.

All the tables should have the following custom_metadata fields set on them:

Name Example Value Notes
MINKNOW:pod5_version 1.0.0 The version of this specification that the schema was based on.
MINKNOW:software MinNOW Core 5.2.3 A free-form description of the software that wrote the file, intended to help pin down the source of files that violate the specification.
MINKNOW:file_identifier cbf91180-0684-4a39-bf56-41eaf437de9e Must be identical across all tables. Allows checking that the files correspond to each other.

Extension Types

Several fields in the table schemas use custom arrow types.

minknow.uuid

The schemas make extensive use of UUIDs to identify reads. This is stored using an extension type, with the following properties:

Name: "minknow.uuid"
Physical storage: FixedBinary(16)

minknow.vbz

Storage for VBZ-encoded data:

Name: "minknow.vbz"
Physical storage: LargeBinary

Tables

The Reads, Signal and Run Info tables must all be present in a POD5 file. Note that some very early POD5 files produced by pre-0.1 versions of the pod5 library did not include a Run Info table, instead including that information in the Reads table.

Reads Table

The Reads table contains a single row per read, and describes the metadata for each read. The signal column of the read links to the Signal table, and allows a reads signal to be retrieved. The run_info column links to the the Run Info table, providing more context for the read and avoiding duplicating data that is common to many or all reads in the file.

Some fields of the Reads table are dictionaries: the contents of the table are stored in a lookup written prior to each batch of read rows and the read row itself then contains an integer index. This allows space savings on fields that would otherwise be repeated. Only simple types are stored in dictionaries as third party tools have limited support for dictionaries of structs.

[tables/reads.toml] contains specific information about fields in the reads table.

Signal Table

The signal table contains the (optionally compressed) signal data where one row contains sequence of sample data, and some information about the sample data origin.

[tables/signal.toml] contains specific information about fields in the signal table.

Run Info Table

The run info table contains a single row per MinKNOW run that any read in the file came from.

Several fields of the Reads table are dictionaries, the contents of the table are stored in a lookup written prior to each batch of read rows, the read row itself then contains an integer index. This allows space savings on fields that would otherwise be repeated.

[tables/run_info.toml] contains specific information about fields in the reads table.

Combined file Layout

Layout

<signature "\213POD\r\n\032\n">
<section marker: 16 bytes>
<embedded file 1 (padded to 8-byte boundary)><section marker: 16 bytes>
...
<embedded file N (padded to 8-byte boundary)><section marker: 16 bytes>
<footer magic: "FOOTER\000\000">
<footer (padded to 8-byte boundary)>
<footer length: 8 bytes little-endian signed integer>
<section marker: 16 bytes>
<signature "\213POD\r\n\032\n">

All padding bytes should be zero. They ensure memory mapped files have the alignment that Arrow expects.

Signature

The first and last eight bytes of the file are both a fixed set of values:

| Decimal          | 139  | 80   | 79   | 68   | 13   | 10   | 26   | 10   |
| Hexadecimal      | 0x8B | 0x50 | 0x4F | 0x44 | 0x0D | 0x0A | 0x1A | 0x0A |
| ASCII C Notation | \213 | P    | O    | D    | \r   | \n   | \032 | \n   |

The format of the signature is based on the PNG file signature, and inherits several useful features from it for detecting file corruption:

  • The first byte is non-ASCII to reduce the probability it is interpreted as a text file.
  • The first byte has the high bit set to catch file transfers that clear the top bit.
  • The \r\n (CRLF) sequence and the final \n (LF) byte check that nothing has attempted to standardise line endings in the file.
  • The second-last byte (\032) is the CTRL-Z sequence, which stops file display under MS-DOS.
Rationale

A unique, fixed signature for the file type allows quickly identifying that the file is in the expected format, and provides an easy way for tools like the UNIX file command to determine the file type.

Placing it at the end allows quickly checking whether the file is complete.

Section marker

The section marker is a 16-byte UUID, generated randomly for each file. All the section markers in a given file must be identical.

Rationale

This aids in recovery of partially-written files (that are missing a footer) - while most of the embedded Arrow IPC files can be scanned easily, it may not be obvious where the footer ends. A given randomly-generated 16-byte value is highly unlikely to occur in actual data, and can be scanned for to find the end of the embedded file for certain. The first section marker is just so that recovery tools know what to look for.

This is the ASCII string "FOOTER" padded to 8 bytes with zeroes. It helps find a partially-written footer when recovering files.

The footer is an encoded FlatBuffer table, using the schema below.

namespace Minknow.ReadsFormat;

enum ContentType:short {
    // The Reads table (an Arrow table)
    ReadsTable,
    // The Signal table (an Arrow table)
    SignalTable,
    // An index for looking up data in the ReadsTable by read_id
    ReadIdIndex,
    // An index based on other columns and/or tables (it will need to be opened to find out what it indexes)
    OtherIndex,
}

enum Format:short {
    // The Apache Feather V2 format, also known as the Apache Arrow IPC File format.
    FeatherV2,
}

// Describes an embedded file.
table EmbeddedFile {
    // The start of the embedded file
    offset: int64;
    // The length of the embedded file (excluding any padding)
    length: int64;
    // The format of the file
    format: Format;
    // What contents should be expected in the file
    content_type: ContentType;
}

table Footer {
    // Must match the "MINKNOW:file_identifier" custom metadata entry in the schemas of the bundled tables.
    file_identifier: string;
    // A free-form description of the software that wrote the file, intended to help pin down the source of files that violate the specification.
    software: string;
    // The version of this specification that the table schemas are based on (1.0.0).
    pod5_version: string;
    // The Apache Arrow tables stored in the file.
    contents: [ EmbeddedFile ];
}
Rationale

FlatBuffers are used because the Arrow IPC file format already uses them for metadata, and they can be read from a memory mapped file or read buffer without further copying. They are also easily (and compatibly) extensible with more fields.

A footer is used instead of a header so the file can be written incrementally: the first table can be written directly to the file before it is known how long it will be or even how many tables there will be.

This is a little-endian 8-byte signed integer giving the length of the footer buffer, including padding.

Rationale

This allows readers to find the start of the footer by starting at the end of the file and reading backwards.

Reads Table

Name Type Descripton
read_id minknow.uuid Globally-unique identifier for the read, can be converted to a string form (using standard routines in other libraries) which matches how reads are identified elsewhere.
signal list(uint64) A list of zero-indexed row numbers in the Signal table. This must be all the rows in the Signal table that have a matching read_id, in order. It functions as an index for the Signal table.
channel uint16 1-indexed channel
well uint8 1-indexed well (typically 1, 2, 3 or 4)
pore_type dictionary(string) Name of the pore type present in the well
calibration_offset float Calibration offset used to scale raw ADC data into pA readings.
calibration_scale float Calibration scale factor used to scale raw ADC data into pA readings.
read_number uint32 The read number on channel. This is increasing but typically not necessarily consecutive.
start uint64 How many samples were taken on this channel before the read started (since the data acquisition period began). This can be combined with the sample rate to get a time in seconds for the start of the read relative to the start of data acquisition.
median_before float The level of current in the well before this read (typically the open pore level of the well). If the level is not known (eg: due to a mux change), this should be nulled out.
tracked_scaling_scale float Scale for tracked read scaling values (based on previous reads shift)
tracked_scaling_shift float Shift for tracked read scaling values (based on previous reads shift)
predicted_scaling_scale float Scale for predicted read scaling values (based on this read's raw signal)
predicted_scaling_shift float Shift for predicted read scaling values (based on this read's raw signal)
num_reads_since_mux_change uint32 Number of selected reads since the last mux change on this reads channel
time_since_mux_change float Time in seconds since the last mux change on this reads channel
num_minknow_events uint64 Number of minknow events that the read contains
end_reason dictionary(string) The end reason, currently one of: unknown, mux_change, unblock_mux_change, data_service_unblock_mux_change, signal_positive, signal_negative, api_request, device_data_error, analysis_config_change or paused.
end_reason_forced bool True if this read was ended 'forcibly' (eg: mux_change, unblock), false if it was a data-driven read break (signal_positive, signal_negative). This allows simple categorisation even in the presence of new reasons that reading code is unaware of.
run_info dictionary(utf8) The run (acquisition) this read came from. Must match the acquisition_id field of exactly one entry in the run_info table.
num_samples uint64 The full length of the signal for this read in samples (equal to the sum of all 'samples' fields of signal chunks)
open_pore_level float The open pore level for the read. A value value in pA showing the open pore level of the well prior to the read starting. If the information is not available (feature not enabled in MinKNOW, or sequencing run on an old version) this value will be NaN.

Run Info Table

Name Type Descripton
acquisition_id utf8 A unique identifier for the run (acquisition). This is the same identifier that MinKNOW uses to identify an acquisition within a protocol.
acquisition_start_time timestamp(milliseconds) This is the clock time for sample 0, and can be used together with sample_rate and the :start read field to calculate a clock time for when a given read was acquired. The timezone should be set. MinKNOW will set this to the local timezone on file creation. When merging files that have different timezones, merging code will have to pick a timezone (possibly defaulting to 'UTC').
adc_max int16 The maximum ADC value that might be encountered. This is a hardware constraint.
adc_min int16 The minimum ADC value that might be encountered. This is a hardware constraint. adc_max - adc_min + 1 is the digitisation.
context_tags map(utf8, utf8) The context tags for the run. For compatibility with fast5. Readers must not make any assumptions about the contents of this field.
experiment_name utf8 A user-supplied name for the experiment being run.
flow_cell_id utf8 Uniquely identifies the flow cell the data was captured on. This is written on the flow cell case.
flow_cell_product_code utf8 Identifies the type of flow cell the data was captured on.
protocol_name utf8 The name of the protocol that was run.
protocol_run_id utf8 A unique identifier for the protocol run that produced this data.
protocol_start_time timestamp(milliseconds) ): When the protocol that the acquisition was part of started. The same considerations apply as for acquisition_start_time.
sample_id utf8 A user-supplied name for the sample being analysed.
sample_rate uint16 The number of samples acquired each second on each channel. This can be used to convert numbers of samples into time durations.
sequencing_kit utf8 The type of sequencing kit used to prepare the sample.
sequencer_position utf8 The sequencer position the data was collected on. For removable positions, like MinION Mk1Bs, this is unique (e.g. 'MN12345'), while for integrated positions it is not (e.g. 'X1' on a GridION).
sequencer_position_type utf8 The type of sequencing hardware the data was collected on. For example: 'MinION Mk1B' or 'GridION' or 'PromethION'.
software utf8 A description of the software that acquired the data. For example: 'MinKNOW 21.05.12 (Bream 5.1.6, Configurations 16.2.1, Core 5.1.9, Guppy 4.2.3)'.
system_name utf8 The name of the system the data was collected on. This might be a sequencer serial (eg: 'GXB1234') or a host name (e.g. 'Lab PC').
system_type utf8 The type of system the data was collected on. For example, 'GridION Mk1' or 'PromethION P48'. If the system is not a Nanopore sequencer with built-in compute, this will be a description of the operating system (e.g. 'Ubuntu 20.04').
tracking_id map(utf8, utf8) The tracking id for the run. For compatibility with fast5. Readers must not make any assumptions about the contents of this field.

Signal Table

Name Type Descripton
read_id minknow.uuid Globally-unique identifier for the read the data came from. This aids recovery and consistency checking.
signal ['large_list(int16)', 'minknow.vbz'] The actual signal. The encoding of the data must the same for all reads in the file, and is determined by the choice of logical type. LargeList(Int16) is the uncompressed storage option. Readers that do not recognise the logical type of this column will be unable to decode the signal data.
samples uint32 The number of samples stored in this row. Allows skipping over compressed chunks easily, also necessary for decoding StreamVByte-encoded data.