Specification
The Oxford Nanopore Technologies output specifications have more details on supported file formats created by ONT devices and software.
Overview
The file format is, at its core, a collection of Apache Arrow tables, stored in the Apache Feather 2
(also know as Apache Arrow IPC File) format, and bundled into a container format. The container file
has the extension .pod5.
Table Schemas
POD5 files are a custom wrapper format around arrow that contain several arrow tables.
All the tables should have the following custom_metadata fields set on them:
| Name | Example Value | Notes |
|---|---|---|
| MINKNOW:pod5_version | 1.0.0 | The version of this specification that the schema was based on. |
| MINKNOW:software | MinNOW Core 5.2.3 | A free-form description of the software that wrote the file, intended to help pin down the source of files that violate the specification. |
| MINKNOW:file_identifier | cbf91180-0684-4a39-bf56-41eaf437de9e | Must be identical across all tables. Allows checking that the files correspond to each other. |
Extension Types
Several fields in the table schemas use custom arrow types.
minknow.uuid
The schemas make extensive use of UUIDs to identify reads. This is stored using an extension type, with the following properties:
Name: "minknow.uuid"
Physical storage: FixedBinary(16)
minknow.vbz
Storage for VBZ-encoded data:
Name: "minknow.vbz"
Physical storage: LargeBinary
Tables
The Reads, Signal and Run Info tables must all be present in a POD5 file. Note that some very early POD5 files produced by pre-0.1 versions of the pod5 library did not include a Run Info table, instead including that information in the Reads table.
Reads Table
The Reads table contains a single row per read, and describes the metadata for each read. The
signal column of the read links to the Signal table, and allows a reads signal to be retrieved.
The run_info column links to the the Run Info table, providing more context for the read and
avoiding duplicating data that is common to many or all reads in the file.
Some fields of the Reads table are dictionaries: the contents of the table are stored in a lookup written prior to each batch of read rows and the read row itself then contains an integer index. This allows space savings on fields that would otherwise be repeated. Only simple types are stored in dictionaries as third party tools have limited support for dictionaries of structs.
[tables/reads.toml] contains specific information about fields in the reads table.
Signal Table
The signal table contains the (optionally compressed) signal data where one row contains sequence of sample data, and some information about the sample data origin.
[tables/signal.toml] contains specific information about fields in the signal table.
Run Info Table
The run info table contains a single row per MinKNOW run that any read in the file came from.
Several fields of the Reads table are dictionaries, the contents of the table are stored in a lookup written prior to each batch of read rows, the read row itself then contains an integer index. This allows space savings on fields that would otherwise be repeated.
[tables/run_info.toml] contains specific information about fields in the reads table.
Combined file Layout
Layout
<signature "\213POD\r\n\032\n">
<section marker: 16 bytes>
<embedded file 1 (padded to 8-byte boundary)><section marker: 16 bytes>
...
<embedded file N (padded to 8-byte boundary)><section marker: 16 bytes>
<footer magic: "FOOTER\000\000">
<footer (padded to 8-byte boundary)>
<footer length: 8 bytes little-endian signed integer>
<section marker: 16 bytes>
<signature "\213POD\r\n\032\n">
All padding bytes should be zero. They ensure memory mapped files have the alignment that Arrow expects.
Signature
The first and last eight bytes of the file are both a fixed set of values:
| Decimal | 139 | 80 | 79 | 68 | 13 | 10 | 26 | 10 |
| Hexadecimal | 0x8B | 0x50 | 0x4F | 0x44 | 0x0D | 0x0A | 0x1A | 0x0A |
| ASCII C Notation | \213 | P | O | D | \r | \n | \032 | \n |
The format of the signature is based on the PNG file signature, and inherits several useful features from it for detecting file corruption:
- The first byte is non-ASCII to reduce the probability it is interpreted as a text file.
- The first byte has the high bit set to catch file transfers that clear the top bit.
- The \r\n (CRLF) sequence and the final \n (LF) byte check that nothing has attempted to standardise line endings in the file.
- The second-last byte (\032) is the CTRL-Z sequence, which stops file display under MS-DOS.
Rationale
A unique, fixed signature for the file type allows quickly identifying that the file is in the
expected format, and provides an easy way for tools like the UNIX file command to determine the
file type.
Placing it at the end allows quickly checking whether the file is complete.
Section marker
The section marker is a 16-byte UUID, generated randomly for each file. All the section markers in a given file must be identical.
Rationale
This aids in recovery of partially-written files (that are missing a footer) - while most of the embedded Arrow IPC files can be scanned easily, it may not be obvious where the footer ends. A given randomly-generated 16-byte value is highly unlikely to occur in actual data, and can be scanned for to find the end of the embedded file for certain. The first section marker is just so that recovery tools know what to look for.
Footer magic
This is the ASCII string "FOOTER" padded to 8 bytes with zeroes. It helps find a partially-written footer when recovering files.
Footer
The footer is an encoded FlatBuffer table, using the schema below.
namespace Minknow.ReadsFormat;
enum ContentType:short {
// The Reads table (an Arrow table)
ReadsTable,
// The Signal table (an Arrow table)
SignalTable,
// An index for looking up data in the ReadsTable by read_id
ReadIdIndex,
// An index based on other columns and/or tables (it will need to be opened to find out what it indexes)
OtherIndex,
}
enum Format:short {
// The Apache Feather V2 format, also known as the Apache Arrow IPC File format.
FeatherV2,
}
// Describes an embedded file.
table EmbeddedFile {
// The start of the embedded file
offset: int64;
// The length of the embedded file (excluding any padding)
length: int64;
// The format of the file
format: Format;
// What contents should be expected in the file
content_type: ContentType;
}
table Footer {
// Must match the "MINKNOW:file_identifier" custom metadata entry in the schemas of the bundled tables.
file_identifier: string;
// A free-form description of the software that wrote the file, intended to help pin down the source of files that violate the specification.
software: string;
// The version of this specification that the table schemas are based on (1.0.0).
pod5_version: string;
// The Apache Arrow tables stored in the file.
contents: [ EmbeddedFile ];
}
Rationale
FlatBuffers are used because the Arrow IPC file format already uses them for metadata, and they can be read from a memory mapped file or read buffer without further copying. They are also easily (and compatibly) extensible with more fields.
A footer is used instead of a header so the file can be written incrementally: the first table can be written directly to the file before it is known how long it will be or even how many tables there will be.
Footer length
This is a little-endian 8-byte signed integer giving the length of the footer buffer, including padding.
Rationale
This allows readers to find the start of the footer by starting at the end of the file and reading backwards.
Reads Table
| Name | Type | Descripton |
|---|---|---|
read_id |
minknow.uuid |
Globally-unique identifier for the read, can be converted to a string form (using standard routines in other libraries) which matches how reads are identified elsewhere. |
signal |
list(uint64) |
A list of zero-indexed row numbers in the Signal table. This must be all the rows in the Signal table that have a matching read_id, in order. It functions as an index for the Signal table. |
channel |
uint16 |
1-indexed channel |
well |
uint8 |
1-indexed well (typically 1, 2, 3 or 4) |
pore_type |
dictionary(string) |
Name of the pore type present in the well |
calibration_offset |
float |
Calibration offset used to scale raw ADC data into pA readings. |
calibration_scale |
float |
Calibration scale factor used to scale raw ADC data into pA readings. |
read_number |
uint32 |
The read number on channel. This is increasing but typically not necessarily consecutive. |
start |
uint64 |
How many samples were taken on this channel before the read started (since the data acquisition period began). This can be combined with the sample rate to get a time in seconds for the start of the read relative to the start of data acquisition. |
median_before |
float |
The level of current in the well before this read (typically the open pore level of the well). If the level is not known (eg: due to a mux change), this should be nulled out. |
tracked_scaling_scale |
float |
Scale for tracked read scaling values (based on previous reads shift) |
tracked_scaling_shift |
float |
Shift for tracked read scaling values (based on previous reads shift) |
predicted_scaling_scale |
float |
Scale for predicted read scaling values (based on this read's raw signal) |
predicted_scaling_shift |
float |
Shift for predicted read scaling values (based on this read's raw signal) |
num_reads_since_mux_change |
uint32 |
Number of selected reads since the last mux change on this reads channel |
time_since_mux_change |
float |
Time in seconds since the last mux change on this reads channel |
num_minknow_events |
uint64 |
Number of minknow events that the read contains |
end_reason |
dictionary(string) |
The end reason, currently one of: unknown, mux_change, unblock_mux_change, data_service_unblock_mux_change, signal_positive, signal_negative, api_request, device_data_error, analysis_config_change or paused. |
end_reason_forced |
bool |
True if this read was ended 'forcibly' (eg: mux_change, unblock), false if it was a data-driven read break (signal_positive, signal_negative). This allows simple categorisation even in the presence of new reasons that reading code is unaware of. |
run_info |
dictionary(utf8) |
The run (acquisition) this read came from. Must match the acquisition_id field of exactly one entry in the run_info table. |
num_samples |
uint64 |
The full length of the signal for this read in samples (equal to the sum of all 'samples' fields of signal chunks) |
open_pore_level |
float |
The open pore level for the read. A value value in pA showing the open pore level of the well prior to the read starting. If the information is not available (feature not enabled in MinKNOW, or sequencing run on an old version) this value will be NaN. |
Run Info Table
| Name | Type | Descripton |
|---|---|---|
acquisition_id |
utf8 |
A unique identifier for the run (acquisition). This is the same identifier that MinKNOW uses to identify an acquisition within a protocol. |
acquisition_start_time |
timestamp(milliseconds) |
This is the clock time for sample 0, and can be used together with sample_rate and the :start read field to calculate a clock time for when a given read was acquired. The timezone should be set. MinKNOW will set this to the local timezone on file creation. When merging files that have different timezones, merging code will have to pick a timezone (possibly defaulting to 'UTC'). |
adc_max |
int16 |
The maximum ADC value that might be encountered. This is a hardware constraint. |
adc_min |
int16 |
The minimum ADC value that might be encountered. This is a hardware constraint. adc_max - adc_min + 1 is the digitisation. |
context_tags |
map(utf8, utf8) |
The context tags for the run. For compatibility with fast5. Readers must not make any assumptions about the contents of this field. |
experiment_name |
utf8 |
A user-supplied name for the experiment being run. |
flow_cell_id |
utf8 |
Uniquely identifies the flow cell the data was captured on. This is written on the flow cell case. |
flow_cell_product_code |
utf8 |
Identifies the type of flow cell the data was captured on. |
protocol_name |
utf8 |
The name of the protocol that was run. |
protocol_run_id |
utf8 |
A unique identifier for the protocol run that produced this data. |
protocol_start_time |
timestamp(milliseconds) |
): When the protocol that the acquisition was part of started. The same considerations apply as for acquisition_start_time. |
sample_id |
utf8 |
A user-supplied name for the sample being analysed. |
sample_rate |
uint16 |
The number of samples acquired each second on each channel. This can be used to convert numbers of samples into time durations. |
sequencing_kit |
utf8 |
The type of sequencing kit used to prepare the sample. |
sequencer_position |
utf8 |
The sequencer position the data was collected on. For removable positions, like MinION Mk1Bs, this is unique (e.g. 'MN12345'), while for integrated positions it is not (e.g. 'X1' on a GridION). |
sequencer_position_type |
utf8 |
The type of sequencing hardware the data was collected on. For example: 'MinION Mk1B' or 'GridION' or 'PromethION'. |
software |
utf8 |
A description of the software that acquired the data. For example: 'MinKNOW 21.05.12 (Bream 5.1.6, Configurations 16.2.1, Core 5.1.9, Guppy 4.2.3)'. |
system_name |
utf8 |
The name of the system the data was collected on. This might be a sequencer serial (eg: 'GXB1234') or a host name (e.g. 'Lab PC'). |
system_type |
utf8 |
The type of system the data was collected on. For example, 'GridION Mk1' or 'PromethION P48'. If the system is not a Nanopore sequencer with built-in compute, this will be a description of the operating system (e.g. 'Ubuntu 20.04'). |
tracking_id |
map(utf8, utf8) |
The tracking id for the run. For compatibility with fast5. Readers must not make any assumptions about the contents of this field. |
Signal Table
| Name | Type | Descripton |
|---|---|---|
read_id |
minknow.uuid |
Globally-unique identifier for the read the data came from. This aids recovery and consistency checking. |
signal |
['large_list(int16)', 'minknow.vbz'] |
The actual signal. The encoding of the data must the same for all reads in the file, and is determined by the choice of logical type. LargeList(Int16) is the uncompressed storage option. Readers that do not recognise the logical type of this column will be unable to decode the signal data. |
samples |
uint32 |
The number of samples stored in this row. Allows skipping over compressed chunks easily, also necessary for decoding StreamVByte-encoded data. |