Reading

Open a Dataset

To open a collection of POD5 files as a dataset use the DatasetReader class. The DatasetReader takes one or more paths to files and/or directories. Directories can be searched for POD5 files recursively with the recursive parameter.

It is strongly recommended that users use python's with statement to ensure that any opened resources (e.g. file handles) are safely closed when they are no longer needed.

import pod5

paths = ["/foo/", "./bar/", "/baz/file.pod5"]
with pod5.DatasetReader(paths, recursive=True) as dataset:
    # Use DatasetReader within this context manager to free resources when done
    ...

Open a Signle POD5

Single POD5 file can be opened with the Reader.

The Reader is a lower-level interface and has many more functions and properties to access the inner workings of POD5 files such as the underlying arrow tables.

import pod5

with pod5.Reader("example.pod5") as reader:
    # Use Reader within this context manager to free resources when done
    ...

Inspecting Reads

With an open Reader or DatasetReader call the reads methods to generate ReadRecord instances for each read in the file or dataset:

# Iterate over every record in the file using reads()
with pod5.Reader("example.pod5") as reader:
    for read_record in reader.reads():
        print(read_record.read_id)

# Iterate over every record in the dataset using __iter__
with pod5.DatasetReader("./dataset/", recursive=True) as dataset:
    for read_record in dataset:
        print(read_record.read_id, read_record.path)

Selecting Reads

To iterate over a selection of read_ids, provide the DatasetReader.reads or Reader.reads merhods a collection of read_ids.

Note

The order of records returned by Reader iterators is always the order on-disk even when specifying a selection of read_ids.

# Create a collection of read_id UUIDs as string
read_ids = {
    "00445e58-3c58-4050-bacf-3411bb716cc3",
    "00520473-4d3d-486b-86b5-f031c59f6591",
}

# Example using single file Reader
with pod5.Reader("example.pod5") as reader:
    for read_record in reader.reads(selection=read_ids):
        assert str(read_record.read_id) in read_ids

# Example using DatasetReader
with pod5.DatasetReader("/path/to/dataset/") as dataset:
    for read_record in dataset.reads(selection=other_ids):
        assert str(read_record.read_id) in read_ids

Random Access to Records

The get_read method allows users to select any record in the dataset by read_id.

To efficiently access records from a POD5 dataset, DatasetReader will index the read_ids of all records and cache POD5 Reader instances.

import pod5

with pod5.DatasetReader("./dataset/") as dataset:
    read_id = "00445e58-3c58-4050-bacf-3411bb716cc3"

    # Dataset indexing will take place here
    read_record = dataset.get_read(read_id)

    # Returned object might be None if read_id is not found
    if read_record is None:
        print(f"dataset does not contain read_id: {read_id}")

    assert str(read_record.read_id) == read_id

Indexing Records

If necessary, the DatasetReader will index every record in the dataset. This may consume a significant amount of memory depending on the size of the dataset. Indexing is only done when a call to a function which requires the index is made.

The functions which require the index are get_read, get_path, and has_duplicate.

To clear the index, freeing the memory, call clear_index.

Cached Readers

The DatasetReader class uses an LRU cache of Reader instances to reduce the overhead of repeatedly re-opening POD5 files. The size of this cache can be controlled by setting max_cached_readers during initialisation.

Where possible, users should maximise the likelihood of a Reader cache hit by ensuring that successive calls to get_read access records in no more POD5s than the size of the LRU cache. Randomly indexing read_ids into many files will result in repeatedly opening the underlying files which will severely affect performance.

Reads and ReadRecords

Nanopore sequencing data comprises Reads which are formed from signal data and other metadata about how and when the sample was sequenced. This data is accessible via the Read or ReadRecord classes.

Although these two classes have very similar interfaces, know that the ReadRecord is an immutable Read formed from a POD5 file record which uses caching to improve read performance.

Duplicate Records

For a typical dataset sourced from a sequencer, it is vanishingly unlikely that there will be a single duplicate UUID read_id. It is much more likely that some POD5 files in a dataset may be copied, merged, subset etc and that loading all files in a dataset especially using the recursive mode will find duplicate read_ids.

The DatasetReader handles duplicate records in iterators by returning all copies as they appear on disk. Similarly, having duplicate read_ids in the selection will repeatedly return duplicates.

The DatasetReader handles duplicate records when indexing by returning a random valid ReadRecord if it exists or None. The ReadRecord instance returned is random because the indexing process is multi-threaded.

If a duplicate read_id is detected, the API will issue a warning. This can be disabled with warn_duplicate_indexing=False in the initialiser.

Plotting Example

Here is an example of how a user may plot a read's signal data against time.

"""
Example use of pod5 to plot the signal data from a selected read.
"""

import matplotlib.pyplot as plt
import numpy as np

import pod5

# Using the example pod5 file provided
example_pod5 = "test_data/multi_fast5_zip.pod5"
selected_read_id = '0000173c-bf67-44e7-9a9c-1ad0bc728e74'

with pod5.Reader(example_pod5) as reader:

    # Read the selected read from the pod5 file
    # next() is required here as Reader.reads() returns a Generator
    read = next(reader.reads([selected_read_id]))

    # Get the signal data and sample rate
    sample_rate = read.run_info.sample_rate
    signal = read.signal

    # Compute the time steps over the sampling period
    time = np.arange(len(signal)) / sample_rate

    # Plot using matplotlib
    plt.plot(time, signal)

Dataset Reader Reference

Click here to view the DatasetReader API Reference.

Reader Reference

Click here to view the Reader API Reference.