Reading
Open a Dataset
To open a collection of POD5 files as a dataset use the DatasetReader class.
The DatasetReader takes one or more paths to files and/or directories.
Directories can be searched for POD5 files recursively with the recursive parameter.
It is strongly recommended that users use python's with statement to ensure that any opened resources (e.g. file handles) are safely closed when they are no longer needed.
import pod5
paths = ["/foo/", "./bar/", "/baz/file.pod5"]
with pod5.DatasetReader(paths, recursive=True) as dataset:
# Use DatasetReader within this context manager to free resources when done
...
Open a Signle POD5
Single POD5 file can be opened with the Reader.
The Reader is a lower-level interface and has many more functions and
properties to access the inner workings of POD5 files such as the underlying arrow tables.
import pod5
with pod5.Reader("example.pod5") as reader:
# Use Reader within this context manager to free resources when done
...
Inspecting Reads
With an open Reader or DatasetReader
call the reads methods to generate ReadRecord
instances for each read in the file or dataset:
# Iterate over every record in the file using reads()
with pod5.Reader("example.pod5") as reader:
for read_record in reader.reads():
print(read_record.read_id)
# Iterate over every record in the dataset using __iter__
with pod5.DatasetReader("./dataset/", recursive=True) as dataset:
for read_record in dataset:
print(read_record.read_id, read_record.path)
Selecting Reads
To iterate over a selection of read_ids, provide the DatasetReader.reads or
Reader.reads merhods a collection of read_ids.
Note
The order of records returned by Reader iterators is always the order on-disk
even when specifying a selection of read_ids.
# Create a collection of read_id UUIDs as string
read_ids = {
"00445e58-3c58-4050-bacf-3411bb716cc3",
"00520473-4d3d-486b-86b5-f031c59f6591",
}
# Example using single file Reader
with pod5.Reader("example.pod5") as reader:
for read_record in reader.reads(selection=read_ids):
assert str(read_record.read_id) in read_ids
# Example using DatasetReader
with pod5.DatasetReader("/path/to/dataset/") as dataset:
for read_record in dataset.reads(selection=other_ids):
assert str(read_record.read_id) in read_ids
Random Access to Records
The get_read method allows users to select any record in
the dataset by read_id.
To efficiently access records from a POD5 dataset, DatasetReader
will index the read_ids of all records and cache POD5 Reader instances.
import pod5
with pod5.DatasetReader("./dataset/") as dataset:
read_id = "00445e58-3c58-4050-bacf-3411bb716cc3"
# Dataset indexing will take place here
read_record = dataset.get_read(read_id)
# Returned object might be None if read_id is not found
if read_record is None:
print(f"dataset does not contain read_id: {read_id}")
assert str(read_record.read_id) == read_id
Indexing Records
If necessary, the DatasetReader will index every record in the dataset.
This may consume a significant amount of memory
depending on the size of the dataset. Indexing is only done when a call to a function
which requires the index is made.
The functions which require the index are get_read,
get_path, and has_duplicate.
To clear the index, freeing the memory, call clear_index.
Cached Readers
The DatasetReader class uses an LRU cache of Reader
instances to reduce the overhead of repeatedly re-opening POD5 files. The size of this
cache can be controlled by setting max_cached_readers during initialisation.
Where possible, users should maximise the likelihood of a Reader
cache hit by ensuring that successive calls to get_read
access records in no more POD5s than the size of the LRU cache.
Randomly indexing read_ids into many files will result in repeatedly
opening the underlying files which will severely affect performance.
Reads and ReadRecords
Nanopore sequencing data comprises Reads which are formed from signal data and other metadata about how and when the sample was sequenced. This data is accessible via the Read or ReadRecord classes.
Although these two classes have very similar interfaces, know that the
ReadRecord is an immutable Read
formed from a POD5 file record which uses caching to improve read performance.
Duplicate Records
For a typical dataset sourced from a sequencer, it is vanishingly unlikely that
there will be a single duplicate UUID read_id. It is much more likely that some
POD5 files in a dataset may be copied, merged, subset etc and that loading all files
in a dataset especially using the recursive mode will find duplicate read_ids.
The DatasetReader handles duplicate records in iterators by
returning all copies as they appear on disk. Similarly, having duplicate read_ids in
the selection will repeatedly return duplicates.
The DatasetReader handles duplicate records when indexing by
returning a random valid ReadRecord if it exists or None.
The ReadRecord instance returned is random because the indexing
process is multi-threaded.
If a duplicate read_id is detected, the API will issue a warning. This can be
disabled with warn_duplicate_indexing=False in the initialiser.
Plotting Example
Here is an example of how a user may plot a read's signal data against time.
"""
Example use of pod5 to plot the signal data from a selected read.
"""
import matplotlib.pyplot as plt
import numpy as np
import pod5
# Using the example pod5 file provided
example_pod5 = "test_data/multi_fast5_zip.pod5"
selected_read_id = '0000173c-bf67-44e7-9a9c-1ad0bc728e74'
with pod5.Reader(example_pod5) as reader:
# Read the selected read from the pod5 file
# next() is required here as Reader.reads() returns a Generator
read = next(reader.reads([selected_read_id]))
# Get the signal data and sample rate
sample_rate = read.run_info.sample_rate
signal = read.signal
# Compute the time steps over the sampling period
time = np.arange(len(signal)) / sample_rate
# Plot using matplotlib
plt.plot(time, signal)
Dataset Reader Reference
Click here to view the DatasetReader API Reference.