Skip to content

POD5 API Dataset Reference

DEFAULT_CPUS module-attribute

DEFAULT_CPUS = min(cpu_count() or 1, 4)

DatasetReader

DatasetReader(
    paths: Union[PathOrStr, Collection[PathOrStr]],
    recursive: bool = False,
    pattern: str = "*.pod5",
    index: bool = False,
    threads: int = DEFAULT_CPUS,
    max_cached_readers: Optional[int] = 2**4,
    warn_duplicate_indexing: bool = True,
)

Reads pod5 files and/or directories of pod5 files as a dataset.

Parameters:

Name Type Description Default
paths PathOrStr | Collection[PathOrStr]

One or more files or directories to load

required
recursive bool

Search directories in paths recursively

False
pattern str

A glob expression to match against file names

'*.pod5'
index bool

Promptly index the dataset instead of deferring until required

False
threads int

The number of threads to use

DEFAULT_CPUS
max_cached_readers Optional[int]

The maximum size of the Reader LRU cache. Set to None for an unlimited cache size.

2 ** 4
warn_duplicate_indexing bool

Issue warnings when duplicate read_ids are detected and indexing by read_id is attempted

True
Note

Random record access is implemented by creating an index of read_id to file path. This can consume a large amount of memory. Methods that generate an index have this noted in their docstring.

Warnings

If duplicate read_ids are present in the dataset, iterator methods such as reads() will yield all copies. Indexing methods such as get_read return one chosen randomly and issue a warning which can be suppressed by setting warn_duplicate_indexing=False

num_reads property

num_reads: int

Return the number of ReadRecords in this dataset.

paths property

paths: List[Path]

Return the list of pod5 file paths in this dataset

read_ids property

read_ids: Generator[str, None, None]

Yield all read_ids in this dataset

threads instance-attribute

threads = threads

warn_duplicate_indexing instance-attribute

warn_duplicate_indexing = warn_duplicate_indexing

__enter__

__enter__() -> DatasetReader

__exit__

__exit__(*exc_details) -> None

__iter__

__iter__() -> Generator[ReadRecord, None, None]

__len__

__len__() -> int

Returns the number of reads in this dataset

clear_index

clear_index() -> None

Clears the read_id to file path index

clear_readers

clear_readers() -> None

Clears the readers LRU cache

get_path

get_path(read_id: str) -> Optional[Path]

Get the pod5 Path for a given read_id or None if it was not found

Parameters:

Name Type Description Default
read_id str

The read_id (UUID) string in this dataset

required
Note

This method will index the dataset

Warnings

Issues a warning if duplicate read_ids are detected in this dataset. The returned path is a always valid file which contains this read_id but this may be random between instances.

Returns:

Type Description
Optional[Path]

get_read

get_read(read_id: str) -> Optional[ReadRecord]

Get a ReadRecord by read_id or return None if it is missing

Parameters:

Name Type Description Default
read_id str

The read_id (UUID) string in this dataset to find

required
Note

This method will index the dataset

Warnings

Issues a warning if duplicate read_ids are detected in this dataset. The returned ReadRecord is a always valid but the source may be random between instances of a DatasetReader.

Returns:

Type Description
Optional[ReadRecord]

get_reader

get_reader(path: PathOrStr) -> Reader

Get a pod5 file Reader in this dataset by path

Parameters:

Name Type Description Default
path PathOrStr

Path to a pod5 file

required

Returns:

Type Description
Reader

has_duplicate

has_duplicate() -> bool

Returns True if there are duplicate read_ids in this dataset

Note

This method will index the dataset

index_read_ids

index_read_ids() -> None

Performs read_id indexing if not already done.

reads

reads(
    selection: Optional[Iterable[str]] = None, preload: Optional[Set[str]] = None
) -> Generator[ReadRecord, None, None]

Iterate over ReadRecords in the dataset.

Parameters:

Name Type Description Default
selection iterable[str]

The read ids to walk in the file.

None
preload set[str]

Columns to preload - "samples" and "sample_count" are valid values

None
Note

ReadRecords are yielded in on-disk record order for each file in self.paths.

Missing records are not detected and multiple records will be yielded if there are duplicates in either of the dataset or selection.

Yields:

Type Description
ReadRecord