POD5 API Dataset Reference

DEFAULT_CPUS `module-attribute`

DEFAULT_CPUS = min(cpu_count() or 1, 4)

DatasetReader

DatasetReader(
    paths: Union[PathOrStr, Collection[PathOrStr]],
    recursive: bool = False,
    pattern: str = "*.pod5",
    index: bool = False,
    threads: int = DEFAULT_CPUS,
    max_cached_readers: Optional[int] = 2**4,
    warn_duplicate_indexing: bool = True,
)

Reads pod5 files and/or directories of pod5 files as a dataset.

Parameters:

Name	Type	Description	Default
`paths`	`PathOrStr \| Collection[PathOrStr]`	One or more files or directories to load	required
`recursive`	`bool`	Search directories in `paths` recursively	`False`
`pattern`	`str`	A glob expression to match against file names	`'*.pod5'`
`index`	`bool`	Promptly index the dataset instead of deferring until required	`False`
`threads`	`int`	The number of threads to use	`DEFAULT_CPUS`
`max_cached_readers`	`Optional[int]`	The maximum size of the `Reader` LRU cache. Set to `None` for an unlimited cache size.	`2 ** 4`
`warn_duplicate_indexing`	`bool`	Issue warnings when duplicate read_ids are detected and indexing by read_id is attempted	`True`

Note

Random record access is implemented by creating an index of read_id to file path. This can consume a large amount of memory. Methods that generate an index have this noted in their docstring.

Warnings

If duplicate read_ids are present in the dataset, iterator methods such as reads() will yield all copies. Indexing methods such as get_read return one chosen randomly and issue a warning which can be suppressed by setting warn_duplicate_indexing=False

num_reads `property`

num_reads: int

Return the number of ReadRecords in this dataset.

paths `property`

paths: List[Path]

Return the list of pod5 file paths in this dataset

read_ids `property`

read_ids: Generator[str, None, None]

Yield all read_ids in this dataset

threads `instance-attribute`

threads = threads

warn_duplicate_indexing `instance-attribute`

warn_duplicate_indexing = warn_duplicate_indexing

enter

__enter__() -> DatasetReader

exit

__exit__(*exc_details) -> None

iter

__iter__() -> Generator[ReadRecord, None, None]

len

__len__() -> int

Returns the number of reads in this dataset

clear_index

clear_index() -> None

Clears the read_id to file path index

clear_readers

clear_readers() -> None

Clears the readers LRU cache

get_path

get_path(read_id: str) -> Optional[Path]

Get the pod5 Path for a given read_id or None if it was not found

Parameters:

Name	Type	Description	Default
`read_id`	`str`	The read_id (UUID) string in this dataset	required

Note

This method will index the dataset

Warnings

Issues a warning if duplicate read_ids are detected in this dataset. The returned path is a always valid file which contains this read_id but this may be random between instances.

Returns:

Type	Description
`Optional[Path]`

get_read

get_read(read_id: str) -> Optional[ReadRecord]

Get a ReadRecord by read_id or return None if it is missing

Parameters:

Name	Type	Description	Default
`read_id`	`str`	The read_id (UUID) string in this dataset to find	required

Note

This method will index the dataset

Warnings

Issues a warning if duplicate read_ids are detected in this dataset. The returned ReadRecord is a always valid but the source may be random between instances of a DatasetReader.

Returns:

Type	Description
`Optional[ReadRecord]`

get_reader

get_reader(path: PathOrStr) -> Reader

Get a pod5 file Reader in this dataset by path

Parameters:

Name	Type	Description	Default
`path`	`PathOrStr`	Path to a pod5 file	required

Returns:

Type	Description
`Reader`

has_duplicate

has_duplicate() -> bool

Returns True if there are duplicate read_ids in this dataset

Note

This method will index the dataset

index_read_ids

index_read_ids() -> None

Performs read_id indexing if not already done.

reads

reads(
    selection: Optional[Iterable[str]] = None, preload: Optional[Set[str]] = None
) -> Generator[ReadRecord, None, None]

Iterate over ReadRecords in the dataset.

Parameters:

Name	Type	Description	Default
`selection`	`iterable[str]`	The read ids to walk in the file.	`None`
`preload`	`set[str]`	Columns to preload - "samples" and "sample_count" are valid values	`None`

Note

ReadRecords are yielded in on-disk record order for each file in self.paths.

Missing records are not detected and multiple records will be yielded if there are duplicates in either of the dataset or selection.

Yields:

Type	Description
`ReadRecord`

POD5 API Dataset Reference

DEFAULT_CPUS module-attribute

DatasetReader

num_reads property

paths property

read_ids property

threads instance-attribute

warn_duplicate_indexing instance-attribute

__enter__

__exit__

__iter__

__len__

clear_index

clear_readers

get_path

get_read

get_reader

has_duplicate

index_read_ids

reads

DEFAULT_CPUS `module-attribute`

num_reads `property`

paths `property`

read_ids `property`

threads `instance-attribute`

warn_duplicate_indexing `instance-attribute`

enter

exit

iter

len