POD5 API Dataset Reference
DatasetReader
DatasetReader(
paths: Union[PathOrStr, Collection[PathOrStr]],
recursive: bool = False,
pattern: str = "*.pod5",
index: bool = False,
threads: int = DEFAULT_CPUS,
max_cached_readers: Optional[int] = 2**4,
warn_duplicate_indexing: bool = True,
)
Reads pod5 files and/or directories of pod5 files as a dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
paths
|
PathOrStr | Collection[PathOrStr]
|
One or more files or directories to load |
required |
recursive
|
bool
|
Search directories in |
False
|
pattern
|
str
|
A glob expression to match against file names |
'*.pod5'
|
index
|
bool
|
Promptly index the dataset instead of deferring until required |
False
|
threads
|
int
|
The number of threads to use |
DEFAULT_CPUS
|
max_cached_readers
|
Optional[int]
|
The maximum size of the |
2 ** 4
|
warn_duplicate_indexing
|
bool
|
Issue warnings when duplicate read_ids are detected and indexing by read_id is attempted |
True
|
Note
Random record access is implemented by creating an index of read_id to file path. This can consume a large amount of memory. Methods that generate an index have this noted in their docstring.
Warnings
If duplicate read_ids are present in the dataset, iterator methods such
as reads() will yield all copies. Indexing methods such as get_read
return one chosen randomly and issue a warning which can be suppressed by
setting warn_duplicate_indexing=False
get_path
Get the pod5 Path for a given read_id or None if it was not found
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
read_id
|
str
|
The read_id (UUID) string in this dataset |
required |
Note
This method will index the dataset
Warnings
Issues a warning if duplicate read_ids are detected in this dataset. The returned path is a always valid file which contains this read_id but this may be random between instances.
Returns:
| Type | Description |
|---|---|
Optional[Path]
|
|
get_read
get_read(read_id: str) -> Optional[ReadRecord]
Get a ReadRecord by read_id or return None if it is missing
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
read_id
|
str
|
The read_id (UUID) string in this dataset to find |
required |
Note
This method will index the dataset
Warnings
Issues a warning if duplicate read_ids are detected in this dataset.
The returned ReadRecord is a always valid but the source may be random
between instances of a DatasetReader.
Returns:
| Type | Description |
|---|---|
Optional[ReadRecord]
|
|
get_reader
get_reader(path: PathOrStr) -> Reader
Get a pod5 file Reader in this dataset by path
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
PathOrStr
|
Path to a pod5 file |
required |
Returns:
| Type | Description |
|---|---|
Reader
|
|
has_duplicate
Returns True if there are duplicate read_ids in this dataset
Note
This method will index the dataset
reads
reads(
selection: Optional[Iterable[str]] = None, preload: Optional[Set[str]] = None
) -> Generator[ReadRecord, None, None]
Iterate over ReadRecords in the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
selection
|
iterable[str]
|
The read ids to walk in the file. |
None
|
preload
|
set[str]
|
Columns to preload - "samples" and "sample_count" are valid values |
None
|
Note
ReadRecords are yielded in on-disk record order for each file in self.paths.
Missing records are not detected and multiple records will be yielded if there are duplicates in either of the dataset or selection.
Yields:
| Type | Description |
|---|---|
ReadRecord
|
|