subset
Tool for subsetting pod5 files into one or more outputs
WorkQueue
assert_filename_template
assert_filename_template(
template: str, subset_columns: List[str], ignore_incomplete_template: bool
) -> None
Get the keys named in the template to assert that they exist in subset_columns
assert_overwrite_ok
Given the target filenames, assert that no unforced overwrite will occur
unless requested raising an FileExistsError. Unlinks existing files if they exist
if force_overwrite set
calculate_transfers
Produce the transfers dataframe which maps the read_ids, source and destination
column_keys_from_template
Get a list of placeholder keys in the template
create_default_filename_template
Create the default filename template from the subset_columns selected
default_filename_template
Create the default filename template from the subset_columns selected
fstring_to_polars
Replace f-string keyed placeholders with positional ones and return the keys in their respective position
get_separator
Inspect the first line of the file at path and attempt to determine the field
separator as either tab or comma, depending on the number of occurrences of each
Returns "," or "
launch_subsetting
launch_subsetting(
transfers: LazyFrame, duplicate_ok: bool, threads: int = DEFAULT_THREADS
) -> None
Iterate over the transfers dataframe subsetting reads from sources to destinations
parse_csv_mapping
Parse the csv direct mapping of output target to read_ids to a targets dataframe
parse_source
Reads the read ids available in a given pod5 file returning a dataframe with the formatted read_ids and the source filename
parse_source_process
Parse sources until paths queue is consumed
parse_sources
Reads all inputs and return formatted lazy dataframe
parse_table_mapping
parse_table_mapping(
summary_path: Path,
filename_template: Optional[str],
subset_columns: List[str],
read_id_column: str = DEFAULT_READ_ID_COLUMN,
ignore_incomplete_template: bool = False,
) -> LazyFrame
Parse a table using polars to create a mapping of output targets to read ids
process_subset_tasks
process_subset_tasks(queue: WorkQueue, process: int, duplicate_ok: bool)
Consumes work from the queue and launches subsetting tasks
resolve_output_targets
Prepend the output path to the target filename and resolve the complete string
subset_pod5
subset_pod5(
inputs: List[Path],
output: Path,
columns: List[str],
csv: Optional[Path] = None,
table: Optional[Path] = None,
threads: int = DEFAULT_THREADS,
template: str = "",
read_id_column: str = DEFAULT_READ_ID_COLUMN,
missing_ok: bool = False,
duplicate_ok: bool = False,
ignore_incomplete_template: bool = False,
force_overwrite: bool = False,
recursive: bool = False,
) -> Any
Prepare the subsampling mapping and run the repacker
subset_pod5s_with_mapping
subset_pod5s_with_mapping(
inputs: Set[Path],
output: Path,
targets: LazyFrame,
threads: int = DEFAULT_THREADS,
missing_ok: bool = False,
duplicate_ok: bool = False,
force_overwrite: bool = False,
) -> None
Given an iterable of input pod5 paths and an output directory, create output pod5 files containing the read_ids specified in the given mapping of output filename to set of read_id.