Skip to content

subset

Tool for subsetting pod5 files into one or more outputs

WorkQueue

WorkQueue(context: SpawnContext, transfers: LazyFrame)

join

join() -> None

Call join on the work queue waiting for all tasks to be done

shutdown

shutdown() -> int

Shutdown all queues returning the counts of all remaining items

assert_filename_template

assert_filename_template(
    template: str, subset_columns: List[str], ignore_incomplete_template: bool
) -> None

Get the keys named in the template to assert that they exist in subset_columns

assert_overwrite_ok

assert_overwrite_ok(targets: LazyFrame, force_overwrite: bool) -> None

Given the target filenames, assert that no unforced overwrite will occur unless requested raising an FileExistsError. Unlinks existing files if they exist if force_overwrite set

calculate_transfers

calculate_transfers(
    sources: LazyFrame, targets: LazyFrame, missing_ok: bool
) -> LazyFrame

Produce the transfers dataframe which maps the read_ids, source and destination

column_keys_from_template

column_keys_from_template(template: str) -> List[str]

Get a list of placeholder keys in the template

create_default_filename_template

create_default_filename_template(subset_columns: List[str]) -> str

Create the default filename template from the subset_columns selected

default_filename_template

default_filename_template(subset_columns: List[str]) -> str

Create the default filename template from the subset_columns selected

fstring_to_polars

fstring_to_polars(template: str) -> Tuple[str, List[str]]

Replace f-string keyed placeholders with positional ones and return the keys in their respective position

get_separator

get_separator(path: Path) -> str

Inspect the first line of the file at path and attempt to determine the field separator as either tab or comma, depending on the number of occurrences of each Returns "," or ""

launch_subsetting

launch_subsetting(
    transfers: LazyFrame, duplicate_ok: bool, threads: int = DEFAULT_THREADS
) -> None

Iterate over the transfers dataframe subsetting reads from sources to destinations

main

main()

pod5 subsample main

parse_csv_mapping

parse_csv_mapping(csv_path: Path) -> LazyFrame

Parse the csv direct mapping of output target to read_ids to a targets dataframe

parse_source

parse_source(path: Path) -> LazyFrame

Reads the read ids available in a given pod5 file returning a dataframe with the formatted read_ids and the source filename

parse_source_process

parse_source_process(paths: JoinableQueue, parsed_sources: Queue)

Parse sources until paths queue is consumed

parse_sources

parse_sources(paths: Set[Path], threads: int = DEFAULT_THREADS) -> LazyFrame

Reads all inputs and return formatted lazy dataframe

parse_table_mapping

parse_table_mapping(
    summary_path: Path,
    filename_template: Optional[str],
    subset_columns: List[str],
    read_id_column: str = DEFAULT_READ_ID_COLUMN,
    ignore_incomplete_template: bool = False,
) -> LazyFrame

Parse a table using polars to create a mapping of output targets to read ids

process_subset_tasks

process_subset_tasks(queue: WorkQueue, process: int, duplicate_ok: bool)

Consumes work from the queue and launches subsetting tasks

resolve_output_targets

resolve_output_targets(targets: LazyFrame, output: Path) -> LazyFrame

Prepend the output path to the target filename and resolve the complete string

subset_pod5

subset_pod5(
    inputs: List[Path],
    output: Path,
    columns: List[str],
    csv: Optional[Path] = None,
    table: Optional[Path] = None,
    threads: int = DEFAULT_THREADS,
    template: str = "",
    read_id_column: str = DEFAULT_READ_ID_COLUMN,
    missing_ok: bool = False,
    duplicate_ok: bool = False,
    ignore_incomplete_template: bool = False,
    force_overwrite: bool = False,
    recursive: bool = False,
) -> Any

Prepare the subsampling mapping and run the repacker

subset_pod5s_with_mapping

subset_pod5s_with_mapping(
    inputs: Set[Path],
    output: Path,
    targets: LazyFrame,
    threads: int = DEFAULT_THREADS,
    missing_ok: bool = False,
    duplicate_ok: bool = False,
    force_overwrite: bool = False,
) -> None

Given an iterable of input pod5 paths and an output directory, create output pod5 files containing the read_ids specified in the given mapping of output filename to set of read_id.

subset_reads

subset_reads(dest: Path, sources: DataFrame, process: int, duplicate_ok: bool) -> None

Copy the reads in sources into a new pod5 file at dest