Pod5Dataset

The Pod5Dataset handles access to multiple files at a time. It allows for random access to reads from any contained file and file-wise iteration. It implements the following functions:

Function	Description
new	Initializes a new Pod5Dataset from multiple pod5 paths
files	Returns references to all contained pod5 files (&PodFile)
get_file	Returns a reference to a specific Pod5File by its path used during initialization
get_file_mut	Returns a mutable reference to a specific Pod5File by its path used during initialization
get_file_by_index	Returns a reference to a specific Pod5File by its index in the path vector during initialization
get_file_by_index_mut	Returns a mutable reference to a specific Pod5File by its index in the path vector during initialization
get_read	Returns a read from any file in the dataset by its id
iter_files	Returns an iterator over references to all Pod5Files in the dataset
iter_files_mut	Returns an iterator over mutable references to all Pod5Files in the dataset
n_files	Returns the number of files contained in the dataset
n_reads	Returns the number of reads over all files in the dataset

The following example shows how to iterate over all reads of a dataset:

use std::path::PathBuf;
use pod5_reader_api::dataset::Pod5Dataset;

fn main() {
    let paths = vec![
        PathBuf::from("example_data/remora_example/can_reads.pod5"),
        // ...
    ];

    let mut pod5_dataset = Pod5Dataset::new(&paths).unwrap();

    for file in pod5_dataset.iter_files_mut() {
        for read_res in file.iter_reads().unwrap() {
            let read = read_res.unwrap();
            println!("{}", read.read_id());
        }
    }
}

Contained Pod5Files are accessible via the get_file, get_file_mut, get_file_by_index and get_file_by_index_mut functions. Alternatively, read information is directly accessible via the get_read and get_read_mut functions. The following example shows how to use the latter:

use std::{path::PathBuf, str::FromStr};
use pod5_reader_api::dataset::Pod5Dataset;
use uuid::Uuid;

fn main() {
    let paths = vec![
        PathBuf::from("example_data/remora_example/can_reads.pod5"),
        // ...
    ];

    let mut pod5_dataset = Pod5Dataset::new(&paths).unwrap();
    let read_id = Uuid::from_str("fbf9c81c-fdb2-4b41-85e1-0a2bd8b5a138").unwrap();

    let pod5_read = pod5_dataset.get_read(&read_id).unwrap();
    println!("{}", pod5_read.read_id());

    // Alternatively the same, but more complicated:
    let pod5_file = pod5_dataset.get_file_by_index_mut(0).unwrap();
    let pod5_read = pod5_file.get(&read_id).unwrap();
    println!("{}", pod5_read.read_id());
}

Just like with the Pod5File, retrieving read information requires mutable access, and is not thread-safe. Again, thread-safe access is provided by Pod5DatasetThreadSafe.

Pod5DatasetThreadSafe

The Pod5DatasetThreadSafe functions like Pod5Dataset with the key difference that it allows for random access to contained reads from multiple threads in parallel. Key differences are that the functions that retrieve mutable references to contained files are not available here. Other functions that are exclusive here are the following:

Function	Description
get_file_thread_safe	Returns a Pod5FileThreadSafe by its path used during initialization
get_file_thread_safe_by_index	Returns a Pod5FileThreadSafe by its index in the path vector during initialization

Note that all file getter functions (get_file, get_file_by_index, get_file_thread_safe, get_file_thread_safe_by_index) construct the file from scratch in the current implementation. As such is pretty inefficient.

The key usage for Pod5DatasetThreadSafe is direct access to contained reads from multiple threads in parallel. The following example shows an approach to do just that:

use std::path::PathBuf;
use std::sync::Arc;
use pod5_reader_api::dataset::Pod5DatasetThreadSafe;
use rayon::current_thread_index;
use rayon::iter::{IntoParallelRefIterator, ParallelIterator};
use uuid::Uuid;

fn main() {
    let paths = vec![
        PathBuf::from("example_data/remora_example/can_reads.pod5"),
        // ...
    ];
    let n_workers = 4;

    let pod5_dataset = Arc::new(
        Pod5DatasetThreadSafe::new(&paths, n_workers).unwrap()
    );
    let read_ids: Vec<Uuid> = pod5_dataset.read_ids().clone();

    read_ids.par_iter().for_each(|read_id| {
        let pod5_dataset = Arc::clone(&pod5_dataset);
        let tid = current_thread_index().unwrap();

        let read = pod5_dataset.get_read(read_id).unwrap();
        println!(
            "Thread {} processed read {} with {} samples",
            tid,
            read.read_id(),
            read.require_num_samples().unwrap()
        );
    });
}

Pod5Dataset vs Pod5DatasetThreadSafe

The ThreadSafe implementations of Pod5File and Pod5Dataset should only be used when processing data in parallel. All linear operations more efficient when using the non-thread-safe implementations due to less overhead and a much simpler implementation.

To showcase the differences in processing speed I set up a quick and dirty benchmark when handling 25GB of pod5 data.

The following approaches were tested: - Random access with Pod5DatasetThreadSafe - 20 threads - Random access with Pod5DatasetThreadSafe - 8 threads - Random access with Pod5DatasetThreadSafe - 1 thread - Random access with Pod5Dataset - Read-wise iterator with Pod5Dataset

The data was split into a different number of files to test if fewer but larger, or more but smaller files are more or less efficient for reading: - 25GB split into 3 files - 25GB split into 28 files - 25GB split into 2746 files

In all runs, each read was accessed once. Due to the internal caching of readers for different files, access in a truly random order is slower. To test how much slower, reads were accessed in both random and non-random order.

Here are the times that were measured using the time command in bash:

Approach	3 files non-random	28 files non-random	2746 files non-random	3 files random	28 files random	2746 files random
thread-safe, 20 threads	00:31,9	00:17,6	00:18,3	01:06,5	01:10,2	01:08,2
thread-safe, 8 threads	00:34,0	00:33,0	00:29,8	01:07,8	01:10,3	01:06,8
thread-safe, 1 thread	03:31,6	03:19,7	03:18,2	07:25,1	05:58,1	05:23,3
Non thread-safe, random access	03:14,1	03:11,7	03:11,3	NA	NA	NA
Non thread-safe, iterative	01:28,6	01:27,1	01:28,8	NA	NA	NA