Output formats and shapes

Different analyses benefit from different output shapes:

Format	Best for	Structure
Melted (long)	Visualization (R ggplot2, Python seaborn)	One row per base
Exploded (wide)	Machine learning (clustering, dimensionality reduction, etc.)	One row per region, columns expanded
Nested	Hierarchical data storage (Parquet) for other downstream processes	One row per region, arrays per field

1. Melted (Long)

Each base of interest becomes one row.

For N statistics:

Column	Description
`read_id`	Unique read ID
`start_index_on_read`	Index of first base on the read (0-based)
`region_of_interest`	Region name
`base_index`	Position within region
`base`	Base character
`<STAT-1>` ... `<STAT-N>`	Computed statistics for this base

For target size T:

Column	Description
`read_id`	Unique read ID
`start_index_on_read`	Index of first base on the read (0-based)
`region_of_interest`	Region name
`base_index`	Position within region
`base`	Base character
`signal_0` ... `signal_(T-1)`	Interpolated signal values
`dwell`	Dwell value for the base

Each region–read pair becomes one row. All values for all bases appear as separate columns. (Requires all regions to have the same length.)

For regions of length M and N statistics:

For regions of length M and N statistics:

Each row represents one read–region pair. Fields store lists or 2D arrays.

Column	Description
`read_id`	Unique read ID
`start_index_on_read`	Index of first base (0-based)
`region_of_interest`	Region name
`bases`	Base sequence (string; length = current region lenght)
`<STAT-1> ... <STAT-N>`	Lists of per-base statistic values (length = current region length)

With all regions of interest of length M and an interpolation target size of T:

Column	Description
`read_id`	Unique read ID
`start_index_on_read`	Index of first base (0-based)
`region_of_interest`	Region name
`bases`	Base sequence (string)
`signal`	2D array of shape (M × T) — interpolated signal for each base
`dwell`	List of M dwell values (for each base)