Output formats and shapes
Different analyses benefit from different output shapes:
| Format | Best for | Structure |
|---|---|---|
| Melted (long) | Visualization (R ggplot2, Python seaborn) | One row per base |
| Exploded (wide) | Machine learning (clustering, dimensionality reduction, etc.) | One row per region, columns expanded |
| Nested | Hierarchical data storage (Parquet) for other downstream processes | One row per region, arrays per field |
1. Melted (Long)
Each base of interest becomes one row.
Base-wise statistics
For N statistics:
| Column | Description |
|---|---|
read_id |
Unique read ID |
start_index_on_read |
Index of first base on the read (0-based) |
region_of_interest |
Region name |
base_index |
Position within region |
base |
Base character |
<STAT-1> ... <STAT-N> |
Computed statistics for this base |
Interpolation
For target size T:
| Column | Description |
|---|---|
read_id |
Unique read ID |
start_index_on_read |
Index of first base on the read (0-based) |
region_of_interest |
Region name |
base_index |
Position within region |
base |
Base character |
signal_0 ... signal_(T-1) |
Interpolated signal values |
dwell |
Dwell value for the base |
2. Exploded (Wide)
Each region–read pair becomes one row. All values for all bases appear as separate columns. (Requires all regions to have the same length.)
Base-wise statistics
For regions of length M and N statistics:
| Column | Description |
|---|---|
read_id |
Unique read ID |
start_index_on_read |
Index of first base (0-based) |
region_of_interest |
Region name |
base_0 ... base_(M-1) |
Bases in region |
<STAT-1>_0 ... <STAT-N>_(M-1) |
Per-base statistics |
Interpolation
For regions of length M and N statistics:
| Column | Description |
|---|---|
read_id |
Unique read ID |
start_index_on_read |
Index of first base (0-based) |
region_of_interest |
Region name |
base_0 ... base_(M-1) |
Bases in region |
signal_base0_0 ... signal_base(M-1)_(T-1) |
Interpolated signals |
dwell_0 ... dwell_(M-1) |
Per-base dwell times |
3. Nested (Parquet only)
Each row represents one read–region pair. Fields store lists or 2D arrays.
Base-wise statistics
| Column | Description |
|---|---|
read_id |
Unique read ID |
start_index_on_read |
Index of first base (0-based) |
region_of_interest |
Region name |
bases |
Base sequence (string; length = current region lenght) |
<STAT-1> ... <STAT-N> |
Lists of per-base statistic values (length = current region length) |
Interpolation
With all regions of interest of length M and an interpolation target size of T:
| Column | Description |
|---|---|
read_id |
Unique read ID |
start_index_on_read |
Index of first base (0-based) |
region_of_interest |
Region name |
bases |
Base sequence (string) |
signal |
2D array of shape (M × T) — interpolated signal for each base |
dwell |
List of M dwell values (for each base) |