Skip to content

Examples

General examples

Example 1: Reference-to-signal with positions of interest (no signal in input table)

In this example, the alignment file does not contain raw signal data, so the corresponding POD5 input must be provided. We extract the mean, standard deviation, and dwell time of the signal around given reference positions and output a melted TSV.

fishnet reformat \
  --alignment alignments_ref.parquet \
  --pod5 /data/pod5_runs/run1 /data/pod5_runs/run2 \
  --positions-of-interest chr1:100000-10 chr2:250000-15 \
  --strategy stats \
  --stats mean std dwell \
  --out ref_positions_stats.tsv \
  --output-shape melted \
  --threads 8 \
  --force-overwrite

Explanation: - --alignment provides reference-to-signal mappings. - --pod5 supplies raw signal data (since it’s missing in the alignment file). - --positions-of-interest defines windows around base positions (±10 and ±15 bases). - The stats strategy calculates per-base signal statistics. - Output is written as a melted TSV table, one row per base.

Example 2: Query-to-signal with motif filtering and interpolation

Here, the alignment file already contains raw signal and includes both reference and query alignments. We select the query alignment, filter by motifs from a FASTA file, and interpolate the signal to a uniform length of 50. The result is stored as a nested Parquet file.

fishnet reformat \
  --alignment alignments_query_signal.parquet \
  --alignment-type query \
  --motifs-file motifs.fasta \
  --strategy interpolate \
  --target-size 50 \
  --out interpolated_query_signal.parquet \
  --output-shape nested \
  --threads 8 \
  --force-overwrite

Explanation: - --alignment-type query selects the query-to-signal mappings. - --motifs-file loads motifs (e.g., ATGCGT, TTTAAA, etc.) from a FASTA file. - --strategy interpolate 50 creates uniformly sized signal vectors (50 samples per base). - nested output preserves per-base signal arrays in Parquet — ideal for machine learning input.

Detailled (minimal) processing example

The following examples shows what gets calculated and how it gets written to file with different output settings. We'll use the following example: - reference to signal alignment of two reads: 1. readA maps to chr1:3-8 2. readB maps to chr1:4-14 - reference regions of interest: - chr1:5-7 - chr1:12-13 - For base-wise stats, mean and dwell are used - For interpolation, a target size of 3 is used

0-based index:          0 1 2 3 4 5 6 7 8 9 0 1 2 3 4
1-based index:          1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
Ref sequence:           A C G T|A G|C T A A A|G T|C T
                               |   |         |   |
readA:                      G T|A G|C T      |   |
readB:                        T|A G|C T A A A|G T|C
                               |   |         |   |
                               |   |         |   |
regions of interest:          chr1:5-6     chr1:12-13

Base-wise stats

fishnet reformat \
  --alignment [...] \
  --ref-regions "chr1:5-6" "chr1:12-13" \
  --strategy "stats" \
  --stats "mean" "dwell" \
  --out [...] \
  --output-shape [...]
For the example, we'll suppose that mean and dwell are chosen for stats. Accordingly, both statistics are calculated for readA at the 5th and 6th reference base, and for readB at the 5th, 6th, 12th and 13th base.

The melted output would look like this:

read_id start_index_on_read region_of_interest base_index base mean dwell
readA 2 chr1:5-7 0 A mA5 dA5
readA 2 chr1:5-7 1 G mA6 dA6
readB 1 chr1:5-7 0 A mA5 dA5
readB 1 chr1:5-7 1 G mA6 dA6
readB 8 chr1:12-13 0 G mA12 dA12
readB 8 chr1:12-13 1 T mA13 dA13

The exploded format would look like this:

read_id start_index_on_read region_of_interest base_0 base_1 mean_0 mean_1 dwell_0 dwell_1
readA 2 chr1:5-7 A G mA5 mA6 dA5 dA6
readB 1 chr1:5-7 A G mB5 mB6 dB5 dB6
readB 8 chr1:12-13 G T mB12 mB13 dB12 dB13

The nested format would look like this:

read_id start_index_on_read region_of_interest bases mean dwell
readA 2 chr1:5-7 AG [mA5, mA6] [dA5, dA6]
readB 1 chr1:5-7 AG [mB5, mB6] [dB5, dB6]
readB 8 chr1:12-13 GT [mB12, mB13] [dB12, dB13]

Interpolation

fishnet reformat \
  --alignment [...] \
  --ref-regions "chr1:5-6" "chr1:12-13" \
  --strategy "interpolate" \
  --target-size 3 \
  --out [...] \
  --output-shape [...]
For the example, we'll suppose that interpolation was performed with a target size of 3. This results in the interpolated signal for readA at the 5th and 6th base, and for readB at the 5th, 6th, 12th and 13th reference base.

Here is a diagram to show what the data would look like:

Raw per-base signal chunks (variable lengths):

  readA
    base 5 →  [ . . . . . ]                   (5 measurements)
    base 6 →  [ . . . . . . . . . . . ]       (11 measurements)

  readB
    base 5  → [ . . . . ]                     (4 measurements)
    base 6  → [ . . . . . . . . . . . . . ]   (13 measurements)
    base 12 → [ . . . . . . . . . . ]         (10 measurements)
    base 13 → [ . . . . . . . ]               (7 measurements)


After interpolation to target size = 3:

  readA
    base 5  → [ sA5_0  sA5_1  sA5_2 ]         (3 measurements)
    base 6  → [ sA6_0  sA6_1  sA6_2 ]         (3 measurements)

  readB
    base 5  → [ sB5_0  sB5_1  sB5_2 ]         (3 measurements)
    base 6  → [ sB6_0  sB6_1  sB6_2 ]         (3 measurements)
    base 12 → [ sB12_0 sB12_1 sB12_2 ]        (3 measurements)
    base 13 → [ sB13_0 sB13_1 sB13_2 ]        (3 measurements)

The melted output would look like this:

read_id start_index_on_read region_of_interest base_index base signal_0 signal_1 signal_2 dwell
readA 2 chr1:5-7 0 A sA5_0 sA5_1 sA5_2 dA5
readA 2 chr1:5-7 1 G sA6_0 sA6_1 sA6_2 dA6
readB 1 chr1:5-7 0 A sB5_0 sB5_1 sB5_2 dB5
readB 1 chr1:5-7 1 G sB6_0 sB6_1 sB6_2 dB6
readB 8 chr1:12-13 0 G sB12_0 sB12_1 sB12_2 dB12
readB 8 chr1:12-13 1 T sB13_0 sB13_1 sB13_2 dB13

The exploded format would look like this:

read_id start_index_on_read region_of_interest base_0 base_1 signal_base0_0 signal_base0_1 signal_base0_2 signal_base1_0 signal_base1_1 signal_base1_2 dwell_0 dwell_1
readA 2 chr1:5-7 A G sA5_0 sA5_1 sA5_2 sA6_0 sA6_1 sA6_2 dA5 dA6
readB 1 chr1:5-7 A G sB5_0 sB5_1 sB5_2 sB6_0 sB6_1 sB6_2 dB5 dB6
readB 8 chr1:12-13 G T sB12_0 sB12_1 sB12_2 sB13_0 sB13_1 sB13_2 dB12 dB13

The nested format would look like this:

read_id start_index_on_read region_of_interest bases signal dwell
readA 2 chr1:5-7 AG [[sA5_0, sA5_1, sA5_2], [sA6_0, sA6_1, sA6_2]] [dA5, dA6]
readB 1 chr1:5-7 AG [[sB5_0, sB5_1, sB5_2], [sB6_0, sB6_1, sB6_2]] [dB5, dB6]
readB 8 chr1:12-13 GT [[sB12_0, sB12_1, sB12_2], [sB13_0, sB13_1, sB13_2]] [dB12, dB13]