Skip to content

Configurations

When doing large-scale analysis, you might want to collect all your processing settings (e.g. those for event-finding, plotting, step detection) into one file.

PoreFlow supports a TOML file used to store and maintain parameters. This document explains when and how to use configurations, provides usage examples, and details all available settings.

When to use configurations

It is recommended to use configurations when your use case has or requires:

  • Reproducible analyses: You need to document and reproduce exact parameter values across multiple runs or share analyses with colleagues
  • Batch processing: Running the same analysis on multiple files with consistent parameters
  • Parameter tuning: Experimenting with different parameter combinations and need to track which settings produced which results
  • Team collaboration: Multiple people need to use the same analysis parameters

When not to use configurations

Consider not using configurations when your use case has or requires:

  • One-off analyses: Simple, exploratory (initial) analyses where you're testing a single file with default parameters
  • Quick scripts: Short scripts with few parameters
  • Highly dynamic parameters: When parameters need to be computed at runtime based on input data characteristics

Configurations in scripts versus poreFlow dashboard

The poreFlow Python module and dashboard have different use-cases, and thus treat configurations a little differently. Their approach to configurations is summarized below:

  • Python Module


    The Python module is intended as building blocks for your own analysis. This means:

    • Your code reads the configuration file.
    • Your code decides which functions use which parts of the configuration file.

    While a lot more flexible, consider that requires some more set-up on your end. Note that poreFlow has been designed around this use case, which makes things easier. For usage examples, check out:

    Code examples

    Note

    As mentioned in the previous section, consider forgoing configurations for one-offs and quick analysis.

  • Dashboard


    The dashboard uses configurations throughout: - Settings indide the dashboard can be stored to a configuration file - A configuration file can be loaded to use saved settings in a new session.

    To learn more, check out the Dashboard features pages:

    Dashboard features

TOML Language

Configuration files use TOML (Tom's Obvious, Minimal Language), a configuration file format that is. Advantages of TOML are:

  • Human-readable: Clear syntax with sections (tables) and key-value pairs
  • Types: Supports strings, integers, floats, booleans, arrays, and nested structures
  • Widely supported: In Python via the tomllib module (or tomli for older versions))

A snippet of a configuration in the TOML language containing event finding parameters:

poreflow.toml
  185,
]
closing_iterations = 10
boundary_trim = [
  1,
  1,
]
n_components = 3
degree = 2
min_frac_os = 0.01
min_duration = 1

[step_finding]
sensitivity = 3.0
min_level_length = 10

[event_filtering]
min_duration = 1
Spacing and lists in TOML files

Note some parameters like open_state_range are a list. TOML files have quite flexible formatting, in this example these lists have written over multiple lines:

open_state_range = [
  200,
  300,
]

A more brief notation is also possible:

open_state_range = [200, 300]

Use whatever you prefer.

Full example

Shown below is an example of a TOML file with all possible parameters. The next section describes the meaning and usages of each parameter.

poreflow.toml
[input]
name = "My File"
folder_path = "my_project/"
file_name = "my_measurement.fast5"
resample_to_freq = 5000
filter_freq = 1000

[output]
path = "./"

[env]
processes = 2
verbose = 0

[plot]
out = "./Figures"
events_current_range = [
  0,
  150,
]

[event_finding]
open_state_range = [
  200,
  300,
]
voltage_range = [
  175,
  185,
]
closing_iterations = 10
boundary_trim = [
  1,
  1,
]
n_components = 3
degree = 2
min_frac_os = 0.01
min_duration = 1

[step_finding]
sensitivity = 3.0
min_level_length = 10

[event_filtering]
min_duration = 1
min_n_steps = 45
max_n_steps = 800
min_binned_entropy_of_means = 2.5
min_step_rate = 12.0
min_ios = 0
max_ios = 0

[boundary]
template_DNA_5_3 = "TCCTTTTATCGTCATCATCTTTGTAATCGCCGCT"
trim_right = 8
ref_length = 54
segment_step_to_end = 4
window_length = 40

[channels_page]
ylim = [
  0,
  300,
]
autoscale_y = "fixed"
xlim = [
  0,
  1000,
]
autoscale_x = "auto"

[channels_voltage_page]
show_voltage = true
ylim = [
  -200,
  200,
]
autoscale_y = "fixed"

[events_page]
ylim = [
  0,
  300,
]
autoscale_y = "fixed"
xlim = [
  0,
  1000,
]
autoscale_x = "auto"
x_start = "abs"

Configuration Reference

This section documents all available configuration settings, organized by section. Each setting corresponds to parameters used in PoreFlow's analysis functions.

Input settings

Section name: [input]

Parameters controlling file input and initial signal processing.

Setting
Type Default Description
name str "My File" Display name of the file for identification in outputs and logs
folder_path str required Path to the folder containing the data file
file_name str required Name of the data file to process (e.g., 260309_XC_H3-C2_pore1_perf1_f2.dat)
resample_to_freq int 5000 Target sample rate (Hz) for downsampling the signal before processing. Lower values reduce data size and processing time but may lose fine details

Output settings

Section name: [output]

Parameters controlling output locations.

Setting
Type Default Description
path str "./" Path to the directory where output files (annotations, figures) will be saved

Environment settings

Section name: [env]

Environment and runtime parameters.

Setting
Type Default Description
processes int 2 Number of parallel processes to use for multi-channel processing. Higher values use more CPU cores but may have diminishing returns1
verbose int 0 Verbosity level. 0 = silent, higher values produce more detailed output

Plot settings

Section name: [plot]

Parameters controlling figure output.

Setting
Type Default Description
out str "./Figures" Path to the directory where figures will be saved

Event detection setting

Section name: [event_finding]

Parameters for detecting events in the recording.

Setting
Type Default Description
open_state_range list[int] [200, 300] Current range in pA that defines the open state (open pore). Values outside this range are considered potential events
voltage_range list[int] [175, 185] Acceptable voltage range in mV. Data outside this range is excluded from analysis
closing_iterations int 10 Number of morphological closing operations applied to the event mask. Higher numbers allow for longer gaps in an event that have an incorrect voltage or an open state current.
boundary_trim list[int] [1, 1] Time in milliseconds to trim from the start and end of each detected event. Negative values can be used to expand an event to include samples before and after the detected event. Format: [start_trim, end_trim]
n_components int 3 Number of components for the Gaussian Mixture Model (GMM) used to fit the open state current distribution
degree int 2 Degree of the polynomial fit applied to the open state current
min_frac_os float 0.01 Minimum fraction of the recording that must be in the open state. Defaults to 1%. If less than this fraction is open state, the channel is rejected.2
min_duration float 1.0 Minimum duration in seconds for a detected event. Events shorter than this are filtered out immediately after detection

Step finding settings

Section name: [step_finding]

Parameters for detecting steps within events.

Setting
Type Default Description
sensitivity float 3.0 Sensitivity of the step finder algorithm. Higher values result in fewer, more significant steps being detected
min_level_length int 10 Minimum number of samples for a step to be valid. Prevents detection of very short, noise-like steps

Event selection settings

Section name: [event_filtering]

Parameters for filtering events based on their characteristics after step finding.

Setting
Type Default Description
min_duration float 1.0 Minimum duration in seconds for an event to pass filtering
min_n_steps int 45 Minimum number of steps required in an event
max_n_steps int 800 Maximum number of steps allowed in an event
min_binned_entropy_of_means float 2.5 Minimum binned entropy of step means. Typical value of 2.5 for M2 MspA DNA sequencing. Measures the variability/disorder of current levels
min_step_rate float 12.0 Minimum step rate in Hz. Typical value of 12 Hz for Hel308 at 37°C with 1 mM ATP
min_ios float 0.0 Minimum local open state current in pA. Events with open state below this are rejected
max_ios float 300.0 Maximum local open state current in pA. Events with open state above this are rejected

DNA-peptide boundary detection settings

Section name: [boundary]

Work in progress

These settings are currently WIP in poreFlow. Using these is not recommended yet. Note that their names might also change in future poreflow versions.

Parameters for DNA-peptide boundary detection and alignment.

Setting
Type Default Description
template_DNA_5_3 str TCCT...CGCT3 Template DNA sequence (5' to 3') used for boundary detection and alignment
trim_right int 8 Number of bases to trim from the right end of events during alignment
ref_length int 54 Reference length in bases for the expected DNA sequence
segment_step_to_end int 4 Number of steps from the end of the event to consider for boundary detection
window_length int 40 Window length in bases for boundary analysis

Code examples

These examples below use the default config file, with one alteration: the environment setting verbose is set to 1.

Loading a raw recording

This example shows how stored filenames and folder can be used to parametrize from where to load a file.

import pathlib
import tomllib

with open("poreflow.toml", "rb") as f:
    config = tomllib.load(f) # (1)!

input_config = config["input"] # (2)!

file_name = input_config["file_name"]
folder_path = pathlib.Path(input_config["folder_path"]) # (3)!
print(f"Loading {file_name} from folder {folder_path}")

with pf.File(folder_path / file_name) as f:
    raw = f.get_raw(channel=18)
    raw = raw.downsample(input_config["resample_to_freq"])

# Further processing with raw
  1. Load a config file using tomllib.
  2. Read the input section of the file
Loading ont_measurement.fast5 from folder .

Event detection

This example shows how a configuration file can be used for setting arguments for event finding. Note that it can be cleanly and easily implemented using argument unpacking. See the inline comments (the icons) in the code below for more information.

1
2
3
4
5
6
7
8
event_config = config["event_finding"]
env_config = config["env"]

with pf.File(folder_path / file_name) as f:
    f.find_events(
        **event_config, # (1)!
        **env_config, # (2)!
    )
  1. poreflow.File.find_events is an interface for poreflow.events.detection.find_events.

    If the TOML file has an event_finding section with the same keys as poreflow.events.detection.find_events (open_state_range, min_duration, etc.), it is easiest and cleanest to simply unpack the arguments into the method using the ** syntax.
  2. We can also unpack the environment section into this method, as poreflow.File.find_events takes both a verbose and processes keyword argument.
Starting 2 workers for 2 items
------- SUMMARY -------
Searched 2 channels and found 155 events. 
Good channels had 77.50 events on average.

Step finding and event selection

This example show how stored filenames and folder can be used to parametrize from where to load a file.

from poreflow.events import selection

print("(1) Step finding:")
with pf.File(folder_path / file_name) as f:
    f.find_steps(
        **config["step_finding"], # (1)!
        **env_config # (2)!
    )

    stats = selection.get_step_finding_stats(f) # (3)!

print("\n(2) Event selection features:")
print(stats.head())

truth_table = selection.filter_from_config(stats, config["event_filtering"])

print("\n(3) Event filtering results:")
print(truth_table.head())

mask = truth_table["all"]
print(
    f"Selecting {sum(mask)} out of {len(mask)} "
    f"events ({sum(mask) / len(mask):.0%})"
)
with pf.File(folder_path / file_name) as f:
    f.filter_events(mask)
  1. Like what was shown in the previous example, [poreflow.File.find_steps][] is an interface for [poreflow.EventDataFrame.find_steps][].

    If the TOML file has an step_finding section with the same keys as [poreflow.EventDataFrame.find_steps][] (sensitivity and min_level_length), it is easiest and cleanest to simply unpack the arguments into the method using the ** syntax.
  2. We can also unpack the environment section into this method, as [poreflow.File.find_steps][] takes both a verbose and processes keyword argument.
(1) Step finding:
Starting 2 workers for 155 items
------- SUMMARY -------
Found steps in 155 events. 
Unable to find steps in 0 (0%) events. 
Found a total of 27592 steps. 
Good channels had 178.01 steps on average.

(2) Event selection features:
   start_idx  end_idx  n_pts  ...  n_steps  step_rate  binned_entropy_of_means
0      59602   118670  59068  ...      258  21.839236                 3.105303
1     153648   205506  51858  ...      336  32.396159                 3.059400
2     259337   298494  39157  ...      186  23.750543                 3.368490
3     303058   323739  20681  ...      109  26.352691                 3.298638
4     346801   359030  12229  ...       61  24.940715                 3.409057

[5 rows x 14 columns]

(3) Event filtering results:
condition  duration>1  n_steps>45  n_steps<800  ...  ios>0  ios<300.0   all
event                                           ...                        
0                True        True         True  ...   True       True  True
1                True        True         True  ...   True       True  True
2                True        True         True  ...   True       True  True
3                True        True         True  ...   True       True  True
4                True        True         True  ...   True       True  True

[5 rows x 8 columns]
Selecting 112 out of 155 events (72%)

  1. Note that processes must be equal to or smaller than the number of processes available on your device. To find out the number of cores available on your PC, consult these intructions. For Mac, open a terminal and run sysctl -n hw.ncpu

  2. Mainly important for ONT devices. ONT devices contain many channels for which each is analysed for events. Some channels are blocked for the full duration of the recording, thus containing few samples within open_state_range. Channels with a low fraction of samples at open state current are bad and event detection is stopped early. Note that this is also useful for UTube measurements, as this will throw an error if the user selects a wrong open state range. If there are an insane amount of reads (i.e. very little time at open state), you can consider setting this to 0.1% or lower. 

  3. Middle nucleotides have been removed here for brevity. Full sequence: TCCTTTTATCGTCATCATCTTTGTAATCGCCGCT