Configurations¶

When doing large-scale analysis, you might want to collect all your processing settings (e.g. those for event-finding, plotting, step detection) into one file.

PoreFlow supports a TOML file used to store and maintain parameters. This document explains when and how to use configurations, provides usage examples, and details all available settings.

When to use configurations¶

It is recommended to use configurations when your use case has or requires:

Reproducible analyses: You need to document and reproduce exact parameter values across multiple runs or share analyses with colleagues
Batch processing: Running the same analysis on multiple files with consistent parameters
Parameter tuning: Experimenting with different parameter combinations and need to track which settings produced which results
Team collaboration: Multiple people need to use the same analysis parameters

When not to use configurations

Consider not using configurations when your use case has or requires:

One-off analyses: Simple, exploratory (initial) analyses where you're testing a single file with default parameters
Quick scripts: Short scripts with few parameters
Highly dynamic parameters: When parameters need to be computed at runtime based on input data characteristics

Configurations in scripts versus poreFlow dashboard¶

The poreFlow Python module and dashboard have different use-cases, and thus treat configurations a little differently. Their approach to configurations is summarized below:

Python Module

The Python module is intended as building blocks for your own analysis. This means:
- Your code reads the configuration file.
- Your code decides which functions use which parts of the configuration file.
While a lot more flexible, consider that requires some more set-up on your end. Note that poreFlow has been designed around this use case, which makes things easier. For usage examples, check out:

Code examples

Note

As mentioned in the previous section, consider forgoing configurations for one-offs and quick analysis.

Dashboard

The dashboard uses configurations throughout: - Settings indide the dashboard can be stored to a configuration file - A configuration file can be loaded to use saved settings in a new session.

To learn more, check out the Dashboard features pages:

Dashboard features

TOML Language¶

Configuration files use TOML (Tom's Obvious, Minimal Language), a configuration file format that is. Advantages of TOML are:

Human-readable: Clear syntax with sections (tables) and key-value pairs
Types: Supports strings, integers, floats, booleans, arrays, and nested structures
Widely supported: In Python via the tomllib module (or tomli for older versions))

A snippet of a configuration in the TOML language containing event finding parameters:

poreflow.toml

  185,
]
closing_iterations = 10
boundary_trim = [
  1,
  1,
]
n_components = 3
degree = 2
min_frac_os = 0.01
min_duration = 1

[step_finding]
sensitivity = 3.0
min_level_length = 10

[event_filtering]
min_duration = 1

Spacing and lists in TOML files

Note some parameters like open_state_range are a list. TOML files have quite flexible formatting, in this example these lists have written over multiple lines:

open_state_range = [
  200,
  300,
]

A more brief notation is also possible:

open_state_range = [200, 300]

Use whatever you prefer.

Full example¶

Shown below is an example of a TOML file with all possible parameters. The next section describes the meaning and usages of each parameter.

poreflow.toml

[input]
name = "My File"
folder_path = "my_project/"
file_name = "my_measurement.fast5"
resample_to_freq = 5000
filter_freq = 1000

[output]
path = "./"

[env]
processes = 2
verbose = 0

[plot]
out = "./Figures"
events_current_range = [
  0,
  150,
]

[event_finding]
open_state_range = [
  200,
  300,
]
voltage_range = [
  175,
  185,
]
closing_iterations = 10
boundary_trim = [
  1,
  1,
]
n_components = 3
degree = 2
min_frac_os = 0.01
min_duration = 1

[step_finding]
sensitivity = 3.0
min_level_length = 10

[event_filtering]
min_duration = 1
min_n_steps = 45
max_n_steps = 800
min_binned_entropy_of_means = 2.5
min_step_rate = 12.0
min_ios = 0
max_ios = 0

[boundary]
template_DNA_5_3 = "TCCTTTTATCGTCATCATCTTTGTAATCGCCGCT"
trim_right = 8
ref_length = 54
segment_step_to_end = 4
window_length = 40

[channels_page]
ylim = [
  0,
  300,
]
autoscale_y = "fixed"
xlim = [
  0,
  1000,
]
autoscale_x = "auto"

[channels_voltage_page]
show_voltage = true
ylim = [
  -200,
  200,
]
autoscale_y = "fixed"

[events_page]
ylim = [
  0,
  300,
]
autoscale_y = "fixed"
xlim = [
  0,
  1000,
]
autoscale_x = "auto"
x_start = "abs"

Configuration Reference¶

This section documents all available configuration settings, organized by section. Each setting corresponds to parameters used in PoreFlow's analysis functions.

Input settings¶

Section name: [input]

Parameters controlling file input and initial signal processing.

Setting	Type	Default	Description
`name`	str	"My File"	Display name of the file for identification in outputs and logs
`folder_path`	str	required	Path to the folder containing the data file
`file_name`	str	required	Name of the data file to process (e.g., `260309_XC_H3-C2_pore1_perf1_f2.dat`)
`resample_to_freq`	int	5000	Target sample rate (Hz) for downsampling the signal before processing. Lower values reduce data size and processing time but may lose fine details

Output settings¶

Section name: [output]

Parameters controlling output locations.

Setting	Type	Default	Description
`path`	str	"./"	Path to the directory where output files (annotations, figures) will be saved

Environment settings¶

Section name: [env]

Environment and runtime parameters.

Setting	Type	Default	Description
`processes`	int	2	Number of parallel processes to use for multi-channel processing. Higher values use more CPU cores but may have diminishing returns¹
`verbose`	int	0	Verbosity level. 0 = silent, higher values produce more detailed output

Plot settings¶

Section name: [plot]

Parameters controlling figure output.

Setting	Type	Default	Description
`out`	str	"./Figures"	Path to the directory where figures will be saved

Event detection setting¶

Section name: [event_finding]

Parameters for detecting events in the recording.

Setting	Type	Default	Description
`open_state_range`	list[int]	[200, 300]	Current range in pA that defines the open state (open pore). Values outside this range are considered potential events
`voltage_range`	list[int]	[175, 185]	Acceptable voltage range in mV. Data outside this range is excluded from analysis
`closing_iterations`	int	10	Number of morphological closing operations applied to the event mask. Higher numbers allow for longer gaps in an event that have an incorrect voltage or an open state current.
`boundary_trim`	list[int]	[1, 1]	Time in milliseconds to trim from the start and end of each detected event. Negative values can be used to expand an event to include samples before and after the detected event. Format: [start_trim, end_trim]
`n_components`	int	3	Number of components for the Gaussian Mixture Model (GMM) used to fit the open state current distribution
`degree`	int	2	Degree of the polynomial fit applied to the open state current
`min_frac_os`	float	0.01	Minimum fraction of the recording that must be in the open state. Defaults to 1%. If less than this fraction is open state, the channel is rejected.²
`min_duration`	float	1.0	Minimum duration in seconds for a detected event. Events shorter than this are filtered out immediately after detection

Step finding settings¶

Section name: [step_finding]

Parameters for detecting steps within events.

Setting	Type	Default	Description
`sensitivity`	float	3.0	Sensitivity of the step finder algorithm. Higher values result in fewer, more significant steps being detected
`min_level_length`	int	10	Minimum number of samples for a step to be valid. Prevents detection of very short, noise-like steps

Event selection settings¶

Section name: [event_filtering]

Parameters for filtering events based on their characteristics after step finding.

Setting	Type	Default	Description
`min_duration`	float	1.0	Minimum duration in seconds for an event to pass filtering
`min_n_steps`	int	45	Minimum number of steps required in an event
`max_n_steps`	int	800	Maximum number of steps allowed in an event
`min_binned_entropy_of_means`	float	2.5	Minimum binned entropy of step means. Typical value of 2.5 for M2 MspA DNA sequencing. Measures the variability/disorder of current levels
`min_step_rate`	float	12.0	Minimum step rate in Hz. Typical value of 12 Hz for Hel308 at 37°C with 1 mM ATP
`min_ios`	float	0.0	Minimum local open state current in pA. Events with open state below this are rejected
`max_ios`	float	300.0	Maximum local open state current in pA. Events with open state above this are rejected

DNA-peptide boundary detection settings¶

Section name: [boundary]

Work in progress

These settings are currently WIP in poreFlow. Using these is not recommended yet. Note that their names might also change in future poreflow versions.

Parameters for DNA-peptide boundary detection and alignment.

Setting	Type	Default	Description
`template_DNA_5_3`	str	TCCT...CGCT³	Template DNA sequence (5' to 3') used for boundary detection and alignment
`trim_right`	int	8	Number of bases to trim from the right end of events during alignment
`ref_length`	int	54	Reference length in bases for the expected DNA sequence
`segment_step_to_end`	int	4	Number of steps from the end of the event to consider for boundary detection
`window_length`	int	40	Window length in bases for boundary analysis

Code examples¶

These examples below use the default config file, with one alteration: the environment setting verbose is set to 1.

Loading a raw recording¶

This example shows how stored filenames and folder can be used to parametrize from where to load a file.

import pathlib
import tomllib

with open("poreflow.toml", "rb") as f:
    config = tomllib.load(f) # (1)!

input_config = config["input"] # (2)!

file_name = input_config["file_name"]
folder_path = pathlib.Path(input_config["folder_path"]) # (3)!
print(f"Loading {file_name} from folder {folder_path}")

with pf.File(folder_path / file_name) as f:
    raw = f.get_raw(channel=18)
    raw = raw.downsample(input_config["resample_to_freq"])

# Further processing with raw

Load a config file using tomllib.
Read the input section of the file

Loading ont_measurement.fast5 from folder .

Event detection¶

This example shows how a configuration file can be used for setting arguments for event finding. Note that it can be cleanly and easily implemented using argument unpacking. See the inline comments (the icons) in the code below for more information.

event_config = config["event_finding"]
env_config = config["env"]

with pf.File(folder_path / file_name) as f:
    f.find_events(
        **event_config, # (1)!
        **env_config, # (2)!
    )

poreflow.File.find_events is an interface for poreflow.events.detection.find_events.

If the TOML file has an event_finding section with the same keys as poreflow.events.detection.find_events (open_state_range, min_duration, etc.), it is easiest and cleanest to simply unpack the arguments into the method using the ** syntax.
We can also unpack the environment section into this method, as poreflow.File.find_events takes both a verbose and processes keyword argument.

Starting 2 workers for 2 items
------- SUMMARY -------
Searched 2 channels and found 155 events. 
Good channels had 77.50 events on average.

Step finding and event selection¶

This example show how stored filenames and folder can be used to parametrize from where to load a file.

from poreflow.events import selection

print("(1) Step finding:")
with pf.File(folder_path / file_name) as f:
    f.find_steps(
        **config["step_finding"], # (1)!
        **env_config # (2)!
    )

    stats = selection.get_step_finding_stats(f) # (3)!

print("\n(2) Event selection features:")
print(stats.head())

truth_table = selection.filter_from_config(stats, config["event_filtering"])

print("\n(3) Event filtering results:")
print(truth_table.head())

mask = truth_table["all"]
print(
    f"Selecting {sum(mask)} out of {len(mask)} "
    f"events ({sum(mask) / len(mask):.0%})"
)
with pf.File(folder_path / file_name) as f:
    f.filter_events(mask)

Like what was shown in the previous example, [poreflow.File.find_steps][] is an interface for [poreflow.EventDataFrame.find_steps][].

If the TOML file has an step_finding section with the same keys as [poreflow.EventDataFrame.find_steps][] (sensitivity and min_level_length), it is easiest and cleanest to simply unpack the arguments into the method using the ** syntax.
We can also unpack the environment section into this method, as [poreflow.File.find_steps][] takes both a verbose and processes keyword argument.

(1) Step finding:
Starting 2 workers for 155 items
------- SUMMARY -------
Found steps in 155 events. 
Unable to find steps in 0 (0%) events. 
Found a total of 27592 steps. 
Good channels had 178.01 steps on average.

(2) Event selection features:
   start_idx  end_idx  n_pts  ...  n_steps  step_rate  binned_entropy_of_means
0      59602   118670  59068  ...      258  21.839236                 3.105303
1     153648   205506  51858  ...      336  32.396159                 3.059400
2     259337   298494  39157  ...      186  23.750543                 3.368490
3     303058   323739  20681  ...      109  26.352691                 3.298638
4     346801   359030  12229  ...       61  24.940715                 3.409057

[5 rows x 14 columns]

(3) Event filtering results:
condition  duration>1  n_steps>45  n_steps<800  ...  ios>0  ios<300.0   all
event                                           ...                        
0                True        True         True  ...   True       True  True
1                True        True         True  ...   True       True  True
2                True        True         True  ...   True       True  True
3                True        True         True  ...   True       True  True
4                True        True         True  ...   True       True  True

[5 rows x 8 columns]
Selecting 112 out of 155 events (72%)

Note that processes must be equal to or smaller than the number of processes available on your device. To find out the number of cores available on your PC, consult these intructions. For Mac, open a terminal and run sysctl -n hw.ncpu. ↩
Mainly important for ONT devices. ONT devices contain many channels for which each is analysed for events. Some channels are blocked for the full duration of the recording, thus containing few samples within open_state_range. Channels with a low fraction of samples at open state current are bad and event detection is stopped early. Note that this is also useful for UTube measurements, as this will throw an error if the user selects a wrong open state range. If there are an insane amount of reads (i.e. very little time at open state), you can consider setting this to 0.1% or lower. ↩
Middle nucleotides have been removed here for brevity. Full sequence: TCCTTTTATCGTCATCATCTTTGTAATCGCCGCT ↩