Configurations¶
When doing large-scale analysis, you might want to collect all your processing settings (e.g. those for event-finding, plotting, step detection) into one file.
PoreFlow supports a TOML file used to store and maintain parameters. This document explains when and how to use configurations, provides usage examples, and details all available settings.
When to use configurations¶
It is recommended to use configurations when your use case has or requires:
- Reproducible analyses: You need to document and reproduce exact parameter values across multiple runs or share analyses with colleagues
- Batch processing: Running the same analysis on multiple files with consistent parameters
- Parameter tuning: Experimenting with different parameter combinations and need to track which settings produced which results
- Team collaboration: Multiple people need to use the same analysis parameters
When not to use configurations
Consider not using configurations when your use case has or requires:
- One-off analyses: Simple, exploratory (initial) analyses where you're testing a single file with default parameters
- Quick scripts: Short scripts with few parameters
- Highly dynamic parameters: When parameters need to be computed at runtime based on input data characteristics
Configurations in scripts versus poreFlow dashboard¶
The poreFlow Python module and dashboard have different use-cases, and thus treat configurations a little differently. Their approach to configurations is summarized below:
-
Python Module
The Python module is intended as building blocks for your own analysis. This means:
- Your code reads the configuration file.
- Your code decides which functions use which parts of the configuration file.
While a lot more flexible, consider that requires some more set-up on your end. Note that poreFlow has been designed around this use case, which makes things easier. For usage examples, check out:
Note
As mentioned in the previous section, consider forgoing configurations for one-offs and quick analysis.
-
Dashboard
The dashboard uses configurations throughout: - Settings indide the dashboard can be stored to a configuration file - A configuration file can be loaded to use saved settings in a new session.
To learn more, check out the Dashboard features pages:
TOML Language¶
Configuration files use TOML (Tom's Obvious, Minimal Language), a configuration file format that is. Advantages of TOML are:
- Human-readable: Clear syntax with sections (tables) and key-value pairs
- Types: Supports strings, integers, floats, booleans, arrays, and nested structures
- Widely supported: In Python via the
tomllibmodule (ortomlifor older versions))
A snippet of a configuration in the TOML language containing event finding parameters:
185,
]
closing_iterations = 10
boundary_trim = [
1,
1,
]
n_components = 3
degree = 2
min_frac_os = 0.01
min_duration = 1
[step_finding]
sensitivity = 3.0
min_level_length = 10
[event_filtering]
min_duration = 1
Spacing and lists in TOML files
Note some parameters like open_state_range are a list. TOML files have quite flexible formatting,
in this example these lists have written over multiple lines:
A more brief notation is also possible:
Use whatever you prefer.
Full example¶
Shown below is an example of a TOML file with all possible parameters. The next section describes the meaning and usages of each parameter.
[input]
name = "My File"
folder_path = "my_project/"
file_name = "my_measurement.fast5"
resample_to_freq = 5000
filter_freq = 1000
[output]
path = "./"
[env]
processes = 2
verbose = 0
[plot]
out = "./Figures"
events_current_range = [
0,
150,
]
[event_finding]
open_state_range = [
200,
300,
]
voltage_range = [
175,
185,
]
closing_iterations = 10
boundary_trim = [
1,
1,
]
n_components = 3
degree = 2
min_frac_os = 0.01
min_duration = 1
[step_finding]
sensitivity = 3.0
min_level_length = 10
[event_filtering]
min_duration = 1
min_n_steps = 45
max_n_steps = 800
min_binned_entropy_of_means = 2.5
min_step_rate = 12.0
min_ios = 0
max_ios = 0
[boundary]
template_DNA_5_3 = "TCCTTTTATCGTCATCATCTTTGTAATCGCCGCT"
trim_right = 8
ref_length = 54
segment_step_to_end = 4
window_length = 40
[channels_page]
ylim = [
0,
300,
]
autoscale_y = "fixed"
xlim = [
0,
1000,
]
autoscale_x = "auto"
[channels_voltage_page]
show_voltage = true
ylim = [
-200,
200,
]
autoscale_y = "fixed"
[events_page]
ylim = [
0,
300,
]
autoscale_y = "fixed"
xlim = [
0,
1000,
]
autoscale_x = "auto"
x_start = "abs"
Configuration Reference¶
This section documents all available configuration settings, organized by section. Each setting corresponds to parameters used in PoreFlow's analysis functions.
Input settings¶
Section name:
[input]
Parameters controlling file input and initial signal processing.
Setting |
Type | Default | Description |
|---|---|---|---|
name |
str | "My File" | Display name of the file for identification in outputs and logs |
folder_path |
str | required | Path to the folder containing the data file |
file_name |
str | required | Name of the data file to process (e.g., 260309_XC_H3-C2_pore1_perf1_f2.dat) |
resample_to_freq |
int | 5000 | Target sample rate (Hz) for downsampling the signal before processing. Lower values reduce data size and processing time but may lose fine details |
Output settings¶
Section name: [output]
Parameters controlling output locations.
Setting |
Type | Default | Description |
|---|---|---|---|
path |
str | "./" | Path to the directory where output files (annotations, figures) will be saved |
Environment settings¶
Section name: [env]
Environment and runtime parameters.
Setting |
Type | Default | Description |
|---|---|---|---|
processes |
int | 2 | Number of parallel processes to use for multi-channel processing. Higher values use more CPU cores but may have diminishing returns1 |
verbose |
int | 0 | Verbosity level. 0 = silent, higher values produce more detailed output |
Plot settings¶
Section name: [plot]
Parameters controlling figure output.
Setting |
Type | Default | Description |
|---|---|---|---|
out |
str | "./Figures" | Path to the directory where figures will be saved |
Event detection setting¶
Section name: [event_finding]
Parameters for detecting events in the recording.
Setting |
Type | Default | Description |
|---|---|---|---|
open_state_range |
list[int] | [200, 300] | Current range in pA that defines the open state (open pore). Values outside this range are considered potential events |
voltage_range |
list[int] | [175, 185] | Acceptable voltage range in mV. Data outside this range is excluded from analysis |
closing_iterations |
int | 10 | Number of morphological closing operations applied to the event mask. Higher numbers allow for longer gaps in an event that have an incorrect voltage or an open state current. |
boundary_trim |
list[int] | [1, 1] | Time in milliseconds to trim from the start and end of each detected event. Negative values can be used to expand an event to include samples before and after the detected event. Format: [start_trim, end_trim] |
n_components |
int | 3 | Number of components for the Gaussian Mixture Model (GMM) used to fit the open state current distribution |
degree |
int | 2 | Degree of the polynomial fit applied to the open state current |
min_frac_os |
float | 0.01 | Minimum fraction of the recording that must be in the open state. Defaults to 1%. If less than this fraction is open state, the channel is rejected.2 |
min_duration |
float | 1.0 | Minimum duration in seconds for a detected event. Events shorter than this are filtered out immediately after detection |
Step finding settings¶
Section name: [step_finding]
Parameters for detecting steps within events.
Setting |
Type | Default | Description |
|---|---|---|---|
sensitivity |
float | 3.0 | Sensitivity of the step finder algorithm. Higher values result in fewer, more significant steps being detected |
min_level_length |
int | 10 | Minimum number of samples for a step to be valid. Prevents detection of very short, noise-like steps |
Event selection settings¶
Section name: [event_filtering]
Parameters for filtering events based on their characteristics after step finding.
Setting |
Type | Default | Description |
|---|---|---|---|
min_duration |
float | 1.0 | Minimum duration in seconds for an event to pass filtering |
min_n_steps |
int | 45 | Minimum number of steps required in an event |
max_n_steps |
int | 800 | Maximum number of steps allowed in an event |
min_binned_entropy_of_means |
float | 2.5 | Minimum binned entropy of step means. Typical value of 2.5 for M2 MspA DNA sequencing. Measures the variability/disorder of current levels |
min_step_rate |
float | 12.0 | Minimum step rate in Hz. Typical value of 12 Hz for Hel308 at 37°C with 1 mM ATP |
min_ios |
float | 0.0 | Minimum local open state current in pA. Events with open state below this are rejected |
max_ios |
float | 300.0 | Maximum local open state current in pA. Events with open state above this are rejected |
DNA-peptide boundary detection settings¶
Section name: [boundary]
Work in progress
These settings are currently WIP in poreFlow. Using these is not recommended yet. Note that their names might also change in future poreflow versions.
Parameters for DNA-peptide boundary detection and alignment.
Setting |
Type | Default | Description |
|---|---|---|---|
template_DNA_5_3 |
str | TCCT...CGCT3 | Template DNA sequence (5' to 3') used for boundary detection and alignment |
trim_right |
int | 8 | Number of bases to trim from the right end of events during alignment |
ref_length |
int | 54 | Reference length in bases for the expected DNA sequence |
segment_step_to_end |
int | 4 | Number of steps from the end of the event to consider for boundary detection |
window_length |
int | 40 | Window length in bases for boundary analysis |
Code examples¶
These examples below use the default config file, with one alteration: the environment
setting verbose is set to 1.
Loading a raw recording¶
This example shows how stored filenames and folder can be used to parametrize from where to load a file.
- Load a config file using
tomllib. - Read the input section of the file
Event detection¶
This example shows how a configuration file can be used for setting arguments for event finding. Note that it can be cleanly and easily implemented using argument unpacking. See the inline comments (the icons) in the code below for more information.
poreflow.File.find_eventsis an interface forporeflow.events.detection.find_events.
If the TOML file has anevent_findingsection with the same keys asporeflow.events.detection.find_events(open_state_range,min_duration, etc.), it is easiest and cleanest to simply unpack the arguments into the method using the**syntax.- We can also unpack the environment section into this method, as
poreflow.File.find_eventstakes both averboseandprocesseskeyword argument.
Step finding and event selection¶
This example show how stored filenames and folder can be used to parametrize from where to load a file.
- Like what was shown in the previous example, [
poreflow.File.find_steps][] is an interface for [poreflow.EventDataFrame.find_steps][].
If the TOML file has anstep_findingsection with the same keys as [poreflow.EventDataFrame.find_steps][] (sensitivityandmin_level_length), it is easiest and cleanest to simply unpack the arguments into the method using the**syntax. - We can also unpack the environment section into this method, as [
poreflow.File.find_steps][] takes both averboseandprocesseskeyword argument.
(1) Step finding:
Starting 2 workers for 155 items
------- SUMMARY -------
Found steps in 155 events.
Unable to find steps in 0 (0%) events.
Found a total of 27592 steps.
Good channels had 178.01 steps on average.
(2) Event selection features:
start_idx end_idx n_pts ... n_steps step_rate binned_entropy_of_means
0 59602 118670 59068 ... 258 21.839236 3.105303
1 153648 205506 51858 ... 336 32.396159 3.059400
2 259337 298494 39157 ... 186 23.750543 3.368490
3 303058 323739 20681 ... 109 26.352691 3.298638
4 346801 359030 12229 ... 61 24.940715 3.409057
[5 rows x 14 columns]
(3) Event filtering results:
condition duration>1 n_steps>45 n_steps<800 ... ios>0 ios<300.0 all
event ...
0 True True True ... True True True
1 True True True ... True True True
2 True True True ... True True True
3 True True True ... True True True
4 True True True ... True True True
[5 rows x 8 columns]
Selecting 112 out of 155 events (72%)
-
Note that
processesmust be equal to or smaller than the number of processes available on your device. To find out the number of cores available on your PC, consult these intructions. For Mac, open a terminal and runsysctl -n hw.ncpu. ↩ -
Mainly important for ONT devices. ONT devices contain many channels for which each is analysed for events. Some channels are blocked for the full duration of the recording, thus containing few samples within
open_state_range. Channels with a low fraction of samples at open state current are bad and event detection is stopped early. Note that this is also useful for UTube measurements, as this will throw an error if the user selects a wrong open state range. If there are an insane amount of reads (i.e. very little time at open state), you can consider setting this to 0.1% or lower. ↩ -
Middle nucleotides have been removed here for brevity. Full sequence:
TCCTTTTATCGTCATCATCTTTGTAATCGCCGCT↩