Annotations¶
Overview¶
PoreFlow separates original measurement data from analysis results using an annotation system. This has two main advantages:
- The integrity of the original data is preserved
- This enables more flexible and efficient analysis
A schematic of the system is shown below. The data file is the original data created on the measurement device (ONT/UTube), this file contains the raw current and voltage data. poreFlow will only read from this file to preserve its integrity.
On the other hand, the annotation file is used by poreFlow to both store and retrieve data. This file contains information like events, steps, open state current fits, and more.
graph TD;
A@{ shape: cyl, label: Data File}-->|Read only|D[poreFlow]
E@{ shape: cyl, label: Annotation File}<-->|Read/write|D
Usage¶
Interfacing between the data and annotation files is done automatically by poreflow.File. In the section below, you can find usage examples outlining how this works.
Opening a data file for the first time¶
When a data file is opened for the first time, poreFlow will automatically create an annotation file in which to store analysis results. Consider this simple project structure:
Then this .dat data file, can be opened using poreflow.File:
- A
.fast5and annotation file is created automatically here. - Analysis results, such as events, are automatically stored in the annotations file.
Then this .fast5 data file, can be opened using poreflow.File:
- An annotation file is created automatically here.
- Analysis results, such as events, are automatically stored in the annotations file.
In the background, poreFlow automatically creates an annotation file. By default, this file is placed in the same
directory as the data file, and the file has the same name as the annotation file, but with the extension
.annot.fast5, in this case: measurement.annot.fast5.
All analysis results are automatically saved to this file.
Opening an annotated file¶
Info
For simplicity, further examples on this page use the ONT .fast5 file as an example, but work just as well with UTube .dat files
If an annotation file already exists next to the data file, it's automatically loaded. Consider this example, where we continue analyzing the file from the previous section, in which we already ran event detection.
- Event data is loaded from annotation file. Note that in this example, f.get_events would also work as well.
Storing annotations in a different directory¶
Annotations can also be automatically created and read from in a separate directory. HDF5 File
To do so, specify the annotation_path argument in poreflow.File.
my-project/
├── measurement.fast5
└── analysis/
└── annotations/
└── measurement.annot.fast5
Opening via an annotation file¶
You can also open a data file by requesting using its annotation in poreflow.File. The annotation stores its parent data file inside, allowing the file loader to automatically find and load the data file.
import poreflow as pf
# Open using the annotation file
with pf.File("measurement.annot.fast5") as f: # (1)!
events = f.get_events()
- The linked measurement.fast5 is automatically loaded here
Tip
Both the fast5 and annot.fast5 files must be in the same directory.
Multiple annotations¶
A powerful feature of the annotation system is the ability to analyze the same raw data in different ways, each with its own annotation file. Consider a simple project:
Imagine we want to try to analyse this measurement using two different approaches. This can be managed easily using the annotation system by creating a unique annotation file per analysis:
In this example we process the measurement twice, storing the annotation in a different folder for each.
This approach results in the following project structure:
project/
├── measurement.fast5
├── default/
│ └── measurement.annot.fast5 # Default process
└── long/
└── measurement.annot.fast5 # Alternative process
Why search_path?
The search_path parameter in poreflow.File specifies where to search for an annotation, and to create
an annotation in this folder if none is found. That is why this parameter is used for saving and loading
multiple annotations stored in different folders.
Storing configurations alongside annotations
Storing annotations in different folders could be particularly useful when also storing the configurations (such as poreflow.toml) used for processing alongside the annotations. For example:
In this example we process the measurement twice, storing the annotations with different filenames.
This approach results in the following project structure:
What's Stored Where¶
Which objects are stored in which file is summarized in the table below.
| Data Type | Location | File Extension | Access Mode |
|---|---|---|---|
| Raw current/voltage | Original file | .fast5 or .dat |
Read-only |
| Sampling rate | Original file | .fast5 or .dat |
Read-only |
| Events | Annotation file | .annot.fast5 |
Read-write |
| Steps | Annotation file | .annot.fast5 |
Read-write |
| Open state fits | Annotation file | .annot.fast5 |
Read-write |
Best Practices¶
Some general tips on how to use annotations:
-
Version Control: While raw
.fast5files are generally too large to store in version control (Git),.annot.fast5are much smaller and could be stored in Git. -
Backup Strategy: While raw data files are precious and should be backed up regularly, annotation files can be regenerated from raw data if their (if the configuration of analysis steps is known). This means that annotation files can be more readily deleted.
-
Collaboration: A corollary of the above: While raw data files are shared on the drive, it makes sense to have annotation files on your device only.