Annotations¶

Overview¶

PoreFlow separates original measurement data from analysis results using an annotation system. This has two main advantages:

The integrity of the original data is preserved
This enables more flexible and efficient analysis

A schematic of the system is shown below. The data file is the original data created on the measurement device (ONT/UTube), this file contains the raw current and voltage data. poreFlow will only read from this file to preserve its integrity.

On the other hand, the annotation file is used by poreFlow to both store and retrieve data. This file contains information like events, steps, open state current fits, and more.

graph TD;
    A@{ shape: cyl, label: Data File}-->|Read only|D[poreFlow]
    E@{ shape: cyl, label: Annotation File}<-->|Read/write|D

Usage¶

Interfacing between the data and annotation files is done automatically by poreflow.File. In the section below, you can find usage examples outlining how this works.

Opening a data file for the first time¶

When a data file is opened for the first time, poreFlow will automatically create an annotation file in which to store analysis results. Consider this simple project structure:

UTube ONT

Project structure

my-project/
 └── measurement.dat

Then this .dat data file, can be opened using poreflow.File:

import poreflow as pf

with pf.File("measurement.dat") as f: # (1)!
    f.find_events() # (2)!

A .fast5 and annotation file is created automatically here.
Analysis results, such as events, are automatically stored in the annotations file.

Project structure

my-project/
 └── measurement.fast5

Then this .fast5 data file, can be opened using poreflow.File:

import poreflow as pf

with pf.File("measurement.fast5") as f: # (1)!
    f.find_events() # (2)!

An annotation file is created automatically here.
Analysis results, such as events, are automatically stored in the annotations file.

In the background, poreFlow automatically creates an annotation file. By default, this file is placed in the same directory as the data file, and the file has the same name as the annotation file, but with the extension .annot.fast5, in this case: measurement.annot.fast5.

UTube ONT

Project structure (after opening)

my-project/
 ├── measurement.dat    
 ├── measurement.fast5    
 └── measurement.annot.fast5  <-- Analysis results (read-write)

Project structure (after opening)

my-project/
 ├── measurement.fast5    
 └── measurement.annot.fast5  <-- Analysis results (read-write)

All analysis results are automatically saved to this file.

Opening an annotated file¶

Info

For simplicity, further examples on this page use the ONT .fast5 file as an example, but work just as well with UTube .dat files

If an annotation file already exists next to the data file, it's automatically loaded. Consider this example, where we continue analyzing the file from the previous section, in which we already ran event detection.

Project structure

my-project/
 ├── measurement.fast5    
 └── measurement.annot.fast5

with pf.File("measurement.fast5") as f:
    print(f.has_events)

    df = f.events  # (1)!

Event data is loaded from annotation file. Note that in this example, f.get_events would also work as well.

Storing annotations in a different directory¶

Annotations can also be automatically created and read from in a separate directory. HDF5 File To do so, specify the annotation_path argument in poreflow.File.

Project structure

my-project/
 ├── measurement.fast5
 └── analysis/
     └── annotations/
         └── measurement.annot.fast5

annotations_dir = "analysis/annotations"

with pf.File("measurement.fast5", annotation_path=annotations_dir) as f:
    df = f.events

Opening via an annotation file¶

You can also open a data file by requesting using its annotation in poreflow.File. The annotation stores its parent data file inside, allowing the file loader to automatically find and load the data file.

Project structure

my-project/
 ├── measurement.fast5    
 └── measurement.annot.fast5

import poreflow as pf

# Open using the annotation file
with pf.File("measurement.annot.fast5") as f: # (1)!
    events = f.get_events()

The linked measurement.fast5 is automatically loaded here

Tip

Both the fast5 and annot.fast5 files must be in the same directory.

Multiple annotations¶

A powerful feature of the annotation system is the ability to analyze the same raw data in different ways, each with its own annotation file. Consider a simple project:

Project structure

my-project/
 └── measurement.fast5

Imagine we want to try to analyse this measurement using two different approaches. This can be managed easily using the annotation system by creating a unique annotation file per analysis:

Storing annotations in different directories Storing annotations with different filenames

In this example we process the measurement twice, storing the annotation in a different folder for each.

import poreflow as pf

# Analysis 1: Default parameters
with pf.File("measurement.fast5", search_path="default") as f:
    f.find_events(min_duration=0.1)
    print(f.n_events)

# Analysis 2: Long events only 
with pf.File("measurement.fast5", search_path="long") as f:
    f.find_events(min_duration=1)
    print(f.n_events)

This approach results in the following project structure:

Project structure

project/
 ├── measurement.fast5                 
 ├── default/
 │   └── measurement.annot.fast5      # Default process
 └── long/
     └── measurement.annot.fast5      # Alternative process

Why search_path?

The search_path parameter in poreflow.File specifies where to search for an annotation, and to create an annotation in this folder if none is found. That is why this parameter is used for saving and loading multiple annotations stored in different folders.

Storing configurations alongside annotations

Storing annotations in different folders could be particularly useful when also storing the configurations (such as poreflow.toml) used for processing alongside the annotations. For example:

Example project structure

project/
 ├── measurement.fast5                 
 ├── default/
 │   ├── poreflow.toml                # Default configuration
 │   └── measurement.annot.fast5      # Default process
 └── long/
     ├── poreflow.toml                # Alternative configuration
     └── measurement.annot.fast5      # Alternative process

In this example we process the measurement twice, storing the annotations with different filenames.

import poreflow as pf

# Analysis 1: Default parameters
with pf.File("measurement.fast5", annotation_name="default") as f:
    f.find_events(min_duration=0.1)
    print(f.n_events)

# Analysis 2: Long events only 
with pf.File("measurement.fast5", annotation_name="long") as f:
    f.find_events(min_duration=1)
    print(f.n_events)

This approach results in the following project structure:

Project structure

project/
 ├── measurement.fast5                 
 ├── default.annot.fast5    # Default parameters
 └── long.annot.fast5       # Alternative parameters

What's Stored Where¶

Which objects are stored in which file is summarized in the table below.

Data Type	Location	File Extension	Access Mode
Raw current/voltage	Original file	`.fast5` or `.dat`	Read-only
Sampling rate	Original file	`.fast5` or `.dat`	Read-only
Events	Annotation file	`.annot.fast5`	Read-write
Steps	Annotation file	`.annot.fast5`	Read-write
Open state fits	Annotation file	`.annot.fast5`	Read-write

Best Practices¶

Some general tips on how to use annotations:

Version Control: While raw .fast5 files are generally too large to store in version control (Git), .annot.fast5 are much smaller and could be stored in Git.
Backup Strategy: While raw data files are precious and should be backed up regularly, annotation files can be regenerated from raw data if their (if the configuration of analysis steps is known). This means that annotation files can be more readily deleted.
Collaboration: A corollary of the above: While raw data files are shared on the drive, it makes sense to have annotation files on your device only.