Segmenter

Background

Nanopore signal can have larger structurally identifiable regions. These regions can be separated into 3 categories: Stall, homopolymer, and regular DNA/cDNA sequence. Identifying the stall at the start of a nanopore read allows for extraction, visualisation, and comparison of the regular sequence. Finding homopolymer sequences can aid in identifying full lenght cDNA reads, breaking up contatonated reads, or identifying adjacent regions. Using segmenter in conjunction with MotifSeq allows for simple targeting within the raw signal. Selected regions can then be, for example, used as input to various classical comparison methods, such as various types dynamic programming, or used as training sets for Machine/Deep Learning techniques.

The core algorithm calculates the median of the full read, sets thresholds about the median using a fraction of the standard deviation, and uses a sliding window to find regions of signal which stay inside the threshold region. Error correction and segment merging helps handle the noisy nature of the raw signal data.

This is a demonstration for analysing raw signal data. The algorithm was developed out frustration with the overcomplicated or incompatible methods employed in other time analysis fields, such as sound wave or ECG detection. It is has undergone minimal optimisation, though still gives surprisingly good results.

Image of full length cDNA showing stall and PolyA tail homopolymer

Getting Started

Segmenter takes a flat file list of fast5 paths, a top directory of file paths, or a signal file from SquigglePull. It can be set to only detect stalls, or it can detect homopolymers within a distance of the start of a read depending on read structure. Many of the core arguments in the algorithm can be modified with arguments for tuning output.

Instructions for use

Segmenter has a built visualisation using -v. This is very helpful in tuning the parameters. Default values are a rough estimate for getting some results from most reads.

Quick start

Identify any segments in folder and visualise each one

Use f to full screen a plot, and ctrl+w to close a plot and move to the next one.

python segmenter.py -p ./test/ -v

Stall identification

python segmenter.py -s signals.tsv.gz -ku -j 100 > signals_stall_segments.tsv

Full usage

usage: segmenter.py [-h] [-f F5F | -p F5_PATH | -s SIGNAL] [-n NUM] [-e ERROR]
                    [-c CORRECTOR] [-w WINDOW] [-d SEG_DIST] [-t STD_SCALE]
                    [-v] [-g] [-b GAP_DIST] [-k] [-u] [-l STALL_LEN]
                    [-j STALL_START] [-scale_hi SCALE_HI]
                    [-scale_low SCALE_LOW]

segmenter - script to find obvious regions in squiggle data

optional arguments:
  -h, --help            show this help message and exit
  -f F5F, --f5f F5F     File list of fast5 paths
  -p F5_PATH, --f5_path F5_PATH
                        Fast5 top dir
  -s SIGNAL, --signal SIGNAL
                        Extracted signal file from squigglePull
  -n NUM, --Num NUM     Section of signal to look at - default 0=all
  -e ERROR, --error ERROR
                        Allowable error in segment algorithm
  -c CORRECTOR, --corrector CORRECTOR
                        Window size for increasing total error correction -
                        better long segment detection
  -w WINDOW, --window WINDOW
                        Minimum segment window size to be detected
  -d SEG_DIST, --seg_dist SEG_DIST
                        Maximum distance between 2 segments to be merged into
                        1
  -t STD_SCALE, --std_scale STD_SCALE
                        Scale factor of STDev about median
  -v, --view            view each output
  -g, --gap             Turn on gap distance for stall to polyTAil
  -b GAP_DIST, --gap_dist GAP_DIST
                        Maximum distance between stall and polyTAil segment -
                        for 10X/dRNA
  -k, --stall           Turn on stall detection - must be present
  -u, --test            Run Tests
  -l STALL_LEN, --stall_len STALL_LEN
                        Minimum percentage of minimum window segment for
                        initial stall segment
  -j STALL_START, --stall_start STALL_START
                        Maximum distance for start of stall segment to be
                        detected
  -scale_hi SCALE_HI, --scale_hi SCALE_HI
                        Upper limit for signal outlier scaling
  -scale_low SCALE_LOW, --scale_low SCALE_LOW
                        Lower limit for signal outlier scaling

Detailed Parameter descriptions

Segmenter Parameters Default Description
-n --Num 0 This allows for the signal to be trimmed a static value. Default is 0 which means get all signal. This could be to limit the search space for segments, or to bring the start of a read into focus using -v, --view. For example, -n 2000 would cut the raw signal after 2000 datapoints. This may impact where segments are found, so it's best to use this only for testing purpose.
-w --window 200 The minimum size of a segment that will be detected. Larger structural sections like stalls and homopolymers are significantly bigger than anything else in the signal. However this can vary for a number of reasons, such as the average length of a polyA tail. Stalls can be almost any size. See -l, --stall_len to see how this is handled. Decreasing this value will increase sensitivity, but decrease accuracy. Increasing it will increase accuracy, but reduce sensitivity.
-t --std_scale 0.75 To detect a segment, a simple yet effective filter is run across the signal. First it calculates the median of the read, then sets two thresholds above and below the median by taking the standard deviation of the read and scaling it by this scale factor. A segment is defined as a region that fluctuates between these two thresholds for at least the minimum window size, -w --window.
-l --stall_len 0.25 The stall could be a large range of lengths, including quite a fraction smaller than the minimum window for a homopolymer region. Currently, this value just scales the window length, however this may not be the best approach as the window gets very large or very small.
-e --error 5 Error is defined as the total number of times the signal can rise above or below the thresholds set above and not break the segment detection. This can be modified by other settings.
-c --corrector 50 The corrector is for helping with long segment detection. Once the segment currently being detected passes the -w --window length, this kicks in and starts increasing the total allowable error for every X values it extends.
-d --seg_dist 50 If the start of a new segment is within this distance of the last detected segment, it will merge the two segments toghers. This especially helps with long segments that have a significant disruption in them that the error correction can't account for without impacting accuracy.
-scale_low -scale_hi 0, 900 Throughout nanopore raw signal are large spikes which can impact calculated values such as median and standard deviation. This scale factor helps limit the region in which signal will be included. Note: it removes the signals that go out of bounds. The defaults are good for DNA and cDNA. For RNA, -scale_hi 1200 is more appropriate. To stop this from working, set both values to -50000 and 50000 respectively.
User based tests users can modify the test_segs function to automate their own tests
-u --test False Turn on user tests. This will allow for various tests to be done on the segments. A few have been added for some ideas and how to integrate them into the script.
-k --stall False Use to turn on Stall detection. Ie, the start position of the first detected segment must be within a given distance.
-j --stall_start 300 The maximum distance from the start of a read the start position of the first detected segment must be within a given distance.
-g --gap False Turns on gap distance for detecting a polyTAil within a certain distance of a detected stall
-b -gap_dist 3000 The maximum distance the end of a stall segment can be from the start of the next segment. useful for finding polyT/A tails within an expected search space.