Segmenter
Background
Nanopore signal can have larger structurally identifiable regions. These regions can be separated into 3 categories: Stall, homopolymer, and regular DNA/cDNA sequence. Identifying the stall at the start of a nanopore read allows for extraction, visualisation, and comparison of the regular sequence. Finding homopolymer sequences can aid in identifying full lenght cDNA reads, breaking up contatonated reads, or identifying adjacent regions. Using segmenter in conjunction with MotifSeq allows for simple targeting within the raw signal. Selected regions can then be, for example, used as input to various classical comparison methods, such as various types dynamic programming, or used as training sets for Machine/Deep Learning techniques.
The core algorithm calculates the median of the full read, sets thresholds about the median using a fraction of the standard deviation, and uses a sliding window to find regions of signal which stay inside the threshold region. Error correction and segment merging helps handle the noisy nature of the raw signal data.
This is a demonstration for analysing raw signal data. The algorithm was developed out frustration with the overcomplicated or incompatible methods employed in other time analysis fields, such as sound wave or ECG detection. It is has undergone minimal optimisation, though still gives surprisingly good results.
Getting Started
Segmenter takes a flat file list of fast5 paths, a top directory of file paths, or a signal file from SquigglePull. It can be set to only detect stalls, or it can detect homopolymers within a distance of the start of a read depending on read structure. Many of the core arguments in the algorithm can be modified with arguments for tuning output.
Instructions for use
Segmenter has a built visualisation using -v
. This is very helpful in tuning the parameters. Default values are a rough estimate for getting some results from most reads.
Quick start
Identify any segments in folder and visualise each one
Use f
to full screen a plot, and ctrl+w
to close a plot and move to the next one.
python segmenter.py -p ./test/ -v
Stall identification
python segmenter.py -s signals.tsv.gz -ku -j 100 > signals_stall_segments.tsv
Full usage
usage: segmenter.py [-h] [-f F5F | -p F5_PATH | -s SIGNAL] [-n NUM] [-e ERROR]
[-c CORRECTOR] [-w WINDOW] [-d SEG_DIST] [-t STD_SCALE]
[-v] [-g] [-b GAP_DIST] [-k] [-u] [-l STALL_LEN]
[-j STALL_START] [-scale_hi SCALE_HI]
[-scale_low SCALE_LOW]
segmenter - script to find obvious regions in squiggle data
optional arguments:
-h, --help show this help message and exit
-f F5F, --f5f F5F File list of fast5 paths
-p F5_PATH, --f5_path F5_PATH
Fast5 top dir
-s SIGNAL, --signal SIGNAL
Extracted signal file from squigglePull
-n NUM, --Num NUM Section of signal to look at - default 0=all
-e ERROR, --error ERROR
Allowable error in segment algorithm
-c CORRECTOR, --corrector CORRECTOR
Window size for increasing total error correction -
better long segment detection
-w WINDOW, --window WINDOW
Minimum segment window size to be detected
-d SEG_DIST, --seg_dist SEG_DIST
Maximum distance between 2 segments to be merged into
1
-t STD_SCALE, --std_scale STD_SCALE
Scale factor of STDev about median
-v, --view view each output
-g, --gap Turn on gap distance for stall to polyTAil
-b GAP_DIST, --gap_dist GAP_DIST
Maximum distance between stall and polyTAil segment -
for 10X/dRNA
-k, --stall Turn on stall detection - must be present
-u, --test Run Tests
-l STALL_LEN, --stall_len STALL_LEN
Minimum percentage of minimum window segment for
initial stall segment
-j STALL_START, --stall_start STALL_START
Maximum distance for start of stall segment to be
detected
-scale_hi SCALE_HI, --scale_hi SCALE_HI
Upper limit for signal outlier scaling
-scale_low SCALE_LOW, --scale_low SCALE_LOW
Lower limit for signal outlier scaling
Detailed Parameter descriptions
Segmenter Parameters | Default | Description |
---|---|---|
-n --Num |
0 | This allows for the signal to be trimmed a static value. Default is 0 which means get all signal. This could be to limit the search space for segments, or to bring the start of a read into focus using -v, --view . For example, -n 2000 would cut the raw signal after 2000 datapoints. This may impact where segments are found, so it's best to use this only for testing purpose. |
-w --window |
200 | The minimum size of a segment that will be detected. Larger structural sections like stalls and homopolymers are significantly bigger than anything else in the signal. However this can vary for a number of reasons, such as the average length of a polyA tail. Stalls can be almost any size. See -l, --stall_len to see how this is handled. Decreasing this value will increase sensitivity, but decrease accuracy. Increasing it will increase accuracy, but reduce sensitivity. |
-t --std_scale |
0.75 | To detect a segment, a simple yet effective filter is run across the signal. First it calculates the median of the read, then sets two thresholds above and below the median by taking the standard deviation of the read and scaling it by this scale factor. A segment is defined as a region that fluctuates between these two thresholds for at least the minimum window size, -w --window . |
-l --stall_len |
0.25 | The stall could be a large range of lengths, including quite a fraction smaller than the minimum window for a homopolymer region. Currently, this value just scales the window length, however this may not be the best approach as the window gets very large or very small. |
-e --error |
5 | Error is defined as the total number of times the signal can rise above or below the thresholds set above and not break the segment detection. This can be modified by other settings. |
-c --corrector |
50 | The corrector is for helping with long segment detection. Once the segment currently being detected passes the -w --window length, this kicks in and starts increasing the total allowable error for every X values it extends. |
-d --seg_dist |
50 | If the start of a new segment is within this distance of the last detected segment, it will merge the two segments toghers. This especially helps with long segments that have a significant disruption in them that the error correction can't account for without impacting accuracy. |
-scale_low -scale_hi |
0, 900 | Throughout nanopore raw signal are large spikes which can impact calculated values such as median and standard deviation. This scale factor helps limit the region in which signal will be included. Note: it removes the signals that go out of bounds. The defaults are good for DNA and cDNA. For RNA, -scale_hi 1200 is more appropriate. To stop this from working, set both values to -50000 and 50000 respectively. |
User based tests | users can modify the test_segs function to automate their own tests |
|
-u --test |
False |
Turn on user tests. This will allow for various tests to be done on the segments. A few have been added for some ideas and how to integrate them into the script. |
-k --stall |
False |
Use to turn on Stall detection. Ie, the start position of the first detected segment must be within a given distance. |
-j --stall_start |
300 | The maximum distance from the start of a read the start position of the first detected segment must be within a given distance. |
-g --gap |
False |
Turns on gap distance for detecting a polyTAil within a certain distance of a detected stall |
-b -gap_dist |
3000 | The maximum distance the end of a stall segment can be from the start of the next segment. useful for finding polyT/A tails within an expected search space. |