Examples

The tools can be used in a variety of ways using different wrapping techniques. I have provided a few examples here.

Barcode experiment partitioning with fast5_fetcher

Fast5 Fetcher was originally built to work with Sun Grid Engine (SGE), exploiting the heck out of array jobs. Although it can work locally and on untarred file structures, when operating on multiple sequencing experiments, with file structures scattered across a file system, is when fast5 fetcher starts to make a difference.

SGE examples

After creating the fastq/paf/flat, sequencing_summary, and index files, create an SGE file.

Note the use of ${SGE_TASK_ID} to use the array job as the pointer to a particular file

After barcode demultiplexing

Given a similar structure and naming convention, it is possible to group the fast5 files by barcode in the following manner.

├── BC_1.fastq.gz                # Barcode 1
├── BC_2.fastq.gz                # Barcode 2
├── BC_3.fastq.gz                # ...
├── BC_4.fastq.gz          
├── BC_5.fastq.gz            
├── BC_6.fastq.gz           
├── BC_7.fastq.gz           
├── BC_8.fastq.gz           
├── BC_9.fastq.gz           
├── BC_10.fastq.gz           
├── BC_11.fastq.gz            
├── BC_12.fastq.gz            
├── unclassified.fastq.gz        # unclassified reads (skipped by fast5_fetcher in this example, rename BC_13 to simple fold it into the example)        
├── sequencing_summary.txt.gz    # gzipped
├── barcoded.index.gz            # index file containing fast5 file paths
├── fast5/                       # fast5 folder, unsorted
|    ├── 1.tar                   # each tar contains ~4000 fast5 files
|    └── 2.tar
|    └── 3.tar
|    └── ...

fetch.sge

# activate virtual python environment
# most HPC will use something like "module load"
source ~/work/venv2714/bin/activate

# Creaete output directory to take advantage of NVME drives on cluster local
mkdir ${TMPDIR}/fast5

# Run fast_fetcher on each barcode after demultiplexing
time python fast5_fetcher.py -r ./BC_${SGE_TASK_ID}.fastq.gz -s sequencing_summary.txt.gz -i barcoded.index.gz -o ${TMPDIR}/fast5/

# tarball the extracted reads into a single tar file
# Can also split the reads into groups of ~4000 if needed
tar -cf ${TMPDIR}/BC_${SGE_TASK_ID}_fast5.tar --transform='s/.*\///' ${TMPDIR}/fast5/*.fast5
# Copy from HPC drives to working dir.
cp ${TMPDIR}/BC_${SGE_TASK_ID}_fast5.tar ./

Create CMD and launch

# current working dir, with 1 CPU, array jobs 1 to 12
# Modify memory settings as required
CMD="qsub -cwd -V -pe smp 1 -N F5F -S /bin/bash -t 1-12 -l mem_requested=20G,h_vmem=20G,tmp_requested=500G ./fetch.sge"

echo $CMD && $CMD

Raw signal extraction for analysis

python SquigglePull.py -r -p test/R9_raw_data/ > data.tsv

Stall detection in raw data

python segmenter.py -s signals.tsv.gz -ku -j 100 > signals_stall_segments.tsv

Nanopore Adapter detection in raw data

scrappie squiggle adapter.fa > adapter.model
python MotifSeq.py -s signals.tsv.gz --segs signals_stall_segments.tsv -a adapter.model > signals_adapters.tsv

Find kmer motif:

Building an adapter model:

fasta format for scrappie:

>my_kmer_name
ATCGATCGCTATGCTAGCATTACG

Make the model from scrappie:

scrappie squiggle my_kmer.fa > scrappie_kmer.model

find the best match to that kmer in the signal:

python MotifSeq.py -s signals.tsv -m scrappie_kmer.model > signals_kmer.tsv

Pipelines

# activate virtual python environment
# most HPC will use something like "module load"

F5F=SquiggleKit/fast5_fetcher.py
SPULL=SquiggleKit/SquigglePull.py
# Filtered paf file
PAF=full_length_60_R2_154_2.paf
SEQ_SUM=LLAAB035083_sequencing_summary.txt
INDEX=LLAAB035083_fast5.index
# Fastq for filtering
FQ=LLAAB035083.fastq.gz
TMP_STORE=${TMPDIR}/fast5

source ~/venv2714/bin/activate

# Create output directory to take advantage of NVME drives on cluster local
mkdir ${TMP_STORE}

echo "fetching fast5 files..." >&2
# Run fast_fetcher on sample after filtering
time python ${F5F} -p ${PAF} -s ${SEQ_SUM} -i ${INDEX} -o ${TMP_STORE} -t -x R2_154_2 -l ${FQ}

echo "extracting squiggles..." >&2
# now extract the signals from them with SquigglePull
time python ${SPULL} -rv -p ${TMP_STORE} -f all > ${TMPDIR}/squigs_R2_154_2.tsv

echo "copying data..." >&2
# Copy from HPC drives to working dir.
cp ${TMPDIR}/*R2_154_2* ./

echo "done!" >&2