Examples
The tools can be used in a variety of ways using different wrapping techniques. I have provided a few examples here.
Barcode experiment partitioning with fast5_fetcher
Fast5 Fetcher was originally built to work with Sun Grid Engine (SGE), exploiting the heck out of array jobs. Although it can work locally and on untarred file structures, when operating on multiple sequencing experiments, with file structures scattered across a file system, is when fast5 fetcher starts to make a difference.
SGE examples
After creating the fastq/paf/flat, sequencing_summary, and index files, create an SGE file.
Note the use of ${SGE_TASK_ID}
to use the array job as the pointer to a particular file
After barcode demultiplexing
Given a similar structure and naming convention, it is possible to group the fast5 files by barcode in the following manner.
├── BC_1.fastq.gz # Barcode 1
├── BC_2.fastq.gz # Barcode 2
├── BC_3.fastq.gz # ...
├── BC_4.fastq.gz
├── BC_5.fastq.gz
├── BC_6.fastq.gz
├── BC_7.fastq.gz
├── BC_8.fastq.gz
├── BC_9.fastq.gz
├── BC_10.fastq.gz
├── BC_11.fastq.gz
├── BC_12.fastq.gz
├── unclassified.fastq.gz # unclassified reads (skipped by fast5_fetcher in this example, rename BC_13 to simple fold it into the example)
├── sequencing_summary.txt.gz # gzipped
├── barcoded.index.gz # index file containing fast5 file paths
├── fast5/ # fast5 folder, unsorted
| ├── 1.tar # each tar contains ~4000 fast5 files
| └── 2.tar
| └── 3.tar
| └── ...
fetch.sge
# activate virtual python environment
# most HPC will use something like "module load"
source ~/work/venv2714/bin/activate
# Creaete output directory to take advantage of NVME drives on cluster local
mkdir ${TMPDIR}/fast5
# Run fast_fetcher on each barcode after demultiplexing
time python fast5_fetcher.py -r ./BC_${SGE_TASK_ID}.fastq.gz -s sequencing_summary.txt.gz -i barcoded.index.gz -o ${TMPDIR}/fast5/
# tarball the extracted reads into a single tar file
# Can also split the reads into groups of ~4000 if needed
tar -cf ${TMPDIR}/BC_${SGE_TASK_ID}_fast5.tar --transform='s/.*\///' ${TMPDIR}/fast5/*.fast5
# Copy from HPC drives to working dir.
cp ${TMPDIR}/BC_${SGE_TASK_ID}_fast5.tar ./
Create CMD and launch
# current working dir, with 1 CPU, array jobs 1 to 12
# Modify memory settings as required
CMD="qsub -cwd -V -pe smp 1 -N F5F -S /bin/bash -t 1-12 -l mem_requested=20G,h_vmem=20G,tmp_requested=500G ./fetch.sge"
echo $CMD && $CMD
Raw signal extraction for analysis
python SquigglePull.py -r -p test/R9_raw_data/ > data.tsv
Stall detection in raw data
python segmenter.py -s signals.tsv.gz -ku -j 100 > signals_stall_segments.tsv
Nanopore Adapter detection in raw data
scrappie squiggle adapter.fa > adapter.model
python MotifSeq.py -s signals.tsv.gz --segs signals_stall_segments.tsv -a adapter.model > signals_adapters.tsv
Find kmer motif:
Building an adapter model:
fasta format for scrappie:
>my_kmer_name
ATCGATCGCTATGCTAGCATTACG
Make the model from scrappie:
scrappie squiggle my_kmer.fa > scrappie_kmer.model
find the best match to that kmer in the signal:
python MotifSeq.py -s signals.tsv -m scrappie_kmer.model > signals_kmer.tsv
Pipelines
# activate virtual python environment
# most HPC will use something like "module load"
F5F=SquiggleKit/fast5_fetcher.py
SPULL=SquiggleKit/SquigglePull.py
# Filtered paf file
PAF=full_length_60_R2_154_2.paf
SEQ_SUM=LLAAB035083_sequencing_summary.txt
INDEX=LLAAB035083_fast5.index
# Fastq for filtering
FQ=LLAAB035083.fastq.gz
TMP_STORE=${TMPDIR}/fast5
source ~/venv2714/bin/activate
# Create output directory to take advantage of NVME drives on cluster local
mkdir ${TMP_STORE}
echo "fetching fast5 files..." >&2
# Run fast_fetcher on sample after filtering
time python ${F5F} -p ${PAF} -s ${SEQ_SUM} -i ${INDEX} -o ${TMP_STORE} -t -x R2_154_2 -l ${FQ}
echo "extracting squiggles..." >&2
# now extract the signals from them with SquigglePull
time python ${SPULL} -rv -p ${TMP_STORE} -f all > ${TMPDIR}/squigs_R2_154_2.tsv
echo "copying data..." >&2
# Copy from HPC drives to working dir.
cp ${TMPDIR}/*R2_154_2* ./
echo "done!" >&2