fast5_fetcher

Doing the heavy lifting for you.

fast5_fetcher is a tool for fetching nanopore fast5 files to save time and simplify downstream analysis.

Background

Reducing the number of fast5 files per folder in a single experiment was a welcomed addition to MinKnow. However this also made it rather useful for manual basecalling on a cluster, using array jobs, where each folder is basecalled individually, producing its own sequencing_summary.txt, reads.fastq, and reads folder containing the newly basecalled fast5s. Taring those fast5 files up into a single file was needed to keep the sys admins at bay, complaining about our millions of individual files on their drives. This meant, whenever there was a need to use the fast5 files from an experiment, or many experiments, unpacking the fast5 files was a significant hurdle both in time and disk space.

fast5_fetcher was built to address this bottleneck. By building an index file of the tarballs, and using the sequencing_summary.txt file to match readIDs with fast5 filenames, only the fast5 files you need can be extracted, either temporarily in a pipeline, or permanently, reducing space and simplifying downstream work flows.

Image demonstrating fast5_fetcher

Getting Started

Building an index of fast5 files and their paths, as well as a simple bash script to control the workflow, be it on a local machine, or HPC, will depend on the starting file structure.

File structures

See Files and folders section

Inputs

It takes 3 files as input:

  1. fastq, paf, or flat (.gz)
  2. sequencing_summary.txt(.gz)
  3. name.index(.gz)

1. fastq, paf, or flat

This is where the readIDs are collected, to be matched with their respective fast5 files for fetching. The idea being, that some form of selection has occurred to generate the files.

In the case of a fastq, it may be filtered for all the reads above a certain quality, or from a particular barcode after running barcode detection.

For the paf file, it is an alignment output of minimap2. This can be used to fetch only the fast5 files that align to some reference, or has been filtered to only contain the reads that align to a particular region of interest.

A flat file in this case is just a file that contains a list of readIDs, one on each line. This allows the user to generate any list of reads to fetch from any other desired method.

Each of these files can be gzipped or not.

See examples below for example test cases.

2. Sequencing summary

The sequencing_summary.txt file is created by the basecalling software, (Albacore, Guppy), and contains information about each read, including the readID and fast5 file name, along with length, quality scores, and potentially barcode information.

There is a shortcut method in which you can use the sequencing_summary.txt only, without the need for a fastq, paf, or flat file. In this case, leave the -q, -f, -r fields empty.

This file can be gzipped or not.

3. Building the index

How the index is built depends on which file structure you are using. It will work with both tarred and un-tarred file structures. Tarred is preferred.

- Raw structure (not preferred)
for file in $(pwd)/reads/*/*;do echo $file; done >> name.index

gzip name.index
- Local basecalled structure
for file in $(pwd)/reads.tar; do echo $file; tar -tf $file; done >> name.index

gzip name.index
- Parallel basecalled structure
for file in $(pwd)/fast5/*fast5.tar; do echo $file; tar -tf $file; done >> name.index

If you have multiple experiments, then cat them all together and gzip.

for file in ./*.index; do cat $file; done >> ../all.name.index

gzip all.name.index

Instructions for use

If using MacOS, and NOT using homebrew, install it here:

https://brew.sh/

then install gnu-tar with:

brew install gnu-tar

Quick start

Basic use on a local computer

fastq

python fast5_fetcher.py -q my.fastq.gz -s sequencing_summary.txt.gz -i name.index.gz -o ./fast5

paf

python fast5_fetcher.py -p my.paf -s sequencing_summary.txt.gz -i name.index.gz -o ./fast5

flat

python fast5_fetcher.py -f my_flat.txt.gz -s sequencing_summary.txt.gz -i name.index.gz -o ./fast5

sequencing_summary.txt only

python fast5_fetcher.py -s sequencing_summary.txt.gz -i name.index.gz -o ./fast5

See examples below for use on an HPC using SGE

fast5_fetcher.py

Full usage

usage: fast5_fetcher.py [-h] [-q FASTQ | -p PAF | -f FLAT] [--OSystem OSYSTEM]
                        [-s SEQ_SUM] [-i INDEX] [-o OUTPUT] [-t]
                        [-l TRIM_LIST] [-x PREFIX] [-z]

fast_fetcher - extraction of specific nanopore fast5 files

optional arguments:
  -h, --help            show this help message and exit
  -q FASTQ, --fastq FASTQ
                        fastq.gz for read ids
  -p PAF, --paf PAF     paf alignment file for read ids
  -f FLAT, --flat FLAT  flat file of read ids
  --OSystem OSYSTEM     running operating system - leave default unless doing
                        odd stuff
  -s SEQ_SUM, --seq_sum SEQ_SUM
                        sequencing_summary.txt.gz file
  -i INDEX, --index INDEX
                        index.gz file mapping fast5 files in tar archives
  -o OUTPUT, --output OUTPUT
                        output directory for extracted fast5s
  -t, --trim            trim files as if standalone experiment, (fq, SS)
  -l TRIM_LIST, --trim_list TRIM_LIST
                        list of file names to trim, comma separated. fastq
                        only needed for -p and -f modes
  -x PREFIX, --prefix PREFIX
                        trim file prefix, eg: barcode_01, output:
                        barcode_01.fastq, barcode_01_seq_sum.txt
  -z, --pppp            Print out tar commands in batches for further
                        processing

Trimming fastq and sequencing_summary files

By using the -t, --trim option, each barcode will also have its own sequencing_summary file for downstream analysis. This is particularly useful if each barcode is a different sample or experiment, as the output is as if it was it's own individual flowcell.

This method can also trim fastq, and sequencing_summary files when using the paf or flat methods. By using the prefix option, you can label the output names, otherwise generic defaults will be used.

batch_tater.py

Potato scripting engaged

This is designed to run on the output files from fast5_fetcher.py using option -z. This writes out file lists for each tarball that contains reads you want to process. Then batch_tater.py can read those files, to open the individual tar files, and extract the files, meaning the file is only opened once.

A recent test using the -z option on ~2.2Tb of data, across ~11/27 million files took about 10min (1CPU) to write and organise the file lists with fast5_fetch.py, and about 20s per array job to extract and repackage with batch_tater.py.

This is best used when you want to do something all at once and filter your reads. Other approaches may be better when you are demultiplexing.

Usage:

Run on SGE using array jobs as a hacky way of doing multiprocessing. Also, helps check when things go wrong, and easy to relaunch failed jobs.

batch.sge

source ~/work/venv2714/bin/activate

FILE=$(ls ./fast5/ | sed -n ${SGE_TASK_ID}p)
BLAH=fast5/${FILE}

mkdir ${TMPDIR}/fast5

time python batch_tater.py tater_master.txt ${BLAH} ${TMPDIR}/fast5/

echo "size of files:" >&2
du -shc ${TMPDIR}/fast5/ >&2
echo "extraction complete!" >&2
echo "Number of files:" >&2
ls ${TMPDIR}/fast5/ | wc -l >&2

echo "copying data..." >&2

tar -cf ${TMPDIR}/batch.${SGE_TASK_ID}.tar --transform='s/.*\///' ${TMPDIR}/fast5/*.fast5
cp ${TMPDIR}/batch.${SGE_TASK_ID}.tar ./batched_fast5/

Create CMD and launch

CMD="qsub -cwd -V -pe smp 1 -N batch -S /bin/bash -t 1-10433 -tc 80 -l mem_requested=20G,h_vmem=20G,tmp_requested=200G ../batch.sge"

echo $CMD && $CMD