Overview

Most files and formats are straightforward. The only tricky starting point is fast5_fetcher.py. This becomes even more important if using something like batch_tater.py with SGE to significantly reduce the time taken to run. Once processing more than 50k files, the batch_tater.py method becomes very attractive.

Getting Started

Building an index of fast5 files and their paths, as well as a simple bash script to control the workflow, be it on a local machine, or HPC, will depend on the starting file structure.

File structures

I have endeavoured to include a few diverse uses, starting from different file states, but of course, I can't think of everything, so if there is something you wish to accomplished with fast5_fetcher.py, but can't quite get it to work for you, let me know, and we can try to work it out.

1. Raw structure

This is the most basic structure, where all files are present in an accessible state.

├── huntsman.fastq
├── sequencing_summary.txt       
├── huntsman_reads/              # Read folder
│   ├── 0/                       # individual folders containing ~4000 fast5s
|   |   ├── huntsman_read1.fast5
|   |   └── huntsman_read2.fast5
|   |   └── ...
|   ├── 1/
|   |   ├── huntsman_read#.fast5
|   |   └── ...
└── ├── ...

2. Local basecalled structure

This structure is the typical structure post local basecalling fastq and sequencing_summary files have been gzipped and the folders in the reads folder have been tarballed into one large file

├── huntsman.fastq.gz            # gzipped
├── sequencing_summary.txt.gz    # gzipped
├── huntsman_reads.tar           # Tarballed read folder
    |                            # Tarball expanded
    |-->│   ├── 0/               # individual folders inside tarball
        |   |   ├── huntsman_read1.fast5
        |   |   └── huntsman_read2.fast5
        |   |   └── ...
        |   ├── 1/
        |   |   ├── huntsman_read#.fast5
        |   |   └── ...
        └── ├── ...

3. Parallel basecalled structure

This structure is post massively parallel basecalling, and looks like multiples of the above structure.

├── fastq/
|   ├── huntsman.1.fastq.gz
|   └── huntsman.2.fastq.gz
|   └── huntsman.3.fastq.gz
|   └── ...
├── logs/
|    ├── sequencing_summary.1.txt.gz
|    └── sequencing_summary.2.txt.gz
|    └── sequencing_summary.3.txt.gz
|    └── ...
├── fast5/
|    ├── 1.tar
|    └── 2.tar
|    └── 3.tar
|    └── ...

With this structure, combining the .fastq and sequencing_summary.txt.gz files is needed.

Combine fastq.gz files
for file in fastq/*.fastq.gz; do cat $file; done >> huntsman.fastq.gz
Combine sequencing_summary.txt.gz files
# create header
zcat $(ls logs/sequencing_summary*.txt.gz | head -1) | head -1 > sequencing_summary.txt

# combine all files, skipping first line header
for file in logs/sequencing_summary*.txt.gz; do zcat $file | tail -n +2; done >> sequencing_summary.txt

gzip sequencing_summary.txt

You should then have something like this:

├── huntsman.fastq.gz            # gzipped
├── sequencing_summary.txt.gz    # gzipped
├── fast5/                       # fast5 folder
|    ├── 1.tar                   # each tar contains ~4000 fast5 files
|    └── 2.tar
|    └── 3.tar
|    └── ...

Inputs

fast5_fetcher.py takes 3 files as input:

  1. fastq, paf, or flat (.gz)
  2. sequencing_summary.txt(.gz)
  3. name.index(.gz)

1. fastq, paf, or flat

This is where the readIDs are collected, to be matched with their respective fast5 files for fetching. The idea being, that some form of selection has occurred to generate the files.

In the case of a fastq, it may be filtered for all the reads above a certain quality, or from a particular barcode after running barcode detection.

For the paf file, it is an alignment output of minimap2. This can be used to fetch only the fast5 files that align to some reference, or has been filtered to only contain the reads that align to a particular region of interest.

A flat file in this case is a file that contains a list of readIDs, one on each line. This allows the user to generate any list of reads to fetch from any other desired method.

Each of these files can be gzipped or not.

2. Sequencing summary

The sequencing_summary.txt file is created by the basecalling software, (Albacore, Guppy), and contains information about each read, including the readID and fast5 file name, along with length, quality scores, and potentially barcode information.

There is a shortcut method in which you can use the sequencing_summary.txt only, without the need for a fastq, paf, or flat file. In this case, leave the -q, -f, -r fields empty.

This file can be gzipped or not.

3. Building the index

How the index is built depends on which file structure you are using. It will work with both tarred and un-tarred file structures.

- Raw structure (not preferred)
for file in $(pwd)/reads/*/*;do echo $file; done >> name.index

gzip name.index
- Local basecalled structure
for file in $(pwd)/reads.tar; do echo $file; tar -tf $file; done >> name.index

gzip name.index
- Parallel basecalled structure
for file in $(pwd)/fast5/*fast5.tar; do echo $file; tar -tf $file; done >> name.index

If you have multiple experiments, then cat them all together and gzip.

for file in ./*.index; do cat $file; done >> ../all.name.index

gzip all.name.index