Performance

Benchmarks were measured on GitHub Actions ubuntu-latest (Azure x86-64, AVX2) and ubuntu-24.04-arm (AArch64, NEON) using cargo bench --bench decode with --sample-size 20. All throughput numbers are in GB/s of input integers (Melem/s × bytes-per-element ÷ 1000).

VBZ pipeline breakdown

At 8192 i16 elements with simd-avx2, each stage measured in isolation:

Stageencodedecode
delta29.2 GB/s3.70 GB/s
zigzag34.2 GB/s28.0 GB/s
SVB16 (mixed)9.24 GB/s9.42 GB/s
VBZ (combined, 3-pass)5.70 GB/s2.42 GB/s
VBZ fused decodeN/A3.68 GB/s
VBZ2 fused 2-chain decodeN/A5.62 GB/s

Zigzag is essentially free (pure bitwise ops, LLVM auto-vectorizes). Delta encode expresses adjacent differences as two overlapping slice views, which LLVM auto-vectorizes to around 29 GB/s with no unsafe code. Delta decode uses an explicit SIMD prefix-sum (SSE2/NEON); the serial carry chain between 8-element blocks limits single-stream throughput to around 3.70 GB/s, essentially the theoretical ceiling for this algorithm.

Fused VBZ decode

decode_vbz_fused collapses all three decode stages into a single SIMD loop. The SVB16 shuffle and zigzag bitwise ops (~5–6 cycles per 8-element block) execute during the delta carry-chain stall (~8 cycles), hiding nearly all of their cost.

decode throughput
decode_vbz (3 separate passes)2.42 GB/s
decode_vbz_fused (single SIMD pass)3.68 GB/s

1.52× faster than the pipeline. The fused path reaches 99% of the delta-alone ceiling (3.70 GB/s): SVB16 and zigzag are effectively free, and the delta carry chain is the only remaining bottleneck.

VBZ2: format-extension 2-chain decode

encode_vbz2 / decode_vbz2 extend the VBZ format with a 6-byte header that enables a two-chain fused decode with no pre-scan required:

[mid_carry: i16 LE][mid_data_offset: u32 LE][standard VBZ payload]

mid_carry is samples[n_half - 1], the decoded sample at the chunk midpoint, i.e., the prefix sum of all deltas before the midpoint. mid_data_offset is the count of data bytes consumed by the first n_half elements (sum of 8 + popcnt(ctrl_byte) over the first half of control bytes). Both are computed in O(n) during encode with negligible cost.

At decode time the payload is split into two independent half-streams. Two carry chains run interleaved in one SIMD loop: the CPU's out-of-order engine overlaps chain A's carry-extract latency with chain B's prefix-sum arithmetic. Port-5 usage is unchanged from single-chain (10 ops per 16 elements), so there is no throughput regression at any size; the gain accumulates only where the carry latency was the limiting factor.

decode throughput
decode_vbz (3 separate passes)2.42 GB/s
decode_vbz_fused (single SIMD pass)3.68 GB/s
decode_vbz2 (format-extension 2-chain)5.62 GB/s

1.53× over single-chain fused at 8192 elements. The 2-chain interleaves two carry chains in one SIMD loop; the CPU's out-of-order engine overlaps chain A's carry-extract latency with chain B's prefix-sum arithmetic, hiding most of the serial dependency cost.

The real payoff is multi-threaded decoding: with mid_data_offset known up-front, both half-streams are independent and can run on separate cores. The format overhead is 6 bytes per chunk regardless of chunk size, negligible for any practical payload.

Caller-side parallel decode

decode_vbz_fused_from_into(data, n, initial_carry, out) exposes the single-chain fused decoder with a caller-supplied initial carry, making it possible to decode any half-stream independently. A caller that manages its own thread pool simply splits the VBZ2 payload and dispatches both halves concurrently:

#![allow(unused)]
fn main() {
let (out_a, out_b) = std::thread::scope(|s| {
    let ha = s.spawn(|| decode_vbz_fused_from(&stream_a, n_half, 0));
    let hb = s.spawn(|| decode_vbz_fused_from(&stream_b, n - n_half, mid_carry));
    (ha.join().unwrap(), hb.join().unwrap())
});
}

Decoding 64 × 8192-element chunks in parallel (64 half-A streams on thread 1, 64 half-B streams on thread 2); run locally for hardware-specific numbers:

decode throughput
decode_vbz_fused (single chain, 1 thread)1.84 Gelem/s
decode_vbz2 (2-chain interleaved, 1 thread)2.81 Gelem/s
decode_vbz_fused_from_into (2 threads, batch of 64)hardware-dependent

Multi-threaded throughput is highly sensitive to CPU core count, L2/L3 topology, and scheduler behaviour. The single-thread numbers above are from GitHub Actions CI (Azure x86-64). For two-thread measurements run the vbz2_parallel criterion benchmark locally: cargo bench --features simd-avx2 --bench decode -- vbz2_parallel. With distinct chunks from independent nanopore reads (the realistic production case) the two streams share no cache lines and the speedup approaches 2×.

VBZ-K: generalised K-stream parallel decode

encode_vbzk(samples, k) / decode_vbzk_parallel_into(data, n, out) generalise VBZ2 to K independent sub-streams. The header stores K−1 split points:

[k: u8][(carry_i: i16 LE, data_offset_i: u32 LE) for i in 1..k][VBZ payload]

Header overhead: 1 + (K−1) × 6 bytes. Each sub-chunk has n_sub = (n/K) & !7 elements; the last sub-chunk takes the remainder. Split-point carries and data offsets are computed in O(n) at encode time with negligible overhead.

Benchmarked at N=8192 with a batch of 64 chunks per thread (amortising thread scope overhead); multi-threaded results are hardware-dependent — run locally for specific numbers:

throughputvs single-chain
single-chain fused (k=1)1.84 Gelem/s1.00×
VBZ-K k=2 (2 threads)hardware-dependent
VBZ-K k=4 (4 threads)hardware-dependent
VBZ-K k=8 (8 threads)hardware-dependent

Multi-threaded throughput is not reliably measurable in shared CI environments. Run cargo bench --features simd-avx2 --bench decode -- vbzk_parallel locally for hardware-specific numbers.

k=4 matches k=2 at this chunk size; k=8 regresses because 8 threads decoding 1024-element sub-streams run into thread-scope overhead and scheduler jitter. With distinct real-world POD5 chunks (6 000–12 000 samples each), larger sub-stream sizes would push k=8 above k=4.

The full POD5 pipeline bottleneck

A POD5 reader decodes: disk → zstd decompress → VBZ decode → i16 samples. On a typical NVMe system (~6.5 GB/s sequential read):

  • Disk: 6.5 GB/s × ~3× zstd ratio = ~19.5 GB/s of decoded signal capacity
  • VBZ single-chain (AVX2): 1.84 Gelem/s × 2 bytes = 3.68 GB/s of decoded signal
  • VBZ-K k=4: scales roughly linearly with cores up to the zstd bottleneck
  • zstd single-core: ~1.5–2 GB/s compressed ≈ 2–3 Gelem/s, the real bottleneck for a single-threaded reader

The disk is rarely the bottleneck. A single-threaded reader is zstd-limited. Parallelising VBZ decode with VBZ-K removes the VBZ ceiling and shifts the bottleneck back to zstd. To saturate NVMe bandwidth you need multi-threaded zstd AND VBZ-K simultaneously.

Delta decode: the 2-chain approach

Delta decode is a serial prefix sum: each output element depends on all previous elements. On x86_64 the SSE2 path processes 8 elements per iteration with a carry chain of ~8 cycles (extract + broadcast + add). We are already at the theoretical single-stream ceiling.

delta::decode_2chain breaks this by decoding two independent sub-streams simultaneously. The CPU's out-of-order engine hides one chain's carry latency behind the other's prefix-sum arithmetic, delivering 1.65× throughput:

decode throughput
delta::decode_into (single stream)3.70 GB/s
delta::decode_2chain (two streams)6.58 GB/s

1.78× throughput with two interleaved chains. This requires one extra i16 stored per chunk: the running delta sum at the midpoint (computed by delta::mid_carry during encode, 2 bytes overhead). Each additional sub-chunk adds another 2-byte carry value and enables one more independent decode stream.

Path to a parallel-decode VBZ format

With K sub-chunks, all stages of the VBZ pipeline (delta, zigzag, SVB16) can be decoded independently on K cores:

Sub-chunksdecode throughputvs. current
1 (current VBZ)2.42 GB/sN/A
2 (single-threaded 2-chain)5.62 GB/s2.3×
2 cores~11 GB/s~4.5×
4 cores~22 GB/s~9×
8 cores~44 GB/s~18×

The format change is: store K−1 carry values (K−1 × 2 bytes) in the chunk header and split the encoded payload into K equal sub-streams. Compression ratio is unchanged. The svb crate provides decode_2chain and mid_carry as the building blocks.

SVB-ZD pipeline

At 8192 i16 elements, GitHub Actions CI (Azure x86-64 and AArch64):

x86-64

PathScalarSSSE3SSSE3×AVX2AVX2×
encode_svbzd158 Melem/s1,140 Melem/s7.2×1,100 Melem/s6.9×
decode_svbzd (3-pass)105 Melem/s696 Melem/s6.6×722 Melem/s6.9×
decode_svbzd_fused466 Melem/s1,510 Melem/s3.2×1,510 Melem/s3.2×

AArch64

PathScalarNEONNEON×
encode_svbzd195 Melem/s551 Melem/s2.8×
decode_svbzd (3-pass)210 Melem/s834 Melem/s4.0×
decode_svbzd_fused564 Melem/s1,850 Melem/s3.3×

The SIMD encode path computes zigzag-delta inline without an intermediate Vec<u32> allocation. On AVX2 it processes 8 i16 values per iteration using _mm256_cvtepi16_epi32 + _mm_alignr_epi8; on NEON it uses vmovl_s16 + vextq_s32.

The fused decode collapses U32Classic decode, unzigzag, and undelta into one SIMD loop. The 2-ctrl-byte inner loop processes 8 values per iteration. Note that SSSE3 ≈ AVX2 for the fused path: the bottleneck is the serial delta carry chain, not SIMD width — wider registers do not help once the carry chain is saturated.

SVB-ZD vs VBZ

Both pipelines operate on i16 signal data; the choice depends on the file format (BLOW5 vs POD5):

MetricVBZSVB-ZD
CodecSVB16 (1-bit tags)U32Classic (2-bit tags)
Encode (AVX2, 8192 elem)2,850 Melem/s1,100 Melem/s
Fused decode (AVX2, 8192 elem)1,840 Melem/s1,510 Melem/s
Fused decode (NEON, 8192 elem)2,280 Melem/s1,850 Melem/s
Wire formatONT POD5 / VBZhasindu2008/slow5lib BLOW5

VBZ is faster because SVB16's 1-bit tags pack more tightly than U32Classic's 2-bit tags. SVB-ZD handles values that overflow i16 after delta without truncation.

Results vs streamvbyte64 v0.2.0

Measured with simd-avx2 on GitHub Actions ubuntu-latest (Azure x86-64). streamvbyte64 uses its own runtime detection; numbers reflect its best available path.

Benchmarksvbsv64ratio
U32Classic decode/1288.68 GB/s3.71 GB/s2.34x
U32Classic decode/102413.6 GB/s4.87 GB/s2.79x
U32Classic decode/819214.1 GB/s4.89 GB/s2.88x
U32Classic encode/1286.65 GB/s2.33 GB/s2.85x
U32Classic encode/10248.26 GB/s3.08 GB/s2.68x
U32Classic encode/81928.93 GB/s3.20 GB/s2.79x
U32Variant0124 decode/1288.98 GB/s3.48 GB/s2.58x
U32Variant0124 decode/102413.8 GB/s4.88 GB/s2.83x
U32Variant0124 decode/819214.2 GB/s5.00 GB/s2.84x
U32Variant0124 encode/1286.74 GB/s2.37 GB/s2.84x
U32Variant0124 encode/10248.32 GB/s2.96 GB/s2.81x
U32Variant0124 encode/81928.89 GB/s3.01 GB/s2.95x
U64Coder1248 decode/12812.0 GB/s5.89 GB/s2.04x
U64Coder1248 decode/102415.0 GB/s8.68 GB/s1.73x
U64Coder1248 decode/819214.8 GB/s8.76 GB/s1.69x
U64Coder1248 encode/1287.37 GB/s3.52 GB/s2.09x
U64Coder1248 encode/10248.73 GB/s4.61 GB/s1.89x
U64Coder1248 encode/81928.85 GB/s4.80 GB/s1.84x

svb is consistently 1.7x–2.9x faster than streamvbyte64. The u32 codecs see the largest gap (approaching 3×); the u64 codecs are closer because 8-byte elements reduce the SIMD parallelism available per control byte.

Running benchmarks

cargo bench --features simd-auto

Benchmarks cover all five codec variants across encode/decode and three slice sizes (128, 1024, 8192 elements). Criterion produces HTML reports in target/criterion/.

To run a single benchmark by name substring:

cargo bench --features simd-auto -- U32Classic/decode