svb

svb is a pure-Rust StreamVByte library covering all major codec variants for u16, u32, and u64 integers. Delta and zigzag encoding are composable layers on top. SIMD back-ends are available for x86-64 (SSSE3, AVX2) and AArch64 (NEON).

What is StreamVByte?

StreamVByte is a family of integer compression schemes that store values in a variable number of bytes. Rather than interleaving the control information with the data, StreamVByte places all control bytes in a separate stream. This layout makes SIMD-accelerated decode practical: a batch of control bytes can be loaded and shuffled in a single instruction, determining widths for an entire group of values without branching.

encoded buffer layout
┌────────────────────┬─────────────────────────────────────┐
│   control stream   │            data stream               │
│  ceil(n/4) bytes   │         variable length              │
└────────────────────┴─────────────────────────────────────┘

Each 2-bit tag in the control stream describes the byte width of the corresponding value. Four values share one control byte. The byte widths available depend on the codec variant.

Codec variants at a glance

Variant	Element	Tag width	Byte widths	Best for
`Svb16`	`u16`	1 bit	1/2	ONT VBZ signal data
`U32Classic`	`u32`	2 bits	1/2/3/4	General u32, C-library compatible
`U32Variant0124`	`u32`	2 bits	0/1/2/4	Sparse u32 (many zeros)
`U64Coder1234`	`u64`	2 bits	1/2/3/4	u64 values that fit in u32
`U64Coder1248`	`u64`	2 bits	1/2/4/8	Full u64 range

API docs

Rustdoc API reference is published at docs.rs/svb.

Getting Started

Installation

Add svb to your Cargo.toml. For most users simd-auto is the right choice: it detects the best available SIMD path at runtime:

[dependencies]
svb = { version = "0.2", features = ["simd-auto"] }

Feature flags

Flag	Effect
`std` (default)	Enables `std`; implies `alloc`
`alloc`	Enables all encode/decode APIs with no other dependencies
`simd-auto`	Runtime CPU detection; selects the best available SIMD path
`simd-avx2`	Compile-time AVX2 (asserts AVX2 is available at runtime)
`simd-ssse3`	Compile-time SSSE3
`simd-neon`	Compile-time NEON (AArch64 only; NEON is always available there)

The compile-time SIMD flags (simd-avx2, simd-ssse3, simd-neon) are intended for environments where the target CPU is known at build time, such as cross-compilation or RUSTFLAGS="-C target-cpu=native". In all other cases, prefer simd-auto.

Basic usage

Every codec is a zero-sized type with encode and decode methods. encode returns a Vec<u8>; decode takes the byte slice and the original element count.

#![allow(unused)]
fn main() {
use svb::u32::U32Classic;

let values: Vec<u32> = vec![1, 500, 70_000, 16_000_000];
let encoded = U32Classic.encode(&values);
let decoded = U32Classic.decode(&encoded, values.len()).unwrap();
assert_eq!(decoded, values);
}

Appending to an existing buffer

Every codec exposes encode_into and decode_into variants that append to a caller-supplied Vec, avoiding extra allocation:

#![allow(unused)]
fn main() {
use svb::u32::U32Classic;

let mut buf = Vec::new();
U32Classic.encode_into(&[1u32, 2, 3], &mut buf);
U32Classic.encode_into(&[4u32, 5, 6], &mut buf);
}

This is useful when building a larger serialised format where multiple compressed sequences are concatenated. The caller is responsible for recording the element counts needed for decode.

Choosing a codec

u16 data (e.g. ONT signal): use Svb16, or the higher-level encode_vbz/decode_vbz pipeline.
u32 general: use U32Classic. Wire-compatible with the Lemire C library.
u32 with many zeros: use U32Variant0124. Zero values consume no data bytes.
u64 values that fit in u32::MAX: use U64Coder1234.
u64 full range: use U64Coder1248.

For sorted or time-series data, compose any codec with delta encoding to compress differences rather than raw values.

Encoding Guide

This page explains how StreamVByte, delta, and zigzag work, when to use each, and how they compose. For the API itself see Delta and Zigzag, Codec Variants, and the API reference.

Fixed-width integers and why they waste space

Every integer type has a fixed storage width. A u32 always occupies 4 bytes, regardless of its value:

value         memory (big-endian for clarity)   bytes used
──────────────────────────────────────────────────────────
         1    00 00 00 01                        4 (3 wasted)
       300    00 00 01 2C                        4 (2 wasted)
    75 000    00 01 24 F8                        4 (1 wasted)
16 000 000    00 F4 24 00                        4 (1 wasted)
 4 294 967 295  FF FF FF FF                      4 (none wasted)

The high-order bytes are zero whenever the value is small. Those zeros carry no information, yet they occupy the same storage as any other byte. For a u32 array where most values are below 256, three quarters of the storage is zero-padding.

This matters in practice because integer arrays in real applications are rarely uniformly distributed across the full type range. File offsets, timestamps, sensor readings, and index lists all tend to cluster at small magnitudes relative to the maximum the type can hold. An array of one million u32 values representing document word frequencies, for example, might use only the bottom 12 bits of each element, leaving 20 bits per value (over 2 MB per million elements) as wasted zeros.

Variable-byte encoding solves this by storing only the bytes that carry information and recording how many bytes each value used. StreamVByte is a specific variable-byte scheme designed to make that decoding fast with SIMD.

StreamVByte

Most integers in real data are small. Fixed-width encoding wastes bytes on the high-order zeros; StreamVByte stores only the bytes that carry information.

The key design decision is where to put the width metadata. Naive variable-byte schemes (such as SQLite's varint or Protocol Buffers' LEN) interleave a length prefix with each value, so the decoder must branch on every element, and SIMD cannot help. StreamVByte separates the metadata into a control stream and the integer bytes into a data stream:

┌──────────────────────────────────┐
│ control stream                   │
│  tag tag tag tag tag tag tag tag  │  ← 2 bits per value (u32), 1 bit (u16)
└──────────────────────────────────┘
┌──────────────────────────────────┐
│ data stream                      │
│  [value 0 bytes][value 1 bytes]… │  ← tightly packed, no separators
└──────────────────────────────────┘

Because all the widths for a block of values live in one or two control bytes, a SIMD decoder can read them all at once, look up a pre-built shuffle table, and unpack 4–8 values in a single pshufb instruction. The data stream is a plain byte array with no branch points.

A concrete example

Four u32 values encoded with U32Classic (1/2/3/4-byte widths):

values:   [   1,   300,  75000,   5 ]
widths:   [ 1 B,   2 B,    3 B, 1 B ]  ← determined by value magnitude
tags:     [  00,    01,     10,  00 ]  ← 2-bit tag per value

control byte:  0b_00_10_01_00  (4 tags packed LSB-first)

data bytes:    01 | 2C 01 | F8 24 01 | 05
               └1┘  └─300─┘  └─75000─┘  └5┘

The full encoded output is 5 bytes (1 control + 4 data) for four 32-bit values that would require 16 bytes in fixed-width form.

Codec variant selection

The five variants differ in which byte widths they support and what element type they encode:

Variant	Element	Byte widths	Best for
`Svb16`	`u16`	1 / 2	16-bit data; values mostly ≤ 255
`U32Classic`	`u32`	1 / 2 / 3 / 4	General-purpose u32; compatible with Lemire C library
`U32Variant0124`	`u32`	0 / 1 / 2 / 4	Sparse data with many exact zeros (0 bytes stored)
`U64Coder1234`	`u64`	1 / 2 / 3 / 4	u64 values known to fit in u32
`U64Coder1248`	`u64`	1 / 2 / 4 / 8	Full u64 range

U32Variant0124 skips the 3-byte width and adds a 0-byte width: a zero value stores no data bytes at all, only its tag. This is a significant win for sparse arrays where many values are exactly zero.

Delta encoding

Delta encoding replaces each value with its difference from the previous one:

original:  [ 1000,  1003,  1007,  1004,  1010 ]
                ↘      ↘      ↘      ↘
deltas:    [ 1000,    +3,    +4,    -3,    +6  ]

The first delta is the first value itself (difference from an implicit zero, or from a caller-supplied carry). Every subsequent delta is values[i] - values[i-1].

For sequences where adjacent values are close together (sorted integers, time-series measurements, sensor readings), the deltas are much smaller than the raw values. Smaller values encode to fewer bytes in any variable-byte scheme.

When delta helps

Data pattern	Example	Delta effect
Sorted / monotone	File offsets, timestamps	Deltas are small positive integers
Slowly drifting	Temperature readings	Deltas cluster near zero
Periodic / oscillating	ADC signal samples	Deltas small if bandwidth is limited
Uniformly random	Hash values	No benefit; deltas are as large as the values

Delta encoding is a lossless, reversible transform. Decoding is a prefix sum: values[i] = deltas[0] + deltas[1] + … + deltas[i]. The serial dependency between elements is the main cost; see Performance for how the SIMD prefix-sum implementation handles it.

Signed vs unsigned

delta in svb is implemented for i16, i32, i64, u32, and u64. For non-monotone data (where values can decrease), use a signed type, as the deltas will be negative and a signed representation preserves that. For guaranteed non-decreasing sequences (file offsets, sorted timestamps), an unsigned type is fine and avoids the overhead of zigzag.

Zigzag encoding

Variable-byte codecs assign shorter encodings to smaller non-negative integers. A signed delta of −1 would be stored as 0xFFFFFFFF (4 bytes) in a u32 codec, with no compression at all.

Zigzag solves this by remapping signed integers to unsigned so that small absolute values map to small codes:

signed →  unsigned
     0 →  0
    -1 →  1
    +1 →  2
    -2 →  3
    +2 →  4
    -3 →  5
    +3 →  6
   ...

The formula is (n << 1) ^ (n >> (bits - 1)), two bitwise ops with no branches. Decoding is (n >> 1) ^ -(n & 1). Both directions are branchless and LLVM auto-vectorizes them.

After zigzag, a signed delta of −1 becomes the unsigned value 1, which encodes in a single byte. A delta of +127 becomes 254, still a single byte. Only values with absolute magnitude above 127 spill into a second byte.

Composing the three

Delta → zigzag → StreamVByte is a standard pipeline for compressing integer sequences that are slowly varying or oscillating. Each stage does one job:

raw values
  │
  ▼  delta encode
differences (signed, small magnitude)
  │
  ▼  zigzag encode
differences (unsigned, small magnitude)
  │
  ▼  StreamVByte encode
compact byte stream

Worked example

Five i16 ADC-style samples:

stage         values
───────────────────────────────────────────────
raw           [ 1000,  1003,  1007,  1004,  1010 ]
after delta   [ 1000,     3,     4,    -3,     6 ]
after zigzag  [ 2000,     6,     8,     5,    12 ]
after SVB16   2 ctrl bytes + 6 data bytes  (vs 10 raw bytes)

The first value (1000) stays large because it is the absolute anchor. The subsequent values (the deltas) all fit in a single byte after zigzag. In practice, for signals with small bandwidth relative to their absolute level, the per-value cost quickly drops to 1 byte once the anchor is amortised over the chunk.

Choosing what to compose

Data	Recipe
Sorted unsigned integers	delta → `U32Classic` or `U64Coder1248`
Non-monotone integers	delta → zigzag → `U32Classic`
16-bit oscillating signal	delta → zigzag → `Svb16` (= the VBZ pipeline)
Sparse data with many zeros	`U32Variant0124` alone, or delta first if it helps
Already-small unsigned values	`U32Classic` or `Svb16` directly

Codec Variants

svb provides five codec variants spanning three element widths. Each is a zero-sized type implementing the same encode/decode surface.

Variant	Element	Tag bits	Byte widths	Wire-compatible with
`Svb16`	`u16`	1	1/2	ONT `vbz_hdf_plugin`
`U32Classic`	`u32`	2	1/2/3/4	Lemire C library, `stream-vbyte` crate
`U32Variant0124`	`u32`	2	0/1/2/4	Lemire "0124" variant
`U64Coder1234`	`u64`	2	1/2/3/4	`streamvbyte64::Coder1234` (u32 values)
`U64Coder1248`	`u64`	2	1/2/4/8	`streamvbyte64::Coder1248`

Tag encoding

All u32 and u64 codecs pack four 2-bit tags into each control byte, LSB-first:

control byte n
bits 1:0  → tag for value 4n+0
bits 3:2  → tag for value 4n+1
bits 5:4  → tag for value 4n+2
bits 7:6  → tag for value 4n+3

Svb16 uses 1-bit tags and packs eight tags per control byte.

Buffer layout

All codecs use the same flat layout: control bytes first, data bytes immediately after.

[ ctrl[0] ctrl[1] ... ctrl[ceil(n/4)-1] | data bytes ... ]

The control stream length is always ceil(n / 4) bytes for 2-bit codecs, ceil(n / 8) for Svb16. No length prefix is stored; the caller supplies the element count to decode.

SVB16

Svb16 compresses u16 values using 1-bit tags. Each value is stored in either 1 byte (values 0–255) or 2 bytes (values 256–65535). Eight tags share one control byte.

This is the codec used in the VBZ pipeline for Oxford Nanopore POD5 signal data.

Tag table

Tag	Byte width	Value range
0	1	0–255
1	2	256–65535

Example

#![allow(unused)]
fn main() {
use svb::u16::Svb16;

let values: Vec<u16> = vec![1, 300, 0, 65000];
let encoded = Svb16.encode(&values);
let decoded = Svb16.decode(&encoded, values.len()).unwrap();
assert_eq!(decoded, values);
}

Control stream layout

Tags are packed 8 per byte, LSB-first. For n values the control stream is ceil(n / 8) bytes.

control byte k  →  tags for values 8k+0 through 8k+7
bit 0  =  tag for value 8k+0
bit 1  =  tag for value 8k+1
...
bit 7  =  tag for value 8k+7

U32Classic

U32Classic is the original Lemire StreamVByte variant for u32 values. Each value is stored in 1–4 bytes depending on its magnitude.

Wire-compatible with the Lemire C library and the stream-vbyte crate.

Tag table

Tag	Byte width	Value range
0	1	0–255
1	2	256–65535
2	3	65536–16777215
3	4	16777216–4294967295

Example

#![allow(unused)]
fn main() {
use svb::u32::U32Classic;

let values: Vec<u32> = vec![1, 500, 70_000, 16_000_000];
let encoded = U32Classic.encode(&values);
let decoded = U32Classic.decode(&encoded, values.len()).unwrap();
assert_eq!(decoded, values);
}

Wire format example

Encoding [1, 256, 65536, 0xFFFFFFFF] produces:

byte 0:     0xE4        control byte (tags: 0, 1, 2, 3 packed LSB-first → 0b11_10_01_00)
bytes 1:    0x01        value 0: 1 (1 byte)
bytes 2-3:  0x00 0x01   value 1: 256 (2 bytes, little-endian)
bytes 4-6:  0x00 0x00 0x01   value 2: 65536 (3 bytes)
bytes 7-10: 0xFF 0xFF 0xFF 0xFF  value 3: 4294967295 (4 bytes)

When to use

U32Classic is the right default for general u32 compression and any context where wire compatibility with the C library matters. For data with many zero or small values, U32Variant0124 compresses better.

U32Variant0124

U32Variant0124 is an alternative u32 codec where zero values consume no data bytes at all. The byte-width options are 0, 1, 2, or 4. There is no 3-byte option.

Wire-compatible with the Lemire "0124" variant and the streamvbyte64::Coder0124.

Tag table

Tag	Byte width	Value range
0	0	0 (exactly)
1	1	1–255
2	2	256–65535
3	4	65536–4294967295

Note that values in the range 65536–16777215 require 4 bytes (not 3), which is worse than U32Classic for that range. The benefit comes from sparse data where many values are zero.

Example

#![allow(unused)]
fn main() {
use svb::u32::U32Variant0124;

// Zero-valued elements cost 0 bytes in the data stream.
let values: Vec<u32> = vec![0, 0, 42, 0, 0, 255, 0];
let encoded = U32Variant0124.encode(&values);
let decoded = U32Variant0124.decode(&encoded, values.len()).unwrap();
assert_eq!(decoded, values);
}

When to use

Use U32Variant0124 when a significant fraction of values are exactly zero, such as sparse histograms, run-length-style data, or delta-encoded sorted lists where many differences are zero. For general data with few zeros, U32Classic is typically better because it can use 3 bytes for values in the 65536–16777215 range.

U64Coder1234

U64Coder1234 stores u64 values using 1–4 bytes, the same byte-width table as U32Classic. Values must fit within u32::MAX (4294967295); values above that are silently truncated on encode.

The wire format is identical to U32Classic; only the element type differs. This means a U64Coder1234-encoded buffer can be decoded by U32Classic (values are zero-extended on decode).

Tag table

Tag	Byte width	Value range
0	1	0–255
1	2	256–65535
2	3	65536–16777215
3	4	16777216–4294967295

Example

#![allow(unused)]
fn main() {
use svb::u64::U64Coder1234;

let values: Vec<u64> = vec![1, 500, 70_000, u32::MAX as u64];

// check_range returns the index of the first out-of-range value, if any.
assert_eq!(U64Coder1234.check_range(&values), None);

let encoded = U64Coder1234.encode(&values);
let decoded = U64Coder1234.decode(&encoded, values.len()).unwrap();
assert_eq!(decoded, values);
}

Range checking

Call check_range before encoding if the input may contain values above u32::MAX:

#![allow(unused)]
fn main() {
use svb::u64::U64Coder1234;

let values: Vec<u64> = vec![1, u64::MAX, 3];
if let Some(idx) = U64Coder1234.check_range(&values) {
    eprintln!("value at index {idx} exceeds u32::MAX");
}
}

When to use

Use U64Coder1234 when your data is logically u64 but all values fit within 32 bits. It produces a more compact encoding than U64Coder1248 for such data because values never consume 8 bytes.

U64Coder1248

U64Coder1248 stores u64 values using 1, 2, 4, or 8 bytes. It covers the full u64 range without truncation.

Wire-compatible with streamvbyte64::Coder1248.

Tag table

Tag	Byte width	Value range
0	1	0–255
1	2	256–65535
2	4	65536–4294967295
3	8	4294967296–18446744073709551615

Note there is no 3-byte option. Values in the range 65536–16777215 require 4 bytes.

Example

#![allow(unused)]
fn main() {
use svb::u64::U64Coder1248;

let values: Vec<u64> = vec![1, 500, 1 << 32, u64::MAX];
let encoded = U64Coder1248.encode(&values);
let decoded = U64Coder1248.decode(&encoded, values.len()).unwrap();
assert_eq!(decoded, values);
}

When to use

Use U64Coder1248 when values may exceed u32::MAX. If all values fit within 32 bits, U64Coder1234 gives better compression because it can use 3 bytes for values in the 65536–16777215 range.

Delta and Zigzag

Delta and zigzag are independent slice transforms. They are not tied to any specific codec; you compose them with whichever codec suits your data.

Delta encoding

Delta encoding replaces each value with the difference from the previous value. This is effective for sorted or slowly-varying data: the differences are small even when the raw values are large.

#![allow(unused)]
fn main() {
use svb::{delta, u64::U64Coder1248};

// Sorted u64 timestamps; differences are small positive numbers.
let timestamps: Vec<u64> = vec![1_000_000, 1_001_500, 1_003_000, 1_010_000];

let deltas = delta::encode(&timestamps);
let encoded = U64Coder1248.encode(&deltas);

let decoded_deltas = U64Coder1248.decode(&encoded, deltas.len()).unwrap();
let recovered = delta::decode(&decoded_deltas);
assert_eq!(recovered, timestamps);
}

delta is implemented for i16, i32, i64, u32, and u64. Use a signed type when the sequence is non-monotone and you intend to follow with zigzag; use an unsigned type for sorted data where all differences are non-negative.

Streaming / chunked delta

For streaming use-cases where data arrives in chunks, use encode_with_initial and decode_with_initial to carry the boundary value across chunks:

#![allow(unused)]
fn main() {
use svb::delta;

let chunk_a: Vec<u32> = vec![100, 105, 110];
let chunk_b: Vec<u32> = vec![115, 120, 125];

// Encode chunk A; initial value is 0.
let (deltas_a, last_a) = delta::encode_with_initial(0, &chunk_a);

// Encode chunk B using the last value from chunk A as the initial.
let (deltas_b, _last_b) = delta::encode_with_initial(last_a, &chunk_b);

// Decode chunk A; initial value is 0.
let (recovered_a, last_a) = delta::decode_with_initial(0, &deltas_a);

// Decode chunk B using the boundary value.
let (recovered_b, _) = delta::decode_with_initial(last_a, &deltas_b);

assert_eq!(recovered_a, chunk_a);
assert_eq!(recovered_b, chunk_b);
}

Zigzag encoding

Zigzag maps signed integers to unsigned integers so that small absolute values map to small codes. This allows signed differences from delta encoding to be compressed efficiently by any unsigned codec.

0  →  0
-1 →  1
 1 →  2
-2 →  3
 2 →  4
...

#![allow(unused)]
fn main() {
use svb::{delta, zigzag, u32::U32Classic};

// Arbitrary i32 data: delta, then zigzag, then U32Classic.
let samples: Vec<i32> = vec![-500, 200, -100, 900];

let deltas: Vec<i32> = delta::encode(&samples);
let codes: Vec<u32> = zigzag::encode(&deltas);
let encoded = U32Classic.encode(&codes);

let decoded_codes = U32Classic.decode(&encoded, codes.len()).unwrap();
let decoded_deltas: Vec<i32> = zigzag::decode(&decoded_codes);
let recovered: Vec<i32> = delta::decode(&decoded_deltas);
assert_eq!(recovered, samples);
}

zigzag is implemented for i16 (producing u16), i32 (producing u32), and i64 (producing u64).

VBZ Pipeline

The VBZ pipeline compresses 16-bit ADC signal data as used in Oxford Nanopore POD5 files. It chains four stages:

raw i16 samples
  →  delta encode    (1st-order differences; small values dominate)
  →  zigzag encode   (signed i16 → unsigned u16; small |values| → small codes)
  →  SVB16 encode    (StreamVByte-16: 1-bit control stream)
  →  zstd            (outer entropy coding; NOT part of this crate)

The svb crate handles stages 1–3. The outer zstd layer is left to the caller.

High-level API

#![allow(unused)]
fn main() {
use svb::{encode_vbz, decode_vbz};

let samples: Vec<i16> = vec![100, 101, 103, 102, 98];

// Encode: i16 → delta → zigzag → SVB16 bytes
let encoded = encode_vbz(&samples);

// Decode: SVB16 bytes → zigzag → delta → i16
let decoded = decode_vbz(&encoded, samples.len()).unwrap();
assert_eq!(decoded, samples);
}

Low-level / into variants

For zero-allocation usage or building larger buffers:

#![allow(unused)]
fn main() {
use svb::{encode_vbz_into, decode_vbz_into};

let samples: Vec<i16> = vec![100, 101, 103, 102, 98];
let mut buf: Vec<u8> = Vec::new();
encode_vbz_into(&samples, &mut buf);

let mut out: Vec<i16> = Vec::new();
decode_vbz_into(&buf, samples.len(), &mut out).unwrap();
assert_eq!(out, samples);
}

SVB-ZD Pipeline

The SVB-ZD pipeline compresses 16-bit signal data into a compact byte stream. It is wire-compatible with the slow5lib SLOW5_COMPRESS_SVB_ZD format used in BLOW5 files. The pipeline chains three stages:

raw i16 samples
  →  widen to i32
  →  fused zigzag-delta (delta of differences → zigzag32 to make values unsigned)
  →  U32Classic encode  (2-bit control stream, 1–4 bytes per value)
  →  zstd               (outer entropy coding; NOT part of this crate)

The key difference from VBZ is the element width: SVB-ZD widens i16 → i32 before the zigzag-delta step, so it uses the U32Classic codec rather than SVB16. This costs one extra bit of tag width but removes SVB16's 2-byte cap: values that overflow i16 after delta (e.g. baseline resets) are encoded correctly without truncation.

High-level API

#![allow(unused)]
fn main() {
use svb::{encode_svbzd, decode_svbzd};

let samples: Vec<i16> = vec![100, 101, 103, 102, 98];

// Encode: i16 → widen → zigzag-delta → U32Classic bytes
let encoded = encode_svbzd(&samples);

// Decode: U32Classic bytes → unzigzag-undelta → i16
let decoded = decode_svbzd(&encoded, samples.len()).unwrap();
assert_eq!(decoded, samples);
}

Low-level / into variants

For zero-allocation usage or appending to an existing buffer:

#![allow(unused)]
fn main() {
use svb::{encode_svbzd_into, decode_svbzd_into};

let samples: Vec<i16> = vec![100, 101, 103, 102, 98];
let mut buf: Vec<u8> = Vec::new();
encode_svbzd_into(&samples, &mut buf);

let mut out: Vec<i16> = Vec::new();
decode_svbzd_into(&buf, samples.len(), &mut out).unwrap();
assert_eq!(out, samples);
}

Fused decode

decode_svbzd_fused collapses all three decode stages (U32Classic, unzigzag, undelta) into a single SIMD loop. This avoids intermediate buffers and is the preferred path for high-throughput BLOW5 reads:

#![allow(unused)]
fn main() {
use svb::decode_svbzd_fused;

let decoded = decode_svbzd_fused(&encoded, samples.len()).unwrap();
}

decode_svbzd_fused_into appends into an existing Vec<i16>.

Parallel decode with fused_from

decode_svbzd_fused_from accepts a caller-supplied initial carry value, enabling independent decoding of any sub-stream that starts at a known split point. This is the building block for parallel decoding:

#![allow(unused)]
fn main() {
use svb::decode_svbzd_fused_from;

// Decode second half independently, with known carry from midpoint
let half_b = decode_svbzd_fused_from(&stream_b, n - n_half, mid_carry).unwrap();
}

The _into variant (decode_svbzd_fused_from_into) appends into an existing Vec<i16>.

Wire format

The encoded byte layout is identical to a U32Classic-encoded Vec<u32> where each u32 is zigzag32(samples[i].widened() - samples[i-1].widened()). There is no additional header; the caller is responsible for tracking n (the number of original i16 samples).

The zigzag32 mapping is (delta << 1) ^ (delta >> 31), the same convention used by slow5lib.

ex-zd Pipeline

The ex-zd pipeline improves on SVB-ZD for 16-bit signal data. It is wire-compatible with the slow5lib SLOW5_COMPRESS_EX_ZD format used in BLOW5 files. The pipeline chains four stages:

raw i16 samples
  →  qts               (find largest right-shift q ≤ 5 that loses no low bits, apply it)
  →  zigzag-delta       (delta of differences → zigzag16 to make values unsigned, u16 domain)
  →  patched exceptions (values ≤ 255 → literal byte; values > 255 → position + residual,
                          both StreamVByte-encoded with U32Classic)
  →  zstd               (outer entropy coding; NOT part of this crate)

Two differences from SVB-ZD:

qts pre-pass. ADC samples are frequently multiples of a power of two (the low bits carry no information). Shifting them out before delta/zigzag makes the resulting deltas smaller and more compressible. The shift is lossless and reversed on decode.
Patched/exception encoding instead of a per-value StreamVByte tag. Rather than SVB-ZD's 2-bit-tag-per-value scheme, ex-zd stores most zigzag-delta values as a single literal byte and pulls the rare large values ("exceptions") out into a separate, StreamVByte-encoded side channel. This tends to compress better when most deltas are small and only occasional spikes need the full range.

Unlike encode_vbz/encode_svbzd, the ex-zd frame is self-describing: it embeds a version byte and the sample count, so decode_exzd takes no n parameter.

High-level API

#![allow(unused)]
fn main() {
use svb::{encode_exzd, decode_exzd};

let samples: Vec<i16> = vec![100, 101, 103, 102, 98];

// Encode: i16 → qts → zigzag-delta → patched(U32Classic) bytes
let encoded = encode_exzd(&samples);

// Decode: bytes are self-describing (version + sample count embedded)
let decoded = decode_exzd(&encoded).unwrap();
assert_eq!(decoded, samples);
}

Low-level / into variants

For zero-allocation usage or appending to an existing buffer:

#![allow(unused)]
fn main() {
use svb::{encode_exzd_into, decode_exzd_into};

let samples: Vec<i16> = vec![100, 101, 103, 102, 98];
let mut buf: Vec<u8> = Vec::new();
encode_exzd_into(&samples, &mut buf);

let mut out: Vec<i16> = Vec::new();
decode_exzd_into(&buf, &mut out).unwrap();
assert_eq!(out, samples);
}

Composable primitives

The qts and patched/exception stages are exposed independently, matching delta and zigzag:

[svb::quantize] — find_qts, apply_shift, unshift_inplace, fixed to i16.
[svb::patched] — encode_into/decode_into over &[u16], generically useful for any patched/exception encoding scenario beyond ex-zd.

Wire format

u8   version        (0)
u64  nin             (sample count, little-endian)
u8   q               (qts shift, 0..=5)
u16  zd[0]           (first zigzag-delta value, stored raw — no predecessor to patch against)
u32  nex             (exception count over the remaining nin-1 values)
  if nex > 1:
    u32  nex_pos_press_bytes
    ..   nex_pos_press_bytes    (U32Classic-encoded, off-by-one delta-encoded exception positions)
    u32  nex_press_bytes
    ..   nex_press_bytes        (U32Classic-encoded exception residuals, value - 256)
  elif nex == 1:
    u32  position
    u32  residual               (value - 256)
  ..   (nin - 1 - nex) literal bytes, one per non-exception value, in stream order

The off-by-one position delta trick (pos[0] raw, pos[i] - pos[i-1] - 1 thereafter) is specific to this exception-position encoding and is not part of the general-purpose svb::delta module — it relies on positions being strictly increasing, which only holds for this use case.

SIMD Backends

svb provides SIMD-accelerated encode and decode for all codec variants. The scalar path is always compiled and serves as the correctness reference.

Available backends

Backend	Feature flag	Architecture	ISA
SSSE3	`simd-ssse3`	x86-64	SSE2 + SSSE3
AVX2	`simd-avx2`	x86-64	AVX2
NEON	`simd-neon`	AArch64	NEON
Auto	`simd-auto`	both	runtime detection

simd-auto

simd-auto detects the best available path at runtime using is_x86_feature_detected! on x86-64 and unconditional NEON on AArch64. This is the recommended flag for most users.

On x86-64, simd-auto selects AVX2 if available, then SSSE3, then scalar. On AArch64, NEON is always selected (NEON is mandatory on AArch64).

simd-auto requires std for runtime CPU detection. In no_std contexts, use a compile-time flag instead.

Compile-time flags

simd-avx2, simd-ssse3, and simd-neon compile in the SIMD path and assume it is available at runtime. These are appropriate when the target CPU is known:

# Cross-compile to a known AVX2 target
svb = { version = "0.2", features = ["simd-avx2"] }

or with RUSTFLAGS="-C target-cpu=native" where the build host and run host are the same.

Pipeline coverage

SIMD paths are provided for individual codec variants and for both high-level pipelines:

VBZ pipeline (encode_vbz / decode_vbz_fused): fused SVB16 + zigzag + delta in a single SIMD loop on x86-64 (SSSE3/AVX2) and AArch64 (NEON).
SVB-ZD pipeline (encode_svbzd / decode_svbzd_fused): fused U32Classic + unzigzag + undelta. Encode computes zigzag-delta inline via SIMD (eliminates the intermediate Vec<u32> allocation), decode collapses all three stages into one SIMD loop.

Decode throughput

With simd-auto on a modern x86-64 machine, decode throughput for all codec variants is in the range of 1.3–4 GB/s depending on variant and input size. See Performance for detailed numbers.

no_std Support

svb supports no_std environments with a global allocator. This covers microcontrollers and embedded targets, WebAssembly modules, and OS-level code such as bootloaders or kernel modules.

Setup

Disable the default std feature and enable alloc:

svb = { version = "0.2", default-features = false, features = ["alloc"] }

All encode and decode APIs are available. The delta and zigzag transforms are also fully available.

SIMD in no_std

Runtime SIMD detection (simd-auto) requires std for is_x86_feature_detected!. In no_std contexts, use a compile-time SIMD flag instead:

# no_std with compile-time NEON (AArch64 embedded target)
svb = { version = "0.2", default-features = false, features = ["alloc", "simd-neon"] }

# no_std with compile-time AVX2
svb = { version = "0.2", default-features = false, features = ["alloc", "simd-avx2"] }

Wire Compatibility

svb is wire-compatible with the reference C implementations and the streamvbyte64 Rust crate. This means a buffer encoded by svb can be decoded by the C library and vice versa.

Compatibility table

svb variant	Compatible with
`U32Classic`	Lemire C `streamvbyte` library, `streamvbyte64::Coder1234`
`U32Variant0124`	Lemire C "0124" variant, `streamvbyte64::Coder0124`
`U64Coder1234`	`streamvbyte64::Coder1234` (u32 values only)
`U64Coder1248`	`streamvbyte64::Coder1248`
`Svb16`	ONT `vbz_hdf_plugin` SVB16 layer
SVB-ZD pipeline (`encode_svbzd` / `decode_svbzd_fused`)	hasindu2008/slow5lib `SLOW5_COMPRESS_SVB_ZD` (BLOW5 files)

Buffer layout difference

streamvbyte64 keeps tags and data in separate buffers. svb concatenates them (tags first). When exchanging data with streamvbyte64, split or join buffers at the control stream boundary:

#![allow(unused)]
fn main() {
// svb flat → streamvbyte64 separate buffers
fn split_flat(encoded: &[u8], n: usize) -> (&[u8], &[u8]) {
    let ctrl_len = n.div_ceil(4);
    (&encoded[..ctrl_len], &encoded[ctrl_len..])
}

// streamvbyte64 separate buffers → svb flat
fn join_flat(tags: &[u8], data: &[u8]) -> Vec<u8> {
    let mut flat = tags.to_vec();
    flat.extend_from_slice(data);
    flat
}
}

Verification

Wire compatibility is verified in tests/compat.rs by round-tripping data in both directions: svb encodes and streamvbyte64 decodes, then streamvbyte64 encodes and svb decodes. These tests run in CI for all four compatible codec pairs.

Performance

Benchmarks were measured on GitHub Actions ubuntu-latest (Azure x86-64, AVX2) and ubuntu-24.04-arm (AArch64, NEON) using cargo bench --bench decode with --sample-size 20. All throughput numbers are in GB/s of input integers (Melem/s × bytes-per-element ÷ 1000).

VBZ pipeline breakdown

At 8192 i16 elements with simd-avx2, each stage measured in isolation:

Stage	encode	decode
delta	29.2 GB/s	3.70 GB/s
zigzag	34.2 GB/s	28.0 GB/s
SVB16 (mixed)	9.24 GB/s	9.42 GB/s
VBZ (combined, 3-pass)	5.70 GB/s	2.42 GB/s
VBZ fused decode	N/A	3.68 GB/s
VBZ2 fused 2-chain decode	N/A	5.62 GB/s

Zigzag is essentially free (pure bitwise ops, LLVM auto-vectorizes). Delta encode expresses adjacent differences as two overlapping slice views, which LLVM auto-vectorizes to around 29 GB/s with no unsafe code. Delta decode uses an explicit SIMD prefix-sum (SSE2/NEON); the serial carry chain between 8-element blocks limits single-stream throughput to around 3.70 GB/s, essentially the theoretical ceiling for this algorithm.

Fused VBZ decode

decode_vbz_fused collapses all three decode stages into a single SIMD loop. The SVB16 shuffle and zigzag bitwise ops (~5–6 cycles per 8-element block) execute during the delta carry-chain stall (~8 cycles), hiding nearly all of their cost.

	decode throughput
`decode_vbz` (3 separate passes)	2.42 GB/s
`decode_vbz_fused` (single SIMD pass)	3.68 GB/s

1.52× faster than the pipeline. The fused path reaches 99% of the delta-alone ceiling (3.70 GB/s): SVB16 and zigzag are effectively free, and the delta carry chain is the only remaining bottleneck.

VBZ2: format-extension 2-chain decode

encode_vbz2 / decode_vbz2 extend the VBZ format with a 6-byte header that enables a two-chain fused decode with no pre-scan required:

[mid_carry: i16 LE][mid_data_offset: u32 LE][standard VBZ payload]

mid_carry is samples[n_half - 1], the decoded sample at the chunk midpoint, i.e., the prefix sum of all deltas before the midpoint. mid_data_offset is the count of data bytes consumed by the first n_half elements (sum of 8 + popcnt(ctrl_byte) over the first half of control bytes). Both are computed in O(n) during encode with negligible cost.

At decode time the payload is split into two independent half-streams. Two carry chains run interleaved in one SIMD loop: the CPU's out-of-order engine overlaps chain A's carry-extract latency with chain B's prefix-sum arithmetic. Port-5 usage is unchanged from single-chain (10 ops per 16 elements), so there is no throughput regression at any size; the gain accumulates only where the carry latency was the limiting factor.

	decode throughput
`decode_vbz` (3 separate passes)	2.42 GB/s
`decode_vbz_fused` (single SIMD pass)	3.68 GB/s
`decode_vbz2` (format-extension 2-chain)	5.62 GB/s

1.53× over single-chain fused at 8192 elements. The 2-chain interleaves two carry chains in one SIMD loop; the CPU's out-of-order engine overlaps chain A's carry-extract latency with chain B's prefix-sum arithmetic, hiding most of the serial dependency cost.

The real payoff is multi-threaded decoding: with mid_data_offset known up-front, both half-streams are independent and can run on separate cores. The format overhead is 6 bytes per chunk regardless of chunk size, negligible for any practical payload.

Caller-side parallel decode

decode_vbz_fused_from_into(data, n, initial_carry, out) exposes the single-chain fused decoder with a caller-supplied initial carry, making it possible to decode any half-stream independently. A caller that manages its own thread pool simply splits the VBZ2 payload and dispatches both halves concurrently:

#![allow(unused)]
fn main() {
let (out_a, out_b) = std::thread::scope(|s| {
    let ha = s.spawn(|| decode_vbz_fused_from(&stream_a, n_half, 0));
    let hb = s.spawn(|| decode_vbz_fused_from(&stream_b, n - n_half, mid_carry));
    (ha.join().unwrap(), hb.join().unwrap())
});
}

Decoding 64 × 8192-element chunks in parallel (64 half-A streams on thread 1, 64 half-B streams on thread 2); run locally for hardware-specific numbers:

	decode throughput
`decode_vbz_fused` (single chain, 1 thread)	1.84 Gelem/s
`decode_vbz2` (2-chain interleaved, 1 thread)	2.81 Gelem/s
`decode_vbz_fused_from_into` (2 threads, batch of 64)	hardware-dependent

Multi-threaded throughput is highly sensitive to CPU core count, L2/L3 topology, and scheduler behaviour. The single-thread numbers above are from GitHub Actions CI (Azure x86-64). For two-thread measurements run the vbz2_parallel criterion benchmark locally: cargo bench --features simd-avx2 --bench decode -- vbz2_parallel. With distinct chunks from independent nanopore reads (the realistic production case) the two streams share no cache lines and the speedup approaches 2×.

VBZ-K: generalised K-stream parallel decode

encode_vbzk(samples, k) / decode_vbzk_parallel_into(data, n, out) generalise VBZ2 to K independent sub-streams. The header stores K−1 split points:

[k: u8][(carry_i: i16 LE, data_offset_i: u32 LE) for i in 1..k][VBZ payload]

Header overhead: 1 + (K−1) × 6 bytes. Each sub-chunk has n_sub = (n/K) & !7 elements; the last sub-chunk takes the remainder. Split-point carries and data offsets are computed in O(n) at encode time with negligible overhead.

Benchmarked at N=8192 with a batch of 64 chunks per thread (amortising thread scope overhead); multi-threaded results are hardware-dependent — run locally for specific numbers:

	throughput	vs single-chain
single-chain fused (k=1)	1.84 Gelem/s	1.00×
VBZ-K k=2 (2 threads)	hardware-dependent	—
VBZ-K k=4 (4 threads)	hardware-dependent	—
VBZ-K k=8 (8 threads)	hardware-dependent	—

Multi-threaded throughput is not reliably measurable in shared CI environments. Run cargo bench --features simd-avx2 --bench decode -- vbzk_parallel locally for hardware-specific numbers.

k=4 matches k=2 at this chunk size; k=8 regresses because 8 threads decoding 1024-element sub-streams run into thread-scope overhead and scheduler jitter. With distinct real-world POD5 chunks (6 000–12 000 samples each), larger sub-stream sizes would push k=8 above k=4.

The full POD5 pipeline bottleneck

A POD5 reader decodes: disk → zstd decompress → VBZ decode → i16 samples. On a typical NVMe system (~6.5 GB/s sequential read):

Disk: 6.5 GB/s × ~3× zstd ratio = ~19.5 GB/s of decoded signal capacity
VBZ single-chain (AVX2): 1.84 Gelem/s × 2 bytes = 3.68 GB/s of decoded signal
VBZ-K k=4: scales roughly linearly with cores up to the zstd bottleneck
zstd single-core: ~1.5–2 GB/s compressed ≈ 2–3 Gelem/s, the real bottleneck for a single-threaded reader

The disk is rarely the bottleneck. A single-threaded reader is zstd-limited. Parallelising VBZ decode with VBZ-K removes the VBZ ceiling and shifts the bottleneck back to zstd. To saturate NVMe bandwidth you need multi-threaded zstd AND VBZ-K simultaneously.

Delta decode: the 2-chain approach

Delta decode is a serial prefix sum: each output element depends on all previous elements. On x86_64 the SSE2 path processes 8 elements per iteration with a carry chain of ~8 cycles (extract + broadcast + add). We are already at the theoretical single-stream ceiling.

delta::decode_2chain breaks this by decoding two independent sub-streams simultaneously. The CPU's out-of-order engine hides one chain's carry latency behind the other's prefix-sum arithmetic, delivering 1.65× throughput:

	decode throughput
`delta::decode_into` (single stream)	3.70 GB/s
`delta::decode_2chain` (two streams)	6.58 GB/s

1.78× throughput with two interleaved chains. This requires one extra i16 stored per chunk: the running delta sum at the midpoint (computed by delta::mid_carry during encode, 2 bytes overhead). Each additional sub-chunk adds another 2-byte carry value and enables one more independent decode stream.

Path to a parallel-decode VBZ format

With K sub-chunks, all stages of the VBZ pipeline (delta, zigzag, SVB16) can be decoded independently on K cores:

Sub-chunks	decode throughput	vs. current
1 (current VBZ)	2.42 GB/s	N/A
2 (single-threaded 2-chain)	5.62 GB/s	2.3×
2 cores	~11 GB/s	~4.5×
4 cores	~22 GB/s	~9×
8 cores	~44 GB/s	~18×

The format change is: store K−1 carry values (K−1 × 2 bytes) in the chunk header and split the encoded payload into K equal sub-streams. Compression ratio is unchanged. The svb crate provides decode_2chain and mid_carry as the building blocks.

SVB-ZD pipeline

At 8192 i16 elements, GitHub Actions CI (Azure x86-64 and AArch64):

x86-64

Path	Scalar	SSSE3	SSSE3×	AVX2	AVX2×
`encode_svbzd`	158 Melem/s	1,140 Melem/s	7.2×	1,100 Melem/s	6.9×
`decode_svbzd` (3-pass)	105 Melem/s	696 Melem/s	6.6×	722 Melem/s	6.9×
`decode_svbzd_fused`	466 Melem/s	1,510 Melem/s	3.2×	1,510 Melem/s	3.2×

AArch64

Path	Scalar	NEON	NEON×
`encode_svbzd`	195 Melem/s	551 Melem/s	2.8×
`decode_svbzd` (3-pass)	210 Melem/s	834 Melem/s	4.0×
`decode_svbzd_fused`	564 Melem/s	1,850 Melem/s	3.3×

The SIMD encode path computes zigzag-delta inline without an intermediate Vec<u32> allocation. On AVX2 it processes 8 i16 values per iteration using _mm256_cvtepi16_epi32 + _mm_alignr_epi8; on NEON it uses vmovl_s16 + vextq_s32.

The fused decode collapses U32Classic decode, unzigzag, and undelta into one SIMD loop. The 2-ctrl-byte inner loop processes 8 values per iteration. Note that SSSE3 ≈ AVX2 for the fused path: the bottleneck is the serial delta carry chain, not SIMD width — wider registers do not help once the carry chain is saturated.

SVB-ZD vs VBZ

Both pipelines operate on i16 signal data; the choice depends on the file format (BLOW5 vs POD5):

Metric	VBZ	SVB-ZD
Codec	SVB16 (1-bit tags)	U32Classic (2-bit tags)
Encode (AVX2, 8192 elem)	2,850 Melem/s	1,100 Melem/s
Fused decode (AVX2, 8192 elem)	1,840 Melem/s	1,510 Melem/s
Fused decode (NEON, 8192 elem)	2,280 Melem/s	1,850 Melem/s
Wire format	ONT POD5 / VBZ	hasindu2008/slow5lib BLOW5

VBZ is faster because SVB16's 1-bit tags pack more tightly than U32Classic's 2-bit tags. SVB-ZD handles values that overflow i16 after delta without truncation.

ex-zd pipeline

ex-zd (encode_exzd/decode_exzd*) is wire-compatible with slow5lib's SLOW5_COMPRESS_EX_ZD: qts (quantize-trailing-shift) → zigzag-delta (u16 domain) → PFOR-style patched/exception encoding (patched.rs), where values > u8::MAX are pulled out as (position, residual) exceptions and everything else is stored as a literal byte. Unlike SVB-ZD's u32 widening, ex-zd keeps the zigzag-delta stream in u16, so patched/exception handling exists specifically to cover the rare values that overflow a byte after delta.

Numbers below are measured locally (AVX2, not GitHub Actions CI like the sections above), on two synthetic profiles: "ramp" (slow drift + small noise, mostly literal bytes — the common case for well-compressing signal) and "spiky" (~20% exception rate, the pathological case).

Decode strategies

n	`decode_exzd` (3-pass)	`decode_exzd_fused`	`ExzdDecoder::decode_into`
128	1,483 Melem/s	2,118 Melem/s	fastest for repeated small-frame decode
1024	1,587 Melem/s	2,319 Melem/s	(see reusable-context section)
8192	2,339 Melem/s	3,457 Melem/s

decode_exzd_fused collapses inverse-zigzag, the delta prefix sum, and the qts left-shift into one SIMD pass over the reconstructed zigzag-delta array (exzd_fused.rs) — the same fusion technique as decode_vbz_fused/decode_svbzd_fused. It's the preferred decode path for one-shot calls.

Reusable decode context (`ExzdDecoder`)

BLOW5 files hold many thousands of individually-encoded reads, each decoded via its own call — a per-call-allocating API pays a fresh Vec allocation for zd/literal/exception scratch buffers on every read. ExzdDecoder reuses those buffers across calls instead:

read length	per-call alloc	`ExzdDecoder` (reused buffers)	speedup
64	baseline	reused	1.24-1.79× (varies with exception rate)
512	baseline	reused	1.24-1.79× (varies with exception rate)

Run cargo bench --features simd-avx2 --bench decode -- exzd_many_small_reads locally for exact numbers on your hardware.

Adaptive exception-merge strategy

Reconstructing the full sample stream means merging the literal-byte run and the exception list back into one array. Two strategies exist, picked by exception density:

merge_runs: copies each stretch of non-exception values in one extend_from_slice (a vectorized memcpy), interrupted only by a single push per exception. Wins when exceptions are rare — long runs, few interruptions.
merge_walk: writes every output element individually through a raw pointer, no bounds/capacity checks per element — the same mechanism slow5lib's C reference (ex_depress) uses. Loses to merge_runs on long runs (a scalar per-element loop can't beat a vectorized memcpy) but wins once runs get short enough that merge_runs's per-run call overhead dominates.

The crossover was found empirically (patched::merge_density_sweep, an #[ignore]d test — run with cargo test --release --features simd-avx2 -- --ignored --nocapture merge_density_sweep), timing both strategies directly against each other across a density sweep. It lands consistently around 14-15% exception density, reproducible across n=128/1024/8192 — the format's own encoder-side warning threshold (~20%, "compression may not be ideal") turned out to be a reasonable starting guess but more conservative than the actual crossover. Decode dispatches on nex * 7 >= n (~14.3%) accordingly.

vs. slow5lib (C reference)

Measured locally against slow5lib's compiled ptr_compress_ex_zd/ex_zd_depress on the same data profiles and machine (a small standalone C harness, not part of this repo's build — the C reference is MIT-licensed and the wire format was independently verified byte-exact via real C-encoded fixtures in tests/parity.rs first).

Ramp data:

n	svb encode	C encode	ratio	svb fused decode	C decode	ratio
128	338 Melem/s	21.7 Melem/s	15.6×	2,118 Melem/s	787 Melem/s	2.7×
1024	470 Melem/s	141 Melem/s	3.3×	2,319 Melem/s	713 Melem/s	3.3×
8192	616 Melem/s	440 Melem/s	1.4×	3,457 Melem/s	714 Melem/s	4.8×

Spiky data (~20% exceptions):

n	svb fused decode	C decode	ratio
128	362 Melem/s	321 Melem/s	1.1×
1024	445 Melem/s	501 Melem/s	0.9×
8192	526 Melem/s	575 Melem/s	0.9×

svb wins comfortably on the common (mostly-literal) case at every size, and on small pathological inputs. At high exception density and larger n, C's ex_depress remains modestly faster (~9-11%).

Encode-side comparisons for spiky data aren't included above — bench_exzd_encode only exercises ramp data; C's spiky encode numbers exist (from the standalone harness) but without a corresponding svb measurement they'd be an apples-to-oranges comparison.

Sanity check against real ONT signal

All of the above (and the merge-strategy threshold tuning) used synthetic "ramp" and "spiky" generators. Since a synthetic profile can diverge from real signal in ways that make a micro-benchmark win irrelevant in practice, the same comparison was repeated against real nanopore reads — the POD5-sourced i16 arrays already used for VBZ parity/benchmark testing (tests/vectors/parity_*.i16, bench_00_101988.i16; bench_exzd_real_reads in benches/decode.rs).

Measured exception density on real reads: 0.9-2.3% — well below both the ~14.3% adaptive-merge threshold and the 20% "spiky" stress profile. Real nanopore signal looks far more like vbz_i16_samples (mostly literal bytes) than the pathological case; merge_walk essentially never activates on data like this. q (the qts shift) was 0 on every real read tested — real ADC noise fills the low bits, leaving no quantization headroom, unlike the synthetic generators.

Read	n	exceptions	density	svb fused decode	svb `ExzdDecoder`	C decode	best ratio
parity_00	2,885	44	1.53%	2,378 Melem/s	2,591 Melem/s	824 Melem/s	3.1×
parity_01	2,915	68	2.33%	2,010 Melem/s	2,450 Melem/s	783 Melem/s	3.1×
parity_02	2,949	40	1.36%	2,426 Melem/s	2,806 Melem/s	782 Melem/s	3.6×
bench_00 (largest available)	101,988	944	0.93%	2,807 Melem/s	2,768 Melem/s	552 Melem/s	5.1×

Encoded byte length matched the C reference exactly on every read (e.g. 2,981 / 3,053 / 3,040 / 103,643 bytes) — the wire-compat fixtures used for this check are now permanent tests/parity.rs tests (exzd_c_reference_pod5_*), not just a one-off local measurement.

Conclusion: real data lands solidly in svb's strongest regime, not its weakest. The 2.6-5.1× win here is consistent with (and at the largest read, better than) the synthetic ramp-profile numbers above, confirming ramp was a reasonable stand-in for real signal — if anything it understates how favorable real conditions are, since real density is even lower than what ramp implicitly produces. The ~9-11% gap chased in the section below is a genuine finding about a real (if rare) data shape, but it is not representative of what a BLOW5 reader will see on actual nanopore output.

Why the remaining ~9-11% gap (on pathological, not representative, data), and why it resisted closing

Two structural fixes were tried and directly disassembled to confirm they changed what they intended to change, then benchmarked:

Eliminating a leftover bounds check. merge_walk's exception branch read ex_pos[j] (guarded by j < ex_pos.len(), letting LLVM elide its check) and ex_val[j] (not guarded by anything LLVM could see, since it's a different slice with no visible length relationship) — the disassembly showed a real bounds check + conditional panic call on the ex_val[j] read that provably could never trigger. Switching both to get_unchecked (safe here: ex_pos.len() == ex_val.len() always, since both are decoded from nex items) removed the check entirely, confirmed by re-disassembling. Measured impact: ~0%. A branch that's essentially always-not-taken is nearly free once the branch predictor learns it — the instructions were real, but they weren't costing anything at runtime.
Two-phase scatter-then-walk, matching C's ex_depress structure exactly at the source level: scatter exceptions into their final output positions in a dedicated loop first (no branching), then a separate walk over all n positions that only ever writes literals, skipping already-filled exception slots. Measured impact: ~0%, identical to the single combined loop.

Both were real, verifiable code shapes — matching C's structure line-for-line didn't change wall-clock time at all. That pointed at the comparison itself: recompiling the same C source with clang instead of gcc (isolating language from compiler backend, since rustc and clang share LLVM) made the C reference itself 15-18% faster. So "C beats Rust here" was the wrong framing — LLVM can extract more from this exact algorithm than gcc did, which means svb's own LLVM-compiled code should be able to reach it too, in principle.

Disassembling LLVM's output for the (already C-structurally-identical) two-phase merge_walk explained why it doesn't: LLVM 4×-unrolled the scatter loop and split the walk loop into multiple specialized variants for the Rust version, versus the single compact loop it emits for the same-shaped C. The likely driver is that Rust's &mut references let rustc emit stronger noalias guarantees than C's raw pointers ever give a compiler — which here pushes LLVM's unrolling heuristics toward a larger, more branch-heavy code shape instead of a smaller, tighter one. Chasing this further would mean fighting LLVM's cost model (e.g. via unroll-suppression hints) rather than fixing an identifiable inefficiency in the code, so the two-phase change was reverted back to the simpler single-loop form (same performance, less code) — the get_unchecked fix was kept since it's a correct, justified use of a real invariant regardless of its lack of measured effect.

Approaches tried and rejected

Documented in code comments (exzd_fused.rs, patched.rs) rather than repeated here in full, so a future session doesn't re-attempt them without the context:

Chunked fusion (folding patched's reconstruction into the SIMD transform loop, processing fixed-size chunks instead of a full merged array) — tried at both 8-wide (SSE2) and 16-wide (AVX2); both measured slower than the simpler two-stage merge-then-transform design, at every density tested.
Naive raw-pointer scatter (replacing the run-based merge outright with unconditional raw-pointer writes) — regressed 2-3× on low-exception data, because a full-array scalar walk throws away the free vectorized memcpy that merge_runs already gets for long literal runs. This is why the final design is adaptive rather than a straight replacement.
Position-walk via Vec::push (translating C's branch structure into safe Rust without raw pointers) — regressed everywhere, including at high exception density where the raw-pointer version wins. push's per-call capacity check, even with reserve called up front, was the dominant cost — the win was never really about the branch structure, it was about removing checks entirely.

Results vs streamvbyte64 v0.2.0

Measured with simd-avx2 on GitHub Actions ubuntu-latest (Azure x86-64). streamvbyte64 uses its own runtime detection; numbers reflect its best available path.

Benchmark	svb	sv64	ratio
U32Classic decode/128	8.68 GB/s	3.71 GB/s	2.34x
U32Classic decode/1024	13.6 GB/s	4.87 GB/s	2.79x
U32Classic decode/8192	14.1 GB/s	4.89 GB/s	2.88x
U32Classic encode/128	6.65 GB/s	2.33 GB/s	2.85x
U32Classic encode/1024	8.26 GB/s	3.08 GB/s	2.68x
U32Classic encode/8192	8.93 GB/s	3.20 GB/s	2.79x
U32Variant0124 decode/128	8.98 GB/s	3.48 GB/s	2.58x
U32Variant0124 decode/1024	13.8 GB/s	4.88 GB/s	2.83x
U32Variant0124 decode/8192	14.2 GB/s	5.00 GB/s	2.84x
U32Variant0124 encode/128	6.74 GB/s	2.37 GB/s	2.84x
U32Variant0124 encode/1024	8.32 GB/s	2.96 GB/s	2.81x
U32Variant0124 encode/8192	8.89 GB/s	3.01 GB/s	2.95x
U64Coder1248 decode/128	12.0 GB/s	5.89 GB/s	2.04x
U64Coder1248 decode/1024	15.0 GB/s	8.68 GB/s	1.73x
U64Coder1248 decode/8192	14.8 GB/s	8.76 GB/s	1.69x
U64Coder1248 encode/128	7.37 GB/s	3.52 GB/s	2.09x
U64Coder1248 encode/1024	8.73 GB/s	4.61 GB/s	1.89x
U64Coder1248 encode/8192	8.85 GB/s	4.80 GB/s	1.84x

svb is consistently 1.7x–2.9x faster than streamvbyte64. The u32 codecs see the largest gap (approaching 3×); the u64 codecs are closer because 8-byte elements reduce the SIMD parallelism available per control byte.

Running benchmarks

cargo bench --features simd-auto

Benchmarks cover all five codec variants across encode/decode and three slice sizes (128, 1024, 8192 elements). Criterion produces HTML reports in target/criterion/.

To run a single benchmark by name substring:

cargo bench --features simd-auto -- U32Classic/decode