svb

svb is a pure-Rust StreamVByte library covering all major codec variants for u16, u32, and u64 integers. Delta and zigzag encoding are composable layers on top. SIMD back-ends are available for x86-64 (SSSE3, AVX2) and AArch64 (NEON).

What is StreamVByte?

StreamVByte is a family of integer compression schemes that store values in a variable number of bytes. Rather than interleaving the control information with the data, StreamVByte places all control bytes in a separate stream. This layout makes SIMD-accelerated decode practical: a batch of control bytes can be loaded and shuffled in a single instruction, determining widths for an entire group of values without branching.

encoded buffer layout
┌────────────────────┬─────────────────────────────────────┐
│   control stream   │            data stream               │
│  ceil(n/4) bytes   │         variable length              │
└────────────────────┴─────────────────────────────────────┘

Each 2-bit tag in the control stream describes the byte width of the corresponding value. Four values share one control byte. The byte widths available depend on the codec variant.

Codec variants at a glance

VariantElementTag widthByte widthsBest for
Svb16u161 bit1/2ONT VBZ signal data
U32Classicu322 bits1/2/3/4General u32, C-library compatible
U32Variant0124u322 bits0/1/2/4Sparse u32 (many zeros)
U64Coder1234u642 bits1/2/3/4u64 values that fit in u32
U64Coder1248u642 bits1/2/4/8Full u64 range

API docs

Rustdoc API reference is published at docs.rs/svb.

Getting Started

Installation

Add svb to your Cargo.toml. For most users simd-auto is the right choice: it detects the best available SIMD path at runtime:

[dependencies]
svb = { version = "0.2", features = ["simd-auto"] }

Feature flags

FlagEffect
std (default)Enables std; implies alloc
allocEnables all encode/decode APIs with no other dependencies
simd-autoRuntime CPU detection; selects the best available SIMD path
simd-avx2Compile-time AVX2 (asserts AVX2 is available at runtime)
simd-ssse3Compile-time SSSE3
simd-neonCompile-time NEON (AArch64 only; NEON is always available there)

The compile-time SIMD flags (simd-avx2, simd-ssse3, simd-neon) are intended for environments where the target CPU is known at build time, such as cross-compilation or RUSTFLAGS="-C target-cpu=native". In all other cases, prefer simd-auto.

Basic usage

Every codec is a zero-sized type with encode and decode methods. encode returns a Vec<u8>; decode takes the byte slice and the original element count.

#![allow(unused)]
fn main() {
use svb::u32::U32Classic;

let values: Vec<u32> = vec![1, 500, 70_000, 16_000_000];
let encoded = U32Classic.encode(&values);
let decoded = U32Classic.decode(&encoded, values.len()).unwrap();
assert_eq!(decoded, values);
}

Appending to an existing buffer

Every codec exposes encode_into and decode_into variants that append to a caller-supplied Vec, avoiding extra allocation:

#![allow(unused)]
fn main() {
use svb::u32::U32Classic;

let mut buf = Vec::new();
U32Classic.encode_into(&[1u32, 2, 3], &mut buf);
U32Classic.encode_into(&[4u32, 5, 6], &mut buf);
}

This is useful when building a larger serialised format where multiple compressed sequences are concatenated. The caller is responsible for recording the element counts needed for decode.

Choosing a codec

For sorted or time-series data, compose any codec with delta encoding to compress differences rather than raw values.

Encoding Guide

This page explains how StreamVByte, delta, and zigzag work, when to use each, and how they compose. For the API itself see Delta and Zigzag, Codec Variants, and the API reference.


Fixed-width integers and why they waste space

Every integer type has a fixed storage width. A u32 always occupies 4 bytes, regardless of its value:

value         memory (big-endian for clarity)   bytes used
──────────────────────────────────────────────────────────
         1    00 00 00 01                        4 (3 wasted)
       300    00 00 01 2C                        4 (2 wasted)
    75 000    00 01 24 F8                        4 (1 wasted)
16 000 000    00 F4 24 00                        4 (1 wasted)
 4 294 967 295  FF FF FF FF                      4 (none wasted)

The high-order bytes are zero whenever the value is small. Those zeros carry no information, yet they occupy the same storage as any other byte. For a u32 array where most values are below 256, three quarters of the storage is zero-padding.

This matters in practice because integer arrays in real applications are rarely uniformly distributed across the full type range. File offsets, timestamps, sensor readings, and index lists all tend to cluster at small magnitudes relative to the maximum the type can hold. An array of one million u32 values representing document word frequencies, for example, might use only the bottom 12 bits of each element, leaving 20 bits per value (over 2 MB per million elements) as wasted zeros.

Variable-byte encoding solves this by storing only the bytes that carry information and recording how many bytes each value used. StreamVByte is a specific variable-byte scheme designed to make that decoding fast with SIMD.


StreamVByte

Most integers in real data are small. Fixed-width encoding wastes bytes on the high-order zeros; StreamVByte stores only the bytes that carry information.

The key design decision is where to put the width metadata. Naive variable-byte schemes (such as SQLite's varint or Protocol Buffers' LEN) interleave a length prefix with each value, so the decoder must branch on every element, and SIMD cannot help. StreamVByte separates the metadata into a control stream and the integer bytes into a data stream:

┌──────────────────────────────────┐
│ control stream                   │
│  tag tag tag tag tag tag tag tag  │  ← 2 bits per value (u32), 1 bit (u16)
└──────────────────────────────────┘
┌──────────────────────────────────┐
│ data stream                      │
│  [value 0 bytes][value 1 bytes]… │  ← tightly packed, no separators
└──────────────────────────────────┘

Because all the widths for a block of values live in one or two control bytes, a SIMD decoder can read them all at once, look up a pre-built shuffle table, and unpack 4–8 values in a single pshufb instruction. The data stream is a plain byte array with no branch points.

A concrete example

Four u32 values encoded with U32Classic (1/2/3/4-byte widths):

values:   [   1,   300,  75000,   5 ]
widths:   [ 1 B,   2 B,    3 B, 1 B ]  ← determined by value magnitude
tags:     [  00,    01,     10,  00 ]  ← 2-bit tag per value

control byte:  0b_00_10_01_00  (4 tags packed LSB-first)

data bytes:    01 | 2C 01 | F8 24 01 | 05
               └1┘  └─300─┘  └─75000─┘  └5┘

The full encoded output is 5 bytes (1 control + 4 data) for four 32-bit values that would require 16 bytes in fixed-width form.

Codec variant selection

The five variants differ in which byte widths they support and what element type they encode:

VariantElementByte widthsBest for
Svb16u161 / 216-bit data; values mostly ≤ 255
U32Classicu321 / 2 / 3 / 4General-purpose u32; compatible with Lemire C library
U32Variant0124u320 / 1 / 2 / 4Sparse data with many exact zeros (0 bytes stored)
U64Coder1234u641 / 2 / 3 / 4u64 values known to fit in u32
U64Coder1248u641 / 2 / 4 / 8Full u64 range

U32Variant0124 skips the 3-byte width and adds a 0-byte width: a zero value stores no data bytes at all, only its tag. This is a significant win for sparse arrays where many values are exactly zero.


Delta encoding

Delta encoding replaces each value with its difference from the previous one:

original:  [ 1000,  1003,  1007,  1004,  1010 ]
                ↘      ↘      ↘      ↘
deltas:    [ 1000,    +3,    +4,    -3,    +6  ]

The first delta is the first value itself (difference from an implicit zero, or from a caller-supplied carry). Every subsequent delta is values[i] - values[i-1].

For sequences where adjacent values are close together (sorted integers, time-series measurements, sensor readings), the deltas are much smaller than the raw values. Smaller values encode to fewer bytes in any variable-byte scheme.

When delta helps

Data patternExampleDelta effect
Sorted / monotoneFile offsets, timestampsDeltas are small positive integers
Slowly driftingTemperature readingsDeltas cluster near zero
Periodic / oscillatingADC signal samplesDeltas small if bandwidth is limited
Uniformly randomHash valuesNo benefit; deltas are as large as the values

Delta encoding is a lossless, reversible transform. Decoding is a prefix sum: values[i] = deltas[0] + deltas[1] + … + deltas[i]. The serial dependency between elements is the main cost; see Performance for how the SIMD prefix-sum implementation handles it.

Signed vs unsigned

delta in svb is implemented for i16, i32, i64, u32, and u64. For non-monotone data (where values can decrease), use a signed type, as the deltas will be negative and a signed representation preserves that. For guaranteed non-decreasing sequences (file offsets, sorted timestamps), an unsigned type is fine and avoids the overhead of zigzag.


Zigzag encoding

Variable-byte codecs assign shorter encodings to smaller non-negative integers. A signed delta of −1 would be stored as 0xFFFFFFFF (4 bytes) in a u32 codec, with no compression at all.

Zigzag solves this by remapping signed integers to unsigned so that small absolute values map to small codes:

signed →  unsigned
     0 →  0
    -1 →  1
    +1 →  2
    -2 →  3
    +2 →  4
    -3 →  5
    +3 →  6
   ...

The formula is (n << 1) ^ (n >> (bits - 1)), two bitwise ops with no branches. Decoding is (n >> 1) ^ -(n & 1). Both directions are branchless and LLVM auto-vectorizes them.

After zigzag, a signed delta of −1 becomes the unsigned value 1, which encodes in a single byte. A delta of +127 becomes 254, still a single byte. Only values with absolute magnitude above 127 spill into a second byte.


Composing the three

Delta → zigzag → StreamVByte is a standard pipeline for compressing integer sequences that are slowly varying or oscillating. Each stage does one job:

raw values
  │
  ▼  delta encode
differences (signed, small magnitude)
  │
  ▼  zigzag encode
differences (unsigned, small magnitude)
  │
  ▼  StreamVByte encode
compact byte stream

Worked example

Five i16 ADC-style samples:

stage         values
───────────────────────────────────────────────
raw           [ 1000,  1003,  1007,  1004,  1010 ]
after delta   [ 1000,     3,     4,    -3,     6 ]
after zigzag  [ 2000,     6,     8,     5,    12 ]
after SVB16   2 ctrl bytes + 6 data bytes  (vs 10 raw bytes)

The first value (1000) stays large because it is the absolute anchor. The subsequent values (the deltas) all fit in a single byte after zigzag. In practice, for signals with small bandwidth relative to their absolute level, the per-value cost quickly drops to 1 byte once the anchor is amortised over the chunk.

Choosing what to compose

DataRecipe
Sorted unsigned integersdelta → U32Classic or U64Coder1248
Non-monotone integersdelta → zigzag → U32Classic
16-bit oscillating signaldelta → zigzag → Svb16 (= the VBZ pipeline)
Sparse data with many zerosU32Variant0124 alone, or delta first if it helps
Already-small unsigned valuesU32Classic or Svb16 directly

Further reading

Codec Variants

svb provides five codec variants spanning three element widths. Each is a zero-sized type implementing the same encode/decode surface.

VariantElementTag bitsByte widthsWire-compatible with
Svb16u1611/2ONT vbz_hdf_plugin
U32Classicu3221/2/3/4Lemire C library, stream-vbyte crate
U32Variant0124u3220/1/2/4Lemire "0124" variant
U64Coder1234u6421/2/3/4streamvbyte64::Coder1234 (u32 values)
U64Coder1248u6421/2/4/8streamvbyte64::Coder1248

Tag encoding

All u32 and u64 codecs pack four 2-bit tags into each control byte, LSB-first:

control byte n
bits 1:0  → tag for value 4n+0
bits 3:2  → tag for value 4n+1
bits 5:4  → tag for value 4n+2
bits 7:6  → tag for value 4n+3

Svb16 uses 1-bit tags and packs eight tags per control byte.

Buffer layout

All codecs use the same flat layout: control bytes first, data bytes immediately after.

[ ctrl[0] ctrl[1] ... ctrl[ceil(n/4)-1] | data bytes ... ]

The control stream length is always ceil(n / 4) bytes for 2-bit codecs, ceil(n / 8) for Svb16. No length prefix is stored; the caller supplies the element count to decode.

SVB16

Svb16 compresses u16 values using 1-bit tags. Each value is stored in either 1 byte (values 0–255) or 2 bytes (values 256–65535). Eight tags share one control byte.

This is the codec used in the VBZ pipeline for Oxford Nanopore POD5 signal data.

Tag table

TagByte widthValue range
010–255
12256–65535

Example

#![allow(unused)]
fn main() {
use svb::u16::Svb16;

let values: Vec<u16> = vec![1, 300, 0, 65000];
let encoded = Svb16.encode(&values);
let decoded = Svb16.decode(&encoded, values.len()).unwrap();
assert_eq!(decoded, values);
}

Control stream layout

Tags are packed 8 per byte, LSB-first. For n values the control stream is ceil(n / 8) bytes.

control byte k  →  tags for values 8k+0 through 8k+7
bit 0  =  tag for value 8k+0
bit 1  =  tag for value 8k+1
...
bit 7  =  tag for value 8k+7

U32Classic

U32Classic is the original Lemire StreamVByte variant for u32 values. Each value is stored in 1–4 bytes depending on its magnitude.

Wire-compatible with the Lemire C library and the stream-vbyte crate.

Tag table

TagByte widthValue range
010–255
12256–65535
2365536–16777215
3416777216–4294967295

Example

#![allow(unused)]
fn main() {
use svb::u32::U32Classic;

let values: Vec<u32> = vec![1, 500, 70_000, 16_000_000];
let encoded = U32Classic.encode(&values);
let decoded = U32Classic.decode(&encoded, values.len()).unwrap();
assert_eq!(decoded, values);
}

Wire format example

Encoding [1, 256, 65536, 0xFFFFFFFF] produces:

byte 0:     0xE4        control byte (tags: 0, 1, 2, 3 packed LSB-first → 0b11_10_01_00)
bytes 1:    0x01        value 0: 1 (1 byte)
bytes 2-3:  0x00 0x01   value 1: 256 (2 bytes, little-endian)
bytes 4-6:  0x00 0x00 0x01   value 2: 65536 (3 bytes)
bytes 7-10: 0xFF 0xFF 0xFF 0xFF  value 3: 4294967295 (4 bytes)

When to use

U32Classic is the right default for general u32 compression and any context where wire compatibility with the C library matters. For data with many zero or small values, U32Variant0124 compresses better.

U32Variant0124

U32Variant0124 is an alternative u32 codec where zero values consume no data bytes at all. The byte-width options are 0, 1, 2, or 4. There is no 3-byte option.

Wire-compatible with the Lemire "0124" variant and the streamvbyte64::Coder0124.

Tag table

TagByte widthValue range
000 (exactly)
111–255
22256–65535
3465536–4294967295

Note that values in the range 65536–16777215 require 4 bytes (not 3), which is worse than U32Classic for that range. The benefit comes from sparse data where many values are zero.

Example

#![allow(unused)]
fn main() {
use svb::u32::U32Variant0124;

// Zero-valued elements cost 0 bytes in the data stream.
let values: Vec<u32> = vec![0, 0, 42, 0, 0, 255, 0];
let encoded = U32Variant0124.encode(&values);
let decoded = U32Variant0124.decode(&encoded, values.len()).unwrap();
assert_eq!(decoded, values);
}

When to use

Use U32Variant0124 when a significant fraction of values are exactly zero, such as sparse histograms, run-length-style data, or delta-encoded sorted lists where many differences are zero. For general data with few zeros, U32Classic is typically better because it can use 3 bytes for values in the 65536–16777215 range.

U64Coder1234

U64Coder1234 stores u64 values using 1–4 bytes, the same byte-width table as U32Classic. Values must fit within u32::MAX (4294967295); values above that are silently truncated on encode.

The wire format is identical to U32Classic; only the element type differs. This means a U64Coder1234-encoded buffer can be decoded by U32Classic (values are zero-extended on decode).

Tag table

TagByte widthValue range
010–255
12256–65535
2365536–16777215
3416777216–4294967295

Example

#![allow(unused)]
fn main() {
use svb::u64::U64Coder1234;

let values: Vec<u64> = vec![1, 500, 70_000, u32::MAX as u64];

// check_range returns the index of the first out-of-range value, if any.
assert_eq!(U64Coder1234.check_range(&values), None);

let encoded = U64Coder1234.encode(&values);
let decoded = U64Coder1234.decode(&encoded, values.len()).unwrap();
assert_eq!(decoded, values);
}

Range checking

Call check_range before encoding if the input may contain values above u32::MAX:

#![allow(unused)]
fn main() {
use svb::u64::U64Coder1234;

let values: Vec<u64> = vec![1, u64::MAX, 3];
if let Some(idx) = U64Coder1234.check_range(&values) {
    eprintln!("value at index {idx} exceeds u32::MAX");
}
}

When to use

Use U64Coder1234 when your data is logically u64 but all values fit within 32 bits. It produces a more compact encoding than U64Coder1248 for such data because values never consume 8 bytes.

U64Coder1248

U64Coder1248 stores u64 values using 1, 2, 4, or 8 bytes. It covers the full u64 range without truncation.

Wire-compatible with streamvbyte64::Coder1248.

Tag table

TagByte widthValue range
010–255
12256–65535
2465536–4294967295
384294967296–18446744073709551615

Note there is no 3-byte option. Values in the range 65536–16777215 require 4 bytes.

Example

#![allow(unused)]
fn main() {
use svb::u64::U64Coder1248;

let values: Vec<u64> = vec![1, 500, 1 << 32, u64::MAX];
let encoded = U64Coder1248.encode(&values);
let decoded = U64Coder1248.decode(&encoded, values.len()).unwrap();
assert_eq!(decoded, values);
}

When to use

Use U64Coder1248 when values may exceed u32::MAX. If all values fit within 32 bits, U64Coder1234 gives better compression because it can use 3 bytes for values in the 65536–16777215 range.

Delta and Zigzag

Delta and zigzag are independent slice transforms. They are not tied to any specific codec; you compose them with whichever codec suits your data.

Delta encoding

Delta encoding replaces each value with the difference from the previous value. This is effective for sorted or slowly-varying data: the differences are small even when the raw values are large.

#![allow(unused)]
fn main() {
use svb::{delta, u64::U64Coder1248};

// Sorted u64 timestamps; differences are small positive numbers.
let timestamps: Vec<u64> = vec![1_000_000, 1_001_500, 1_003_000, 1_010_000];

let deltas = delta::encode(&timestamps);
let encoded = U64Coder1248.encode(&deltas);

let decoded_deltas = U64Coder1248.decode(&encoded, deltas.len()).unwrap();
let recovered = delta::decode(&decoded_deltas);
assert_eq!(recovered, timestamps);
}

delta is implemented for i16, i32, i64, u32, and u64. Use a signed type when the sequence is non-monotone and you intend to follow with zigzag; use an unsigned type for sorted data where all differences are non-negative.

Streaming / chunked delta

For streaming use-cases where data arrives in chunks, use encode_with_initial and decode_with_initial to carry the boundary value across chunks:

#![allow(unused)]
fn main() {
use svb::delta;

let chunk_a: Vec<u32> = vec![100, 105, 110];
let chunk_b: Vec<u32> = vec![115, 120, 125];

// Encode chunk A; initial value is 0.
let (deltas_a, last_a) = delta::encode_with_initial(0, &chunk_a);

// Encode chunk B using the last value from chunk A as the initial.
let (deltas_b, _last_b) = delta::encode_with_initial(last_a, &chunk_b);

// Decode chunk A; initial value is 0.
let (recovered_a, last_a) = delta::decode_with_initial(0, &deltas_a);

// Decode chunk B using the boundary value.
let (recovered_b, _) = delta::decode_with_initial(last_a, &deltas_b);

assert_eq!(recovered_a, chunk_a);
assert_eq!(recovered_b, chunk_b);
}

Zigzag encoding

Zigzag maps signed integers to unsigned integers so that small absolute values map to small codes. This allows signed differences from delta encoding to be compressed efficiently by any unsigned codec.

0  →  0
-1 →  1
 1 →  2
-2 →  3
 2 →  4
...
#![allow(unused)]
fn main() {
use svb::{delta, zigzag, u32::U32Classic};

// Arbitrary i32 data: delta, then zigzag, then U32Classic.
let samples: Vec<i32> = vec![-500, 200, -100, 900];

let deltas: Vec<i32> = delta::encode(&samples);
let codes: Vec<u32> = zigzag::encode(&deltas);
let encoded = U32Classic.encode(&codes);

let decoded_codes = U32Classic.decode(&encoded, codes.len()).unwrap();
let decoded_deltas: Vec<i32> = zigzag::decode(&decoded_codes);
let recovered: Vec<i32> = delta::decode(&decoded_deltas);
assert_eq!(recovered, samples);
}

zigzag is implemented for i16 (producing u16), i32 (producing u32), and i64 (producing u64).

VBZ Pipeline

The VBZ pipeline compresses 16-bit ADC signal data as used in Oxford Nanopore POD5 files. It chains four stages:

raw i16 samples
  →  delta encode    (1st-order differences; small values dominate)
  →  zigzag encode   (signed i16 → unsigned u16; small |values| → small codes)
  →  SVB16 encode    (StreamVByte-16: 1-bit control stream)
  →  zstd            (outer entropy coding; NOT part of this crate)

The svb crate handles stages 1–3. The outer zstd layer is left to the caller.

High-level API

#![allow(unused)]
fn main() {
use svb::{encode_vbz, decode_vbz};

let samples: Vec<i16> = vec![100, 101, 103, 102, 98];

// Encode: i16 → delta → zigzag → SVB16 bytes
let encoded = encode_vbz(&samples);

// Decode: SVB16 bytes → zigzag → delta → i16
let decoded = decode_vbz(&encoded, samples.len()).unwrap();
assert_eq!(decoded, samples);
}

Low-level / into variants

For zero-allocation usage or building larger buffers:

#![allow(unused)]
fn main() {
use svb::{encode_vbz_into, decode_vbz_into};

let samples: Vec<i16> = vec![100, 101, 103, 102, 98];
let mut buf: Vec<u8> = Vec::new();
encode_vbz_into(&samples, &mut buf);

let mut out: Vec<i16> = Vec::new();
decode_vbz_into(&buf, samples.len(), &mut out).unwrap();
assert_eq!(out, samples);
}

SVB-ZD Pipeline

The SVB-ZD pipeline compresses 16-bit signal data into a compact byte stream. It is wire-compatible with the slow5lib SLOW5_COMPRESS_SVB_ZD format used in BLOW5 files. The pipeline chains three stages:

raw i16 samples
  →  widen to i32
  →  fused zigzag-delta (delta of differences → zigzag32 to make values unsigned)
  →  U32Classic encode  (2-bit control stream, 1–4 bytes per value)
  →  zstd               (outer entropy coding; NOT part of this crate)

The key difference from VBZ is the element width: SVB-ZD widens i16 → i32 before the zigzag-delta step, so it uses the U32Classic codec rather than SVB16. This costs one extra bit of tag width but removes SVB16's 2-byte cap: values that overflow i16 after delta (e.g. baseline resets) are encoded correctly without truncation.

High-level API

#![allow(unused)]
fn main() {
use svb::{encode_svbzd, decode_svbzd};

let samples: Vec<i16> = vec![100, 101, 103, 102, 98];

// Encode: i16 → widen → zigzag-delta → U32Classic bytes
let encoded = encode_svbzd(&samples);

// Decode: U32Classic bytes → unzigzag-undelta → i16
let decoded = decode_svbzd(&encoded, samples.len()).unwrap();
assert_eq!(decoded, samples);
}

Low-level / into variants

For zero-allocation usage or appending to an existing buffer:

#![allow(unused)]
fn main() {
use svb::{encode_svbzd_into, decode_svbzd_into};

let samples: Vec<i16> = vec![100, 101, 103, 102, 98];
let mut buf: Vec<u8> = Vec::new();
encode_svbzd_into(&samples, &mut buf);

let mut out: Vec<i16> = Vec::new();
decode_svbzd_into(&buf, samples.len(), &mut out).unwrap();
assert_eq!(out, samples);
}

Fused decode

decode_svbzd_fused collapses all three decode stages (U32Classic, unzigzag, undelta) into a single SIMD loop. This avoids intermediate buffers and is the preferred path for high-throughput BLOW5 reads:

#![allow(unused)]
fn main() {
use svb::decode_svbzd_fused;

let decoded = decode_svbzd_fused(&encoded, samples.len()).unwrap();
}

decode_svbzd_fused_into appends into an existing Vec<i16>.

Parallel decode with fused_from

decode_svbzd_fused_from accepts a caller-supplied initial carry value, enabling independent decoding of any sub-stream that starts at a known split point. This is the building block for parallel decoding:

#![allow(unused)]
fn main() {
use svb::decode_svbzd_fused_from;

// Decode second half independently, with known carry from midpoint
let half_b = decode_svbzd_fused_from(&stream_b, n - n_half, mid_carry).unwrap();
}

The _into variant (decode_svbzd_fused_from_into) appends into an existing Vec<i16>.

Wire format

The encoded byte layout is identical to a U32Classic-encoded Vec<u32> where each u32 is zigzag32(samples[i].widened() - samples[i-1].widened()). There is no additional header; the caller is responsible for tracking n (the number of original i16 samples).

The zigzag32 mapping is (delta << 1) ^ (delta >> 31), the same convention used by slow5lib.

SIMD Backends

svb provides SIMD-accelerated encode and decode for all codec variants. The scalar path is always compiled and serves as the correctness reference.

Available backends

BackendFeature flagArchitectureISA
SSSE3simd-ssse3x86-64SSE2 + SSSE3
AVX2simd-avx2x86-64AVX2
NEONsimd-neonAArch64NEON
Autosimd-autobothruntime detection

simd-auto

simd-auto detects the best available path at runtime using is_x86_feature_detected! on x86-64 and unconditional NEON on AArch64. This is the recommended flag for most users.

On x86-64, simd-auto selects AVX2 if available, then SSSE3, then scalar. On AArch64, NEON is always selected (NEON is mandatory on AArch64).

simd-auto requires std for runtime CPU detection. In no_std contexts, use a compile-time flag instead.

Compile-time flags

simd-avx2, simd-ssse3, and simd-neon compile in the SIMD path and assume it is available at runtime. These are appropriate when the target CPU is known:

# Cross-compile to a known AVX2 target
svb = { version = "0.2", features = ["simd-avx2"] }

or with RUSTFLAGS="-C target-cpu=native" where the build host and run host are the same.

Pipeline coverage

SIMD paths are provided for individual codec variants and for both high-level pipelines:

  • VBZ pipeline (encode_vbz / decode_vbz_fused): fused SVB16 + zigzag + delta in a single SIMD loop on x86-64 (SSSE3/AVX2) and AArch64 (NEON).
  • SVB-ZD pipeline (encode_svbzd / decode_svbzd_fused): fused U32Classic + unzigzag + undelta. Encode computes zigzag-delta inline via SIMD (eliminates the intermediate Vec<u32> allocation), decode collapses all three stages into one SIMD loop.

Decode throughput

With simd-auto on a modern x86-64 machine, decode throughput for all codec variants is in the range of 1.3–4 GB/s depending on variant and input size. See Performance for detailed numbers.

no_std Support

svb supports no_std environments with a global allocator. This covers microcontrollers and embedded targets, WebAssembly modules, and OS-level code such as bootloaders or kernel modules.

Setup

Disable the default std feature and enable alloc:

svb = { version = "0.2", default-features = false, features = ["alloc"] }

All encode and decode APIs are available. The delta and zigzag transforms are also fully available.

SIMD in no_std

Runtime SIMD detection (simd-auto) requires std for is_x86_feature_detected!. In no_std contexts, use a compile-time SIMD flag instead:

# no_std with compile-time NEON (AArch64 embedded target)
svb = { version = "0.2", default-features = false, features = ["alloc", "simd-neon"] }
# no_std with compile-time AVX2
svb = { version = "0.2", default-features = false, features = ["alloc", "simd-avx2"] }

Wire Compatibility

svb is wire-compatible with the reference C implementations and the streamvbyte64 Rust crate. This means a buffer encoded by svb can be decoded by the C library and vice versa.

Compatibility table

svb variantCompatible with
U32ClassicLemire C streamvbyte library, streamvbyte64::Coder1234
U32Variant0124Lemire C "0124" variant, streamvbyte64::Coder0124
U64Coder1234streamvbyte64::Coder1234 (u32 values only)
U64Coder1248streamvbyte64::Coder1248
Svb16ONT vbz_hdf_plugin SVB16 layer
SVB-ZD pipeline (encode_svbzd / decode_svbzd_fused)hasindu2008/slow5lib SLOW5_COMPRESS_SVB_ZD (BLOW5 files)

Buffer layout difference

streamvbyte64 keeps tags and data in separate buffers. svb concatenates them (tags first). When exchanging data with streamvbyte64, split or join buffers at the control stream boundary:

#![allow(unused)]
fn main() {
// svb flat → streamvbyte64 separate buffers
fn split_flat(encoded: &[u8], n: usize) -> (&[u8], &[u8]) {
    let ctrl_len = n.div_ceil(4);
    (&encoded[..ctrl_len], &encoded[ctrl_len..])
}

// streamvbyte64 separate buffers → svb flat
fn join_flat(tags: &[u8], data: &[u8]) -> Vec<u8> {
    let mut flat = tags.to_vec();
    flat.extend_from_slice(data);
    flat
}
}

Verification

Wire compatibility is verified in tests/compat.rs by round-tripping data in both directions: svb encodes and streamvbyte64 decodes, then streamvbyte64 encodes and svb decodes. These tests run in CI for all four compatible codec pairs.

Performance

Benchmarks were measured on GitHub Actions ubuntu-latest (Azure x86-64, AVX2) and ubuntu-24.04-arm (AArch64, NEON) using cargo bench --bench decode with --sample-size 20. All throughput numbers are in GB/s of input integers (Melem/s × bytes-per-element ÷ 1000).

VBZ pipeline breakdown

At 8192 i16 elements with simd-avx2, each stage measured in isolation:

Stageencodedecode
delta29.2 GB/s3.70 GB/s
zigzag34.2 GB/s28.0 GB/s
SVB16 (mixed)9.24 GB/s9.42 GB/s
VBZ (combined, 3-pass)5.70 GB/s2.42 GB/s
VBZ fused decodeN/A3.68 GB/s
VBZ2 fused 2-chain decodeN/A5.62 GB/s

Zigzag is essentially free (pure bitwise ops, LLVM auto-vectorizes). Delta encode expresses adjacent differences as two overlapping slice views, which LLVM auto-vectorizes to around 29 GB/s with no unsafe code. Delta decode uses an explicit SIMD prefix-sum (SSE2/NEON); the serial carry chain between 8-element blocks limits single-stream throughput to around 3.70 GB/s, essentially the theoretical ceiling for this algorithm.

Fused VBZ decode

decode_vbz_fused collapses all three decode stages into a single SIMD loop. The SVB16 shuffle and zigzag bitwise ops (~5–6 cycles per 8-element block) execute during the delta carry-chain stall (~8 cycles), hiding nearly all of their cost.

decode throughput
decode_vbz (3 separate passes)2.42 GB/s
decode_vbz_fused (single SIMD pass)3.68 GB/s

1.52× faster than the pipeline. The fused path reaches 99% of the delta-alone ceiling (3.70 GB/s): SVB16 and zigzag are effectively free, and the delta carry chain is the only remaining bottleneck.

VBZ2: format-extension 2-chain decode

encode_vbz2 / decode_vbz2 extend the VBZ format with a 6-byte header that enables a two-chain fused decode with no pre-scan required:

[mid_carry: i16 LE][mid_data_offset: u32 LE][standard VBZ payload]

mid_carry is samples[n_half - 1], the decoded sample at the chunk midpoint, i.e., the prefix sum of all deltas before the midpoint. mid_data_offset is the count of data bytes consumed by the first n_half elements (sum of 8 + popcnt(ctrl_byte) over the first half of control bytes). Both are computed in O(n) during encode with negligible cost.

At decode time the payload is split into two independent half-streams. Two carry chains run interleaved in one SIMD loop: the CPU's out-of-order engine overlaps chain A's carry-extract latency with chain B's prefix-sum arithmetic. Port-5 usage is unchanged from single-chain (10 ops per 16 elements), so there is no throughput regression at any size; the gain accumulates only where the carry latency was the limiting factor.

decode throughput
decode_vbz (3 separate passes)2.42 GB/s
decode_vbz_fused (single SIMD pass)3.68 GB/s
decode_vbz2 (format-extension 2-chain)5.62 GB/s

1.53× over single-chain fused at 8192 elements. The 2-chain interleaves two carry chains in one SIMD loop; the CPU's out-of-order engine overlaps chain A's carry-extract latency with chain B's prefix-sum arithmetic, hiding most of the serial dependency cost.

The real payoff is multi-threaded decoding: with mid_data_offset known up-front, both half-streams are independent and can run on separate cores. The format overhead is 6 bytes per chunk regardless of chunk size, negligible for any practical payload.

Caller-side parallel decode

decode_vbz_fused_from_into(data, n, initial_carry, out) exposes the single-chain fused decoder with a caller-supplied initial carry, making it possible to decode any half-stream independently. A caller that manages its own thread pool simply splits the VBZ2 payload and dispatches both halves concurrently:

#![allow(unused)]
fn main() {
let (out_a, out_b) = std::thread::scope(|s| {
    let ha = s.spawn(|| decode_vbz_fused_from(&stream_a, n_half, 0));
    let hb = s.spawn(|| decode_vbz_fused_from(&stream_b, n - n_half, mid_carry));
    (ha.join().unwrap(), hb.join().unwrap())
});
}

Decoding 64 × 8192-element chunks in parallel (64 half-A streams on thread 1, 64 half-B streams on thread 2); run locally for hardware-specific numbers:

decode throughput
decode_vbz_fused (single chain, 1 thread)1.84 Gelem/s
decode_vbz2 (2-chain interleaved, 1 thread)2.81 Gelem/s
decode_vbz_fused_from_into (2 threads, batch of 64)hardware-dependent

Multi-threaded throughput is highly sensitive to CPU core count, L2/L3 topology, and scheduler behaviour. The single-thread numbers above are from GitHub Actions CI (Azure x86-64). For two-thread measurements run the vbz2_parallel criterion benchmark locally: cargo bench --features simd-avx2 --bench decode -- vbz2_parallel. With distinct chunks from independent nanopore reads (the realistic production case) the two streams share no cache lines and the speedup approaches 2×.

VBZ-K: generalised K-stream parallel decode

encode_vbzk(samples, k) / decode_vbzk_parallel_into(data, n, out) generalise VBZ2 to K independent sub-streams. The header stores K−1 split points:

[k: u8][(carry_i: i16 LE, data_offset_i: u32 LE) for i in 1..k][VBZ payload]

Header overhead: 1 + (K−1) × 6 bytes. Each sub-chunk has n_sub = (n/K) & !7 elements; the last sub-chunk takes the remainder. Split-point carries and data offsets are computed in O(n) at encode time with negligible overhead.

Benchmarked at N=8192 with a batch of 64 chunks per thread (amortising thread scope overhead); multi-threaded results are hardware-dependent — run locally for specific numbers:

throughputvs single-chain
single-chain fused (k=1)1.84 Gelem/s1.00×
VBZ-K k=2 (2 threads)hardware-dependent
VBZ-K k=4 (4 threads)hardware-dependent
VBZ-K k=8 (8 threads)hardware-dependent

Multi-threaded throughput is not reliably measurable in shared CI environments. Run cargo bench --features simd-avx2 --bench decode -- vbzk_parallel locally for hardware-specific numbers.

k=4 matches k=2 at this chunk size; k=8 regresses because 8 threads decoding 1024-element sub-streams run into thread-scope overhead and scheduler jitter. With distinct real-world POD5 chunks (6 000–12 000 samples each), larger sub-stream sizes would push k=8 above k=4.

The full POD5 pipeline bottleneck

A POD5 reader decodes: disk → zstd decompress → VBZ decode → i16 samples. On a typical NVMe system (~6.5 GB/s sequential read):

  • Disk: 6.5 GB/s × ~3× zstd ratio = ~19.5 GB/s of decoded signal capacity
  • VBZ single-chain (AVX2): 1.84 Gelem/s × 2 bytes = 3.68 GB/s of decoded signal
  • VBZ-K k=4: scales roughly linearly with cores up to the zstd bottleneck
  • zstd single-core: ~1.5–2 GB/s compressed ≈ 2–3 Gelem/s, the real bottleneck for a single-threaded reader

The disk is rarely the bottleneck. A single-threaded reader is zstd-limited. Parallelising VBZ decode with VBZ-K removes the VBZ ceiling and shifts the bottleneck back to zstd. To saturate NVMe bandwidth you need multi-threaded zstd AND VBZ-K simultaneously.

Delta decode: the 2-chain approach

Delta decode is a serial prefix sum: each output element depends on all previous elements. On x86_64 the SSE2 path processes 8 elements per iteration with a carry chain of ~8 cycles (extract + broadcast + add). We are already at the theoretical single-stream ceiling.

delta::decode_2chain breaks this by decoding two independent sub-streams simultaneously. The CPU's out-of-order engine hides one chain's carry latency behind the other's prefix-sum arithmetic, delivering 1.65× throughput:

decode throughput
delta::decode_into (single stream)3.70 GB/s
delta::decode_2chain (two streams)6.58 GB/s

1.78× throughput with two interleaved chains. This requires one extra i16 stored per chunk: the running delta sum at the midpoint (computed by delta::mid_carry during encode, 2 bytes overhead). Each additional sub-chunk adds another 2-byte carry value and enables one more independent decode stream.

Path to a parallel-decode VBZ format

With K sub-chunks, all stages of the VBZ pipeline (delta, zigzag, SVB16) can be decoded independently on K cores:

Sub-chunksdecode throughputvs. current
1 (current VBZ)2.42 GB/sN/A
2 (single-threaded 2-chain)5.62 GB/s2.3×
2 cores~11 GB/s~4.5×
4 cores~22 GB/s~9×
8 cores~44 GB/s~18×

The format change is: store K−1 carry values (K−1 × 2 bytes) in the chunk header and split the encoded payload into K equal sub-streams. Compression ratio is unchanged. The svb crate provides decode_2chain and mid_carry as the building blocks.

SVB-ZD pipeline

At 8192 i16 elements, GitHub Actions CI (Azure x86-64 and AArch64):

x86-64

PathScalarSSSE3SSSE3×AVX2AVX2×
encode_svbzd158 Melem/s1,140 Melem/s7.2×1,100 Melem/s6.9×
decode_svbzd (3-pass)105 Melem/s696 Melem/s6.6×722 Melem/s6.9×
decode_svbzd_fused466 Melem/s1,510 Melem/s3.2×1,510 Melem/s3.2×

AArch64

PathScalarNEONNEON×
encode_svbzd195 Melem/s551 Melem/s2.8×
decode_svbzd (3-pass)210 Melem/s834 Melem/s4.0×
decode_svbzd_fused564 Melem/s1,850 Melem/s3.3×

The SIMD encode path computes zigzag-delta inline without an intermediate Vec<u32> allocation. On AVX2 it processes 8 i16 values per iteration using _mm256_cvtepi16_epi32 + _mm_alignr_epi8; on NEON it uses vmovl_s16 + vextq_s32.

The fused decode collapses U32Classic decode, unzigzag, and undelta into one SIMD loop. The 2-ctrl-byte inner loop processes 8 values per iteration. Note that SSSE3 ≈ AVX2 for the fused path: the bottleneck is the serial delta carry chain, not SIMD width — wider registers do not help once the carry chain is saturated.

SVB-ZD vs VBZ

Both pipelines operate on i16 signal data; the choice depends on the file format (BLOW5 vs POD5):

MetricVBZSVB-ZD
CodecSVB16 (1-bit tags)U32Classic (2-bit tags)
Encode (AVX2, 8192 elem)2,850 Melem/s1,100 Melem/s
Fused decode (AVX2, 8192 elem)1,840 Melem/s1,510 Melem/s
Fused decode (NEON, 8192 elem)2,280 Melem/s1,850 Melem/s
Wire formatONT POD5 / VBZhasindu2008/slow5lib BLOW5

VBZ is faster because SVB16's 1-bit tags pack more tightly than U32Classic's 2-bit tags. SVB-ZD handles values that overflow i16 after delta without truncation.

Results vs streamvbyte64 v0.2.0

Measured with simd-avx2 on GitHub Actions ubuntu-latest (Azure x86-64). streamvbyte64 uses its own runtime detection; numbers reflect its best available path.

Benchmarksvbsv64ratio
U32Classic decode/1288.68 GB/s3.71 GB/s2.34x
U32Classic decode/102413.6 GB/s4.87 GB/s2.79x
U32Classic decode/819214.1 GB/s4.89 GB/s2.88x
U32Classic encode/1286.65 GB/s2.33 GB/s2.85x
U32Classic encode/10248.26 GB/s3.08 GB/s2.68x
U32Classic encode/81928.93 GB/s3.20 GB/s2.79x
U32Variant0124 decode/1288.98 GB/s3.48 GB/s2.58x
U32Variant0124 decode/102413.8 GB/s4.88 GB/s2.83x
U32Variant0124 decode/819214.2 GB/s5.00 GB/s2.84x
U32Variant0124 encode/1286.74 GB/s2.37 GB/s2.84x
U32Variant0124 encode/10248.32 GB/s2.96 GB/s2.81x
U32Variant0124 encode/81928.89 GB/s3.01 GB/s2.95x
U64Coder1248 decode/12812.0 GB/s5.89 GB/s2.04x
U64Coder1248 decode/102415.0 GB/s8.68 GB/s1.73x
U64Coder1248 decode/819214.8 GB/s8.76 GB/s1.69x
U64Coder1248 encode/1287.37 GB/s3.52 GB/s2.09x
U64Coder1248 encode/10248.73 GB/s4.61 GB/s1.89x
U64Coder1248 encode/81928.85 GB/s4.80 GB/s1.84x

svb is consistently 1.7x–2.9x faster than streamvbyte64. The u32 codecs see the largest gap (approaching 3×); the u64 codecs are closer because 8-byte elements reduce the SIMD parallelism available per control byte.

Running benchmarks

cargo bench --features simd-auto

Benchmarks cover all five codec variants across encode/decode and three slice sizes (128, 1024, 8192 elements). Criterion produces HTML reports in target/criterion/.

To run a single benchmark by name substring:

cargo bench --features simd-auto -- U32Classic/decode