svb
svb is a pure-Rust StreamVByte library covering all major codec variants for u16, u32, and u64 integers. Delta and zigzag encoding are composable layers on top. SIMD back-ends are available for x86-64 (SSSE3, AVX2) and AArch64 (NEON).
What is StreamVByte?
StreamVByte is a family of integer compression schemes that store values in a variable number of bytes. Rather than interleaving the control information with the data, StreamVByte places all control bytes in a separate stream. This layout makes SIMD-accelerated decode practical: a batch of control bytes can be loaded and shuffled in a single instruction, determining widths for an entire group of values without branching.
encoded buffer layout
┌────────────────────┬─────────────────────────────────────┐
│ control stream │ data stream │
│ ceil(n/4) bytes │ variable length │
└────────────────────┴─────────────────────────────────────┘
Each 2-bit tag in the control stream describes the byte width of the corresponding value. Four values share one control byte. The byte widths available depend on the codec variant.
Codec variants at a glance
| Variant | Element | Tag width | Byte widths | Best for |
|---|---|---|---|---|
Svb16 | u16 | 1 bit | 1/2 | ONT VBZ signal data |
U32Classic | u32 | 2 bits | 1/2/3/4 | General u32, C-library compatible |
U32Variant0124 | u32 | 2 bits | 0/1/2/4 | Sparse u32 (many zeros) |
U64Coder1234 | u64 | 2 bits | 1/2/3/4 | u64 values that fit in u32 |
U64Coder1248 | u64 | 2 bits | 1/2/4/8 | Full u64 range |
API docs
Rustdoc API reference is published at docs.rs/svb.
Getting Started
Installation
Add svb to your Cargo.toml. For most users simd-auto is the right choice: it detects the best available SIMD path at runtime:
[dependencies]
svb = { version = "0.2", features = ["simd-auto"] }
Feature flags
| Flag | Effect |
|---|---|
std (default) | Enables std; implies alloc |
alloc | Enables all encode/decode APIs with no other dependencies |
simd-auto | Runtime CPU detection; selects the best available SIMD path |
simd-avx2 | Compile-time AVX2 (asserts AVX2 is available at runtime) |
simd-ssse3 | Compile-time SSSE3 |
simd-neon | Compile-time NEON (AArch64 only; NEON is always available there) |
The compile-time SIMD flags (simd-avx2, simd-ssse3, simd-neon) are intended for environments where the target CPU is known at build time, such as cross-compilation or RUSTFLAGS="-C target-cpu=native". In all other cases, prefer simd-auto.
Basic usage
Every codec is a zero-sized type with encode and decode methods. encode returns a Vec<u8>; decode takes the byte slice and the original element count.
#![allow(unused)] fn main() { use svb::u32::U32Classic; let values: Vec<u32> = vec![1, 500, 70_000, 16_000_000]; let encoded = U32Classic.encode(&values); let decoded = U32Classic.decode(&encoded, values.len()).unwrap(); assert_eq!(decoded, values); }
Appending to an existing buffer
Every codec exposes encode_into and decode_into variants that append to a caller-supplied Vec, avoiding extra allocation:
#![allow(unused)] fn main() { use svb::u32::U32Classic; let mut buf = Vec::new(); U32Classic.encode_into(&[1u32, 2, 3], &mut buf); U32Classic.encode_into(&[4u32, 5, 6], &mut buf); }
This is useful when building a larger serialised format where multiple compressed sequences are concatenated. The caller is responsible for recording the element counts needed for decode.
Choosing a codec
u16data (e.g. ONT signal): useSvb16, or the higher-levelencode_vbz/decode_vbzpipeline.u32general: useU32Classic. Wire-compatible with the Lemire C library.u32with many zeros: useU32Variant0124. Zero values consume no data bytes.u64values that fit inu32::MAX: useU64Coder1234.u64full range: useU64Coder1248.
For sorted or time-series data, compose any codec with delta encoding to compress differences rather than raw values.
Encoding Guide
This page explains how StreamVByte, delta, and zigzag work, when to use each, and how they compose. For the API itself see Delta and Zigzag, Codec Variants, and the API reference.
Fixed-width integers and why they waste space
Every integer type has a fixed storage width. A u32 always occupies 4 bytes, regardless of its value:
value memory (big-endian for clarity) bytes used
──────────────────────────────────────────────────────────
1 00 00 00 01 4 (3 wasted)
300 00 00 01 2C 4 (2 wasted)
75 000 00 01 24 F8 4 (1 wasted)
16 000 000 00 F4 24 00 4 (1 wasted)
4 294 967 295 FF FF FF FF 4 (none wasted)
The high-order bytes are zero whenever the value is small. Those zeros carry no information, yet they occupy the same storage as any other byte. For a u32 array where most values are below 256, three quarters of the storage is zero-padding.
This matters in practice because integer arrays in real applications are rarely uniformly distributed across the full type range. File offsets, timestamps, sensor readings, and index lists all tend to cluster at small magnitudes relative to the maximum the type can hold. An array of one million u32 values representing document word frequencies, for example, might use only the bottom 12 bits of each element, leaving 20 bits per value (over 2 MB per million elements) as wasted zeros.
Variable-byte encoding solves this by storing only the bytes that carry information and recording how many bytes each value used. StreamVByte is a specific variable-byte scheme designed to make that decoding fast with SIMD.
StreamVByte
Most integers in real data are small. Fixed-width encoding wastes bytes on the high-order zeros; StreamVByte stores only the bytes that carry information.
The key design decision is where to put the width metadata. Naive variable-byte schemes (such as SQLite's varint or Protocol Buffers' LEN) interleave a length prefix with each value, so the decoder must branch on every element, and SIMD cannot help. StreamVByte separates the metadata into a control stream and the integer bytes into a data stream:
┌──────────────────────────────────┐
│ control stream │
│ tag tag tag tag tag tag tag tag │ ← 2 bits per value (u32), 1 bit (u16)
└──────────────────────────────────┘
┌──────────────────────────────────┐
│ data stream │
│ [value 0 bytes][value 1 bytes]… │ ← tightly packed, no separators
└──────────────────────────────────┘
Because all the widths for a block of values live in one or two control bytes, a SIMD decoder can read them all at once, look up a pre-built shuffle table, and unpack 4–8 values in a single pshufb instruction. The data stream is a plain byte array with no branch points.
A concrete example
Four u32 values encoded with U32Classic (1/2/3/4-byte widths):
values: [ 1, 300, 75000, 5 ]
widths: [ 1 B, 2 B, 3 B, 1 B ] ← determined by value magnitude
tags: [ 00, 01, 10, 00 ] ← 2-bit tag per value
control byte: 0b_00_10_01_00 (4 tags packed LSB-first)
data bytes: 01 | 2C 01 | F8 24 01 | 05
└1┘ └─300─┘ └─75000─┘ └5┘
The full encoded output is 5 bytes (1 control + 4 data) for four 32-bit values that would require 16 bytes in fixed-width form.
Codec variant selection
The five variants differ in which byte widths they support and what element type they encode:
| Variant | Element | Byte widths | Best for |
|---|---|---|---|
Svb16 | u16 | 1 / 2 | 16-bit data; values mostly ≤ 255 |
U32Classic | u32 | 1 / 2 / 3 / 4 | General-purpose u32; compatible with Lemire C library |
U32Variant0124 | u32 | 0 / 1 / 2 / 4 | Sparse data with many exact zeros (0 bytes stored) |
U64Coder1234 | u64 | 1 / 2 / 3 / 4 | u64 values known to fit in u32 |
U64Coder1248 | u64 | 1 / 2 / 4 / 8 | Full u64 range |
U32Variant0124 skips the 3-byte width and adds a 0-byte width: a zero value stores no data bytes at all, only its tag. This is a significant win for sparse arrays where many values are exactly zero.
Delta encoding
Delta encoding replaces each value with its difference from the previous one:
original: [ 1000, 1003, 1007, 1004, 1010 ]
↘ ↘ ↘ ↘
deltas: [ 1000, +3, +4, -3, +6 ]
The first delta is the first value itself (difference from an implicit zero, or from a caller-supplied carry). Every subsequent delta is values[i] - values[i-1].
For sequences where adjacent values are close together (sorted integers, time-series measurements, sensor readings), the deltas are much smaller than the raw values. Smaller values encode to fewer bytes in any variable-byte scheme.
When delta helps
| Data pattern | Example | Delta effect |
|---|---|---|
| Sorted / monotone | File offsets, timestamps | Deltas are small positive integers |
| Slowly drifting | Temperature readings | Deltas cluster near zero |
| Periodic / oscillating | ADC signal samples | Deltas small if bandwidth is limited |
| Uniformly random | Hash values | No benefit; deltas are as large as the values |
Delta encoding is a lossless, reversible transform. Decoding is a prefix sum: values[i] = deltas[0] + deltas[1] + … + deltas[i]. The serial dependency between elements is the main cost; see Performance for how the SIMD prefix-sum implementation handles it.
Signed vs unsigned
delta in svb is implemented for i16, i32, i64, u32, and u64. For non-monotone data (where values can decrease), use a signed type, as the deltas will be negative and a signed representation preserves that. For guaranteed non-decreasing sequences (file offsets, sorted timestamps), an unsigned type is fine and avoids the overhead of zigzag.
Zigzag encoding
Variable-byte codecs assign shorter encodings to smaller non-negative integers. A signed delta of −1 would be stored as 0xFFFFFFFF (4 bytes) in a u32 codec, with no compression at all.
Zigzag solves this by remapping signed integers to unsigned so that small absolute values map to small codes:
signed → unsigned
0 → 0
-1 → 1
+1 → 2
-2 → 3
+2 → 4
-3 → 5
+3 → 6
...
The formula is (n << 1) ^ (n >> (bits - 1)), two bitwise ops with no branches. Decoding is (n >> 1) ^ -(n & 1). Both directions are branchless and LLVM auto-vectorizes them.
After zigzag, a signed delta of −1 becomes the unsigned value 1, which encodes in a single byte. A delta of +127 becomes 254, still a single byte. Only values with absolute magnitude above 127 spill into a second byte.
Composing the three
Delta → zigzag → StreamVByte is a standard pipeline for compressing integer sequences that are slowly varying or oscillating. Each stage does one job:
raw values
│
▼ delta encode
differences (signed, small magnitude)
│
▼ zigzag encode
differences (unsigned, small magnitude)
│
▼ StreamVByte encode
compact byte stream
Worked example
Five i16 ADC-style samples:
stage values
───────────────────────────────────────────────
raw [ 1000, 1003, 1007, 1004, 1010 ]
after delta [ 1000, 3, 4, -3, 6 ]
after zigzag [ 2000, 6, 8, 5, 12 ]
after SVB16 2 ctrl bytes + 6 data bytes (vs 10 raw bytes)
The first value (1000) stays large because it is the absolute anchor. The subsequent values (the deltas) all fit in a single byte after zigzag. In practice, for signals with small bandwidth relative to their absolute level, the per-value cost quickly drops to 1 byte once the anchor is amortised over the chunk.
Choosing what to compose
| Data | Recipe |
|---|---|
| Sorted unsigned integers | delta → U32Classic or U64Coder1248 |
| Non-monotone integers | delta → zigzag → U32Classic |
| 16-bit oscillating signal | delta → zigzag → Svb16 (= the VBZ pipeline) |
| Sparse data with many zeros | U32Variant0124 alone, or delta first if it helps |
| Already-small unsigned values | U32Classic or Svb16 directly |
Further reading
- StreamVByte paper: Lemire, Kurz, Rupp. Stream VByte: Faster Byte-Oriented Integer Compression (2017). arxiv.org/abs/1709.08990
- Lemire's blog post introducing StreamVByte with benchmarks: lemire.me/blog/2017/09/27/stream-vbyte-breaking-new-speed-records-for-integer-compression/
- Zigzag encoding as used in Protocol Buffers (good concise reference): protobuf.dev/programming-guides/encoding/#signed-ints
Codec Variants
svb provides five codec variants spanning three element widths. Each is a zero-sized type implementing the same encode/decode surface.
| Variant | Element | Tag bits | Byte widths | Wire-compatible with |
|---|---|---|---|---|
Svb16 | u16 | 1 | 1/2 | ONT vbz_hdf_plugin |
U32Classic | u32 | 2 | 1/2/3/4 | Lemire C library, stream-vbyte crate |
U32Variant0124 | u32 | 2 | 0/1/2/4 | Lemire "0124" variant |
U64Coder1234 | u64 | 2 | 1/2/3/4 | streamvbyte64::Coder1234 (u32 values) |
U64Coder1248 | u64 | 2 | 1/2/4/8 | streamvbyte64::Coder1248 |
Tag encoding
All u32 and u64 codecs pack four 2-bit tags into each control byte, LSB-first:
control byte n
bits 1:0 → tag for value 4n+0
bits 3:2 → tag for value 4n+1
bits 5:4 → tag for value 4n+2
bits 7:6 → tag for value 4n+3
Svb16 uses 1-bit tags and packs eight tags per control byte.
Buffer layout
All codecs use the same flat layout: control bytes first, data bytes immediately after.
[ ctrl[0] ctrl[1] ... ctrl[ceil(n/4)-1] | data bytes ... ]
The control stream length is always ceil(n / 4) bytes for 2-bit codecs, ceil(n / 8) for Svb16. No length prefix is stored; the caller supplies the element count to decode.
SVB16
Svb16 compresses u16 values using 1-bit tags. Each value is stored in either 1 byte (values 0–255) or 2 bytes (values 256–65535). Eight tags share one control byte.
This is the codec used in the VBZ pipeline for Oxford Nanopore POD5 signal data.
Tag table
| Tag | Byte width | Value range |
|---|---|---|
| 0 | 1 | 0–255 |
| 1 | 2 | 256–65535 |
Example
#![allow(unused)] fn main() { use svb::u16::Svb16; let values: Vec<u16> = vec![1, 300, 0, 65000]; let encoded = Svb16.encode(&values); let decoded = Svb16.decode(&encoded, values.len()).unwrap(); assert_eq!(decoded, values); }
Control stream layout
Tags are packed 8 per byte, LSB-first. For n values the control stream is ceil(n / 8) bytes.
control byte k → tags for values 8k+0 through 8k+7
bit 0 = tag for value 8k+0
bit 1 = tag for value 8k+1
...
bit 7 = tag for value 8k+7
U32Classic
U32Classic is the original Lemire StreamVByte variant for u32 values. Each value is stored in 1–4 bytes depending on its magnitude.
Wire-compatible with the Lemire C library and the stream-vbyte crate.
Tag table
| Tag | Byte width | Value range |
|---|---|---|
| 0 | 1 | 0–255 |
| 1 | 2 | 256–65535 |
| 2 | 3 | 65536–16777215 |
| 3 | 4 | 16777216–4294967295 |
Example
#![allow(unused)] fn main() { use svb::u32::U32Classic; let values: Vec<u32> = vec![1, 500, 70_000, 16_000_000]; let encoded = U32Classic.encode(&values); let decoded = U32Classic.decode(&encoded, values.len()).unwrap(); assert_eq!(decoded, values); }
Wire format example
Encoding [1, 256, 65536, 0xFFFFFFFF] produces:
byte 0: 0xE4 control byte (tags: 0, 1, 2, 3 packed LSB-first → 0b11_10_01_00)
bytes 1: 0x01 value 0: 1 (1 byte)
bytes 2-3: 0x00 0x01 value 1: 256 (2 bytes, little-endian)
bytes 4-6: 0x00 0x00 0x01 value 2: 65536 (3 bytes)
bytes 7-10: 0xFF 0xFF 0xFF 0xFF value 3: 4294967295 (4 bytes)
When to use
U32Classic is the right default for general u32 compression and any context where wire compatibility with the C library matters. For data with many zero or small values, U32Variant0124 compresses better.
U32Variant0124
U32Variant0124 is an alternative u32 codec where zero values consume no data bytes at all. The byte-width options are 0, 1, 2, or 4. There is no 3-byte option.
Wire-compatible with the Lemire "0124" variant and the streamvbyte64::Coder0124.
Tag table
| Tag | Byte width | Value range |
|---|---|---|
| 0 | 0 | 0 (exactly) |
| 1 | 1 | 1–255 |
| 2 | 2 | 256–65535 |
| 3 | 4 | 65536–4294967295 |
Note that values in the range 65536–16777215 require 4 bytes (not 3), which is worse than U32Classic for that range. The benefit comes from sparse data where many values are zero.
Example
#![allow(unused)] fn main() { use svb::u32::U32Variant0124; // Zero-valued elements cost 0 bytes in the data stream. let values: Vec<u32> = vec![0, 0, 42, 0, 0, 255, 0]; let encoded = U32Variant0124.encode(&values); let decoded = U32Variant0124.decode(&encoded, values.len()).unwrap(); assert_eq!(decoded, values); }
When to use
Use U32Variant0124 when a significant fraction of values are exactly zero, such as sparse histograms, run-length-style data, or delta-encoded sorted lists where many differences are zero. For general data with few zeros, U32Classic is typically better because it can use 3 bytes for values in the 65536–16777215 range.
U64Coder1234
U64Coder1234 stores u64 values using 1–4 bytes, the same byte-width table as U32Classic. Values must fit within u32::MAX (4294967295); values above that are silently truncated on encode.
The wire format is identical to U32Classic; only the element type differs. This means a U64Coder1234-encoded buffer can be decoded by U32Classic (values are zero-extended on decode).
Tag table
| Tag | Byte width | Value range |
|---|---|---|
| 0 | 1 | 0–255 |
| 1 | 2 | 256–65535 |
| 2 | 3 | 65536–16777215 |
| 3 | 4 | 16777216–4294967295 |
Example
#![allow(unused)] fn main() { use svb::u64::U64Coder1234; let values: Vec<u64> = vec![1, 500, 70_000, u32::MAX as u64]; // check_range returns the index of the first out-of-range value, if any. assert_eq!(U64Coder1234.check_range(&values), None); let encoded = U64Coder1234.encode(&values); let decoded = U64Coder1234.decode(&encoded, values.len()).unwrap(); assert_eq!(decoded, values); }
Range checking
Call check_range before encoding if the input may contain values above u32::MAX:
#![allow(unused)] fn main() { use svb::u64::U64Coder1234; let values: Vec<u64> = vec![1, u64::MAX, 3]; if let Some(idx) = U64Coder1234.check_range(&values) { eprintln!("value at index {idx} exceeds u32::MAX"); } }
When to use
Use U64Coder1234 when your data is logically u64 but all values fit within 32 bits. It produces a more compact encoding than U64Coder1248 for such data because values never consume 8 bytes.
U64Coder1248
U64Coder1248 stores u64 values using 1, 2, 4, or 8 bytes. It covers the full u64 range without truncation.
Wire-compatible with streamvbyte64::Coder1248.
Tag table
| Tag | Byte width | Value range |
|---|---|---|
| 0 | 1 | 0–255 |
| 1 | 2 | 256–65535 |
| 2 | 4 | 65536–4294967295 |
| 3 | 8 | 4294967296–18446744073709551615 |
Note there is no 3-byte option. Values in the range 65536–16777215 require 4 bytes.
Example
#![allow(unused)] fn main() { use svb::u64::U64Coder1248; let values: Vec<u64> = vec![1, 500, 1 << 32, u64::MAX]; let encoded = U64Coder1248.encode(&values); let decoded = U64Coder1248.decode(&encoded, values.len()).unwrap(); assert_eq!(decoded, values); }
When to use
Use U64Coder1248 when values may exceed u32::MAX. If all values fit within 32 bits, U64Coder1234 gives better compression because it can use 3 bytes for values in the 65536–16777215 range.
Delta and Zigzag
Delta and zigzag are independent slice transforms. They are not tied to any specific codec; you compose them with whichever codec suits your data.
Delta encoding
Delta encoding replaces each value with the difference from the previous value. This is effective for sorted or slowly-varying data: the differences are small even when the raw values are large.
#![allow(unused)] fn main() { use svb::{delta, u64::U64Coder1248}; // Sorted u64 timestamps; differences are small positive numbers. let timestamps: Vec<u64> = vec![1_000_000, 1_001_500, 1_003_000, 1_010_000]; let deltas = delta::encode(×tamps); let encoded = U64Coder1248.encode(&deltas); let decoded_deltas = U64Coder1248.decode(&encoded, deltas.len()).unwrap(); let recovered = delta::decode(&decoded_deltas); assert_eq!(recovered, timestamps); }
delta is implemented for i16, i32, i64, u32, and u64. Use a signed type when the sequence is non-monotone and you intend to follow with zigzag; use an unsigned type for sorted data where all differences are non-negative.
Streaming / chunked delta
For streaming use-cases where data arrives in chunks, use encode_with_initial and decode_with_initial to carry the boundary value across chunks:
#![allow(unused)] fn main() { use svb::delta; let chunk_a: Vec<u32> = vec![100, 105, 110]; let chunk_b: Vec<u32> = vec![115, 120, 125]; // Encode chunk A; initial value is 0. let (deltas_a, last_a) = delta::encode_with_initial(0, &chunk_a); // Encode chunk B using the last value from chunk A as the initial. let (deltas_b, _last_b) = delta::encode_with_initial(last_a, &chunk_b); // Decode chunk A; initial value is 0. let (recovered_a, last_a) = delta::decode_with_initial(0, &deltas_a); // Decode chunk B using the boundary value. let (recovered_b, _) = delta::decode_with_initial(last_a, &deltas_b); assert_eq!(recovered_a, chunk_a); assert_eq!(recovered_b, chunk_b); }
Zigzag encoding
Zigzag maps signed integers to unsigned integers so that small absolute values map to small codes. This allows signed differences from delta encoding to be compressed efficiently by any unsigned codec.
0 → 0
-1 → 1
1 → 2
-2 → 3
2 → 4
...
#![allow(unused)] fn main() { use svb::{delta, zigzag, u32::U32Classic}; // Arbitrary i32 data: delta, then zigzag, then U32Classic. let samples: Vec<i32> = vec![-500, 200, -100, 900]; let deltas: Vec<i32> = delta::encode(&samples); let codes: Vec<u32> = zigzag::encode(&deltas); let encoded = U32Classic.encode(&codes); let decoded_codes = U32Classic.decode(&encoded, codes.len()).unwrap(); let decoded_deltas: Vec<i32> = zigzag::decode(&decoded_codes); let recovered: Vec<i32> = delta::decode(&decoded_deltas); assert_eq!(recovered, samples); }
zigzag is implemented for i16 (producing u16), i32 (producing u32), and i64 (producing u64).
VBZ Pipeline
The VBZ pipeline compresses 16-bit ADC signal data as used in Oxford Nanopore POD5 files. It chains four stages:
raw i16 samples
→ delta encode (1st-order differences; small values dominate)
→ zigzag encode (signed i16 → unsigned u16; small |values| → small codes)
→ SVB16 encode (StreamVByte-16: 1-bit control stream)
→ zstd (outer entropy coding; NOT part of this crate)
The svb crate handles stages 1–3. The outer zstd layer is left to the caller.
High-level API
#![allow(unused)] fn main() { use svb::{encode_vbz, decode_vbz}; let samples: Vec<i16> = vec![100, 101, 103, 102, 98]; // Encode: i16 → delta → zigzag → SVB16 bytes let encoded = encode_vbz(&samples); // Decode: SVB16 bytes → zigzag → delta → i16 let decoded = decode_vbz(&encoded, samples.len()).unwrap(); assert_eq!(decoded, samples); }
Low-level / into variants
For zero-allocation usage or building larger buffers:
#![allow(unused)] fn main() { use svb::{encode_vbz_into, decode_vbz_into}; let samples: Vec<i16> = vec![100, 101, 103, 102, 98]; let mut buf: Vec<u8> = Vec::new(); encode_vbz_into(&samples, &mut buf); let mut out: Vec<i16> = Vec::new(); decode_vbz_into(&buf, samples.len(), &mut out).unwrap(); assert_eq!(out, samples); }
SVB-ZD Pipeline
The SVB-ZD pipeline compresses 16-bit signal data into a compact byte stream. It is wire-compatible with the slow5lib SLOW5_COMPRESS_SVB_ZD format used in BLOW5 files. The pipeline chains three stages:
raw i16 samples
→ widen to i32
→ fused zigzag-delta (delta of differences → zigzag32 to make values unsigned)
→ U32Classic encode (2-bit control stream, 1–4 bytes per value)
→ zstd (outer entropy coding; NOT part of this crate)
The key difference from VBZ is the element width: SVB-ZD widens i16 → i32 before the zigzag-delta step, so it uses the U32Classic codec rather than SVB16. This costs one extra bit of tag width but removes SVB16's 2-byte cap: values that overflow i16 after delta (e.g. baseline resets) are encoded correctly without truncation.
High-level API
#![allow(unused)] fn main() { use svb::{encode_svbzd, decode_svbzd}; let samples: Vec<i16> = vec![100, 101, 103, 102, 98]; // Encode: i16 → widen → zigzag-delta → U32Classic bytes let encoded = encode_svbzd(&samples); // Decode: U32Classic bytes → unzigzag-undelta → i16 let decoded = decode_svbzd(&encoded, samples.len()).unwrap(); assert_eq!(decoded, samples); }
Low-level / into variants
For zero-allocation usage or appending to an existing buffer:
#![allow(unused)] fn main() { use svb::{encode_svbzd_into, decode_svbzd_into}; let samples: Vec<i16> = vec![100, 101, 103, 102, 98]; let mut buf: Vec<u8> = Vec::new(); encode_svbzd_into(&samples, &mut buf); let mut out: Vec<i16> = Vec::new(); decode_svbzd_into(&buf, samples.len(), &mut out).unwrap(); assert_eq!(out, samples); }
Fused decode
decode_svbzd_fused collapses all three decode stages (U32Classic, unzigzag, undelta) into a single SIMD loop. This avoids intermediate buffers and is the preferred path for high-throughput BLOW5 reads:
#![allow(unused)] fn main() { use svb::decode_svbzd_fused; let decoded = decode_svbzd_fused(&encoded, samples.len()).unwrap(); }
decode_svbzd_fused_into appends into an existing Vec<i16>.
Parallel decode with fused_from
decode_svbzd_fused_from accepts a caller-supplied initial carry value, enabling independent decoding of any sub-stream that starts at a known split point. This is the building block for parallel decoding:
#![allow(unused)] fn main() { use svb::decode_svbzd_fused_from; // Decode second half independently, with known carry from midpoint let half_b = decode_svbzd_fused_from(&stream_b, n - n_half, mid_carry).unwrap(); }
The _into variant (decode_svbzd_fused_from_into) appends into an existing Vec<i16>.
Wire format
The encoded byte layout is identical to a U32Classic-encoded Vec<u32> where each u32 is zigzag32(samples[i].widened() - samples[i-1].widened()). There is no additional header; the caller is responsible for tracking n (the number of original i16 samples).
The zigzag32 mapping is (delta << 1) ^ (delta >> 31), the same convention used by slow5lib.
SIMD Backends
svb provides SIMD-accelerated encode and decode for all codec variants. The scalar path is always compiled and serves as the correctness reference.
Available backends
| Backend | Feature flag | Architecture | ISA |
|---|---|---|---|
| SSSE3 | simd-ssse3 | x86-64 | SSE2 + SSSE3 |
| AVX2 | simd-avx2 | x86-64 | AVX2 |
| NEON | simd-neon | AArch64 | NEON |
| Auto | simd-auto | both | runtime detection |
simd-auto
simd-auto detects the best available path at runtime using is_x86_feature_detected! on x86-64 and unconditional NEON on AArch64. This is the recommended flag for most users.
On x86-64, simd-auto selects AVX2 if available, then SSSE3, then scalar. On AArch64, NEON is always selected (NEON is mandatory on AArch64).
simd-auto requires std for runtime CPU detection. In no_std contexts, use a compile-time flag instead.
Compile-time flags
simd-avx2, simd-ssse3, and simd-neon compile in the SIMD path and assume it is available at runtime. These are appropriate when the target CPU is known:
# Cross-compile to a known AVX2 target
svb = { version = "0.2", features = ["simd-avx2"] }
or with RUSTFLAGS="-C target-cpu=native" where the build host and run host are the same.
Pipeline coverage
SIMD paths are provided for individual codec variants and for both high-level pipelines:
- VBZ pipeline (
encode_vbz/decode_vbz_fused): fused SVB16 + zigzag + delta in a single SIMD loop on x86-64 (SSSE3/AVX2) and AArch64 (NEON). - SVB-ZD pipeline (
encode_svbzd/decode_svbzd_fused): fused U32Classic + unzigzag + undelta. Encode computes zigzag-delta inline via SIMD (eliminates the intermediateVec<u32>allocation), decode collapses all three stages into one SIMD loop.
Decode throughput
With simd-auto on a modern x86-64 machine, decode throughput for all codec variants is in the range of 1.3–4 GB/s depending on variant and input size. See Performance for detailed numbers.
no_std Support
svb supports no_std environments with a global allocator. This covers microcontrollers and embedded targets, WebAssembly modules, and OS-level code such as bootloaders or kernel modules.
Setup
Disable the default std feature and enable alloc:
svb = { version = "0.2", default-features = false, features = ["alloc"] }
All encode and decode APIs are available. The delta and zigzag transforms are also fully available.
SIMD in no_std
Runtime SIMD detection (simd-auto) requires std for is_x86_feature_detected!. In no_std contexts, use a compile-time SIMD flag instead:
# no_std with compile-time NEON (AArch64 embedded target)
svb = { version = "0.2", default-features = false, features = ["alloc", "simd-neon"] }
# no_std with compile-time AVX2
svb = { version = "0.2", default-features = false, features = ["alloc", "simd-avx2"] }
Wire Compatibility
svb is wire-compatible with the reference C implementations and the streamvbyte64 Rust crate. This means a buffer encoded by svb can be decoded by the C library and vice versa.
Compatibility table
| svb variant | Compatible with |
|---|---|
U32Classic | Lemire C streamvbyte library, streamvbyte64::Coder1234 |
U32Variant0124 | Lemire C "0124" variant, streamvbyte64::Coder0124 |
U64Coder1234 | streamvbyte64::Coder1234 (u32 values only) |
U64Coder1248 | streamvbyte64::Coder1248 |
Svb16 | ONT vbz_hdf_plugin SVB16 layer |
SVB-ZD pipeline (encode_svbzd / decode_svbzd_fused) | hasindu2008/slow5lib SLOW5_COMPRESS_SVB_ZD (BLOW5 files) |
Buffer layout difference
streamvbyte64 keeps tags and data in separate buffers. svb concatenates them (tags first). When exchanging data with streamvbyte64, split or join buffers at the control stream boundary:
#![allow(unused)] fn main() { // svb flat → streamvbyte64 separate buffers fn split_flat(encoded: &[u8], n: usize) -> (&[u8], &[u8]) { let ctrl_len = n.div_ceil(4); (&encoded[..ctrl_len], &encoded[ctrl_len..]) } // streamvbyte64 separate buffers → svb flat fn join_flat(tags: &[u8], data: &[u8]) -> Vec<u8> { let mut flat = tags.to_vec(); flat.extend_from_slice(data); flat } }
Verification
Wire compatibility is verified in tests/compat.rs by round-tripping data in both directions: svb encodes and streamvbyte64 decodes, then streamvbyte64 encodes and svb decodes. These tests run in CI for all four compatible codec pairs.
Performance
Benchmarks were measured on GitHub Actions ubuntu-latest (Azure x86-64, AVX2) and ubuntu-24.04-arm (AArch64, NEON) using cargo bench --bench decode with --sample-size 20. All throughput numbers are in GB/s of input integers (Melem/s × bytes-per-element ÷ 1000).
VBZ pipeline breakdown
At 8192 i16 elements with simd-avx2, each stage measured in isolation:
| Stage | encode | decode |
|---|---|---|
| delta | 29.2 GB/s | 3.70 GB/s |
| zigzag | 34.2 GB/s | 28.0 GB/s |
| SVB16 (mixed) | 9.24 GB/s | 9.42 GB/s |
| VBZ (combined, 3-pass) | 5.70 GB/s | 2.42 GB/s |
| VBZ fused decode | N/A | 3.68 GB/s |
| VBZ2 fused 2-chain decode | N/A | 5.62 GB/s |
Zigzag is essentially free (pure bitwise ops, LLVM auto-vectorizes). Delta encode expresses adjacent differences as two overlapping slice views, which LLVM auto-vectorizes to around 29 GB/s with no unsafe code. Delta decode uses an explicit SIMD prefix-sum (SSE2/NEON); the serial carry chain between 8-element blocks limits single-stream throughput to around 3.70 GB/s, essentially the theoretical ceiling for this algorithm.
Fused VBZ decode
decode_vbz_fused collapses all three decode stages into a single SIMD loop. The
SVB16 shuffle and zigzag bitwise ops (~5–6 cycles per 8-element block) execute
during the delta carry-chain stall (~8 cycles), hiding nearly all of their cost.
| decode throughput | |
|---|---|
decode_vbz (3 separate passes) | 2.42 GB/s |
decode_vbz_fused (single SIMD pass) | 3.68 GB/s |
1.52× faster than the pipeline. The fused path reaches 99% of the delta-alone ceiling (3.70 GB/s): SVB16 and zigzag are effectively free, and the delta carry chain is the only remaining bottleneck.
VBZ2: format-extension 2-chain decode
encode_vbz2 / decode_vbz2 extend the VBZ format with a 6-byte header that
enables a two-chain fused decode with no pre-scan required:
[mid_carry: i16 LE][mid_data_offset: u32 LE][standard VBZ payload]
mid_carry is samples[n_half - 1], the decoded sample at the chunk midpoint,
i.e., the prefix sum of all deltas before the midpoint. mid_data_offset is the
count of data bytes consumed by the first n_half elements (sum of 8 + popcnt(ctrl_byte) over the first half of control bytes). Both are computed in
O(n) during encode with negligible cost.
At decode time the payload is split into two independent half-streams. Two carry chains run interleaved in one SIMD loop: the CPU's out-of-order engine overlaps chain A's carry-extract latency with chain B's prefix-sum arithmetic. Port-5 usage is unchanged from single-chain (10 ops per 16 elements), so there is no throughput regression at any size; the gain accumulates only where the carry latency was the limiting factor.
| decode throughput | |
|---|---|
decode_vbz (3 separate passes) | 2.42 GB/s |
decode_vbz_fused (single SIMD pass) | 3.68 GB/s |
decode_vbz2 (format-extension 2-chain) | 5.62 GB/s |
1.53× over single-chain fused at 8192 elements. The 2-chain interleaves two carry chains in one SIMD loop; the CPU's out-of-order engine overlaps chain A's carry-extract latency with chain B's prefix-sum arithmetic, hiding most of the serial dependency cost.
The real payoff is multi-threaded decoding: with mid_data_offset known
up-front, both half-streams are independent and can run on separate cores. The
format overhead is 6 bytes per chunk regardless of chunk size, negligible for
any practical payload.
Caller-side parallel decode
decode_vbz_fused_from_into(data, n, initial_carry, out) exposes the single-chain
fused decoder with a caller-supplied initial carry, making it possible to decode
any half-stream independently. A caller that manages its own thread pool simply
splits the VBZ2 payload and dispatches both halves concurrently:
#![allow(unused)] fn main() { let (out_a, out_b) = std::thread::scope(|s| { let ha = s.spawn(|| decode_vbz_fused_from(&stream_a, n_half, 0)); let hb = s.spawn(|| decode_vbz_fused_from(&stream_b, n - n_half, mid_carry)); (ha.join().unwrap(), hb.join().unwrap()) }); }
Decoding 64 × 8192-element chunks in parallel (64 half-A streams on thread 1, 64 half-B streams on thread 2); run locally for hardware-specific numbers:
| decode throughput | |
|---|---|
decode_vbz_fused (single chain, 1 thread) | 1.84 Gelem/s |
decode_vbz2 (2-chain interleaved, 1 thread) | 2.81 Gelem/s |
decode_vbz_fused_from_into (2 threads, batch of 64) | hardware-dependent |
Multi-threaded throughput is highly sensitive to CPU core count, L2/L3 topology,
and scheduler behaviour. The single-thread numbers above are from GitHub Actions
CI (Azure x86-64). For two-thread measurements run the vbz2_parallel criterion
benchmark locally: cargo bench --features simd-avx2 --bench decode -- vbz2_parallel.
With distinct chunks from independent nanopore reads (the realistic production case)
the two streams share no cache lines and the speedup approaches 2×.
VBZ-K: generalised K-stream parallel decode
encode_vbzk(samples, k) / decode_vbzk_parallel_into(data, n, out) generalise
VBZ2 to K independent sub-streams. The header stores K−1 split points:
[k: u8][(carry_i: i16 LE, data_offset_i: u32 LE) for i in 1..k][VBZ payload]
Header overhead: 1 + (K−1) × 6 bytes. Each sub-chunk has n_sub = (n/K) & !7
elements; the last sub-chunk takes the remainder. Split-point carries and data
offsets are computed in O(n) at encode time with negligible overhead.
Benchmarked at N=8192 with a batch of 64 chunks per thread (amortising thread scope overhead); multi-threaded results are hardware-dependent — run locally for specific numbers:
| throughput | vs single-chain | |
|---|---|---|
| single-chain fused (k=1) | 1.84 Gelem/s | 1.00× |
| VBZ-K k=2 (2 threads) | hardware-dependent | — |
| VBZ-K k=4 (4 threads) | hardware-dependent | — |
| VBZ-K k=8 (8 threads) | hardware-dependent | — |
Multi-threaded throughput is not reliably measurable in shared CI environments.
Run cargo bench --features simd-avx2 --bench decode -- vbzk_parallel locally
for hardware-specific numbers.
k=4 matches k=2 at this chunk size; k=8 regresses because 8 threads decoding 1024-element sub-streams run into thread-scope overhead and scheduler jitter. With distinct real-world POD5 chunks (6 000–12 000 samples each), larger sub-stream sizes would push k=8 above k=4.
The full POD5 pipeline bottleneck
A POD5 reader decodes: disk → zstd decompress → VBZ decode → i16 samples. On a typical NVMe system (~6.5 GB/s sequential read):
- Disk: 6.5 GB/s × ~3× zstd ratio = ~19.5 GB/s of decoded signal capacity
- VBZ single-chain (AVX2): 1.84 Gelem/s × 2 bytes = 3.68 GB/s of decoded signal
- VBZ-K k=4: scales roughly linearly with cores up to the zstd bottleneck
- zstd single-core: ~1.5–2 GB/s compressed ≈ 2–3 Gelem/s, the real bottleneck for a single-threaded reader
The disk is rarely the bottleneck. A single-threaded reader is zstd-limited. Parallelising VBZ decode with VBZ-K removes the VBZ ceiling and shifts the bottleneck back to zstd. To saturate NVMe bandwidth you need multi-threaded zstd AND VBZ-K simultaneously.
Delta decode: the 2-chain approach
Delta decode is a serial prefix sum: each output element depends on all previous elements. On x86_64 the SSE2 path processes 8 elements per iteration with a carry chain of ~8 cycles (extract + broadcast + add). We are already at the theoretical single-stream ceiling.
delta::decode_2chain breaks this by decoding two independent sub-streams simultaneously. The CPU's out-of-order engine hides one chain's carry latency behind the other's prefix-sum arithmetic, delivering 1.65× throughput:
| decode throughput | |
|---|---|
delta::decode_into (single stream) | 3.70 GB/s |
delta::decode_2chain (two streams) | 6.58 GB/s |
1.78× throughput with two interleaved chains. This requires one extra i16
stored per chunk: the running delta sum at the midpoint (computed by
delta::mid_carry during encode, 2 bytes overhead). Each additional sub-chunk
adds another 2-byte carry value and enables one more independent decode stream.
Path to a parallel-decode VBZ format
With K sub-chunks, all stages of the VBZ pipeline (delta, zigzag, SVB16) can be decoded independently on K cores:
| Sub-chunks | decode throughput | vs. current |
|---|---|---|
| 1 (current VBZ) | 2.42 GB/s | N/A |
| 2 (single-threaded 2-chain) | 5.62 GB/s | 2.3× |
| 2 cores | ~11 GB/s | ~4.5× |
| 4 cores | ~22 GB/s | ~9× |
| 8 cores | ~44 GB/s | ~18× |
The format change is: store K−1 carry values (K−1 × 2 bytes) in the chunk header and split the encoded payload into K equal sub-streams. Compression ratio is unchanged. The svb crate provides decode_2chain and mid_carry as the building blocks.
SVB-ZD pipeline
At 8192 i16 elements, GitHub Actions CI (Azure x86-64 and AArch64):
x86-64
| Path | Scalar | SSSE3 | SSSE3× | AVX2 | AVX2× |
|---|---|---|---|---|---|
encode_svbzd | 158 Melem/s | 1,140 Melem/s | 7.2× | 1,100 Melem/s | 6.9× |
decode_svbzd (3-pass) | 105 Melem/s | 696 Melem/s | 6.6× | 722 Melem/s | 6.9× |
decode_svbzd_fused | 466 Melem/s | 1,510 Melem/s | 3.2× | 1,510 Melem/s | 3.2× |
AArch64
| Path | Scalar | NEON | NEON× |
|---|---|---|---|
encode_svbzd | 195 Melem/s | 551 Melem/s | 2.8× |
decode_svbzd (3-pass) | 210 Melem/s | 834 Melem/s | 4.0× |
decode_svbzd_fused | 564 Melem/s | 1,850 Melem/s | 3.3× |
The SIMD encode path computes zigzag-delta inline without an intermediate Vec<u32>
allocation. On AVX2 it processes 8 i16 values per iteration using
_mm256_cvtepi16_epi32 + _mm_alignr_epi8; on NEON it uses vmovl_s16 +
vextq_s32.
The fused decode collapses U32Classic decode, unzigzag, and undelta into one SIMD loop. The 2-ctrl-byte inner loop processes 8 values per iteration. Note that SSSE3 ≈ AVX2 for the fused path: the bottleneck is the serial delta carry chain, not SIMD width — wider registers do not help once the carry chain is saturated.
SVB-ZD vs VBZ
Both pipelines operate on i16 signal data; the choice depends on the file format (BLOW5 vs POD5):
| Metric | VBZ | SVB-ZD |
|---|---|---|
| Codec | SVB16 (1-bit tags) | U32Classic (2-bit tags) |
| Encode (AVX2, 8192 elem) | 2,850 Melem/s | 1,100 Melem/s |
| Fused decode (AVX2, 8192 elem) | 1,840 Melem/s | 1,510 Melem/s |
| Fused decode (NEON, 8192 elem) | 2,280 Melem/s | 1,850 Melem/s |
| Wire format | ONT POD5 / VBZ | hasindu2008/slow5lib BLOW5 |
VBZ is faster because SVB16's 1-bit tags pack more tightly than U32Classic's 2-bit tags. SVB-ZD handles values that overflow i16 after delta without truncation.
Results vs streamvbyte64 v0.2.0
Measured with simd-avx2 on GitHub Actions ubuntu-latest (Azure x86-64).
streamvbyte64 uses its own runtime detection; numbers reflect its best available path.
| Benchmark | svb | sv64 | ratio |
|---|---|---|---|
| U32Classic decode/128 | 8.68 GB/s | 3.71 GB/s | 2.34x |
| U32Classic decode/1024 | 13.6 GB/s | 4.87 GB/s | 2.79x |
| U32Classic decode/8192 | 14.1 GB/s | 4.89 GB/s | 2.88x |
| U32Classic encode/128 | 6.65 GB/s | 2.33 GB/s | 2.85x |
| U32Classic encode/1024 | 8.26 GB/s | 3.08 GB/s | 2.68x |
| U32Classic encode/8192 | 8.93 GB/s | 3.20 GB/s | 2.79x |
| U32Variant0124 decode/128 | 8.98 GB/s | 3.48 GB/s | 2.58x |
| U32Variant0124 decode/1024 | 13.8 GB/s | 4.88 GB/s | 2.83x |
| U32Variant0124 decode/8192 | 14.2 GB/s | 5.00 GB/s | 2.84x |
| U32Variant0124 encode/128 | 6.74 GB/s | 2.37 GB/s | 2.84x |
| U32Variant0124 encode/1024 | 8.32 GB/s | 2.96 GB/s | 2.81x |
| U32Variant0124 encode/8192 | 8.89 GB/s | 3.01 GB/s | 2.95x |
| U64Coder1248 decode/128 | 12.0 GB/s | 5.89 GB/s | 2.04x |
| U64Coder1248 decode/1024 | 15.0 GB/s | 8.68 GB/s | 1.73x |
| U64Coder1248 decode/8192 | 14.8 GB/s | 8.76 GB/s | 1.69x |
| U64Coder1248 encode/128 | 7.37 GB/s | 3.52 GB/s | 2.09x |
| U64Coder1248 encode/1024 | 8.73 GB/s | 4.61 GB/s | 1.89x |
| U64Coder1248 encode/8192 | 8.85 GB/s | 4.80 GB/s | 1.84x |
svb is consistently 1.7x–2.9x faster than streamvbyte64. The u32 codecs see the
largest gap (approaching 3×); the u64 codecs are closer because 8-byte elements
reduce the SIMD parallelism available per control byte.
Running benchmarks
cargo bench --features simd-auto
Benchmarks cover all five codec variants across encode/decode and three slice sizes (128, 1024, 8192 elements). Criterion produces HTML reports in target/criterion/.
To run a single benchmark by name substring:
cargo bench --features simd-auto -- U32Classic/decode