File format specification

This page specifies the binary layout of .bindu files at the abstract level. Byte-exact offsets are defined in RFC-BINDU-0001 — this page is the human-readable overview.

A .bindu file is a sequence of frames. Every frame is self-describing, so tools can skip frames they don’t understand and still process the rest.

Self-contained decompression

A key property: the file’s header carries everything required to decompress it. A Bindu compression session derives the symbols it needs, packs the resulting symbol table and pipeline configuration into the header, and the decoder has no external dependencies — no separately-shipped dictionary, no registry lookup, no shared state.

This is what makes a .bindu file safe to drop into a generic archive or hand to a downstream consumer who doesn’t share your infrastructure. The cost is a small per-file header overhead (a few hundred bytes to a few KB depending on configuration); the benefit is operational independence.

Long-running pipelines that want to amortize the symbol table across many files can opt into a stateful mode where the header references shared external state, but that’s an explicit configuration — the default is fully self-contained.

High-level layout

┌─────────────────────────────┐
│ Magic + version (8 bytes)   │   "BINDU\0" + u32 version
├─────────────────────────────┤
│ File header                 │   schema fingerprint, flags
├─────────────────────────────┤
│ Schema frame (optional)     │   present if --embed-schema
├─────────────────────────────┤
│ Dictionary frames (0..N)    │   per-file and inline shared dicts
├─────────────────────────────┤
│ Column frames (1..N)        │   one per logical column
├─────────────────────────────┤
│ Index frames (optional)     │   Bloom filters, sorted indexes
├─────────────────────────────┤
│ Merkle frame (optional)     │   present if --seal
├─────────────────────────────┤
│ Footer                      │   offsets + file-level checksum
└─────────────────────────────┘

The footer is always the last 512 bytes. Tools read it first to discover the offsets of the other frames, which is what enables random access.

Frame format

Every frame starts with:

u8    frame type code
u8    version
u16   flags
u64   payload length

Followed by a payload and a trailing CRC-32C. Unknown frame types with a skippable flag set are ignored; unknown types without it cause a parse error.

Frame types

Code	Type	Description
0x01	Header	Schema fingerprint, creation timestamp, flags
0x02	Schema	Embedded schema (if requested)
0x03	Dictionary	Shared dictionary section
0x04	Column	A single logical column’s encoded data
0x05	BloomIndex	Bloom filter over one column
0x06	SortedIndex	Sparse sorted index over one column
0x07	MerkleTree	Merkle tree over the file’s records
0x08	Footer	Offset table + file checksum
0x80	UserMeta	User-supplied metadata (skippable)

The 0x80 range and above is reserved for extensions. Implementations must tolerate unknown skippable frames.

Column encoding

Each column frame carries:

Column ID (stable across schema versions)
Type descriptor (int, float, enum, ts, string, bytes, nested)
Codec chain (e.g. frame-of-reference → zigzag → zstd)
Chunk index (offsets for in-file random access)
Encoded payload

Readers execute the codec chain in reverse to decode a chunk.

Schema fingerprint

The schema fingerprint is an SHA-256(canonical-schema-bytes) truncated to 128 bits. Two files with the same fingerprint use byte-identical schemas. Fingerprints are the key used to look up schemas in a registry.

Compatibility guarantees

Files written by Bindu version X.Y.Z are readable by any version ≥ X.0.0 (minor-version forward compatibility within a major line).
Major versions may introduce new frame types; old readers must treat them as skippable unless marked otherwise.
The magic bytes (BINDU\0) and footer position are stable across all versions.

Overview — the symbolic model above this on-disk layout.
RFC-BINDU-0001 — the byte-exact specification.