File format specification
This page specifies the binary layout of .bindu files at the abstract level. Byte-exact offsets are defined in RFC-BINDU-0001 — this page is the human-readable overview.
A .bindu file is a sequence of frames. Every frame is self-describing, so tools can skip frames they don’t understand and still process the rest.
Self-contained decompression
Section titled “Self-contained decompression”A key property: the file’s header carries everything required to decompress it. A Bindu compression session derives the symbols it needs, packs the resulting symbol table and pipeline configuration into the header, and the decoder has no external dependencies — no separately-shipped dictionary, no registry lookup, no shared state.
This is what makes a .bindu file safe to drop into a generic archive or hand to a downstream consumer who doesn’t share your infrastructure. The cost is a small per-file header overhead (a few hundred bytes to a few KB depending on configuration); the benefit is operational independence.
Long-running pipelines that want to amortize the symbol table across many files can opt into a stateful mode where the header references shared external state, but that’s an explicit configuration — the default is fully self-contained.
High-level layout
Section titled “High-level layout”┌─────────────────────────────┐│ Magic + version (8 bytes) │ "BINDU\0" + u32 version├─────────────────────────────┤│ File header │ schema fingerprint, flags├─────────────────────────────┤│ Schema frame (optional) │ present if --embed-schema├─────────────────────────────┤│ Dictionary frames (0..N) │ per-file and inline shared dicts├─────────────────────────────┤│ Column frames (1..N) │ one per logical column├─────────────────────────────┤│ Index frames (optional) │ Bloom filters, sorted indexes├─────────────────────────────┤│ Merkle frame (optional) │ present if --seal├─────────────────────────────┤│ Footer │ offsets + file-level checksum└─────────────────────────────┘The footer is always the last 512 bytes. Tools read it first to discover the offsets of the other frames, which is what enables random access.
Frame format
Section titled “Frame format”Every frame starts with:
u8 frame type codeu8 versionu16 flagsu64 payload lengthFollowed by a payload and a trailing CRC-32C. Unknown frame types with a skippable flag set are ignored; unknown types without it cause a parse error.
Frame types
Section titled “Frame types”| Code | Type | Description |
|---|---|---|
| 0x01 | Header | Schema fingerprint, creation timestamp, flags |
| 0x02 | Schema | Embedded schema (if requested) |
| 0x03 | Dictionary | Shared dictionary section |
| 0x04 | Column | A single logical column’s encoded data |
| 0x05 | BloomIndex | Bloom filter over one column |
| 0x06 | SortedIndex | Sparse sorted index over one column |
| 0x07 | MerkleTree | Merkle tree over the file’s records |
| 0x08 | Footer | Offset table + file checksum |
| 0x80 | UserMeta | User-supplied metadata (skippable) |
The 0x80 range and above is reserved for extensions. Implementations must tolerate unknown skippable frames.
Column encoding
Section titled “Column encoding”Each column frame carries:
- Column ID (stable across schema versions)
- Type descriptor (int, float, enum, ts, string, bytes, nested)
- Codec chain (e.g.
frame-of-reference → zigzag → zstd) - Chunk index (offsets for in-file random access)
- Encoded payload
Readers execute the codec chain in reverse to decode a chunk.
Schema fingerprint
Section titled “Schema fingerprint”The schema fingerprint is an SHA-256(canonical-schema-bytes) truncated to 128 bits. Two files with the same fingerprint use byte-identical schemas. Fingerprints are the key used to look up schemas in a registry.
Compatibility guarantees
Section titled “Compatibility guarantees”- Files written by Bindu version X.Y.Z are readable by any version ≥ X.0.0 (minor-version forward compatibility within a major line).
- Major versions may introduce new frame types; old readers must treat them as skippable unless marked otherwise.
- The magic bytes (
BINDU\0) and footer position are stable across all versions.
Related
Section titled “Related”- Overview — the symbolic model above this on-disk layout.
- RFC-BINDU-0001 — the byte-exact specification.