Skip to content

File format specification

This page specifies the binary layout of .bindu files at the abstract level. Byte-exact offsets are defined in RFC-BINDU-0001 — this page is the human-readable overview.

A .bindu file is a sequence of frames. Every frame is self-describing, so tools can skip frames they don’t understand and still process the rest.

A key property: the file’s header carries everything required to decompress it. A Bindu compression session derives the symbols it needs, packs the resulting symbol table and pipeline configuration into the header, and the decoder has no external dependencies — no separately-shipped dictionary, no registry lookup, no shared state.

This is what makes a .bindu file safe to drop into a generic archive or hand to a downstream consumer who doesn’t share your infrastructure. The cost is a small per-file header overhead (a few hundred bytes to a few KB depending on configuration); the benefit is operational independence.

Long-running pipelines that want to amortize the symbol table across many files can opt into a stateful mode where the header references shared external state, but that’s an explicit configuration — the default is fully self-contained.

┌─────────────────────────────┐
│ Magic + version (8 bytes) │ "BINDU\0" + u32 version
├─────────────────────────────┤
│ File header │ schema fingerprint, flags
├─────────────────────────────┤
│ Schema frame (optional) │ present if --embed-schema
├─────────────────────────────┤
│ Dictionary frames (0..N) │ per-file and inline shared dicts
├─────────────────────────────┤
│ Column frames (1..N) │ one per logical column
├─────────────────────────────┤
│ Index frames (optional) │ Bloom filters, sorted indexes
├─────────────────────────────┤
│ Merkle frame (optional) │ present if --seal
├─────────────────────────────┤
│ Footer │ offsets + file-level checksum
└─────────────────────────────┘

The footer is always the last 512 bytes. Tools read it first to discover the offsets of the other frames, which is what enables random access.

Every frame starts with:

u8 frame type code
u8 version
u16 flags
u64 payload length

Followed by a payload and a trailing CRC-32C. Unknown frame types with a skippable flag set are ignored; unknown types without it cause a parse error.

CodeTypeDescription
0x01HeaderSchema fingerprint, creation timestamp, flags
0x02SchemaEmbedded schema (if requested)
0x03DictionaryShared dictionary section
0x04ColumnA single logical column’s encoded data
0x05BloomIndexBloom filter over one column
0x06SortedIndexSparse sorted index over one column
0x07MerkleTreeMerkle tree over the file’s records
0x08FooterOffset table + file checksum
0x80UserMetaUser-supplied metadata (skippable)

The 0x80 range and above is reserved for extensions. Implementations must tolerate unknown skippable frames.

Each column frame carries:

  • Column ID (stable across schema versions)
  • Type descriptor (int, float, enum, ts, string, bytes, nested)
  • Codec chain (e.g. frame-of-reference → zigzag → zstd)
  • Chunk index (offsets for in-file random access)
  • Encoded payload

Readers execute the codec chain in reverse to decode a chunk.

The schema fingerprint is an SHA-256(canonical-schema-bytes) truncated to 128 bits. Two files with the same fingerprint use byte-identical schemas. Fingerprints are the key used to look up schemas in a registry.

  • Files written by Bindu version X.Y.Z are readable by any version ≥ X.0.0 (minor-version forward compatibility within a major line).
  • Major versions may introduce new frame types; old readers must treat them as skippable unless marked otherwise.
  • The magic bytes (BINDU\0) and footer position are stable across all versions.