Skip to content

Architecture

Bindu is a cascade of independent stages. None of them is a neural model and none requires a pretrained dictionary. The wire format is documented in full at Reference → File format for interoperability.

A single linear pass over the input collects a small fixed set of statistics: alphabet cardinality, run profile, periodic stride if any, fraction of printable bytes, fraction of zeros. These statistics drive a decision that picks one of several specialized pipelines.

Routing is closed-form rather than empirical: no trial-encode-and-pick race, no learned classifier — a deterministic decision in microseconds, then the system commits. This is what lets one binary handle every domain we have benchmarked without per-format flags.

Each pipeline is a different way of factoring redundancy. Bindu picks one based on the byte-signature in Stage 1.

PipelineWhat it factorsPowers
GrammarRepeated phrases extracted into a rule table, body becomes referencesIn-place editing
Burrows-Wheeler (BWT)Cyclic rotations sorted to expose long runs of similar bytes, then range-codedFM-index search
Dictionary (DICT)Short codes assigned to a small alphabet of repeating fixed-width unitsNative scan over int16/int32 columnar data
Stride / periodic (LINDELTA)Recognizes data that repeats with a fixed offset; encodes the period plus residualsSensor logs, columnar CSV, FITS image rows
ConstantTrivial case for runs of a single valueO(1) search
RLE, PAETH2DResidual streams against neighbor contextDecompress-and-scan fallback

The first five admit zero-decompression search and (where applicable) in-place edit. RLE and PAETH2D deliberately fall back to decompress-and-scan because their wire formats encode residuals against a neighbor context that has to be reconstructed before any candidate match can be verified — a correctness-preserving choice, not a missing feature.

High-value patterns extracted by the grammar pipeline are persisted to a bounded, monotonically-growing pool that lives across compress sessions. When a new artifact starts compression, the pool is consulted as a seed — patterns that paid off on previous artifacts get a head start on this one.

The pool is a measured speedup, not magic: it lowers the encoder’s discovery cost on related artifacts, which is why repeated runs over the same data domain converge faster than the first. Pre-seeding the pool with a domain corpus is one of the levers covered in Tuning Bindu.

Stage 4 — Wire-format envelope validation

Section titled “Stage 4 — Wire-format envelope validation”

Before any output is finalized, the encoder checks that the output size falls within a structural envelope of admissible sizes derived from the wire format itself. If a pipeline produces output outside its envelope (a known bug class), the encoder rejects the result and falls back to a guaranteed-correct path.

This is run on every encode in CI and is part of why the round-trip rate is 100% across the 349 measured runs in the industry benchmark.

Two consequences worth calling out for engineering audiences:

  • Every choice point is closed-form. There is no machine-learned routing model that needs retraining when your data shifts. Routing in Stage 1, pipeline behavior in Stage 2, envelope checks in Stage 4 — all deterministic, all auditable.
  • Every stage’s output is independently validated. That’s what makes both the search/edit operations and the bit-exact round-trip guarantee tractable. The encoder cannot ship a result that violates the wire format’s structural envelope, because Stage 4 will reject and fall back before that result is finalized.

When a sub-Shannon outcome appears (the MMS-flag case at millions to one), it is not a violation of information theory. It reflects that the structural description of the data — a single repeated value with a handful of breakpoints — is much shorter than the byte-level conditional entropy H(X) would suggest. The system is audited against a Kolmogorov-style structural lower bound, not against H(X).

In other words: byte-level codecs are bounded by what byte-level statistics can capture. Bindu is bounded by what structure can capture, which is sometimes much less.

  • Tuning Bindu — how the pipelines, rule pool, and binary itself get specialized to a workload.
  • Reference → File format — the on-disk wire format that this architecture produces.
  • Benchmarks — measured numbers across all the corpora.