Architecture
Bindu is a cascade of independent stages. None of them is a neural model and none requires a pretrained dictionary. The wire format is documented in full at Reference → File format for interoperability.
Stage 1 — Byte-signature routing
Section titled “Stage 1 — Byte-signature routing”A single linear pass over the input collects a small fixed set of statistics: alphabet cardinality, run profile, periodic stride if any, fraction of printable bytes, fraction of zeros. These statistics drive a decision that picks one of several specialized pipelines.
Routing is closed-form rather than empirical: no trial-encode-and-pick race, no learned classifier — a deterministic decision in microseconds, then the system commits. This is what lets one binary handle every domain we have benchmarked without per-format flags.
Stage 2 — Pipeline-specific transform
Section titled “Stage 2 — Pipeline-specific transform”Each pipeline is a different way of factoring redundancy. Bindu picks one based on the byte-signature in Stage 1.
| Pipeline | What it factors | Powers |
|---|---|---|
| Grammar | Repeated phrases extracted into a rule table, body becomes references | In-place editing |
| Burrows-Wheeler (BWT) | Cyclic rotations sorted to expose long runs of similar bytes, then range-coded | FM-index search |
| Dictionary (DICT) | Short codes assigned to a small alphabet of repeating fixed-width units | Native scan over int16/int32 columnar data |
| Stride / periodic (LINDELTA) | Recognizes data that repeats with a fixed offset; encodes the period plus residuals | Sensor logs, columnar CSV, FITS image rows |
| Constant | Trivial case for runs of a single value | O(1) search |
| RLE, PAETH2D | Residual streams against neighbor context | Decompress-and-scan fallback |
The first five admit zero-decompression search and (where applicable) in-place edit. RLE and PAETH2D deliberately fall back to decompress-and-scan because their wire formats encode residuals against a neighbor context that has to be reconstructed before any candidate match can be verified — a correctness-preserving choice, not a missing feature.
Stage 3 — Persistent rule pool
Section titled “Stage 3 — Persistent rule pool”High-value patterns extracted by the grammar pipeline are persisted to a bounded, monotonically-growing pool that lives across compress sessions. When a new artifact starts compression, the pool is consulted as a seed — patterns that paid off on previous artifacts get a head start on this one.
The pool is a measured speedup, not magic: it lowers the encoder’s discovery cost on related artifacts, which is why repeated runs over the same data domain converge faster than the first. Pre-seeding the pool with a domain corpus is one of the levers covered in Tuning Bindu.
Stage 4 — Wire-format envelope validation
Section titled “Stage 4 — Wire-format envelope validation”Before any output is finalized, the encoder checks that the output size falls within a structural envelope of admissible sizes derived from the wire format itself. If a pipeline produces output outside its envelope (a known bug class), the encoder rejects the result and falls back to a guaranteed-correct path.
This is run on every encode in CI and is part of why the round-trip rate is 100% across the 349 measured runs in the industry benchmark.
Why this design
Section titled “Why this design”Two consequences worth calling out for engineering audiences:
- Every choice point is closed-form. There is no machine-learned routing model that needs retraining when your data shifts. Routing in Stage 1, pipeline behavior in Stage 2, envelope checks in Stage 4 — all deterministic, all auditable.
- Every stage’s output is independently validated. That’s what makes both the search/edit operations and the bit-exact round-trip guarantee tractable. The encoder cannot ship a result that violates the wire format’s structural envelope, because Stage 4 will reject and fall back before that result is finalized.
On Shannon and Kolmogorov
Section titled “On Shannon and Kolmogorov”When a sub-Shannon outcome appears (the MMS-flag case at millions to one), it is not a violation of information theory. It reflects that the structural description of the data — a single repeated value with a handful of breakpoints — is much shorter than the byte-level conditional entropy H(X) would suggest. The system is audited against a Kolmogorov-style structural lower bound, not against H(X).
In other words: byte-level codecs are bounded by what byte-level statistics can capture. Bindu is bounded by what structure can capture, which is sometimes much less.
- Tuning Bindu — how the pipelines, rule pool, and binary itself get specialized to a workload.
- Reference → File format — the on-disk wire format that this architecture produces.
- Benchmarks — measured numbers across all the corpora.