Skip to content

Tuning Bindu

Bindu is general-purpose by design — one binary handles every domain we have benchmarked, with no per-domain configuration. The decisive wins on a specific workload come from tuning the system to it. This page explains how that tuning works.

A note on framing: the benchmark numbers in these docs are out-of-the-box — no workload-specific tuning. Out of the box across general-purpose corpora, Bindu is competitive with or ahead of the best modern compressors. Tuning is what extends the gap on workloads you control end-to-end.

What’s different about Bindu’s starting point

Section titled “What’s different about Bindu’s starting point”

Out of the box, Bindu uses no pre-built dictionary or pretrained model. This is a real difference from every other modern compressor:

CodecShips with
zstd (with --train)Trained dictionary, typically ≤256 KB
brotli119 KB static dictionary, biased toward 1990s English-language web
Neural compressors (cmix, nncp)A learned model — gigabytes of weights
BinduNothing. Pipeline selection runs from a single byte-signature pass over the input.

So when we say “tunable,” the starting point is genuinely empty. There’s no English-language bias to overcome, no training corpus to fit against, no model to swap. Tuning means choosing which pipelines run, how the persistent rule pool is seeded, and (for deep deployments) what to strip out of the binary itself.

In roughly increasing order of effort:

  1. Pipeline selection — turn off the pipelines that don’t apply to your data.
  2. Rule pool strategy — pick between empty pool, growing pool, or pre-seeded pool.
  3. Code-level customization — for embedded or specialized deployments (a satellite, a sensor pipeline), modify the binary itself.

A one-shot compression of a structured archive may need none of them. A continuously-running telemetry pipeline benefits from all three.

Bindu’s cascade contains several specialized pipelines, each a different way of factoring redundancy. The full list and what each one does is at Architecture → Stage 2; the short version:

PipelineBest for
GrammarRepeated phrases — text, structured logs, JSON, XML, source code. Powers in-place edit.
Burrows-Wheeler (BWT)Long runs of similar bytes after sorting — large text corpora. Powers FM-index search.
Dictionary (DICT)Small alphabet of repeating fixed-width units — int16/int32 columnar data.
Stride / periodic (LINDELTA)Data that repeats with a fixed offset — sensor logs, columnar CSV, FITS image rows.
ConstantRuns of a single value — sparse telemetry quality flags.
RLE, PAETH2DResidual streams against neighbor context — image and stride-residual paths.

A general-purpose Bindu run picks one of these per input via Stage 1 byte-signature routing — automatic, no flag required. For a workload-specific deployment, you can also strip pipelines that will never apply (e.g., a telemetry pipeline never needs the BWT path; a text pipeline never needs the stride path). After stripping, the resulting compressor is dramatically smaller in code size and faster per byte processed.

Today this is a deployment-time decision — pipelines are removed in code rather than via a config flag. A configuration file is on the roadmap: once we have enough internal tuning data to know which knobs matter and which don’t, we’ll surface them as runtime options.

The grammar pipeline maintains a persistent rule pool across compress sessions: high-value patterns extracted from one artifact are kept and consulted as a seed for the next. Three strategies, depending on how you use the system:

  • Empty pool. The default. Each file’s header carries everything required to decompress it. No external state to ship or maintain. Right for one-off files and for use cases where the rule pool adds little — most pure-telemetry workloads, where the deltas already capture the relevant structure without a separate rule table.
  • Growing pool. A long-running pipeline keeps the pool persistent across artifacts; each new file amortizes against everything previously seen. There’s an initial training period as the pool finds the dominant patterns; once it stabilizes, ratios converge to their tuned-state values. For an organization running Bindu on its own data over time, the growing pool becomes a high-value asset: a structured representation of every pattern shape they’ve processed.
  • Pre-seeded pool. If you have known patterns (a domain corpus, a representative sample of typical inputs), you can seed the pool up front. The first artifact then starts higher up the compression curve rather than spending time bootstrapping.

Nothing about the pool mechanism is tied to your domain. A finance company and a satellite operator end up with very different pools because the patterns their data exhibits are different, even though the machinery is identical.

For deeply-tuned deployments — a satellite that only ever produces one shape of data, an embedded sensor pipeline, a specialized scientific instrument — the binary itself gets modified. That means:

  • Stripping unused pipelines down to bare metal.
  • Adjusting the byte-signature routing thresholds in Stage 1 so they match the specific data shape.
  • Sometimes adding workload-specific paths (e.g., a particular sensor’s calibration encoding).

The result for a satellite-tuned compressor: tens of bytes of in-software state, plus the binary. Small enough to put on the spacecraft. The decompression side stays full-fat on the ground, where there’s no resource constraint.

This level of tuning is what we do for design partners with specific workloads. It’s not self-serve today.

A long-standing tradeoff in compression: speed and ratio are opposed. Faster codecs find less redundancy; higher-ratio codecs spend more cycles searching.

Bindu doesn’t have this tradeoff once tuning is applied. Stripping unused pipelines reduces the per-byte work AND tightens the ratio (because the chosen pipeline is the one that fits the data). Seeding the rule pool reduces the encoder’s discovery cost AND improves the ratio on related artifacts. Each tuning step compounds in both directions.

(This is one axis where Bindu sits ahead of conventional codecs even out of the box — see Honest limits for the encode/decode throughput tradeoffs that do apply.)

Bindu’s full architecture spans storage, network, and compute compression. In practice we are focusing the deployable surface on storage and network compression for sequential structured data right now, because that’s where the immediate value is and where focused engineering effort produces the cleanest production system.

Compute-side capabilities — operating on compressed memory, running compressed software directly — are real and demonstrated in research, but we are keeping them on the longer-term roadmap so the storage and network surface ships rock-solid first. See Roadmap for what’s coming.