Tuning Bindu
Out of the box, Bindu is competitive with the best modern compressors across general-purpose corpora. On domains that play to its strengths such as telemetry, sensor streams, and structured logs, it outperforms modern compressors without any configuration. But to get the best performance on a given domain, Bindu needs to be tuned. This page walks through how we do that.
Why Bindu needs tuning
Section titled “Why Bindu needs tuning”Most modern compressors push the user to commit to a corpus or model ahead of time.
- zstd offers a
--trainmode where you supply representative samples and it bakes the most-rewarding raw byte sequences into a dictionary (typically ≤256 KB) that you then ship alongside every compressed artifact; the dictionary is fixed at training time and tied to the corpus you chose. - brotli embeds a 119 KB static dictionary baked against 1990s English-language web content; you take what brotli’s authors committed to.
- Neural compressors carry gigabytes of pretrained weights, also fixed at training time. Once any of these commitments are made, they cannot be easily. moved without redoing the work.
Bindu inverts this paradigm. It ships with no pre-built dictionary, no pretrained weights, and no per-domain configuration. Instead, Bindu applies its architecture to dynamically discover the optimal set of formulas and symbols that can be used to represent the given data.
This is why tuning matters. To “tune” Bindu means to take subjective knowledge of the data set Bindu will compress and use that information to refine the approach Bindu uses to discover the optimal formulas and symbols. With no tuning and sufficient time, Bindu will eventually discover an equally optimial set of formulas, but tuning allows us to either skip unnecessary steps, or “fast forward” to what Bindu will eventually discover, so that we can achieve better compression ratios in less time.
Levers
Section titled “Levers”There are three major tuning levers to apply. In roughly increasing order of effort, we have: pipeline selection, rule pool strategy, and code-level customization.
Pipeline selection
Section titled “Pipeline selection”Bindu’s cascade contains several specialized pipelines, each a different way of factoring redundancy. The full list and what each one does is at Architecture → Stage 2.
When we tune for a workload, the first thing we do is strip the pipelines that will never apply. A telemetry deployment never needs the BWT path; a text deployment never needs the stride path. The resulting compressor is dramatically smaller in code size and faster per byte processed, and the routing decision in Stage 1 collapses to a near-trivial check.
Rule pool strategy
Section titled “Rule pool strategy”The grammar pipeline maintains a persistent rule pool across compress sessions: high-value patterns extracted from one artifact are kept and consulted as a seed for the next. When we tune a deployment, we pick one of three strategies for that pool depending on how the system will be used:
- Empty pool. The default. Each file’s header carries everything required to decompress it, and there is no external state to ship or maintain. This works well for one-off files and for use cases where the rule pool adds little value. For example, in pure-telemetry workloads, the deltas already capture the relevant structure without a separate rule table.
- Pre-seeded pool. If a data set or domain has known patterns, we seed the pool up front. The first artifact then starts higher up the compression curve rather than spending time bootstrapping.
- Growing pool. For long-running pipelines, we keep the pool persistent across compress sessions, so each new artifact builds on patterns from every prior one. There is a short training period while the dominant patterns surface; after that, ratios settle at their tuned-state level. Over time, the pool becomes a long-term record of every pattern Bindu has seen on that data, which can serve as a compounding advantage for an organization running Bindu on its own corpus.
Code-level customization
Section titled “Code-level customization”Finally, we have the option of modifying the Bindu code itself. In practice, this means:
- Stripping unused pipelines down to bare metal.
- Adjusting the byte-signature routing thresholds in Stage 1 so they match the specific data shape.
- Adding workload-specific paths where it pays off (for example, a particular sensor’s calibration encoding).
As Bindu matures, these customizations will be available via configuration, but for today this is done with Bindu design partners.
Why speed and ratio improve together
Section titled “Why speed and ratio improve together”A long-standing tradeoff in compression is that speed and ratio are opposed. Historically, faster codecs find less redundancy, while higher-ratio codecs spend more cycles searching.
Bindu does not have this tradeoff. Stripping unused pipelines reduces the per-byte work and tightens the ratio (because the chosen pipeline is the one that fits the data). Seeding the rule pool reduces the encoder’s discovery cost and improves the ratio on related artifacts. Each tuning step compounds in both directions.