Skip to content

When not to use Bindu

Bindu is a tunable approach to compression, not a universal hammer. Using it where it doesn’t apply means dragging around tuning complexity for no benefit. Use this checklist before adopting it.

Video (H.264/265), audio (MP3/AAC/Opus), images (JPEG/WebP), and archive formats (.zip, .tar.gz) cannot be meaningfully compressed further. Bindu produces output the same size as the input plus a small header. Use the source format as-is.

By definition there’s nothing to compress. Same outcome as already-compressed media. Encrypt after compression, not before.

Your data is unstructured narrative prose and you don’t need legibility

Section titled “Your data is unstructured narrative prose and you don’t need legibility”

Books, essays, long-form web text. Bindu lands competitive with the best classical codecs on prose (it’s the strongest classical result on the Hutter Prize corpora — see the industry benchmark) but doesn’t outrun them by a margin worth the tuning. If you’re not going to query or operate on the compressed form, zstd or xz with a trained dictionary is simpler.

Anywhere the consumer isn’t in your control — HTTP responses to third parties, emailed attachments, files posted to FTP servers used by unknown clients — pick a format every system can decompress. gzip, brotli, and zstd all qualify. Bindu does not yet.

Below a few KB of input, the per-file header and symbol-table bootstrap are a meaningful fraction of the output. Bindu shines on artifacts where there’s enough data for the symbol table to amortize.

Your workload is compress-once, read-once, sequentially

Section titled “Your workload is compress-once, read-once, sequentially”

Bindu’s structural advantages include searching the compressed form, random access, and operating without decompressing. If you’ll read the file exactly once, sequentially, from start to end, those advantages don’t pay off. A byte compressor is simpler and sufficient.

Aggregate decode throughput is 4.6 MB/s today — zstd -19 decodes the same files at 1,075 MB/s. CDN hot paths, real-time streaming, and other decode-heavy workloads are not the right fit. (Search and edit on most data classes don’t require decompression — that’s the point of the system — but if your workload genuinely needs raw decoded bytes fast, pick zstd.)

The fast in-place edit paths (Tier 1 grammar rule patch, Tier 2 dictionary rewrite) require len(old) == len(new). Variable-length edits force a full decompress / recompress (Tier 3), with no advantage over the conventional pipeline. See Honest limits → In-place edit constraint.

Peak RSS during enwik9 compression is 19.5 GB for a 953 MB input — xz -9 peaks at 675 MB. CubeSat or otherwise memory-constrained deployments need to either chunk the input or use a tuned-down pipeline budget.

Output of bindu compress --dry-run and the predicted ratio isn’t meaningfully better than zstd -19? There’s no reason to pay the tuning cost. Bindu’s wins come from structure; no structure → no wins.

General-purpose mixed corpora out of the box

Section titled “General-purpose mixed corpora out of the box”

On a Silesia-style mix (text + binaries + scientific images + structured records) out of the box, Bindu is competitive with the best — wins more files than any other codec, but loses some by 0.2–0.3× to xz or bzip2. If your data really is “everything mixed together” and you can’t tune for it, the choice between Bindu, xz, and zstd is a small one. Tune Bindu for your specific shape and the gap opens up.

xz with executable filters is specifically tuned for x86 instruction encoding. On mozilla in the Silesia corpus, xz wins by ~25% over Bindu. If your archive is mostly x86 binaries, xz is the better default.

The mirror image of the above:

  • Data is sequential and slowly-changing (telemetry, satellite, sensor streams). Flagship use case.
  • Data is structured (JSON, JSONL, CSV, Parquet, source code, logs, scientific arrays).
  • Volumes are large or repeated.
  • You want capabilities beyond storage: search, query, edit, cross-file compare on the compressed form.
  • You control both writer and reader and can amortize a tuning step.
Is the data sequential telemetry / satellite / sensor stream?
├─ Yes → Use Bindu (flagship case).
└─ No → Is it structured (JSON, CSV, code, logs, tabular)?
├─ No → Use zstd (or gzip for compatibility).
└─ Yes → Will you query, search, or operate on the compressed form?
├─ Yes → Use Bindu.
└─ No → Use zstd -19 or xz -9e.