Skip to content

Bindu vs. other compressors

Modern lossless compression splits into four well-defined categories. Bindu occupies a fifth category: computable compression. Let’s look at how all the categories compare.

For per-codec head-to-heads with measured numbers, see the benchmarks.

Examples: gzip (DEFLATE), bzip2 (BWT), zstd (LZ77+FSE), xz (LZMA2), brotli.

Approach: Find repeated byte sequences in a sliding window and substitute them with shorter codes.

Strengths: These are tools are ubiquitous, format-agnostic, fast, and battle-hardened over decades. zstd is the industry standard for speed and sz for compression ratio.

Limits: No awareness of structure. They re-discover the same redundancy on every block, which is why a satellite repeatedly re-compresses the same dark-sky frame instead of storing only what changed.

Examples: zstd with a trained dictionary, brotli’s static dictionary.

Approach: Ship a small (typically ≤256 KB) pre-computed table of common patterns alongside the codec; files compressed with it substitute references back into the table.

Strengths: These codecs deliver flat ratio improvements on repetitive small files with no runtime training overhead, especially on data that resembles the training corpus.

Limits: The dictionary is fixed at build time and brittle outside its training corpus. brotli’s dictionary, for instance, was tuned against 1990s English-language web content and gradually loses relevance as the web evolves.

Examples: aec / CCSDS 121 (satellite telemetry), FLAC (lossless audio), ZFP (LLNL float arrays), x265 (video), JPEG-LS, fpack (FITS).

Approach: Hand-tune the entire codec for one data shape (e.g. onboard spacecraft, audio waveforms, or scientific float arrays) assuming a great deal about the input to keep code size and runtime small.

Strengths: These codecs are unmatched on the workload they target, and small enough to run on tightly-constrained embedded hardware. aec encodes at 470 MB/s with just 32 MB of RAM; FLAC is the audio reference.

Limits: They only work on their target shape. Move outside it and they don’t apply at all. You end up with one codec per workload, and operational complexity grows accordingly.

Examples: cmix, nncp, paq8 family.

Approach: Train a deep model on the data as it streams, predicting each next symbol from prior context and encoding it in as few bits as the prediction allows.

Strengths: These compressors hit the highest published ratios on text. cmix reaches 7.10× on the Hutter Prize corpus where the best classical codec lands at ~4.7×, near the Shannon limit.

Limits: The model itself ends up larger than the data being compressed — gigabytes of weights for a gigabyte of compressed output. Compute cost is enormous (GPUs, hours of wall time). Unfit for embedded, satellite, or any transmission-constrained setting.

Bindu is a fifth category of compression tool known as computable compression. It is distinct from all four categories as follows:

  • Symbolic, not byte-level. The unit of compression is a symbol with coordinates and a root + delta decomposition, not a byte reference into a sliding window.
  • Computed on the spot, not pre-trained. No fixed dictionary, no out-of-band training step required. The symbol vocabulary is derived from the data in front of it.
  • Optionally growing. A single-shot session is fully self-contained (the file’s header carries everything required to decompress). A long-running pipeline can let the symbol vocabulary grow and amortize across artifacts.
  • Tunable per workload. The default pipeline includes seven sub-pipelines for different classes of structure. For a tightly-scoped deployment, you can strip the unused pipelines. For example,a satellite-tuned compressor is on the order of tens of bytes of in-software state.
  • Operates on the compressed form. You can search, edit, and cross-compare Bindu artifacts without decompressing them.
PropertyClassicalDict-trainedSpecializedNeuralBindu
Awareness of structureNoneLimitedHard-codedLearnedSymbolic
Vocabulary sourceNoneTraining corpusCodec authorModel weightsComputed on the spot
Vocabulary sizeNone≤256 KBNoneGB+Tunable: bytes → unbounded
Same codec across domainsYesYesNoYesYes
Read without decompressingNoNoNoNoYes
Search compressed formNoNoNoNoYes
Edit compressed formNoNoNoNoYes
Suitable for embeddedSomeYesYesNoYes (tuned)