Skip to content

Introduction

Bindu is a next-generation computable compression system.

Traditional compression means making files smaller. Classic tools like gzip, zstd, and brotli achieve this by scanning for repeated byte sequences and emitting shorter references in their place. To accelerate the process of finding repeated byte sequences, traditional compression tools often use dictionaries that are pre-populated with the most common byte sequences.

Computable compression is a novel, lossless approach that not only outperforms traditional compression tools at making files smaller (in most cases), but also allows users to read, search, and edit data directly without the need to ever decompress the data. That is, traditional compression tools lower only storage and data transfer requirements, while computable compression lowers not only storage and data transfer requirements, but also compute requirements for any data artifact or program.

Bindu works by deriving a shared vocabulary of formulas and symbols that describe a given data set. Bindu is recursive, so these formulas and symbols are themselves derived into an even more efficient shared vocabulary of formulas and symbols. This process typically goes through 11 full cycles, however users can choose to invest as much compute as they wish to add additional cycles and achieve greater compression gains.

As a frontier computable compression tool, Bindu unlocks the following core capabilities:

  • Best-in-class compression ratios on structured and text data. Across the combined Silesia + satellite + Hutter Prize corpus, Bindu is the per-file ratio leader on 19 of 30 files, representing more wins than all other codecs tested combined.
  • Search and edit without decompression. Bindu can accept conventional language as input, convert it to the formulas and symbols system it developed for a given data set, and then search or edit the data without decompressing it, leading to compute time improvements of many orders of magnitude.
  • One system for any data type. Bindu does not require per-domain dictionaries or pre-trained tables to perform well on a given data set. The same Bindu binary handles English text, scientific time-series, satellite imagery, sparse telemetry, and structured logs. While Bindu does benefit from being tuned to a specific use case, the core Bindu binary is unchanged.
  • Compounding efficiency from compute and data. Bindu’s compression ratio improves along two axes. The first is compute per artifact, which means that running additional iterations of formulas and symbols on a single file extracts deeper structure. The second is data through the system: every artifact you compress contributes patterns to a persistent rule pool that subsequent compressions consult, so the more data Bindu has already seen, the more cross-connections it can exploit on the next file. Unlike traditional codecs whose dictionaries are fixed at build time, Bindu’s vocabulary keeps growing with use.

Suppose your goal is to replace a single string in a 4 MB compressed archive of Automatic Dependent Surveillance-Broadcast (ADS-B) data (which includes aircraft GPS location, altitude, ground speed, and identification broadcast from planes).

The conventional way to do this is to decompress the file, use a tool like sed to make the edit, and recompress the file. When written as a single script in our test environment using xzip, this took 1.33 s. With Bindu, this data need not be decompressed to make an edit, so there is no decompress step, you edit the data directly, and likewise there is no recompress step. In our test environment, this completed in 3 ms, a speedup of 443x.

Traditional compression achieves cheaper storage. Bindu achieves not only cheaper storage but also cheaper compute because every read, query, and edit takes place in compressed space rather than after a decompression step.

Bindu is general-purpose and domain-agnostic by design. The same binary handles every workload we have benchmarked, with no per-domain configuration.

We have found empirically that Bindu works especially well out of the box on structured, sequential, telemetry-style data such as satellite downlinks, sensor streams, structured logs, and scientific time-series. That being said, Bindu can be tuned to support any compression use case. r

Below we share Bindu’s out-of-the-box performance in various domains. Again, with tuning, we can substantially perform these compression ratios:

DomainBinduBest other
Wikipedia (1 GB text)5.42×xz -9: 4.69×
Generic English text4.02×xz -9: 3.83×
Software / structured12.62×xz -9: 12.46×
Float64 time-series1.88× fewer bits/value geomeanALP, Chimp, Patas reference
Weather imagery21.98×aec (CCSDS 121): 20.49×
Sparse telemetry5.5 M× (5.3 MB → 1 byte)bzip2 -9: 113,000×

The full 9-domain matrix and head-to-head tables are at Comparisons → Benchmarks.