Compression benchmarks
This page compares Bindu to every lossless compressor we could install on a single test rig, across three industry-standard corpora. In total, we execute 349 measured runs, all fully round-trip verified by SHA-256.
If you want to know how to reproduce these numbers yourself, see Reproducing benchmarks for the methodology page.
Test rig
Section titled “Test rig”| Component | Spec |
|---|---|
| CPU | AMD Ryzen 7 8745HS — 8 cores, TDP 35 W |
| RAM | 27 GB DDR5 |
| OS | Linux x86_64 |
| Verification | SHA-256 of decompressed output ≡ SHA-256 of original input |
All commodity compressors invoked with single-thread flags (xz -T1, zstd --single-thread; gzip/bzip2/brotli/zip are natively single-threaded) so per-core throughput is comparable across codecs. Bindu uses its --shape/--dtype hints on structured satellite data — that matches its real deployment path.
Codecs measured
Section titled “Codecs measured”| Codec | Class | Files | Notes |
|---|---|---|---|
| Bindu | symbolic / formulaic | 30 | this benchmark |
xz | LZMA2 | 30 | levels 1, 6, 9, 9e |
brotli | LZ77 + static dictionary | 14 | levels 6, 9, 11 |
bzip2 | BWT + Huffman | 30 | levels 1, 9 |
zstd | FSE + LZ77 | 30 | levels 1, 3, 9, 19, 22 --long |
gzip | DEFLATE | 30 | levels 1, 6, 9 |
zip | DEFLATE archiver | 14 | level 6 |
aec | CCSDS 121.0-B-3 Rice | 10 | satellite-applicable only |
zfp | LLNL reversible | 2 | float-only |
flac | Lossless audio | 3 | int16-only |
A note on aec (CCSDS 121 Adaptive Entropy Coding): this is the lossless reference codec used by NOAA, ESA, JAXA, and NASA missions, and embedded in HDF5 via SZIP. It is sometimes mis-typed as “AES” — that’s encryption (Rijndael), not a compressor. It’s excluded.
Corpus
Section titled “Corpus”Three industry-standard corpora, 30 files total, ~1.5 GB raw:
| Corpus | Files | Total raw | Description |
|---|---|---|---|
| Silesia | 12 | 202 MB | Industry-standard generic benchmark — text, binaries, scientific images, structured records |
| Satellite / telemetry | 16 | 306 MB | SAR (Umbra), MSI (Sentinel-2), HSI (AVIRIS), weather (GOES-16), astronomy (Chandra), space-weather telemetry (MMS, OMNI, THEMIS), SSA text (AIS, ADS-B) |
| Hutter Prize / LTCB | 2 | 1.05 GB | enwik8 (100 MB) and enwik9 (1 GB) — first 100 MB and first 1 GB of the English Wikipedia XML dump (Mar 2006), Matt Mahoney’s canonical benchmark |
Per-file ratio table
Section titled “Per-file ratio table”The headline result: ratio (uncompressed / compressed) for each codec on each file. Bold = winner per file.
| Corpus | File | Size | Bindu | xz | brotli | bzip2 | zstd | gzip | zip | aec | zfp | flac |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Silesia | dickens | 9.7 MB | 4.02× | 3.60× | 3.60× | 3.64× | 3.58× | 2.65× | 2.63× | — | — | — |
| Silesia | mozilla | 48.8 MB | 2.88× | 3.83× | 3.69× | 2.86× | 3.42× | 2.70× | 2.69× | — | — | — |
| Silesia | webster | 39.5 MB | 5.75× | 4.95× | 4.92× | 4.80× | 4.90× | 3.44× | 3.40× | — | — | — |
| Silesia | x-ray | 8.1 MB | 2.13× | 1.89× | 1.81× | 2.09× | 1.64× | 1.41× | 1.40× | — | — | — |
| Silesia | xml | 5.1 MB | 12.63× | 12.29× | 12.41× | 12.12× | 11.80× | 8.07× | 7.72× | — | — | — |
| Silesia | samba | 20.6 MB | 5.03× | 5.78× | 5.74× | 4.75× | 5.57× | 4.00× | 3.96× | — | — | — |
| Silesia | osdb | 9.6 MB | 3.98× | 3.55× | 3.58× | 3.60× | 3.26× | 2.71× | 2.70× | — | — | — |
| Silesia | reymont | 6.3 MB | 5.90× | 5.04× | 4.97× | 5.32× | 4.92× | 3.64× | 3.57× | — | — | — |
| Silesia | sao | 6.9 MB | 1.44× | 1.64× | 1.58× | 1.47× | 1.45× | 1.36× | 1.36× | — | — | — |
| Silesia | nci | 32.0 MB | 24.79× | 23.15× | 22.08× | 18.51× | 20.84× | 11.23× | 10.49× | — | — | — |
| Silesia | ooffice | 5.9 MB | 2.14× | 2.54× | 2.48× | 2.15× | 2.37× | 1.99× | 1.99× | — | — | — |
| Silesia | mr | 9.5 MB | 4.22× | 3.63× | 3.53× | 4.08× | 3.21× | 2.71× | 2.70× | — | — | — |
| Satellite | SAR_Umbra | 15.3 MB | 1.46× | 1.43× | — | 1.40× | 1.36× | 1.33× | — | 1.33× | — | — |
| Satellite | MSI_S2_B04 | 47.7 MB | 2.81× | 2.65× | — | 2.82× | 2.41× | 2.20× | — | 2.81× | — | — |
| Satellite | MSI_S2_B11 | 57.5 MB | 3.10× | 2.82× | — | 3.03× | 2.56× | 2.29× | — | 3.22× | — | — |
| Satellite | HSI_Salinas | 47.5 MB | 3.01× | 3.23× | — | 2.38× | 2.38× | 1.46× | — | 1.83× | — | 1.83× |
| Satellite | HSI_IndianPines | 8.0 MB | 1.99× | 1.74× | — | 1.69× | 1.39× | 1.24× | — | 1.73× | — | — |
| Satellite | WX_GOES_Ch01 | 28.6 MB | 21.98× | 17.79× | — | 21.57× | 15.93× | 12.30× | — | 20.25× | — | 9.98× |
| Satellite | WX_GOES_Ch13 | 7.2 MB | 6.74× | 5.34× | — | 5.91× | 4.95× | 4.07× | — | 5.93× | — | 5.73× |
| Satellite | SCI_Chandra | 2.4 MB | 7.41× | 6.94× | — | 7.16× | 5.89× | 4.78× | — | 1.25× | — | — |
| Satellite | TEL_MMS_Epoch | 10.5 MB | 1.34× | 1.97× | — | 1.46× | 1.98× | 1.68× | — | — | — | — |
| Satellite | TEL_OMNI_Epoch | 348.8 KB | 2,349× | 3.70× | — | 3.02× | 3.24× | 2.85× | — | — | 2.17× | — |
| Satellite | TEL_OMNI_IMF | 174.4 KB | 220× | 128× | — | 201× | 157× | 160× | — | 36× | — | — |
| Satellite | TEL_MMS_flag | 5.3 MB | 263,314× | 5,933× | — | 112,849× | 29,103× | 1,025× | — | 1,213× | — | — |
| Satellite | TEL_MMS_B_gse | 21.1 MB | 1.62× | 1.44× | — | 1.10× | 1.13× | 1.13× | — | — | 1.00× | — |
| Satellite | TEL_THEMIS_B | 368.0 KB | 1.11× | 1.26× | — | 1.03× | 1.09× | 1.09× | — | — | — | — |
| Satellite | SSA_AIS | 50.0 MB | 8.75× | 9.30× | — | 4.16× | 8.68× | 2.81× | — | — | — | — |
| Satellite | SSA_ADSB | 3.8 MB | 10.33× | 8.94× | — | 9.76× | 8.79× | 6.55× | — | — | — | — |
| Hutter | enwik8 | 95.4 MB | 4.29× | 4.03× | 3.88× | 3.45× | 3.95× | 2.74× | 2.74× | — | — | — |
| Hutter | enwik9 | 953.7 MB | 5.43× | 4.69× | 3.87× | 3.94× | 4.25× | 3.10× | 3.09× | — | — | — |
Per-file winners
Section titled “Per-file winners”| Codec | Wins | Domains |
|---|---|---|
| Bindu | 19 | Generic text, structured records, scientific imaging, weather, astronomy, space-weather telemetry, SSA records, Hutter Prize |
xz | 7 | x86 binaries, source code, certain hyperspectral cubes, AIS records |
bzip2 | 2 | medical imaging, multispectral red band |
aec (CCSDS 121) | 1 | Sentinel-2 SWIR band |
zstd | 1 | int64 timestamps |
Bindu wins 63% of files outright — more than every other codec combined.
Per-codec aggregate
Section titled “Per-codec aggregate”Across files where each codec applies (specialized codecs only run on dtypes they support):
| Codec | Files | Total in | Total out | Comp % | Avg enc MB/s | Avg dec MB/s | Peak RSS enc |
|---|---|---|---|---|---|---|---|
| Bindu | 30 | 1.5 GB | 343 MB | 77.95% | 0.6 | 4.6 | 19.5 GB |
xz | 30 | 1.5 GB | 371 MB | 76.15% | 1.9 | 105 | 675 MB |
brotli | 14 | 1.2 GB | 318 MB | 74.54% | 2.0 | 493 | 237 MB |
zstd | 30 | 1.5 GB | 412 MB | 73.51% | 2.3 | 949 | 747 MB |
bzip2 | 30 | 1.5 GB | 435 MB | 72.03% | 22 | 39 | 9 MB |
zip | 14 | 1.2 GB | 409 MB | 67.34% | 32 | 216 | 3 MB |
gzip | 30 | 1.5 GB | 553 MB | 64.51% | 22 | 234 | 2.1 MB |
aec (CCSDS 121) | 10 | 220 MB | 81 MB | 62.94% | 470 | 470 | 32 MB |
flac | 3 | 83 MB | 30 MB | 63.84% | 153 | 265 | 4 MB |
zfp | 2 | 21 MB | 21 MB | 0.44% | 250 | 173 | 44 MB |
Hutter Prize / Large Text Compression Benchmark
Section titled “Hutter Prize / Large Text Compression Benchmark”Bindu’s results on enwik8 and enwik9 placed against published LTCB rankings and historical Hutter Prize entries:
enwik8 — 100 MB English Wikipedia
Section titled “enwik8 — 100 MB English Wikipedia”| Compressor | Ratio | bits/byte | Class | Source |
|---|---|---|---|---|
| cmix v18 | 7.10× | 1.13 | neural context-mix | Hutter Prize 2024 (published) |
| nncp v2 | 6.50–6.80× | 1.18–1.23 | neural arithmetic | published |
| paq8hp12 | 6.32× | 1.27 | tuned PAQ for enwik8 | published 2008 |
| zpaq -m6 | 6.11× | 1.31 | classical extreme | published |
| paq8l | 5.95× | 1.34 | PAQ ensemble | Hutter Prize 2009 |
| brotli -q11 | 4.68× | 1.71 | LZ77 + dict | published (Google) |
| xz -9 / lzma -9e | 4.65× | 1.72 | LZMA2 | published |
| zstd -22 | 4.30× | 1.86 | FSE + LZ77 | published (Meta) |
| Bindu | 4.29× | 1.87 | symbolic / formulaic | measured |
| zpaq -m5 | 4.34× | 1.84 | classical context-mix | published |
| bzip2 -9 | 3.71× | 2.16 | BWT | published |
| gzip -9 | 3.13× | 2.55 | DEFLATE | published |
enwik9 — 1 GB English Wikipedia
Section titled “enwik9 — 1 GB English Wikipedia”| Compressor | Ratio | bits/byte | Class | Source |
|---|---|---|---|---|
| cmix v18+ | 7.0–7.1× | 1.13–1.15 | neural context-mix | published |
| nncp v2 | ~6.5× | 1.23 | neural arithmetic | published |
| paq8 family | ~6.0× | 1.33 | PAQ ensemble | published |
| Bindu | 5.43× | 1.47 | symbolic / formulaic | measured |
| zpaq -m5 | ~4.7–5.0× | 1.60–1.70 | classical | published |
| xz -9 | 4.69× | 1.71 | LZMA2 | measured |
| zstd -19 | 4.25× | 1.88 | FSE + LZ77 | measured |
| bzip2 -9 | 3.94× | 2.03 | BWT | measured |
| brotli -9 | 3.87× | 2.07 | LZ77 + dict | measured |
| gzip -9 | 3.10× | 2.58 | DEFLATE | measured |
Where Bindu sits in the LTCB landscape
Section titled “Where Bindu sits in the LTCB landscape”| Tier | Compressors | enwik8 ratio range | Notes |
|---|---|---|---|
| Neural (Hutter Prize-class) | cmix, nncp, paq8 | 5.95–7.10× | GPU/long compute; file-specific learned models |
| Bindu | Bindu | 4.29× | symbolic, single CPU, deterministic |
| Classical max-ratio | xz, brotli-11, zstd-22, zpaq-m5 | 3.95–4.68× | mainstream production codecs |
| Classical mainstream | bzip2, gzip | 2.74–3.71× | ubiquitous baselines |
On enwik9 (1 GB), Bindu’s 5.43× / 1.47 bpc is better than any non-neural codec measured at this scale, beating xz-9 by 16% and within striking distance of the lower neural tier. The Hutter Prize award threshold is 7.10× / 1.13 bpc — historically reachable only with neural context mixing.
ALP corpus — float64 time-series
Section titled “ALP corpus — float64 time-series”Separately from the unified Silesia/satellite/Hutter benchmark, Bindu was measured against the ALP reference implementation on the 30-dataset ALP float-time-series corpus.
| Metric | Result |
|---|---|
| Geometric mean ratio (bits/value) vs ALP | 1.88× fewer |
| Per-dataset wins | 27 of 30 vs ALP, Chimp, and Patas |
The result is meaningful because ALP is the published state-of-the-art for lossless float compression; the 27/30 outcome is a clear lead on the workload it was designed for. Bindu uses the DICT and LINDELTA pipelines for this domain — see Architecture for the routing logic.
Headline trade-offs
Section titled “Headline trade-offs”| Axis | Winner | Value |
|---|---|---|
| Best ratio overall | Bindu | 77.95% aggregate; 19/30 per-file wins |
| Highest encode throughput | aec (CCSDS 121) | 470 MB/s — onboard-spacecraft champion |
| Highest decode throughput | zstd -1 | 2,355 MB/s peak |
| Lowest RAM (compress) | gzip | 2.1 MB peak |
| Lowest energy (J/MB encoded) | aec | 0.009 J/MB |
| Highest single-file ratio | Bindu on TEL_MMS_flag | 263,314× (5.5 MB → 21 bytes) |
Domain recommendations
Section titled “Domain recommendations”| Workload | Recommended codec | Why |
|---|---|---|
| Cold archival, mixed generic data | xz -9e | well-established, 77% Silesia aggregate |
| Cold archival, telemetry / structured | Bindu | wins 19/30 files; extreme ratios (200–263k×) on sparse telemetry |
| Hutter Prize / large text | Bindu | leads enwik8 (4.29×) and enwik9 (5.43×) among classical codecs |
| Onboard spacecraft (CPU/RAM constrained) | aec | 62.94% at 470 MB/s, 32 MB RAM, 0.009 J/MB |
| Streaming / low-latency ingest | zstd -1 | 451 MB/s enc, 2,355 MB/s dec |
| Read-heavy / CDN | zstd -1 | decode leader by ~2× |
| x86 binaries | xz -9e | LZMA2 with executable filters tuned for this |
| Int16 sensor / time-series | flac | 63.8% at 153 MB/s, 4 MB RAM |
When Bindu is not the top performer
Section titled “When Bindu is not the top performer”Not every file in the corpus is a Bindu win. The places it loses are worth flagging explicitly:
- Silesia
mozilla,samba,ooffice— large source-tree archives containing x86 binaries.xz -9e’s LZMA2 with executable filters wins by a small margin. If your archive is dominated by x86 binaries, xz is the better default. - Sentinel-2 multispectral B11 —
aec(CCSDS 121) is marginally smaller. The specialized telemetry codec wins on the specific multispectral band it was tuned for. - MMS Epoch / TEL_THEMIS_B / certain int64 timestamp encodings —
zstdandxzcome out ahead by small margins on a handful of timestamp formats.
The pattern is straightforward. A specialized codec can still win when the data exactly matches the shape it was built for, such as executable code, a specific satellite image band, or a narrow timestamp format. Bindu does best when the data has repeated structure it can describe with its grammar, BWT, dictionary, or stride pipelines. That is why Bindu leads on most files in the corpus, while a few highly specialized cases still favor older tools.
Caveats
Section titled “Caveats”- Single host, single run. Production procurement should replicate on n≥3 runs across multiple host SKUs.
- Energy is a CPU-time proxy (
(user + sys) × 4.375 W/core); RAPL is root-only on this rig andperf_event_paranoid=4blocks user-modepower/energy-pkg/. Ranking between codecs is preserved; absolute joule numbers require an out-of-band reader. - No commercial codecs measured (DAPCOM FAPEC, OptimFROG). The benchmark harness picks them up if they appear on
PATH. - No neural context-mix codecs measured (cmix, nncp, paq8). Their published numbers are cited above for reference.
- Bindu uses multi-threaded SBPN mode on structured satellite data (via
--shape/--dtypehints). Commodity codecs were single-threaded for per-core comparability. Parallel variants (pigz,pbzip2,pixz,zstd -T0) would scale encode near-linearly with cores without changing ratio. fpack(FITS Rice) is installed but excluded from the timed matrix because it requires a synthetic FITS wrapper around raw arrays.- Bindu memory cost on enwik9: ~20 GB peak RSS to encode 1 GB. The BWT suffix-array allocation dominates; won’t fit on a 16 GB host. Smaller inputs are unaffected.
Reproducibility
Section titled “Reproducibility”The full per-run CSVs (results.csv, satellite_results.csv, enwik_results.csv) ship with this benchmark and contain every measurement: ratio, compressed bytes, wall/user/sys time, peak RSS, page faults, context switches, energy proxy, and SHA-256 round-trip status. To reproduce on your own hardware, see Reproducing benchmarks.