Skip to content

Scientific datasets

Scientific datasets are a secondary applicable use case. The flagship — sensor and time-series scientific data in particular — is covered in satellite & telemetry. This page covers the broader scientific shapes: tabular (CSV, Parquet), tensor (HDF5, NetCDF, Zarr), and time series. Bindu’s symbolic pipeline applies to all three with appropriate tuning.

CSV and Parquet inputs are re-encoded through Bindu’s columnar path. Per-column codecs (see the Overview) let each column use the codec that fits its distribution:

Terminal window
bindu compress measurements.parquet
# → measurements.parquet.bindu

Typical Parquet → Bindu ratios range from 2–5× on top of Parquet’s own compression. The wins come from:

  • Global dictionaries. Parquet’s dictionaries are per row-group; Bindu’s are per-file (or per-archive with --dict).
  • Better numeric codecs. Gorilla + ALP outperforms Parquet’s PLAIN/RLE on measurement data.
  • Cross-row-group value reuse. Parquet can’t share a rare-but-repeated value across row groups; Bindu can.

Query tools that read Parquet can read Bindu through a drop-in adapter (bindu-parquet-bridge).

HDF5, NetCDF, and Zarr have a tensor schema built in:

Terminal window
bindu compress --schema tensor-v1 simulation.h5

Bindu chooses a codec per dataset based on dtype and observed distribution:

Array kindCodec
Dense float32 fieldsGorilla XOR + bit-planing
Integer countsFrame-of-reference + zigzag
Categorical labelsDictionary-encoded indices
Sparse fieldsCSR with per-row varint
Identifier arraysRaw

Chunk boundaries from the source are preserved, which keeps random access O(chunk size) — important for large simulation outputs where you rarely want the whole array.

Dedicated path for time-stamped numeric data:

Terminal window
bindu compress --schema ts-v1 \
--series-key sensor_id \
--value-column value \
--ts-column ts \
sensors/*.parquet

The ts-v1 schema applies:

  • Double-delta timestamp coding (common sampling intervals compress to ~1 bit each)
  • Per-series Gorilla on the value column
  • Series-level dictionaries for sparse tags

Expect 8–15× over the source Parquet for steady-sampled sensor data.

InputOver sourceLegible?
Parquet (tabular)2–5×Yes
HDF5 / NetCDF (dense arrays)1.5–4×Yes
Zarr (chunked arrays)2–4×Yes
Raw CSV10–30×Yes
Time-series JSONL15–25×Yes

The “over source” number is the ratio against the already-compressed source format. Against raw text equivalents the ratios are much larger.

  • Floating-point precision. Gorilla is lossless. If you pre-quantize for size (e.g. float32 → float16), do that upstream — Bindu won’t reverse it.
  • Random access granularity. Compression is more effective with larger chunks, but large chunks make random access slower. Default is 64k rows per chunk; tune with --chunk-rows.
  • Don’t re-compress NetCDF that’s already heavily tuned. If the producer spent effort on bit-shaving, Bindu adds overhead without proportional wins.