Scientific datasets

Scientific datasets are a secondary applicable use case. The flagship — sensor and time-series scientific data in particular — is covered in satellite & telemetry. This page covers the broader scientific shapes: tabular (CSV, Parquet), tensor (HDF5, NetCDF, Zarr), and time series. Bindu’s symbolic pipeline applies to all three with appropriate tuning.

Tabular

CSV and Parquet inputs are re-encoded through Bindu’s columnar path. Per-column codecs (see the Overview) let each column use the codec that fits its distribution:

bindu compress measurements.parquet
# → measurements.parquet.bindu

Typical Parquet → Bindu ratios range from 2–5× on top of Parquet’s own compression. The wins come from:

Global dictionaries. Parquet’s dictionaries are per row-group; Bindu’s are per-file (or per-archive with --dict).
Better numeric codecs. Gorilla + ALP outperforms Parquet’s PLAIN/RLE on measurement data.
Cross-row-group value reuse. Parquet can’t share a rare-but-repeated value across row groups; Bindu can.

Query tools that read Parquet can read Bindu through a drop-in adapter (bindu-parquet-bridge).

Tensor / array data

HDF5, NetCDF, and Zarr have a tensor schema built in:

bindu compress --schema tensor-v1 simulation.h5

Bindu chooses a codec per dataset based on dtype and observed distribution:

Array kind	Codec
Dense float32 fields	Gorilla XOR + bit-planing
Integer counts	Frame-of-reference + zigzag
Categorical labels	Dictionary-encoded indices
Sparse fields	CSR with per-row varint
Identifier arrays	Raw

Chunk boundaries from the source are preserved, which keeps random access O(chunk size) — important for large simulation outputs where you rarely want the whole array.

Time series

Dedicated path for time-stamped numeric data:

bindu compress --schema ts-v1 \
  --series-key sensor_id \
  --value-column value \
  --ts-column ts \
  sensors/*.parquet

The ts-v1 schema applies:

Double-delta timestamp coding (common sampling intervals compress to ~1 bit each)
Per-series Gorilla on the value column
Series-level dictionaries for sparse tags

Expect 8–15× over the source Parquet for steady-sampled sensor data.

Expected ratios

Input	Over source	Legible?
Parquet (tabular)	2–5×	Yes
HDF5 / NetCDF (dense arrays)	1.5–4×	Yes
Zarr (chunked arrays)	2–4×	Yes
Raw CSV	10–30×	Yes
Time-series JSONL	15–25×	Yes

The “over source” number is the ratio against the already-compressed source format. Against raw text equivalents the ratios are much larger.

Pitfalls

Floating-point precision. Gorilla is lossless. If you pre-quantize for size (e.g. float32 → float16), do that upstream — Bindu won’t reverse it.
Random access granularity. Compression is more effective with larger chunks, but large chunks make random access slower. Default is 64k rows per chunk; tune with --chunk-rows.
Don’t re-compress NetCDF that’s already heavily tuned. If the producer spent effort on bit-shaving, Bindu adds overhead without proportional wins.