Scientific datasets
Scientific datasets are a secondary applicable use case. The flagship — sensor and time-series scientific data in particular — is covered in satellite & telemetry. This page covers the broader scientific shapes: tabular (CSV, Parquet), tensor (HDF5, NetCDF, Zarr), and time series. Bindu’s symbolic pipeline applies to all three with appropriate tuning.
Tabular
Section titled “Tabular”CSV and Parquet inputs are re-encoded through Bindu’s columnar path. Per-column codecs (see the Overview) let each column use the codec that fits its distribution:
bindu compress measurements.parquet# → measurements.parquet.binduTypical Parquet → Bindu ratios range from 2–5× on top of Parquet’s own compression. The wins come from:
- Global dictionaries. Parquet’s dictionaries are per row-group; Bindu’s are per-file (or per-archive with
--dict). - Better numeric codecs. Gorilla + ALP outperforms Parquet’s PLAIN/RLE on measurement data.
- Cross-row-group value reuse. Parquet can’t share a rare-but-repeated value across row groups; Bindu can.
Query tools that read Parquet can read Bindu through a drop-in adapter (bindu-parquet-bridge).
Tensor / array data
Section titled “Tensor / array data”HDF5, NetCDF, and Zarr have a tensor schema built in:
bindu compress --schema tensor-v1 simulation.h5Bindu chooses a codec per dataset based on dtype and observed distribution:
| Array kind | Codec |
|---|---|
| Dense float32 fields | Gorilla XOR + bit-planing |
| Integer counts | Frame-of-reference + zigzag |
| Categorical labels | Dictionary-encoded indices |
| Sparse fields | CSR with per-row varint |
| Identifier arrays | Raw |
Chunk boundaries from the source are preserved, which keeps random access O(chunk size) — important for large simulation outputs where you rarely want the whole array.
Time series
Section titled “Time series”Dedicated path for time-stamped numeric data:
bindu compress --schema ts-v1 \ --series-key sensor_id \ --value-column value \ --ts-column ts \ sensors/*.parquetThe ts-v1 schema applies:
- Double-delta timestamp coding (common sampling intervals compress to ~1 bit each)
- Per-series Gorilla on the value column
- Series-level dictionaries for sparse tags
Expect 8–15× over the source Parquet for steady-sampled sensor data.
Expected ratios
Section titled “Expected ratios”| Input | Over source | Legible? |
|---|---|---|
| Parquet (tabular) | 2–5× | Yes |
| HDF5 / NetCDF (dense arrays) | 1.5–4× | Yes |
| Zarr (chunked arrays) | 2–4× | Yes |
| Raw CSV | 10–30× | Yes |
| Time-series JSONL | 15–25× | Yes |
The “over source” number is the ratio against the already-compressed source format. Against raw text equivalents the ratios are much larger.
Pitfalls
Section titled “Pitfalls”- Floating-point precision. Gorilla is lossless. If you pre-quantize for size (e.g. float32 → float16), do that upstream — Bindu won’t reverse it.
- Random access granularity. Compression is more effective with larger chunks, but large chunks make random access slower. Default is 64k rows per chunk; tune with
--chunk-rows. - Don’t re-compress NetCDF that’s already heavily tuned. If the producer spent effort on bit-shaving, Bindu adds overhead without proportional wins.