Log archives

Application logs share the structural properties that make Bindu work: they’re sequential, highly repetitive, and queried long after they’re written. Many of the same arguments from the satellite & telemetry flagship — symbolic deltas, transmission cost dominance, operating on the compressed form directly — apply here. This page covers the log-specific framing.

Why Bindu fits logs

High cardinality of template, low cardinality of values. service, level, host, region, error codes, and log templates all fit in small enums. Per-event variable fields (IDs, timestamps, numbers) fit tightly packed columns.
You query more than you read. Once an event is written, the common access pattern is searching or aggregating — not reconstructing the original text.
You retain for compliance, not readability. 90-day, 1-year, or 7-year retention windows benefit heavily from 10× compression, and you never need the exact original byte layout.

Suggested pipeline

app → shipper → hot store (last 24h) → nightly rollup → Bindu archive

Rollup:

bindu compress \
  --schema app-logs-v3.bindus \
  --dict app-logs.bindud \
  --partition 'date, service' \
  --index 'trace_id, user_id' \
  --input 'hot/2026-01-23/*.jsonl' \
  --out 'archive/2026-01-23/'

The --partition flag produces one file per (date, service) combination, which makes most queries selective at the file-skip layer.

The --index flag builds a sparse Bloom filter over those columns. Lookups by trace_id touch O(1) files even across years of logs.

Querying

# All errors from the checkout service last week
bindu query archive/ \
  --where 'service == "checkout" && level == "error"' \
  --since '2026-01-16' --until '2026-01-23'

# Reconstruct a full trace
bindu query archive/ \
  --where 'trace_id == "a3f..."'

# Aggregate
bindu query archive/ \
  --select 'service, count(*), p99(latency_ms)' \
  --group-by service \
  --since '2026-01-23'

On realistic log volumes (1 TB/day compressed to ~100 GB/day), a targeted trace lookup against a 90-day archive runs in under a second on a single node.

Retention and rollups

For long retention windows, consider a two-tier scheme:

Full fidelity for 30 days (--level 6, default indexes).
Aggregated rollup for months 2–12 (bindu rollup — drops high-cardinality columns, keeps aggregates).

Storage drops by another 5–10× at the cost of losing per-event detail.

Pitfalls

Don’t mix schemas in one archive. If service A and service B have different log shapes, partition by service. A union schema bloats the dictionary.
Watch trace ID encoding. If IDs are emitted as hex strings, explicitly type them as bytes(16) in the schema. Without that hint Bindu falls back to generic string compression and you lose ~30% on that column.
Clock skew matters. Double-delta timestamp encoding assumes monotonic-ish input. Sort by ts in the shipper.