Document archives

Document archives are a secondary applicable use case (the flagship is satellite & telemetry). Structured document collections — support tickets, invoices, contracts, emails, customer records — combine long retention windows with structured per-document metadata, which is where Bindu’s wins come from on this workload. Body text compresses at roughly classical-codec parity; the metadata is what moves the ratio.

Why Bindu fits

Repetitive metadata. Headers, IDs, status fields, labels, and timestamps repeat at massive scale.
Templated bodies. Invoices, receipts, and notification emails share templates. Bindu finds these as substructural repeats once parsed.
Regulatory retention. 7-year, 10-year, and indefinite retention amplifies the value of a 10× ratio.
Search is the primary access pattern. You retrieve by ID or query much more often than you read sequentially.

Suggested setup

# Ingest a month of tickets
bindu compress \
  --schema tickets-v2.bindus \
  --dict tickets.bindud \
  --partition 'year, month' \
  --index 'ticket_id, customer_id, status' \
  --input 'raw/2026-03/*.json' \
  --out 'archive/2026/03/'

For email/document corpora where bodies dominate:

bindu compress \
  --schema email-v1 \
  --body-column body \
  --body-codec 'prose' \
  --dict email-headers.bindud \
  --recursive emails/

--body-codec prose tells Bindu to apply its prose-tuned path to the body (lighter structural work, heavier entropy pass) while still using the structured path for headers.

Expected ratios

On a sample enterprise ticket archive (JSON, ~2 KB metadata + 4 KB body avg):

Method	Size	Ratio
Raw JSON	1.0 TB	1.0×
gzip -6	290 GB	3.4×
zstd -9	220 GB	4.5×
Bindu	95 GB	10.5×

Metadata-heavy workloads (logs, ticket summaries, records) land higher on the ratio curve; body-heavy workloads (long emails, contract text) land lower.

Lookups without decompression

bindu query archive/ --where 'ticket_id == "T-42819"'
bindu query archive/ --where 'customer_id == 38291 && status == "open"'

With per-file Bloom filters on indexed columns, these touch O(1) files regardless of archive size.

Compliance considerations

Immutability. Pass --seal to make the output read-only and add a Merkle tree over records. Tampering is detectable per-record.
Legal hold. bindu hold apply archive/ --filter '...' marks matching files as undeletable; bindu hold release removes the mark. Metadata-only operation, no re-compression.
Right to erasure. bindu erase archive/ --where 'customer_id == ...' rewrites only the affected files, preserving everything else. Erasure is logged in a tamper-evident audit trail.

Pitfalls

PDFs are not documents to Bindu. A PDF body goes through as an opaque blob. If you need the text, extract it upstream and store text + PDF as two fields.
Attachments. Inline base64 attachments explode the body. Store them as separate fields with --bytes encoding, or externalize to object storage and keep only the reference.
Don’t over-partition. One file per ticket is dramatically worse than one file per day. Target 50–500 MB per .bindu file.