Skip to content

Document archives

Document archives are a secondary applicable use case (the flagship is satellite & telemetry). Structured document collections — support tickets, invoices, contracts, emails, customer records — combine long retention windows with structured per-document metadata, which is where Bindu’s wins come from on this workload. Body text compresses at roughly classical-codec parity; the metadata is what moves the ratio.

  • Repetitive metadata. Headers, IDs, status fields, labels, and timestamps repeat at massive scale.
  • Templated bodies. Invoices, receipts, and notification emails share templates. Bindu finds these as substructural repeats once parsed.
  • Regulatory retention. 7-year, 10-year, and indefinite retention amplifies the value of a 10× ratio.
  • Search is the primary access pattern. You retrieve by ID or query much more often than you read sequentially.
Terminal window
# Ingest a month of tickets
bindu compress \
--schema tickets-v2.bindus \
--dict tickets.bindud \
--partition 'year, month' \
--index 'ticket_id, customer_id, status' \
--input 'raw/2026-03/*.json' \
--out 'archive/2026/03/'

For email/document corpora where bodies dominate:

Terminal window
bindu compress \
--schema email-v1 \
--body-column body \
--body-codec 'prose' \
--dict email-headers.bindud \
--recursive emails/

--body-codec prose tells Bindu to apply its prose-tuned path to the body (lighter structural work, heavier entropy pass) while still using the structured path for headers.

On a sample enterprise ticket archive (JSON, ~2 KB metadata + 4 KB body avg):

MethodSizeRatio
Raw JSON1.0 TB1.0×
gzip -6290 GB3.4×
zstd -9220 GB4.5×
Bindu95 GB10.5×

Metadata-heavy workloads (logs, ticket summaries, records) land higher on the ratio curve; body-heavy workloads (long emails, contract text) land lower.

Terminal window
bindu query archive/ --where 'ticket_id == "T-42819"'
bindu query archive/ --where 'customer_id == 38291 && status == "open"'

With per-file Bloom filters on indexed columns, these touch O(1) files regardless of archive size.

  • Immutability. Pass --seal to make the output read-only and add a Merkle tree over records. Tampering is detectable per-record.
  • Legal hold. bindu hold apply archive/ --filter '...' marks matching files as undeletable; bindu hold release removes the mark. Metadata-only operation, no re-compression.
  • Right to erasure. bindu erase archive/ --where 'customer_id == ...' rewrites only the affected files, preserving everything else. Erasure is logged in a tamper-evident audit trail.
  • PDFs are not documents to Bindu. A PDF body goes through as an opaque blob. If you need the text, extract it upstream and store text + PDF as two fields.
  • Attachments. Inline base64 attachments explode the body. Store them as separate fields with --bytes encoding, or externalize to object storage and keep only the reference.
  • Don’t over-partition. One file per ticket is dramatically worse than one file per day. Target 50–500 MB per .bindu file.