Document archives
Document archives are a secondary applicable use case (the flagship is satellite & telemetry). Structured document collections — support tickets, invoices, contracts, emails, customer records — combine long retention windows with structured per-document metadata, which is where Bindu’s wins come from on this workload. Body text compresses at roughly classical-codec parity; the metadata is what moves the ratio.
Why Bindu fits
Section titled “Why Bindu fits”- Repetitive metadata. Headers, IDs, status fields, labels, and timestamps repeat at massive scale.
- Templated bodies. Invoices, receipts, and notification emails share templates. Bindu finds these as substructural repeats once parsed.
- Regulatory retention. 7-year, 10-year, and indefinite retention amplifies the value of a 10× ratio.
- Search is the primary access pattern. You retrieve by ID or query much more often than you read sequentially.
Suggested setup
Section titled “Suggested setup”# Ingest a month of ticketsbindu compress \ --schema tickets-v2.bindus \ --dict tickets.bindud \ --partition 'year, month' \ --index 'ticket_id, customer_id, status' \ --input 'raw/2026-03/*.json' \ --out 'archive/2026/03/'For email/document corpora where bodies dominate:
bindu compress \ --schema email-v1 \ --body-column body \ --body-codec 'prose' \ --dict email-headers.bindud \ --recursive emails/--body-codec prose tells Bindu to apply its prose-tuned path to the body (lighter structural work, heavier entropy pass) while still using the structured path for headers.
Expected ratios
Section titled “Expected ratios”On a sample enterprise ticket archive (JSON, ~2 KB metadata + 4 KB body avg):
| Method | Size | Ratio |
|---|---|---|
| Raw JSON | 1.0 TB | 1.0× |
| gzip -6 | 290 GB | 3.4× |
| zstd -9 | 220 GB | 4.5× |
| Bindu | 95 GB | 10.5× |
Metadata-heavy workloads (logs, ticket summaries, records) land higher on the ratio curve; body-heavy workloads (long emails, contract text) land lower.
Lookups without decompression
Section titled “Lookups without decompression”bindu query archive/ --where 'ticket_id == "T-42819"'bindu query archive/ --where 'customer_id == 38291 && status == "open"'With per-file Bloom filters on indexed columns, these touch O(1) files regardless of archive size.
Compliance considerations
Section titled “Compliance considerations”- Immutability. Pass
--sealto make the output read-only and add a Merkle tree over records. Tampering is detectable per-record. - Legal hold.
bindu hold apply archive/ --filter '...'marks matching files as undeletable;bindu hold releaseremoves the mark. Metadata-only operation, no re-compression. - Right to erasure.
bindu erase archive/ --where 'customer_id == ...'rewrites only the affected files, preserving everything else. Erasure is logged in a tamper-evident audit trail.
Pitfalls
Section titled “Pitfalls”- PDFs are not documents to Bindu. A PDF body goes through as an opaque blob. If you need the text, extract it upstream and store text + PDF as two fields.
- Attachments. Inline base64 attachments explode the body. Store them as separate fields with
--bytesencoding, or externalize to object storage and keep only the reference. - Don’t over-partition. One file per ticket is dramatically worse than one file per day. Target 50–500 MB per
.bindufile.