Skip to content

Operating on compressed data

This is the section that diverges most sharply from conventional codecs. With gzip or zstd, the compressed file is a dead artifact: you decompress it before doing anything useful. With Bindu, three operations run directly against the compressed wire: search, count/locate, and edit.

This page walks through what’s available today, with the model coverage and measured numbers.

This demo continues from Compressing data, where you compressed Alice’s Adventures in Wonderland into alice.txt.bindu. The same operations apply to any artifact Bindu produces.

Terminal window
bindu search "pattern" file.bindu
bindu count "pattern" file.bindu
bindu find "pattern" file.bindu -n 10

bindu search (and its count / find variants) auto-detects the wire’s compression model and dispatches a model-native search. The capability table:

Wire modelBody contentSearch strategyStatus
RAWbody is the raw bytesBoyer-Moore-Horspool over bodynative
CONSTANTsingle repeated byteO(1) — does the pattern repeat?native
DICTalphabet + index streamalphabet-mapped index-stream scannative
BWT + indexBurrows-Wheeler L-column + sidecarFM-index backward search (Ferragina-Manzini)native (opt-in index)
Grammarrule table inside legacy wirerule-table memcmpnative
RLE, LINDELTA, PAETH2Dresidual streams that need neighbor contextdecompress + scanfallback

Five model classes admit zero-decompression search today (RAW, CONSTANT, DICT, BWT-with-index, grammar-rule probe). Three (RLE, LINDELTA, PAETH2D) deliberately fall back because their wire formats encode residuals against a neighbor context that has to be reconstructed before any candidate match can be verified — a correctness-preserving choice, not a missing feature.

All verified with byte-exact round-trip:

OperationCorpusPatternResultTimeSpeedup vs fallback
Grammar rule probe4 MB ADS-B JSONrule match453 occurrences1 ms453×
CONSTANT scan1 MB constant byte8-byte run1,048,569 matches1 ms(no decode)
FM-index count10 MiB BWT index9 bytes1,280 matches5 ms170×
FM-index locate10 MiB BWT index9 bytes1,280 positions93 ms(positions only via FM)
DICT native scan30 MiB GOES-Ch01 int164 bytes94,204 matches1.04 s~4×

bindu count returns just the occurrence count and is the fastest path. bindu find -n K returns the first K positions and pays the additional O(K · log n) for position recovery on the FM-index path. If you only need the count (e.g., for filtering or aggregation), prefer count.

Terminal window
bindu replace file.bindu "old" "new"

bindu replace rewrites every occurrence of old with new. When the pattern equals an extracted grammar rule and len(old) == len(new), the replacement patches the rule body once and every downstream occurrence updates implicitly — no decompression, no recompression, no file rewrite.

Three tiers in priority order:

  1. Grammar rule patch — equal-length, in-place via memory-mapped write. Measured at 3 ms on 4 MB ADS-B, against 1.33 s for the zstd -d → sed → zstd -19 baseline (a 443× speedup). Zero scratch I/O.
  2. Dictionary entry rewrite — equal-length pattern that matches a DICT alphabet entry; rewrites the entry inline.
  3. Decompress + scan + recompress — the general case for non-equal-length edits or unsupported wire models. Correctness-preserving; same cost as the conventional pipeline.

Length-preserving edits are the common case for fielded data corrections (numeric value updates, coordinate fixes, ID rewrites). Non-length-preserving edits force tier 3, because changing an underlying rule body’s length cascades through every offset that follows it — see Honest limits → In-place edit constraint.

Both the 1 ms search and the 3 ms in-place edit come from the same underlying structure: Bindu’s grammar pipeline extracts repeated phrases into a rule table and rewrites the body as a stream of references into that table.

So a 4 MB ADS-B archive compressed via the grammar pipeline contains:

  • A rule table (typically a few KB) with each unique repeated phrase stored exactly once.
  • A reference stream — most of the body — made of index numbers that point into the rule table, plus residuals where needed.

Why the search hits 1 ms. A query like bindu count "GLOBAL_NAV_OK" runs in two cheap phases. First, a memcmp probe over the rule table answers “does any rule’s body match the query?” — a KB-scale scan, microseconds. Second, if rule index N matched, count how many times N appears in the reference stream. That’s metadata about the rule, not a scan over reconstructed body content. The decompress-and-scan fallback at 453 ms is doing roughly 450× more work — fully reconstructing the 4 MB body, then running Boyer-Moore over it.

Why the edit propagates from a single small write. Because every occurrence of a phrase in the body is a reference to a single rule entry, patching that rule entry in place updates every occurrence implicitly — no need to walk the body, no offsets to recalculate. The “single 4-byte write” in the headline number is exactly that: the rule’s stored phrase being overwritten in memory-mapped place, with the reference stream untouched.

The catch: this only works when len(old) == len(new). Different lengths would cascade offset shifts through the rule table and the reference stream, which is why Tier 3 (decompress / recompress) is the correctness-preserving fallback. Length-preserving covers the common cases — numeric corrections, coordinate fixes, ID rewrites where field widths are fixed.

From the prior demo’s alice.txt.bindu:

Terminal window
# Verify the file currently contains "Alice"
bindu count "Alice" alice.txt.bindu
# → N occurrences
# Rename Alice to George everywhere — same length, so this hits Tier 1
bindu replace alice.txt.bindu "Alice" "George"
# Verify
bindu count "Alice" alice.txt.bindu
# → 0 occurrences
bindu count "George" alice.txt.bindu
# → N occurrences
# Decompress and confirm
bindu decompress alice.txt.bindu --output alice-edited.txt
grep -c "Alice" alice-edited.txt
grep -c "George" alice-edited.txt

The edit ran without ever materializing the decompressed text. The grep at the end is just there to verify the round-trip.

The same property makes cross-file analytics tractable on large compressed corpora. You can ask questions across many files without unpacking any of them. A canonical example from the satellite domain:

“Find every time a left turn was taken at a 30-degree angle.”

against a fleet of compressed telemetry files runs as a coordinate-space scan, not a decompression pipeline. You only pay decompression cost when you want to materialize the matching records.

Three independent line items get cheaper at once:

  1. Storage — the bytes you keep are smaller.
  2. Network — the bytes you transmit are smaller.
  3. Compute — the bytes you read at query time are smaller, because the query runs against the compressed form.

For workloads where Bindu shines, compression isn’t a one-shot win at write time — it pays out continuously every time the data is read.