Skip to content

Edit and search in depth

This page continues from Operating on compressed data. The walkthrough showed the core behavior with alice.txt.bindu; this page covers the command details and the cases where Bindu can operate directly on the compressed form.

Terminal window
bindu replace file.bindu "old" "new"

bindu replace rewrites every occurrence of old with new. The Alice-to-Mabel walkthrough hits the fast path because the replacement has the same byte length and the rule body for Alice can be patched in place. In general, bindu replace picks one of three tiers in priority order:

  1. Grammar rule patch — equal-length, in-place via memory-mapped write. This is the fast path shown by the Alice walkthrough and measured by the ADS-B benchmark on the previous page.
  2. Dictionary entry rewrite — equal-length pattern that matches a DICT alphabet entry; rewrites the entry inline.
  3. Decompress + scan + recompress — the general case for non-equal-length edits or unsupported wire models. Correctness-preserving; same cost as the conventional pipeline.

Length-preserving edits are the compressed-space win. Non-length-preserving edits force tier 3: correct, but the same cost class as the conventional pipeline.

Terminal window
bindu search "pattern" file.bindu
bindu count "pattern" file.bindu
bindu find "pattern" file.bindu -n 10

When you run bindu search (or its count / find siblings), Bindu inspects the compressed file’s header to see which pipeline produced it, then picks a search strategy tailored to that pipeline. For most pipelines the search runs directly against the compressed bytes; for the residual-encoded ones it has to decompress first.

Wire modelBody contentSearch strategyCompressed search?
RAWbody is the raw bytesBoyer-Moore-Horspool over bodyYes
CONSTANTsingle repeated byteO(1) — does the pattern repeat?Yes
DICTalphabet + index streamalphabet-mapped index-stream scanYes
BWT + indexBurrows-Wheeler L-column + sidecarFM-index backward search (Ferragina-Manzini)Yes (with FM-index sidecar)
Grammarrule table inside legacy wirerule-table memcmpYes
RLErun-length residuals against prior bytesdecompress + scanNo
LINDELTAstride-period residuals against fixed offsetdecompress + scanNo
PAETH2D2D Paeth residuals against neighbor pixelsdecompress + scanNo

Five model classes admit zero-decompression search today (RAW, CONSTANT, DICT, BWT-with-index, grammar-rule probe). Three (RLE, LINDELTA, PAETH2D) fall back to decompression-first search because their wire formats encode residuals against a neighbor context that has to be reconstructed before any candidate match can be verified. For measured search results across larger corpora, see Operating benchmarks.

bindu count returns just the occurrence count and is the fastest path. bindu find -n K returns the first K positions and pays the additional O(K · log n) for position recovery on the FM-index path. If you only need the count (e.g., for filtering or aggregation), prefer count.

The same property makes cross-file analytics tractable on large compressed corpora. You can ask questions across many files without unpacking any of them. A canonical example from the satellite domain:

“Find every time a left turn was taken at a 30-degree angle.”

against a fleet of compressed telemetry files runs as a coordinate-space scan, not a decompression pipeline. You only pay decompression cost when you want to materialize the matching records.

Three independent line items get cheaper at once:

  1. Storage — the bytes you keep are smaller.
  2. Network — the bytes you transmit are smaller.
  3. Compute — the bytes you read at query time are smaller, because the query runs against the compressed form.

For workloads where Bindu shines, compression isn’t a one-shot win at write time — it pays out continuously every time the data is read.