Operating on compressed data

This is the section that diverges most sharply from conventional codecs. With gzip or zstd, the compressed file is a dead artifact: you decompress it before doing anything useful. With Bindu, three operations run directly against the compressed wire: search, count/locate, and edit.

This page walks through what’s available today, with the model coverage and measured numbers.

Continuing from the previous page

This demo continues from Compressing data, where you compressed Alice’s Adventures in Wonderland into alice.txt.bindu. The same operations apply to any artifact Bindu produces.

Search

bindu search "pattern" file.bindu
bindu count  "pattern" file.bindu
bindu find   "pattern" file.bindu -n 10

bindu search (and its count / find variants) auto-detects the wire’s compression model and dispatches a model-native search. The capability table:

Wire model	Body content	Search strategy	Status
RAW	body is the raw bytes	Boyer-Moore-Horspool over body	native
CONSTANT	single repeated byte	O(1) — does the pattern repeat?	native
DICT	alphabet + index stream	alphabet-mapped index-stream scan	native
BWT + index	Burrows-Wheeler L-column + sidecar	FM-index backward search (Ferragina-Manzini)	native (opt-in index)
Grammar	rule table inside legacy wire	rule-table memcmp	native
RLE, LINDELTA, PAETH2D	residual streams that need neighbor context	decompress + scan	fallback

Five model classes admit zero-decompression search today (RAW, CONSTANT, DICT, BWT-with-index, grammar-rule probe). Three (RLE, LINDELTA, PAETH2D) deliberately fall back because their wire formats encode residuals against a neighbor context that has to be reconstructed before any candidate match can be verified — a correctness-preserving choice, not a missing feature.

Measured results

All verified with byte-exact round-trip:

Operation	Corpus	Pattern	Result	Time	Speedup vs fallback
Grammar rule probe	4 MB ADS-B JSON	rule match	453 occurrences	1 ms	453×
CONSTANT scan	1 MB constant byte	8-byte run	1,048,569 matches	1 ms	(no decode)
FM-index count	10 MiB BWT index	9 bytes	1,280 matches	5 ms	170×
FM-index locate	10 MiB BWT index	9 bytes	1,280 positions	93 ms	(positions only via FM)
DICT native scan	30 MiB GOES-Ch01 int16	4 bytes	94,204 matches	1.04 s	~4×

Count vs locate

bindu count returns just the occurrence count and is the fastest path. bindu find -n K returns the first K positions and pays the additional O(K · log n) for position recovery on the FM-index path. If you only need the count (e.g., for filtering or aggregation), prefer count.

Edit

bindu replace file.bindu "old" "new"

bindu replace rewrites every occurrence of old with new. When the pattern equals an extracted grammar rule and len(old) == len(new), the replacement patches the rule body once and every downstream occurrence updates implicitly — no decompression, no recompression, no file rewrite.

Three tiers in priority order:

Grammar rule patch — equal-length, in-place via memory-mapped write. Measured at 3 ms on 4 MB ADS-B, against 1.33 s for the zstd -d → sed → zstd -19 baseline (a 443× speedup). Zero scratch I/O.
Dictionary entry rewrite — equal-length pattern that matches a DICT alphabet entry; rewrites the entry inline.
Decompress + scan + recompress — the general case for non-equal-length edits or unsupported wire models. Correctness-preserving; same cost as the conventional pipeline.

Length-preserving edits are the common case for fielded data corrections (numeric value updates, coordinate fixes, ID rewrites). Non-length-preserving edits force tier 3, because changing an underlying rule body’s length cascades through every offset that follows it — see Honest limits → In-place edit constraint.

How the grammar pipeline makes both fast

Both the 1 ms search and the 3 ms in-place edit come from the same underlying structure: Bindu’s grammar pipeline extracts repeated phrases into a rule table and rewrites the body as a stream of references into that table.

So a 4 MB ADS-B archive compressed via the grammar pipeline contains:

A rule table (typically a few KB) with each unique repeated phrase stored exactly once.
A reference stream — most of the body — made of index numbers that point into the rule table, plus residuals where needed.

Why the search hits 1 ms. A query like bindu count "GLOBAL_NAV_OK" runs in two cheap phases. First, a memcmp probe over the rule table answers “does any rule’s body match the query?” — a KB-scale scan, microseconds. Second, if rule index N matched, count how many times N appears in the reference stream. That’s metadata about the rule, not a scan over reconstructed body content. The decompress-and-scan fallback at 453 ms is doing roughly 450× more work — fully reconstructing the 4 MB body, then running Boyer-Moore over it.

Why the edit propagates from a single small write. Because every occurrence of a phrase in the body is a reference to a single rule entry, patching that rule entry in place updates every occurrence implicitly — no need to walk the body, no offsets to recalculate. The “single 4-byte write” in the headline number is exactly that: the rule’s stored phrase being overwritten in memory-mapped place, with the reference stream untouched.

The catch: this only works when len(old) == len(new). Different lengths would cascade offset shifts through the rule table and the reference stream, which is why Tier 3 (decompress / recompress) is the correctness-preserving fallback. Length-preserving covers the common cases — numeric corrections, coordinate fixes, ID rewrites where field widths are fixed.

A worked example: rename Alice to George

From the prior demo’s alice.txt.bindu:

# Verify the file currently contains "Alice"
bindu count "Alice" alice.txt.bindu
# → N occurrences

# Rename Alice to George everywhere — same length, so this hits Tier 1
bindu replace alice.txt.bindu "Alice" "George"

# Verify
bindu count "Alice" alice.txt.bindu
# → 0 occurrences
bindu count "George" alice.txt.bindu
# → N occurrences

# Decompress and confirm
bindu decompress alice.txt.bindu --output alice-edited.txt
grep -c "Alice" alice-edited.txt
grep -c "George" alice-edited.txt

The edit ran without ever materializing the decompressed text. The grep at the end is just there to verify the round-trip.

Cross-file operations

The same property makes cross-file analytics tractable on large compressed corpora. You can ask questions across many files without unpacking any of them. A canonical example from the satellite domain:

“Find every time a left turn was taken at a 30-degree angle.”

against a fleet of compressed telemetry files runs as a coordinate-space scan, not a decompression pipeline. You only pay decompression cost when you want to materialize the matching records.

Why this matters

Three independent line items get cheaper at once:

Storage — the bytes you keep are smaller.
Network — the bytes you transmit are smaller.
Compute — the bytes you read at query time are smaller, because the query runs against the compressed form.

For workloads where Bindu shines, compression isn’t a one-shot win at write time — it pays out continuously every time the data is read.

What’s next

Architecture — the cascade that produces the wire formats this page operates against.
Reference → Query language — full search/edit syntax and supported predicates.
Honest limits — the cases that force decompress-and-scan fallback.