Operating on compressed data
This is the section that diverges most sharply from conventional codecs. With gzip or zstd, the compressed file is a dead artifact: you decompress it before doing anything useful. With Bindu, three operations run directly against the compressed wire: search, count/locate, and edit.
This page walks through what’s available today, with the model coverage and measured numbers.
Continuing from the previous page
Section titled “Continuing from the previous page”This demo continues from Compressing data, where you compressed Alice’s Adventures in Wonderland into alice.txt.bindu. The same operations apply to any artifact Bindu produces.
Search
Section titled “Search”bindu search "pattern" file.bindubindu count "pattern" file.bindubindu find "pattern" file.bindu -n 10bindu search (and its count / find variants) auto-detects the wire’s compression model and dispatches a model-native search. The capability table:
| Wire model | Body content | Search strategy | Status |
|---|---|---|---|
| RAW | body is the raw bytes | Boyer-Moore-Horspool over body | native |
| CONSTANT | single repeated byte | O(1) — does the pattern repeat? | native |
| DICT | alphabet + index stream | alphabet-mapped index-stream scan | native |
| BWT + index | Burrows-Wheeler L-column + sidecar | FM-index backward search (Ferragina-Manzini) | native (opt-in index) |
| Grammar | rule table inside legacy wire | rule-table memcmp | native |
| RLE, LINDELTA, PAETH2D | residual streams that need neighbor context | decompress + scan | fallback |
Five model classes admit zero-decompression search today (RAW, CONSTANT, DICT, BWT-with-index, grammar-rule probe). Three (RLE, LINDELTA, PAETH2D) deliberately fall back because their wire formats encode residuals against a neighbor context that has to be reconstructed before any candidate match can be verified — a correctness-preserving choice, not a missing feature.
Measured results
Section titled “Measured results”All verified with byte-exact round-trip:
| Operation | Corpus | Pattern | Result | Time | Speedup vs fallback |
|---|---|---|---|---|---|
| Grammar rule probe | 4 MB ADS-B JSON | rule match | 453 occurrences | 1 ms | 453× |
| CONSTANT scan | 1 MB constant byte | 8-byte run | 1,048,569 matches | 1 ms | (no decode) |
| FM-index count | 10 MiB BWT index | 9 bytes | 1,280 matches | 5 ms | 170× |
| FM-index locate | 10 MiB BWT index | 9 bytes | 1,280 positions | 93 ms | (positions only via FM) |
| DICT native scan | 30 MiB GOES-Ch01 int16 | 4 bytes | 94,204 matches | 1.04 s | ~4× |
Count vs locate
Section titled “Count vs locate”bindu count returns just the occurrence count and is the fastest path. bindu find -n K returns the first K positions and pays the additional O(K · log n) for position recovery on the FM-index path. If you only need the count (e.g., for filtering or aggregation), prefer count.
bindu replace file.bindu "old" "new"bindu replace rewrites every occurrence of old with new. When the pattern equals an extracted grammar rule and len(old) == len(new), the replacement patches the rule body once and every downstream occurrence updates implicitly — no decompression, no recompression, no file rewrite.
Three tiers in priority order:
- Grammar rule patch — equal-length, in-place via memory-mapped write. Measured at 3 ms on 4 MB ADS-B, against 1.33 s for the
zstd -d → sed → zstd -19baseline (a 443× speedup). Zero scratch I/O. - Dictionary entry rewrite — equal-length pattern that matches a DICT alphabet entry; rewrites the entry inline.
- Decompress + scan + recompress — the general case for non-equal-length edits or unsupported wire models. Correctness-preserving; same cost as the conventional pipeline.
Length-preserving edits are the common case for fielded data corrections (numeric value updates, coordinate fixes, ID rewrites). Non-length-preserving edits force tier 3, because changing an underlying rule body’s length cascades through every offset that follows it — see Honest limits → In-place edit constraint.
How the grammar pipeline makes both fast
Section titled “How the grammar pipeline makes both fast”Both the 1 ms search and the 3 ms in-place edit come from the same underlying structure: Bindu’s grammar pipeline extracts repeated phrases into a rule table and rewrites the body as a stream of references into that table.
So a 4 MB ADS-B archive compressed via the grammar pipeline contains:
- A rule table (typically a few KB) with each unique repeated phrase stored exactly once.
- A reference stream — most of the body — made of index numbers that point into the rule table, plus residuals where needed.
Why the search hits 1 ms. A query like bindu count "GLOBAL_NAV_OK" runs in two cheap phases. First, a memcmp probe over the rule table answers “does any rule’s body match the query?” — a KB-scale scan, microseconds. Second, if rule index N matched, count how many times N appears in the reference stream. That’s metadata about the rule, not a scan over reconstructed body content. The decompress-and-scan fallback at 453 ms is doing roughly 450× more work — fully reconstructing the 4 MB body, then running Boyer-Moore over it.
Why the edit propagates from a single small write. Because every occurrence of a phrase in the body is a reference to a single rule entry, patching that rule entry in place updates every occurrence implicitly — no need to walk the body, no offsets to recalculate. The “single 4-byte write” in the headline number is exactly that: the rule’s stored phrase being overwritten in memory-mapped place, with the reference stream untouched.
The catch: this only works when len(old) == len(new). Different lengths would cascade offset shifts through the rule table and the reference stream, which is why Tier 3 (decompress / recompress) is the correctness-preserving fallback. Length-preserving covers the common cases — numeric corrections, coordinate fixes, ID rewrites where field widths are fixed.
A worked example: rename Alice to George
Section titled “A worked example: rename Alice to George”From the prior demo’s alice.txt.bindu:
# Verify the file currently contains "Alice"bindu count "Alice" alice.txt.bindu# → N occurrences
# Rename Alice to George everywhere — same length, so this hits Tier 1bindu replace alice.txt.bindu "Alice" "George"
# Verifybindu count "Alice" alice.txt.bindu# → 0 occurrencesbindu count "George" alice.txt.bindu# → N occurrences
# Decompress and confirmbindu decompress alice.txt.bindu --output alice-edited.txtgrep -c "Alice" alice-edited.txtgrep -c "George" alice-edited.txtThe edit ran without ever materializing the decompressed text. The grep at the end is just there to verify the round-trip.
Cross-file operations
Section titled “Cross-file operations”The same property makes cross-file analytics tractable on large compressed corpora. You can ask questions across many files without unpacking any of them. A canonical example from the satellite domain:
“Find every time a left turn was taken at a 30-degree angle.”
against a fleet of compressed telemetry files runs as a coordinate-space scan, not a decompression pipeline. You only pay decompression cost when you want to materialize the matching records.
Why this matters
Section titled “Why this matters”Three independent line items get cheaper at once:
- Storage — the bytes you keep are smaller.
- Network — the bytes you transmit are smaller.
- Compute — the bytes you read at query time are smaller, because the query runs against the compressed form.
For workloads where Bindu shines, compression isn’t a one-shot win at write time — it pays out continuously every time the data is read.
What’s next
Section titled “What’s next”- Architecture — the cascade that produces the wire formats this page operates against.
- Reference → Query language — full search/edit syntax and supported predicates.
- Honest limits — the cases that force decompress-and-scan fallback.