Search and edit on compressed data
This page walks through the four operations Bindu supports against a compressed wire — count, find, fm-test, and r (replace) — using real captures from the demo and CI suites in the source tree. The numbers and outputs below are reproducible from the repository today; CI re-verifies them on every commit.
For the underlying property and why this works at all, see Operating on compressed data.
1. Count occurrences without decompression
Section titled “1. Count occurrences without decompression”The simplest operation. bindu count returns just the number of times a pattern appears in the original data, computed directly against the compressed wire when the model admits a native search path.
On a constant-byte file (MCP_CONSTANT)
Section titled “On a constant-byte file (MCP_CONSTANT)”Compress 1 MB of 0xFF, then count 8-byte runs of 0xFF:
$ python3 -c "open('/tmp/const.raw','wb').write(b'\\xff' * 1048576)"$ bindu c --shape 1048576 --dtype uint8 /tmp/const.raw /tmp/const.sbp$ ls -la /tmp/const.sbp-rw-r--r-- 1 user user 21 ... /tmp/const.sbp # 1 MB → 21 bytes
$ bindu count "$(printf '\xff%.0s' {1..8})" /tmp/const.sbp 1048569 (1 MB constant-byte body, no decompression)A 1 MB input, a 21-byte compressed file, 1,048,569 matches returned in 1 ms. The compressed body is a single byte plus length metadata; the count is computed in closed form (max(0, n - |pattern| + 1) if the constant matches pattern, else 0). The wire is never expanded.
On a BWT-indexed file (MCP_BWT with FM-index sidecar)
Section titled “On a BWT-indexed file (MCP_BWT with FM-index sidecar)”The interesting case. Compress a 10 MiB file with the FM-index sidecar enabled, then count occurrences of a 9-byte needle:
$ ls -la /tmp/bwt10.raw-rw-r--r-- 1 user user 10485760 ... /tmp/bwt10.raw
$ BINDU_BUILD_INDEX=1 bindu c --shape 10485760 --dtype uint8 \ /tmp/bwt10.raw /tmp/bwt10.sbp
$ bindu count NEEDLE42 /tmp/bwt10.sbp 1280The CI smoke harness verifies this exact result on every commit (ci/smoke.sh in the source tree):
log "search: count-only via bindu count"N=$(./bindu count NEEDLE42 /tmp/ci_bwt10.sbp 2>&1 | grep -oE "^ [0-9]+" | tr -d ' ')[ "$N" = "1280" ] || fail "bindu count: expected 1280, got '$N'"The FM-index path runs Ferragina-Manzini backward search against the L-column of the Burrows-Wheeler block and returns the count without reconstructing any of the original bytes. Measured 5 ms vs ~866 ms for the decompress-and-scan fallback — a 170× speedup. The 9-byte pattern is small enough that the per-query cost is dominated by C-array setup, not the search itself.
The trade-off: emitting the FM-index sidecar at compress time roughly doubles the output size on this particular file (12.4 MB indexed vs 1.94 MB without index). It is opt-in via BINDU_BUILD_INDEX=1. Pay it on the datasets you actually want searchable.
2. Locate the first K matches
Section titled “2. Locate the first K matches”bindu find -n K returns positions, not just a count. On the FM-index path this uses LF-walk position recovery, which costs more per match than the count step but still avoids decompression.
$ bindu find NEEDLE42 /tmp/bwt10.sbp -n 4 match at offset 218752 match at offset 437504 match at offset 656256 match at offset 875008 4 of 1280CI canary:
log "search: locate via bindu find -n 4"MATCHES=$(./bindu find NEEDLE42 /tmp/ci_bwt10.sbp -n 4 2>&1 | grep -c "^ match ")[ "$MATCHES" = "4" ] || fail "bindu find -n 4: expected 4 matches, got $MATCHES"For all 1,280 positions on the 10 MiB file, FM-index locate completes in 93 ms with no decompression. Compared to the decompress-and-scan fallback that pays the full inflate cost regardless of K, locate is sub-linear in the original size.
3. Verify FM-index correctness with fm-test
Section titled “3. Verify FM-index correctness with fm-test”Before relying on FM-index search in production, run the correctness harness. It builds the BWT of the raw file, runs bwt_fm_count, and asserts it matches a Boyer-Moore-Horspool brute-force count on the original bytes:
$ echo -n "abracadabra" > /tmp/abra$ bindu fm-test /tmp/abra "abra" bwt_encode (build): 0.001 ms bwt_build_C (per-q): 0.000 ms bwt_fm_count (per-q): 0.000 ms bmh_count (reference): 0.000 ms fm_count = 2, bmh_count = 2 MATCH ✓
$ bindu fm-test /tmp/bwt10.raw "NEEDLE42" bwt_encode (build): 3316 ms bwt_build_C (per-q): 5 ms bwt_fm_count (per-q): 5 ms bmh_count (reference): 41 ms fm_count = 1280, bmh_count = 1280 MATCH ✓The harness is bounded to 64 MiB inputs and is the only bindu surface that builds a BWT just to test against itself. If the assertions ever drift, CI fails the build.
4. Edit in place — the grammar-rule fast path
Section titled “4. Edit in place — the grammar-rule fast path”This is the most distinctive operation. The full canonical demo lives under demo/ in the source tree; what follows is the captured run from demo/results/run_20260423_130958/.
The setup
Section titled “The setup”Three lanes, same task: in a 4 MB compressed ADS-B aircraft-tracking JSON archive, replace every occurrence of the string 9.51 with 9.99, then verify the edit landed.
- Lane A —
zstd -d→sed→zstd -19(the conventional approach) - Lane B —
zstd -d→grepto locate, thensed, thenzstd -19(locate-first variant; same cost class) - Lane C —
bindu r adsb.sbp "9.51" "9.99"(grammar-rule patch)
Lane A: the conventional path
Section titled “Lane A: the conventional path”Raw output of /usr/bin/time -v on the zstd pipeline:
Command being timed: "bash -c zstd -d -q -f adsb.zst -o scratch_a/adsb.json sed -i 's/9\.51/9.99/g' scratch_a/adsb.json zstd -19 -q -f scratch_a/adsb.json -o scratch_a/adsb_fixed.zst"User time (seconds): 1.29System time (seconds): 0.03Elapsed (wall clock): 0:01.32Maximum resident set size (kbytes): 57328File system outputs: 16576Voluntary context switches: 1011.32 seconds wall-clock. 56 MB peak RSS. ~16 MB of file-system writes (the decompressed JSON, then the recompressed output). The compressed input is 388,556 bytes; the scratch directory holds a 4,013,553-byte intermediate JSON.
Lane C: the grammar-rule patch
Section titled “Lane C: the grammar-rule patch”Raw stdout from bindu r, with one internal label redacted:
$ bindu r demo/artifacts/adsb.sbp "9.51" "9.99"
SBP-WLF Replace [grammar-rule replace gate] (internal label redacted) File: demo/artifacts/adsb.sbp (388556 bytes compressed) Find: "9.51" (4 bytes) Replace: "9.99" (4 bytes)
1 replacement made (tier 2: grammar-rule in-place) File: 388556 -> 388556 bytes compressed (size unchanged) Done in 0.000sRaw time -v output for the same command:
Command being timed: "./bindu r demo/artifacts/adsb.sbp 9.51 9.99"User time (seconds): 0.00System time (seconds): 0.00Percent of CPU this job got: 92%Elapsed (wall clock): 0:00.00 # 3 ms below time -v's resolutionMaximum resident set size (kbytes): 2096File system outputs: 136Voluntary context switches: 13 ms wall-clock (below time -v’s resolution; measured by the demo harness). 2 MB peak RSS. 136 bytes of file-system output (the mmap’d region was written back). Input file size unchanged at 388,556 bytes.
The summary
Section titled “The summary”The demo harness writes a CSV that compares the three lanes head-to-head:
$ cat demo/results/run_20260423_130958/summary.csvlane,wall_ms,peak_rss_mb,scratch_bytes,speedup_vs_aa,1328,56.0,4470007,1.0b,1340,56.0,4470007,1.0c,3,2.0,0,442.7Lane A vs Lane C: 442.7× speedup, 28× less peak memory, zero scratch bytes. Lane B (locate-first) is essentially identical to Lane A — the locate step is dwarfed by the inflate-deflate cost.
Round-trip verification
Section titled “Round-trip verification”The demo harness decompresses the patched file and confirms the edit landed:
$ cat demo/results/run_20260423_130958/verify_roundtrip.logcompressing with bindu...decompressing...byte-comparing raw vs round-trip...priming zstd -19 artifact for Lanes A/B...ROUNDTRIP_OK raw=4013553 sbp=388556 zst=456471The CI smoke harness also runs an automated version of this check (ci/smoke.sh lines 138–150):
./bindu c /tmp/satdata/sectors/adsb.json /tmp/ci_adsb.sbp >/dev/nullORIG_SIZE=$(stat -c%s /tmp/ci_adsb.sbp)./bindu r /tmp/ci_adsb.sbp "9.51" "9.99" 2>&1 | grep -q "replacement made" \ || fail "tier-2 replace: no replacement recorded"NEW_SIZE=$(stat -c%s /tmp/ci_adsb.sbp)[ "$ORIG_SIZE" = "$NEW_SIZE" ] || fail "tier-2 replace: file size changed"./bindu d /tmp/ci_adsb.sbp /tmp/ci_adsb.json >/dev/nullpython3 -c "import sys; d=open('/tmp/ci_adsb.json').read(); sys.exit(0 if d.count('9.99') > 0 and d.count('9.51') < d.count('9.99') else 1)" \ || fail "tier-2 replace: edit did not propagate to decompressed output"Three invariants enforced on every commit: file size unchanged, decompression succeeds, the new value dominates the old in the decompressed bytes.
5. The general edit path (length-changing)
Section titled “5. The general edit path (length-changing)”bindu r falls back to a decompress-and-recompress path when len(old) != len(new), or when the pattern is not extracted as a grammar rule. Same correctness, no speedup over the conventional baseline:
$ bindu r ci_r.sbp "FINDME42" "REPLX_42" # equal length, no grammar match 1 replacement made (tier 3: decompress + recompress) File: 8056 -> 8048 bytes Done in 0.012sThe CI canary covers this path too (ci/smoke.sh lines 167–178), with bit-exact verification that the new pattern is present and the old is gone in the decompressed output. The cost is the same as the conventional approach — there is no penalty for falling back, just no speedup.
When to use which operation
Section titled “When to use which operation”A short field guide:
| You want to… | Use | Cost class |
|---|---|---|
| Confirm a string exists at all | bindu count | Native on RAW / CONSTANT / DICT / BWT-with-index / Grammar; fallback on RLE / LINDELTA / PAETH2D |
| Get position of first K matches | bindu find -n K | Same as count + LF-walk for FM-index path |
| Count over a large corpus, willing to pay index cost at compress time | bindu c with BINDU_BUILD_INDEX=1, then bindu count | One-time index build, then O(|pattern|) per query |
| Patch a value in compressed text data | bindu r, length-preserving | Native (3 ms on 4 MB ADS-B) when pattern is a grammar rule |
| Patch a value in compressed binary data | bindu r, length-preserving | Falls back to tier 3 unless model is DICT |
| Replace with a different-length string | bindu r | Always falls back to tier 3 (decompress + recompress) |
Limits and gotchas
Section titled “Limits and gotchas”- Length-preserving edits only on the fast path. Non-equal-length replacements force the decompress-recompress fallback. There is no way around this in the wire format itself: changing a rule body’s length cascades through every offset that follows.
- The FM-index sidecar is opt-in. Without
BINDU_BUILD_INDEX=1at compress time,MCP_BWTfiles fall back to decompress-and-scan on search. The sidecar roughly doubles output size on text-like data; pay it on the corpora you query repeatedly. - Three model classes always fall back.
RLE,LINDELTA, andPAETH2Dencode residuals against a neighbor context that has to be reconstructed before any candidate match can be verified. The fallback is a correctness-preserving choice, not a missing feature, but it means search on these wires is no faster thandecompress + grep. The encoder picks the model based on the data; if your data routes to one of these, you pay the inflate cost on every query. - No multi-pattern queries yet.
bindu count "p1|p2"is not supported; run two queries and add. This is a gap, not a structural limit — the native operators all support a single pattern at a time today.
Reproducing these numbers
Section titled “Reproducing these numbers”Every capture above is from the source tree as of the most recent commit:
- Demo run —
demo/results/run_20260423_130958/containslane_a.log,lane_a.time,lane_c.log,lane_c.time,summary.csv,verify_roundtrip.log. Re-run withdemo/run_demo.sh. - CI canaries —
ci/smoke.sh, sections labelled “FM-index correctness”, “CLI search paths”, and “Grammar in-place replace”. Re-run withbash ci/smoke.sh. - FM-index timing — captured in
docs/mcp_search.mdof the source tree, “Measured: 10 MiB BWT-dominant file” section. Re-run withbindu fm-test.
If you want to run these against your own data, the Getting Started page covers the install, and the Reference → CLI page documents every flag.