Search and edit on compressed data

This page walks through the four operations Bindu supports against a compressed wire — count, find, fm-test, and r (replace) — using real captures from the demo and CI suites in the source tree. The numbers and outputs below are reproducible from the repository today; CI re-verifies them on every commit.

For the underlying property and why this works at all, see Operating on compressed data.

1. Count occurrences without decompression

The simplest operation. bindu count returns just the number of times a pattern appears in the original data, computed directly against the compressed wire when the model admits a native search path.

On a constant-byte file (`MCP_CONSTANT`)

Compress 1 MB of 0xFF, then count 8-byte runs of 0xFF:

$ python3 -c "open('/tmp/const.raw','wb').write(b'\\xff' * 1048576)"
$ bindu c --shape 1048576 --dtype uint8 /tmp/const.raw /tmp/const.sbp
$ ls -la /tmp/const.sbp
-rw-r--r-- 1 user user 21 ... /tmp/const.sbp           # 1 MB → 21 bytes

$ bindu count "$(printf '\xff%.0s' {1..8})" /tmp/const.sbp
  1048569
  (1 MB constant-byte body, no decompression)

A 1 MB input, a 21-byte compressed file, 1,048,569 matches returned in 1 ms. The compressed body is a single byte plus length metadata; the count is computed in closed form (max(0, n - |pattern| + 1) if the constant matches pattern, else 0). The wire is never expanded.

On a BWT-indexed file (`MCP_BWT` with FM-index sidecar)

The interesting case. Compress a 10 MiB file with the FM-index sidecar enabled, then count occurrences of a 9-byte needle:

$ ls -la /tmp/bwt10.raw
-rw-r--r-- 1 user user 10485760 ... /tmp/bwt10.raw

$ BINDU_BUILD_INDEX=1 bindu c --shape 10485760 --dtype uint8 \
    /tmp/bwt10.raw /tmp/bwt10.sbp

$ bindu count NEEDLE42 /tmp/bwt10.sbp
  1280

The CI smoke harness verifies this exact result on every commit (ci/smoke.sh in the source tree):

log "search: count-only via bindu count"
N=$(./bindu count NEEDLE42 /tmp/ci_bwt10.sbp 2>&1 | grep -oE "^  [0-9]+" | tr -d ' ')
[ "$N" = "1280" ] || fail "bindu count: expected 1280, got '$N'"

The FM-index path runs Ferragina-Manzini backward search against the L-column of the Burrows-Wheeler block and returns the count without reconstructing any of the original bytes. Measured 5 ms vs ~866 ms for the decompress-and-scan fallback — a 170× speedup. The 9-byte pattern is small enough that the per-query cost is dominated by C-array setup, not the search itself.

The trade-off: emitting the FM-index sidecar at compress time roughly doubles the output size on this particular file (12.4 MB indexed vs 1.94 MB without index). It is opt-in via BINDU_BUILD_INDEX=1. Pay it on the datasets you actually want searchable.

2. Locate the first K matches

bindu find -n K returns positions, not just a count. On the FM-index path this uses LF-walk position recovery, which costs more per match than the count step but still avoids decompression.

$ bindu find NEEDLE42 /tmp/bwt10.sbp -n 4
  match at offset 218752
  match at offset 437504
  match at offset 656256
  match at offset 875008
  4 of 1280

CI canary:

log "search: locate via bindu find -n 4"
MATCHES=$(./bindu find NEEDLE42 /tmp/ci_bwt10.sbp -n 4 2>&1 | grep -c "^  match ")
[ "$MATCHES" = "4" ] || fail "bindu find -n 4: expected 4 matches, got $MATCHES"

For all 1,280 positions on the 10 MiB file, FM-index locate completes in 93 ms with no decompression. Compared to the decompress-and-scan fallback that pays the full inflate cost regardless of K, locate is sub-linear in the original size.

3. Verify FM-index correctness with `fm-test`

Before relying on FM-index search in production, run the correctness harness. It builds the BWT of the raw file, runs bwt_fm_count, and asserts it matches a Boyer-Moore-Horspool brute-force count on the original bytes:

$ echo -n "abracadabra" > /tmp/abra
$ bindu fm-test /tmp/abra "abra"
  bwt_encode (build):    0.001 ms
  bwt_build_C (per-q):   0.000 ms
  bwt_fm_count (per-q):  0.000 ms
  bmh_count (reference): 0.000 ms
  fm_count = 2, bmh_count = 2
  MATCH ✓

$ bindu fm-test /tmp/bwt10.raw "NEEDLE42"
  bwt_encode (build):    3316 ms
  bwt_build_C (per-q):   5 ms
  bwt_fm_count (per-q):  5 ms
  bmh_count (reference): 41 ms
  fm_count = 1280, bmh_count = 1280
  MATCH ✓

The harness is bounded to 64 MiB inputs and is the only bindu surface that builds a BWT just to test against itself. If the assertions ever drift, CI fails the build.

4. Edit in place — the grammar-rule fast path

This is the most distinctive operation. The full canonical demo lives under demo/ in the source tree; what follows is the captured run from demo/results/run_20260423_130958/.

The setup

Three lanes, same task: in a 4 MB compressed ADS-B aircraft-tracking JSON archive, replace every occurrence of the string 9.51 with 9.99, then verify the edit landed.

Lane A — zstd -d → sed → zstd -19 (the conventional approach)
Lane B — zstd -d → grep to locate, then sed, then zstd -19 (locate-first variant; same cost class)
Lane C — bindu r adsb.sbp "9.51" "9.99" (grammar-rule patch)

Lane A: the conventional path

Raw output of /usr/bin/time -v on the zstd pipeline:

Command being timed: "bash -c
  zstd -d -q -f adsb.zst -o scratch_a/adsb.json
  sed -i 's/9\.51/9.99/g' scratch_a/adsb.json
  zstd -19 -q -f scratch_a/adsb.json -o scratch_a/adsb_fixed.zst
"
User time (seconds):    1.29
System time (seconds):  0.03
Elapsed (wall clock):   0:01.32
Maximum resident set size (kbytes): 57328
File system outputs:    16576
Voluntary context switches: 101

1.32 seconds wall-clock. 56 MB peak RSS. ~16 MB of file-system writes (the decompressed JSON, then the recompressed output). The compressed input is 388,556 bytes; the scratch directory holds a 4,013,553-byte intermediate JSON.

Lane C: the grammar-rule patch

Raw stdout from bindu r, with one internal label redacted:

$ bindu r demo/artifacts/adsb.sbp "9.51" "9.99"

  SBP-WLF Replace [grammar-rule replace gate]   (internal label redacted)
  File: demo/artifacts/adsb.sbp (388556 bytes compressed)
  Find: "9.51" (4 bytes)
  Replace: "9.99" (4 bytes)

  1 replacement made (tier 2: grammar-rule in-place)
  File: 388556 -> 388556 bytes compressed (size unchanged)
  Done in 0.000s

Raw time -v output for the same command:

Command being timed: "./bindu r demo/artifacts/adsb.sbp 9.51 9.99"
User time (seconds):    0.00
System time (seconds):  0.00
Percent of CPU this job got: 92%
Elapsed (wall clock):   0:00.00          # 3 ms below time -v's resolution
Maximum resident set size (kbytes): 2096
File system outputs:    136
Voluntary context switches: 1

3 ms wall-clock (below time -v’s resolution; measured by the demo harness). 2 MB peak RSS. 136 bytes of file-system output (the mmap’d region was written back). Input file size unchanged at 388,556 bytes.

The summary

The demo harness writes a CSV that compares the three lanes head-to-head:

$ cat demo/results/run_20260423_130958/summary.csv
lane,wall_ms,peak_rss_mb,scratch_bytes,speedup_vs_a
a,1328,56.0,4470007,1.0
b,1340,56.0,4470007,1.0
c,3,2.0,0,442.7

Lane A vs Lane C: 442.7× speedup, 28× less peak memory, zero scratch bytes. Lane B (locate-first) is essentially identical to Lane A — the locate step is dwarfed by the inflate-deflate cost.

Round-trip verification

The demo harness decompresses the patched file and confirms the edit landed:

$ cat demo/results/run_20260423_130958/verify_roundtrip.log
compressing with bindu...
decompressing...
byte-comparing raw vs round-trip...
priming zstd -19 artifact for Lanes A/B...
ROUNDTRIP_OK raw=4013553 sbp=388556 zst=456471

The CI smoke harness also runs an automated version of this check (ci/smoke.sh lines 138–150):

./bindu c /tmp/satdata/sectors/adsb.json /tmp/ci_adsb.sbp >/dev/null
ORIG_SIZE=$(stat -c%s /tmp/ci_adsb.sbp)
./bindu r /tmp/ci_adsb.sbp "9.51" "9.99" 2>&1 | grep -q "replacement made" \
    || fail "tier-2 replace: no replacement recorded"
NEW_SIZE=$(stat -c%s /tmp/ci_adsb.sbp)
[ "$ORIG_SIZE" = "$NEW_SIZE" ] || fail "tier-2 replace: file size changed"
./bindu d /tmp/ci_adsb.sbp /tmp/ci_adsb.json >/dev/null
python3 -c "import sys; d=open('/tmp/ci_adsb.json').read();
    sys.exit(0 if d.count('9.99') > 0 and d.count('9.51') < d.count('9.99') else 1)" \
    || fail "tier-2 replace: edit did not propagate to decompressed output"

Three invariants enforced on every commit: file size unchanged, decompression succeeds, the new value dominates the old in the decompressed bytes.

5. The general edit path (length-changing)

bindu r falls back to a decompress-and-recompress path when len(old) != len(new), or when the pattern is not extracted as a grammar rule. Same correctness, no speedup over the conventional baseline:

$ bindu r ci_r.sbp "FINDME42" "REPLX_42"     # equal length, no grammar match
  1 replacement made (tier 3: decompress + recompress)
  File: 8056 -> 8048 bytes
  Done in 0.012s

The CI canary covers this path too (ci/smoke.sh lines 167–178), with bit-exact verification that the new pattern is present and the old is gone in the decompressed output. The cost is the same as the conventional approach — there is no penalty for falling back, just no speedup.

When to use which operation

A short field guide:

You want to…	Use	Cost class
Confirm a string exists at all	`bindu count`	Native on RAW / CONSTANT / DICT / BWT-with-index / Grammar; fallback on RLE / LINDELTA / PAETH2D
Get position of first K matches	`bindu find -n K`	Same as count + LF-walk for FM-index path
Count over a large corpus, willing to pay index cost at compress time	`bindu c` with `BINDU_BUILD_INDEX=1`, then `bindu count`	One-time index build, then O(\|pattern\|) per query
Patch a value in compressed text data	`bindu r`, length-preserving	Native (3 ms on 4 MB ADS-B) when pattern is a grammar rule
Patch a value in compressed binary data	`bindu r`, length-preserving	Falls back to tier 3 unless model is DICT
Replace with a different-length string	`bindu r`	Always falls back to tier 3 (decompress + recompress)

Limits and gotchas

Length-preserving edits only on the fast path. Non-equal-length replacements force the decompress-recompress fallback. There is no way around this in the wire format itself: changing a rule body’s length cascades through every offset that follows.
The FM-index sidecar is opt-in. Without BINDU_BUILD_INDEX=1 at compress time, MCP_BWT files fall back to decompress-and-scan on search. The sidecar roughly doubles output size on text-like data; pay it on the corpora you query repeatedly.
Three model classes always fall back. RLE, LINDELTA, and PAETH2D encode residuals against a neighbor context that has to be reconstructed before any candidate match can be verified. The fallback is a correctness-preserving choice, not a missing feature, but it means search on these wires is no faster than decompress + grep. The encoder picks the model based on the data; if your data routes to one of these, you pay the inflate cost on every query.
No multi-pattern queries yet. bindu count "p1|p2" is not supported; run two queries and add. This is a gap, not a structural limit — the native operators all support a single pattern at a time today.

Reproducing these numbers

Every capture above is from the source tree as of the most recent commit:

Demo run — demo/results/run_20260423_130958/ contains lane_a.log, lane_a.time, lane_c.log, lane_c.time, summary.csv, verify_roundtrip.log. Re-run with demo/run_demo.sh.
CI canaries — ci/smoke.sh, sections labelled “FM-index correctness”, “CLI search paths”, and “Grammar in-place replace”. Re-run with bash ci/smoke.sh.
FM-index timing — captured in docs/mcp_search.md of the source tree, “Measured: 10 MiB BWT-dominant file” section. Re-run with bindu fm-test.

If you want to run these against your own data, the Getting Started page covers the install, and the Reference → CLI page documents every flag.