Skip to content

Operating on compressed data

This is the section that diverges most sharply from conventional codecs. With gzip or zstd, the compressed file is a dead artifact: you decompress it before doing anything useful. With Bindu, three operations run directly against the compressed wire: search, count/locate, and edit.

In the previous page, we compressed Alice’s Adventures in Wonderland into alice.txt.bindu. This page continues with that file to first count a term, then patch it with another term, then verify the edited compressed artifact still decompresses correctly.

Terminal window
# Confirm the file currently contains "Alice"
bindu count "Alice" alice.txt.bindu
# → N occurrences
# Patch Alice to Mabel everywhere
bindu replace alice.txt.bindu "Alice" "Mabel"
# Verify
bindu count "Alice" alice.txt.bindu
# → 0 occurrences
bindu count "Mabel" alice.txt.bindu
# → N occurrences
# Decompress and confirm round-trip
bindu decompress alice.txt.bindu --output alice-edited.txt
grep -c "Alice" alice-edited.txt
grep -c "Mabel" alice-edited.txt

The replace command edited the compressed file in place so that every occurrence of Alice became Mabel without ever materializing the decompressed text. The grep at the end is only there to verify the round-trip.

The important property is not the word Alice. It is that the compressed file remains an active data structure: Bindu can count, locate, and patch supported patterns without first writing alice.txt back out to disk.

This specific edit hits the fast path because Alice and Mabel have the same byte length. Bindu still supports different-length replacements, but today those fall back to decompress / recompress. That fallback is correct, but it is no longer the compressed-space speedup.

The Alice file is intentionally small and readable. To measure the performance impact of “compressed search,” we benchmark the same edit path on a separate 4 MB ADS-B aircraft-tracking archive.

ADS-B is the broadcast data planes use to report their position, altitude, speed, heading, and identity; in archive form it is usually structured telemetry with many repeated fields and numeric values. In that benchmark, replacing 9.51 with 9.99 completes in 3 ms. The conventional pipeline, zstd -dsed ...zstd -19, takes 1.33 s, a speedup of 443x. See Operating benchmarks for the full run.

When the grammar pipeline is selected, Bindu structures the compressed output as a rule table plus a reference stream.

A grammar-compressed artifact contains:

  • A rule table (typically a few KB) with each unique repeated phrase stored exactly once.
  • A reference stream (most of the body) made of index numbers that point into the rule table, plus residuals where needed.

Why an edit can propagate from a single small write. Because every occurrence of a phrase in the body is a reference to a single rule entry, patching that rule entry in place updates every occurrence implicitly. There’s no need to walk the body, no offsets to recalculate, and no decompressed copy to rewrite.

The important limit is that this is a fixed-width patch fast path, not a general-purpose compressed text editor. If len(old) != len(new), changing the rule body’s length would cascade through every offset that follows it. The fast path is most useful for operational corrections where field widths are already fixed: numeric updates, coordinate fixes, status-code changes, and ID rewrites.

Why search uses the same structure. A query like bindu count "GLOBAL_NAV_OK" runs in two cheap phases. First, a memcmp probe over the rule table answers “does any rule’s body match the query?”. Second, if rule index N matched, Bindu counts how many times N appears in the reference stream. That’s metadata about the rule, not a scan over reconstructed body content.

For the command details, search model coverage, and cross-file examples, continue to Edit and search in depth.