Skip to content

Source code repositories

Source code is a secondary applicable use case (the flagship is satellite & telemetry). Code is structured, highly repetitive, and often stored at scale — three properties that suit Bindu’s symbolic representation. The page below describes how the system applies to monorepos, code search, and training corpora.

Byte compressors find textual repeats. That catches keywords and common tokens, but it misses:

  • Identifier canonicalization. getUserById and GetUserByID are structurally identical calls in different casing conventions. Bindu unifies them through a shared identifier table.
  • Import/package trees. import { foo } from '../../lib/foo' repeats across files with small path variations. Bindu stores the resolved module graph once.
  • Whitespace/formatting. A Prettier-formatted and a raw version of the same code compress identically.

Use a language-specific schema:

Terminal window
bindu compress --schema source-ts-v1 --recursive ./src/
bindu compress --schema source-py-v1 --recursive ./ml/
bindu compress --schema source-go-v1 --recursive ./infra/

Or mixed:

Terminal window
bindu compress --schema source-auto --recursive ./monorepo/

source-auto dispatches per file extension and maintains a shared identifier dictionary across languages.

On a 12 GB TypeScript monorepo:

MethodSizeRatio
Raw12.0 GB1.0×
gzip -62.1 GB5.7×
zstd -91.6 GB7.5×
Bindu0.6 GB20.0×

Code is where Bindu’s ratio advantage is largest because the structural redundancy is extreme — the same AST shapes recur thousands of times.

The computable property enables AST-level search without a decompression pass:

Terminal window
# Find all callers of a function across a compressed monorepo
bindu code search monorepo.bindu --pattern 'call: getUserById(_)'
# Find TODO comments attached to exported functions
bindu code search monorepo.bindu --pattern 'comment[contains="TODO"] > export function'

Internally this is a tree-walk over the semantic form — vastly cheaper than grep over the decompressed source tree, and with real AST awareness.

For code LLM training:

Terminal window
bindu compress \
--schema source-auto \
--dict code-v1.bindud \
--dedupe exact \
--filter 'license in ("MIT","Apache-2.0","BSD-*")' \
--recursive crawl/

--dedupe exact is effectively free (see LLM training corpora), and license filtering happens before compression so disallowed files never hit the archive.

  • Generated code hurts ratios. Minified bundles, auto-generated protobuf code, and vendored node_modules/vendor directories have weird repetition patterns that don’t match the language schemas well. Exclude them or compress separately with the generic text schema.
  • Binary assets in a source tree. Images, fonts, and compiled artifacts inside the repo fall back to zstd (see When not to use Bindu). Use .binduignore to skip them.
  • Schema pinning matters. Language schemas evolve as grammars do. Pin the major version (source-ts-v1) to keep archives decodable across toolchain upgrades.