Source code repositories

Source code is a secondary applicable use case (the flagship is satellite & telemetry). Code is structured, highly repetitive, and often stored at scale — three properties that suit Bindu’s symbolic representation. The page below describes how the system applies to monorepos, code search, and training corpora.

Why Bindu beats byte compressors on code

Byte compressors find textual repeats. That catches keywords and common tokens, but it misses:

Identifier canonicalization. getUserById and GetUserByID are structurally identical calls in different casing conventions. Bindu unifies them through a shared identifier table.
Import/package trees. import { foo } from '../../lib/foo' repeats across files with small path variations. Bindu stores the resolved module graph once.
Whitespace/formatting. A Prettier-formatted and a raw version of the same code compress identically.

Typical setup

Use a language-specific schema:

bindu compress --schema source-ts-v1 --recursive ./src/
bindu compress --schema source-py-v1 --recursive ./ml/
bindu compress --schema source-go-v1 --recursive ./infra/

Or mixed:

bindu compress --schema source-auto --recursive ./monorepo/

source-auto dispatches per file extension and maintains a shared identifier dictionary across languages.

Expected ratios

On a 12 GB TypeScript monorepo:

Method	Size	Ratio
Raw	12.0 GB	1.0×
gzip -6	2.1 GB	5.7×
zstd -9	1.6 GB	7.5×
Bindu	0.6 GB	20.0×

Code is where Bindu’s ratio advantage is largest because the structural redundancy is extreme — the same AST shapes recur thousands of times.

Code search in place

The computable property enables AST-level search without a decompression pass:

# Find all callers of a function across a compressed monorepo
bindu code search monorepo.bindu --pattern 'call: getUserById(_)'

# Find TODO comments attached to exported functions
bindu code search monorepo.bindu --pattern 'comment[contains="TODO"] > export function'

Internally this is a tree-walk over the semantic form — vastly cheaper than grep over the decompressed source tree, and with real AST awareness.

Use in code-training corpora

For code LLM training:

bindu compress \
  --schema source-auto \
  --dict code-v1.bindud \
  --dedupe exact \
  --filter 'license in ("MIT","Apache-2.0","BSD-*")' \
  --recursive crawl/

--dedupe exact is effectively free (see LLM training corpora), and license filtering happens before compression so disallowed files never hit the archive.

Pitfalls

Generated code hurts ratios. Minified bundles, auto-generated protobuf code, and vendored node_modules/vendor directories have weird repetition patterns that don’t match the language schemas well. Exclude them or compress separately with the generic text schema.
Binary assets in a source tree. Images, fonts, and compiled artifacts inside the repo fall back to zstd (see When not to use Bindu). Use .binduignore to skip them.
Schema pinning matters. Language schemas evolve as grammars do. Pin the major version (source-ts-v1) to keep archives decodable across toolchain upgrades.