Source code repositories
Source code is a secondary applicable use case (the flagship is satellite & telemetry). Code is structured, highly repetitive, and often stored at scale — three properties that suit Bindu’s symbolic representation. The page below describes how the system applies to monorepos, code search, and training corpora.
Why Bindu beats byte compressors on code
Section titled “Why Bindu beats byte compressors on code”Byte compressors find textual repeats. That catches keywords and common tokens, but it misses:
- Identifier canonicalization.
getUserByIdandGetUserByIDare structurally identical calls in different casing conventions. Bindu unifies them through a shared identifier table. - Import/package trees.
import { foo } from '../../lib/foo'repeats across files with small path variations. Bindu stores the resolved module graph once. - Whitespace/formatting. A Prettier-formatted and a raw version of the same code compress identically.
Typical setup
Section titled “Typical setup”Use a language-specific schema:
bindu compress --schema source-ts-v1 --recursive ./src/bindu compress --schema source-py-v1 --recursive ./ml/bindu compress --schema source-go-v1 --recursive ./infra/Or mixed:
bindu compress --schema source-auto --recursive ./monorepo/source-auto dispatches per file extension and maintains a shared identifier dictionary across languages.
Expected ratios
Section titled “Expected ratios”On a 12 GB TypeScript monorepo:
| Method | Size | Ratio |
|---|---|---|
| Raw | 12.0 GB | 1.0× |
| gzip -6 | 2.1 GB | 5.7× |
| zstd -9 | 1.6 GB | 7.5× |
| Bindu | 0.6 GB | 20.0× |
Code is where Bindu’s ratio advantage is largest because the structural redundancy is extreme — the same AST shapes recur thousands of times.
Code search in place
Section titled “Code search in place”The computable property enables AST-level search without a decompression pass:
# Find all callers of a function across a compressed monorepobindu code search monorepo.bindu --pattern 'call: getUserById(_)'
# Find TODO comments attached to exported functionsbindu code search monorepo.bindu --pattern 'comment[contains="TODO"] > export function'Internally this is a tree-walk over the semantic form — vastly cheaper than grep over the decompressed source tree, and with real AST awareness.
Use in code-training corpora
Section titled “Use in code-training corpora”For code LLM training:
bindu compress \ --schema source-auto \ --dict code-v1.bindud \ --dedupe exact \ --filter 'license in ("MIT","Apache-2.0","BSD-*")' \ --recursive crawl/--dedupe exact is effectively free (see LLM training corpora), and license filtering happens before compression so disallowed files never hit the archive.
Pitfalls
Section titled “Pitfalls”- Generated code hurts ratios. Minified bundles, auto-generated protobuf code, and vendored
node_modules/vendordirectories have weird repetition patterns that don’t match the language schemas well. Exclude them or compress separately with the generic text schema. - Binary assets in a source tree. Images, fonts, and compiled artifacts inside the repo fall back to zstd (see When not to use Bindu). Use
.binduignoreto skip them. - Schema pinning matters. Language schemas evolve as grammars do. Pin the major version (
source-ts-v1) to keep archives decodable across toolchain upgrades.