LLM training corpora
LLM training corpora are an applicable secondary use case. The flagship is satellite & telemetry, but pretraining datasets share two properties that line up with Bindu: highly structured metadata per document and heavy cross-document repetition in boilerplate, templates, and URLs. The text portion itself compresses at roughly classical-codec parity (see the Hutter Prize results) — the wins on this workload come from the metadata layer.
Why Bindu fits
Section titled “Why Bindu fits”A typical pretraining dataset row looks like:
{ "url": "https://...", "ts": "2026-01-23T...", "source": "common-crawl-2025-04", "lang": "en", "license": "cc-by-4.0", "text": "<the actual document>", "quality": {"toxicity": 0.01, "perplexity": 24.3, "length": 4129}}Everything outside text is a prime target for stage 3 (reference). URLs, licenses, language codes, source tags, and quality-metric templates repeat billions of times across the corpus.
Typical setup
Section titled “Typical setup”# Generate a schema from a representative samplebindu schema generate sample/*.jsonl --out corpus-v1.bindus
# Build a shared dictionary across the whole corpusbindu dict build --schema corpus-v1.bindus \ --input 'shards/*.jsonl' \ --out corpus-v1.bindud
# Compress all shards using the shared dictionarybindu compress --schema corpus-v1.bindus \ --dict corpus-v1.bindud \ --jobs 64 \ --recursive shards/The shared dictionary is built once and reused for every shard. New shards added later can use the same dictionary, and when it drifts too far from the data you version it (corpus-v2.bindud).
Expected ratios
Section titled “Expected ratios”On a representative English web corpus (2 TB of JSONL, 40% boilerplate/metadata, 60% text):
| Method | Size | Ratio vs raw |
|---|---|---|
| Raw JSONL | 2.00 TB | 1.0× |
| gzip -6 | 680 GB | 2.9× |
| zstd -9 | 520 GB | 3.8× |
| zstd -19 + long-range | 410 GB | 4.9× |
| Bindu (no shared dict) | 240 GB | 8.3× |
| Bindu (shared dict) | 180 GB | 11.1× |
The text portion compresses at roughly zstd parity; the wins come from the metadata and from shared-dictionary hits on recurring URL prefixes, license strings, and source tags.
Streaming during training
Section titled “Streaming during training”The computable property matters here: training loaders can read records directly from .bindu files without a decompression stage.
from bindu import Reader
reader = Reader("corpus-v1.bindu", schema="corpus-v1.bindus")for record in reader.stream(columns=["text", "lang"], where="quality.perplexity < 40"): yield record["text"]Column projection (text, lang only) and predicate pushdown (perplexity < 40) happen against the compressed form. A quality filter that would touch every byte of a zstd archive reads ~5% of the Bindu archive.
Deduplication
Section titled “Deduplication”Bindu’s reference table is effectively a content-addressed store. Two identical documents in different shards share a single physical copy after dictionary build:
bindu dedupe corpus-v1/ --scope text --min-length 512Exact-match dedup at document granularity is free. Near-dup (MinHash over paragraph shingles) is a separate pass (bindu dedupe --method minhash).
Pitfalls
Section titled “Pitfalls”- Don’t share dictionaries across languages. A Chinese-optimized dictionary hurts English ratios. Shard by language first.
- Budget for schema drift. Pin
corpus-v1.bindudto a frozen sample. Rebuild and re-version when the corpus source mix changes meaningfully. - Text content doesn’t get 10×. The overall 11× ratio is an average. The
textcolumn alone is closer to 3–4×. Your headline number depends on your metadata-to-text ratio.