LLM training corpora

LLM training corpora are an applicable secondary use case. The flagship is satellite & telemetry, but pretraining datasets share two properties that line up with Bindu: highly structured metadata per document and heavy cross-document repetition in boilerplate, templates, and URLs. The text portion itself compresses at roughly classical-codec parity (see the Hutter Prize results) — the wins on this workload come from the metadata layer.

Why Bindu fits

A typical pretraining dataset row looks like:

{
  "url": "https://...",
  "ts": "2026-01-23T...",
  "source": "common-crawl-2025-04",
  "lang": "en",
  "license": "cc-by-4.0",
  "text": "<the actual document>",
  "quality": {"toxicity": 0.01, "perplexity": 24.3, "length": 4129}
}

Everything outside text is a prime target for stage 3 (reference). URLs, licenses, language codes, source tags, and quality-metric templates repeat billions of times across the corpus.

Typical setup

# Generate a schema from a representative sample
bindu schema generate sample/*.jsonl --out corpus-v1.bindus

# Build a shared dictionary across the whole corpus
bindu dict build --schema corpus-v1.bindus \
  --input 'shards/*.jsonl' \
  --out corpus-v1.bindud

# Compress all shards using the shared dictionary
bindu compress --schema corpus-v1.bindus \
  --dict corpus-v1.bindud \
  --jobs 64 \
  --recursive shards/

The shared dictionary is built once and reused for every shard. New shards added later can use the same dictionary, and when it drifts too far from the data you version it (corpus-v2.bindud).

Expected ratios

On a representative English web corpus (2 TB of JSONL, 40% boilerplate/metadata, 60% text):

Method	Size	Ratio vs raw
Raw JSONL	2.00 TB	1.0×
gzip -6	680 GB	2.9×
zstd -9	520 GB	3.8×
zstd -19 + long-range	410 GB	4.9×
Bindu (no shared dict)	240 GB	8.3×
Bindu (shared dict)	180 GB	11.1×

The text portion compresses at roughly zstd parity; the wins come from the metadata and from shared-dictionary hits on recurring URL prefixes, license strings, and source tags.

Streaming during training

The computable property matters here: training loaders can read records directly from .bindu files without a decompression stage.

from bindu import Reader

reader = Reader("corpus-v1.bindu", schema="corpus-v1.bindus")
for record in reader.stream(columns=["text", "lang"], where="quality.perplexity < 40"):
    yield record["text"]

Column projection (text, lang only) and predicate pushdown (perplexity < 40) happen against the compressed form. A quality filter that would touch every byte of a zstd archive reads ~5% of the Bindu archive.

Deduplication

Bindu’s reference table is effectively a content-addressed store. Two identical documents in different shards share a single physical copy after dictionary build:

bindu dedupe corpus-v1/ --scope text --min-length 512

Exact-match dedup at document granularity is free. Near-dup (MinHash over paragraph shingles) is a separate pass (bindu dedupe --method minhash).

Pitfalls

Don’t share dictionaries across languages. A Chinese-optimized dictionary hurts English ratios. Shard by language first.
Budget for schema drift. Pin corpus-v1.bindud to a frozen sample. Rebuild and re-version when the corpus source mix changes meaningfully.
Text content doesn’t get 10×. The overall 11× ratio is an average. The text column alone is closer to 3–4×. Your headline number depends on your metadata-to-text ratio.