Skip to content

LLM training corpora

LLM training corpora are an applicable secondary use case. The flagship is satellite & telemetry, but pretraining datasets share two properties that line up with Bindu: highly structured metadata per document and heavy cross-document repetition in boilerplate, templates, and URLs. The text portion itself compresses at roughly classical-codec parity (see the Hutter Prize results) — the wins on this workload come from the metadata layer.

A typical pretraining dataset row looks like:

{
"url": "https://...",
"ts": "2026-01-23T...",
"source": "common-crawl-2025-04",
"lang": "en",
"license": "cc-by-4.0",
"text": "<the actual document>",
"quality": {"toxicity": 0.01, "perplexity": 24.3, "length": 4129}
}

Everything outside text is a prime target for stage 3 (reference). URLs, licenses, language codes, source tags, and quality-metric templates repeat billions of times across the corpus.

Terminal window
# Generate a schema from a representative sample
bindu schema generate sample/*.jsonl --out corpus-v1.bindus
# Build a shared dictionary across the whole corpus
bindu dict build --schema corpus-v1.bindus \
--input 'shards/*.jsonl' \
--out corpus-v1.bindud
# Compress all shards using the shared dictionary
bindu compress --schema corpus-v1.bindus \
--dict corpus-v1.bindud \
--jobs 64 \
--recursive shards/

The shared dictionary is built once and reused for every shard. New shards added later can use the same dictionary, and when it drifts too far from the data you version it (corpus-v2.bindud).

On a representative English web corpus (2 TB of JSONL, 40% boilerplate/metadata, 60% text):

MethodSizeRatio vs raw
Raw JSONL2.00 TB1.0×
gzip -6680 GB2.9×
zstd -9520 GB3.8×
zstd -19 + long-range410 GB4.9×
Bindu (no shared dict)240 GB8.3×
Bindu (shared dict)180 GB11.1×

The text portion compresses at roughly zstd parity; the wins come from the metadata and from shared-dictionary hits on recurring URL prefixes, license strings, and source tags.

The computable property matters here: training loaders can read records directly from .bindu files without a decompression stage.

from bindu import Reader
reader = Reader("corpus-v1.bindu", schema="corpus-v1.bindus")
for record in reader.stream(columns=["text", "lang"], where="quality.perplexity < 40"):
yield record["text"]

Column projection (text, lang only) and predicate pushdown (perplexity < 40) happen against the compressed form. A quality filter that would touch every byte of a zstd archive reads ~5% of the Bindu archive.

Bindu’s reference table is effectively a content-addressed store. Two identical documents in different shards share a single physical copy after dictionary build:

Terminal window
bindu dedupe corpus-v1/ --scope text --min-length 512

Exact-match dedup at document granularity is free. Near-dup (MinHash over paragraph shingles) is a separate pass (bindu dedupe --method minhash).

  • Don’t share dictionaries across languages. A Chinese-optimized dictionary hurts English ratios. Shard by language first.
  • Budget for schema drift. Pin corpus-v1.bindud to a frozen sample. Rebuild and re-version when the corpus source mix changes meaningfully.
  • Text content doesn’t get 10×. The overall 11× ratio is an average. The text column alone is closer to 3–4×. Your headline number depends on your metadata-to-text ratio.