Compressing Earth Embeddings, pt. 2 – TerraBit

embeddings

quantization

compression

retrieval

foundation-models

sentinel-2

clay

browser

demo

How many bits do you need for planetary-scale earth embedding retrieval? Binary-quantized embedding retrieval over 50M Sentinel-2 patches — entirely in the browser, no backend or server beyond cloud-native formats and storage.

Authors

Isaac Corley

Caleb Robinson

Published

April 7, 2026

Unfinished business

Last time, we compressed earth embeddings 64× with less than 2% loss on patch classification. We found int8 was statistically indistinguishable from float32 and that PCA(64)+int8 was the sweet spot. Binary quantization — reducing each dimension to its sign bit — achieved 16.5× end-to-end compression on disk (32× on the raw embedding payload alone), but we hadn’t yet measured retrieval quality at scale.

We were clear about what we didn’t test. From our limitations section:

We have not tested: semantic segmentation, pixel regression, object detection, change detection, or retrieval — ranking quality over large databases may be more sensitive to distance distortion than top-1 classification.

In other words, patch classification on EuroSAT is a controlled benchmark, not a real workflow. Can you actually do useful things with aggressively compressed embeddings? This time we work with Clay v1.5 — a foundation model trained on multi-sensor satellite imagery — at global scale. LGND made the full global corpus available in float32 on Source Cooperative, which gave us the raw material to test compression at scale.

TerraBit

To test this, we built TerraBit — a global retrieval demo that runs entirely in the browser with no backend or server-side computation. We binary-quantize the full Clay v1.5 corpus into packed bit vectors, store them as spatially-partitioned cloud-native Parquet on public object storage, and let the browser handle shard discovery, data fetching, and in-memory Hamming scoring. The entire “backend” is a static S3 bucket; all compute happens on your machine.

How it works:

You draw one or more regions of interest (ROI) anywhere — each of these are loaded independently; regions can be rectangles or freehand polygons
You click to create exemplar patches on the map (one or many); positive exemplars outside the AOI have their embeddings fetched on the fly; negatives work anywhere on the globe for contrastive scoring (pos_dist − neg_dist); you can also invert search (bitwise NOT) to find the opposite of a reference!
DuckDB-WASM queries a manifest for intersecting geohash shards; only those shards are fetched via HTTP range requests — no full-corpus scan
A Web Worker scores all candidates with brute-force Hamming distance and returns ranked results
The results render via MapLibre GL across several view modes (top-k, heatmap, threshold, outlier, surprise, gradient)

Multiple exemplars can be combined via mean distance, by applying bitwise AND / OR / XOR directly on the packed binary vectors before scoring — exact, lossless ops that compose semantically because binary embeddings have nice arithmetic properties.

The 50M embeddings are partitioned into geohash-aligned Parquet shards and published on Source Cooperative, which serves them cloud-natively out of S3 — public HTTP with byte-range support, no egress fees, no intermediate server. A single manifest file records the path, row count, and spatial extent of every shard.

When you draw an ROI, DuckDB-WASM queries the manifest with a bounding-box predicate — manifest-based shard pruning: the manifest acts as a coarse spatial index so the browser never opens metadata on shards outside the ROI. Once the intersecting shard list is resolved, DuckDB streams those shard files over HTTP (via httpfs range requests) and applies a second filter at the row level — a bbox predicate for rectangles, or ST_Intersects for freehand polygons — to extract only patches within the drawn region. Ranking over the candidate slice is exact brute-force Hamming: binary embeddings arrive as packed Uint8Array columns (128 bytes per 1024-dim vector) and are scored in a Web Worker via XOR+popcount, which maps directly to hardware-accelerated popcount instructions and completes in milliseconds for a typical AOI partition.

The binary embeddings are lossy though — we find they have ~65% recall@10 (the fraction of true float32 nearest neighbors recovered by the binary representation) which means roughly a third of true neighbors are missed (Figure 2). Good enough for coarse exploration; not a claim about downstream curation or labeling productivity. How coarse is too coarse though? 65% recall goes further than you’d expect — try the demo on your own region!

A few examples of what this enables: a) click a center-pivot irrigation field in Kansas and separate it from rectangular fields across the state, b) pick a greenhouse cluster in Rotterdam and highlight dense greenhouse and vineyard complexes across the region or c) select a solar installation in northwest India and find others at similar scale. None of these queries require labeled data, a trained classifier, or even a definition of what you’re looking for beyond a single click. This is useful for data exploration, bootstrapping training datasets for supervised models, and narrowing the search space before running expensive high-resolution models over targeted areas. The demo also supports exporting ranked candidates as GeoParquet!

Binary Earth Embedding Retrieval at Planet Scale

Clay v1.5 produces 1024-dimensional embeddings from Sentinel-2 imagery. The global corpus spans two years of observations — roughly 50 million embeddings covering Earth’s land surface — and is 183 GiB on disk in ZSTD-compressed Parquet (≈190 GiB as raw float32 – float32s don’t compress well even if they come from a GeoFM). Serving float32 vectors at this scale to a browser isn’t viable; the question we ask is how aggressively you can compress without destroying retrieval quality.

Binary quantization reduces each dimension to a single sign bit. 1024 floats (4,096 bytes) become 128 bytes — a 32× reduction on the raw payload. End-to-end on disk (Parquet with ZSTD, geometry and STAC metadata columns, row-group overhead), the full 49.8M-row corpus drops from 182.9 GiB to 11.1 GiB — 16.5× compression. The on-disk number is what you pay for on object storage (32× is the raw payload reduction). The web demo corpus is smaller still (~7 GiB) because several columns were dropped and the compression level was increased — a demo-specific optimization on top of the 16.5× quantization win.

Figure 1: On-disk storage of the full 49.8M Clay v1.5 corpus across quantization levels. fp32 → binary gives 16.5× end-to-end compression.

Figure 2: kNN recall@k vs. on-disk compression ratio across quantization methods. int8 is near-lossless; binary hits 65% recall at 16.5×.

Why does aggressive quantization work at all on 1024-dimensional vectors? One diagnostic is the intrinsic dimension (ID) — the degrees of freedom the data actually uses, regardless of ambient dimensionality [Facco et al., 2017; Levina & Bickel, 2004]. This framing is directly motivated by Rao et al., 2025, who find that geographic representations — despite operating in 256–512 dimensional spaces — compress to just 2–10 intrinsic dimensions, and that ID correlates with downstream task performance. For Clay v1.5 we estimate ID ≈ 13–17 (MLE: 17.0, TwoNN: 12.6, Local PCA: 17.0, on a 10k sample subset). Three estimators with different assumptions agree on a narrow range. Low ID is why aggressive compression is worth attempting — the data simply isn’t using most of its dimensions.

TurboQuant aka rotate before you quantize

Binary is the extreme end of the compression spectrum, and the retrieval demo uses it — but what if you need more recall than binary while keeping storage well below float32?

Standard affine quantization at low bit-widths (int2–int4) suffers from high variance disparity across embedding dimensions: some dimensions carry far more signal than others, and a uniform quantization grid wastes bits on low-variance dimensions while clipping high-variance ones. TurboQuant fixes this by applying a fixed random orthogonal rotation \(R \in \mathbb{R}^{d \times d}\) (sampled once from a Haar-distributed ensemble via QR decomposition) before symmetric affine quantization: \(\hat{x} = R^\top Q_b(Rx)\). The rotation spreads variance across dimensions so no channel dominates the bit budget. \(R\) is generated once, stored with the quantized embeddings, and reused for all queries — one matrix multiply at encode/decode, no retraining.

Earth embeddings have the same property: ID ≈ 13–17 in a 1024-d space leaves a lot of variance to redistribute.

We ran TurboQuant across bit-widths on the Clay v1.5 embeddings. The gains are largest at low precision and vanish at high precision:

Figure 3: TurboQuant vs. standard scalar quantization across bit-widths. The rotation provides the largest recall improvement at 2–3 bits, where inter-channel variance disparity hurts most. By int8, affine quantization is already near-lossless and the rotation adds nothing.

Figure 4: Quality–compression Pareto front across standard quantization methods, under both cosine and Euclidean ground truth (the two overlap almost exactly). Standard int4 is the sweet spot at ~6× on-disk compression and 91% recall@10; int2 is dominated by binary, which recovers some recall despite 16.5× compression.

Practical takeaway: if binary recall is too coarse but you still want aggressive compression, TurboQuant at int2–int4 is worth trying first. At TurboQuant int4, 95% recall at ~6× on-disk compression (8× on the raw payload). By int8, the affine grid is fine enough on its own and the rotation adds nothing.

Search throughput

We also benchmarked brute-force kNN on a 1M-vector subset (1K queries, k=10) using FAISS on CPU and PyTorch’s torch.cdist on an RTX 3090. While other bit-widths benefit from GPU acceleration, binary search is unreasonably fast on CPU thanks to SIMD acceleration.

Why not just build a backend?

Reasonable reaction: “Cool demo, but real systems need a database and an API.” Maybe — but in geospatial ML, the gap between a working prototype and a deployed tool is almost all infrastructure: vector databases, REST APIs, auth, scaling, monitoring. Each layer is individually reasonable but collectively large enough of a barrier to prevent someone from shipping and maintaining a useful tool.

Furthermore, existing vector DBs primarily partition by embedding similarity; a small AOI query still touches shards scattered across the index with geospatial filtering applied AFTER the expensive approximate nearest neighbor (ANN) step. Getting geo-first partitioning right takes careful co-design, and no existing systems target zero-ops browser-native serving of a static corpus. Our approach sidesteps that: embeddings partitioned spatially by geohash, a manifest for shard pruning, and a throwaway Hamming scan.

To be clear, backends still have their place. Full-corpus ANN, multi-user serving, auth, and strict SLAs are backend territory. But for exploration and dataset curation, the barrier to useful interaction with embeddings should be as close to zero as possible, and for a lot of real problems, client-side is enough.

Links: TerraBit retrieval demo · binarized embedding corpus · pt. 1: Compressing Earth Embeddings

Acknowledgments. Thanks to Jeff Albrecht for his review and feedback on this post.

Citation

BibTeX citation:

@online{corley2026,
  author = {Corley, Isaac and Robinson, Caleb},
  title = {Compressing {Earth} {Embeddings,} Pt. 2 -\/- {TerraBit}},
  date = {2026-04-07},
  url = {https://geospatialml.com/posts/terrabit/},
  langid = {en}
}

For attribution, please cite this work as:

Corley, Isaac, and Caleb Robinson. 2026. “Compressing Earth Embeddings, Pt. 2 -- TerraBit.” April 7. https://geospatialml.com/posts/terrabit/.