<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>GeoSpatial ML</title>
<link>https://geospatialml.com/</link>
<atom:link href="https://geospatialml.com/index.xml" rel="self" type="application/rss+xml"/>
<description>A blog about geospatial machine learning — remote sensing, earth observation, foundation models, and applied ML for understanding our planet.</description>
<generator>quarto-1.9.37</generator>
<lastBuildDate>Tue, 12 May 2026 00:00:00 GMT</lastBuildDate>
<item>
  <title>Gaussian Splat-based Satellite Image Super Resolution</title>
  <dc:creator>Caleb Robinson</dc:creator>
  <dc:creator>Isaac Corley</dc:creator>
  <link>https://geospatialml.com/posts/sentinel2-superresolution/</link>
  <description><![CDATA[ 





<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://geospatialml.com/posts/sentinel2-superresolution/fig1_main_comparison.webp" class="figure-full-width img-fluid figure-img" style="width:100.0%"></p>
<figcaption>Multi-temporal super-resolution of a Sentinel-2 scene. From left: a single S2 natural-color observation (10m), a 0.8m aerial basemap for visual reference, a bicubic 10× upsample of the S2 input, our Gaussian-splat reconstruction with a coarse-to-fine LBFGS schedule (labelled “C2F” in the panel titles), and the same reconstruction with an Adam warmup followed by LBFGS. The splat reconstructions are visibly cleaner than bicubic and recover detail down to the limit of what the data physically supports.</figcaption>
</figure>
</div>
<p><strong>Super-resolution</strong> is the task of recovering a higher-resolution image from one or more lower-resolution observations of the same scene. In the commercial satellite world, it’s now a product as companies aim to create sharper, more detailed imagery from their data. Planet’s recently released <a href="https://www.planet.com/pulse/planet-superres-is-now-available-see-the-world-more-clearly/">SuperRes</a> product sharpens ~3m PlanetScope imagery to 2m, DigiFarm’s <a href="https://digifarm.io/products/dr-imagery">DR imagery</a> takes 10m Sentinel-2 to 1m, and Vantor’s <a href="https://vantor.com/blog/how-hd-imagery-improves-aiml-computer-vision-models/">HD imagery</a> takes 30cm to 15cm.</p>
<p>Academic work on the same problem goes back decades. Classical multi-frame super-resolution [1] showed that many shifted low-resolution observations of the same scene can be fused into a sharper image given a known imaging model and the inter-frame shifts. Sentinel-2-specific work has mostly taken a learned route since: Razzak et al.&nbsp;[2] fuse multi-temporal and multi-spectral cues with radiometric-consistency losses, and Aybar et al.&nbsp;[3] introduce SEN2SR, a deep-learning framework that pushes Sentinel-2 to 2.5m, benchmarks CNN, Mamba, and Swin Transformer backbones, and adds a low-frequency hard constraint to preserve spectral consistency. Sirko et al.&nbsp;[4] sidestep image super-resolution as an intermediate product, instead training an impressive Sentinel-2 student model to mimic 50 cm building and road predictions from a high-resolution teacher.</p>
<p>Super-resolution is also useful outside remote sensing. NVIDIA’s <a href="https://www.nvidia.com/en-us/geforce/technologies/dlss/">DLSS</a> renders games at a lower internal resolution and learns to upscale. The trick DLSS exploits is that real scenes have structure, so a representation that captures that structure carries more detail per parameter than a dense pixel grid. In 3D vision, the dominant version is the <strong>Gaussian splat</strong>, a scene represented as anisotropic Gaussians over a continuous domain, fit by optimization rather than trained. Splats are continuous, have a closed-form forward model, and act as a smoothness prior that denoises without hallucinating detail. Pix4D’s <a href="https://www.pix4d.com/blog/pix4d-gaussian-splatting-3D-visualization/">photogrammetry pipeline</a> and <a href="https://www.esri.com/arcgis-blog/products/arcgis-pro/3d-gis/how-to-create-the-best-gaussian-splats-in-arcgis-reality">ArcGIS Reality</a> both use splats to reconstruct sharp 3D scenes from photogrammetric imagery.</p>
<p>In this post we try using these recent ideas from the Gaussian splatting world to superresolve Sentinel-2 imagery timeseries.</p>
<section id="super-resolving-with-sentinel-2" class="level2">
<h2 class="anchored" data-anchor-id="super-resolving-with-sentinel-2">Super resolving with Sentinel-2</h2>
<p>The Sentinel-2 constellation revisits every spot on Earth roughly every 5 days. Over a single cloud-free summer at a mid-latitude site, that’s around 30 looks at the same patch of ground from the same sensor family. Although the delivered L2A products live on a fixed 10m UTM grid, the scene content is not perfectly co-registered across dates. Residual processing and viewing-geometry effects show up as sub-pixel translations between products: typically around 1.5m of standard deviation per axis, and occasionally as much as 6 to 7m. That small difference is what classical multi-frame super-resolution exploits. Given many slightly shifted samples of the same scene, you can solve for the hidden high-resolution image that, once shifted, blurred, and downsampled the way the sensor would, reproduces every observation.</p>
<p>To do this, we represent the ground as a continuous field of 2D <a href="https://en.wikipedia.org/wiki/Gaussian_splatting">Gaussian splats</a> on a fine grid, push it through an analytic forward model, and let a gradient based optimizer (LBFGS) jointly recover the scene weights and per-observation shifts. The implementation is around <a href="https://gist.github.com/calebrob6/9ce42c29dc956d2b09148e2ff5ca5b17">500 lines of PyTorch</a>, and the optimizer rediscovers the sub-pixel jitter on its own.</p>
<iframe src="animation.html" width="100%" height="650" style="border: none; display: block; margin: 1em 0;" loading="lazy" title="Animated explainer of multi-temporal Sentinel-2 super-resolution"></iframe>
</section>
<section id="background-the-point-spread-function" class="level2">
<h2 class="anchored" data-anchor-id="background-the-point-spread-function">Background: the point-spread function</h2>
<p>Every camera and sensor blurs a single point of light into a small fuzzy spot. That spot is the point-spread function, or PSF. It comes from the optics, the atmosphere, and the sensor’s photodetector response.</p>
<p>For a satellite imager, the PSF is typically wider than a single pixel on the ground. The optics smear features smaller than the PSF across several pixels before the sensor digitizes anything, so post-processing alone cannot put that information back without an external prior.</p>
<p>The cleanest analogy is a long-exposure photo of an out-of-focus star: stacking more exposures averages down the noise, but never recovers what’s inside the blurred disk.</p>
<p>For Sentinel-2’s 10m bands, the residual blur after pixel aperture is approximately a 2D Gaussian with a standard deviation of roughly 3 to 5m.<sup>1</sup> Without an external or learned prior, you should not expect to reliably recover arbitrary features much below roughly the 8–10 m scale from Sentinel-2.</p>
<p>Our forward model includes the PSF as an explicit known-Gaussian blur. The assumed PSF width matters more than any other knob. Too narrow and the model invents detail that isn’t there. Too wide and it can’t resolve anything.</p>
</section>
<section id="the-setup" class="level2">
<h2 class="anchored" data-anchor-id="the-setup">The setup</h2>
<p>We pull 32 cloud-free L2A scenes from April–October 2025 over a 987 × 1011 pixel area at 10m resolution from the Microsoft Planetary Computer. Most ablations and visual examples below use a centred 256 × 256 pixel crop (2.56 × 2.56 km) of that scene. We co-register four bands (B02, B03, B04, B08, corresponding to Blue, Green, Red, and NIR) into a single time series. Esri World Imagery (~0.8m/px) over the same AOI is used only as a visual reference.</p>
<p>The model treats all observations as views of one latent scene. That only works when temporal variation (phenology, illumination, atmosphere, BRDF) is small relative to the spatial structure we care about. A tighter seasonal window or per-date radiometric correction would help on dynamic landscapes.</p>
<p>The job is an <a href="https://en.wikipedia.org/wiki/Inverse_problem">inverse problem</a>: we never see the high-resolution scene <img src="https://latex.codecogs.com/png.latex?x"> directly; we see 32 low-resolution measurements <img src="https://latex.codecogs.com/png.latex?%5C%7By_t%5C%7D">. The forward model that maps the hidden scene to each measurement is</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ay_t%20%5C;=%5C;%20D%20%5Ccdot%20H%20%5Ccdot%20W_t(x)%20%5C;+%5C;%20%5Cvarepsilon_t,%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?W_t"> is a per-observation sub-pixel warp parameterized by a 2D shift vector <img src="https://latex.codecogs.com/png.latex?%5Cdelta_t%20%5Cin%20%5Cmathbb%7BR%7D%5E2"> (the small residual misregistration of L2A product <img src="https://latex.codecogs.com/png.latex?t">, in meters), <img src="https://latex.codecogs.com/png.latex?H"> is a Gaussian point-spread function (PSF) capturing the sensor’s residual optical blur, and <img src="https://latex.codecogs.com/png.latex?D"> is average-pooling down to the native 10m grid. We want to find the scene <img src="https://latex.codecogs.com/png.latex?x"> and the per-pass shifts <img src="https://latex.codecogs.com/png.latex?%5C%7B%5Cdelta_t%5C%7D"> that best explain all 32 observations.</p>
<p>The key design choice is how to represent <img src="https://latex.codecogs.com/png.latex?x">. A learnable dense pixel grid works but tends to alias. Instead, we use <strong>2D Gaussian splats</strong>: a continuous field of <img src="https://latex.codecogs.com/png.latex?K"> Gaussian blobs on a regular grid finer than the 10m pixel size. Each splat has a learnable per-band weight; we keep positions and widths fixed. The splat basis is naturally smooth, and a Gaussian integrated against another Gaussian has a closed form, which makes the forward pass cheap.</p>
</section>
<section id="the-forward-model" class="level2">
<h2 class="anchored" data-anchor-id="the-forward-model">The forward model</h2>
<p>The forward model has a closed form. A single observed pixel at position <img src="https://latex.codecogs.com/png.latex?(p_x,%20p_y)"> on pass <img src="https://latex.codecogs.com/png.latex?t"> is the integral of the sensor-blurred scene over the pixel’s footprint:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Chat%7By%7D_t(p_x,%20p_y)%20%5C;=%5C;%20%5Csum_%7Bk=1%7D%5E%7BK%7D%20w_k%20%5Cint_%7Bp_x%20-%205%7D%5E%7Bp_x%20+%205%7D%20%5Cint_%7Bp_y%20-%205%7D%5E%7Bp_y%20+%205%7D%20%5Cmathcal%7BN%7D%5C!%5Cleft(u,%20v%20%5C,%5Cbig%7C%5C,%20%5Cmu_k%20+%20%5Cdelta_t,%5C,%20%5Csigma_%7B%5Ctext%7Beff%7D%7D%5E2%20%5Cmathbf%7BI%7D%5Cright)%20%5C,%20du%20%5C,%20dv,%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5Csigma_%7B%5Ctext%7Beff%7D%7D%20=%20%5Csqrt%7B%5Csigma_%7B%5Ctext%7Bsplat%7D%7D%5E2%20+%20%5Csigma_%7B%5Ctext%7Bpsf%7D%7D%5E2%7D"> folds the splat width and the sensor PSF into a single effective Gaussian (the convolution of two Gaussians is a wider Gaussian). Because the integrand factors along the two axes, the double integral collapses into a product of <code>erf</code><sup>2</sup> differences:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cint_%7Ba%7D%5E%7Bb%7D%20%5Cmathcal%7BN%7D(u%20%5C,%7C%5C,%20%5Cmu,%20%5Csigma%5E2)%20%5C,%20du%20%5C;=%5C;%20%5Ctfrac%7B1%7D%7B2%7D%5C!%5Cleft%5B%5C,%5Ctext%7Berf%7D%5C!%5Cleft(%5Ctfrac%7Bb%20-%20%5Cmu%7D%7B%5Csigma%20%5Csqrt%7B2%7D%7D%5Cright)%20-%20%5Ctext%7Berf%7D%5C!%5Cleft(%5Ctfrac%7Ba%20-%20%5Cmu%7D%7B%5Csigma%20%5Csqrt%7B2%7D%7D%5Cright)%5Cright%5D.%0A"></p>
<p>That’s the whole forward model: each predicted pixel is a sum over splats of <code>weight × erf-difference-x × erf-difference-y</code>. PyTorch differentiates <code>erf</code> analytically, so we get exact gradients with respect to both the splat weights <img src="https://latex.codecogs.com/png.latex?w_k"> and the per-observation shifts <img src="https://latex.codecogs.com/png.latex?%5Cdelta_t">, with no image discretization.</p>
<p>The structure is also separable along the row and column axes, so rendering reduces to two batched matrix multiplications per band. Even at the default 3 m splat spacing — roughly 730k splats over a 256 × 256 crop — a single render takes milliseconds.</p>
</section>
<section id="the-shifts-come-from-the-data" class="level2">
<h2 class="anchored" data-anchor-id="the-shifts-come-from-the-data">The shifts come from the data</h2>
<p>We don’t have to tell the optimizer where the sub-pixel shifts are. We initialize <img src="https://latex.codecogs.com/png.latex?%5Cdelta_t%20=%200"> for every observation, let LBFGS jointly optimize the scene weights and the shifts against the data fidelity loss, and the per-pass offsets fall out. Each <img src="https://latex.codecogs.com/png.latex?%5Cdelta_t"> is a single global translation per observation, which works for our small, relatively flat AOI; larger or more mountainous areas would likely need local shifts or a full deformation field.</p>
<p>We can sanity-check the recovered shifts against an independent estimator. Running <a href="https://scikit-image.org/docs/stable/api/skimage.registration.html#skimage.registration.phase_cross_correlation"><code>skimage.phase_cross_correlation</code></a> with 100× upsampling on the same observation pairs gives shifts that don’t go through our forward model at all. The two sets agree closely.</p>
<table class="large-header-table caption-top table">
<caption>Sub-pixel shift statistics.</caption>
<thead>
<tr class="header">
<th>Metric</th>
<th>dx (m)</th>
<th>dy (m)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Standard deviation across 32 obs</td>
<td>1.4</td>
<td>1.8</td>
</tr>
<tr class="even">
<td>Maximum magnitude</td>
<td>6.6</td>
<td>7.0</td>
</tr>
<tr class="odd">
<td><strong>Pearson r vs.&nbsp;phase correlation</strong></td>
<td><strong>0.975</strong></td>
<td><strong>0.975</strong></td>
</tr>
<tr class="even">
<td>Mean absolute deviation vs.&nbsp;phase correlation</td>
<td>0.5</td>
<td>0.5</td>
</tr>
</tbody>
</table>
<p>Two methods landing on the same numbers is good evidence that the jitter is a real physical signal in the data, not just the optimizer fitting noise.</p>
</section>
<section id="what-you-can-actually-resolve" class="level2">
<h2 class="anchored" data-anchor-id="what-you-can-actually-resolve">What you can actually resolve</h2>
<p>We swept the assumed sensor PSF width, <img src="https://latex.codecogs.com/png.latex?%5Csigma_%7B%5Ctext%7Bpsf%7D%7D">, across a realistic range for S2’s 10m bands:</p>
<table class="large-header-table caption-top table">
<caption>PSF sweep.</caption>
<colgroup>
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
</colgroup>
<thead>
<tr class="header">
<th>σ<sub>splat</sub> (m)</th>
<th>σ<sub>psf</sub> (m)</th>
<th>σ<sub>eff</sub> (m)</th>
<th>Mean LBFGS shift from init (m)</th>
<th>Visual</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>3</td>
<td>5</td>
<td>5.8</td>
<td>0.36</td>
<td>clean, smooth</td>
</tr>
<tr class="even">
<td>3</td>
<td>4</td>
<td>5.0</td>
<td>0.40</td>
<td>sharper edges</td>
</tr>
<tr class="odd">
<td><strong>3</strong></td>
<td><strong>3</strong></td>
<td><strong>4.2</strong></td>
<td><strong>0.64</strong></td>
<td>sharpest realistic</td>
</tr>
<tr class="even">
<td>2</td>
<td>2</td>
<td>2.8</td>
<td>1.28</td>
<td>severe artifacts</td>
</tr>
</tbody>
</table>
<p><em>The bolded row was our sweet spot on this scene — the sharpest reconstruction without visible overfitting. At the final row’s settings the optimizer began fitting the noise as scene structure. “Mean LBFGS shift from init” is the mean adjustment away from the phase-correlation seed, not the absolute recovered shift.</em></p>
<p>The published S2 modulation transfer function suggests <img src="https://latex.codecogs.com/png.latex?%5Csigma_%7B%5Ctext%7Bpsf%7D%7D%20%5Capprox%203">–4m for the 10m bands, which lands us in the bolded row of the table: in our reconstructions we saw meaningful sharpening down to roughly 8m features from 10m data, and visibly cleaner edges than bicubic upsampling. When we pushed <img src="https://latex.codecogs.com/png.latex?%5Csigma_%7B%5Ctext%7Bpsf%7D%7D"> below 3m the optimizer began manufacturing structure the data doesn’t support – overfitting that looked plausible until we compared it against the aerial reference.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://geospatialml.com/posts/sentinel2-superresolution/fig6_tight_zoom.webp" class="figure-full-width img-fluid figure-img" style="width:100.0%"></p>
<figcaption>Tight zoom on a building cluster. From left: a single S2 natural-color observation (10m), a 0.8m aerial basemap for visual reference, a bicubic 10× upsample of the S2 input, and our Gaussian-splat reconstruction with the coarse-to-fine LBFGS schedule. The reconstruction recovers building edges and road geometry that bicubic misses, but it cannot invent the sub-meter façade details visible in the aerial image.</figcaption>
</figure>
</div>
<p>There is no neural network filling in plausible-looking texture. What you see is close to what the 32 observations support under the assumed forward model.</p>
</section>
<section id="more-observations-reduce-noise" class="level2">
<h2 class="anchored" data-anchor-id="more-observations-reduce-noise">More observations reduce noise</h2>
<p>Across our N-sweep, adding observations produced cleaner images but didn’t recover new spatial frequencies.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://geospatialml.com/posts/sentinel2-superresolution/fig3_temporal_depth.webp" class="figure-full-width img-fluid figure-img" style="width:100.0%"></p>
<figcaption>Reconstruction quality as a function of how many of the 32 observations we use. Each panel is the same scene fit with <img src="https://latex.codecogs.com/png.latex?N%20%5Cin%20%5C%7B1,%202,%204,%208,%2016,%2032%5C%7D"> observations. Noise drops monotonically with more observations; the underlying sharpness plateaus by about <img src="https://latex.codecogs.com/png.latex?N%20=%208">.</figcaption>
</figure>
</div>
<p>With the explicit sensor model, the fit stopped improving at about the same point whether we used 8, 16, or all 32 observations. More looks average down temporal variation from phenology, illumination, and atmosphere, but the 1.5m sub-pixel jitter is small relative to the 3 to 5m PSF — in this regime, multi-temporal fusion bought us SNR more than it bought us resolution.</p>
<p>Use a few well-chosen observations for sharper edges; add more only when you need denoising.</p>
</section>
<section id="fast-in-practice" class="level2">
<h2 class="anchored" data-anchor-id="fast-in-practice">Fast in practice</h2>
<p>The whole pipeline runs on a single GPU. At the 3 m default splat spacing, a 256 × 256 crop (~730k splats) fits in 30–150 seconds depending on configuration, and the full 987 × 1011 scene (~11M splats) fits in around 5 minutes. The biggest single speedup comes from handing the joint optimization to LBFGS after a brief Adam warmup:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://geospatialml.com/posts/sentinel2-superresolution/fig4_lbfgs_speedup.webp" class="figure-full-width img-fluid figure-img" style="width:100.0%"></p>
<figcaption>Convergence curves for joint scene + shift optimization on a 256 × 256 crop. Adam reaches a reasonable solution but plateaus; an Adam-to-LBFGS handoff drops the loss further in a fraction of the wall-clock time.</figcaption>
</figure>
</div>
<p>A few hundred Adam steps find a sane neighborhood – LBFGS’s quasi-Newton update can struggle with poor initialization – then LBFGS takes over. We found that with exact gradients through <code>erf</code>, LBFGS converged quickly and stably across the configurations we tried, and that the only knob that meaningfully changed outputs was <img src="https://latex.codecogs.com/png.latex?%5Csigma_%7B%5Ctext%7Bpsf%7D%7D">. Varying the rest of the regularization (TV weight, splat spacing, <img src="https://latex.codecogs.com/png.latex?%5Csigma_%7B%5Ctext%7Bsplat%7D%7D">) had little visual impact in our experiments.</p>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>This formulation is a good fit when you want a per-scene, training-free reconstruction that runs end-to-end in minutes on a single GPU and handles multiple bands without extra cost. It is most useful when downstream users need to trust the pixels — change detection over time, visual inspection, or basemap production for areas without sub-meter coverage.</p>
<p>If you genuinely need sub-meter recovery from 10m data, you need something else. A learned prior trained on paired low- and high-resolution data (diffusion models, ESRGAN, and similar approaches), or an external structural reference like an aerial basemap or a sharper sensor (PlanetScope, SkySat). We have experimented with basemap-conditioned versions of this same optimizer and they look sharper, but the sharpness comes from the basemap, not the S2 stack.</p>
<p>The code is <a href="https://gist.github.com/calebrob6/9ce42c29dc956d2b09148e2ff5ca5b17">around 500 lines of PyTorch</a>, with the analytic <code>erf</code>-based renderer on the bottom and a thin Adam-then-LBFGS loop on top.</p>
</section>
<section id="bibliography" class="level2">
<h2 class="anchored" data-anchor-id="bibliography">Bibliography</h2>
<p><a id="ref-irani-peleg"></a> <strong>[1]</strong> Irani, M. and Peleg, S. <a href="https://www.sciencedirect.com/science/article/abs/pii/104996529190045L">“Improving resolution by image registration.”</a> <em>CVGIP: Graphical Models and Image Processing</em>, 1991.</p>
<p><a id="ref-razzak"></a> <strong>[2]</strong> Razzak, M. T., Mateo-García, G., Lecuyer, G., Gómez-Chova, L., Gal, Y., and Kalaitzis, F. <a href="https://www.sciencedirect.com/science/article/pii/S0924271622002878">“Multi-Spectral Multi-Image Super-Resolution of Sentinel-2 with Radiometric Consistency Losses and Its Effect on Building Delineation.”</a> <em>ISPRS Journal of Photogrammetry and Remote Sensing</em>, 2023.</p>
<p><a id="ref-aybar"></a> <strong>[3]</strong> Aybar, C., et al.&nbsp;<a href="https://www.sciencedirect.com/science/article/pii/S0034425725006261">“A radiometrically and spatially consistent super-resolution framework for Sentinel-2.”</a> <em>Remote Sensing of Environment</em>, 2026.</p>
<p><a id="ref-sirko"></a> <strong>[4]</strong> Sirko, W., Asiedu Brempong, E., Marcos, J. T. C., Annkah, A., Korme, A., Hassen, M. A., Sapkota, K., Shekel, T., Diack, A., Nevo, S., Hickey, J., and Quinn, J. <a href="https://arxiv.org/abs/2310.11622">“High-Resolution Building and Road Detection from Sentinel-2.”</a>, 2024.</p>


</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>Published Sentinel-2 modulation transfer function (MTF) numbers fold in both the optical blur and the 10m pixel integration. The forward model below handles the pixel integration explicitly (it integrates each splat over a 10m × 10m footprint via <code>erf</code>), so the <img src="https://latex.codecogs.com/png.latex?%5Csigma_%7B%5Ctext%7Bpsf%7D%7D"> in our equations is the residual optical and product blur on top of that, not the full system response.↩︎</p></li>
<li id="fn2"><p><code>erf</code> is the error function: <img src="https://latex.codecogs.com/png.latex?%5Ctext%7Berf%7D(z)%20=%20%5Ctfrac%7B2%7D%7B%5Csqrt%7B%5Cpi%7D%7D%20%5Cint_0%5Ez%20e%5E%7B-t%5E2%7D%5C,dt">. Up to a linear rescaling, it gives the integral of a Gaussian over a half-line, which is exactly what we need to integrate a Gaussian over a finite pixel footprint.↩︎</p></li>
</ol>
</section><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{robinson2026,
  author = {Robinson, Caleb and Corley, Isaac},
  title = {Gaussian {Splat-based} {Satellite} {Image} {Super}
    {Resolution}},
  date = {2026-05-12},
  url = {https://geospatialml.com/posts/sentinel2-superresolution/},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-robinson2026" class="csl-entry quarto-appendix-citeas">
Robinson, Caleb, and Isaac Corley. 2026. <span>“Gaussian Splat-Based
Satellite Image Super Resolution.”</span> May 12. <a href="https://geospatialml.com/posts/sentinel2-superresolution/">https://geospatialml.com/posts/sentinel2-superresolution/</a>.
</div></div></section></div> ]]></description>
  <category>super-resolution</category>
  <category>sentinel-2</category>
  <category>gaussian-splats</category>
  <category>optimization</category>
  <category>remote-sensing</category>
  <guid>https://geospatialml.com/posts/sentinel2-superresolution/</guid>
  <pubDate>Tue, 12 May 2026 00:00:00 GMT</pubDate>
  <media:content url="https://geospatialml.com/posts/sentinel2-superresolution/fig1_main_comparison.webp" medium="image" type="image/webp"/>
</item>
<item>
  <title>ThroughputBench: How fast can a deep learning model map the Earth?</title>
  <dc:creator>Caleb Robinson</dc:creator>
  <dc:creator>Isaac Corley</dc:creator>
  <link>https://geospatialml.com/posts/throughput-bench/</link>
  <description><![CDATA[ 





<link rel="stylesheet" href="figures.css">
<script src="https://d3js.org/d3.v7.min.js"></script>
<script src="figures-data.js"></script>
<script src="figures.js" defer=""></script>
<p>The abundance of open satellite imagery and advances in geospatial ML and remote sensing methods have made it possible to monitor a growing list of variables directly from orbit. Burke et al.&nbsp;(<a href="https://www.science.org/doi/10.1126/science.abe8628">2021, <em>Science</em></a>), for example, reviewed how satellite imagery combined with machine learning can measure outcomes directly linked to the UN’s <a href="https://sdgs.un.org/goals">Sustainable Development Goals</a> — population, economic livelihoods, infrastructure quality, land use, informal settlements, agricultural productivity, and others. Operational systems like Hansen’s <a href="https://www.science.org/doi/10.1126/science.1244693">global forest change</a> product, and Google’s <a href="https://www.nature.com/articles/s41597-022-01307-4">Dynamic World</a> near-realtime land cover dataset are examples of planetary-scale monitoring with machine learning.</p>
<p>Modeling these variables well is one challenge. Scaling those models across the entire globe, on every new acquisition is a different challenge — and the choices you make at modeling time decide whether the second one is feasible at all. Take Dynamic World: <a href="https://www.nature.com/articles/s41597-022-01307-4">Brown et al.</a> report that they evaluate ~12,000 Sentinel-2 scenes per day and process about half after cloud filtering, so a new Dynamic World image lands roughly every 14.4 s — ~2 million Sentinel-2 tiles per year, or <strong>~4 billion 224×224 patch-equivalents</strong> at 10 m. The paper itself notes the architecture they use to do this is “almost 100× smaller than U-Net or DeepLab v3+ baselines.”</p>
<p>An accurate model that is too expensive to run frequently at planetary scale may not be operationally useful. However, we couldn’t find anywhere this is benchmarked (<a href="https://github.com/huggingface/pytorch-image-models/blob/main/benchmark.py">although timm has a good benchmark script/results</a>). So if you want to know whether ConvNeXt-B at fp16 on a V100 is cheaper than EfficientNet-B4 on an H100 for a global Sentinel-2 sweep, the answer today is to figure it out yourself.</p>
<blockquote class="blockquote">
<p><strong>Same imagery, same hardware, ~205× difference on the bill.</strong> Mapping the entire planet on every Sentinel-2 acquisition costs ~$30/year with MobileNetV3-S — or ~$6,150/year with ViT-L/8.</p>
</blockquote>
<p>This post is about that gap! We built <a href="https://github.com/calebrob6/throughput-bench"><strong>ThroughputBench</strong></a> — an extensible harness that measures model inference throughput for 33 common vision backbones plus 12 encoders across five geospatial foundation model families (DOFA, CROMA, SenPaMAE, Galileo, OlmoEarth), on whatever GPU you point it at, and serializes the results to a CSV. We’ve run the full matrix on two devices so far — raw results: <a href="https://github.com/calebrob6/throughput-bench/blob/main/results/tesla_v100_sxm2_32gb.csv">V100</a>, <a href="https://github.com/calebrob6/throughput-bench/blob/main/results/nvidia_h100_nvl.csv">H100</a>. You can see them summarized below, or play with our <a href="https://calebrob.com/throughput-bench/">interactive viewer</a> that shows how fast each model can sweep across Earth’s land surface.</p>
<div class="theme-figure">
<p><img src="https://geospatialml.com/posts/throughput-bench/throughput_bench.png" class="img-fluid"></p>
<p><span class="theme-figure-caption"><em>Figure 1</em>: The <a href="https://calebrob.com/throughput-bench/">interactive viewer</a> — pick two models and a GPU, and see them race over Earth’s land surface!</span></p>
</div>
<p>Our results show that on a H100 GPU, the same ~4 B-patches/year Dynamic World workload takes:</p>
<ul>
<li><strong>~10 GPU-hours/year (~$30)</strong><sup>1</sup> with MobileNetV3-S throughput (115K img/s, fp16, compiled) — the cheapest backbone in the matrix.</li>
<li><strong>21 GPU-hours/year (~$65)</strong> with ResNet-18 throughput (53K img/s, fp16, compiled)</li>
<li><strong>430 GPU-hours/year (~$1,340)</strong> with ViT-L/16 throughput (2,585 img/s, bf16, compiled)</li>
<li><strong>~970 GPU-hours/year (~$3,020)</strong> with the same ViT-L/16 with fp32 + no compile (1,147 img/s) — the precision and compile axes alone are ~2× on a single backbone.</li>
<li><strong>~1,840 GPU-hours/year (~$5,720)</strong> with OlmoEarth-Large/8 throughput (604 img/s, bf16, compiled).</li>
<li><strong>~1,980 GPU-hours/year (~$6,150)</strong> with ViT-L/8 throughput (562 img/s, bf16, compiled)<sup>2</sup> — the most expensive backbone in the matrix, ~205× the cost of MobileNetV3-S.</li>
</ul>
<p><strong>Note</strong> — our results specifically bypass dataloader overhead, which can be significant, and only measure GPU time; see Dataloader overhead.</p>
<section id="methods" class="level2">
<h2 class="anchored" data-anchor-id="methods">Methods</h2>
<p>We ran every (model × precision × compile-mode × input-shape) combination on each GPU, recorded forward-pass latency, and divided by total wall-clock time to get images/sec.&nbsp;We fix batch size at 512 across the matrix.<sup>3</sup> A few details about the protocol:</p>
<ul>
<li><strong>We isolate the model from the dataloader.</strong> A pre-allocated batch sits on the GPU; we time only the model forward pass plus <code>cuda.synchronize()</code>. Real pipelines have host→device transfer and disk I/O too, but we deliberately exclude both so what’s left is just the model and the precision (we revisit the dataloader cost in Dataloader overhead).</li>
<li><strong>We warm up before timing.</strong> Twenty warmup iterations clear the <code>cudnn</code> autotuner and any JIT compilation; then peak-memory stats reset and timing starts. Each cell runs for at least 30 seconds. Throughput is total images divided by total wall time, not the mean of per-iteration rates.</li>
<li><strong>The GPU has to be idle.</strong> The benchmark harness checks for other processes on the GPU and aborts if it finds any.</li>
<li><strong>Encoder-only.</strong> Every model runs as the classification backbone alone, no segmentation decoder attached, so hierarchical CNNs and plain ViTs pay the same accounting.</li>
<li><strong>What “fp32” means depends on the GPU.</strong> On Ampere-and-newer GPUs (A100, H100), our <code>fp32</code> rows are actually <strong>TF32</strong> — 10-bit-mantissa Tensor Cores; on V100 they’re true IEEE-754 fp32. A <code>tf32_enabled</code> column in the CSV disambiguates. (See the dtype primer below for what TF32 means.)</li>
</ul>
<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center collapsed" data-bs-toggle="collapse" data-bs-target=".callout-1-contents" aria-controls="callout-1" aria-expanded="false" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Tip</span>Quick primer: dtypes and GPU generations
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-1" class="callout-1-contents callout-collapse collapse">
<div class="callout-body-container callout-body">
<p>If you’re already comfortable with the difference between fp32, fp16, bf16, AMP, and TF32, skip this. Otherwise:</p>
<ul>
<li><strong>fp32 (single-precision float)</strong> — 32 bits, what most PyTorch code uses by default. Slow on modern GPUs because it doesn’t run on Tensor Cores.</li>
<li><strong>TF32</strong> — Ampere-and-newer NVIDIA GPUs run <code>fp32</code> matmuls on Tensor Cores at TF32 precision (10-bit mantissa instead of 23-bit) when you set <code>torch.set_float32_matmul_precision("high")</code>. We enable this, so on H100 our “fp32” rows are TF32; on V100 (Volta) they’re true IEEE-754 (V100 has no fp32 Tensor Cores).</li>
<li><strong>fp16 (half-precision)</strong> — 16 bits, runs on Tensor Cores at ~2× the throughput of TF32 (and ~8× of IEEE fp32 on V100). Numerical range is narrower; some models need loss scaling for training, but inference is usually fine.</li>
<li><strong>bf16 (brain float)</strong> — 16 bits with the same exponent range as fp32 but only 7 mantissa bits. Hardware-supported on Ampere+ (A100, H100); not supported on V100. More numerically stable than fp16 for some workloads.</li>
<li><strong>AMP (automatic mixed precision)</strong> — PyTorch’s <code>torch.autocast</code> keeps the model in fp32 and dynamically casts individual operations to fp16/bf16 per a built-in op list. Lower-overhead path to half: the wrapper code keeps running in fp32, only the heavy matmuls drop to half.</li>
</ul>
<p>GPU-generation cheat sheet:</p>
<table class="caption-top table">
<colgroup>
<col style="width: 17%">
<col style="width: 44%">
<col style="width: 30%">
<col style="width: 7%">
</colgroup>
<thead>
<tr class="header">
<th>Code name</th>
<th>Example cards</th>
<th style="text-align: center;">TF32 (fp32 Tensor Cores)?</th>
<th style="text-align: center;">bf16?</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Volta</td>
<td>V100</td>
<td style="text-align: center;">no</td>
<td style="text-align: center;">no</td>
</tr>
<tr class="even">
<td>Ampere</td>
<td>A100, RTX A6000, RTX 30-series</td>
<td style="text-align: center;">yes</td>
<td style="text-align: center;">yes</td>
</tr>
<tr class="odd">
<td>Hopper</td>
<td>H100, H200</td>
<td style="text-align: center;">yes</td>
<td style="text-align: center;">yes</td>
</tr>
<tr class="even">
<td>Ada Lovelace</td>
<td>RTX 6000 Ada, L40 / L40S, RTX 40-series</td>
<td style="text-align: center;">yes</td>
<td style="text-align: center;">yes</td>
</tr>
<tr class="odd">
<td>Blackwell</td>
<td>B100, B200, GB200, RTX PRO 6000 Blackwell, RTX 50-series</td>
<td style="text-align: center;">yes</td>
<td style="text-align: center;">yes</td>
</tr>
</tbody>
</table>
<p>When this post says “Ampere+” or “modern GPUs” it means Ampere or newer (basically anything ≥ A100).</p>
</div>
</div>
</div>
<p>Models cover 33 architectures spanning five families — classical CNNs (ResNet, EfficientNet, ConvNeXt, MobileNetV3, RegNetY), plain ViTs (ViT, DeiT3, DinoV3), hierarchical ViTs (Swin, BEiT), and CNN-ViT hybrids (CoAtNet) — plus 12 encoders from five recent geospatial foundation model families<sup>4</sup>: <strong>DOFA</strong> (<a href="https://arxiv.org/abs/2403.15356">Xiong et al.&nbsp;2024</a>), <strong>CROMA</strong> (<a href="https://arxiv.org/abs/2311.00566">Fuller et al., NeurIPS 2023</a>), <strong>SenPaMAE</strong> (<a href="https://arxiv.org/abs/2408.11000">Prexl &amp; Schmitt, DAGM 2024</a>), <strong>Galileo</strong> (<a href="https://arxiv.org/abs/2502.09356">Tseng et al., ICML 2025</a>), and <strong>OlmoEarth</strong> (<a href="https://arxiv.org/abs/2511.13655">Herzog et al., CVPR 2026</a>). Precisions are <code>fp32</code>, <code>fp16</code>, <code>bf16</code><sup>5</sup>, and <code>amp</code>. Compile modes are <code>none</code> and <code>default</code> (<code>torch.compile</code> with default Inductor settings).<sup>6</sup></p>
<p>We benchmark each timm model across input sizes that match the geo-FMs we want to compare it against (<code>{64, 120, 128, 144, 224}</code>-pixel side, <code>{2, 3, 10, 12}</code> channels) so we can do shape-for-shape comparisons. We run each geo-FM at its pretraining shape: DOFA-{B,L}/16 at 3×224×224, CROMA-Optical at 12×120×120, CROMA-SAR at 2×120×120, SenPaMAE-B/16 at 3×144×144, Galileo-{Nano,Base,Large}/8 at 10×64×64, OlmoEarth-{Nano,Tiny,Base,Large}/8 at 12×128×128.<sup>7</sup></p>
</section>
<section id="results" class="level2">
<h2 class="anchored" data-anchor-id="results">Results</h2>
<section id="dynamic-world" class="level3">
<h3 class="anchored" data-anchor-id="dynamic-world">Mapping the Earth</h3>
<p>A global scale Sentinel-2 monitoring job needs to process <strong>~4 billion 224×224 patches/year</strong> (~12,000 tiles/day × half kept after cloud filter × ~1,950 patches/tile × 365 days). On an H100 GPU at each model’s best precision + compile, that costs anywhere from ten to ~2,000 GPU-hours — roughly <strong>$30 to $6,150/year</strong> at $3.11/hr per H100. This ~205× spread is what makes model choice an important lever for planetary scale jobs: the fastest backbone runs the entire job in tens of GPU-hours, while the heaviest one (ViT-L/8, the heaviest-compute vanilla ViT in our matrix) needs the equivalent of nearly three months of continuous H100 time.</p>
<div class="tb-fig" id="tb-fig-dynamic-world"></div>
<div class="tb-fig-caption"><em>Figure 2</em>: GPU-hours per year on H100 NVL to run a global, annual Sentinel-2 monitoring job at ~4 billion 224×224 patches/year, sorted by cost. MobileNetV3-S costs ~10 GPU-h/yr (~$30) at the cheap end; ViT-L/8 costs ~1,980 (~$6,150) at the expensive end (a non-standard `/8`-patch ViT-L we added as a same-compute comparator for OlmoEarth-Large/8 and Galileo-Large/8), with OlmoEarth-Large/8 sitting just below at ~1,840 (~$5,720). A ~205× spread on a fixed task, fixed imagery, fixed GPU. Hover any bar for the exact precision + compile config.</div>
<div class="callout callout-style-default callout-warning callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Warning</span>Read this first: every number in this post is a model-only lower bound
</div>
</div>
<div class="callout-body-container callout-body">
<p>Everything in this post is <strong>model-forward time only</strong> — a <strong>model-only lower bound</strong> for our specific stack (PyTorch eager + <code>torch.compile</code>, no TensorRT, ONNX export, quantization, CUDA graphs, or model surgery, all of which can move the floor further). Real deployment pipelines stack multipliers on top:</p>
<ul>
<li><strong>Patch overlap</strong> — sliding-window inference typically overlaps adjacent patches to eliminate edge artifacts in the output. For example, a 50% overlap on both axes ⇒ ~4× more patches than the non-overlap count.</li>
<li><strong>Test-time augmentation</strong> — some pipelines ensemble each patch’s predictions across flips and rotations, typically 4×–8× more forward passes per patch.</li>
<li><strong>Dataloader overhead</strong> — we pre-allocate a batch on the GPU and time only the model (see Dataloader overhead below). Real pipelines spend 30–50% of wall time on host→device transfer and I/O.</li>
</ul>
<p>These compose multiplicatively, so a realistic production setup can easily land 10–30× higher than the headlines above.</p>
</div>
</div>
</section>
<section id="browse-table" class="level3">
<h3 class="anchored" data-anchor-id="browse-table">Browse the fp32 numbers</h3>
<p>The table below shows our results for every model at <code>fp32</code>, <code>compile_mode=none</code>, on V100 and H100. Click a column header to sort, click a family chip to hide/show that family.</p>
<p>Note, the <code>Input (C×HW)</code> column is what each row is actually running. All timm models in this table run at 3×224 (standard ImageNet); the geo-FMs run at their pretraining shapes (12×120, 12×128, 10×64, 3×144, 2×120, 3×224 — so DOFA is the only geo-FM in the same shape as the timm rows). That means the <code>img/s</code> column is <strong>not directly comparable</strong> between rows with different shapes — Galileo-Nano/8 at 4,785 img/s is processing 64×64 patches, not 224×224 ones. The <code>MPx/s</code> column (channels × H × W × img/s, in megapixels per second) is the closest like-for-like throughput across shapes, though it still doesn’t account for the fact that the model is doing different amounts of work per pixel. The <code>MACs (G)</code> column counts <strong>multiply-accumulate operations</strong> in a single forward pass (in billions): one MAC is one multiply plus one accumulate, and FLOPs ≈ 2 × MACs. We use it as a rough compute budget for a model at a given input shape — bigger MACs means more arithmetic per image, though as we’ll see in the geo-FM section, MACs alone don’t predict throughput.</p>
<p>Sort by <code>H100 img/s</code> to see the throughput ladder; sort by <code>MPx/s</code> to compare across input shapes; click a family chip to focus on one architecture family.</p>
<div id="tb-root" style="font-family: -apple-system, system-ui, Segoe UI, Roboto, sans-serif; font-size: 13px; margin: 1em 0;">
  <div id="tb-chips" style="display:flex; flex-wrap:wrap; gap:6px; margin-bottom:10px;"></div>
  <div id="tb-table-wrap" style="overflow-x:auto;"></div>
  <div id="tb-status" style="font-size:11px; color:var(--geo-text-dim,#666); margin-top:6px;"></div>
</div>
<script>
(function(){
  const DATA = [{"name":"BEiT-B/16","family":"BEiT","type":"vit","params":85.77,"macs":35.13,"shape":"3\u00d7224","v100":{"tput":347.54,"mpxs":52.32,"p50":1472.468,"mem":4074.7,"bs":512},"h100":{"tput":3032.29,"mpxs":456.44,"p50":168.859,"mem":4098.1,"bs":512}},{"name":"BEiT-L/16","family":"BEiT","type":"vit","params":303.42,"macs":123.11,"shape":"3\u00d7224","v100":{"tput":102.45,"mpxs":15.42,"p50":4997.311,"mem":6084.7,"bs":512},"h100":{"tput":1066.63,"mpxs":160.56,"p50":480.046,"mem":6107.6,"bs":512}},{"name":"CROMA-Optical","family":"CROMA","type":"vit","params":90.34,"macs":40.36,"shape":"12\u00d7120","v100":{"tput":271.95,"mpxs":46.99,"p50":1882.416,"mem":6880.1,"bs":512},"h100":{"tput":2326.56,"mpxs":402.03,"p50":219.987,"mem":6905.2,"bs":512}},{"name":"CROMA-SAR","family":"CROMA","type":"vit","params":47.34,"macs":20.1,"shape":"2\u00d7120","v100":{"tput":554.48,"mpxs":15.97,"p50":923.003,"mem":6415.4,"bs":512},"h100":{"tput":4628.64,"mpxs":133.3,"p50":110.329,"mem":6437.7,"bs":512}},{"name":"CoAtNet-0","family":"CoAtNet","type":"hybrid","params":24.28,"macs":9.09,"shape":"3\u00d7224","v100":{"tput":724.38,"mpxs":109.04,"p50":352.962,"mem":16190.3,"bs":256},"h100":{"tput":2915.22,"mpxs":438.82,"p50":87.817,"mem":16214.3,"bs":256}},{"name":"CoAtNet-2","family":"CoAtNet","type":"hybrid","params":73.67,"macs":32.85,"shape":"3\u00d7224","v100":{"tput":267.02,"mpxs":40.19,"p50":958.573,"mem":22253.4,"bs":256},"h100":{"tput":1239.39,"mpxs":186.56,"p50":206.603,"mem":22272.1,"bs":256}},{"name":"ConvNeXt-B","family":"ConvNeXt","type":"cnn","params":87.58,"macs":30.71,"shape":"3\u00d7224","v100":{"tput":351.18,"mpxs":52.86,"p50":1457.825,"mem":9711.1,"bs":512},"h100":{"tput":2417.5,"mpxs":363.9,"p50":211.726,"mem":9735.1,"bs":512}},{"name":"ConvNeXt-L","family":"ConvNeXt","type":"cnn","params":196.25,"macs":68.72,"shape":"3\u00d7224","v100":{"tput":160.99,"mpxs":24.23,"p50":3179.908,"mem":14667.9,"bs":512},"h100":{"tput":1449.49,"mpxs":218.19,"p50":352.866,"mem":14691.2,"bs":512}},{"name":"ConvNeXt-S","family":"ConvNeXt","type":"cnn","params":49.46,"macs":17.37,"shape":"3\u00d7224","v100":{"tput":548.04,"mpxs":82.5,"p50":933.967,"mem":7297.9,"bs":512},"h100":{"tput":3227.95,"mpxs":485.9,"p50":158.621,"mem":7321.9,"bs":512}},{"name":"ConvNeXt-T","family":"ConvNeXt","type":"cnn","params":27.83,"macs":8.91,"shape":"3\u00d7224","v100":{"tput":926.99,"mpxs":139.54,"p50":552.196,"mem":7211.4,"bs":512},"h100":{"tput":5232.89,"mpxs":787.7,"p50":97.845,"mem":7235.4,"bs":512}},{"name":"DOFA-B/16","family":"DOFA","type":"vit","params":111.35,"macs":34.0,"shape":"3\u00d7224","v100":{"tput":356.63,"mpxs":53.68,"p50":1435.945,"mem":4175.3,"bs":512},"h100":{"tput":3260.13,"mpxs":490.74,"p50":157.075,"mem":4197.7,"bs":512}},{"name":"DOFA-L/16","family":"DOFA","type":"vit","params":337.15,"macs":119.65,"shape":"3\u00d7224","v100":{"tput":105.22,"mpxs":15.84,"p50":4865.775,"mem":6213.1,"bs":512},"h100":{"tput":1125.72,"mpxs":169.45,"p50":455.399,"mem":6237.1,"bs":512}},{"name":"DeiT3-B/16","family":"DeiT","type":"vit","params":85.82,"macs":33.7,"shape":"3\u00d7224","v100":{"tput":352.21,"mpxs":53.02,"p50":1453.255,"mem":4382.2,"bs":512},"h100":{"tput":3171.46,"mpxs":477.39,"p50":161.46,"mem":4404.5,"bs":512}},{"name":"DeiT3-S/16","family":"DeiT","type":"vit","params":21.68,"macs":8.48,"shape":"3\u00d7224","v100":{"tput":1183.67,"mpxs":178.18,"p50":432.704,"mem":2263.7,"bs":512},"h100":{"tput":7458.19,"mpxs":1122.67,"p50":68.595,"mem":2290.1,"bs":512}},{"name":"DinoV3-H+/16","family":"DinoV3","type":"vit","params":840.52,"macs":337.61,"shape":"3\u00d7224","v100":{"tput":36.23,"mpxs":5.45,"p50":14134.06,"mem":13773.1,"bs":512},"h100":{"tput":374.94,"mpxs":56.44,"p50":1365.197,"mem":13805.5,"bs":512}},{"name":"EfficientNet-B0","family":"EfficientNet","type":"cnn","params":4.02,"macs":0.77,"shape":"3\u00d7224","v100":{"tput":3169.14,"mpxs":477.04,"p50":161.543,"mem":6499.8,"bs":512},"h100":{"tput":9188.86,"mpxs":1383.18,"p50":55.758,"mem":6523.8,"bs":512}},{"name":"EfficientNet-B4","family":"EfficientNet","type":"cnn","params":17.57,"macs":3.0,"shape":"3\u00d7224","v100":{"tput":1104.03,"mpxs":166.19,"p50":463.835,"mem":9638.7,"bs":512},"h100":{"tput":3300.29,"mpxs":496.79,"p50":155.216,"mem":9661.2,"bs":512}},{"name":"EfficientNet-B7","family":"EfficientNet","type":"cnn","params":63.81,"macs":10.33,"shape":"3\u00d7224","v100":{"tput":415.64,"mpxs":62.57,"p50":1231.881,"mem":12906.4,"bs":512},"h100":{"tput":1347.04,"mpxs":202.77,"p50":380.014,"mem":12929.8,"bs":512}},{"name":"Galileo-Base/8","family":"Galileo","type":"vit","params":86.52,"macs":54.42,"shape":"10\u00d764","v100":{"tput":192.75,"mpxs":7.9,"p50":2655.52,"mem":9489.4,"bs":512},"h100":{"tput":1349.4,"mpxs":55.27,"p50":379.329,"mem":9514.0,"bs":512}},{"name":"Galileo-Large/8","family":"Galileo","type":"vit","params":474.7,"macs":302.09,"shape":"10\u00d764","v100":{"tput":40.31,"mpxs":1.65,"p50":12693.348,"mem":15855.5,"bs":512},"h100":{"tput":393.07,"mpxs":16.1,"p50":1300.617,"mem":15838.6,"bs":512}},{"name":"Galileo-Nano/8","family":"Galileo","type":"vit","params":1.04,"macs":0.51,"shape":"10\u00d764","v100":{"tput":2150.13,"mpxs":88.07,"p50":237.89,"mem":3456.1,"bs":512},"h100":{"tput":4785.14,"mpxs":196.0,"p50":106.984,"mem":3480.1,"bs":512}},{"name":"MobileNetV3-L","family":"MobileNet","type":"cnn","params":4.21,"macs":0.43,"shape":"3\u00d7224","v100":{"tput":5318.22,"mpxs":800.54,"p50":96.269,"mem":4445.3,"bs":512},"h100":{"tput":15803.55,"mpxs":2378.88,"p50":32.395,"mem":4469.3,"bs":512}},{"name":"MobileNetV3-S","family":"MobileNet","type":"cnn","params":1.53,"macs":0.11,"shape":"3\u00d7224","v100":{"tput":15535.74,"mpxs":2338.56,"p50":32.915,"mem":1762.7,"bs":512},"h100":{"tput":42240.5,"mpxs":6358.38,"p50":12.127,"mem":1786.7,"bs":512}},{"name":"OlmoEarth-Base/8","family":"OlmoEarth","type":"vit","params":88.96,"macs":130.76,"shape":"12\u00d7128","v100":{"tput":77.96,"mpxs":15.33,"p50":6573.894,"mem":18159.9,"bs":512},"h100":{"tput":602.33,"mpxs":118.42,"p50":850.0,"mem":18184.5,"bs":512}},{"name":"OlmoEarth-Large/8","family":"OlmoEarth","type":"vit","params":307.77,"macs":464.26,"shape":"12\u00d7128","v100":{"tput":24.86,"mpxs":4.89,"p50":20597.278,"mem":25633.8,"bs":512},"h100":{"tput":205.31,"mpxs":40.36,"p50":2494.012,"mem":25658.3,"bs":512}},{"name":"OlmoEarth-Nano/8","family":"OlmoEarth","type":"vit","params":1.36,"macs":1.26,"shape":"12\u00d7128","v100":{"tput":1909.63,"mpxs":375.45,"p50":268.084,"mem":3268.0,"bs":512},"h100":{"tput":5617.15,"mpxs":1104.38,"p50":91.188,"mem":3292.8,"bs":512}},{"name":"OlmoEarth-Tiny/8","family":"OlmoEarth","type":"vit","params":6.2,"macs":8.23,"shape":"12\u00d7128","v100":{"tput":626.45,"mpxs":123.16,"p50":817.085,"mem":4718.8,"bs":512},"h100":{"tput":2752.03,"mpxs":541.07,"p50":186.138,"mem":4743.3,"bs":512}},{"name":"RegNetY-400MF","family":"RegNet","type":"cnn","params":3.91,"macs":0.8,"shape":"3\u00d7224","v100":{"tput":4361.39,"mpxs":656.51,"p50":117.393,"mem":3622.0,"bs":512},"h100":{"tput":16048.93,"mpxs":2415.81,"p50":31.904,"mem":4262.6,"bs":512}},{"name":"RegNetY-4GF","family":"RegNet","type":"cnn","params":19.57,"macs":7.94,"shape":"3\u00d7224","v100":{"tput":256.7,"mpxs":38.64,"p50":1985.634,"mem":9440.1,"bs":512},"h100":{"tput":5012.26,"mpxs":754.48,"p50":102.116,"mem":9463.7,"bs":512}},{"name":"ResNet-101","family":"ResNet","type":"cnn","params":42.52,"macs":15.6,"shape":"3\u00d7224","v100":{"tput":779.56,"mpxs":117.35,"p50":657.039,"mem":5832.2,"bs":512},"h100":{"tput":3811.06,"mpxs":573.67,"p50":134.403,"mem":5856.0,"bs":512}},{"name":"ResNet-152","family":"ResNet","type":"cnn","params":58.16,"macs":23.02,"shape":"3\u00d7224","v100":{"tput":543.79,"mpxs":81.86,"p50":941.104,"mem":5894.8,"bs":512},"h100":{"tput":2729.84,"mpxs":410.92,"p50":187.552,"mem":5918.7,"bs":512}},{"name":"ResNet-18","family":"ResNet","type":"cnn","params":11.18,"macs":3.63,"shape":"3\u00d7224","v100":{"tput":4616.63,"mpxs":694.93,"p50":110.857,"mem":3651.0,"bs":512},"h100":{"tput":17289.9,"mpxs":2602.61,"p50":29.674,"mem":3983.3,"bs":512}},{"name":"ResNet-50","family":"ResNet","type":"cnn","params":23.53,"macs":8.17,"shape":"3\u00d7224","v100":{"tput":1277.58,"mpxs":192.31,"p50":400.699,"mem":5755.8,"bs":512},"h100":{"tput":5925.86,"mpxs":892.01,"p50":86.344,"mem":5779.8,"bs":512}},{"name":"SenPaMAE-B/16","family":"SenPaMAE","type":"vit","params":95.02,"macs":41.43,"shape":"3\u00d7144","v100":{"tput":295.33,"mpxs":18.37,"p50":1734.486,"mem":5487.0,"bs":512},"h100":{"tput":2888.63,"mpxs":179.7,"p50":177.186,"mem":5509.8,"bs":512}},{"name":"Swin-B","family":"Swin","type":"vit","params":86.75,"macs":30.86,"shape":"3\u00d7224","v100":{"tput":313.97,"mpxs":47.26,"p50":1629.013,"mem":10968.2,"bs":512},"h100":{"tput":1900.57,"mpxs":286.09,"p50":269.461,"mem":10993.3,"bs":512}},{"name":"Swin-L","family":"Swin","type":"vit","params":195.01,"macs":68.95,"shape":"3\u00d7224","v100":{"tput":158.51,"mpxs":23.86,"p50":3231.683,"mem":16554.9,"bs":512},"h100":{"tput":1165.27,"mpxs":175.41,"p50":439.301,"mem":16577.1,"bs":512}},{"name":"Swin-S","family":"Swin","type":"vit","params":48.84,"macs":17.48,"shape":"3\u00d7224","v100":{"tput":481.42,"mpxs":72.47,"p50":1063.138,"mem":8241.1,"bs":512},"h100":{"tput":2561.0,"mpxs":385.5,"p50":199.952,"mem":8266.2,"bs":512}},{"name":"Swin-T","family":"Swin","type":"vit","params":27.53,"macs":8.98,"shape":"3\u00d7224","v100":{"tput":813.67,"mpxs":122.48,"p50":629.315,"mem":8156.8,"bs":512},"h100":{"tput":4097.45,"mpxs":616.78,"p50":125.106,"mem":8180.4,"bs":512}},{"name":"ViT-B/16","family":"ViT","type":"vit","params":85.81,"macs":33.7,"shape":"3\u00d7224","v100":{"tput":343.64,"mpxs":51.73,"p50":1442.817,"mem":4381.4,"bs":512},"h100":{"tput":3314.78,"mpxs":498.97,"p50":154.45,"mem":4404.4,"bs":512}},{"name":"ViT-G/14","family":"ViT","type":"vit","params":1011.22,"macs":519.18,"shape":"3\u00d7224","v100":{"tput":24.89,"mpxs":3.75,"p50":20568.282,"mem":13866.0,"bs":512},"h100":{"tput":274.9,"mpxs":41.38,"p50":1861.818,"mem":13895.2,"bs":512}},{"name":"ViT-H/14","family":"ViT","type":"vit","params":630.78,"macs":323.77,"shape":"3\u00d7224","v100":{"tput":38.26,"mpxs":5.76,"p50":13378.971,"mem":10992.6,"bs":512},"h100":{"tput":439.97,"mpxs":66.23,"p50":1164.059,"mem":11016.6,"bs":512}},{"name":"ViT-L/16","family":"ViT","type":"vit","params":303.31,"macs":119.29,"shape":"3\u00d7224","v100":{"tput":104.98,"mpxs":15.8,"p50":4873.227,"mem":6489.8,"bs":512},"h100":{"tput":1146.7,"mpxs":172.61,"p50":446.603,"mem":6512.8,"bs":512}},{"name":"ViT-L/8","family":"ViT","type":"vit","params":303.32,"macs":474.43,"shape":"3\u00d7224","v100":{"tput":23.84,"mpxs":3.59,"p50":21477.237,"mem":21287.2,"bs":512},"h100":{"tput":185.07,"mpxs":27.86,"p50":2765.978,"mem":21310.3,"bs":512}},{"name":"ViT-S/16","family":"ViT","type":"vit","params":21.67,"macs":8.48,"shape":"3\u00d7224","v100":{"tput":1241.22,"mpxs":186.84,"p50":412.707,"mem":2267.5,"bs":512},"h100":{"tput":7866.14,"mpxs":1184.07,"p50":65.022,"mem":2289.0,"bs":512}},{"name":"ViT-Ti/16","family":"ViT","type":"vit","params":5.53,"macs":2.15,"shape":"3\u00d7224","v100":{"tput":3312.47,"mpxs":498.62,"p50":154.485,"mem":1269.6,"bs":512},"h100":{"tput":15799.63,"mpxs":2378.29,"p50":32.372,"mem":1293.5,"bs":512}}];

  const FAMILY_COLORS = {
    "ResNet":"#1f77b4","EfficientNet":"#2ca02c","ConvNeXt":"#9467bd",
    "MobileNet":"#8c564b","RegNet":"#e377c2","ViT":"#d62728",
    "DeiT":"#ff7f0e","Swin":"#bcbd22","BEiT":"#17becf","CoAtNet":"#7f7f7f",
    "DOFA":"#e6550d","CROMA":"#31a354","SenPaMAE":"#756bb1",
    "Galileo":"#de2d26","OlmoEarth":"#3182bd"
  };

  // Compute derived field: speedup = h100/v100 throughput
  DATA.forEach(d => { d.speedup = d.h100.tput / d.v100.tput; });

  const COLS = [
    {key:"name",     label:"Model",            fmt: v => v,                      align:"left",  numeric:false},
    {key:"family",   label:"Family",           fmt: v => v,                      align:"left",  numeric:false},
    {key:"shape",    label:"Input (C×HW)",     fmt: v => v,                      align:"left",  numeric:false},
    {key:"params",   label:"Params (M)",       fmt: v => v.toFixed(1),           align:"right", numeric:true},
    {key:"macs",     label:"MACs (G)",         fmt: v => v.toFixed(2),           align:"right", numeric:true},
    {key:"v100.tput",label:"V100 img/s",       fmt: v => v.toFixed(0),           align:"right", numeric:true},
    {key:"h100.tput",label:"H100 img/s",       fmt: v => v.toFixed(0),           align:"right", numeric:true},
    {key:"v100.mpxs",label:"V100 MPx/s",       fmt: v => v.toFixed(1),           align:"right", numeric:true},
    {key:"h100.mpxs",label:"H100 MPx/s",       fmt: v => v.toFixed(1),           align:"right", numeric:true},
    {key:"speedup",  label:"H100/V100",        fmt: v => v.toFixed(2)+"×",       align:"right", numeric:true},
  ];

  function getVal(row, key) {
    return key.split(".").reduce((a,k) => a[k], row);
  }

  // State
  let sortKey = "h100.tput";
  let sortDesc = true;
  const families = Array.from(new Set(DATA.map(d => d.family))).sort();
  const hidden = new Set();

  function renderChips() {
    const root = document.getElementById("tb-chips");
    root.innerHTML = "";
    families.forEach(fam => {
      const c = FAMILY_COLORS[fam] || "#888";
      const on = !hidden.has(fam);
      const el = document.createElement("button");
      el.textContent = fam;
      el.style.cssText = `
        border: 1.5px solid ${c};
        background: ${on ? c : "transparent"};
        color: ${on ? "#fff" : c};
        padding: 3px 9px;
        border-radius: 12px;
        font-size: 11px;
        font-weight: 600;
        cursor: pointer;
        transition: background 80ms;
        font-family: inherit;
      `;
      el.onclick = () => {
        if (hidden.has(fam)) hidden.delete(fam); else hidden.add(fam);
        render();
      };
      root.appendChild(el);
    });
    // All / None buttons
    const all = document.createElement("button");
    all.textContent = "all";
    all.style.cssText = "border: 1px solid var(--geo-border,#aaa); background:var(--geo-bg-card,#fff); color:var(--geo-text-dim,#444); padding: 3px 9px; border-radius:12px; font-size:11px; cursor:pointer; font-family:inherit;";
    all.onclick = () => { hidden.clear(); render(); };
    root.appendChild(all);
    const none = document.createElement("button");
    none.textContent = "none";
    none.style.cssText = all.style.cssText;
    none.onclick = () => { families.forEach(f => hidden.add(f)); render(); };
    root.appendChild(none);
  }

  function renderHead() {
    const head = document.getElementById("tb-head");
    head.innerHTML = "";
    COLS.forEach(c => {
      const th = document.createElement("th");
      const isSorted = c.key === sortKey;
      const arrow = isSorted ? (sortDesc ? " ▼" : " ▲") : "";
      th.textContent = c.label + arrow;
      th.style.cssText = `
        text-align: ${c.align};
        padding: 6px 10px;
        cursor: ${c.numeric ? "pointer" : "default"};
        user-select: none;
        white-space: nowrap;
        font-weight: 600;
        color: ${isSorted ? "var(--geo-text,#000)" : "var(--geo-text-dim,#444)"};
      `;
      th.onclick = () => {
        if (sortKey === c.key) sortDesc = !sortDesc;
        else { sortKey = c.key; sortDesc = c.numeric ? true : false; }
        render();
      };
      head.appendChild(th);
    });
  }

  function renderBody() {
    const body = document.getElementById("tb-body");
    body.innerHTML = "";
    let rows = DATA.filter(d => !hidden.has(d.family));
    rows.sort((a, b) => {
      const va = getVal(a, sortKey);
      const vb = getVal(b, sortKey);
      if (typeof va === "string") {
        return sortDesc ? vb.localeCompare(va) : va.localeCompare(vb);
      }
      return sortDesc ? vb - va : va - vb;
    });
    rows.forEach((d, i) => {
      const tr = document.createElement("tr");
      const fc = FAMILY_COLORS[d.family] || "#888";
      tr.style.cssText = `
        border-bottom: 1px solid var(--geo-border, #eee);
        background: ${i % 2 === 0 ? "transparent" : "var(--geo-bg-card, #fafafa)"};
        color: var(--geo-text, #222);
      `;
      COLS.forEach((c, ci) => {
        const td = document.createElement("td");
        const raw = getVal(d, c.key);
        td.textContent = c.fmt(raw);
        td.style.cssText = `
          text-align: ${c.align};
          padding: 5px 10px;
          white-space: nowrap;
          font-variant-numeric: tabular-nums;
        `;
        if (ci === 0) {
          td.style.borderLeft = `4px solid ${fc}`;
          td.style.fontWeight = "500";
        }
        if (c.key === "family") {
          td.style.color = fc;
          td.style.fontWeight = "600";
        }
        tr.appendChild(td);
      });
      body.appendChild(tr);
    });
    document.getElementById("tb-status").textContent =
      `${rows.length} of ${DATA.length} models shown · sorted by ${COLS.find(c => c.key === sortKey).label} ${sortDesc ? "↓" : "↑"} · all rows are fp32, compile_mode=none`;
  }

  function render() { renderChips(); renderHead(); renderBody(); }

  // Build the table programmatically so Pandoc doesn't normalize the
  // raw-HTML <table> in the .qmd (it was stripping the empty <thead>).
  const wrap = document.getElementById("tb-table-wrap");
  const table = document.createElement("table");
  table.id = "tb-table";
  table.style.cssText = "border-collapse: collapse; width: 100%; min-width: 720px;";
  const thead = document.createElement("thead");
  const headRow = document.createElement("tr");
  headRow.id = "tb-head";
  headRow.style.cssText = "background:var(--geo-bg-card,#f6f6f6); border-bottom: 2px solid var(--geo-text-dim,#888);";
  thead.appendChild(headRow);
  const tbody = document.createElement("tbody");
  tbody.id = "tb-body";
  table.appendChild(thead);
  table.appendChild(tbody);
  wrap.appendChild(table);

  render();
})();
</script>
</section>
<section id="geo-overhead" class="level3">
<h3 class="anchored" data-anchor-id="geo-overhead">Geo-FMs vs.&nbsp;matched-MAC vanilla baselines</h3>
<p>The geo-FMs run at different input shapes with different numbers of input channels (12×128, 10×64, 12×120, 2×120, 3×144), so a head-to-head 3×224 comparison would be unfair to most of them. A fairer test: pair each geo-FM with the timm/DinoV3 model whose MAC count is closest, then compare throughput at fp32 + no compile so what’s left is just the architecture and the wrapper. For the smaller geo-FMs the closest-MAC comparator is naturally at the same input shape; for the largest ones (Galileo-Large/8, OlmoEarth-Large/8, OlmoEarth-Base/8) the only timm/DinoV3 models with comparable MAC budgets are the big ViTs at 3×224 — so we pair across shapes there, with the comparator’s shape noted in the table.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 7%">
<col style="width: 13%">
<col style="width: 9%">
<col style="width: 10%">
<col style="width: 23%">
<col style="width: 6%">
<col style="width: 17%">
<col style="width: 11%">
</colgroup>
<thead>
<tr class="header">
<th>Geo-FM</th>
<th>Native input</th>
<th style="text-align: right;">MACs (G)</th>
<th style="text-align: right;">Geo img/s</th>
<th>Closest-MAC comparator</th>
<th>shape</th>
<th style="text-align: right;">comparator img/s</th>
<th style="text-align: right;">Geo / comp</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>DOFA-B/16</td>
<td>3×224</td>
<td style="text-align: right;">34.0</td>
<td style="text-align: right;">3,260</td>
<td>ViT-B/16 (33.7 G)</td>
<td>3×224</td>
<td style="text-align: right;">3,315</td>
<td style="text-align: right;"><strong>0.98×</strong></td>
</tr>
<tr class="even">
<td>DOFA-L/16</td>
<td>3×224</td>
<td style="text-align: right;">119.7</td>
<td style="text-align: right;">1,126</td>
<td>ViT-L/16 (119.3 G)</td>
<td>3×224</td>
<td style="text-align: right;">1,147</td>
<td style="text-align: right;"><strong>0.98×</strong></td>
</tr>
<tr class="odd">
<td>CROMA-Optical</td>
<td>12×120</td>
<td style="text-align: right;">40.4</td>
<td style="text-align: right;">2,327</td>
<td>Swin-L (35.2 G)</td>
<td>12×120</td>
<td style="text-align: right;">2,294</td>
<td style="text-align: right;"><strong>1.01×</strong></td>
</tr>
<tr class="even">
<td>CROMA-SAR</td>
<td>2×120</td>
<td style="text-align: right;">20.1</td>
<td style="text-align: right;">4,629</td>
<td>ConvNeXt-L (17.2 G)</td>
<td>2×120</td>
<td style="text-align: right;">5,199</td>
<td style="text-align: right;"><strong>0.89×</strong></td>
</tr>
<tr class="odd">
<td>SenPaMAE-B/16</td>
<td>3×144</td>
<td style="text-align: right;">41.4</td>
<td style="text-align: right;">2,889</td>
<td>Swin-L (41.2 G)</td>
<td>3×144</td>
<td style="text-align: right;">1,881</td>
<td style="text-align: right;"><strong>1.54×</strong></td>
</tr>
<tr class="even">
<td>Galileo-Nano/8</td>
<td>10×64</td>
<td style="text-align: right;">0.5</td>
<td style="text-align: right;">4,785</td>
<td>RegNetY-4GF (0.7 G)</td>
<td>10×64</td>
<td style="text-align: right;">25,626</td>
<td style="text-align: right;"><strong>0.19×</strong></td>
</tr>
<tr class="odd">
<td>Galileo-Base/8</td>
<td>10×64</td>
<td style="text-align: right;">54.4</td>
<td style="text-align: right;">1,349</td>
<td>ConvNeXt-L (68.7 G)</td>
<td>3×224</td>
<td style="text-align: right;">1,449</td>
<td style="text-align: right;"><strong>0.93×</strong></td>
</tr>
<tr class="even">
<td>Galileo-Large/8</td>
<td>10×64</td>
<td style="text-align: right;">302.1</td>
<td style="text-align: right;">393</td>
<td>ViT-H/14 (323.8 G)</td>
<td>3×224</td>
<td style="text-align: right;">440</td>
<td style="text-align: right;"><strong>0.89×</strong></td>
</tr>
<tr class="odd">
<td>OlmoEarth-Nano/8</td>
<td>12×128</td>
<td style="text-align: right;">1.3</td>
<td style="text-align: right;">5,617</td>
<td>ResNet-18 (1.4 G)</td>
<td>12×128</td>
<td style="text-align: right;">44,355</td>
<td style="text-align: right;"><strong>0.13×</strong></td>
</tr>
<tr class="even">
<td>OlmoEarth-Tiny/8</td>
<td>12×128</td>
<td style="text-align: right;">8.2</td>
<td style="text-align: right;">2,752</td>
<td>ResNet-152 (7.8 G)</td>
<td>12×128</td>
<td style="text-align: right;">6,398</td>
<td style="text-align: right;"><strong>0.43×</strong></td>
</tr>
<tr class="odd">
<td>OlmoEarth-Base/8</td>
<td>12×128</td>
<td style="text-align: right;">130.8</td>
<td style="text-align: right;">602</td>
<td>ViT-L/16 (119.3 G)</td>
<td>3×224</td>
<td style="text-align: right;">1,147</td>
<td style="text-align: right;"><strong>0.53×</strong></td>
</tr>
<tr class="even">
<td>OlmoEarth-Large/8</td>
<td>12×128</td>
<td style="text-align: right;">464.3</td>
<td style="text-align: right;">205</td>
<td>ViT-L/8 (474.4 G)</td>
<td>3×224</td>
<td style="text-align: right;">185</td>
<td style="text-align: right;"><strong>1.11×</strong></td>
</tr>
</tbody>
</table>
<p>This splits the geo-FMs into two camps: ones whose input handling is essentially free, and ones whose input handling adds work that matters mostly at small model sizes.</p>
<p><strong>DOFA, CROMA, and SenPaMAE are timm-class encoders with remote-sensing-aware inputs.</strong> DOFA-B/16 lands within 2% of ViT-B/16’s throughput at the same input shape; DOFA-L/16 within 2% of ViT-L/16. In this benchmark we don’t measure a meaningful overhead from the wavelength-conditioned dynamic patch embed. CROMA stays close to or above the closest-MAC timm backbone, and SenPaMAE outpaces Swin-L at matched MACs (1.54×) — it’s a plain ViT-B inside; we suspect Swin’s window attention contributes more small ops per FLOP, though we haven’t profiled to confirm. CROMA and DOFA trail their same-shape timm baselines mainly because they’re capped at fp32 + amp; SenPaMAE has full half-precision support and cashes it in (best H100 = 6,630 img/s at bf16 + compile, 2.3× the fp32 baseline).</p>
<p><strong>Galileo and OlmoEarth tokenize each image as multiple <em>bandsets</em></strong> — groups of spectral channels processed with separate patch embeddings. OlmoEarth at 12×128 with patch 8 produces 16×16 × 3 bandsets × 1 timestep = <strong>768 tokens per image</strong> (vs.&nbsp;a vanilla ViT’s 256 at the same shape); Galileo at 10×64 with patch 8 produces 8×8 × 5 bandsets × 1 timestep = <strong>320 tokens</strong>. On top of that, both wrappers run the multi-modal multi-timestep code path the models were trained for — per-timestep position embeddings still computed at T=1, modality-dispatch logic, and host-side reshaping (Galileo’s <code>format_input</code>, OlmoEarth’s <code>MaskedOlmoEarthSample</code> packing) — which adds several CPU ops to every forward.</p>
<p>How much this hurts throughput depends on size. At the small end, the picture is consistent with wrapper overhead dominating a tiny encoder: Galileo-Nano/8 trails an equivalent-MAC RegNetY by ~5×, and OlmoEarth-Nano/8 trails ResNet-18 by ~8×. At the large end, the encoder cost grows past whatever wrapper overhead exists and the geo-FMs catch up: Galileo-Large/8 runs at <strong>0.89×</strong> of ViT-H/14 at matched MACs, and <strong>OlmoEarth-Large/8 actually runs <em>faster</em> than ViT-L/8</strong> at matched MACs (205 vs 185 img/s, 1.11×) — once the encoder is big enough, the per-batch cost of the bandset and time-axis handling looks like a small share of total time.</p>
<p>A useful corollary: <strong>In this benchmark, MACs become more predictive of throughput at the largest scales, and less so at the small end.</strong> At 464 G MACs, OlmoEarth-Large/8 (205 img/s) and ViT-L/8 (185 img/s) land within 11% of each other on H100 fp32 + no compile — both appear to be compute-bound. At the small end the picture inverts: OlmoEarth-Nano/8 has 1.3 G MACs but runs at 13% of a same-MAC ResNet-18 (44,355 img/s). Without a profiler trace we can only speculate, but a plausible reading is that the ResNet’s 1.3 G is dominated by a few big matmuls that fuse into a handful of kernels, while the OlmoEarth wrapper’s many small per-modality / per-bandset ops launch hundreds of GPU kernels per batch. If a fixed per-batch wrapper cost is roughly constant across model sizes, then as the encoder grows, that cost takes up a smaller fraction of total time.</p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Note</span>Throughput is one dimension; accuracy is another
</div>
</div>
<div class="callout-body-container callout-body">
<p>Cheaper isn’t automatically better. Large models exist because, in practice, they tend to deliver better downstream accuracy. The <a href="https://arxiv.org/abs/2511.13655">OlmoEarth paper</a> (<a href="https://arxiv.org/abs/2511.13655">Herzog et al., CVPR 2026</a>), for example, shows a clear monotonic increase in average task performance going from Nano → Tiny → Base on the benchmarks they evaluate — at the cost of the throughput we measure here. Throughput is only half the picture; the deployment question is which model is cheapest at your required accuracy — and accuracy is a question this post deliberately doesn’t answer. Treat the numbers here as one axis of a two-axis decision; the other axis lives in the foundation-model papers themselves.</p>
</div>
</div>
</section>
</section>
<section id="dataloader-overhead" class="level2">
<h2 class="anchored" data-anchor-id="dataloader-overhead">Dataloader overhead</h2>
<p>Inference throughput numbers are notoriously hard to compare across reports because the measurement protocol is rarely consistent. The most common confound is the input pipeline — how each batch gets to the GPU before the model runs.</p>
<p>To isolate the dataloader cost from any specific image decoder or augmentation stack, we wrote <a href="https://github.com/calebrob6/throughput-bench/blob/main/benchmark_dataloader_ipc.py"><code>benchmark_dataloader_ipc.py</code></a>: it runs ResNet-18 against a <strong>best-case <code>Dataset</code> that returns a cached random tensor for every <code>__getitem__</code></strong> — no image decoding, no augmentation, no disk I/O. What’s left in the pipeline is just <strong>batching, worker-to-main-process IPC, pinning, and the host→GPU copy</strong>. For each batch size and worker count, we measure three timings:</p>
<ul>
<li><strong>compute</strong> — model forward through a pre-allocated GPU batch, plus <code>cuda.synchronize()</code>. Upper bound, no dataloader involved.</li>
<li><strong>fetch-only</strong> — one batch out of the DataLoader, no model. Pure pipeline cost.</li>
<li><strong>end-to-end</strong> — fetch-only + host→GPU copy + compute. The realistic path.</li>
</ul>
<p>For a single 1024×3×224×224 batch on V100 with ResNet-18 at <code>num_workers=8</code>, those three paths cost:</p>
<table class="caption-top table">
<colgroup>
<col style="width: 20%">
<col style="width: 58%">
<col style="width: 20%">
</colgroup>
<thead>
<tr class="header">
<th>Path</th>
<th>What runs in it</th>
<th style="text-align: right;">Time</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>compute</strong></td>
<td>model forward + <code>cuda.synchronize()</code> (batch already on GPU)</td>
<td style="text-align: right;"><strong>~218 ms</strong></td>
</tr>
<tr class="even">
<td><strong>fetch-only</strong></td>
<td>DataLoader batch fetch (<code>__getitem__</code>s on workers + IPC + <code>pin_memory</code>)</td>
<td style="text-align: right;"><strong>~222 ms</strong></td>
</tr>
<tr class="odd">
<td><strong>end-to-end</strong></td>
<td>fetch-only + host→GPU copy + compute (overlapped)</td>
<td style="text-align: right;"><strong>~275 ms</strong></td>
</tr>
</tbody>
</table>
<p>The fetch-only path (~222 ms) costs about as much as a full ResNet-18 forward pass (~218 ms) — and a chunk of the total wall time even after overlap. If two benchmark reports use different <code>num_workers</code>, different <code>pin_memory</code> settings, different host/device topology, or different storage backends, their throughput numbers measure different things. Worse, these confounds tend to compress the spread between models — if every backbone is bottlenecked on host→device transfer, ResNet-18 and ViT-L/16 look closer than they really are.</p>
<section id="worker-count" class="level3">
<h3 class="anchored" data-anchor-id="worker-count">Worker count</h3>
<p>Compute time per batch is steady at ~218 ms regardless of worker count; only the fetch path varies. With <strong><code>num_workers=1</code> the worker becomes the bottleneck</strong> and end-to-end throughput collapses to ~1,800 img/s (2.6× slowdown vs.&nbsp;compute). Two workers already get most of the benefit; 4–8 workers settles into the steady-state ~3,700 img/s band. Past 8 workers the picture is noisier — most configurations land in the same ~3,500–3,800 range, but occasional contention spikes drop throughput by 10–15%, presumably from main-process scheduling overhead with many concurrent workers:</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th style="text-align: right;">Workers</th>
<th style="text-align: right;">End-to-end img/s</th>
<th style="text-align: right;">Slowdown</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: right;">1</td>
<td style="text-align: right;">1,795</td>
<td style="text-align: right;">2.60×</td>
</tr>
<tr class="even">
<td style="text-align: right;">2</td>
<td style="text-align: right;">3,597</td>
<td style="text-align: right;">1.30×</td>
</tr>
<tr class="odd">
<td style="text-align: right;">4</td>
<td style="text-align: right;">3,753</td>
<td style="text-align: right;">1.25×</td>
</tr>
<tr class="even">
<td style="text-align: right;">8</td>
<td style="text-align: right;">3,727</td>
<td style="text-align: right;">1.26×</td>
</tr>
<tr class="odd">
<td style="text-align: right;">12</td>
<td style="text-align: right;">3,301</td>
<td style="text-align: right;">1.42×</td>
</tr>
<tr class="even">
<td style="text-align: right;">16</td>
<td style="text-align: right;">3,196</td>
<td style="text-align: right;">1.47×</td>
</tr>
<tr class="odd">
<td style="text-align: right;">24</td>
<td style="text-align: right;">3,671</td>
<td style="text-align: right;">1.28×</td>
</tr>
</tbody>
</table>
<p>(batch=1024, <code>pin_memory=True</code>, V100, ResNet-18, dummy cached-tensor dataset; full sweep in the <a href="https://github.com/calebrob6/throughput-bench/blob/main/results/dataloader_ipc_consolidated.csv">results CSV</a>.)</p>
<p>The practical takeaway: <strong>realistic ResNet-18 V100 throughput at bs=1024 is ~3,727 img/s, not the ~4,690 the model-only timer reports</strong> — about a 20% headroom. With a real dataset that does decoding and augmentation, the gap will widen further. The headline numbers in this post are model-only; multiply through accordingly when budgeting real workloads, and watch worker count especially if you’re sharing CPUs with other processes.</p>
</section>
</section>
<section id="precision-compile" class="level2">
<h2 class="anchored" data-anchor-id="precision-compile">Appendix: precision, compile, and cross-GPU effects</h2>
<p>The headline numbers in this post all use each model’s best precision + compile config. This appendix unpacks how that “best” gets there — which families gain the most from <code>torch.compile</code> and half precision, how the GPU type interacts with all of it, and how the geo-FMs respond to the same axes.</p>
<div class="tb-fig" id="tb-fig-precision-compile"></div>
<div class="tb-fig-caption"><em>Figure A1</em>: Precision + compile lift over fp32 + no compile, by family and GPU. ViT arrows are 4–5× long on V100 but only 2–3× on H100; ResNet arrows compress on V100 (3.2–4×) and slightly outpace ViTs on H100. Open circles mark the fp32 + no-compile baseline; filled circles mark each model's best (fp16/bf16 + compile). Hover any arrow for the speedup and configs.</div>
<p>The speedup you get from going to half precision and turning on <code>torch.compile</code> is <strong>not uniform across model families</strong> — and the family-by-family pattern is different on V100 than on H100. The table below is the median speedup over the fp32 + no-compile baseline at 224×3, across all models in each family:</p>
<section id="h100-nvl-speedup-over-fp32-no-compile" class="level3">
<h3 class="anchored" data-anchor-id="h100-nvl-speedup-over-fp32-no-compile">H100 NVL — speedup over fp32 + no-compile</h3>
<p>Bolded entries mark each row’s best configuration.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 18%">
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 19%">
</colgroup>
<thead>
<tr class="header">
<th>Family</th>
<th style="text-align: right;">fp32 + compile</th>
<th style="text-align: right;">fp16 + compile</th>
<th style="text-align: right;">bf16 + compile</th>
<th style="text-align: right;">amp + compile</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>ResNet</td>
<td style="text-align: right;">1.86×</td>
<td style="text-align: right;"><strong>3.40×</strong></td>
<td style="text-align: right;">3.15×</td>
<td style="text-align: right;">3.21×</td>
</tr>
<tr class="even">
<td>EfficientNet</td>
<td style="text-align: right;">1.94×</td>
<td style="text-align: right;"><strong>2.95×</strong></td>
<td style="text-align: right;">2.79×</td>
<td style="text-align: right;">2.83×</td>
</tr>
<tr class="odd">
<td>MobileNet</td>
<td style="text-align: right;">1.79×</td>
<td style="text-align: right;"><strong>2.91×</strong></td>
<td style="text-align: right;">2.61×</td>
<td style="text-align: right;">2.72×</td>
</tr>
<tr class="even">
<td>RegNet</td>
<td style="text-align: right;">1.66×</td>
<td style="text-align: right;"><strong>2.56×</strong></td>
<td style="text-align: right;">2.48×</td>
<td style="text-align: right;">2.44×</td>
</tr>
<tr class="odd">
<td>DeiT</td>
<td style="text-align: right;">1.12×</td>
<td style="text-align: right;">2.39×</td>
<td style="text-align: right;"><strong>2.47×</strong></td>
<td style="text-align: right;">2.32×</td>
</tr>
<tr class="even">
<td>ConvNeXt</td>
<td style="text-align: right;">1.23×</td>
<td style="text-align: right;">2.27×</td>
<td style="text-align: right;"><strong>2.33×</strong></td>
<td style="text-align: right;">2.19×</td>
</tr>
<tr class="odd">
<td>Swin</td>
<td style="text-align: right;">1.49×</td>
<td style="text-align: right;">2.26×</td>
<td style="text-align: right;"><strong>2.33×</strong></td>
<td style="text-align: right;">2.20×</td>
</tr>
<tr class="even">
<td>ViT</td>
<td style="text-align: right;">1.05×</td>
<td style="text-align: right;">2.23×</td>
<td style="text-align: right;"><strong>2.33×</strong></td>
<td style="text-align: right;">2.19×</td>
</tr>
<tr class="odd">
<td>DinoV3</td>
<td style="text-align: right;">1.19×</td>
<td style="text-align: right;">2.13×</td>
<td style="text-align: right;"><strong>2.33×</strong></td>
<td style="text-align: right;">2.13×</td>
</tr>
<tr class="even">
<td>BEiT</td>
<td style="text-align: right;">1.06×</td>
<td style="text-align: right;">2.09×</td>
<td style="text-align: right;"><strong>2.17×</strong></td>
<td style="text-align: right;">2.04×</td>
</tr>
<tr class="odd">
<td>CoAtNet</td>
<td style="text-align: right;">1.21×</td>
<td style="text-align: right;"><strong>1.71×</strong></td>
<td style="text-align: right;">1.71×</td>
<td style="text-align: right;">1.69×</td>
</tr>
</tbody>
</table>
</section>
<section id="v100-speedup-over-fp32-no-compile" class="level3">
<h3 class="anchored" data-anchor-id="v100-speedup-over-fp32-no-compile">V100 — speedup over fp32 + no-compile</h3>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Family</th>
<th style="text-align: right;">fp32 + compile</th>
<th style="text-align: right;">fp16 + compile</th>
<th style="text-align: right;">amp + compile</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>DinoV3</td>
<td style="text-align: right;">0.99×</td>
<td style="text-align: right;"><strong>5.05×</strong></td>
<td style="text-align: right;">4.94×</td>
</tr>
<tr class="even">
<td>BEiT</td>
<td style="text-align: right;">1.04×</td>
<td style="text-align: right;"><strong>4.85×</strong></td>
<td style="text-align: right;">4.54×</td>
</tr>
<tr class="odd">
<td>ViT</td>
<td style="text-align: right;">1.02×</td>
<td style="text-align: right;"><strong>4.81×</strong></td>
<td style="text-align: right;">4.68×</td>
</tr>
<tr class="even">
<td>DeiT</td>
<td style="text-align: right;">1.04×</td>
<td style="text-align: right;"><strong>4.37×</strong></td>
<td style="text-align: right;">4.23×</td>
</tr>
<tr class="odd">
<td>ConvNeXt</td>
<td style="text-align: right;">1.08×</td>
<td style="text-align: right;"><strong>4.29×</strong></td>
<td style="text-align: right;">4.11×</td>
</tr>
<tr class="even">
<td>ResNet</td>
<td style="text-align: right;">1.20×</td>
<td style="text-align: right;"><strong>4.00×</strong></td>
<td style="text-align: right;">3.96×</td>
</tr>
<tr class="odd">
<td>Swin</td>
<td style="text-align: right;">1.17×</td>
<td style="text-align: right;"><strong>3.87×</strong></td>
<td style="text-align: right;">3.75×</td>
</tr>
<tr class="even">
<td>CoAtNet</td>
<td style="text-align: right;">1.21×</td>
<td style="text-align: right;"><strong>3.60×</strong></td>
<td style="text-align: right;">3.57×</td>
</tr>
<tr class="odd">
<td>RegNet</td>
<td style="text-align: right;">1.03×</td>
<td style="text-align: right;"><strong>3.03×</strong></td>
<td style="text-align: right;">2.97×</td>
</tr>
<tr class="even">
<td>EfficientNet</td>
<td style="text-align: right;">1.46×</td>
<td style="text-align: right;"><strong>2.99×</strong></td>
<td style="text-align: right;">2.97×</td>
</tr>
<tr class="odd">
<td>MobileNet</td>
<td style="text-align: right;">1.42×</td>
<td style="text-align: right;"><strong>2.66×</strong></td>
<td style="text-align: right;">2.60×</td>
</tr>
</tbody>
</table>
<p>We find:</p>
<p><strong>Plain ViTs see a much bigger lift on V100 than on H100.</strong> BEiT-L/16 goes 5.05× faster on V100 with fp16 + compile, but only 2.07× on H100. ViT-B/16 sees 4.81× on V100, 2.23× on H100. The reason is that V100’s <code>fp32</code> matmul path doesn’t use Tensor Cores at all — it runs on the regular CUDA cores at ~16 TFLOPs. fp16 unlocks Volta’s V100 Tensor Cores at ~125 TFLOPs (~8× peak ratio), and ViTs are mostly big matmuls — exactly the workload Tensor Cores were built for — so they cash in nearly the full speedup. On H100, the <code>fp32</code> rows are already TF32 (Tensor Cores with a 10-bit mantissa), so fp16 only widens the lane that’s already open — ~2× peak ratio, not 8×. The V100 fp16 cliff for transformers is real, but it’s a measure of how <em>bad</em> V100 fp32 is, not how good V100 fp16 is.</p>
<p><strong>CNNs are the opposite — bigger lift on H100, smaller on V100.</strong> ResNet sees 3.40× on H100 vs.&nbsp;4.00× on V100; MobileNet sees 2.91× vs.&nbsp;2.66×. CNNs spend a much larger fraction of their wall time on small, non-matmul ops (depthwise convs, BN, activations, kernel launches), so the compute speedup of half precision is diluted. On H100 the smaller-ops penalty is also smaller (faster scheduler, larger SMs), which is why MobileNetV3-S clears 115K img/s.</p>
<p><strong><code>torch.compile</code> alone (fp32 + compile) does very little.</strong> On both GPUs the median lift across families is in the 1.0–1.9× range. The big wins are paired: precision <em>and</em> compile. Compiled fp32 alone rarely justifies the build time.</p>
<p><strong>CoAtNet is an outlier.</strong> Hybrid CNN-ViT, smallest precision lift on H100 (1.71×). We aren’t sure why (but reproduced this several times)!</p>
</section>
<section id="gpu-choice-vs.-model-choice" class="level3">
<h3 class="anchored" data-anchor-id="gpu-choice-vs.-model-choice">GPU choice vs.&nbsp;model choice</h3>
<div class="tb-fig" id="tb-fig-cross-gpu"></div>
<div class="tb-fig-caption"><em>Figure A2</em>: Cross-GPU lift, V100 best vs H100 NVL best at 3×224×224. Diagonal reference lines mark 1×, 3×, 5×, and 10× speedup. Most models cluster between 3× and 5×, with plain transformers (ViT/DeiT3/BEiT/DOFA) at the high end (4–5×) and CoAtNet/EfficientNet at the low end (2–3×). The figure shows only 3×224 models, so geo-FMs other than DOFA aren't plotted. Point size scales with parameter count. Toggle families with chips, hover any point for both throughputs and the speedup.</div>
<p>Stack the precision/compile lift on top of the GPU change and you get the full cross-GPU picture. Fix the model, take each GPU’s best precision + compile config: ResNet-18 throughput goes <strong>14,762 → 53,478 img/s</strong> (V100 → H100, 3.6×); ViT-L/16 goes <strong>529 → 2,585</strong> (4.9×); Galileo-Large/8 goes <strong>183 → 843</strong> (4.6×). At the 4 B-patches/year baseline, that translates to 75 → 21 GPU-h/yr for ResNet-18, 2,100 → 430 for ViT-L/16, and ~6,070 → ~1,320 for Galileo-Large/8.</p>
<p>Once AMP is unlocked across the matrix on V100 (and SenPaMAE/OlmoEarth gain full fp16/bf16), the cross-GPU spread is roughly flat at 3–5× across families — the geo-FMs aren’t the outliers they were when V100 was stuck on the slowest fp32 path. The point worth keeping: the <strong>spread across models on a single GPU</strong> (~205× from MobileNetV3-S to ViT-L/8) <strong>dwarfs the spread across GPUs for any single model</strong> (3–5×). Model choice dominates GPU choice for this workload.</p>
</section>
</section>
<section id="links" class="level2">
<h2 class="anchored" data-anchor-id="links">Links</h2>
<ul>
<li><strong>Code:</strong> <a href="https://github.com/calebrob6/throughput-bench">github.com/calebrob6/throughput-bench</a></li>
<li><strong>Interactive viewer:</strong> <a href="https://calebrob.com/throughput-bench/">calebrob.com/throughput-bench</a></li>
<li><strong>Results CSVs:</strong> <a href="https://github.com/calebrob6/throughput-bench/tree/main/results"><code>results/</code></a> (V100, H100 NVL)</li>
</ul>


</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>Costs throughout this post use <strong>$3.11/hr per H100 NVL</strong>, the average across cloud providers from <a href="https://getdeploying.com/gpus/nvidia-h100">getdeploying.com/gpus/nvidia-h100</a>. Costs scale linearly with GPU-hours, so swap in your own rate if you’ve negotiated a better one.↩︎</p></li>
<li id="fn2"><p>ViT-L/8 is non-standard — vanilla ViT-L is typically <code>/16</code>. We added it (along with ViT-H/14, ViT-G/14, and <a href="https://arxiv.org/abs/2508.10104">DinoV3</a>-H+/16) as same-compute / same-token-count comparators for the larger geo-FMs. At 3×224 with patch size 8, ViT-L/8 produces (224/8)² = 784 tokens, roughly matching OlmoEarth-Large/8’s 768 tokens at 12×128, and at 474 G multiply-accumulates per forward pass (MACs) vs OlmoEarth-Large/8’s 464 G it’s a near-MAC match. We use it as the apples-to-apples vanilla ViT for the large geo-FMs in the Geo-FMs vs.&nbsp;shape-matched timm baselines section.↩︎</p></li>
<li id="fn3"><p>The only models that didn’t fit at 512 were CoAtNet-0 and CoAtNet-2, both of which the harness halved to 256 on both V100 and H100. Every other model fits at 512 on both GPUs.↩︎</p></li>
<li id="fn4"><p>Galileo and OlmoEarth are <em>time-series multimodal</em> models that can ingest a sequence of images across multiple modalities; we benchmark them in their simplest configuration (single image, single timestep, Sentinel-2 only). Both wrappers use <code>max_patch_size=8</code> and <code>max_sequence_length=1</code>. Both also tokenize each image as multiple bandsets (groups of spectral channels with separate patch embeddings): OlmoEarth at 12×128 produces 16×16 × 3 bandsets × 1 timestep = <strong>768 tokens per image</strong> (12-band L2A); Galileo at 10×64 produces 8×8 × 5 bandsets × 1 timestep = <strong>320 tokens</strong> (10-band S2 split into RGB / Red Edge / NIR-10m / NIR-20m / SWIR). DOFA, CROMA, and SenPaMAE are not time-series models, so their wrapper settings are minimal: DOFA passes the per-channel wavelength tensor (Sentinel-2 wavelengths truncated to <code>num_channels</code>); CROMA dispatches by modality (<code>"optical"</code> → 12-band Sentinel-2, <code>"SAR"</code> → 2-band Sentinel-1); SenPaMAE uses dummy SRF and GSD constants (the forward needs them but they don’t affect throughput). All five families run with random weights, since throughput doesn’t depend on the parameter values. Full wrapper code: <a href="https://github.com/calebrob6/throughput-bench/blob/main/geo_models.py"><code>geo_models.py</code></a>.↩︎</p></li>
<li id="fn5"><p>We skip bf16 on V100 — Volta has no hardware support for it (Ampere added it).↩︎</p></li>
<li id="fn6"><p>The harness also exposes <code>max-autotune</code>, but we didn’t include it in this sweep.↩︎</p></li>
<li id="fn7"><p>The Galileo paper (<a href="https://arxiv.org/abs/2502.09356">Tseng et al.&nbsp;2025</a>, Table 12) only releases Nano, Tiny, and Base checkpoints. <strong>Galileo-Large/8</strong> in our matrix is a synthetic ViT-Large (1280 dim × 24 depth × 16 heads, 474.7M params) plugged into the same wrapper — a “what does this architecture look like at ViT-Large scale” data point, not an officially released model.↩︎</p></li>
</ol>
</section><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{robinson2026,
  author = {Robinson, Caleb and Corley, Isaac},
  title = {ThroughputBench: {How} Fast Can a Deep Learning Model Map the
    {Earth?}},
  date = {2026-05-04},
  url = {https://geospatialml.com/posts/throughput-bench/},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-robinson2026" class="csl-entry quarto-appendix-citeas">
Robinson, Caleb, and Isaac Corley. 2026. <span>“ThroughputBench: How
Fast Can a Deep Learning Model Map the Earth?”</span> May 4. <a href="https://geospatialml.com/posts/throughput-bench/">https://geospatialml.com/posts/throughput-bench/</a>.
</div></div></section></div> ]]></description>
  <category>benchmarking</category>
  <category>throughput</category>
  <category>inference</category>
  <category>gpu</category>
  <category>timm</category>
  <category>foundation-models</category>
  <category>classification</category>
  <category>sdgs</category>
  <guid>https://geospatialml.com/posts/throughput-bench/</guid>
  <pubDate>Mon, 04 May 2026 00:00:00 GMT</pubDate>
  <media:content url="https://geospatialml.com/posts/throughput-bench/throughput_bench.png" medium="image" type="image/png" height="76" width="144"/>
</item>
<item>
  <title>Compressing Earth Embeddings, pt. 3 – DeltaBit</title>
  <dc:creator>Caleb Robinson</dc:creator>
  <dc:creator>Isaac Corley</dc:creator>
  <link>https://geospatialml.com/posts/change-detection/</link>
  <description><![CDATA[ 





<p>In our <a href="../../posts/terrabit/">TerraBit post</a> last time, we binary-quantized the global Clay v1.5 embeddings down to 128 bytes per patch and served 50M of them from static object storage — implementing planetary scale retrieval entirely in the browser. The first post in the series, <a href="../../posts/compressing-earth-embeddings/">Compressing Earth Embeddings</a>, set up the underlying claim: int8 quantization is statistically free across every model and dataset we tested, and PCA can strip most dimensions without meaningful accuracy loss.</p>
<p>Those posts were about <em>patch embeddings</em> — one vector per Sentinel-2 chip/patch/image, queried like a vector database.</p>
<p>This is the other half of the story: <em>pixel embeddings</em>. Dense prediction tasks, like change detection, segmentation, and anomaly maps, need a vector at every pixel and to be available everywhere the user is looking. A single Sentinel-2 tile at 10m resolution holds ~120 million pixels; even at <a href="https://arxiv.org/abs/2507.22291">AlphaEarth</a>’s (AEF) native 64-dimensional int8 (currently the most compact per-pixel earth embedding released) that’s ~7.7 GB per scene, and wider float32 models like DINOv3 ViT-L are ~60× larger per pixel. <strong>However, the same compression that made TerraBit possible should let us serve per-pixel embeddings as XYZ map tiles and train models on them in the browser.</strong> This post tests this idea on change detection.</p>
<video autoplay="" muted="" loop="" playsinline="" style="width:100%; border-radius:6px;">
<source src="deltabit-demo-1.webm" type="video/webm">
</video>
<p>We built a demo called <strong><a href="https://calebrob.com/deltabit/">DeltaBit</a></strong>. Zoom in on Seattle, click a few dozen change/no-change pixels on the map, and a logistic regression trains (in the browser, on just those clicks — no server round-trip) and runs inference across every visible tile in milliseconds. Tiles are int8-quantized PCA-8 difference embeddings — 8 bytes per pixel — served as standard <code>{z}/{x}/{y}.tif</code> GeoTIFFs from a web server.</p>
<section id="study-area-and-data" class="level2">
<h2 class="anchored" data-anchor-id="study-area-and-data">Study area and data</h2>
<p>We used the <a href="https://source.coop/tge-labs/aef-mosaic">AEF Mosaic</a> — a global Zarr v3 archive of AlphaEarth embedding fields on S3, built by <a href="https://github.com/geospatial-jeff">Jeff Albrecht</a> in collaboration with Taylor Geospatial, with annual composites from 2017–2025. Each pixel is a 64-dimensional signed int8 vector aligned to the Sentinel-2 10m grid.</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Property</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Sentinel-2 MGRS tile</td>
<td>T10TET (Seattle metro)</td>
</tr>
<tr class="even">
<td>Comparison years</td>
<td>2020 vs 2024</td>
</tr>
<tr class="odd">
<td>Raster</td>
<td>11,099 × 16,337 pixels (181.3 M pixels)</td>
</tr>
<tr class="even">
<td>Embedding</td>
<td>64-d int8 per pixel, AEF Mosaic</td>
</tr>
</tbody>
</table>
<p>The per-pixel temporal difference (<code>emb_2024 − emb_2020</code>) gives a 64-d float32 feature vector at every location. Pixels where the surface didn’t change cluster near zero; changed pixels project to non-zero directions in that 64-d space, and the experiment below tests how well a linear model can read those directions back out.</p>
</section>
<section id="compressing-the-difference-vectors" class="level2">
<h2 class="anchored" data-anchor-id="compressing-the-difference-vectors">Compressing the difference vectors</h2>
<p>Our working hypothesis from pt.&nbsp;1 is that most of the nominal dimensions in an earth embedding are redundant — the <em>intrinsic</em> dimensionality is much lower — and <a href="https://en.wikipedia.org/wiki/Principal_component_analysis">PCA</a> is the tool we use to collapse down to a compact basis. DeltaBit is a test of whether that same idea holds for <em>difference</em> vectors, where we stand to gain the most from it. We fit PCA on a 1-in-1000 subsample of the full raster diff (~181K pixels) and looked at how variance is distributed across the 64 components.</p>
<div class="theme-figure">
<p><img src="https://geospatialml.com/posts/change-detection/explained_variance_cdf.png" class="theme-light img-fluid" style="width:85.0%" alt="Cumulative explained variance of PCA fit on the Seattle AEF embedding diff."> <img src="https://geospatialml.com/posts/change-detection/explained_variance_cdf_dark.png" class="theme-dark img-fluid" style="width:85.0%" alt="Cumulative explained variance of PCA fit on the Seattle AEF embedding diff (dark theme)."></p>
<p>Cumulative explained variance of PCA fit on the full Seattle 2020→2024 AEF embedding diff. No single component dominates: PC1 captures 18%, 8 components reach 48%, and 40 components are needed for 90%.</p>
</div>
<p><strong>Variance is broadly distributed.</strong> PC1 captures only 18.3%; you need 9 components to cross 50% and 40 to cross 90%. This is much flatter than the static-embedding case from our compression post, where AEF reaches 80% variance in 8 dimensions. One hypothesis: differencing partly cancels the dominant scene-level modes — land cover, broad vegetation state — that two nearby years share, and what’s left is a more isotropic mix of higher-frequency change signals and embedding noise. We haven’t validated that directly.</p>
<p><strong>PCA-8 retains less than half the variance.</strong> That’s a strong test of whether per-pixel change detection survives aggressive compression. Variance is not the same as predictive signal, but if the model needs the long tail, PCA-8 will lose it.</p>
</section>
<section id="change-detection-with-pixel-supervision" class="level2">
<h2 class="anchored" data-anchor-id="change-detection-with-pixel-supervision">Change detection with pixel supervision</h2>
<p>We labeled 336 pixels by hand — 190 change, 146 no-change — using a side-by-side Sentinel-2 swipe of 2020 vs 2024 true-color imagery. At each labeled pixel we sample the 2020 and 2024 embeddings, compute the diff, and project it through three feature pipelines:</p>
<ul>
<li><strong>PCA-3</strong> — first 3 principal components (3-d)</li>
<li><strong>PCA-8</strong> — first 8 principal components (8-d)</li>
<li><strong>Full diff</strong> — the raw 64-d difference, no PCA</li>
</ul>
<p>For each, we run logistic regression with 10-fold stratified nested cross-validation: an outer 10-fold for unbiased performance estimation, and an inner 5-fold sweep to pick the regularization strength on F1.<sup>1</sup></p>
<table class="caption-top table">
<colgroup>
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
</colgroup>
<thead>
<tr class="header">
<th>Feature set</th>
<th>Accuracy</th>
<th>F1</th>
<th>Precision</th>
<th>Recall</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>PCA-3 (3-d)</td>
<td>0.90 ± 0.05</td>
<td>0.90 ± 0.05</td>
<td>0.97 ± 0.04</td>
<td>0.85 ± 0.10</td>
</tr>
<tr class="even">
<td>PCA-8 (8-d)</td>
<td>0.96 ± 0.04</td>
<td>0.96 ± 0.04</td>
<td>0.98 ± 0.03</td>
<td>0.94 ± 0.06</td>
</tr>
<tr class="odd">
<td>Full diff (64-d)</td>
<td><strong>0.99 ± 0.02</strong></td>
<td><strong>0.99 ± 0.01</strong></td>
<td>1.00 ± 0.00</td>
<td><strong>0.98 ± 0.03</strong></td>
</tr>
</tbody>
</table>
<p><em>10-fold stratified CV on 336 hand-labeled pixels. Inner 5-fold for <code>C</code> selection on F1. Mean ± std across outer folds. Bolds mark column-wise bests on the metrics that drive the comparison; precision saturates near 1.0 across all three feature sets, so it isn’t a useful tiebreaker.</em></p>
<p>A few things worth pulling out:</p>
<p><strong>PCA-8 is the practical sweet spot.</strong> 96% F1 with 8 features is more than enough for an interactive viewer. The demo’s compressed tiles — int8-quantized PCA-8 at 8 bytes per pixel — are <strong>8× smaller than the source AEF int8 embeddings</strong> that went in (and 32× smaller than the raw float32 diff the CV ran on), at essentially no cost to what you can do on top of them. The CV here is on float32 PCA-8; pt1 already showed int8 quantization on top of PCA is statistically free, so the in-browser tiles inherit the same number.</p>
<p><strong>Precision is consistently ≥0.97.</strong> The model rarely flags no-change pixels as change. The error mode is missed changes: recall climbs from 0.85 (PCA-3) to 0.94 (PCA-8) as feature dimensionality grows.</p>
<p><strong>PCA-3 is useful even without a model.</strong> The first three components already reach 90% F1 on a linear probe, which means the change signal is concentrated enough that <em>just looking</em> at the top 3 PCs — mapped directly to R/G/B — surfaces changed pixels by eye. DeltaBit exposes this as an overlay toggle at the top of the labeling panel, and changed areas jump out with no training at all:</p>
<video autoplay="" muted="" loop="" playsinline="" style="width:100%; border-radius:6px;">
<source src="deltabit-demo-3.webm" type="video/webm">
</video>
<p>The middle row is what <a href="https://calebrob.com/deltabit/">DeltaBit</a> serves. The same logistic regression evaluated above gets re-fit in the browser on whatever you click, in a few hundred milliseconds. 8 bytes per pixel also means the <strong>entire Seattle scene</strong> fits in 2.60 GiB of XYZ tiles at zoom levels 8–14. The browser fetches GeoTIFFs on demand with <a href="https://geotiffjs.github.io/">GeoTIFF.js</a>, caches them per tile key, and runs a WebGPU dot product over each tile to score every pixel.</p>
<p>What 96% F1 actually looks like, on a rural area south of Seattle:</p>
<div class="img-switcher" tabindex="0">
<div class="img-switcher-tabs">
<p><button type="button" class="is-active" data-idx="0" aria-selected="true">2020 Sentinel-2</button> <button type="button" data-idx="1" aria-selected="false">2024 Sentinel-2</button> <button type="button" data-idx="2" aria-selected="false">Predicted change</button></p>
</div>
<div class="img-switcher-stage">
<p><img src="https://geospatialml.com/posts/change-detection/s2_2020.jpg" data-idx="0" class="is-active" alt="2020 Sentinel-2 true-color crop of rural area south of Seattle in the DeltaBit viewer."> <img src="https://geospatialml.com/posts/change-detection/s2_2024.jpg" data-idx="1" alt="2024 Sentinel-2 true-color crop of the same area; several parcels visibly cleared."> <img src="https://geospatialml.com/posts/change-detection/change.jpg" data-idx="2" alt="DeltaBit predicted change overlay. Cleared parcels are flagged red over a blue basemap."></p>
</div>
<p class="img-switcher-caption">
Three views of the same crop: click a tab or press <kbd>←</kbd>/<kbd>→</kbd> to flip between the 2020 and 2024 Sentinel-2 true-color panels and the trained model’s change prediction. Several parcels were cleared between the two dates and the model — a logistic regression fit in the browser on a few dozen clicks — recovers them as red pixels in the third panel.
</p>
</div>
<p>The same flow, end to end in the viewer — 14 clicks is enough to stand up a deforestation detector on this area:</p>
<video autoplay="" muted="" loop="" playsinline="" style="width:100%; border-radius:6px;">
<source src="deltabit-demo-2.webm" type="video/webm">
</video>
<p>We zoom into a rural patch south of Seattle, place 8 change clicks on cleared parcels and 6 no-change clicks on untouched forest, hit <strong>Train</strong>, and switch to the heatmap view. The model generalizes cleanly across the local area, lighting up the other clearings we didn’t label.</p>
<style>
.img-switcher {
  max-width: 900px;
  margin: 1.25rem auto;
  outline: none;
}
.img-switcher-tabs {
  display: flex;
  flex-wrap: wrap;
  justify-content: center;
  gap: 0.4rem;
  margin-bottom: 0.75rem;
}
.img-switcher-tabs > p,
.img-switcher-stage > p {
  display: contents;
}
.img-switcher-tabs button {
  font-family: 'Space Grotesk', system-ui, sans-serif;
  font-size: 0.82rem;
  font-weight: 600;
  padding: 0.35rem 0.85rem;
  background: var(--geo-bg-card);
  color: var(--geo-text-dim);
  border: 1px solid var(--geo-border);
  border-radius: 8px;
  cursor: pointer;
  transition: background 0.15s, color 0.15s, border-color 0.15s;
}
.img-switcher-tabs button:hover {
  border-color: var(--geo-accent);
  color: var(--geo-text);
}
.img-switcher-tabs button.is-active {
  background: var(--geo-accent);
  color: #fff;
  border-color: var(--geo-accent);
}
.img-switcher-stage {
  position: relative;
  border-radius: 10px;
  overflow: hidden;
  border: 1px solid var(--geo-border);
  line-height: 0;
}
.img-switcher-stage img {
  display: none;
  width: 100%;
  max-width: 100%;
  height: auto;
  margin: 0;
  border-radius: 0;
}
.img-switcher-stage img.is-active { display: block; }
.img-switcher-caption {
  font-size: 0.82rem;
  color: var(--geo-text-dim);
  text-align: center;
  margin-top: 0.6rem;
  line-height: 1.45;
}
.img-switcher:focus-visible .img-switcher-stage {
  box-shadow: 0 0 0 2px var(--geo-accent);
}
.img-switcher kbd {
  font-family: var(--geo-mono);
  font-size: 0.78em;
  padding: 0 0.3em;
  border: 1px solid var(--geo-border);
  border-radius: 4px;
  background: var(--geo-bg-card);
}
</style>
<script>
(function () {
  document.querySelectorAll('.img-switcher').forEach(function (root) {
    const tabs = Array.from(root.querySelectorAll('.img-switcher-tabs button'));
    const imgs = Array.from(root.querySelectorAll('.img-switcher-stage img'));
    const n = tabs.length;
    if (!n || imgs.length !== n) return;
    let active = 0;

    function setActive(i) {
      active = ((i % n) + n) % n;
      tabs.forEach(function (t, k) {
        t.classList.toggle('is-active', k === active);
        t.setAttribute('aria-selected', k === active ? 'true' : 'false');
      });
      imgs.forEach(function (im, k) {
        im.classList.toggle('is-active', k === active);
      });
    }

    tabs.forEach(function (t, k) {
      t.addEventListener('click', function () {
        setActive(k);
        root.focus();
      });
    });

    root.addEventListener('keydown', function (e) {
      if (e.key === 'ArrowRight' || e.key === 'ArrowDown') {
        setActive(active + 1);
        e.preventDefault();
      } else if (e.key === 'ArrowLeft' || e.key === 'ArrowUp') {
        setActive(active - 1);
        e.preventDefault();
      } else if (/^[1-9]$/.test(e.key)) {
        const idx = parseInt(e.key, 10) - 1;
        if (idx < n) {
          setActive(idx);
          e.preventDefault();
        }
      }
    });
  });
})();
</script>
<p>One thing we noticed — and which in retrospect had to work this way: the lower zoom levels in the pyramid are built by 2×2-averaging the int8 PCA-8 tiles directly, and the <em>same</em> trained linear model produces coherent change predictions at those zooms. Zoom out in the viewer and cleared parcels and construction still light up. The whole pipeline is linear — differencing, PCA, and the model’s pre-sigmoid logit all commute with averaging — so the logit of an averaged tile equals the mean of its four child logits, modulo a bit of int8 rounding. The sigmoid mildly distorts that on the probability side, but not enough to blur the signal. It’s a free side effect of keeping every stage before the classifier linear.</p>
<p>You can try this for yourself — <strong><a href="https://calebrob.com/deltabit/">open DeltaBit</a></strong>, pick a spot in the Seattle metro, click a dozen change and no-change pixels, and see the prediction layer update across the visible map. Arrow keys / scroll to fly around; the same tiles get reused at every zoom level.</p>
</section>
<section id="limitations" class="level2">
<h2 class="anchored" data-anchor-id="limitations">Limitations</h2>
<p><strong>Single scene pair.</strong> One Sentinel-2 tile, one year pair. We haven’t tested other sensors, resolutions, or temporal baselines.</p>
<p><strong>Narrow slice of change types.</strong> The 336 labeled pixels are dominated by deforestation and urban changes — new construction, road work, cleared parcels — because that’s what’s conspicuous in a 2020→2024 side-by-side over Seattle. Agricultural cycles, flooding, fires, snow and ice, and seasonal vegetation shifts are barely represented, so the CV scores don’t speak to how well a linear model separates those change modes from each other (or from no-change).</p>
<p><strong>Tested on AEF only.</strong> The recipe is model-agnostic — PCA + int8 + XYZ tiles works on any per-pixel embedding — but we only ran the experiment on AEF. We picked AEF because a <a href="https://source.coop/tge-labs/aef-mosaic">global Zarr store</a> already exists and we had code to pull from it; the bottleneck was data access, not the pipeline. That friction is decreasing fast. As more per-pixel embedding products ship as global mega-Zarrs, COGs, or APIs, running the same PCA(8) + int8 + XYZ recipe against them becomes a configuration change rather than a project. A global change-detection app over any of those sources isn’t inconceivable — just unbuilt.</p>
</section>
<section id="takeaways" class="level2">
<h2 class="anchored" data-anchor-id="takeaways">Takeaways</h2>
<p><strong>PCA-8 + int8 is the sweet spot for interactive change detection.</strong> A full Sentinel-2 scene’s worth of change embeddings fits in a few GB of tiles — small enough to stream to a browser and run dense inference on the GPU without any server involvement, while still matching what a model trained on the raw embeddings can do. Concretely: 8 bytes per pixel gets 96% F1, which is 8× smaller than the source AEF int8 embeddings (32× smaller than the float32 diff the CV ran on). The recipe is model-agnostic: DINOv3, OlmoEarth, Tessera, or any other per-pixel embedding from <a href="../../posts/compressing-earth-embeddings/">pt.&nbsp;1</a> would slot in the same way.</p>
<p><strong>Difference embeddings compress less neatly than static ones — and it doesn’t matter.</strong> Subtracting two years of embeddings produces a messier, higher-dimensional signal than either year alone, but enough of what matters for change detection survives the first handful of components. Static AEF reaches 80% variance in 8 dims; the 2020→2024 diff needs 40 components to clear 90%, and PCA-8 only retains 48%. Despite that, PCA-8 already hits 96% F1 on the labeled set — variance isn’t the same as predictive signal, and 8 bytes per pixel is the right place to draw the line for an interactive viewer.</p>
<p><strong>Linear is enough.</strong> The embeddings are doing the heavy lifting, so a simple classifier on top is all you need — which, as a bonus, is trivially fast to train and run on the GPU in the browser. Logistic regression on 8 features hits 96% F1; nothing fancier is needed for the webapp to feel like a useful tool.</p>
<p><strong>Links:</strong> <a href="https://calebrob.com/deltabit/">DeltaBit demo</a> · <a href="../../posts/compressing-earth-embeddings/">pt.&nbsp;1: Compressing Earth Embeddings</a> · <a href="../../posts/terrabit/">pt.&nbsp;2: TerraBit</a></p>


</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>Slightly interesting side note: the median selected <code>C</code> (inverse regularization strength) decreases as the feature set grows — PCA-3 wants weaker regularization than PCA-8, which wants weaker than the full 64-d diff. With fewer features the model needs more freedom to fit the limited signal; with 64 features, regularization keeps it honest.↩︎</p></li>
</ol>
</section><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{robinson2026,
  author = {Robinson, Caleb and Corley, Isaac},
  title = {Compressing {Earth} {Embeddings,} Pt. 3 -\/- {DeltaBit}},
  date = {2026-04-15},
  url = {https://geospatialml.com/posts/change-detection/},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-robinson2026" class="csl-entry quarto-appendix-citeas">
Robinson, Caleb, and Isaac Corley. 2026. <span>“Compressing Earth
Embeddings, Pt. 3 -- DeltaBit.”</span> April 15. <a href="https://geospatialml.com/posts/change-detection/">https://geospatialml.com/posts/change-detection/</a>.
</div></div></section></div> ]]></description>
  <category>change-detection</category>
  <category>embeddings</category>
  <category>compression</category>
  <category>interactive</category>
  <category>webgpu</category>
  <category>browser</category>
  <category>demo</category>
  <guid>https://geospatialml.com/posts/change-detection/</guid>
  <pubDate>Wed, 15 Apr 2026 00:00:00 GMT</pubDate>
  <media:content url="https://geospatialml.com/posts/change-detection/teaser.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Compressing Earth Embeddings, pt. 2 – TerraBit</title>
  <dc:creator>Isaac Corley</dc:creator>
  <dc:creator>Caleb Robinson</dc:creator>
  <link>https://geospatialml.com/posts/terrabit/</link>
  <description><![CDATA[ 





<section id="unfinished-business" class="level2">
<h2 class="anchored" data-anchor-id="unfinished-business">Unfinished business</h2>
<p><a href="../../posts/compressing-earth-embeddings/">Last time</a>, we compressed earth embeddings 64× with less than 2% loss on patch classification. We found int8 was statistically indistinguishable from float32 and that PCA(64)+int8 was the sweet spot. Binary quantization — reducing each dimension to its sign bit — achieved 16.5× end-to-end compression on disk (32× on the raw embedding payload alone), but we hadn’t yet measured retrieval quality at scale.</p>
<p>We were clear about what we didn’t test. From our limitations section:</p>
<blockquote class="blockquote">
<p>We have not tested: semantic segmentation, pixel regression, object detection, <strong>change detection</strong>, or <strong>retrieval</strong> — ranking quality over large databases may be more sensitive to distance distortion than top-1 classification.</p>
</blockquote>
<p>In other words, patch classification on <a href="https://github.com/phelber/EuroSAT">EuroSAT</a> is a controlled benchmark, not a real workflow. <strong>Can you actually do useful things with aggressively compressed embeddings?</strong> This time we work with <a href="https://clay-foundation.github.io/model/release-notes/specification.html">Clay v1.5</a> — a foundation model trained on multi-sensor satellite imagery — at global scale. <a href="https://lgnd.ai/">LGND</a> made the <a href="https://source.coop/clay/lgnd-clay-v1-5-sentinel-2-l2a">full global corpus available</a> in float32 on Source Cooperative, which gave us the raw material to test compression at scale.</p>
</section>
<section id="terrabit" class="level2">
<h2 class="anchored" data-anchor-id="terrabit">TerraBit</h2>
<video autoplay="" muted="" loop="" playsinline="" style="width:100%; border-radius:6px;">
<source src="terrabit-demo.mp4" type="video/mp4">
</video>
<p>To test this, we built <a href="https://isaac.earth/terrabit/">TerraBit</a> — a global retrieval demo that runs entirely in the browser with no backend or server-side computation. We binary-quantize the full Clay v1.5 corpus into packed bit vectors, store them as spatially-partitioned cloud-native <a href="https://parquet.apache.org/">Parquet</a> on public object storage, and let the browser handle shard discovery, data fetching, and in-memory Hamming scoring. The entire “backend” is a static S3 bucket; all compute happens on your machine.</p>
<p><strong>How it works:</strong></p>
<ol type="1">
<li>You draw one or more regions of interest (ROI) anywhere — each of these are loaded independently; regions can be rectangles or freehand polygons</li>
<li>You click to create exemplar patches on the map (one or many); positive exemplars outside the AOI have their embeddings fetched on the fly; negatives work anywhere on the globe for contrastive scoring (<code>pos_dist − neg_dist</code>); you can also invert search (bitwise NOT) to find the opposite of a reference!</li>
<li>DuckDB-WASM queries a manifest for intersecting geohash shards; only those shards are fetched via HTTP range requests — no full-corpus scan</li>
<li>A Web Worker scores all candidates with brute-force Hamming distance and returns ranked results</li>
<li>The results render via MapLibre GL across several view modes (<em>top-k, heatmap, threshold, outlier, surprise, gradient</em>)</li>
</ol>
<p>Multiple exemplars can be combined via mean distance, by applying bitwise <code>AND</code> / <code>OR</code> / <code>XOR</code> directly on the packed binary vectors before scoring — exact, lossless ops that compose semantically because binary embeddings have nice arithmetic properties.</p>
<video autoplay="" muted="" loop="" playsinline="" style="width:100%; border-radius:6px;">
<source src="terrabit-across-the-world-demo.mp4" type="video/mp4">
</video>
<p>The 50M embeddings are partitioned into geohash-aligned Parquet shards and published on <a href="https://source.coop/geospatialml/terrabit">Source Cooperative</a>, which serves them cloud-natively out of S3 — public HTTP with byte-range support, no egress fees, no intermediate server. A single manifest file records the path, row count, and spatial extent of every shard.</p>
<p>When you draw an ROI, <a href="https://duckdb.org/docs/api/wasm/overview.html">DuckDB-WASM</a> queries the manifest with a bounding-box predicate — manifest-based shard pruning: the manifest acts as a coarse spatial index so the browser never opens metadata on shards outside the ROI. Once the intersecting shard list is resolved, DuckDB streams those shard files over HTTP (via <a href="https://duckdb.org/docs/extensions/httpfs/overview.html"><code>httpfs</code></a> range requests) and applies a second filter at the row level — a bbox predicate for rectangles, or <code>ST_Intersects</code> for freehand polygons — to extract only patches within the drawn region. Ranking over the candidate slice is exact brute-force Hamming: binary embeddings arrive as packed <code>Uint8Array</code> columns (128 bytes per 1024-dim vector) and are scored in a Web Worker via XOR+<a href="https://nimrod.blog/posts/algorithms-behind-popcount/">popcount</a>, which maps directly to hardware-accelerated popcount instructions and completes in milliseconds for a typical AOI partition.</p>
<p>The binary embeddings are lossy though — we find they have ~65% <a href="https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Recall">recall@10</a> (the fraction of true float32 nearest neighbors recovered by the binary representation) which means roughly a third of true neighbors are missed (Figure&nbsp;2). Good enough for coarse exploration; not a claim about downstream curation or labeling productivity. How coarse is too coarse though? 65% recall goes further than you’d expect — <a href="https://isaac.earth/terrabit/">try the demo</a> on your own region!</p>
<p>A few examples of what this enables: a) click a center-pivot irrigation field in Kansas and separate it from rectangular fields across the state, b) pick a greenhouse cluster in Rotterdam and highlight dense greenhouse and vineyard complexes across the region or c) select a solar installation in northwest India and find others at similar scale. None of these queries require labeled data, a trained classifier, or even a definition of what you’re looking for beyond a single click. This is useful for data exploration, bootstrapping training datasets for supervised models, and narrowing the search space before running expensive high-resolution models over targeted areas. The demo also supports exporting ranked candidates as GeoParquet!</p>
<video autoplay="" muted="" loop="" playsinline="" style="width:100%; border-radius:6px;">
<source src="terrabit-places.mp4" type="video/mp4">
</video>
</section>
<section id="binary-earth-embedding-retrieval-at-planet-scale" class="level2">
<h2 class="anchored" data-anchor-id="binary-earth-embedding-retrieval-at-planet-scale">Binary Earth Embedding Retrieval at Planet Scale</h2>
<p>Clay v1.5 produces 1024-dimensional embeddings from <a href="https://www.esa.int/Applications/Observing_the_Earth/Copernicus/Sentinel-2">Sentinel-2</a> imagery. The global corpus spans two years of observations — roughly 50 million embeddings covering Earth’s land surface — and is <strong>183 GiB</strong> on disk in ZSTD-compressed Parquet (≈190 GiB as raw float32 – float32s don’t compress well even if they come from a GeoFM). Serving float32 vectors at this scale to a browser isn’t viable; the question we ask is <strong>how aggressively you can compress without destroying retrieval quality.</strong></p>
<p>Binary quantization reduces each dimension to a single sign bit. 1024 floats (4,096 bytes) become 128 bytes — a 32× reduction on the raw payload. End-to-end on disk (Parquet with ZSTD, geometry and STAC metadata columns, row-group overhead), the full 49.8M-row corpus drops from 182.9 GiB to <strong>11.1 GiB</strong> — <strong>16.5× compression</strong>. The on-disk number is what you pay for on object storage (32× is the raw payload reduction). The web demo corpus is smaller still (~7 GiB) because several columns were dropped and the compression level was increased — a demo-specific optimization on top of the 16.5× quantization win.</p>
<div id="fig-storage" class="quarto-float quarto-figure quarto-figure-center anchored">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-storage-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://geospatialml.com/posts/terrabit/storage.png" class="img-fluid figure-img" style="width:100.0%">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-storage-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;1: On-disk storage of the full 49.8M Clay v1.5 corpus across quantization levels. fp32 → binary gives 16.5× end-to-end compression.
</figcaption>
</figure>
</div>
<div id="fig-knn-recall" class="quarto-float quarto-figure quarto-figure-center anchored">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-knn-recall-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://geospatialml.com/posts/terrabit/knn_recall_k.png" class="img-fluid figure-img" style="width:100.0%">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-knn-recall-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;2: kNN recall@k vs.&nbsp;on-disk compression ratio across quantization methods. int8 is near-lossless; binary hits 65% recall at 16.5×.
</figcaption>
</figure>
</div>
<p>Why does aggressive quantization work at all on 1024-dimensional vectors? One diagnostic is the <a href="https://en.wikipedia.org/wiki/Intrinsic_dimension">intrinsic dimension</a> (ID) — the degrees of freedom the data actually uses, regardless of ambient dimensionality [<a href="https://www.nature.com/articles/s41598-017-11873-y">Facco et al., 2017</a>; <a href="https://proceedings.neurips.cc/paper/2004/hash/74934548253bcab8490ebd74afed7031-Abstract.html">Levina &amp; Bickel, 2004</a>]. This framing is directly motivated by <a href="https://arxiv.org/abs/2511.02101">Rao et al., 2025</a>, who find that geographic representations — despite operating in 256–512 dimensional spaces — compress to just 2–10 intrinsic dimensions, and that ID correlates with downstream task performance. <strong>For Clay v1.5 we estimate ID ≈ 13–17 (MLE: 17.0, TwoNN: 12.6, Local PCA: 17.0, on a 10k sample subset).</strong> Three estimators with different assumptions agree on a narrow range. Low ID is why aggressive compression is worth attempting — the data simply isn’t using most of its dimensions.</p>
</section>
<section id="turboquant-aka-rotate-before-you-quantize" class="level2">
<h2 class="anchored" data-anchor-id="turboquant-aka-rotate-before-you-quantize">TurboQuant aka rotate before you quantize</h2>
<p>Binary is the extreme end of the compression spectrum, and the retrieval demo uses it — but what if you need more recall than binary while keeping storage well below float32?</p>
<p>Standard affine quantization at low bit-widths (int2–int4) suffers from high variance disparity across embedding dimensions: some dimensions carry far more signal than others, and a uniform quantization grid wastes bits on low-variance dimensions while clipping high-variance ones. <a href="https://arxiv.org/abs/2504.19874">TurboQuant</a> fixes this by applying a fixed random orthogonal rotation <img src="https://latex.codecogs.com/png.latex?R%20%5Cin%20%5Cmathbb%7BR%7D%5E%7Bd%20%5Ctimes%20d%7D"> (sampled once from a Haar-distributed ensemble via QR decomposition) before symmetric affine quantization: <img src="https://latex.codecogs.com/png.latex?%5Chat%7Bx%7D%20=%20R%5E%5Ctop%20Q_b(Rx)">. The rotation spreads variance across dimensions so no channel dominates the bit budget. <img src="https://latex.codecogs.com/png.latex?R"> is generated once, stored with the quantized embeddings, and reused for all queries — one matrix multiply at encode/decode, no retraining.</p>
<p>Earth embeddings have the same property: ID ≈ 13–17 in a 1024-d space leaves a lot of variance to redistribute.</p>
<p>We ran TurboQuant across bit-widths on the Clay v1.5 embeddings. The gains are largest at low precision and vanish at high precision:</p>
<div id="fig-turbo-vs-int" class="quarto-float quarto-figure quarto-figure-center anchored">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-turbo-vs-int-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://geospatialml.com/posts/terrabit/turbo_vs_int.png" class="img-fluid figure-img" style="width:100.0%">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-turbo-vs-int-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;3: TurboQuant vs.&nbsp;standard scalar quantization across bit-widths. The rotation provides the largest recall improvement at 2–3 bits, where inter-channel variance disparity hurts most. By int8, affine quantization is already near-lossless and the rotation adds nothing.
</figcaption>
</figure>
</div>
<div id="fig-pareto" class="quarto-float quarto-figure quarto-figure-center anchored">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-pareto-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://geospatialml.com/posts/terrabit/pareto.png" class="img-fluid figure-img" style="width:100.0%">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-pareto-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;4: Quality–compression Pareto front across standard quantization methods, under both cosine and Euclidean ground truth (the two overlap almost exactly). Standard int4 is the sweet spot at ~6× on-disk compression and 91% recall@10; int2 is dominated by binary, which recovers some recall despite 16.5× compression.
</figcaption>
</figure>
</div>
<p>Practical takeaway: if binary recall is too coarse but you still want aggressive compression, TurboQuant at int2–int4 is worth trying first. <strong>At TurboQuant int4, 95% recall at ~6× on-disk compression (8× on the raw payload).</strong> By int8, the affine grid is fine enough on its own and the rotation adds nothing.</p>
<section id="search-throughput" class="level3">
<h3 class="anchored" data-anchor-id="search-throughput">Search throughput</h3>
<p>We also benchmarked brute-force kNN on a 1M-vector subset (1K queries, k=10) using <a href="https://github.com/facebookresearch/faiss">FAISS</a> on CPU and PyTorch’s <a href="https://docs.pytorch.org/docs/stable/generated/torch.cdist.html">torch.cdist</a> on an RTX 3090. While other bit-widths benefit from GPU acceleration, binary search is unreasonably fast on CPU thanks to SIMD acceleration.</p>
<div id="fig-search-benchmark" class="quarto-float quarto-figure quarto-figure-center anchored">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-search-benchmark-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://geospatialml.com/posts/terrabit/search_benchmark.png" class="img-fluid figure-img" style="width:90.0%">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-search-benchmark-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;5: Search throughput benchmark: brute-force kNN on 1M vectors (1K queries, k=10). Binary Hamming search dominates on CPU (800 QPS) thanks to hardware-accelerated popcount — roughly 16× faster than the dequantize-then-search path on CPU. Dequantized methods benefit from GPU acceleration (&gt;1,100 QPS on an RTX 3090, a 23–29× speedup over CPU). Recall tracks quantization fidelity independently of hardware.
</figcaption>
</figure>
</div>
</section>
</section>
<section id="why-not-just-build-a-backend" class="level2">
<h2 class="anchored" data-anchor-id="why-not-just-build-a-backend">Why not just build a backend?</h2>
<p>Reasonable reaction: “Cool demo, but real systems need a database and an API.” Maybe — but in geospatial ML, the gap between a working prototype and a deployed tool is almost all infrastructure: vector databases, REST APIs, auth, scaling, monitoring. Each layer is individually reasonable but collectively large enough of a barrier to prevent someone from shipping and maintaining a useful tool.</p>
<p>Furthermore, existing vector DBs primarily partition by embedding similarity; a small AOI query still touches shards scattered across the index with geospatial filtering applied AFTER the expensive approximate nearest neighbor (ANN) step. Getting geo-first partitioning right takes careful co-design, and no existing systems target zero-ops browser-native serving of a static corpus. Our approach sidesteps that: embeddings partitioned spatially by geohash, a manifest for shard pruning, and a throwaway Hamming scan.</p>
<p>To be clear, backends still have their place. Full-corpus ANN, multi-user serving, auth, and strict SLAs are backend territory. But <strong>for exploration and dataset curation, the barrier to useful interaction with embeddings should be as close to zero as possible, and for a lot of real problems, client-side is enough.</strong></p>
<p><strong>Links:</strong> <a href="https://isaac.earth/terrabit/">TerraBit retrieval demo</a> · <a href="https://source.coop/geospatialml/terrabit">binarized embedding corpus</a> · <a href="../../posts/compressing-earth-embeddings/">pt.&nbsp;1: Compressing Earth Embeddings</a></p>
<div style="font-size: 0.85em; color: gray;">
<p><strong>Acknowledgments.</strong> Thanks to <a href="https://www.linkedin.com/in/jeff-albrecht-5a2b86148">Jeff Albrecht</a> for his review and feedback on this post.</p>
</div>


</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{corley2026,
  author = {Corley, Isaac and Robinson, Caleb},
  title = {Compressing {Earth} {Embeddings,} Pt. 2 -\/- {TerraBit}},
  date = {2026-04-07},
  url = {https://geospatialml.com/posts/terrabit/},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-corley2026" class="csl-entry quarto-appendix-citeas">
Corley, Isaac, and Caleb Robinson. 2026. <span>“Compressing Earth
Embeddings, Pt. 2 -- TerraBit.”</span> April 7. <a href="https://geospatialml.com/posts/terrabit/">https://geospatialml.com/posts/terrabit/</a>.
</div></div></section></div> ]]></description>
  <category>embeddings</category>
  <category>quantization</category>
  <category>compression</category>
  <category>retrieval</category>
  <category>foundation-models</category>
  <category>sentinel-2</category>
  <category>clay</category>
  <category>browser</category>
  <category>demo</category>
  <guid>https://geospatialml.com/posts/terrabit/</guid>
  <pubDate>Tue, 07 Apr 2026 00:00:00 GMT</pubDate>
  <media:content url="https://geospatialml.com/posts/terrabit/thumbnail.png" medium="image" type="image/png"/>
</item>
<item>
  <title>Compressing Earth Embeddings</title>
  <dc:creator>Caleb Robinson</dc:creator>
  <dc:creator>Isaac Corley</dc:creator>
  <link>https://geospatialml.com/posts/compressing-earth-embeddings/</link>
  <description><![CDATA[ 





<blockquote class="blockquote">
<p><strong>Update (2026-03-26):</strong> OlmoEarth-nano results throughout have been recomputed with properly normalized inputs. The initial version we released used unnormalized inputs, which significantly underestimated OlmoEarth-nano’s performance. Thanks Gabriel Tseng for flagging this issue!</p>
</blockquote>
<p>Foundation models like Tessera [1], OlmoEarth [2], and AlphaEarth [3] produce dense per-pixel embeddings from satellite imagery. With a kNN classifier or linear probe, you can do classification, change detection, or similarity search — no fine-tuning needed. The appeal here is that you can skip expensive image preprocessing and model inference, download some embeddings, then plug into your task. But the cost of actually storing these embedding products can get out of hand fast.</p>
<p><a href="https://isaac.earth/earth-embedding-products">Isaac’s recent survey</a> of earth embedding products [4] catalogued this growing ecosystem — AlphaEarth, Tessera, Clay, Major-TOM, MOSAIKS — and identified a common problem: <strong>at continental or global scale, embedding storage costs dwarf the compute savings that motivated precomputation in the first place.</strong> Distribution is fragmented across incompatible formats (COG, GeoParquet, raw NumPy), and there are no shared standards for tiling, CRS, or provenance. But the most fundamental issue is size.</p>
<section id="the-storage-problem" class="level2">
<h2 class="anchored" data-anchor-id="the-storage-problem">The storage problem</h2>
<p>Earth’s land surface covers about 150 million km^2. At Sentinel-2’s 10m resolution, that’s <a href="https://www.wolframalpha.com/input?i=land+area+of+the+world+%2F+%2810+meters+*+10+meters%29"><strong>1.5 trillion pixels</strong></a>. Multiply by embedding dimension and bytes per element:</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Model</th>
<th>Dims</th>
<th>Bytes/embedding</th>
<th>1 year (global)</th>
<th>S3 cost/year</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>DINOv3 ViT-L (float32)</td>
<td>1,024</td>
<td>4,096</td>
<td><strong>6.1 PB</strong></td>
<td><strong>$1.7M</strong></td>
</tr>
<tr class="even">
<td>DINOv3 ViT-L (int8)</td>
<td>1,024</td>
<td>1,024</td>
<td>1.5 PB</td>
<td>$424K</td>
</tr>
<tr class="odd">
<td>Tessera encoder (float32)</td>
<td>512</td>
<td>2,048</td>
<td>3.1 PB</td>
<td>$847K</td>
</tr>
<tr class="even">
<td>Tessera product (int8)</td>
<td>128</td>
<td>128</td>
<td>192 TB</td>
<td>$53K</td>
</tr>
<tr class="odd">
<td>OlmoEarth-nano (float32)</td>
<td>128</td>
<td>512</td>
<td>768 TB</td>
<td>$212K</td>
</tr>
<tr class="even">
<td>AEF (int8)</td>
<td>64</td>
<td>64</td>
<td>96 TB</td>
<td>$26K</td>
</tr>
</tbody>
</table>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://geospatialml.com/posts/compressing-earth-embeddings/storage_costs.png" class="img-fluid figure-img" style="width:75.0%"></p>
<figcaption>Global storage for 1024-dimensional embeddings at 10m resolution under different compression schemes. The dashed line marks the annual Sentinel-2 archive volume (~3-4 PB/year). Float32 baseline exceeds the Sentinel-2 annual output; PCA(64)+int8 brings it under 100 TB.</figcaption>
</figure>
</div>
<p>For context, the entire Sentinel-2 archive — every L1C and L2A product collected since 2015 — was roughly 22 PB in 2022 [5] and exceeded 50 PB by mid-2025 [6]. The archive grows by 3-4 PB per year. <strong>A single year of 1024d float32 embeddings (6.1 PB) would exceed the annual Sentinel-2 data volume that produced them.</strong> The embeddings are larger than the source imagery.</p>
<blockquote class="blockquote">
<p>The embeddings are larger than the source imagery.</p>
</blockquote>
<p>And these are per-year numbers. AlphaEarth covers 2017-2025 (9 years). <a href="https://anil.recoil.org/notes/geotessera-python">Tessera plans the same</a>. Multi-year archives at these scales reach tens of petabytes even for compact models. So how much can you compress before the embeddings stop being useful?</p>
</section>
<section id="eo-representations-are-redundant" class="level2">
<h2 class="anchored" data-anchor-id="eo-representations-are-redundant">EO representations are redundant</h2>
<p>Two recent papers provide evidence that earth observation representations carry substantial redundancy.</p>
<p><strong>Model-level redundancy.</strong> Hackel et al.&nbsp;[7] applied post-hoc “slimming” to remote sensing foundation models — uniformly reducing the width of transformer layers after training. At just 1% of the original FLOPs, these models retained over 71% of their full-scale accuracy (relative retention). An ImageNet-trained MAE dropped below 10% relative retention under the same treatment. Intermediate model sizes sometimes <em>outperformed</em> the full model, suggesting the extra capacity adds noise rather than signal. If the intermediate representations are this redundant, the output embeddings are too.</p>
<p><strong>Image-level redundancy.</strong> Papazafeiropoulos et al.&nbsp;[8] applied patch-level masking during training and inference of a ViT model, retaining only a fraction of image patches. On BigEarthNet, 15% patch retention achieved 99.4% of baseline accuracy. Even segmentation tolerated 50% patch removal while recovering ~97% of full performance.</p>
<p>These results suggest that standard embedding compression methods — including quantization and dimensionality reduction — may be effective for remotely sensed data as well. So we tested it!</p>
</section>
<section id="experimental-setup" class="level2">
<h2 class="anchored" data-anchor-id="experimental-setup">Experimental setup</h2>
<p>We evaluate combinations of quantization (float32, int8, int4, int2, binary, ternary, product quantization) and dimensionality reduction (PCA, truncated SVD, random projection, feature selection) across 5 embedding models and 6 classification datasets.</p>
<p>All experiments in this section use <a href="https://github.com/phelber/eurosat">EuroSAT</a> [9] — a 10-class Sentinel-2 land cover dataset with 21,600 images — as the primary benchmark. We use the precomputed embeddings for AEF, OlmoEarth, and Tessera from Isaac’s <a href="https://github.com/isaaccorley/geopool">geopool</a> repository, and we compute DINOv3 and ResNet50 embeddings separately. Then, we validate our findings on 5 additional datasets (RESISC45 [10] and 4 GeoBench [11] benchmarks) with the DINOv3 and ResNet50 based embeddings in the cross-dataset section below.</p>
<section id="models" class="level3">
<h3 class="anchored" data-anchor-id="models">Models</h3>
<table class="caption-top table">
<colgroup>
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
</colgroup>
<thead>
<tr class="header">
<th>Model</th>
<th>Architecture</th>
<th>Dims</th>
<th>Bytes/emb</th>
<th>Pretraining</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>AlphaEarth (AEF)</td>
<td>STP Encoder</td>
<td>64</td>
<td>256</td>
<td>3B+ multi-source EO obs.</td>
</tr>
<tr class="even">
<td>OlmoEarth-nano</td>
<td>Transformer</td>
<td>128</td>
<td>512</td>
<td>S1/S2/Landsat self-supervised</td>
</tr>
<tr class="odd">
<td>Tessera</td>
<td>Transformer</td>
<td>512</td>
<td>2,048</td>
<td>S1/S2 self-supervised</td>
</tr>
<tr class="even">
<td>DINOv3 ViT-L/16</td>
<td>Vision Transformer</td>
<td>1,024</td>
<td>4,096</td>
<td>SAT-493M (0.6m Maxar RGB)</td>
</tr>
<tr class="odd">
<td>ResNet50</td>
<td>CNN</td>
<td>2,048</td>
<td>8,192</td>
<td>ImageNet supervised</td>
</tr>
</tbody>
</table>
<p>DINOv3 ViT-L/16 uses Meta’s SAT-493M checkpoint [12] — a ViT-L distilled from the DINOv3 ViT-7B, trained on 493 million 0.6m Maxar RGB tiles. Tessera’s encoder outputs 512-dim embeddings, but the <a href="https://geotessera.readthedocs.io/">distributed product</a> compresses these to 128-dim int8 with per-pixel scale factors. Our experiments use the 512-dim encoder output via <a href="https://github.com/isaaccorley/geopool">geopool</a>.</p>
<p>We evaluate with <strong>kNN</strong> (k=5, cosine distance) and <strong>linear probes</strong> (logistic regression with tuned regularization). All quantization and reduction parameters are fit on training data only. DINOv3 results use mean-pooled patch tokens throughout unless noted.</p>
</section>
<section id="baselines" class="level3">
<h3 class="anchored" data-anchor-id="baselines">Baselines</h3>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Model</th>
<th>Dims</th>
<th>B/emb</th>
<th>EuroSAT kNN</th>
<th>EuroSAT Linear</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>AEF</td>
<td>64</td>
<td>256</td>
<td>94.5%</td>
<td>95.4%</td>
</tr>
<tr class="even">
<td>OlmoEarth-nano</td>
<td>128</td>
<td>512</td>
<td><strong>94.8%</strong></td>
<td>96.5%</td>
</tr>
<tr class="odd">
<td>Tessera</td>
<td>512</td>
<td>2,048</td>
<td>87.6%</td>
<td>94.2%</td>
</tr>
<tr class="even">
<td>DINOv3</td>
<td>1,024</td>
<td>4,096</td>
<td>94.5%</td>
<td><strong>98.0%</strong></td>
</tr>
<tr class="odd">
<td>ResNet50</td>
<td>2,048</td>
<td>8,192</td>
<td>92.6%</td>
<td>95.8%</td>
</tr>
</tbody>
</table>
</section>
</section>
<section id="int8-is-always-free" class="level2">
<h2 class="anchored" data-anchor-id="int8-is-always-free">int8 is always free</h2>
<p>The simplest compression: reduce each float32 value to int8. For each dimension, compute the min and max across the training set, then linearly map the range into 256 integer levels (4x compression).</p>
<table class="caption-top table">
<colgroup>
<col style="width: 12%">
<col style="width: 12%">
<col style="width: 12%">
<col style="width: 12%">
<col style="width: 12%">
<col style="width: 12%">
<col style="width: 12%">
<col style="width: 12%">
</colgroup>
<thead>
<tr class="header">
<th>Method</th>
<th>Bits</th>
<th>B/emb (1024d)</th>
<th>AEF</th>
<th>OlmoEarth-nano</th>
<th>Tessera</th>
<th>DINOv3</th>
<th>ResNet50</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>float32</td>
<td>32</td>
<td>4,096</td>
<td>94.5%</td>
<td>94.8%</td>
<td>87.6%</td>
<td>94.5%</td>
<td>92.6%</td>
</tr>
<tr class="even">
<td><strong>int8</strong></td>
<td><strong>8</strong></td>
<td><strong>1,024</strong></td>
<td><strong>94.6%</strong></td>
<td><strong>94.8%</strong></td>
<td><strong>87.8%</strong></td>
<td><strong>94.5%</strong></td>
<td><strong>92.5%</strong></td>
</tr>
<tr class="odd">
<td>int4</td>
<td>4</td>
<td>512</td>
<td>94.2%</td>
<td>94.4%</td>
<td>86.5%</td>
<td>94.4%</td>
<td>92.4%</td>
</tr>
<tr class="even">
<td>int2</td>
<td>2</td>
<td>256</td>
<td>91.7%</td>
<td>91.0%</td>
<td>84.7%</td>
<td>92.3%</td>
<td>—</td>
</tr>
<tr class="odd">
<td>binary</td>
<td>1</td>
<td>128</td>
<td>88.8%</td>
<td>90.8%</td>
<td>81.4%</td>
<td>91.8%</td>
<td>86.8%</td>
</tr>
</tbody>
</table>
<p><em>EuroSAT kNN accuracy. Bold row = statistically indistinguishable from float32 baseline.</em></p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://geospatialml.com/posts/compressing-earth-embeddings/quant_comparison.png" class="img-fluid figure-img" style="width:75.0%"></p>
<figcaption>EuroSAT kNN accuracy under different quantization levels for each model. int8 is visually indistinguishable from float32 across all five models.</figcaption>
</figure>
</div>
<p>We find <strong>int8 is never statistically distinguishable from float32.</strong> McNemar’s test gives p &gt;= 0.12 for every model-dataset pair (smallest p = 0.12). The 95% bootstrap confidence interval for the accuracy difference is within +/-0.2% everywhere. <strong>There is no reason to store float32 embeddings.</strong></p>
<blockquote class="blockquote">
<p>There is no reason to store float32 embeddings.</p>
</blockquote>
<p>We also find int4 loses less than 1% for AEF and DINOv3. Binary quantization (1 bit per dimension, 32x compression) is worth a closer look — DINOv3 at 128 bytes still hits 91.8%! More on this below.</p>
</section>
<section id="most-embedding-dimensions-are-redundant" class="level2">
<h2 class="anchored" data-anchor-id="most-embedding-dimensions-are-redundant">Most embedding dimensions are redundant</h2>
<p>PCA variance analysis reveals different spectral structures across models:</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Model</th>
<th>4d</th>
<th>8d</th>
<th>16d</th>
<th>32d</th>
<th>64d</th>
<th>256d</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>AEF (64d)</td>
<td>57%</td>
<td>80%</td>
<td>91%</td>
<td>97%</td>
<td>100%</td>
<td>—</td>
</tr>
<tr class="even">
<td>OlmoEarth-nano (128d)</td>
<td>77%</td>
<td>88%</td>
<td>95%</td>
<td>98%</td>
<td>100%</td>
<td>—</td>
</tr>
<tr class="odd">
<td>Tessera (512d)</td>
<td>94%</td>
<td>98%</td>
<td>99%</td>
<td>100%</td>
<td>100%</td>
<td>100%</td>
</tr>
<tr class="even">
<td>DINOv3 mean (1024d)</td>
<td>52%</td>
<td>64%</td>
<td>74%</td>
<td>82%</td>
<td>88%</td>
<td>97%</td>
</tr>
<tr class="odd">
<td>DINOv3 cls (1024d)</td>
<td>35%</td>
<td>46%</td>
<td>57%</td>
<td>67%</td>
<td>76%</td>
<td>91%</td>
</tr>
</tbody>
</table>
<p><em>Cumulative variance explained by top-k PCA components, fitted on EuroSAT training embeddings.</em></p>
<p>OlmoEarth-nano spreads its variance more broadly than Tessera, with 77% in 4 dimensions and needing 32 dimensions for 98%. DINOv3 distributes variance more evenly still, needing 256 dimensions for 97%.</p>
<p>DINOv3 spreads its variance across many dimensions, so you might expect it to compress poorly — if no dimension is dispensable, PCA can’t help. But DINOv3 at PCA(64)+int8 (6% of its original dimensions) still hits 93.1% kNN accuracy, only 1.4% below baseline. The dimensions PCA discards carry variance but apparently not much task-relevant information.</p>
</section>
<section id="combined-compression-the-pareto-frontier" class="level2">
<h2 class="anchored" data-anchor-id="combined-compression-the-pareto-frontier">Combined compression: the Pareto frontier</h2>
<p>The best configurations combine PCA with quantization — reduce dimensions first, then quantize:</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Model</th>
<th>Config</th>
<th>B/emb</th>
<th>EuroSAT kNN</th>
<th>EuroSAT Linear</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>AEF</td>
<td>int8</td>
<td>64</td>
<td>94.6%</td>
<td>95.5%</td>
</tr>
<tr class="even">
<td>AEF</td>
<td>int4</td>
<td>32</td>
<td>94.2%</td>
<td>95.0%</td>
</tr>
<tr class="odd">
<td>DINOv3</td>
<td>int8</td>
<td>1,024</td>
<td>94.5%</td>
<td>98.0%</td>
</tr>
<tr class="even">
<td>DINOv3</td>
<td>int4</td>
<td>512</td>
<td>94.4%</td>
<td>97.9%</td>
</tr>
<tr class="odd">
<td>DINOv3</td>
<td>PCA(128)+int8</td>
<td>128</td>
<td>93.6%</td>
<td>97.3%</td>
</tr>
<tr class="even">
<td>DINOv3</td>
<td>PCA(64)+int8</td>
<td>64</td>
<td>93.1%</td>
<td>96.4%</td>
</tr>
<tr class="odd">
<td>DINOv3</td>
<td>PCA(32)+int8</td>
<td>32</td>
<td>92.4%</td>
<td>94.1%</td>
</tr>
<tr class="even">
<td>DINOv3</td>
<td>PCA(16)+int4</td>
<td>8</td>
<td>89.3%</td>
<td>90.5%</td>
</tr>
</tbody>
</table>
<p><em>EuroSAT accuracy. kNN: k=5, cosine. Linear: logistic regression, C tuned.</em></p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://geospatialml.com/posts/compressing-earth-embeddings/pareto_knn.png" class="img-fluid figure-img" style="width:75.0%"></p>
<figcaption>Pareto frontiers on EuroSAT: storage cost vs.&nbsp;kNN accuracy (left) and linear probe accuracy (right). Each point is one compression configuration; lines trace the best accuracy at each storage budget.</figcaption>
</figure>
</div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://geospatialml.com/posts/compressing-earth-embeddings/pareto_linear.png" class="img-fluid figure-img" style="width:75.0%"></p>
<figcaption>Pareto frontier: storage cost vs.&nbsp;linear probe accuracy on EuroSAT.</figcaption>
</figure>
</div>
<p>The two plots tell different stories. For kNN, AEF dominates at every storage budget — its 64 dimensions are compact enough that int8 at 64 bytes is nearly unbeatable, and larger models can’t overcome the dimensionality tax. For linear probes, DINOv3 pulls ahead once budgets exceed ~16 bytes, because a trained classifier can exploit the richer representation even after PCA compression.</p>
<p><strong>PCA(64)+int8 at 64 bytes/embedding is the sweet spot for DINOv3</strong>: 64x compression with only 1.4% kNN loss and 96.4% linear accuracy. That brings a year of global DINOv3 embeddings from 6.1 PB down to 96 TB — the same footprint as AlphaEarth’s native int8 representation. Which model to choose depends on your task: kNN retrieval favors AEF, classification with a trained head favors DINOv3.</p>
</section>
<section id="binary-quantization-on-dinov3" class="level2">
<h2 class="anchored" data-anchor-id="binary-quantization-on-dinov3">Binary quantization on DINOv3</h2>
<p>DINOv3 loses only 2.7% kNN accuracy under binary quantization (1 bit per dimension, 32x compression), while AEF and Tessera lose 5.7-6.2%. We hypothesize that this might be due to:</p>
<ol type="1">
<li><p><strong>High dimensionality.</strong> 1,024 binary dimensions give 2^1024 possible codes — enormous capacity for separating 10 classes.</p></li>
<li><p><strong>Balanced dimensions.</strong> DINOv3’s dimensions are nearly symmetric around their means (average imbalance = 0.018). Each threshold bit carries close to 1 bit of entropy. OlmoEarth-nano is also well-balanced (0.052), while AEF’s higher imbalance (0.082) means many bits are nearly constant.</p></li>
</ol>
<p>A related finding with the binary quantizations: <strong>Hamming distance on raw bits outperforms reconstructing float32 vectors and computing cosine distance.</strong> The reconstruction step replaces each bit with a centroid value (the mean of all above-threshold or below-threshold values for that dimension). We find that KNN with a Hamming distance (count the differing bits between the two vectors) outperforms using cosine distance on the reconstructed vectors. This seems to preserve the ranking of neighbor distances better:</p>
<table class="caption-top table">
<colgroup>
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
</colgroup>
<thead>
<tr class="header">
<th>Model</th>
<th>Dims</th>
<th>Baseline</th>
<th>Reconstructed + cosine</th>
<th>Hamming on raw bits</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>DINOv3</td>
<td>1,024</td>
<td>94.5%</td>
<td>91.8%</td>
<td><strong>93.2%</strong></td>
</tr>
<tr class="even">
<td>Tessera</td>
<td>512</td>
<td>87.6%</td>
<td>81.4%</td>
<td><strong>87.0%</strong></td>
</tr>
<tr class="odd">
<td>OlmoEarth-nano</td>
<td>128</td>
<td>94.8%</td>
<td>90.8%</td>
<td><strong>92.7%</strong></td>
</tr>
<tr class="even">
<td>AEF</td>
<td>64</td>
<td>94.5%</td>
<td>88.8%</td>
<td><strong>89.1%</strong></td>
</tr>
</tbody>
</table>
<p>Hamming distance is also significantly faster to compute than cosine distance on reconstructed vectors — it reduces to a popcount on XOR’d bit vectors.</p>
</section>
<section id="cross-dataset-consistency" class="level2">
<h2 class="anchored" data-anchor-id="cross-dataset-consistency">Cross-dataset consistency</h2>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://geospatialml.com/posts/compressing-earth-embeddings/pareto_all_datasets.png" class="img-fluid figure-img" style="width:75.0%"></p>
<figcaption>Pareto frontiers for kNN accuracy vs.&nbsp;storage cost across all 6 datasets. The relative ordering of models is consistent, though absolute accuracy varies with task difficulty.</figcaption>
</figure>
</div>
<p>The compression patterns hold across all 6 datasets:</p>
<ul>
<li><strong>int8 is effectively lossless on all datasets</strong>, including the 45-class RESISC45.</li>
<li><strong>PCA(64)+int8 at 64 bytes</strong> gives 93.1% on EuroSAT (10 classes) and 82.0% on RESISC45 (45 classes) — proportionally similar retention.</li>
<li><strong>m-forestnet</strong> (deforestation driver classification) is the hardest task at ~40% kNN for DINOv3 and ~36% for ResNet50 — likely because RGB-only embeddings lose the spectral bands needed for this task.</li>
</ul>
</section>
<section id="per-class-failure-modes" class="level2">
<h2 class="anchored" data-anchor-id="per-class-failure-modes">Per-class failure modes</h2>
<p>Under aggressive compression, <strong>specific classes are disproportionately affected</strong>:</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Model</th>
<th>Config (B/emb)</th>
<th>Worst class</th>
<th>F1 drop</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>AEF</td>
<td>int4 (32 B)</td>
<td>Highway</td>
<td>-0.012</td>
</tr>
<tr class="even">
<td>AEF</td>
<td>binary (8 B)</td>
<td>Highway</td>
<td>-0.170</td>
</tr>
<tr class="odd">
<td>AEF</td>
<td>PCA(8)+binary (1 B)</td>
<td>Highway</td>
<td>-0.486</td>
</tr>
<tr class="even">
<td>OlmoEarth-nano</td>
<td>PCA(8)+binary (1 B)</td>
<td>PermanentCrop</td>
<td>-0.570</td>
</tr>
<tr class="odd">
<td>Tessera</td>
<td>PCA(8)+binary (1 B)</td>
<td>PermanentCrop</td>
<td>-0.479</td>
</tr>
</tbody>
</table>
<p>Highway and PermanentCrop are consistently the most affected — narrow categories relying on fine-grained spectral or spatial features that aggressive quantization destroys. If your application needs balanced per-class performance (rare category detection, e.g.), avoid extreme compression and verify per-class metrics.</p>
</section>
<section id="limitations" class="level2">
<h2 class="anchored" data-anchor-id="limitations">Limitations</h2>
<p>The main experiments here use EuroSAT — a 10-class patch classification dataset that most models find relatively easy (baselines already at 94-98%). We have some evidence from DINOv3 and ResNet50 results on RESISC45 (45 classes) and 4 GeoBench benchmarks that the core findings generalize across patch classification tasks — int8 is effectively lossless on all of them. But all 6 datasets are patch classification. We have not tested:</p>
<ul>
<li><strong>Semantic segmentation</strong> — pixel-level predictions may be more sensitive to per-dimension quantization error</li>
<li><strong>Pixel regression</strong> (e.g., canopy height, biomass estimation) — continuous targets could amplify small reconstruction errors that classification absorbs</li>
<li><strong>Object detection</strong> — localization accuracy may degrade differently than classification accuracy</li>
<li><strong>Change detection</strong> — differencing compressed embeddings across time steps could compound quantization noise</li>
<li><strong>Retrieval</strong> — ranking quality over large databases may be more sensitive to distance distortion than top-1 classification</li>
</ul>
<p>If you are saving embeddings for one of these tasks, we recommend validating compression effects on a representative sample before committing to a storage format.</p>
<p>We also only test OlmoEarth-nano (1.4M params, 128d) — the smallest model in the OlmoEarth family. The larger variants (Tiny at 192d, Base at 768d, Large at 1024d) may have different compression characteristics. And input normalization and patch size play a role in downstream performance that we haven’t disentangled from the compression effects here.</p>
</section>
<section id="takeaways" class="level2">
<h2 class="anchored" data-anchor-id="takeaways">Takeaways</h2>
<p>Some takeaways from these experiments (given the above caveat about patch classification):</p>
<ol type="1">
<li><p><strong>Always use int8.</strong> It is statistically indistinguishable from float32 across every model and dataset we tested (p &gt; 0.12). 4x compression, zero engineering effort, no reason not to.</p></li>
<li><p><strong>Check intrinsic dimensionality before storing.</strong> Many geospatial embeddings carry redundant dimensions. Tessera packs 94% of its variance into 4 dimensions; even DINOv3 can be PCA-reduced to 64d with only 1.5% kNN loss.</p></li>
<li><p><strong>PCA(64)+int8 is the sweet spot for DINOv3.</strong> 64 bytes/embedding, 64x compression, 1.4% kNN loss, 96.4% linear accuracy.</p></li>
<li><p><strong>For binary search indices, use Hamming distance directly on binary embeddings.</strong> Skip dequantization — it introduces correlated noise that hurts more than it helps.</p></li>
<li><p><strong>Don’t use ternary quantization.</strong> Binary is simpler, uses fewer bits, and performs better in every configuration we tested.</p></li>
<li><p><strong>Tune regularization (C) for linear probes.</strong> The default C=1.0 leaves performance on the table: Tessera gains 0.9% from C=10, DINOv3 gains 0.4% from C=0.1.</p></li>
<li><p><strong>Verify per-class metrics under compression.</strong> Highway and PermanentCrop degrade disproportionately — aggregate accuracy can mask category-level failures.</p></li>
</ol>
</section>
<section id="bibliography" class="level2">
<h2 class="anchored" data-anchor-id="bibliography">Bibliography</h2>
<p><a id="ref-tessera"></a> <strong>[1]</strong> Feng, Z., et al.&nbsp;“Tessera: Global-Scale Pixel Embeddings from Sentinel-2.” arXiv:2506.20380, 2025. <a href="https://arxiv.org/abs/2506.20380">[paper]</a> <a href="https://github.com/ucam-eo/tessera">[code]</a></p>
<p><a id="ref-olmoearth"></a> <strong>[2]</strong> Herzog, H., et al.&nbsp;“OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation.” arXiv:2511.13655, 2025. <a href="https://arxiv.org/abs/2511.13655">[paper]</a> <a href="https://github.com/allenai/olmoearth_pretrain">[code]</a></p>
<p><a id="ref-alphaearth"></a> <strong>[3]</strong> Brown, C.F., et al.&nbsp;“AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse label data.” arXiv:2507.22291, 2025. <a href="https://arxiv.org/abs/2507.22291">[paper]</a> <a href="https://developers.google.com/earth-engine/datasets/catalog/GOOGLE_SATELLITE_EMBEDDING_V1_ANNUAL">[GEE catalog]</a></p>
<p><a id="ref-embproducts"></a> <strong>[4]</strong> Fang, H., et al.&nbsp;“Earth Embeddings as Products.” arXiv:2601.13134, 2026. <a href="https://arxiv.org/abs/2601.13134">[paper]</a> <a href="https://isaac.earth/earth-embedding-products">[blog]</a></p>
<p><a id="ref-wastingpb"></a> <strong>[5]</strong> Bauer-Marschallinger, B. and Falkner, K. “Wasting Petabytes: A Survey of the Sentinel-2 UTM Tiling Grid and its Spatial Overhead.” <em>ISPRS Journal of Photogrammetry and Remote Sensing</em>, 2023. <a href="https://doi.org/10.1016/j.isprsjprs.2023.07.015">[paper]</a></p>
<p><a id="ref-lps25"></a> <strong>[6]</strong> ESA. “Copernicus Sentinels Mission and Data Management.” Living Planet Symposium, 2025. <a href="https://lps25.esa.int/lps25-presentations/presentations/2505/_2505.pdf">[slides]</a></p>
<p><a id="ref-hackel"></a> <strong>[7]</strong> Hackel, L., Burgert, T., and Demir, B. “How Much of a Model Do We Need? Redundancy and Slimmability in Remote Sensing Foundation Models.” arXiv:2601.22841, 2026. <a href="https://arxiv.org/abs/2601.22841">[paper]</a></p>
<p><a id="ref-hideseek"></a> <strong>[8]</strong> Papazafeiropoulos, T., et al.&nbsp;“Hide and Seek: Investigating Redundancy in Earth Observation Imagery.” arXiv:2603.13524, 2026. <a href="https://arxiv.org/abs/2603.13524">[paper]</a></p>
<p><a id="ref-eurosat"></a> <strong>[9]</strong> Helber, P., et al.&nbsp;“EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification.” <em>IEEE JSTARS</em>, 2019. <a href="https://doi.org/10.1109/JSTARS.2019.2918242">[paper]</a></p>
<p><a id="ref-resisc"></a> <strong>[10]</strong> Cheng, G., Han, J., and Lu, X. “Remote Sensing Image Scene Classification: Benchmark and State of the Art.” <em>Proceedings of the IEEE</em>, 2017. <a href="https://doi.org/10.1109/JPROC.2017.2675998">[paper]</a></p>
<p><a id="ref-geobench"></a> <strong>[11]</strong> Lacoste, A., et al.&nbsp;“GEO-Bench: Toward Foundation Models for Earth Monitoring.” <em>NeurIPS</em>, 2023. <a href="https://arxiv.org/abs/2306.03831">[paper]</a></p>
<p><a id="ref-dinov3"></a> <strong>[12]</strong> Simeoni, O., et al.&nbsp;“DINOv3” arXiv:2508.10104, 2025. <a href="https://arxiv.org/abs/2508.10104">[paper]</a> <a href="https://github.com/facebookresearch/dinov3">[code]</a></p>


</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{robinson2026,
  author = {Robinson, Caleb and Corley, Isaac},
  title = {Compressing {Earth} {Embeddings}},
  date = {2026-03-24},
  url = {https://geospatialml.com/posts/compressing-earth-embeddings/},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-robinson2026" class="csl-entry quarto-appendix-citeas">
Robinson, Caleb, and Isaac Corley. 2026. <span>“Compressing Earth
Embeddings.”</span> March 24. <a href="https://geospatialml.com/posts/compressing-earth-embeddings/">https://geospatialml.com/posts/compressing-earth-embeddings/</a>.
</div></div></section></div> ]]></description>
  <category>embeddings</category>
  <category>quantization</category>
  <category>compression</category>
  <category>foundation-models</category>
  <guid>https://geospatialml.com/posts/compressing-earth-embeddings/</guid>
  <pubDate>Tue, 24 Mar 2026 00:00:00 GMT</pubDate>
  <media:content url="https://geospatialml.com/posts/compressing-earth-embeddings/storage_costs.png" medium="image" type="image/png" height="95" width="144"/>
</item>
<item>
  <title>Seeing the Roads Through the Trees: Do Segmentation Models Actually Use Long-Range Context?</title>
  <dc:creator>Caleb Robinson</dc:creator>
  <dc:creator>Isaac Corley</dc:creator>
  <link>https://geospatialml.com/posts/long-range-dependencies/</link>
  <description><![CDATA[ 





<p>How well do segmentation models actually use long-range spatial information to make decisions? No existing benchmark directly measures this, especially in remote sensing where most datasets can be solved with relatively local texture and color cues. This matters beyond any single task — remote sensing is full of cases where local appearance is ambiguous and the correct label depends on spatial context, from mapping flooded areas under tree canopy during disaster response to identifying informal settlements where the signal is the neighborhood-level pattern rather than any individual structure. In <a href="https://arxiv.org/abs/2401.06762">Seeing the Roads Through the Trees</a> we designed a dataset and metric to measure spatial reasoning directly, and found that standard CNN encoder-decoder models are generally bad at it. In this post we revisit the problem with transformer-based architectures and gradient-based receptive field analysis to understand <em>why</em>.</p>
<section id="the-dataset" class="level2">
<h2 class="anchored" data-anchor-id="the-dataset">The dataset</h2>
<p><a href="https://huggingface.co/datasets/torchgeo/ChesapeakeRSC">Chesapeake Roads Spatial Context (RSC)</a> contains 30,000 512x512 NAIP patches from Maryland with 4-band imagery (RGB + near-infrared) and labels for three classes: background, road, and tree canopy over road. The class balance is extreme — 96.3% background, 3.0% road, 0.7% tree canopy over road.</p>
<div style="display: flex; gap: 0.75rem; justify-content: center; max-width: 80%; margin: 0 auto;">
  <img src="https://geospatialml.com/posts/long-range-dependencies/example1.png" alt="Example patch from the dataset" style="width: 48%; height: auto;">
  <img src="https://geospatialml.com/posts/long-range-dependencies/example2.png" alt="Second example patch" style="width: 48%; height: auto;">
</div>
<figcaption style="text-align: center; font-size: 0.9em; color: #666; margin-top: 0.5rem;">Example patches from the dataset. <span style="color: #4a90d9; font-weight: bold;">Blue</span> = visible road pixels, <span style="color: #d94a4a; font-weight: bold;">red</span> = tree canopy over road. The model must classify both as "road," but the red pixels have no local evidence of being road.</figcaption>
<p>The idea is simple: roads pass under tree canopy, and when they do, the local appearance at those pixels looks like trees, not road. A model can only classify those pixels correctly by looking at nearby visible road segments and inferring that the road continues underneath. The distance from each tree-canopy-over-road pixel to the nearest visible road pixel has a median of 4 pixels but a 95th percentile of 107 pixels, so some of these inferences require connecting evidence across a large spatial span.</p>
<figure class="figure">
<img src="https://geospatialml.com/posts/long-range-dependencies/dataset_map.png" alt="Map of Maryland showing the distribution of 30,000 train, validation, and test patches" style="width: 80%;" class="figure-img">
<figcaption style="text-align: center; font-size: 0.9em; color: #666;">
Distribution of 30,000 train, validation, and test patches across Maryland.
</figcaption>
</figure>
<p>Other remote sensing datasets with roads (ISPRS Vaihingen/Potsdam, LandCover.ai, DeepGlobe, SpaceNet, RoadTracer) are strong tests of segmentation quality, topology, or connectivity, but none explicitly separate easy road pixels from locally ambiguous ones. Chesapeake RSC partitions the road class by spatial difficulty, which makes it possible to ask not just “how well does this model segment roads?” but “how far away can the model look to make a correct decision?”</p>
</section>
<section id="distance-weighted-recall" class="level2">
<h2 class="anchored" data-anchor-id="distance-weighted-recall">Distance-weighted recall</h2>
<p>To quantify how well a model uses spatial context, we introduced <strong>distance-weighted recall (DWR)</strong>. For each tree-canopy-over-road pixel, we measure its distance to the nearest visible road pixel, then weight the pixel’s contribution to recall by that distance. A model that only gets the easy nearby pixels right will have a high unweighted recall but a low DWR; a model that correctly classifies tree canopy far from any visible road will score much higher.</p>
</section>
<section id="theoretical-vs.-effective-receptive-fields" class="level2">
<h2 class="anchored" data-anchor-id="theoretical-vs.-effective-receptive-fields">Theoretical vs.&nbsp;effective receptive fields</h2>
<p>Every segmentation architecture has a <strong>theoretical receptive field (TRF)</strong> — the maximum region of the input that could influence a given output pixel, determined purely by kernel sizes, strides, and network depth. <a href="https://distill.pub/2019/computing-receptive-fields">Araujo et al.&nbsp;(2019)</a> give a clear treatment of how to compute this for convolutional networks.</p>
<p>The <strong>effective receptive field (ERF)</strong> is what the model actually uses. <a href="https://papers.nips.cc/paper/6203-understanding-the-effective-receptive-field-in-deep-convolutional-neural-networks">Luo et al.&nbsp;(2016)</a> showed that in deep CNNs the ERF is typically much smaller than the TRF and has a Gaussian-like concentration around the center pixel. A model can have a 527-pixel theoretical receptive field and still behave as though it only looks at a small local neighborhood. For transformers, self-attention gives a global TRF by construction, but global access does not automatically mean global use.</p>
</section>
<section id="models-and-results" class="level2">
<h2 class="anchored" data-anchor-id="models-and-results">Models and results</h2>
<p>We trained a U-Net with a ResNet-18 backbone (14M params, TRF of 527 pixels) and two SegFormer variants: MiT-B0 (3.7M params) and MiT-B2 (25M params), both with global TRFs via self-attention. <strong>All models were trained on a binary task</strong> (road vs.&nbsp;background, with canopy-over-road grouped into road) using AdamW, cosine annealing, and cross-entropy loss for 150 epochs.</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Model</th>
<th>Params</th>
<th>Road R</th>
<th>Road P</th>
<th>TC/Road R</th>
<th>DWR</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>U-Net (ResNet-18)</td>
<td>14M</td>
<td>83.6</td>
<td>71.8</td>
<td>62.4</td>
<td><strong>44.0</strong></td>
</tr>
<tr class="even">
<td>U-Net (ResNet-18) + Cutout</td>
<td>14M</td>
<td>83.4</td>
<td>71.7</td>
<td>61.8</td>
<td>43.4</td>
</tr>
<tr class="odd">
<td>SegFormer (MiT-B0)</td>
<td>3.7M</td>
<td>83.1</td>
<td>71.7</td>
<td>58.9</td>
<td>37.9</td>
</tr>
<tr class="even">
<td>SegFormer (MiT-B2)</td>
<td>25M</td>
<td><strong>84.6</strong></td>
<td><strong>72.2</strong></td>
<td><strong>63.2</strong></td>
<td>42.3</td>
</tr>
</tbody>
</table>
<p><em>R = recall, P = precision, TC/Road R = recall on tree canopy over road subgroup. Background metrics omitted (all ~99.5%).</em></p>
<p>SegFormer MiT-B2 leads on overall metrics — best road recall (84.6%) and best tree canopy recall (63.2%). But the U-Net wins on DWR (44.0 vs 42.3), meaning it’s better at classifying tree canopy pixels that are far from visible road. The SegFormers’ ability to attend to distant tokens doesn’t translate into better performance on the spatially hardest pixels. This isn’t to say the U-Net is <em>good</em> at spatial reasoning (62.4% tree canopy recall is still a 21-point drop from visible road recall) — it’s that the ViT’s global attention doesn’t magically help here.</p>
<p>We also tested cutout augmentation — randomly masking 64x64 patches of the input during training — to force the model to reconstruct missing regions and improve spatial reasoning. We tested several cutout sizes and the story was the same: it doesn’t help. The variant shown here achieves 61.8% tree canopy recall, comparable to the baseline’s 62.4%.</p>
</section>
<section id="performance-degrades-with-distance" class="level2">
<h2 class="anchored" data-anchor-id="performance-degrades-with-distance">Performance degrades with distance</h2>
<p>For each tree-canopy-over-road pixel in the test set, we measure the distance to the nearest visible road pixel, bin into log-spaced groups, and compute recall within each bin.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://geospatialml.com/posts/long-range-dependencies/recall_vs_distance.png" class="img-fluid figure-img" style="width:65.0%"></p>
<figcaption>Recall on tree canopy over road pixels as a function of distance from the nearest visible road pixel (log scale). Both models start at ~74-76% recall for adjacent pixels and decay monotonically.</figcaption>
</figure>
</div>
<p>Both models show monotonic performance degradation. At distance ~1 pixel, the U-Net achieves ~76% recall and the SegFormer ~73%. By ~100 pixels, both are in the 36-43% range. At 400+ pixels, recall falls to 20-28%. The U-Net outperforms the SegFormer MiT-B0 at every distance despite having a narrower effective receptive field.</p>
</section>
<section id="measuring-the-effective-receptive-field" class="level2">
<h2 class="anchored" data-anchor-id="measuring-the-effective-receptive-field">Measuring the effective receptive field</h2>
<p>The distance-stratified recall shows <em>that</em> models fail to use long-range context. Gradient-based ERF analysis shows <em>why</em>.</p>
<p>We computed gradient attributions by backpropagating from pre-softmax road logits to the input for 200 test images, then measured how gradient mass distributes as a function of radius from the output pixel. The effective diameter at a given percentile is the smallest circle centered on the output pixel that encloses that fraction of total gradient mass.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://geospatialml.com/posts/long-range-dependencies/erf_cumulative_simple.png" class="img-fluid figure-img" style="width:65.0%"></p>
<figcaption>Cumulative gradient mass as a function of radius from the output pixel. The U-Net reaches 50% of its gradient mass within ~92 pixel radius; the SegFormer reaches 50% at ~146 pixels. Dashed lines mark 50th, 90th, and 99th percentile radii.</figcaption>
</figure>
</div>
<p>The U-Net concentrates half its gradient mass within a 184-pixel diameter circle despite having a 527-pixel theoretical receptive field. The SegFormer reaches 50% at 292 pixels — 1.6x wider, but the bulk of its attention (90%) still stays within a 542-pixel diameter.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://geospatialml.com/posts/long-range-dependencies/erf_effective_diameter_road.png" class="img-fluid figure-img" style="width:65.0%"></p>
<figcaption>Effective receptive field diameter (in pixels) for road-class predictions at three percentile cutoffs. The SegFormer is 1.4-1.6x wider than the U-Net depending on percentile.</figcaption>
</figure>
</div>
<p>The road class has the widest ERF across all models, which suggests models do allocate more spatial attention for road-related predictions. But tree canopy ERFs are approximately equal to background ERFs — when a model needs to look farther to identify a canopy-covered road pixel, it doesn’t.</p>
</section>
<section id="gradient-attribution-interactive-explorer" class="level2">
<h2 class="anchored" data-anchor-id="gradient-attribution-interactive-explorer">Gradient attribution: interactive explorer</h2>
<p>The aggregate ERF statistics above summarize behavior across many pixels. The visualizer below lets you explore gradient attributions for individual predictions — hover over any 8x8 block to see which input pixels the model relies on for that block’s prediction. Toggle the mask overlay to see ground truth labels.</p>
<div id="gv-root">
    <div class="gv-controls">
        <div class="gv-sample-picker">
            <span class="gv-label">Sample</span>
            <div class="gv-thumbs" id="gv-thumbs"></div>
        </div>
        <div class="gv-model-picker">
            <span class="gv-label">Model</span>
            <div class="gv-radios">
                <label class="gv-radio">
                    <input type="radio" name="gv-model" value="unet" checked="">
                    U-Net (ResNet-18)
                </label>
                <label class="gv-radio">
                    <input type="radio" name="gv-model" value="segformer">
                    SegFormer (MIT-B0)
                </label>
            </div>
        </div>
    </div>
    <div class="gv-viewer">
        <div class="gv-image-container" id="gv-hover-area">
            <img id="gv-base-image" class="gv-base-image" alt="Input image" width="512" height="512">
            <img id="gv-mask-overlay" class="gv-mask-overlay" alt="Mask overlay" width="512" height="512">
            <canvas id="gv-gradient-overlay" class="gv-gradient-overlay" width="512" height="512"></canvas>
            <div id="gv-block-highlight" class="gv-block-highlight"></div>
            <div id="gv-loading" class="gv-loading-indicator">Loading…</div>
        </div>
        <div class="gv-info-panel" id="gv-info-panel">
            <div class="gv-info-header">Block Info</div>
            <div id="gv-info-rows">
                <p class="gv-placeholder">Hover over the image to see gradient attribution</p>
            </div>
            <div class="gv-slider-group">
                <label>
                    Gradient opacity: <span id="gv-opacity-val">75%</span>
                </label>
                <input type="range" id="gv-opacity" min="0" max="100" value="75">
            </div>
            <div class="gv-toggle-group">
                <input type="checkbox" id="gv-show-mask">
                <label for="gv-show-mask">Show mask overlay</label>
            </div>
            <div class="gv-toggle-group gv-mask-source-hidden" id="gv-mask-source-group">
                <label class="gv-radio">
                    <input type="radio" name="gv-mask-src" value="pred" checked="">
                    Prediction
                </label>
                <label class="gv-radio">
                    <input type="radio" name="gv-mask-src" value="gt">
                    Ground Truth
                </label>
            </div>
            <div class="gv-legend">
                <div class="gv-legend-title">Legend</div>
                <div class="gv-legend-item">
                    <span class="gv-swatch gv-swatch-bg"></span> Background
                </div>
                <div class="gv-legend-item">
                    <span class="gv-swatch gv-swatch-road"></span> Road
                </div>
                <div class="gv-legend-item">
                    <span class="gv-swatch gv-swatch-canopy"></span> Tree Canopy Over Road
                </div>
                <div class="gv-legend-item">
                    <span class="gv-swatch gv-swatch-gradient"></span> Gradient (inferno)
                </div>
            </div>
        </div>
    </div>
</div>
<style>
#gv-root img {
  margin: 0;
  border-radius: 0;
  max-width: 100%;
}

#gv-root {
  background: #111827;
  border-radius: 8px;
  padding: 20px;
  margin: 1.5rem 0;
  color: #e5e7eb;
  font-family: system-ui, -apple-system, sans-serif;
  font-size: 14px;
  width: 100vw;
  position: relative;
  left: 50%;
  transform: translateX(-50%);
  max-width: 1100px;
  box-sizing: border-box;
}

.gv-controls {
  display: flex;
  gap: 24px;
  margin-bottom: 16px;
  flex-wrap: wrap;
  align-items: flex-start;
}

.gv-label {
  display: block;
  font-size: 11px;
  color: #9ca3af;
  text-transform: uppercase;
  letter-spacing: 0.08em;
  margin-bottom: 6px;
}

.gv-thumbs {
  display: flex;
  gap: 6px;
  flex-wrap: wrap;
}

.gv-thumb {
  width: 64px;
  height: 64px;
  border-radius: 4px;
  border: 2px solid #374151;
  overflow: hidden;
  cursor: pointer;
  transition: border-color 0.15s;
  flex-shrink: 0;
}
.gv-thumb:hover { border-color: #6b7280; }
.gv-thumb.selected { border-color: #22d3ee; box-shadow: 0 0 8px rgba(34,211,238,0.3); }
.gv-thumb img {
  width: 100%;
  height: 100%;
  object-fit: cover;
  image-rendering: pixelated;
  display: block;
}

.gv-radios {
  display: flex;
  gap: 16px;
  flex-wrap: wrap;
}
.gv-radio {
  font-size: 13px;
  color: #9ca3af;
  cursor: pointer;
  display: flex;
  align-items: center;
  gap: 5px;
}
.gv-radio input { accent-color: #22d3ee; cursor: pointer; }

.gv-viewer {
  display: flex;
  gap: 20px;
  align-items: flex-start;
  flex-wrap: wrap;
}

.gv-image-container {
  position: relative;
  width: 512px;
  max-width: 100%;
  aspect-ratio: 1;
  border-radius: 6px;
  overflow: hidden;
  border: 1px solid #374151;
  cursor: crosshair;
  flex-shrink: 0;
  background: #000;
}
.gv-image-container img,
.gv-image-container canvas {
  position: absolute;
  top: 0; left: 0;
  width: 100%; height: 100%;
}
.gv-base-image { z-index: 1; image-rendering: pixelated; }
.gv-mask-overlay {
  z-index: 2; opacity: 0; pointer-events: none;
  image-rendering: pixelated; transition: opacity 0.2s;
}
.gv-gradient-overlay { z-index: 3; opacity: 0; pointer-events: none; }
.gv-block-highlight {
  position: absolute; z-index: 4;
  border: 1.5px solid #22d3ee; border-radius: 1px;
  pointer-events: none; display: none;
  box-shadow: 0 0 6px rgba(34,211,238,0.45);
}
.gv-loading-indicator {
  position: absolute; z-index: 5;
  top: 50%; left: 50%;
  transform: translate(-50%, -50%);
  color: #9ca3af; font-size: 14px;
  display: none;
}

.gv-info-panel {
  background: #1f2937;
  border-radius: 8px;
  padding: 16px 18px;
  min-width: 200px;
  flex: 1;
  max-width: 300px;
}
.gv-info-header {
  font-size: 11px;
  color: #9ca3af;
  text-transform: uppercase;
  letter-spacing: 0.08em;
  margin-bottom: 12px;
}
.gv-info-row {
  display: flex;
  justify-content: space-between;
  align-items: center;
  padding: 5px 0;
  border-bottom: 1px solid #111827;
  font-size: 13px;
}
.gv-info-row:last-child { border-bottom: none; }
.gv-info-label { color: #9ca3af; }
.gv-info-value { color: #fff; font-family: 'SF Mono', Consolas, monospace; font-size: 12px; }

.gv-placeholder { color: #6b7280; font-style: italic; font-size: 13px; padding: 10px 0; }

.gv-slider-group { margin-top: 14px; }
.gv-slider-group label { display: block; font-size: 12px; color: #9ca3af; margin-bottom: 4px; }
.gv-slider-group input[type=range] { width: 100%; accent-color: #22d3ee; cursor: pointer; }

.gv-toggle-group { margin-top: 10px; display: flex; align-items: center; gap: 8px; flex-wrap: wrap; }
.gv-toggle-group label { font-size: 13px; color: #9ca3af; cursor: pointer; }
.gv-toggle-group input[type=checkbox] { accent-color: #22d3ee; cursor: pointer; }

.gv-legend { margin-top: 14px; padding-top: 12px; border-top: 1px solid #374151; }
.gv-legend-title {
  font-size: 11px; color: #9ca3af; text-transform: uppercase;
  letter-spacing: 0.08em; margin-bottom: 8px;
}
.gv-legend-item { display: flex; align-items: center; gap: 8px; font-size: 13px; margin-bottom: 3px; color: #d1d5db; }
.gv-swatch { width: 14px; height: 14px; border-radius: 2px; border: 1px solid #4b5563; flex-shrink: 0; }
.gv-swatch-bg { background: #000; }
.gv-swatch-road { background: #22d3ee; }
.gv-swatch-canopy { background: #f59e0b; }
.gv-swatch-gradient { background: linear-gradient(90deg,#000004,#420a68,#932667,#dd513a,#fca50a,#fcffa4); }
.gv-mask-source-hidden { display: none; }

@media (max-width: 800px) {
  #gv-root { padding: 12px; }
  .gv-image-container { width: 100%; }
  .gv-info-panel { max-width: 100%; }
  .gv-viewer { flex-direction: column; }
}
</style>
<script>
(function() {
  var SAMPLES = ['1717', '2056', '2762', '6212', '8180', '8782'];
  var MODELS = {
    unet:      { suffix: '',                   label: 'U-Net (ResNet-18)' },
    segformer: { suffix: '_segformer-mit-b0',  label: 'SegFormer (MIT-B0)' }
  };
  var STORAGE = 'https://s3.us-west-2.amazonaws.com/us-west-2.opendata.source.coop/calebrob6/geospatialml/gradients';

  var currentSample = SAMPLES[0];
  var currentModel = 'unet';
  var meta = null;
  var gradCache = new Map();
  var metaCache = {};
  var inflight = new Set();
  var wantedPath = null;
  var currentKey = null;
  var preloadQueue = [];
  var gradOpacity = 0.75;
  var PRELOAD_RADIUS = 3;
  var MAX_PRELOADS = 4;

  var container   = document.getElementById('gv-hover-area');
  var baseImage   = document.getElementById('gv-base-image');
  var maskOvl     = document.getElementById('gv-mask-overlay');
  var gradCanvas  = document.getElementById('gv-gradient-overlay');
  var gradCtx     = gradCanvas.getContext('2d');
  var highlight   = document.getElementById('gv-block-highlight');
  var infoRows    = document.getElementById('gv-info-rows');
  var opacityEl   = document.getElementById('gv-opacity');
  var opacityVal  = document.getElementById('gv-opacity-val');
  var showMaskCb  = document.getElementById('gv-show-mask');
  var loadingEl   = document.getElementById('gv-loading');
  var thumbsEl    = document.getElementById('gv-thumbs');

  function getDir() {
    return STORAGE + '/' + currentSample + MODELS[currentModel].suffix + '/';
  }
  function pad3(n) { return String(n).padStart(3, '0'); }

  SAMPLES.forEach(function(id, i) {
    var div = document.createElement('div');
    div.className = 'gv-thumb' + (i === 0 ? ' selected' : '');
    div.dataset.id = id;
    var img = document.createElement('img');
    img.src = STORAGE + '/' + id + '/image.png';
    img.alt = 'Sample ' + id;
    img.loading = 'lazy';
    div.appendChild(img);
    div.addEventListener('click', function() {
      if (currentSample === id) return;
      currentSample = id;
      document.querySelectorAll('.gv-thumb').forEach(function(t) { t.classList.remove('selected'); });
      div.classList.add('selected');
      loadSample();
    });
    thumbsEl.appendChild(div);
  });

  document.querySelectorAll('input[name="gv-model"]').forEach(function(radio) {
    radio.addEventListener('change', function() {
      if (currentModel === this.value) return;
      currentModel = this.value;
      loadSample();
    });
  });

  document.querySelectorAll('input[name="gv-mask-src"]').forEach(function(radio) {
    radio.addEventListener('change', function() {
      var dir = getDir();
      maskOvl.src = this.value === 'gt' ? dir + 'gt_mask.png' : dir + 'mask.png';
    });
  });

  async function loadSample() {
    var dir = getDir();
    var cacheKey = currentSample + '|' + currentModel;
    gradCache.clear();
    inflight.clear();
    preloadQueue = [];
    currentKey = null;
    wantedPath = null;
    gradCtx.clearRect(0, 0, gradCanvas.width, gradCanvas.height);
    gradCanvas.style.opacity = '0';
    highlight.style.display = 'none';
    infoRows.innerHTML = '<p class="gv-placeholder">Hover over the image to see gradient attribution</p>';
    loadingEl.style.display = 'block';
    if (metaCache[cacheKey]) {
      meta = metaCache[cacheKey];
    } else {
      try {
        var resp = await fetch(dir + 'metadata.json');
        meta = await resp.json();
        metaCache[cacheKey] = meta;
      } catch(e) {
        loadingEl.textContent = 'Failed to load';
        return;
      }
    }
    gradCanvas.width = meta.image_shape[2];
    gradCanvas.height = meta.image_shape[1];
    baseImage.src = dir + 'image.png';
    maskOvl.src = dir + 'mask.png';
    var predRadio = document.querySelector('input[name="gv-mask-src"][value="pred"]');
    if (predRadio) predRadio.checked = true;
    var maskSrcGroup = document.getElementById('gv-mask-source-group');
    if (meta.has_gt_mask) {
      maskSrcGroup.classList.remove('gv-mask-source-hidden');
    } else {
      maskSrcGroup.classList.add('gv-mask-source-hidden');
    }
    baseImage.onload = function() { loadingEl.style.display = 'none'; };
  }

  function gradPath(r, c) {
    var ext = meta.gradient_format || 'png';
    return getDir() + pad3(r) + '/' + pad3(c) + '.' + ext;
  }

  function drawGradient(bitmap) {
    gradCtx.clearRect(0, 0, gradCanvas.width, gradCanvas.height);
    gradCtx.drawImage(bitmap, 0, 0, gradCanvas.width, gradCanvas.height);
    gradCanvas.style.opacity = gradOpacity;
  }

  function loadBitmap(path) {
    if (gradCache.has(path) || inflight.has(path)) return;
    inflight.add(path);
    fetch(path)
      .then(function(r) { return r.blob(); })
      .then(function(b) { return createImageBitmap(b); })
      .then(function(bmp) {
        gradCache.set(path, bmp);
        inflight.delete(path);
        if (path === wantedPath) drawGradient(bmp);
        pumpPreloads();
      })
      .catch(function() { inflight.delete(path); pumpPreloads(); });
  }

  function requestGradient(r, c) {
    var key = r + ',' + c;
    if (key === currentKey) return;
    currentKey = key;
    var path = gradPath(r, c);
    wantedPath = path;
    if (gradCache.has(path)) {
      drawGradient(gradCache.get(path));
      schedulePreload(r, c);
      return;
    }
    loadBitmap(path);
    schedulePreload(r, c);
  }

  function schedulePreload(r, c) {
    if (!meta) return;
    var GRID_R = meta.grid[0], GRID_C = meta.grid[1];
    var items = [];
    for (var dr = -PRELOAD_RADIUS; dr <= PRELOAD_RADIUS; dr++) {
      for (var dc = -PRELOAD_RADIUS; dc <= PRELOAD_RADIUS; dc++) {
        if (dr === 0 && dc === 0) continue;
        var nr = r + dr, nc = c + dc;
        if (nr >= 0 && nr < GRID_R && nc >= 0 && nc < GRID_C) {
          var p = gradPath(nr, nc);
          if (!gradCache.has(p) && !inflight.has(p)) {
            items.push({ path: p, dist: dr*dr + dc*dc });
          }
        }
      }
    }
    items.sort(function(a, b) { return a.dist - b.dist; });
    preloadQueue = items.map(function(i) { return i.path; });
    pumpPreloads();
  }

  function pumpPreloads() {
    while (inflight.size < MAX_PRELOADS && preloadQueue.length > 0) {
      var p = preloadQueue.shift();
      if (!gradCache.has(p) && !inflight.has(p)) loadBitmap(p);
    }
  }

  function updateInfo(r, c) {
    if (!meta) return;
    var BLOCK_SIZE = meta.block_size;
    var road = meta.block_pred[r][c];
    var cls = road > 0.5 ? 'Road' : 'Background';
    infoRows.innerHTML =
      '<div class="gv-info-row"><span class="gv-info-label">Row, Col</span><span class="gv-info-value">' + r + ', ' + c + '</span></div>' +
      '<div class="gv-info-row"><span class="gv-info-label">Pixel range</span><span class="gv-info-value">' +
        (r*BLOCK_SIZE) + '\u2013' + ((r+1)*BLOCK_SIZE-1) + ', ' +
        (c*BLOCK_SIZE) + '\u2013' + ((c+1)*BLOCK_SIZE-1) + '</span></div>' +
      '<div class="gv-info-row"><span class="gv-info-label">Pred class</span><span class="gv-info-value">' + cls + '</span></div>' +
      '<div class="gv-info-row"><span class="gv-info-label">Road prob</span><span class="gv-info-value">' + (road*100).toFixed(1) + '%</span></div>';
  }

  var raf = null;
  container.addEventListener('mousemove', function(e) {
    if (raf || !meta) return;
    raf = requestAnimationFrame(function() {
      raf = null;
      var rect = container.getBoundingClientRect();
      var x = (e.clientX - rect.left) / rect.width;
      var y = (e.clientY - rect.top) / rect.height;
      var GRID_R = meta.grid[0], GRID_C = meta.grid[1];
      var c = Math.max(0, Math.min(GRID_C-1, Math.floor(x * GRID_C)));
      var r = Math.max(0, Math.min(GRID_R-1, Math.floor(y * GRID_R)));
      var bw = rect.width / GRID_C;
      var bh = rect.height / GRID_R;
      highlight.style.left   = (c * bw) + 'px';
      highlight.style.top    = (r * bh) + 'px';
      highlight.style.width  = bw + 'px';
      highlight.style.height = bh + 'px';
      requestGradient(r, c);
      updateInfo(r, c);
    });
  });

  container.addEventListener('mouseenter', function() {
    highlight.style.display = 'block';
    if (currentKey) gradCanvas.style.opacity = gradOpacity;
  });

  container.addEventListener('mouseleave', function() {
    highlight.style.display = 'none';
    gradCanvas.style.opacity = '0';
    currentKey = null;
    wantedPath = null;
    preloadQueue = [];
    infoRows.innerHTML = '<p class="gv-placeholder">Hover over the image to see gradient attribution</p>';
  });

  container.addEventListener('touchmove', function(e) {
    e.preventDefault();
    if (!meta) return;
    var touch = e.touches[0];
    var rect = container.getBoundingClientRect();
    var x = (touch.clientX - rect.left) / rect.width;
    var y = (touch.clientY - rect.top) / rect.height;
    var GRID_R = meta.grid[0], GRID_C = meta.grid[1];
    var c = Math.max(0, Math.min(GRID_C-1, Math.floor(x * GRID_C)));
    var r = Math.max(0, Math.min(GRID_R-1, Math.floor(y * GRID_R)));
    var bw = rect.width / GRID_C;
    var bh = rect.height / GRID_R;
    highlight.style.left   = (c * bw) + 'px';
    highlight.style.top    = (r * bh) + 'px';
    highlight.style.width  = bw + 'px';
    highlight.style.height = bh + 'px';
    highlight.style.display = 'block';
    requestGradient(r, c);
    updateInfo(r, c);
  }, { passive: false });

  container.addEventListener('touchend', function() {
    highlight.style.display = 'none';
    gradCanvas.style.opacity = '0';
    currentKey = null;
    wantedPath = null;
    preloadQueue = [];
    infoRows.innerHTML = '<p class="gv-placeholder">Hover over the image to see gradient attribution</p>';
  });

  opacityEl.addEventListener('input', function() {
    gradOpacity = this.value / 100;
    opacityVal.textContent = this.value + '%';
    if (currentKey) gradCanvas.style.opacity = gradOpacity;
  });

  showMaskCb.addEventListener('change', function() {
    maskOvl.style.opacity = this.checked ? '0.55' : '0';
  });

  loadSample();
})();
</script>
</section>
<section id="whats-next" class="level2">
<h2 class="anchored" data-anchor-id="whats-next">What’s next</h2>
<p>Across the architectures we tested, the bottleneck appears to be the training signal, not the architecture. Switching from CNN to transformer, increasing model capacity (MiT-B0 -&gt; B2), and adding cutout augmentation all fail to substantially improve spatial reasoning on the hardest pixels. The binary cross-entropy loss treats all road pixels equally — it doesn’t reward the model for propagating information from distant visible road segments to occluded ones. Distance-aware loss functions or auxiliary connectivity tasks might provide a stronger learning signal.</p>
<p>Chesapeake RSC is a controlled version of a broader challenge in remote sensing, and the effective receptive field tools we use here apply directly to any task where local appearance is ambiguous and the correct label depends on spatial context.</p>
</section>
<section id="links" class="level2">
<h2 class="anchored" data-anchor-id="links">Links</h2>
<ul>
<li><strong>Paper</strong>: <a href="https://arxiv.org/abs/2401.06762">Seeing the roads through the trees (arXiv)</a></li>
<li><strong>Code</strong>: <a href="https://github.com/isaaccorley/ChesapeakeRSC">github.com/isaaccorley/ChesapeakeRSC</a></li>
<li><strong>Dataset</strong>: <a href="https://huggingface.co/datasets/torchgeo/ChesapeakeRSC">torchgeo/ChesapeakeRSC on HuggingFace</a></li>
</ul>


</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{robinson2026,
  author = {Robinson, Caleb and Corley, Isaac},
  title = {Seeing the {Roads} {Through} the {Trees:} {Do} {Segmentation}
    {Models} {Actually} {Use} {Long-Range} {Context?}},
  date = {2026-03-17},
  url = {https://geospatialml.com/posts/long-range-dependencies/},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-robinson2026" class="csl-entry quarto-appendix-citeas">
Robinson, Caleb, and Isaac Corley. 2026. <span>“Seeing the Roads Through
the Trees: Do Segmentation Models Actually Use Long-Range
Context?”</span> March 17. <a href="https://geospatialml.com/posts/long-range-dependencies/">https://geospatialml.com/posts/long-range-dependencies/</a>.
</div></div></section></div> ]]></description>
  <category>receptive-fields</category>
  <category>spatial-context</category>
  <category>road-segmentation</category>
  <category>chesapeake-rsc</category>
  <category>semantic-segmentation</category>
  <guid>https://geospatialml.com/posts/long-range-dependencies/</guid>
  <pubDate>Tue, 17 Mar 2026 00:00:00 GMT</pubDate>
  <media:content url="https://geospatialml.com/posts/long-range-dependencies/example1.png" medium="image" type="image/png" height="144" width="144"/>
</item>
<item>
  <title>Characterizing Census Blocks with Satellite Embedding Statistics</title>
  <dc:creator>Caleb Robinson</dc:creator>
  <dc:creator>Isaac Corley</dc:creator>
  <link>https://geospatialml.com/posts/aef-census-block-embeddings/</link>
  <description><![CDATA[ 





<p>How can you join AEF embeddings to census blocks, and how well do they predict different variables? We wrote a <a href="https://gist.github.com/calebrob6/e71adbc64a94e362ec7c251e4fbc5223">script</a> for doing this! We find, for example, that statistics of AEF embeddings can differentiate between urban and rural blocks in Washington with <strong>92.5% accuracy</strong> using a simple logistic regression.</p>
<p>There’s a growing ecosystem of <a href="https://isaac.earth/earth-embedding-products">pixel-level embedding products</a> covering the entire planet — AEF, Clay, Prithvi, and others. These are potentially powerful features for research well beyond remote sensing: sociology, demography, public health, economics — any field that works with administrative boundaries. But there’s still a high technical barrier to actually <em>using</em> them. Going from a wall of raster tiles to a clean feature table keyed by census tract or district requires spatial joins, CRS wrangling, and careful aggregation.</p>
<p>This post is a practical, end-to-end example of how to do exactly that. We take <a href="https://source.coop/tge-labs/aef">Alpha Earth Foundation (AEF)</a> embeddings from <a href="https://source.coop">Source Cooperative</a>, summarize them across the ~149K census blocks in Washington State, and see how well these purely satellite-derived features predict census variables like population density and urban/rural classification.</p>
<section id="the-data" class="level2">
<h2 class="anchored" data-anchor-id="the-data">The data</h2>
<p><strong>AEF embeddings</strong> are 64-dimensional vectors produced by a <a href="https://arxiv.org/abs/2507.22291">geospatial foundation model</a> for every 10m pixel on Earth. See the Google Earth Engine catalog page <a href="https://developers.google.com/earth-engine/datasets/catalog/GOOGLE_SATELLITE_EMBEDDING_V1_ANNUAL">here</a>. They capture land use, land cover, vegetation, and built environment characteristics from satellite imagery. The data is distributed as cloud-optimized GeoTIFFs tiled at 8192x8192 pixels in UTM projection, with a <a href="https://data.source.coop/tge-labs/aef/v1/annual/aef_index.gpkg">GeoPackage spatial index</a> mapping tile footprints to file paths across years 2018-2025.</p>
<p><strong>Census block boundaries</strong> come from the 2020 US Census <a href="https://www.census.gov/cgi-bin/geo/shapefiles/index.php">TIGER/Line shapefiles</a> — the finest-grained census geography, with attributes like population (<code>POP20</code>), housing units (<code>HOUSING20</code>), land/water area, and an urban/rural flag (<code>UR20</code>).</p>
</section>
<section id="method" class="level2">
<h2 class="anchored" data-anchor-id="method">Method</h2>
<section id="step-1-compute-per-block-embedding-statistics" class="level3">
<h3 class="anchored" data-anchor-id="step-1-compute-per-block-embedding-statistics">Step 1: Compute per-block embedding statistics</h3>
<p>For each census block, we compute the <strong>mean</strong> and <strong>standard deviation</strong> of each of the 64 AEF embedding dimensions across all valid 10m pixels within the block for 2020. This produces a 128-dimensional feature vector per block (64 means + 64 stdevs).</p>
<p>Before processing, we filter out blocks that would be uninformative or expensive:</p>
<table class="caption-top table">
<colgroup>
<col style="width: 33%">
<col style="width: 33%">
<col style="width: 33%">
</colgroup>
<thead>
<tr class="header">
<th>Filter</th>
<th>Blocks removed</th>
<th>Reason</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>All-water (<code>ALAND20 == 0</code>)</td>
<td>8,272 (5.2%)</td>
<td>No land pixels to sample</td>
</tr>
<tr class="even">
<td>Oversized (<code>ALAND20 &gt; 25 km^2</code>)</td>
<td>1,138 (0.7%)</td>
<td>Too expensive to mask at 10m resolution</td>
</tr>
<tr class="odd">
<td><strong>Total kept</strong></td>
<td><strong>148,683 (94.0%)</strong></td>
<td>Covers 99.6% of WA’s population</td>
</tr>
</tbody>
</table>
<p>The pipeline spatial-joins blocks with the AEF tile index, downloads the needed tiles (44 tiles, ~34 GB for Washington in 2020), then masks and aggregates pixel values per block using multithreaded I/O. See <a href="https://gist.github.com/calebrob6/e71adbc64a94e362ec7c251e4fbc5223#file-compute_aef_block_stats-py"><code>compute_aef_block_stats.py</code></a> for the full script.</p>
</section>
<section id="step-2-pca-visualization" class="level3">
<h3 class="anchored" data-anchor-id="step-2-pca-visualization">Step 2: PCA visualization</h3>
<p>To visualize the embedding space geographically, we fit a 3-component PCA on the 128-dimensional block feature vectors, scale each component to uint8, and rasterize block polygons at 10m resolution into a 3-band GeoTIFF. The resulting RGB composite shows blocks with similar land use in similar colors. See <a href="https://gist.github.com/calebrob6/e71adbc64a94e362ec7c251e4fbc5223#file-pca_rasterize-py"><code>pca_rasterize.py</code></a> for the full script.</p>
<p>Here’s what the PCA-3 RGB rendering looks like across all of Washington State:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://geospatialml.com/posts/aef-census-block-embeddings/washington_state_block_pca.png" class="img-fluid figure-img" style="width:85.0%"></p>
<figcaption>PCA-3 RGB rendering of AEF embedding statistics across Washington State census blocks. Color differences reflect differences in the embedding space — similar land use appears in similar colors.</figcaption>
</figure>
</div>
<p>And zoomed into the Seattle/Bellevue metro area, where urban structure is clearly visible at the block level:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://geospatialml.com/posts/aef-census-block-embeddings/seattle_block_pca.png" class="img-fluid figure-img" style="width:80.0%"></p>
<figcaption>PCA-3 RGB rendering zoomed into Seattle/Bellevue. The dense urban core, suburban neighborhoods, and surrounding forests are clearly differentiated by color.</figcaption>
</figure>
</div>
</section>
<section id="step-3-correlation-with-census-variables" class="level3">
<h3 class="anchored" data-anchor-id="step-3-correlation-with-census-variables">Step 3: Correlation with census variables</h3>
<p>We joined the embedding statistics with census block attributes and tested predictiveness using simple linear models.</p>
<p>We tested two feature representations:</p>
<ul>
<li><strong>128-dim</strong>: The raw 64 per-band means + 64 per-band standard deviations</li>
<li><strong>PCA-10</strong>: A 10-component PCA capturing 79.0% of total variance</li>
</ul>
</section>
</section>
<section id="results" class="level2">
<h2 class="anchored" data-anchor-id="results">Results</h2>
<section id="linear-regression-r2" class="level3">
<h3 class="anchored" data-anchor-id="linear-regression-r2">Linear regression R^2</h3>
<p>We fit ordinary least squares on all 148,683 blocks and report in-sample R^2. No train/test split here — these numbers are upper bounds on what a linear model can extract, meant to gauge the information content of the features rather than predict on held-out data:</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Census Variable</th>
<th>R^2 (128-dim)</th>
<th>R^2 (PCA-10)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Log(land area)</td>
<td><strong>0.843</strong></td>
<td>0.665</td>
</tr>
<tr class="even">
<td>Urban/Rural</td>
<td><strong>0.740</strong></td>
<td>0.676</td>
</tr>
<tr class="odd">
<td>Log(population)</td>
<td>0.575</td>
<td>0.366</td>
</tr>
<tr class="even">
<td>Water fraction</td>
<td>0.414</td>
<td>0.073</td>
</tr>
<tr class="odd">
<td>Pop. density</td>
<td>0.380</td>
<td>0.260</td>
</tr>
<tr class="even">
<td>Housing density</td>
<td>0.299</td>
<td>0.165</td>
</tr>
<tr class="odd">
<td>Raw population</td>
<td>0.252</td>
<td>0.105</td>
</tr>
<tr class="even">
<td>Raw housing units</td>
<td>0.229</td>
<td>0.094</td>
</tr>
</tbody>
</table>
<p><strong>Log(land area)</strong> is the most predictable (R^2 = 0.84), which makes intuitive sense — block size directly corresponds to land cover homogeneity, and the embeddings capture this. <strong>Urban/rural</strong> is next at R^2 = 0.74, confirming that AEF embeddings strongly encode built environment characteristics. The gap between 128-dim and PCA-10 is also telling: compressing to 10 components loses a lot of signal for some variables (water fraction drops from 0.41 to 0.07), suggesting the tail dimensions carry real information.</p>
</section>
<section id="classification-accuracy" class="level3">
<h3 class="anchored" data-anchor-id="classification-accuracy">Classification accuracy</h3>
<p>Using <code>LogisticRegression</code>, now with a stratified 5-fold cross-validation (no block sees its own label during training in any given fold), we attempt to train a model to predict Urban vs.&nbsp;Rural:</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Task</th>
<th>Baseline</th>
<th>Accuracy (128-dim)</th>
<th>Accuracy (PCA-10)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Urban vs.&nbsp;Rural</td>
<td>59.4%</td>
<td><strong>92.5% +/- 0.1%</strong></td>
<td>90.1% +/- 0.2%</td>
</tr>
</tbody>
</table>
<p>A logistic regression on AEF embedding statistics achieves <strong>92.5% accuracy</strong> at distinguishing urban from rural blocks — a 33 percentage point improvement over the majority-class baseline – using only statistics of pixel embeddings.</p>
</section>
<section id="pca-component-correlations" class="level3">
<h3 class="anchored" data-anchor-id="pca-component-correlations">PCA component correlations</h3>
<p>To interpret what the PCA components actually capture, we compute the Pearson correlation between each component’s per-block score and several census variables. Values near +1 or -1 indicate a strong linear relationship:</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Variable</th>
<th>PC1</th>
<th>PC2</th>
<th>PC3</th>
<th>PC4</th>
<th>PC5</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Urban/Rural</td>
<td><strong>-0.71</strong></td>
<td>-0.34</td>
<td>-0.07</td>
<td>+0.13</td>
<td>+0.00</td>
</tr>
<tr class="even">
<td>Log(land area)</td>
<td><strong>+0.58</strong></td>
<td>+0.39</td>
<td>+0.09</td>
<td>-0.21</td>
<td>+0.01</td>
</tr>
<tr class="odd">
<td>Pop. density</td>
<td><strong>-0.42</strong></td>
<td>-0.21</td>
<td>-0.12</td>
<td>+0.03</td>
<td>-0.07</td>
</tr>
<tr class="even">
<td>Log(population)</td>
<td><strong>-0.40</strong></td>
<td>-0.01</td>
<td>-0.06</td>
<td>+0.29</td>
<td>+0.17</td>
</tr>
<tr class="odd">
<td>Housing density</td>
<td>-0.30</td>
<td>-0.16</td>
<td>-0.08</td>
<td>-0.01</td>
<td>-0.10</td>
</tr>
<tr class="even">
<td>Water fraction</td>
<td>+0.05</td>
<td>+0.06</td>
<td>+0.07</td>
<td>-0.08</td>
<td>-0.09</td>
</tr>
</tbody>
</table>
<p><strong>PC1</strong> is clearly an urban-to-rural axis (r = -0.71 with the urban flag), while <strong>PC2</strong> adds spatial scale information. The higher-order components pick up more nuanced variation — but even PC5 barely correlates with anything in the census, suggesting those dimensions capture land use patterns that don’t map neatly onto sociodemographic variables.</p>
</section>
</section>
<section id="takeaways" class="level2">
<h2 class="anchored" data-anchor-id="takeaways">Takeaways</h2>
<ol type="1">
<li><p><strong>AEF embeddings encode urban/rural character.</strong> While not very surprising, these can get 92.5% classification accuracy from a linear model alone.</p></li>
<li><p><strong>The full 128-dim representation is substantially richer than PCA-10.</strong> For log population, R^2 jumps from 0.37 to 0.58 — the higher-order embedding dimensions contain useful information.</p></li>
<li><p><strong>Block-level embedding statistics are a practical feature engineering approach.</strong> Mean and stdev per block compresses many 64-dim pixel vectors into a fixed 128-dim representation per geographic unit — simple enough to throw into any tabular ML pipeline. How to aggregate these was also the question of our recent paper, “From Pixels to Patches: Pooling Strategies for Earth Embeddings” (preprint <a href="https://arxiv.org/abs/2603.02080">here</a>).</p></li>
</ol>
</section>
<section id="reproduction" class="level2">
<h2 class="anchored" data-anchor-id="reproduction">Reproduction</h2>
<p>Both scripts are available in <a href="https://gist.github.com/calebrob6/e71adbc64a94e362ec7c251e4fbc5223">this gist</a>. The precomputed Washington State block-level embedding statistics are available as a GeoParquet file on <a href="https://huggingface.co/datasets/calebrob6/wa-block-aef-stats">Hugging Face</a> if you want to skip the ~34 GB download and play with the data instead.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Download census blocks from:</span></span>
<span id="cb1-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># https://www.census.gov/cgi-bin/geo/shapefiles/index.php</span></span>
<span id="cb1-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># (year 2025, layer "Blocks (2020)", state "Washington")</span></span>
<span id="cb1-4"></span>
<span id="cb1-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Download AEF index</span></span>
<span id="cb1-6"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">curl</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-O</span> https://data.source.coop/tge-labs/aef/v1/annual/aef_index.gpkg</span>
<span id="cb1-7"></span>
<span id="cb1-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Compute block statistics (downloads ~34 GB of tiles)</span></span>
<span id="cb1-9"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">python3</span> compute_aef_block_stats.py <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb1-10">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--year</span> 2020 <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--max-land-km2</span> 25 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb1-11">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--output</span> wa_block_aef_stats.geoparquet</span>
<span id="cb1-12"></span>
<span id="cb1-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Quick test on Seattle/Bellevue area (~2 tiles)</span></span>
<span id="cb1-14"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">python3</span> compute_aef_block_stats.py <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb1-15">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--bbox</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-122.44</span> 47.49 <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-122.07</span> 47.73 <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span></span>
<span id="cb1-16">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--output</span> seattle_bellevue_aef_stats.geoparquet</span>
<span id="cb1-17"></span>
<span id="cb1-18"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Generate PCA visualization</span></span>
<span id="cb1-19"><span class="ex" style="color: null;
background-color: null;
font-style: inherit;">python3</span> pca_rasterize.py wa_block_aef_stats.geoparquet <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-o</span> wa_pca.tif</span></code></pre></div></div>


</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{robinson2026,
  author = {Robinson, Caleb and Corley, Isaac},
  title = {Characterizing {Census} {Blocks} with {Satellite} {Embedding}
    {Statistics}},
  date = {2026-03-10},
  url = {https://geospatialml.com/posts/aef-census-block-embeddings/},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-robinson2026" class="csl-entry quarto-appendix-citeas">
Robinson, Caleb, and Isaac Corley. 2026. <span>“Characterizing Census
Blocks with Satellite Embedding Statistics.”</span> March 10. <a href="https://geospatialml.com/posts/aef-census-block-embeddings/">https://geospatialml.com/posts/aef-census-block-embeddings/</a>.
</div></div></section></div> ]]></description>
  <category>embeddings</category>
  <category>census</category>
  <category>foundation-models</category>
  <category>pca</category>
  <guid>https://geospatialml.com/posts/aef-census-block-embeddings/</guid>
  <pubDate>Tue, 10 Mar 2026 00:00:00 GMT</pubDate>
  <media:content url="https://geospatialml.com/posts/aef-census-block-embeddings/washington_state_block_pca.png" medium="image" type="image/png" height="93" width="144"/>
</item>
<item>
  <title>Training a Water Segmentation Model with TorchGeo</title>
  <dc:creator>Caleb Robinson</dc:creator>
  <dc:creator>Isaac Corley</dc:creator>
  <link>https://geospatialml.com/posts/torchgeo-iclr-tutorial/</link>
  <description><![CDATA[ 





<p>One notebook, a few hundred lines of Python, and you go from raw Sentinel-2 imagery to a georeferenced water map you can open in QGIS. That’s the premise of the <a href="https://torchgeo.readthedocs.io/en/stable/tutorials/earth_surface_water.html">TorchGeo tutorial</a> we put together for the <a href="https://ml-for-rs.github.io/iclr2026/">ICLR 2026 ML4RS Workshop</a> (<a href="https://arxiv.org/abs/2603.02386">paper</a>). It walks through the full earth observation (EO) ML workflow: loading multispectral data, training a semantic segmentation model on the <a href="https://zenodo.org/records/5205674">Earth Surface Water dataset</a>, and running gridded inference on a Sentinel-2 scene over Rio de Janeiro.</p>
<section id="why-satellite-imagery-isnt-just-big-computer-vision" class="level2">
<h2 class="anchored" data-anchor-id="why-satellite-imagery-isnt-just-big-computer-vision">Why satellite imagery isn’t just “big computer vision”</h2>
<p>If you’ve tried to plug satellite imagery into a standard computer vision pipeline, you’ve probably run into the friction. Imagery arrives as large georeferenced scenes (often with more than three bands), labels live in separate files with different coordinate reference systems (CRSs) and resolutions, and you can’t just <code>resize</code> and <code>normalize</code> your way to a training loop. Further, once you have a model you need to run inference across entire scenes, which requires stitching together predictions from overlapping tiles and saving the output as a georeferenced raster.</p>
<p>TorchGeo handles this by providing geospatial-aware datasets, samplers, and transforms that slot into standard PyTorch workflows. The key components are:</p>
<ul>
<li><strong>Composable datasets</strong> — use <code>|</code> (union) to mosaic tiles and <code>&amp;</code> (intersection) to pair imagery with labels, all lazily evaluated</li>
<li><strong>Geographic samplers</strong> — <code>RandomGeoSampler</code> for training and <code>GridGeoSampler</code> for inference, sampling in projected coordinates rather than pixel indices</li>
<li><strong>Windowed reads</strong> — no pre-tiling (assuming you have data in Cloud Optimized GeoTIFFs or other cloud native formats); TorchGeo reads only the pixels it needs from large rasters on demand</li>
</ul>
</section>
<section id="the-earth-surface-water-dataset" class="level2">
<h2 class="anchored" data-anchor-id="the-earth-surface-water-dataset">The Earth Surface Water dataset</h2>
<p>The <a href="https://zenodo.org/records/5205674">Earth Surface Water dataset</a> contains Sentinel-2 patches paired with binary water masks from diverse geographic regions. It’s a good fit for a tutorial because it’s small enough to train on quickly but realistic enough to show the full complexity of an EO workflow: patches span multiple UTM zones, the labels are raster masks in separate files, and the task (water vs.&nbsp;non-water) is easy to interpret visually.</p>
</section>
<section id="pairing-imagery-and-labels-across-utm-zones" class="level2">
<h2 class="anchored" data-anchor-id="pairing-imagery-and-labels-across-utm-zones">Pairing imagery and labels across UTM zones</h2>
<p>The tutorial constructs paired <code>RasterDataset</code> objects for imagery and masks, then combines them with TorchGeo’s intersection operator:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> torchgeo.datasets <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> RasterDataset</span>
<span id="cb1-2"></span>
<span id="cb1-3">images <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> RasterDataset(paths<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>image_dir, crs<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"EPSG:3395"</span>, res<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, transforms<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>scale)</span>
<span id="cb1-4">masks <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> RasterDataset(paths<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>mask_dir, crs<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"EPSG:3395"</span>, res<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>)</span>
<span id="cb1-5">masks.is_image <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># use nearest-neighbor resampling for discrete labels</span></span>
<span id="cb1-6">dataset <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> images <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span> masks</span></code></pre></div></div>
<p>Because the patches are distributed globally (often falling in different UTM zones), the notebook specifies a global CRS (World Mercator, EPSG:3395) so that all samples are consistently aligned during sampling and loading.</p>
</section>
<section id="from-6-bands-to-9-channels-with-spectral-indices" class="level2">
<h2 class="anchored" data-anchor-id="from-6-bands-to-9-channels-with-spectral-indices">From 6 bands to 9 channels with spectral indices</h2>
<p>Satellite data typically has more than three bands, which breaks standard vision preprocessing pipelines. The Earth Surface Water tutorial uses six Sentinel-2 bands — B02 (blue), B03 (green), B04 (red), B08 (NIR) at 10 m resolution, plus B11 and B12 (SWIR) at 20 m. Raw Sentinel-2 digital numbers are divided by 10,000 to convert to surface reflectance (a small detail that’s easy to forget and will silently wreck your training if you skip it).</p>
<p>From those 6 reflectance bands, the notebook computes three spectral indices using TorchGeo’s built-in transforms: NDWI (Normalized Difference Water Index, using green and NIR), MNDWI (Modified NDWI, using green and SWIR2), and NDVI (Normalized Difference Vegetation Index). The full preprocessing pipeline chains index computation and normalization in a single <code>Sequential</code>:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> kornia.augmentation <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> K</span>
<span id="cb2-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> torchgeo.transforms <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> indices</span>
<span id="cb2-3"></span>
<span id="cb2-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Compute mean/std over training images for z-score normalization,</span></span>
<span id="cb2-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># then pad with 0s/1s so the 3 index channels pass through unchanged</span></span>
<span id="cb2-6">mean <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.concatenate([band_mean, [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]])</span>
<span id="cb2-7">std <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.concatenate([band_std, [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]])</span>
<span id="cb2-8"></span>
<span id="cb2-9">tfms <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> torch.nn.Sequential(</span>
<span id="cb2-10">    indices.AppendNDWI(index_green<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, index_nir<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>),   <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># NDWI: (Green - NIR) / (Green + NIR)</span></span>
<span id="cb2-11">    indices.AppendNDWI(index_green<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, index_nir<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>),   <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># MNDWI: (Green - SWIR2) / (Green + SWIR2)</span></span>
<span id="cb2-12">    indices.AppendNDVI(index_nir<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, index_red<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>),      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># NDVI: (NIR - Red) / (NIR + Red)</span></span>
<span id="cb2-13">    K.Normalize(mean<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>mean, std<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>std),</span>
<span id="cb2-14">)</span>
<span id="cb2-15"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Input: 6 bands,  Output: 9 channels (6 normalized bands + 3 indices)</span></span></code></pre></div></div>
<p>We pad the mean/std vectors with <code>[0, 0, 0]</code> and <code>[1, 1, 1]</code>, so that z-score normalization becomes a no-op for the index channels, which are already bounded [-1, 1] by construction.</p>
</section>
<section id="adapting-an-rgb-architecture-to-9-channels" class="level2">
<h2 class="anchored" data-anchor-id="adapting-an-rgb-architecture-to-9-channels">Adapting an RGB architecture to 9 channels</h2>
<p>The model is a DeepLabV3 with a ResNet-50 backbone from torchvision, trained from scratch — ImageNet-pretrained weights expect 3-channel RGB input, so they’re not useful here. The key adaptation is reinitializing the first convolutional layer to accept our 9 input channels:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> torchvision.models.segmentation <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> deeplabv3_resnet50</span>
<span id="cb3-2"></span>
<span id="cb3-3">model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> deeplabv3_resnet50(weights<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>, num_classes<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)</span>
<span id="cb3-4">backbone <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model.get_submodule(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"backbone"</span>)</span>
<span id="cb3-5">conv <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> torch.nn.Conv2d(</span>
<span id="cb3-6">    in_channels<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">9</span>, out_channels<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">64</span>,</span>
<span id="cb3-7">    kernel_size<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">7</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">7</span>), stride<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>), padding<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>), bias<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>,</span>
<span id="cb3-8">)</span>
<span id="cb3-9">backbone.register_module(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"conv1"</span>, conv)</span></code></pre></div></div>
<p>The dataset ships with a pre-defined, geographically separated train/validation split — important for avoiding the over-optimistic metrics that spatial autocorrelation can cause in EO. Within each split, <code>RandomGeoSampler</code> draws 512x512 chips in geographic coordinate space, handling CRS alignment and resolution matching automatically. After 10 epochs with Adam (lr=1e-4, weight_decay=0.01) and a batch size of 4, the model reaches <strong>0.977 overall accuracy</strong> and <strong>0.824 IoU</strong> on the validation set. Training takes a few minutes on a single GPU.</p>
</section>
<section id="inference-on-a-sentinel-2-scene" class="level2">
<h2 class="anchored" data-anchor-id="inference-on-a-sentinel-2-scene">Inference on a Sentinel-2 scene</h2>
<p>This is the part of the tutorial where the model stops being a number on a leaderboard and starts being a useful tool! After training, the notebook downloads a Sentinel-2 scene over Rio de Janeiro, Brazil from the <a href="https://planetarycomputer.microsoft.com/">Microsoft Planetary Computer</a>, runs gridded inference across the entire tile, and finally saves the resulting predictions as a georeferenced GeoTIFF.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> torchgeo.datasets <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Sentinel2</span>
<span id="cb4-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> torchgeo.samplers <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> GridGeoSampler</span>
<span id="cb4-3"></span>
<span id="cb4-4">s2_dataset <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Sentinel2(paths<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>scene_dir, bands<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>bands, res<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, transforms<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>scale)</span>
<span id="cb4-5">grid_sampler <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> GridGeoSampler(s2_dataset, size<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">512</span>, stride<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">448</span>, units<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>Units.PIXELS)</span>
<span id="cb4-6">s2_dataloader <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> DataLoader(</span>
<span id="cb4-7">    s2_dataset, sampler<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>grid_sampler, batch_size<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span>, collate_fn<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>stack_samples</span>
<span id="cb4-8">)</span></code></pre></div></div>
<p>The <code>GridGeoSampler</code> tiles the scene into overlapping 512x512 patches (stride=448, so 64 pixels of overlap on each edge). Predictions are stitched back together and saved as a GeoTIFF — tiled, compressed, with overviews — that is pixel-aligned with the input scene:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> rasterio <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> rio</span>
<span id="cb5-2"></span>
<span id="cb5-3">profile <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {</span>
<span id="cb5-4">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"driver"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"GTiff"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"dtype"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"uint8"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"count"</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,</span>
<span id="cb5-5">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"width"</span>: img_width, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"height"</span>: img_height,</span>
<span id="cb5-6">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"crs"</span>: crs, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"transform"</span>: transform,</span>
<span id="cb5-7">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"compress"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"deflate"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"tiled"</span>: <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>,</span>
<span id="cb5-8">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"blockxsize"</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">512</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"blockysize"</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">512</span>,</span>
<span id="cb5-9">}</span>
<span id="cb5-10"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> rio.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">open</span>(output_path, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"w"</span>, <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span>profile) <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> dst:</span>
<span id="cb5-11">    dst.write(prediction, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb5-12">    dst.build_overviews([<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span>], rio.enums.Resampling.nearest)</span></code></pre></div></div>
<p>The result is a georeferenced water mask that you can open in QGIS, load into a GIS pipeline, or overlay on the original scene.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://geospatialml.com/posts/torchgeo-iclr-tutorial/rio_sentinel2.png" class="img-fluid figure-img" style="width:90.0%"></p>
<figcaption>Sentinel-2 true-color composite of Rio de Janeiro</figcaption>
</figure>
</div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://geospatialml.com/posts/torchgeo-iclr-tutorial/rio_prediction.png" class="img-fluid figure-img" style="width:90.0%"></p>
<figcaption>Water segmentation predictions (blue) on the same scene</figcaption>
</figure>
</div>
<p>This step bridges the gap between “model that scores well on a test set” and “model that produces a useful geospatial product.” It also lets you explore the model’s behavior beyond aggregate metrics: How sharp are the predictions along coastlines? What’s the smallest water feature it can detect? Where does it fail?</p>
</section>
<section id="try-it-yourself" class="level2">
<h2 class="anchored" data-anchor-id="try-it-yourself">Try it yourself</h2>
<p>The tutorial is distributed as two executable notebooks, and all you need is a machine with a GPU (a Colab T4 works fine):</p>
<ul>
<li><a href="https://torchgeo.readthedocs.io/en/stable/tutorials/torchgeo.html">Introduction to TorchGeo</a> — core abstractions (dataset composition, spatiotemporal indexing, geographic samplers)</li>
<li><a href="https://torchgeo.readthedocs.io/en/stable/tutorials/earth_surface_water.html">Earth Surface Water</a> — the end-to-end case study described in this post</li>
</ul>
<p>For more detail on the design choices and motivation, see our <a href="https://arxiv.org/abs/2603.02386">ICLR 2026 ML4RS Workshop paper</a>. The tutorial also builds on Mauricio Cordeiro’s <a href="https://medium.com/towards-data-science/artificial-intelligence-for-geospatial-analysis-with-pytorchs-torchgeo-part-1-52d17e409f09">3-part Medium series</a> on geospatial analysis with TorchGeo. If you have questions or want to discuss, come find us in the <a href="https://torchgeo.slack.com/join/shared_invite/zt-22rse667m-eqtCeNW0yI000Tl4B~2PIw">TorchGeo Slack</a>.</p>


</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-citation"><h2 class="anchored quarto-appendix-heading">Citation</h2><div><div class="quarto-appendix-secondary-label">BibTeX citation:</div><pre class="sourceCode code-with-copy quarto-appendix-bibtex"><code class="sourceCode bibtex">@online{robinson2026,
  author = {Robinson, Caleb and Corley, Isaac},
  title = {Training a {Water} {Segmentation} {Model} with {TorchGeo}},
  date = {2026-03-02},
  url = {https://geospatialml.com/posts/torchgeo-iclr-tutorial/},
  langid = {en}
}
</code></pre><div class="quarto-appendix-secondary-label">For attribution, please cite this work as:</div><div id="ref-robinson2026" class="csl-entry quarto-appendix-citeas">
Robinson, Caleb, and Isaac Corley. 2026. <span>“Training a Water
Segmentation Model with TorchGeo.”</span> March 2. <a href="https://geospatialml.com/posts/torchgeo-iclr-tutorial/">https://geospatialml.com/posts/torchgeo-iclr-tutorial/</a>.
</div></div></section></div> ]]></description>
  <category>torchgeo</category>
  <category>tutorial</category>
  <category>semantic-segmentation</category>
  <category>sentinel-2</category>
  <category>iclr</category>
  <guid>https://geospatialml.com/posts/torchgeo-iclr-tutorial/</guid>
  <pubDate>Mon, 02 Mar 2026 00:00:00 GMT</pubDate>
  <media:content url="https://geospatialml.com/posts/torchgeo-iclr-tutorial/rio_sentinel2.png" medium="image" type="image/png" height="69" width="144"/>
</item>
<item>
  <title>Welcome to GeoSpatial ML</title>
  <dc:creator>Caleb Robinson</dc:creator>
  <dc:creator>Isaac Corley</dc:creator>
  <link>https://geospatialml.com/posts/welcome/</link>
  <description><![CDATA[ 





<p>Welcome to <strong>GeoSpatial ML</strong> — a place to share what we’re exploring, building, and reading at the intersection of geospatial data and machine learning.</p>
<p>Many of us already swap papers, datasets, and half-baked experiments in the <a href="https://torchgeo.slack.com/join/shared_invite/zt-22rse667m-eqtCeNW0yI000Tl4B~2PIw">TorchGeo Slack</a>. This blog is an extension of those conversations — a more permanent home for the things we find interesting each week.</p>
<section id="what-to-expect" class="level2">
<h2 class="anchored" data-anchor-id="what-to-expect">What to expect</h2>
<ul>
<li><strong>Paper highlights</strong> — summaries and takes on new GeoAI / GeoML research we’re reading</li>
<li><strong>Code demos</strong> — small, reproducible experiments with <a href="https://github.com/microsoft/torchgeo">TorchGeo</a> and the broader geospatial ML ecosystem</li>
<li><strong>New models &amp; datasets</strong> — quick tours of recently released foundation models, benchmarks, and datasets worth trying</li>
<li><strong>Geospatial explorations</strong> — anything from satellite imagery tricks to fun visualizations to workflow tips</li>
</ul>
<p>Posts will be short and practical. If something is interesting enough to share in Slack, it’s interesting enough to write up here.</p>
<p>Stay tuned, and come hang out in <a href="https://torchgeo.slack.com/join/shared_invite/zt-22rse667m-eqtCeNW0yI000Tl4B~2PIw">TorchGeo Slack</a> if you haven’t already.</p>


</section>

 ]]></description>
  <category>meta</category>
  <guid>https://geospatialml.com/posts/welcome/</guid>
  <pubDate>Fri, 27 Feb 2026 00:00:00 GMT</pubDate>
</item>
</channel>
</rss>
