How can you join AEF embeddings to census blocks, and how well do they predict different variables? We wrote a script for doing this! We find, for example, that statistics of AEF embeddings can differentiate between urban and rural blocks in Washington with 92.5% accuracy using a simple logistic regression.

There’s a growing ecosystem of pixel-level embedding products covering the entire planet — AEF, Clay, Prithvi, and others. These are potentially powerful features for research well beyond remote sensing: sociology, demography, public health, economics — any field that works with administrative boundaries. But there’s still a high technical barrier to actually using them. Going from a wall of raster tiles to a clean feature table keyed by census tract or district requires spatial joins, CRS wrangling, and careful aggregation.

This post is a practical, end-to-end example of how to do exactly that. We take Alpha Earth Foundation (AEF) embeddings from Source Cooperative, summarize them across the ~149K census blocks in Washington State, and see how well these purely satellite-derived features predict census variables like population density and urban/rural classification.

The data

AEF embeddings are 64-dimensional vectors produced by a geospatial foundation model for every 10m pixel on Earth. See the Google Earth Engine catalog page here. They capture land use, land cover, vegetation, and built environment characteristics from satellite imagery. The data is distributed as cloud-optimized GeoTIFFs tiled at 8192x8192 pixels in UTM projection, with a GeoPackage spatial index mapping tile footprints to file paths across years 2018–2025.

Census block boundaries come from the 2020 US Census TIGER/Line shapefiles — the finest-grained census geography, with attributes like population (POP20), housing units (HOUSING20), land/water area, and an urban/rural flag (UR20).

Method

Step 1: Compute per-block embedding statistics

For each census block, we compute the mean and standard deviation of each of the 64 AEF embedding dimensions across all valid 10m pixels within the block for 2020. This produces a 128-dimensional feature vector per block (64 means + 64 stdevs).

Before processing, we filter out blocks that would be uninformative or expensive:

FilterBlocks removedReason
All-water (ALAND20 == 0)8,272 (5.2%)No land pixels to sample
Oversized (ALAND20 > 25 km²)1,138 (0.7%)Too expensive to mask at 10m resolution
Total kept148,683 (94.0%)Covers 99.6% of WA’s population

The pipeline spatial-joins blocks with the AEF tile index, downloads the needed tiles (44 tiles, ~34 GB for Washington in 2020), then masks and aggregates pixel values per block using multithreaded I/O. See compute_aef_block_stats.py for the full script.

Step 2: PCA visualization

To visualize the embedding space geographically, we fit a 3-component PCA on the 128-dimensional block feature vectors, scale each component to uint8, and rasterize block polygons at 10m resolution into a 3-band GeoTIFF. The resulting RGB composite shows blocks with similar land use in similar colors. See pca_rasterize.py for the full script.

Here’s what the PCA-3 RGB rendering looks like across all of Washington State:

PCA-3 RGB rendering of AEF embedding statistics across Washington State census blocks. Color differences reflect differences in the embedding space — similar land use appears in similar colors.

And zoomed into the Seattle/Bellevue metro area, where urban structure is clearly visible at the block level:

PCA-3 RGB rendering zoomed into Seattle/Bellevue. The dense urban core, suburban neighborhoods, and surrounding forests are clearly differentiated by color.

Step 3: Correlation with census variables

We joined the embedding statistics with census block attributes and tested predictiveness using simple linear models.

We tested two feature representations:

  • 128-dim: The raw 64 per-band means + 64 per-band standard deviations
  • PCA-10: A 10-component PCA capturing 79.0% of total variance

Results

Linear regression R²

We fit ordinary least squares on all 148,683 blocks and report in-sample R². No train/test split here — these numbers are upper bounds on what a linear model can extract, meant to gauge the information content of the features rather than predict on held-out data:

Census VariableR² (128-dim)R² (PCA-10)
Log(land area)0.8430.665
Urban/Rural0.7400.676
Log(population)0.5750.366
Water fraction0.4140.073
Pop. density0.3800.260
Housing density0.2990.165
Raw population0.2520.105
Raw housing units0.2290.094

Log(land area) is the most predictable (R² = 0.84), which makes intuitive sense — block size directly corresponds to land cover homogeneity, and the embeddings capture this. Urban/rural is next at R² = 0.74, confirming that AEF embeddings strongly encode built environment characteristics. The gap between 128-dim and PCA-10 is also telling: compressing to 10 components loses a lot of signal for some variables (water fraction drops from 0.41 to 0.07), suggesting the tail dimensions carry real information.

Classification accuracy

Using LogisticRegression, now with a stratified 5-fold cross-validation (no block sees its own label during training in any given fold), we attempt to train a model to predict Urban vs. Rural:

TaskBaselineAccuracy (128-dim)Accuracy (PCA-10)
Urban vs. Rural59.4%92.5% ± 0.1%90.1% ± 0.2%

A logistic regression on AEF embedding statistics achieves 92.5% accuracy at distinguishing urban from rural blocks — a 33 percentage point improvement over the majority-class baseline – using only statistics of pixel embeddings.

PCA component correlations

To interpret what the PCA components actually capture, we compute the Pearson correlation between each component’s per-block score and several census variables. Values near +1 or -1 indicate a strong linear relationship:

VariablePC1PC2PC3PC4PC5
Urban/Rural-0.71-0.34-0.07+0.13+0.00
Log(land area)+0.58+0.39+0.09-0.21+0.01
Pop. density-0.42-0.21-0.12+0.03-0.07
Log(population)-0.40-0.01-0.06+0.29+0.17
Housing density-0.30-0.16-0.08-0.01-0.10
Water fraction+0.05+0.06+0.07-0.08-0.09

PC1 is clearly an urban-to-rural axis (r = -0.71 with the urban flag), while PC2 adds spatial scale information. The higher-order components pick up more nuanced variation — but even PC5 barely correlates with anything in the census, suggesting those dimensions capture land use patterns that don’t map neatly onto sociodemographic variables.

Takeaways

  1. AEF embeddings encode urban/rural character. While not very surprising, these can get 92.5% classification accuracy from a linear model alone.

  2. The full 128-dim representation is substantially richer than PCA-10. For log population, R² jumps from 0.37 to 0.58 — the higher-order embedding dimensions contain useful information.

  3. Block-level embedding statistics are a practical feature engineering approach. Mean and stdev per block compresses many 64-dim pixel vectors into a fixed 128-dim representation per geographic unit — simple enough to throw into any tabular ML pipeline. How to aggregate these was also the question of our recent paper, “From Pixels to Patches: Pooling Strategies for Earth Embeddings” (preprint here).

Reproduction

Both scripts are available in this gist. The precomputed Washington State block-level embedding statistics are available as a GeoParquet file on Hugging Face if you want to skip the ~34 GB download and play with the data instead.

# Download census blocks from:
# https://www.census.gov/cgi-bin/geo/shapefiles/index.php
# (year 2025, layer "Blocks (2020)", state "Washington")

# Download AEF index
curl -O https://data.source.coop/tge-labs/aef/v1/annual/aef_index.gpkg

# Compute block statistics (downloads ~34 GB of tiles)
python3 compute_aef_block_stats.py \
  --year 2020 --max-land-km2 25 \
  --output wa_block_aef_stats.geoparquet

# Quick test on Seattle/Bellevue area (~2 tiles)
python3 compute_aef_block_stats.py \
  --bbox -122.44 47.49 -122.07 47.73 \
  --output seattle_bellevue_aef_stats.geoparquet

# Generate PCA visualization
python3 pca_rasterize.py wa_block_aef_stats.geoparquet -o wa_pca.tif