How well do segmentation models actually use long-range spatial information to make decisions? No existing benchmark directly measures this, especially in remote sensing where most datasets can be solved with relatively local texture and color cues. This matters beyond any single task — remote sensing is full of cases where local appearance is ambiguous and the correct label depends on spatial context, from mapping flooded areas under tree canopy during disaster response to identifying informal settlements where the signal is the neighborhood-level pattern rather than any individual structure. In Seeing the Roads Through the Trees we designed a dataset and metric to measure spatial reasoning directly, and found that standard CNN encoder-decoder models are generally bad at it. In this post we revisit the problem with transformer-based architectures and gradient-based receptive field analysis to understand why.
The dataset
Chesapeake Roads Spatial Context (RSC) contains 30,000 512×512 NAIP patches from Maryland with 4-band imagery (RGB + near-infrared) and labels for three classes: background, road, and tree canopy over road. The class balance is extreme — 96.3% background, 3.0% road, 0.7% tree canopy over road.

The idea is simple: roads pass under tree canopy, and when they do, the local appearance at those pixels looks like trees, not road. A model can only classify those pixels correctly by looking at nearby visible road segments and inferring that the road continues underneath. The distance from each tree-canopy-over-road pixel to the nearest visible road pixel has a median of 4 pixels but a 95th percentile of 107 pixels, so some of these inferences require connecting evidence across a large spatial span.

Other remote sensing datasets with roads (ISPRS Vaihingen/Potsdam, LandCover.ai, DeepGlobe, SpaceNet, RoadTracer) are strong tests of segmentation quality, topology, or connectivity, but none explicitly separate easy road pixels from locally ambiguous ones. Chesapeake RSC partitions the road class by spatial difficulty, which makes it possible to ask not just “how well does this model segment roads?” but “how far away can the model look to make a correct decision?”
Distance-weighted recall
To quantify how well a model uses spatial context, we introduced distance-weighted recall (DWR). For each tree-canopy-over-road pixel, we measure its distance to the nearest visible road pixel, then weight the pixel’s contribution to recall by that distance. A model that only gets the easy nearby pixels right will have a high unweighted recall but a low DWR; a model that correctly classifies tree canopy far from any visible road will score much higher.
Theoretical vs. effective receptive fields
Every segmentation architecture has a theoretical receptive field (TRF) — the maximum region of the input that could influence a given output pixel, determined purely by kernel sizes, strides, and network depth. Araujo et al. (2019) give a clear treatment of how to compute this for convolutional networks.
The effective receptive field (ERF) is what the model actually uses. Luo et al. (2016) showed that in deep CNNs the ERF is typically much smaller than the TRF and has a Gaussian-like concentration around the center pixel. A model can have a 527-pixel theoretical receptive field and still behave as though it only looks at a small local neighborhood. For transformers, self-attention gives a global TRF by construction, but global access does not automatically mean global use.
Models and results
We trained a U-Net with a ResNet-18 backbone (14M params, TRF of 527 pixels) and two SegFormer variants: MiT-B0 (3.7M params) and MiT-B2 (25M params), both with global TRFs via self-attention. All models were trained on a binary task (road vs. background, with canopy-over-road grouped into road) using AdamW, cosine annealing, and cross-entropy loss for 150 epochs.
| Model | Params | Road R | Road P | TC/Road R | DWR |
|---|---|---|---|---|---|
| U-Net (ResNet-18) | 14M | 83.6 | 71.8 | 62.4 | 44.0 |
| U-Net (ResNet-18) + Cutout | 14M | 83.4 | 71.7 | 61.8 | 43.4 |
| SegFormer (MiT-B0) | 3.7M | 83.1 | 71.7 | 58.9 | 37.9 |
| SegFormer (MiT-B2) | 25M | 84.6 | 72.2 | 63.2 | 42.3 |
R = recall, P = precision, TC/Road R = recall on tree canopy over road subgroup. Background metrics omitted (all ~99.5%).
SegFormer MiT-B2 leads on overall metrics — best road recall (84.6%) and best tree canopy recall (63.2%). But the U-Net wins on DWR (44.0 vs 42.3), meaning it’s better at classifying tree canopy pixels that are far from visible road. The SegFormers’ ability to attend to distant tokens doesn’t translate into better performance on the spatially hardest pixels. This isn’t to say the U-Net is good at spatial reasoning (62.4% tree canopy recall is still a 21-point drop from visible road recall) — it’s that the ViT’s global attention doesn’t magically help here.
We also tested cutout augmentation — randomly masking 64×64 patches of the input during training — to force the model to reconstruct missing regions and improve spatial reasoning. We tested several cutout sizes and the story was the same: it doesn’t help. The variant shown here achieves 61.8% tree canopy recall, comparable to the baseline’s 62.4%.
Performance degrades with distance
For each tree-canopy-over-road pixel in the test set, we measure the distance to the nearest visible road pixel, bin into log-spaced groups, and compute recall within each bin.

Both models show monotonic performance degradation. At distance ~1 pixel, the U-Net achieves ~76% recall and the SegFormer ~73%. By ~100 pixels, both are in the 36–43% range. At 400+ pixels, recall falls to 20–28%. The U-Net outperforms the SegFormer MiT-B0 at every distance despite having a narrower effective receptive field.
Measuring the effective receptive field
The distance-stratified recall shows that models fail to use long-range context. Gradient-based ERF analysis shows why.
We computed gradient attributions by backpropagating from pre-softmax road logits to the input for 200 test images, then measured how gradient mass distributes as a function of radius from the output pixel. The effective diameter at a given percentile is the smallest circle centered on the output pixel that encloses that fraction of total gradient mass.

The U-Net concentrates half its gradient mass within a 184-pixel diameter circle despite having a 527-pixel theoretical receptive field. The SegFormer reaches 50% at 292 pixels — 1.6× wider, but the bulk of its attention (90%) still stays within a 542-pixel diameter.

The road class has the widest ERF across all models, which suggests models do allocate more spatial attention for road-related predictions. But tree canopy ERFs are approximately equal to background ERFs — when a model needs to look farther to identify a canopy-covered road pixel, it doesn’t.
Gradient attribution: interactive explorer
The aggregate ERF statistics above summarize behavior across many pixels. The visualizer below lets you explore gradient attributions for individual predictions — hover over any 8×8 block to see which input pixels the model relies on for that block’s prediction. Toggle the mask overlay to see ground truth labels.
Hover over the image to see gradient attribution
What’s next
Across the architectures we tested, the bottleneck appears to be the training signal, not the architecture. Switching from CNN to transformer, increasing model capacity (MiT-B0 → B2), and adding cutout augmentation all fail to substantially improve spatial reasoning on the hardest pixels. The binary cross-entropy loss treats all road pixels equally — it doesn’t reward the model for propagating information from distant visible road segments to occluded ones. Distance-aware loss functions or auxiliary connectivity tasks might provide a stronger learning signal.
Chesapeake RSC is a controlled version of a broader challenge in remote sensing, and the effective receptive field tools we use here apply directly to any task where local appearance is ambiguous and the correct label depends on spatial context.