Seeing the Roads Through the Trees: Do Segmentation Models Actually Use Long-Range Context?

How well do segmentation models actually use long-range spatial information to make decisions? No existing benchmark directly measures this, especially in remote sensing where most datasets can be solved with relatively local texture and color cues. This matters beyond any single task — remote sensing is full of cases where local appearance is ambiguous and the correct label depends on spatial context, from mapping flooded areas under tree canopy during disaster response to identifying informal settlements where the signal is the neighborhood-level pattern rather than any individual structure. In Seeing the Roads Through the Trees we designed a dataset and metric to measure spatial reasoning directly, and found that standard CNN encoder-decoder models are generally bad at it. In this post we revisit the problem with transformer-based architectures and gradient-based receptive field analysis to understand why.

The dataset

Chesapeake Roads Spatial Context (RSC) contains 30,000 512×512 NAIP patches from Maryland with 4-band imagery (RGB + near-infrared) and labels for three classes: background, road, and tree canopy over road. The class balance is extreme — 96.3% background, 3.0% road, 0.7% tree canopy over road.

Second example patch showing road passing under tree canopy

Example patches from the dataset. Blue = visible road pixels, red = tree canopy over road. The model must classify both as "road," but the red pixels have no local evidence of being road.

The idea is simple: roads pass under tree canopy, and when they do, the local appearance at those pixels looks like trees, not road. A model can only classify those pixels correctly by looking at nearby visible road segments and inferring that the road continues underneath. The distance from each tree-canopy-over-road pixel to the nearest visible road pixel has a median of 4 pixels but a 95th percentile of 107 pixels, so some of these inferences require connecting evidence across a large spatial span.

Map of Maryland showing the distribution of 30,000 train, validation, and test patches — Distribution of 30,000 train, validation, and test patches across Maryland.

Other remote sensing datasets with roads (ISPRS Vaihingen/Potsdam, LandCover.ai, DeepGlobe, SpaceNet, RoadTracer) are strong tests of segmentation quality, topology, or connectivity, but none explicitly separate easy road pixels from locally ambiguous ones. Chesapeake RSC partitions the road class by spatial difficulty, which makes it possible to ask not just “how well does this model segment roads?” but “how far away can the model look to make a correct decision?”

Distance-weighted recall

To quantify how well a model uses spatial context, we introduced distance-weighted recall (DWR). For each tree-canopy-over-road pixel, we measure its distance to the nearest visible road pixel, then weight the pixel’s contribution to recall by that distance. A model that only gets the easy nearby pixels right will have a high unweighted recall but a low DWR; a model that correctly classifies tree canopy far from any visible road will score much higher.

Theoretical vs. effective receptive fields

Every segmentation architecture has a theoretical receptive field (TRF) — the maximum region of the input that could influence a given output pixel, determined purely by kernel sizes, strides, and network depth. Araujo et al. (2019) give a clear treatment of how to compute this for convolutional networks.

The effective receptive field (ERF) is what the model actually uses. Luo et al. (2016) showed that in deep CNNs the ERF is typically much smaller than the TRF and has a Gaussian-like concentration around the center pixel. A model can have a 527-pixel theoretical receptive field and still behave as though it only looks at a small local neighborhood. For transformers, self-attention gives a global TRF by construction, but global access does not automatically mean global use.

Models and results

We trained a U-Net with a ResNet-18 backbone (14M params, TRF of 527 pixels) and two SegFormer variants: MiT-B0 (3.7M params) and MiT-B2 (25M params), both with global TRFs via self-attention. All models were trained on a binary task (road vs. background, with canopy-over-road grouped into road) using AdamW, cosine annealing, and cross-entropy loss for 150 epochs.

Model	Params	Road R	Road P	TC/Road R	DWR
U-Net (ResNet-18)	14M	83.6	71.8	62.4	44.0
U-Net (ResNet-18) + Cutout	14M	83.4	71.7	61.8	43.4
SegFormer (MiT-B0)	3.7M	83.1	71.7	58.9	37.9
SegFormer (MiT-B2)	25M	84.6	72.2	63.2	42.3

R = recall, P = precision, TC/Road R = recall on tree canopy over road subgroup. Background metrics omitted (all ~99.5%).

SegFormer MiT-B2 leads on overall metrics — best road recall (84.6%) and best tree canopy recall (63.2%). But the U-Net wins on DWR (44.0 vs 42.3), meaning it’s better at classifying tree canopy pixels that are far from visible road. The SegFormers’ ability to attend to distant tokens doesn’t translate into better performance on the spatially hardest pixels. This isn’t to say the U-Net is good at spatial reasoning (62.4% tree canopy recall is still a 21-point drop from visible road recall) — it’s that the ViT’s global attention doesn’t magically help here.

We also tested cutout augmentation — randomly masking 64×64 patches of the input during training — to force the model to reconstruct missing regions and improve spatial reasoning. We tested several cutout sizes and the story was the same: it doesn’t help. The variant shown here achieves 61.8% tree canopy recall, comparable to the baseline’s 62.4%.

Performance degrades with distance

For each tree-canopy-over-road pixel in the test set, we measure the distance to the nearest visible road pixel, bin into log-spaced groups, and compute recall within each bin.

Recall on tree canopy over road pixels as a function of distance from the nearest visible road pixel (log scale). Both models start at ~74–76% recall for adjacent pixels and decay monotonically.

Both models show monotonic performance degradation. At distance ~1 pixel, the U-Net achieves ~76% recall and the SegFormer ~73%. By ~100 pixels, both are in the 36–43% range. At 400+ pixels, recall falls to 20–28%. The U-Net outperforms the SegFormer MiT-B0 at every distance despite having a narrower effective receptive field.

Measuring the effective receptive field

The distance-stratified recall shows that models fail to use long-range context. Gradient-based ERF analysis shows why.

We computed gradient attributions by backpropagating from pre-softmax road logits to the input for 200 test images, then measured how gradient mass distributes as a function of radius from the output pixel. The effective diameter at a given percentile is the smallest circle centered on the output pixel that encloses that fraction of total gradient mass.

Cumulative gradient mass as a function of radius from the output pixel. The U-Net reaches 50% of its gradient mass within ~92 pixel radius; the SegFormer reaches 50% at ~146 pixels. Dashed lines mark 50th, 90th, and 99th percentile radii.

The U-Net concentrates half its gradient mass within a 184-pixel diameter circle despite having a 527-pixel theoretical receptive field. The SegFormer reaches 50% at 292 pixels — 1.6× wider, but the bulk of its attention (90%) still stays within a 542-pixel diameter.

Effective receptive field diameter (in pixels) for road-class predictions at three percentile cutoffs. The SegFormer is 1.4–1.6× wider than the U-Net depending on percentile.

The road class has the widest ERF across all models, which suggests models do allocate more spatial attention for road-related predictions. But tree canopy ERFs are approximately equal to background ERFs — when a model needs to look farther to identify a canopy-covered road pixel, it doesn’t.

Gradient attribution: interactive explorer

The aggregate ERF statistics above summarize behavior across many pixels. The visualizer below lets you explore gradient attributions for individual predictions — hover over any 8×8 block to see which input pixels the model relies on for that block’s prediction. Toggle the mask overlay to see ground truth labels.

Sample

Model

U-Net (ResNet-18) SegFormer (MIT-B0)

Loading…

Block Info

Hover over the image to see gradient attribution

Gradient opacity: 75%

Show mask overlay

Prediction Ground Truth

Legend

Background

Road

Tree Canopy Over Road

Gradient (inferno)

What’s next

Across the architectures we tested, the bottleneck appears to be the training signal, not the architecture. Switching from CNN to transformer, increasing model capacity (MiT-B0 → B2), and adding cutout augmentation all fail to substantially improve spatial reasoning on the hardest pixels. The binary cross-entropy loss treats all road pixels equally — it doesn’t reward the model for propagating information from distant visible road segments to occluded ones. Distance-aware loss functions or auxiliary connectivity tasks might provide a stronger learning signal.

Chesapeake RSC is a controlled version of a broader challenge in remote sensing, and the effective receptive field tools we use here apply directly to any task where local appearance is ambiguous and the correct label depends on spatial context.