Training a Water Segmentation Model with TorchGeo

One notebook, a few hundred lines of Python, and you go from raw Sentinel-2 imagery to a georeferenced water map you can open in QGIS. That’s the premise of the TorchGeo tutorial we put together for the ICLR 2026 ML4RS Workshop (paper). It walks through the full earth observation (EO) ML workflow: loading multispectral data, training a semantic segmentation model on the Earth Surface Water dataset, and running gridded inference on a Sentinel-2 scene over Rio de Janeiro.

Why satellite imagery isn’t just “big computer vision”

If you’ve tried to plug satellite imagery into a standard computer vision pipeline, you’ve probably run into the friction. Imagery arrives as large georeferenced scenes (often with more than three bands), labels live in separate files with different coordinate reference systems (CRSs) and resolutions, and you can’t just resize and normalize your way to a training loop. Further, once you have a model you need to run inference across entire scenes, which requires stitching together predictions from overlapping tiles and saving the output as a georeferenced raster.

TorchGeo handles this by providing geospatial-aware datasets, samplers, and transforms that slot into standard PyTorch workflows. The key components are:

Composable datasets — use | (union) to mosaic tiles and & (intersection) to pair imagery with labels, all lazily evaluated
Geographic samplers — RandomGeoSampler for training and GridGeoSampler for inference, sampling in projected coordinates rather than pixel indices
Windowed reads — no pre-tiling (assuming you have data in Cloud Optimized GeoTIFFs or other cloud native formats); TorchGeo reads only the pixels it needs from large rasters on demand

The Earth Surface Water dataset

The Earth Surface Water dataset contains Sentinel-2 patches paired with binary water masks from diverse geographic regions. It’s a good fit for a tutorial because it’s small enough to train on quickly but realistic enough to show the full complexity of an EO workflow: patches span multiple UTM zones, the labels are raster masks in separate files, and the task (water vs. non-water) is easy to interpret visually.

Pairing imagery and labels across UTM zones

The tutorial constructs paired RasterDataset objects for imagery and masks, then combines them with TorchGeo’s intersection operator:

from torchgeo.datasets import RasterDataset

images = RasterDataset(paths=image_dir, crs="EPSG:3395", res=10, transforms=scale)
masks = RasterDataset(paths=mask_dir, crs="EPSG:3395", res=10)
masks.is_image = False  # use nearest-neighbor resampling for discrete labels
dataset = images & masks

Because the patches are distributed globally (often falling in different UTM zones), the notebook specifies a global CRS (World Mercator, EPSG:3395) so that all samples are consistently aligned during sampling and loading.

From 6 bands to 9 channels with spectral indices

Satellite data typically has more than three bands, which breaks standard vision preprocessing pipelines. The Earth Surface Water tutorial uses six Sentinel-2 bands — B02 (blue), B03 (green), B04 (red), B08 (NIR) at 10 m resolution, plus B11 and B12 (SWIR) at 20 m. Raw Sentinel-2 digital numbers are divided by 10,000 to convert to surface reflectance (a small detail that’s easy to forget and will silently wreck your training if you skip it).

From those 6 reflectance bands, the notebook computes three spectral indices using TorchGeo’s built-in transforms: NDWI (Normalized Difference Water Index, using green and NIR), MNDWI (Modified NDWI, using green and SWIR2), and NDVI (Normalized Difference Vegetation Index). The full preprocessing pipeline chains index computation and normalization in a single Sequential:

import kornia.augmentation as K
from torchgeo.transforms import indices

# Compute mean/std over training images for z-score normalization,
# then pad with 0s/1s so the 3 index channels pass through unchanged
mean = np.concatenate([band_mean, [0, 0, 0]])
std = np.concatenate([band_std, [1, 1, 1]])

tfms = torch.nn.Sequential(
    indices.AppendNDWI(index_green=1, index_nir=3),   # NDWI: (Green - NIR) / (Green + NIR)
    indices.AppendNDWI(index_green=1, index_nir=5),   # MNDWI: (Green - SWIR2) / (Green + SWIR2)
    indices.AppendNDVI(index_nir=3, index_red=2),      # NDVI: (NIR - Red) / (NIR + Red)
    K.Normalize(mean=mean, std=std),
)
# Input: 6 bands,  Output: 9 channels (6 normalized bands + 3 indices)

We pad the mean/std vectors with [0, 0, 0] and [1, 1, 1], so that z-score normalization becomes a no-op for the index channels, which are already bounded [-1, 1] by construction.

Adapting an RGB architecture to 9 channels

The model is a DeepLabV3 with a ResNet-50 backbone from torchvision, trained from scratch — ImageNet-pretrained weights expect 3-channel RGB input, so they’re not useful here. The key adaptation is reinitializing the first convolutional layer to accept our 9 input channels:

from torchvision.models.segmentation import deeplabv3_resnet50

model = deeplabv3_resnet50(weights=None, num_classes=2)
backbone = model.get_submodule("backbone")
conv = torch.nn.Conv2d(
    in_channels=9, out_channels=64,
    kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False,
)
backbone.register_module("conv1", conv)

The dataset ships with a pre-defined, geographically separated train/validation split — important for avoiding the over-optimistic metrics that spatial autocorrelation can cause in EO. Within each split, RandomGeoSampler draws 512x512 chips in geographic coordinate space, handling CRS alignment and resolution matching automatically. After 10 epochs with Adam (lr=1e-4, weight_decay=0.01) and a batch size of 4, the model reaches 0.977 overall accuracy and 0.824 IoU on the validation set. Training takes a few minutes on a single GPU.

Inference on a Sentinel-2 scene

This is the part of the tutorial where the model stops being a number on a leaderboard and starts being a useful tool! After training, the notebook downloads a Sentinel-2 scene over Rio de Janeiro, Brazil from the Microsoft Planetary Computer, runs gridded inference across the entire tile, and finally saves the resulting predictions as a georeferenced GeoTIFF.

from torchgeo.datasets import Sentinel2
from torchgeo.samplers import GridGeoSampler

s2_dataset = Sentinel2(paths=scene_dir, bands=bands, res=10, transforms=scale)
grid_sampler = GridGeoSampler(s2_dataset, size=512, stride=448, units=Units.PIXELS)
s2_dataloader = DataLoader(
    s2_dataset, sampler=grid_sampler, batch_size=16, collate_fn=stack_samples
)

The GridGeoSampler tiles the scene into overlapping 512x512 patches (stride=448, so 64 pixels of overlap on each edge). Predictions are stitched back together and saved as a GeoTIFF — tiled, compressed, with overviews — that is pixel-aligned with the input scene:

import rasterio as rio

profile = {
    "driver": "GTiff", "dtype": "uint8", "count": 1,
    "width": img_width, "height": img_height,
    "crs": crs, "transform": transform,
    "compress": "deflate", "tiled": True,
    "blockxsize": 512, "blockysize": 512,
}
with rio.open(output_path, "w", **profile) as dst:
    dst.write(prediction, 1)
    dst.build_overviews([2, 4, 8, 16], rio.enums.Resampling.nearest)

The result is a georeferenced water mask that you can open in QGIS, load into a GIS pipeline, or overlay on the original scene.

Sentinel-2 true-color composite of Rio de Janeiro

Water segmentation predictions (blue) on the same scene

This step bridges the gap between “model that scores well on a test set” and “model that produces a useful geospatial product.” It also lets you explore the model’s behavior beyond aggregate metrics: How sharp are the predictions along coastlines? What’s the smallest water feature it can detect? Where does it fail?

Try it yourself

The tutorial is distributed as two executable notebooks, and all you need is a machine with a GPU (a Colab T4 works fine):

Introduction to TorchGeo — core abstractions (dataset composition, spatiotemporal indexing, geographic samplers)
Earth Surface Water — the end-to-end case study described in this post

For more detail on the design choices and motivation, see our ICLR 2026 ML4RS Workshop paper. The tutorial also builds on Mauricio Cordeiro’s 3-part Medium series on geospatial analysis with TorchGeo. If you have questions or want to discuss, come find us in the TorchGeo Slack.