Paper | Examples | Model Code | ESA Blog | IBM Blog

TerraMind 1.0 large

TerraMind is the first multimodal any-to-any generative foundation model for Earth Observation jointly developed by IBM, ESA, and Forschungszentrum Jülich.

Architecture

TerraMind uses a dual-scale transformer-based encoder-decoder architecture, simultaneously processing pixel-level and token-level data. The model was pre-trained on 500B tokens from 9M spatiotemporally aligned multimodal samples from the TerraMesh dataset.

Modality-specific patch embeddings allow direct processing of raw inputs, while modality-specific FSQ-VAEs are used for image tokenization. For sequence-like modalities such as coordinates, an adapted WordPiece tokenizer is employed. During pre-training, TerraMind leverages masked token reconstruction, learning complex cross-modal correlations to generate high-quality latent representations.

Evaluation

We benchmarked TerraMind against other geospatial foundation models using the PANGAEA benchmark. TerraMind consistently achieved state-of-the-art performance, surpassing existing models in various downstream tasks such as land use segmentation, water body mapping, and vegetation assessments. The evaluation highlights its effectiveness in handling diverse Earth Observation scenarios. We present additional experiments in our pre-print.

Usage

TerraMind is fully integrated into the fine-tuning package TerraTorch. This makes it easy to initialize the pre-trained model or fine-tune it via PyTorch Lightning. The weights are automatically downloaded from Hugging Face.

Fine-tuning

You can fine-tune TerraMind with a config using TerraTorch:

terratorch fit -c terramind_config.yaml

For testing the fine-tuned TerraMind model, run:

terratorch test -c terramind_config.yaml --ckpt_path path/to/your/checkpoint.ckpt

We provide config examples and notebooks with step-by-step explanations at https://github.com/IBM/terramind.

Backbone

Alternatively, you can build the backbone with the following code and use it in your custom pipeline.

from terratorch import BACKBONE_REGISTRY
model = BACKBONE_REGISTRY.build(
    'terramind_v1_large', 
    pretrained=True, 
    modalities=['S2L2A', 'S1GRD']    
)

The model supports the following raw inputs which you can specify in modalities: S2L2A, S2L1C, S1GRD, S1RTC, DEM, RGB. If your data does not use all bands of a modality, you can specify a subset with bands={'S2L2A': ['BLUE', 'GREEN', 'RED', 'NIR_NARROW', 'SWIR_1', 'SWIR_2']}. You can pass the inputs as in a dict to the model. If a tensor is directly passed, the model assumes it is the first defined modality. TerraMind can also handle missing input modalities.

output = model(
  {
    'S2L2A': s2l2a_tensor,  # B, 12, 224, 224
    'S1GRD': s1grd_tensor,  # B, 2, 224, 224
  }
)

output.shape  # B, 196, 768

The model outputs patch embeddings for each input modality. By default, the patch embeddings are averaged over all modalities to reduce the output size. You can specify another merge_method from 'mean', 'max', 'concat', 'dict', and None.

mean and max are applied per patch over all image modality embeddings.
concat stacks all image modalities along the embedding dimension and returns one embedding per patch.
dict returns all tokens split by modality in a dictionary.
None returns the tokens without further processing.

Thinking in Modalities

TerraMind introduces a new Thinking-in-Modalities (TiM) approach, where other modalities are predicted as an intermediate steps. Then, the fine-tuned encoder uses both raw inputs and the generated modalities.

Use TiM models in TerraTorch by adding _tim to the model name:

from terratorch import BACKBONE_REGISTRY
model = BACKBONE_REGISTRY.build(
    'terramind_v1_large_tim', 
    pretrained=True, 
    modalities=['S2L2A', 'S1GRD'],
    tim_modalities=['LULC']  # optional, defaults to LULC (land-use land-cover)
)

If you use TiM models, we recommend using the pre-training statistics for standardization.

Generations

TerraMind can perform any-to-any generation based on varying combinations of inputs.

Build the full TerraMind model (including de-tokenizer steps) from the FULL_MODEL_REGISTRY:

from terratorch import FULL_MODEL_REGISTRY 

model = FULL_MODEL_REGISTRY.build(
    'terramind_v1_large_generate',
    pretrained=False,
    modalities=['S2L2A'],
    output_modalities=['S1GRD', 'LULC'],
    timesteps=10,  # Define diffusion steps
    standardize=True,  # Apply standardization
)

Like the backbone, pass multiple modalities as a dict or a single modality as a tensor to the model which returns the generated output_modalities as a dict of tensors. Note: These generations are not reconstructions but "mental images" representing how the model imagines the modality. You can control generation details via the number of diffusion steps (timesteps) that you can pass to the constructor or the forward function. By passing standardize=True, the pre-training standardization values are automatically applied to the input and output.

We provide an example notebook for generations at https://github.com/IBM/terramind.

Feedback

Your feedback is invaluable to us. Please share it with us by starting a discussion in this HF repository or submitting an issue to TerraMind on GitHub.

Citation

If you use TerraMind in your research, please cite the TerraMind pre-print.

@article{jakubik2025terramind,
  title={TerraMind: Large-Scale Generative Multimodality for Earth Observation},
  author={Jakubik, Johannes and Yang, Felix and Blumenstiel, Benedikt and Scheurer, Erik and Sedona, Rocco and Maurogiovanni, Stefano and Bosmans, Jente and Dionelis, Nikolaos and Marsocci, Valerio and Kopp, Niklas and others},
  journal={arXiv preprint arXiv:2504.11171},
  year={2025}
}

ibm-esa-geospatial
/

TerraMind-1.0-large