Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping
Paper โข 2505.13777 โข Published
Trained checkpoints and backbone weights for Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping, accepted at EarthVision 2026 (IEEE/ISPRS Workshop on Large Scale Computer Vision for Remote Sensing).
| Path | Description |
|---|---|
sat2sound/bingmap_nometa.ckpt |
GeoSound-Bing, no metadata |
sat2sound/bingmap_withmeta.ckpt |
GeoSound-Bing, with metadata |
sat2sound/sentinel_nometa.ckpt |
GeoSound-Sentinel, no metadata |
sat2sound/sentinel_withmeta.ckpt |
GeoSound-Sentinel, with metadata |
sat2sound/SoundingEarth_nometa.ckpt |
SoundingEarth, no metadata |
sat2sound/SoundingEarth_withmeta.ckpt |
SoundingEarth, with metadata |
sat2text/bingmap_i2t_baseline.ckpt |
Sat2Text image-text baseline |
backbones/pretrain-vit-base-e199.pth |
SatMAE ViT-Base backbone |
backbones/mga-clap.pt |
MGACLAP audio encoder backbone |
demo/GeoSound_gallery_w_bingmap.h5 |
Retrieval demo gallery (9,931 samples) |
ckpt_cfg.json |
Experiment name โ checkpoint path mapping |
Checkpoints and backbones are resolved automatically by the codebase via src/hub.py:resolve_hf_ckpt โ no manual download needed.
Clone the code repo, install the environment, then:
import torch
import torchaudio
from src.engine import l2normalize
from utilities.utils import load_sat2sound, encode_text, encode_gps_time, load_audio_mel, prepare_batch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
B = 4
model, tokenizer = load_sat2sound("bingmap_withmeta", device)
# audio โ swap the next two lines to use a real recording instead of white noise
torchaudio.save("/tmp/demo.wav", torch.randn(1, 320_000), sample_rate=32_000)
mel = load_audio_mel("/tmp/demo.wav", device) # (1, 1001, 64)
latlong, time_enc, month_enc = encode_gps_time(37.77, -122.42, hour=13, month=5, B=B, device=device)
batch = prepare_batch(
sat = torch.randn(B, 3, 224, 224, device=device), # ImageNet-normalised satellite tile
audio_mel = mel,
audio_caption = encode_text(["Traffic noise and distant birds."] * B, tokenizer, device),
image_caption = encode_text(["An urban intersection with dense buildings."] * B, tokenizer, device),
latlong=latlong, time_enc=time_enc, month_enc=month_enc,
)
with torch.no_grad():
embeds = model.get_embeds(batch)
sat_emb = l2normalize(embeds["sat_embeds_dict"]["ctotal"]) # (B, 1024)
audio_emb = l2normalize(embeds["audio_embeds"]) # (B, 1024)
text_emb = l2normalize(embeds["fdt_txt_embeds"]) # (B, 1024)
print(sat_emb @ audio_emb.T) # (B, B) satellite โ audio cosine similarity
For
*_nometacheckpoints omitlatlong,time_enc, andmonth_enc(they default toNone).
@inproceedings{khanal2026sat2sound,
title = {{Sat2Sound}: A Unified Framework for Zero-Shot Soundscape Mapping},
author = {Khanal, Subash and Sastry, Srikumar and Dhakal, Aayush and
Ahmad, Adeel and Stylianou, Abby and Jacobs, Nathan},
booktitle = {IEEE/ISPRS Workshop: Large Scale Computer Vision for
Remote Sensing (EarthVision)},
year = {2026},
}