Add precomputed text-embedding bank for voxtell_v1.1 (14,194 labels)
#2
by MoritzLangenberg - opened
- README.md +258 -63
- embeddings/voxtell_v1.1/labels.json +0 -0
- embeddings/voxtell_v1.1/text_embeddings.npz +3 -0
README.md
CHANGED
|
@@ -15,16 +15,17 @@ tags:
|
|
| 15 |
- Radiology
|
| 16 |
---
|
| 17 |
|
| 18 |
-
# VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation
|
| 19 |
|
| 20 |
<div align="center">
|
| 21 |
|
| 22 |
[](https://arxiv.org/abs/2511.11450) 
|
| 23 |
[](https://github.com/MIC-DKFZ/VoxTell) 
|
| 24 |
-
[](https://huggingface.co/
|
| 25 |
[](https://github.com/gomesgustavoo/voxtell-web-plugin) 
|
| 26 |
-
[![
|
| 27 |
-
[![
|
|
|
|
| 28 |
|
| 29 |
</div>
|
| 30 |
|
|
@@ -66,22 +67,20 @@ We release multiple VoxTell versions (continuously updated) to enable both repro
|
|
| 66 |
|
| 67 |
<img src="https://raw.githubusercontent.com/MIC-DKFZ/VoxTell/main/documentation/assets/VoxTellConcepts.png" alt="Concept Coverage"/>
|
| 68 |
|
| 69 |
-
|
| 70 |
|
| 71 |
-
|
| 72 |
|
| 73 |
-
|
| 74 |
-
from huggingface_hub import snapshot_download
|
| 75 |
|
| 76 |
-
|
| 77 |
-
|
|
|
|
|
|
|
| 78 |
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
local_dir=DOWNLOAD_DIR
|
| 83 |
-
)
|
| 84 |
-
```
|
| 85 |
|
| 86 |
## 🛠 Installation
|
| 87 |
|
|
@@ -98,7 +97,7 @@ conda activate voxtell
|
|
| 98 |
|
| 99 |
> [!WARNING]
|
| 100 |
> **Temporary Compatibility Warning**
|
| 101 |
-
> There is a known issue with **PyTorch 2.9.0** causing **OOM errors during inference**
|
| 102 |
> **Until this is resolved, please use PyTorch 2.8.0 or earlier.**
|
| 103 |
|
| 104 |
Install PyTorch compatible with your CUDA version. For example, for Ubuntu with a modern NVIDIA GPU:
|
|
@@ -109,13 +108,14 @@ pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorc
|
|
| 109 |
|
| 110 |
*For other configurations (macOS, CPU, different CUDA versions), please refer to the [PyTorch Get Started](https://pytorch.org/get-started/previous-versions/) page.*
|
| 111 |
|
| 112 |
-
Install
|
|
|
|
| 113 |
|
| 114 |
```bash
|
| 115 |
-
pip install
|
| 116 |
```
|
| 117 |
|
| 118 |
-
|
| 119 |
|
| 120 |
```bash
|
| 121 |
git clone https://github.com/MIC-DKFZ/VoxTell
|
|
@@ -123,7 +123,122 @@ cd VoxTell
|
|
| 123 |
pip install -e .
|
| 124 |
```
|
| 125 |
|
| 126 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 127 |
|
| 128 |
For more control or integration into Python workflows, use the Python API:
|
| 129 |
|
|
@@ -136,15 +251,16 @@ from nnunetv2.imageio.nibabel_reader_writer import NibabelIOWithReorient
|
|
| 136 |
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
|
| 137 |
|
| 138 |
# Load image
|
|
|
|
| 139 |
image_path = "/path/to/your/image.nii.gz"
|
| 140 |
-
img,
|
| 141 |
|
| 142 |
# Define text prompts
|
| 143 |
text_prompts = ["liver", "right kidney", "left kidney", "spleen"]
|
| 144 |
|
| 145 |
# Initialize predictor
|
| 146 |
predictor = VoxTellPredictor(
|
| 147 |
-
model_dir="/path/to/voxtell_model_directory",
|
| 148 |
device=device,
|
| 149 |
)
|
| 150 |
|
|
@@ -153,7 +269,83 @@ predictor = VoxTellPredictor(
|
|
| 153 |
voxtell_seg = predictor.predict_single_image(img, text_prompts)
|
| 154 |
```
|
| 155 |
|
| 156 |
-
####
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 157 |
|
| 158 |
You can visualize the segmentation results using [napari](https://napari.org/):
|
| 159 |
|
|
@@ -161,6 +353,10 @@ You can visualize the segmentation results using [napari](https://napari.org/):
|
|
| 161 |
pip install napari[all]
|
| 162 |
```
|
| 163 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 164 |
```python
|
| 165 |
import napari
|
| 166 |
import numpy as np
|
|
@@ -177,63 +373,62 @@ for i, prompt in enumerate(text_prompts):
|
|
| 177 |
napari.run()
|
| 178 |
```
|
| 179 |
|
| 180 |
-
##
|
| 181 |
-
|
| 182 |
-
- ⚠️ **Image Orientation (Critical)**: For correct anatomical localization (e.g., distinguishing left from right), images **must be in RAS orientation**. VoxTell was trained on data reoriented using [this specific reader](https://github.com/MIC-DKFZ/nnUNet/blob/86606c53ef9f556d6f024a304b52a48378453641/nnunetv2/imageio/nibabel_reader_writer.py#L101). Orientation mismatches can be a source of error. An easy way to test for this is if a simple prompt like "liver" fails and segments parts of the spleen instead. Make sure your image metadata is correct.
|
| 183 |
-
|
| 184 |
-
- **Image Spacing**: The model does not resample images to a standardized spacing for faster inference. Performance may degrade on images with very uncommon voxel spacings (e.g., super high-resolution brain MRI). In such cases, consider resampling the image to a more typical clinical spacing (e.g., 1.5×1.5×1.5 mm³) before segmentation.
|
| 185 |
|
| 186 |
-
--
|
|
|
|
|
|
|
| 187 |
|
| 188 |
-
|
| 189 |
-
|
| 190 |
-
VoxTell employs a multi-stage vision-language fusion approach:
|
| 191 |
-
|
| 192 |
-
- **Image Encoder**: Processes 3D volumetric input into latent feature representations
|
| 193 |
-
- **Prompt Encoder**: We use the fozen [Qwen3-Embedding-4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B) model to embed text prompts
|
| 194 |
-
- **Prompt Decoder**: Transforms text queries and image latents into multi-scale text features
|
| 195 |
-
- **Image Decoder**: Fuses visual and textual information at multiple resolutions using MaskFormer-style query-image fusion with deep supervision
|
| 196 |
|
| 197 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 198 |
|
| 199 |
-
|
| 200 |
|
| 201 |
-
|
|
|
|
|
|
|
|
|
|
| 202 |
|
| 203 |
-
-
|
| 204 |
-
|
| 205 |
-
-
|
| 206 |
|
| 207 |
-
|
| 208 |
|
| 209 |
-
|
| 210 |
-
- Real-time emergency medical decision-making
|
| 211 |
-
- Commercial use
|
| 212 |
|
| 213 |
-
#
|
| 214 |
|
| 215 |
-
|
| 216 |
|
| 217 |
-
|
| 218 |
|
|
|
|
| 219 |
|
| 220 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 221 |
|
| 222 |
-
-
|
| 223 |
-
- Prompting structures absent from the image and never seen on this modality (e.g., "liver" in a brain MRI) may lead to undesired results
|
| 224 |
-
- Text prompt quality and specificity affects segmentation accuracy
|
| 225 |
-
- Not validated for direct clinical use without expert review
|
| 226 |
|
| 227 |
## Citation
|
| 228 |
|
| 229 |
```bibtex
|
| 230 |
-
@
|
| 231 |
-
|
| 232 |
-
|
| 233 |
-
|
| 234 |
-
|
| 235 |
-
|
| 236 |
-
pages = {37538-37557}
|
| 237 |
}
|
| 238 |
```
|
| 239 |
|
|
|
|
| 15 |
- Radiology
|
| 16 |
---
|
| 17 |
|
| 18 |
+
# [CVPR2026] VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation
|
| 19 |
|
| 20 |
<div align="center">
|
| 21 |
|
| 22 |
[](https://arxiv.org/abs/2511.11450) 
|
| 23 |
[](https://github.com/MIC-DKFZ/VoxTell) 
|
| 24 |
+
[](https://huggingface.co/mrokuss/VoxTell) 
|
| 25 |
[](https://github.com/gomesgustavoo/voxtell-web-plugin) 
|
| 26 |
+
[](https://github.com/CCI-Bonn/OHIF-AI) 
|
| 27 |
+
[](https://github.com/lassoan/SlicerVoxTell)
|
| 28 |
+
[](https://github.com/MIC-DKFZ/napari-voxtell) 
|
| 29 |
|
| 30 |
</div>
|
| 31 |
|
|
|
|
| 67 |
|
| 68 |
<img src="https://raw.githubusercontent.com/MIC-DKFZ/VoxTell/main/documentation/assets/VoxTellConcepts.png" alt="Concept Coverage"/>
|
| 69 |
|
| 70 |
+
---
|
| 71 |
|
| 72 |
+
## Architecture
|
| 73 |
|
| 74 |
+
VoxTell combines **3D image encoding** with **text-prompt embeddings** and **multi-stage vision–language fusion**:
|
|
|
|
| 75 |
|
| 76 |
+
- **Image Encoder**: Processes 3D volumetric input into latent feature representations
|
| 77 |
+
- **Prompt Encoder**: We use the fozen [Qwen3-Embedding-4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B) model to embed text prompts
|
| 78 |
+
- **Prompt Decoder**: Transforms text queries and image latents into multi-scale text features
|
| 79 |
+
- **Image Decoder**: Fuses visual and textual information at multiple resolutions using MaskFormer-style query-image fusion with deep supervision
|
| 80 |
|
| 81 |
+
<img src="https://raw.githubusercontent.com/MIC-DKFZ/VoxTell/main/documentation/assets/VoxTellArchitecture.png" alt="Architecture Diagram"/>
|
| 82 |
+
|
| 83 |
+
---
|
|
|
|
|
|
|
|
|
|
| 84 |
|
| 85 |
## 🛠 Installation
|
| 86 |
|
|
|
|
| 97 |
|
| 98 |
> [!WARNING]
|
| 99 |
> **Temporary Compatibility Warning**
|
| 100 |
+
> There is a known issue with **PyTorch 2.9.0** causing **OOM errors during inference** (related to 3D convolutions — see the PyTorch issue [here](https://github.com/pytorch/pytorch/issues/166122)).
|
| 101 |
> **Until this is resolved, please use PyTorch 2.8.0 or earlier.**
|
| 102 |
|
| 103 |
Install PyTorch compatible with your CUDA version. For example, for Ubuntu with a modern NVIDIA GPU:
|
|
|
|
| 108 |
|
| 109 |
*For other configurations (macOS, CPU, different CUDA versions), please refer to the [PyTorch Get Started](https://pytorch.org/get-started/previous-versions/) page.*
|
| 110 |
|
| 111 |
+
Install the latest version directly from the repository (you can also use
|
| 112 |
+
[uv](https://docs.astral.sh/uv/)):
|
| 113 |
|
| 114 |
```bash
|
| 115 |
+
pip install git+https://github.com/MIC-DKFZ/VoxTell.git
|
| 116 |
```
|
| 117 |
|
| 118 |
+
For development, clone and install in editable mode:
|
| 119 |
|
| 120 |
```bash
|
| 121 |
git clone https://github.com/MIC-DKFZ/VoxTell
|
|
|
|
| 123 |
pip install -e .
|
| 124 |
```
|
| 125 |
|
| 126 |
+
---
|
| 127 |
+
|
| 128 |
+
## 🚀 Getting Started
|
| 129 |
+
|
| 130 |
+
👉 NEW: [Try VoxTell interactively in the napari viewer](https://github.com/MIC-DKFZ/napari-voxtell)
|
| 131 |
+
|
| 132 |
+
VoxTell downloads its default model (`voxtell_v1.1`) automatically on first use and caches it (in
|
| 133 |
+
the standard Hugging Face cache, `~/.cache/huggingface`), so the examples below work without any
|
| 134 |
+
setup.
|
| 135 |
+
|
| 136 |
+
To download a copy into a directory you control (e.g. to use a different or custom model), fetch
|
| 137 |
+
it with the Hugging Face `huggingface_hub` library:
|
| 138 |
+
|
| 139 |
+
```python
|
| 140 |
+
import os
|
| 141 |
+
from huggingface_hub import snapshot_download
|
| 142 |
+
|
| 143 |
+
MODEL_NAME = "voxtell_v1.1" # the default model
|
| 144 |
+
DOWNLOAD_DIR = "/home/user/temp" # where to put the model
|
| 145 |
+
|
| 146 |
+
local = snapshot_download("mrokuss/VoxTell", allow_patterns=f"{MODEL_NAME}/*", local_dir=DOWNLOAD_DIR)
|
| 147 |
+
model_path = os.path.join(local, MODEL_NAME) # e.g. "/home/user/temp/voxtell_v1.1"
|
| 148 |
+
```
|
| 149 |
+
|
| 150 |
+
Then point VoxTell at that directory with the `VOXTELL_MODEL` environment variable (or pass
|
| 151 |
+
`-m`/`model_dir` to override per run):
|
| 152 |
+
|
| 153 |
+
```bash
|
| 154 |
+
export VOXTELL_MODEL=/path/to/voxtell_v1.1 # a local model directory (e.g. add to ~/.bashrc)
|
| 155 |
+
```
|
| 156 |
+
|
| 157 |
+
Once a model has been downloaded it is cached, so subsequent runs work offline.
|
| 158 |
+
|
| 159 |
+
### Command-Line Interface (CLI)
|
| 160 |
+
|
| 161 |
+
VoxTell provides a convenient command-line interface for running predictions:
|
| 162 |
+
|
| 163 |
+
```bash
|
| 164 |
+
voxtell-predict -i input.nii.gz -o output_folder -p "liver" "spleen" "kidney"
|
| 165 |
+
```
|
| 166 |
+
|
| 167 |
+
**Single prompt:**
|
| 168 |
+
```bash
|
| 169 |
+
voxtell-predict -i case001.nii.gz -o output_folder -p "liver"
|
| 170 |
+
# Output: output_folder/case001_liver.nii.gz
|
| 171 |
+
```
|
| 172 |
+
|
| 173 |
+
**Multiple prompts (saves individual files by default):**
|
| 174 |
+
```bash
|
| 175 |
+
voxtell-predict -i case001.nii.gz -o output_folder -p "liver" "spleen" "right kidney"
|
| 176 |
+
# Outputs:
|
| 177 |
+
# output_folder/case001_liver.nii.gz
|
| 178 |
+
# output_folder/case001_spleen.nii.gz
|
| 179 |
+
# output_folder/case001_right_kidney.nii.gz
|
| 180 |
+
```
|
| 181 |
+
|
| 182 |
+
**Save combined multi-label file:**
|
| 183 |
+
```bash
|
| 184 |
+
voxtell-predict -i case001.nii.gz -o output_folder -p "liver" "spleen" --save-combined
|
| 185 |
+
# Output: output_folder/case001.nii.gz (multi-label: 1=liver, 2=spleen)
|
| 186 |
+
# ⚠️ WARNING: Overlapping structures will be overwritten by later prompts
|
| 187 |
+
```
|
| 188 |
+
|
| 189 |
+
#### CLI Options
|
| 190 |
+
|
| 191 |
+
| Argument | Short | Required | Description |
|
| 192 |
+
|----------|-------|----------|-------------|
|
| 193 |
+
| `--input` | `-i` | Yes | Path to input NIfTI file |
|
| 194 |
+
| `--output` | `-o` | Yes | Path to output folder |
|
| 195 |
+
| `--model` | `-m` | No | Path to a local model directory. If omitted, uses `VOXTELL_MODEL` or downloads the default model (`voxtell_v1.1`) from Hugging Face |
|
| 196 |
+
| `--prompts` | `-p` | Yes | Text prompt(s) for segmentation |
|
| 197 |
+
| `--device` | | No | Device to use: `cuda` (default) or `cpu` |
|
| 198 |
+
| `--gpu` | | No | GPU device ID (default: 0) |
|
| 199 |
+
| `--save-combined` | | No | Save multi-label file instead of individual files |
|
| 200 |
+
| `--embeddings` | | No | Use a local precomputed-embeddings file (`.npz`) instead of auto-download |
|
| 201 |
+
| `--no-precomputed` | | No | Skip the automatic precomputed-embeddings download; embed every prompt with the backbone |
|
| 202 |
+
| `--list-embeddings` | | No | List the available precomputed prompts and exit |
|
| 203 |
+
| `--no-overwrite` | | No | Skip images whose outputs already exist |
|
| 204 |
+
| `--verbose` | | No | Enable verbose output |
|
| 205 |
+
|
| 206 |
+
> `--input` is **either** a single folder (all NIfTI files in it) **or** one or more NIfTI files
|
| 207 |
+
> (absolute or relative to the current directory) — not a mix. The text prompts are embedded once
|
| 208 |
+
> and reused across all images.
|
| 209 |
+
|
| 210 |
+
#### Batch / folder / list inference (same prompts)
|
| 211 |
+
|
| 212 |
+
```bash
|
| 213 |
+
# Every NIfTI in a folder
|
| 214 |
+
voxtell-predict -i images_folder -o output_folder -p "liver" "spleen"
|
| 215 |
+
|
| 216 |
+
# An explicit list of files
|
| 217 |
+
voxtell-predict -i a.nii.gz b.nii.gz c.nii.gz -o out -p "liver"
|
| 218 |
+
```
|
| 219 |
+
|
| 220 |
+
#### Different prompts per image
|
| 221 |
+
|
| 222 |
+
Use `--jobs` to bind each image to its own prompts (images come from the file, so `-i` is not used).
|
| 223 |
+
The *union* of all prompts across the jobs is embedded only once.
|
| 224 |
+
|
| 225 |
+
```bash
|
| 226 |
+
voxtell-predict --jobs jobs.json -o out
|
| 227 |
+
```
|
| 228 |
+
|
| 229 |
+
```json
|
| 230 |
+
// jobs.json
|
| 231 |
+
[
|
| 232 |
+
{"image": "a.nii.gz", "prompts": ["liver", "spleen"]},
|
| 233 |
+
{"image": "b.nii.gz", "prompts": ["tumor"]}
|
| 234 |
+
]
|
| 235 |
+
```
|
| 236 |
+
|
| 237 |
+
(For the same prompts on every image, use `-p` with `-i` instead.)
|
| 238 |
+
|
| 239 |
+
---
|
| 240 |
+
|
| 241 |
+
### Python API
|
| 242 |
|
| 243 |
For more control or integration into Python workflows, use the Python API:
|
| 244 |
|
|
|
|
| 251 |
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
|
| 252 |
|
| 253 |
# Load image
|
| 254 |
+
# Keep `props`: it stores the original affine/orientation and is required to save the masks correctly.
|
| 255 |
image_path = "/path/to/your/image.nii.gz"
|
| 256 |
+
img, props = NibabelIOWithReorient().read_images([image_path])
|
| 257 |
|
| 258 |
# Define text prompts
|
| 259 |
text_prompts = ["liver", "right kidney", "left kidney", "spleen"]
|
| 260 |
|
| 261 |
# Initialize predictor
|
| 262 |
predictor = VoxTellPredictor(
|
| 263 |
+
model_dir="/path/to/voxtell_model_directory", # optional; omit to use $VOXTELL_MODEL or auto-download
|
| 264 |
device=device,
|
| 265 |
)
|
| 266 |
|
|
|
|
| 269 |
voxtell_seg = predictor.predict_single_image(img, text_prompts)
|
| 270 |
```
|
| 271 |
|
| 272 |
+
#### Optional: Save Results
|
| 273 |
+
|
| 274 |
+
Save the masks through the same reader:
|
| 275 |
+
|
| 276 |
+
```python
|
| 277 |
+
import os
|
| 278 |
+
import numpy as np
|
| 279 |
+
|
| 280 |
+
output_folder = "/path/to/output"
|
| 281 |
+
os.makedirs(output_folder, exist_ok=True)
|
| 282 |
+
writer = NibabelIOWithReorient()
|
| 283 |
+
|
| 284 |
+
# Option A - one 3D mask per prompt
|
| 285 |
+
for prompt, seg in zip(text_prompts, voxtell_seg):
|
| 286 |
+
out_path = os.path.join(output_folder, f"{prompt.replace(' ', '_')}.nii.gz")
|
| 287 |
+
writer.write_seg(seg, out_path, props)
|
| 288 |
+
|
| 289 |
+
# Option B - a single multi-label 3D file, where each prompt gets its own label
|
| 290 |
+
# value (1, 2, 3, ...). Overlapping structures are overwritten by later prompts.
|
| 291 |
+
combined = np.zeros_like(voxtell_seg[0], dtype=np.uint8)
|
| 292 |
+
for i, seg in enumerate(voxtell_seg):
|
| 293 |
+
combined[seg > 0] = i + 1 # label 1=first prompt, 2=second, ...
|
| 294 |
+
writer.write_seg(combined, os.path.join(output_folder, "combined.nii.gz"), props)
|
| 295 |
+
# Label legend: {i + 1: prompt for i, prompt in enumerate(text_prompts)}
|
| 296 |
+
```
|
| 297 |
+
|
| 298 |
+
For many images, the `voxtell-predict` CLI and `predictor.predict_from_files` /
|
| 299 |
+
`predict_from_jobs` (below) handle this saving for you.
|
| 300 |
+
|
| 301 |
+
#### Efficient batch / folder inference
|
| 302 |
+
|
| 303 |
+
To segment many images with the same prompts, use `predict_from_files`. The text prompts are
|
| 304 |
+
embedded **once** and reused across every image (a folder, a single file, or a list of files):
|
| 305 |
+
|
| 306 |
+
```python
|
| 307 |
+
predictor = VoxTellPredictor(device=device) # model auto-downloads (or set $VOXTELL_MODEL)
|
| 308 |
+
|
| 309 |
+
written = predictor.predict_from_files(
|
| 310 |
+
inputs="/path/to/images_folder", # folder, file, or list of files
|
| 311 |
+
output_folder="/path/to/output",
|
| 312 |
+
text_prompts=["liver", "spleen"],
|
| 313 |
+
save_combined=False, # one file per prompt (default)
|
| 314 |
+
)
|
| 315 |
+
```
|
| 316 |
+
|
| 317 |
+
For **different prompts per image**, use `predict_from_jobs` (the union of all prompts is embedded
|
| 318 |
+
once):
|
| 319 |
+
|
| 320 |
+
```python
|
| 321 |
+
predictor.predict_from_jobs(
|
| 322 |
+
jobs=[
|
| 323 |
+
{"image": "a.nii.gz", "prompts": ["liver", "spleen"]},
|
| 324 |
+
{"image": "b.nii.gz", "prompts": ["tumor"]},
|
| 325 |
+
],
|
| 326 |
+
output_folder="/path/to/output",
|
| 327 |
+
)
|
| 328 |
+
```
|
| 329 |
+
|
| 330 |
+
You can also embed prompts yourself and feed the embeddings into `predict_single_image` to reuse
|
| 331 |
+
them across custom loops:
|
| 332 |
+
|
| 333 |
+
```python
|
| 334 |
+
embeddings = predictor.embed_text_prompts(["liver", "spleen"])
|
| 335 |
+
seg = predictor.predict_single_image(img, text_embeddings=embeddings)
|
| 336 |
+
```
|
| 337 |
+
|
| 338 |
+
#### Precomputed text embeddings
|
| 339 |
+
|
| 340 |
+
Common prompts are precomputed and downloaded automatically from Hugging Face, skipping the Qwen3
|
| 341 |
+
backbone; anything uncovered is embedded on the fly. To override:
|
| 342 |
+
|
| 343 |
+
```python
|
| 344 |
+
VoxTellPredictor(embedding_bank="/path/to/embeddings.npz") # explicit local file
|
| 345 |
+
VoxTellPredictor(use_precomputed_embeddings=False) # always use the backbone
|
| 346 |
+
```
|
| 347 |
+
|
| 348 |
+
#### Optional: Visualize Results
|
| 349 |
|
| 350 |
You can visualize the segmentation results using [napari](https://napari.org/):
|
| 351 |
|
|
|
|
| 353 |
pip install napari[all]
|
| 354 |
```
|
| 355 |
|
| 356 |
+
> 💡 **Tip**
|
| 357 |
+
> If you work in napari already, the [napari-voxtell plugin](https://github.com/MIC-DKFZ/napari-voxtell) offers the fastest way to explore VoxTell results interactively.
|
| 358 |
+
|
| 359 |
+
|
| 360 |
```python
|
| 361 |
import napari
|
| 362 |
import numpy as np
|
|
|
|
| 373 |
napari.run()
|
| 374 |
```
|
| 375 |
|
| 376 |
+
## 🎯 Fine-Tuning
|
|
|
|
|
|
|
|
|
|
|
|
|
| 377 |
|
| 378 |
+
Transfer VoxTell's pretrained image **encoder** into nnU-Net and fine-tune it for
|
| 379 |
+
multi-class segmentation. The image encoder is transferred and the image decoder
|
| 380 |
+
is trained from scratch.
|
| 381 |
|
| 382 |
+
**1. Preprocess your dataset** with standard nnU-Net:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 383 |
|
| 384 |
+
```bash
|
| 385 |
+
export nnUNet_raw=/path/to/nnUNet_raw
|
| 386 |
+
export nnUNet_preprocessed=/path/to/nnUNet_preprocessed
|
| 387 |
+
export nnUNet_results=/path/to/nnUNet_results
|
| 388 |
+
nnUNetv2_plan_and_preprocess -d DATASET_ID --verify_dataset_integrity
|
| 389 |
+
```
|
| 390 |
|
| 391 |
+
**2. Fine-tune** (positional `dataset configuration fold`):
|
| 392 |
|
| 393 |
+
```bash
|
| 394 |
+
voxtell-finetune DATASET_ID 3d_fullres 0 \
|
| 395 |
+
-pretrained_weights /path/to/voxtell_model/fold_0/checkpoint_final.pth
|
| 396 |
+
```
|
| 397 |
|
| 398 |
+
Use `-tr VoxTellTrainer_noMirroring` for datasets whose labels distinguish left/right. The
|
| 399 |
+
CLI mirrors `nnUNetv2_train` (`--c` to resume, `--val` to validate, etc.), see the
|
| 400 |
+
[nnU-Net repository](https://github.com/MIC-DKFZ/nnUNet) for the full argument reference.
|
| 401 |
|
| 402 |
+
---
|
| 403 |
|
| 404 |
+
## Important: Image Orientation and Spacing
|
|
|
|
|
|
|
| 405 |
|
| 406 |
+
- ⚠️ **Image Orientation (Critical)**: For correct anatomical localization (e.g., distinguishing left from right), images **must be in RAS orientation**. VoxTell was trained on data reoriented using [this specific reader](https://github.com/MIC-DKFZ/nnUNet/blob/86606c53ef9f556d6f024a304b52a48378453641/nnunetv2/imageio/nibabel_reader_writer.py#L101). Orientation mismatches can be a source of error. An easy way to test for this is if a simple prompt like "liver" fails and segments parts of the spleen instead. Make sure your image metadata is correct.
|
| 407 |
|
| 408 |
+
- **Image Spacing**: The model does not resample images to a standardized spacing for faster inference. Performance may degrade on images with very uncommon voxel spacings (e.g., super high-resolution brain MRI). In such cases, consider resampling the image to a more typical clinical spacing (e.g., 1.5×1.5×1.5 mm³) before segmentation.
|
| 409 |
|
| 410 |
+
---
|
| 411 |
|
| 412 |
+
## 🗺️ Roadmap
|
| 413 |
|
| 414 |
+
- [x] **Paper Published**: [arXiv:2511.11450](https://arxiv.org/abs/2511.11450)
|
| 415 |
+
- [x] **Code Release**: Official implementation published
|
| 416 |
+
- [x] **PyPI Package**: Package downloadable via pip
|
| 417 |
+
- [x] **Model Release**: Public availability of pretrained weights
|
| 418 |
+
- [x] **Napari Plugin**: Integration into the napari viewer as a [plugin](https://github.com/MIC-DKFZ/napari-voxtell)
|
| 419 |
+
- [x] **Fine-Tuning**: Support and scripts for custom fine-tuning
|
| 420 |
|
| 421 |
+
---
|
|
|
|
|
|
|
|
|
|
| 422 |
|
| 423 |
## Citation
|
| 424 |
|
| 425 |
```bibtex
|
| 426 |
+
@inproceedings{rokuss2026voxtell,
|
| 427 |
+
title={Voxtell: Free-text promptable universal 3d medical image segmentation},
|
| 428 |
+
author={Rokuss, Maximilian and Langenberg, Moritz and Kirchhoff, Yannick and Isensee, Fabian and Hamm, Benjamin and Ulrich, Constantin and Regnery, Sebastian and Bauer, Lukas and Katsigiannopulos, Efthimios and Norajitra, Tobias and Maier-Hein, Klaus},
|
| 429 |
+
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
|
| 430 |
+
pages={37538--37557},
|
| 431 |
+
year={2026}
|
|
|
|
| 432 |
}
|
| 433 |
```
|
| 434 |
|
embeddings/voxtell_v1.1/labels.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
embeddings/voxtell_v1.1/text_embeddings.npz
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:65bdee9bcd4eb58d909d7b38701bf1c65e1dc6e4af6733ae00e152b21997c25a
|
| 3 |
+
size 67306013
|