Add precomputed text-embedding bank for voxtell_v1.1 (14,194 labels)

#2
README.md CHANGED
@@ -15,16 +15,17 @@ tags:
15
  - Radiology
16
  ---
17
 
18
- # VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation
19
 
20
  <div align="center">
21
 
22
  [![arXiv](https://img.shields.io/badge/arXiv-2511.11450-B31B1B.svg)](https://arxiv.org/abs/2511.11450)&#160;
23
  [![GitHub](https://img.shields.io/badge/GitHub-VoxTell-181717?logo=github&logoColor=white)](https://github.com/MIC-DKFZ/VoxTell)&#160;
24
- [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Model-VoxTell-yellow)](https://huggingface.co/MIC-DKFZ/VoxTell)&#160;
25
  [![web tool](https://img.shields.io/badge/web-tool-4CAF50)](https://github.com/gomesgustavoo/voxtell-web-plugin)&#160;
26
- [![3D Slicer](https://img.shields.io/badge/3D%20Slicer-plugin-1f65b0)](https://github.com/lassoan/SlicerVoxTell)&#160;
27
- [![napari](https://img.shields.io/badge/napari-plugin-80d1ff)](https://github.com/MIC-DKFZ/napari-voxtell)&#160;
 
28
 
29
  </div>
30
 
@@ -66,22 +67,20 @@ We release multiple VoxTell versions (continuously updated) to enable both repro
66
 
67
  <img src="https://raw.githubusercontent.com/MIC-DKFZ/VoxTell/main/documentation/assets/VoxTellConcepts.png" alt="Concept Coverage"/>
68
 
69
- ## How to Download
70
 
71
- You can download VoxTell checkpoints using the Hugging Face `huggingface_hub` library:
72
 
73
- ```
74
- from huggingface_hub import snapshot_download
75
 
76
- MODEL_NAME = "voxtell_v1.1" # Updated models may be available in the future
77
- DOWNLOAD_DIR = "/home/user/temp" # Optionally specify the download directory
 
 
78
 
79
- download_path = snapshot_download(
80
- repo_id="mrokuss/VoxTell",
81
- allow_patterns=[f"{MODEL_NAME}/*", "*.json"],
82
- local_dir=DOWNLOAD_DIR
83
- )
84
- ```
85
 
86
  ## 🛠 Installation
87
 
@@ -98,7 +97,7 @@ conda activate voxtell
98
 
99
  > [!WARNING]
100
  > **Temporary Compatibility Warning**
101
- > There is a known issue with **PyTorch 2.9.0** causing **OOM errors during inference** in `VoxTell` (related to 3D convolutions — see the PyTorch issue [here](https://github.com/pytorch/pytorch/issues/166122)).
102
  > **Until this is resolved, please use PyTorch 2.8.0 or earlier.**
103
 
104
  Install PyTorch compatible with your CUDA version. For example, for Ubuntu with a modern NVIDIA GPU:
@@ -109,13 +108,14 @@ pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorc
109
 
110
  *For other configurations (macOS, CPU, different CUDA versions), please refer to the [PyTorch Get Started](https://pytorch.org/get-started/previous-versions/) page.*
111
 
112
- Install via pip (you can also use [uv](https://docs.astral.sh/uv/)):
 
113
 
114
  ```bash
115
- pip install voxtell
116
  ```
117
 
118
- or install directly from the GitHub repository:
119
 
120
  ```bash
121
  git clone https://github.com/MIC-DKFZ/VoxTell
@@ -123,7 +123,122 @@ cd VoxTell
123
  pip install -e .
124
  ```
125
 
126
- ### 3. Python API
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
127
 
128
  For more control or integration into Python workflows, use the Python API:
129
 
@@ -136,15 +251,16 @@ from nnunetv2.imageio.nibabel_reader_writer import NibabelIOWithReorient
136
  device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
137
 
138
  # Load image
 
139
  image_path = "/path/to/your/image.nii.gz"
140
- img, _ = NibabelIOWithReorient().read_images([image_path])
141
 
142
  # Define text prompts
143
  text_prompts = ["liver", "right kidney", "left kidney", "spleen"]
144
 
145
  # Initialize predictor
146
  predictor = VoxTellPredictor(
147
- model_dir="/path/to/voxtell_model_directory",
148
  device=device,
149
  )
150
 
@@ -153,7 +269,83 @@ predictor = VoxTellPredictor(
153
  voxtell_seg = predictor.predict_single_image(img, text_prompts)
154
  ```
155
 
156
- #### 4. Optional: Visualize Results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
157
 
158
  You can visualize the segmentation results using [napari](https://napari.org/):
159
 
@@ -161,6 +353,10 @@ You can visualize the segmentation results using [napari](https://napari.org/):
161
  pip install napari[all]
162
  ```
163
 
 
 
 
 
164
  ```python
165
  import napari
166
  import numpy as np
@@ -177,63 +373,62 @@ for i, prompt in enumerate(text_prompts):
177
  napari.run()
178
  ```
179
 
180
- ## Important: Image Orientation and Spacing
181
-
182
- - ⚠️ **Image Orientation (Critical)**: For correct anatomical localization (e.g., distinguishing left from right), images **must be in RAS orientation**. VoxTell was trained on data reoriented using [this specific reader](https://github.com/MIC-DKFZ/nnUNet/blob/86606c53ef9f556d6f024a304b52a48378453641/nnunetv2/imageio/nibabel_reader_writer.py#L101). Orientation mismatches can be a source of error. An easy way to test for this is if a simple prompt like "liver" fails and segments parts of the spleen instead. Make sure your image metadata is correct.
183
-
184
- - **Image Spacing**: The model does not resample images to a standardized spacing for faster inference. Performance may degrade on images with very uncommon voxel spacings (e.g., super high-resolution brain MRI). In such cases, consider resampling the image to a more typical clinical spacing (e.g., 1.5×1.5×1.5 mm³) before segmentation.
185
 
186
- ---
 
 
187
 
188
- ## Architecture
189
-
190
- VoxTell employs a multi-stage vision-language fusion approach:
191
-
192
- - **Image Encoder**: Processes 3D volumetric input into latent feature representations
193
- - **Prompt Encoder**: We use the fozen [Qwen3-Embedding-4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B) model to embed text prompts
194
- - **Prompt Decoder**: Transforms text queries and image latents into multi-scale text features
195
- - **Image Decoder**: Fuses visual and textual information at multiple resolutions using MaskFormer-style query-image fusion with deep supervision
196
 
197
- <img src="https://raw.githubusercontent.com/MIC-DKFZ/VoxTell/main/documentation/assets/VoxTellArchitecture.png" alt="Architecture Diagram"/>
 
 
 
 
 
198
 
199
- ## Intended Use
200
 
201
- #### Primary Use Cases
 
 
 
202
 
203
- - Research in vision-language models for medical image analysis
204
- - Text-promptable or automated segmentation of anatomical structures in medical imaging
205
- - Identification and delineation of pathological findings
206
 
207
- #### Out-of-Scope Use
208
 
209
- - Clinical diagnosis without expert radiologist review
210
- - Real-time emergency medical decision-making
211
- - Commercial use
212
 
213
- ## Performance
214
 
215
- VoxTell achieves state-of-the-art performance on anatomical and pathological segmentation tasks across multiple medical imaging benchmarks. Detailed performance metrics and comparisons are available in the [paper](https://arxiv.org/abs/2511.11450).
216
 
217
- Tip: Experiment with different prompts tailored to your use case. For example, the prompt `lesions` is known to be overconfident, i.e. over-segmenting, compared to `lesion`.
218
 
 
219
 
220
- ## Limitations / Known Issues
 
 
 
 
 
221
 
222
- - Performance may vary on imaging modalities or anatomical regions underrepresented in training data
223
- - Prompting structures absent from the image and never seen on this modality (e.g., "liver" in a brain MRI) may lead to undesired results
224
- - Text prompt quality and specificity affects segmentation accuracy
225
- - Not validated for direct clinical use without expert review
226
 
227
  ## Citation
228
 
229
  ```bibtex
230
- @InProceedings{Rokuss_2026_CVPR,
231
- author = {Rokuss, Maximilian and Langenberg, Moritz and Kirchhoff, Yannick and Isensee, Fabian and Hamm, Benjamin and Ulrich, Constantin and Regnery, Sebastian and Bauer, Lukas and Katsigiannopulos, Efthimios and Norajitra, Tobias and Maier-Hein, Klaus},
232
- title = {VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation},
233
- booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
234
- month = {June},
235
- year = {2026},
236
- pages = {37538-37557}
237
  }
238
  ```
239
 
 
15
  - Radiology
16
  ---
17
 
18
+ # [CVPR2026] VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation
19
 
20
  <div align="center">
21
 
22
  [![arXiv](https://img.shields.io/badge/arXiv-2511.11450-B31B1B.svg)](https://arxiv.org/abs/2511.11450)&#160;
23
  [![GitHub](https://img.shields.io/badge/GitHub-VoxTell-181717?logo=github&logoColor=white)](https://github.com/MIC-DKFZ/VoxTell)&#160;
24
+ [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Model-VoxTell-yellow)](https://huggingface.co/mrokuss/VoxTell)&#160;
25
  [![web tool](https://img.shields.io/badge/web-tool-4CAF50)](https://github.com/gomesgustavoo/voxtell-web-plugin)&#160;
26
+ [![OHIF integration](https://img.shields.io/badge/OHIF-integration-101332)](https://github.com/CCI-Bonn/OHIF-AI)&#160;
27
+ [![3D Slicer](https://badgen.net/badge/3D%20Slicer/plugin/1f65b0ff?icon=https://raw.githubusercontent.com/Slicer/slicer.org/bc48de2b885e9bb4a725a24ab44b86273014f0ea/assets/img/3D-Slicer-Mark.svg)](https://github.com/lassoan/SlicerVoxTell)
28
+ [![napari](https://badgen.net/badge/napari/plugin/80d1ff?icon=https://raw.githubusercontent.com/napari/napari/8b74cdfb205338a20a2e63dcbba048007ecd2309/src/napari/resources/logos/gradient-plain-light.svg)](https://github.com/MIC-DKFZ/napari-voxtell)&#160;
29
 
30
  </div>
31
 
 
67
 
68
  <img src="https://raw.githubusercontent.com/MIC-DKFZ/VoxTell/main/documentation/assets/VoxTellConcepts.png" alt="Concept Coverage"/>
69
 
70
+ ---
71
 
72
+ ## Architecture
73
 
74
+ VoxTell combines **3D image encoding** with **text-prompt embeddings** and **multi-stage vision–language fusion**:
 
75
 
76
+ - **Image Encoder**: Processes 3D volumetric input into latent feature representations
77
+ - **Prompt Encoder**: We use the fozen [Qwen3-Embedding-4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B) model to embed text prompts
78
+ - **Prompt Decoder**: Transforms text queries and image latents into multi-scale text features
79
+ - **Image Decoder**: Fuses visual and textual information at multiple resolutions using MaskFormer-style query-image fusion with deep supervision
80
 
81
+ <img src="https://raw.githubusercontent.com/MIC-DKFZ/VoxTell/main/documentation/assets/VoxTellArchitecture.png" alt="Architecture Diagram"/>
82
+
83
+ ---
 
 
 
84
 
85
  ## 🛠 Installation
86
 
 
97
 
98
  > [!WARNING]
99
  > **Temporary Compatibility Warning**
100
+ > There is a known issue with **PyTorch 2.9.0** causing **OOM errors during inference** (related to 3D convolutions — see the PyTorch issue [here](https://github.com/pytorch/pytorch/issues/166122)).
101
  > **Until this is resolved, please use PyTorch 2.8.0 or earlier.**
102
 
103
  Install PyTorch compatible with your CUDA version. For example, for Ubuntu with a modern NVIDIA GPU:
 
108
 
109
  *For other configurations (macOS, CPU, different CUDA versions), please refer to the [PyTorch Get Started](https://pytorch.org/get-started/previous-versions/) page.*
110
 
111
+ Install the latest version directly from the repository (you can also use
112
+ [uv](https://docs.astral.sh/uv/)):
113
 
114
  ```bash
115
+ pip install git+https://github.com/MIC-DKFZ/VoxTell.git
116
  ```
117
 
118
+ For development, clone and install in editable mode:
119
 
120
  ```bash
121
  git clone https://github.com/MIC-DKFZ/VoxTell
 
123
  pip install -e .
124
  ```
125
 
126
+ ---
127
+
128
+ ## 🚀 Getting Started
129
+
130
+ 👉 NEW: [Try VoxTell interactively in the napari viewer](https://github.com/MIC-DKFZ/napari-voxtell)
131
+
132
+ VoxTell downloads its default model (`voxtell_v1.1`) automatically on first use and caches it (in
133
+ the standard Hugging Face cache, `~/.cache/huggingface`), so the examples below work without any
134
+ setup.
135
+
136
+ To download a copy into a directory you control (e.g. to use a different or custom model), fetch
137
+ it with the Hugging Face `huggingface_hub` library:
138
+
139
+ ```python
140
+ import os
141
+ from huggingface_hub import snapshot_download
142
+
143
+ MODEL_NAME = "voxtell_v1.1" # the default model
144
+ DOWNLOAD_DIR = "/home/user/temp" # where to put the model
145
+
146
+ local = snapshot_download("mrokuss/VoxTell", allow_patterns=f"{MODEL_NAME}/*", local_dir=DOWNLOAD_DIR)
147
+ model_path = os.path.join(local, MODEL_NAME) # e.g. "/home/user/temp/voxtell_v1.1"
148
+ ```
149
+
150
+ Then point VoxTell at that directory with the `VOXTELL_MODEL` environment variable (or pass
151
+ `-m`/`model_dir` to override per run):
152
+
153
+ ```bash
154
+ export VOXTELL_MODEL=/path/to/voxtell_v1.1 # a local model directory (e.g. add to ~/.bashrc)
155
+ ```
156
+
157
+ Once a model has been downloaded it is cached, so subsequent runs work offline.
158
+
159
+ ### Command-Line Interface (CLI)
160
+
161
+ VoxTell provides a convenient command-line interface for running predictions:
162
+
163
+ ```bash
164
+ voxtell-predict -i input.nii.gz -o output_folder -p "liver" "spleen" "kidney"
165
+ ```
166
+
167
+ **Single prompt:**
168
+ ```bash
169
+ voxtell-predict -i case001.nii.gz -o output_folder -p "liver"
170
+ # Output: output_folder/case001_liver.nii.gz
171
+ ```
172
+
173
+ **Multiple prompts (saves individual files by default):**
174
+ ```bash
175
+ voxtell-predict -i case001.nii.gz -o output_folder -p "liver" "spleen" "right kidney"
176
+ # Outputs:
177
+ # output_folder/case001_liver.nii.gz
178
+ # output_folder/case001_spleen.nii.gz
179
+ # output_folder/case001_right_kidney.nii.gz
180
+ ```
181
+
182
+ **Save combined multi-label file:**
183
+ ```bash
184
+ voxtell-predict -i case001.nii.gz -o output_folder -p "liver" "spleen" --save-combined
185
+ # Output: output_folder/case001.nii.gz (multi-label: 1=liver, 2=spleen)
186
+ # ⚠️ WARNING: Overlapping structures will be overwritten by later prompts
187
+ ```
188
+
189
+ #### CLI Options
190
+
191
+ | Argument | Short | Required | Description |
192
+ |----------|-------|----------|-------------|
193
+ | `--input` | `-i` | Yes | Path to input NIfTI file |
194
+ | `--output` | `-o` | Yes | Path to output folder |
195
+ | `--model` | `-m` | No | Path to a local model directory. If omitted, uses `VOXTELL_MODEL` or downloads the default model (`voxtell_v1.1`) from Hugging Face |
196
+ | `--prompts` | `-p` | Yes | Text prompt(s) for segmentation |
197
+ | `--device` | | No | Device to use: `cuda` (default) or `cpu` |
198
+ | `--gpu` | | No | GPU device ID (default: 0) |
199
+ | `--save-combined` | | No | Save multi-label file instead of individual files |
200
+ | `--embeddings` | | No | Use a local precomputed-embeddings file (`.npz`) instead of auto-download |
201
+ | `--no-precomputed` | | No | Skip the automatic precomputed-embeddings download; embed every prompt with the backbone |
202
+ | `--list-embeddings` | | No | List the available precomputed prompts and exit |
203
+ | `--no-overwrite` | | No | Skip images whose outputs already exist |
204
+ | `--verbose` | | No | Enable verbose output |
205
+
206
+ > `--input` is **either** a single folder (all NIfTI files in it) **or** one or more NIfTI files
207
+ > (absolute or relative to the current directory) — not a mix. The text prompts are embedded once
208
+ > and reused across all images.
209
+
210
+ #### Batch / folder / list inference (same prompts)
211
+
212
+ ```bash
213
+ # Every NIfTI in a folder
214
+ voxtell-predict -i images_folder -o output_folder -p "liver" "spleen"
215
+
216
+ # An explicit list of files
217
+ voxtell-predict -i a.nii.gz b.nii.gz c.nii.gz -o out -p "liver"
218
+ ```
219
+
220
+ #### Different prompts per image
221
+
222
+ Use `--jobs` to bind each image to its own prompts (images come from the file, so `-i` is not used).
223
+ The *union* of all prompts across the jobs is embedded only once.
224
+
225
+ ```bash
226
+ voxtell-predict --jobs jobs.json -o out
227
+ ```
228
+
229
+ ```json
230
+ // jobs.json
231
+ [
232
+ {"image": "a.nii.gz", "prompts": ["liver", "spleen"]},
233
+ {"image": "b.nii.gz", "prompts": ["tumor"]}
234
+ ]
235
+ ```
236
+
237
+ (For the same prompts on every image, use `-p` with `-i` instead.)
238
+
239
+ ---
240
+
241
+ ### Python API
242
 
243
  For more control or integration into Python workflows, use the Python API:
244
 
 
251
  device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
252
 
253
  # Load image
254
+ # Keep `props`: it stores the original affine/orientation and is required to save the masks correctly.
255
  image_path = "/path/to/your/image.nii.gz"
256
+ img, props = NibabelIOWithReorient().read_images([image_path])
257
 
258
  # Define text prompts
259
  text_prompts = ["liver", "right kidney", "left kidney", "spleen"]
260
 
261
  # Initialize predictor
262
  predictor = VoxTellPredictor(
263
+ model_dir="/path/to/voxtell_model_directory", # optional; omit to use $VOXTELL_MODEL or auto-download
264
  device=device,
265
  )
266
 
 
269
  voxtell_seg = predictor.predict_single_image(img, text_prompts)
270
  ```
271
 
272
+ #### Optional: Save Results
273
+
274
+ Save the masks through the same reader:
275
+
276
+ ```python
277
+ import os
278
+ import numpy as np
279
+
280
+ output_folder = "/path/to/output"
281
+ os.makedirs(output_folder, exist_ok=True)
282
+ writer = NibabelIOWithReorient()
283
+
284
+ # Option A - one 3D mask per prompt
285
+ for prompt, seg in zip(text_prompts, voxtell_seg):
286
+ out_path = os.path.join(output_folder, f"{prompt.replace(' ', '_')}.nii.gz")
287
+ writer.write_seg(seg, out_path, props)
288
+
289
+ # Option B - a single multi-label 3D file, where each prompt gets its own label
290
+ # value (1, 2, 3, ...). Overlapping structures are overwritten by later prompts.
291
+ combined = np.zeros_like(voxtell_seg[0], dtype=np.uint8)
292
+ for i, seg in enumerate(voxtell_seg):
293
+ combined[seg > 0] = i + 1 # label 1=first prompt, 2=second, ...
294
+ writer.write_seg(combined, os.path.join(output_folder, "combined.nii.gz"), props)
295
+ # Label legend: {i + 1: prompt for i, prompt in enumerate(text_prompts)}
296
+ ```
297
+
298
+ For many images, the `voxtell-predict` CLI and `predictor.predict_from_files` /
299
+ `predict_from_jobs` (below) handle this saving for you.
300
+
301
+ #### Efficient batch / folder inference
302
+
303
+ To segment many images with the same prompts, use `predict_from_files`. The text prompts are
304
+ embedded **once** and reused across every image (a folder, a single file, or a list of files):
305
+
306
+ ```python
307
+ predictor = VoxTellPredictor(device=device) # model auto-downloads (or set $VOXTELL_MODEL)
308
+
309
+ written = predictor.predict_from_files(
310
+ inputs="/path/to/images_folder", # folder, file, or list of files
311
+ output_folder="/path/to/output",
312
+ text_prompts=["liver", "spleen"],
313
+ save_combined=False, # one file per prompt (default)
314
+ )
315
+ ```
316
+
317
+ For **different prompts per image**, use `predict_from_jobs` (the union of all prompts is embedded
318
+ once):
319
+
320
+ ```python
321
+ predictor.predict_from_jobs(
322
+ jobs=[
323
+ {"image": "a.nii.gz", "prompts": ["liver", "spleen"]},
324
+ {"image": "b.nii.gz", "prompts": ["tumor"]},
325
+ ],
326
+ output_folder="/path/to/output",
327
+ )
328
+ ```
329
+
330
+ You can also embed prompts yourself and feed the embeddings into `predict_single_image` to reuse
331
+ them across custom loops:
332
+
333
+ ```python
334
+ embeddings = predictor.embed_text_prompts(["liver", "spleen"])
335
+ seg = predictor.predict_single_image(img, text_embeddings=embeddings)
336
+ ```
337
+
338
+ #### Precomputed text embeddings
339
+
340
+ Common prompts are precomputed and downloaded automatically from Hugging Face, skipping the Qwen3
341
+ backbone; anything uncovered is embedded on the fly. To override:
342
+
343
+ ```python
344
+ VoxTellPredictor(embedding_bank="/path/to/embeddings.npz") # explicit local file
345
+ VoxTellPredictor(use_precomputed_embeddings=False) # always use the backbone
346
+ ```
347
+
348
+ #### Optional: Visualize Results
349
 
350
  You can visualize the segmentation results using [napari](https://napari.org/):
351
 
 
353
  pip install napari[all]
354
  ```
355
 
356
+ > 💡 **Tip**
357
+ > If you work in napari already, the [napari-voxtell plugin](https://github.com/MIC-DKFZ/napari-voxtell) offers the fastest way to explore VoxTell results interactively.
358
+
359
+
360
  ```python
361
  import napari
362
  import numpy as np
 
373
  napari.run()
374
  ```
375
 
376
+ ## 🎯 Fine-Tuning
 
 
 
 
377
 
378
+ Transfer VoxTell's pretrained image **encoder** into nnU-Net and fine-tune it for
379
+ multi-class segmentation. The image encoder is transferred and the image decoder
380
+ is trained from scratch.
381
 
382
+ **1. Preprocess your dataset** with standard nnU-Net:
 
 
 
 
 
 
 
383
 
384
+ ```bash
385
+ export nnUNet_raw=/path/to/nnUNet_raw
386
+ export nnUNet_preprocessed=/path/to/nnUNet_preprocessed
387
+ export nnUNet_results=/path/to/nnUNet_results
388
+ nnUNetv2_plan_and_preprocess -d DATASET_ID --verify_dataset_integrity
389
+ ```
390
 
391
+ **2. Fine-tune** (positional `dataset configuration fold`):
392
 
393
+ ```bash
394
+ voxtell-finetune DATASET_ID 3d_fullres 0 \
395
+ -pretrained_weights /path/to/voxtell_model/fold_0/checkpoint_final.pth
396
+ ```
397
 
398
+ Use `-tr VoxTellTrainer_noMirroring` for datasets whose labels distinguish left/right. The
399
+ CLI mirrors `nnUNetv2_train` (`--c` to resume, `--val` to validate, etc.), see the
400
+ [nnU-Net repository](https://github.com/MIC-DKFZ/nnUNet) for the full argument reference.
401
 
402
+ ---
403
 
404
+ ## Important: Image Orientation and Spacing
 
 
405
 
406
+ - ⚠️ **Image Orientation (Critical)**: For correct anatomical localization (e.g., distinguishing left from right), images **must be in RAS orientation**. VoxTell was trained on data reoriented using [this specific reader](https://github.com/MIC-DKFZ/nnUNet/blob/86606c53ef9f556d6f024a304b52a48378453641/nnunetv2/imageio/nibabel_reader_writer.py#L101). Orientation mismatches can be a source of error. An easy way to test for this is if a simple prompt like "liver" fails and segments parts of the spleen instead. Make sure your image metadata is correct.
407
 
408
+ - **Image Spacing**: The model does not resample images to a standardized spacing for faster inference. Performance may degrade on images with very uncommon voxel spacings (e.g., super high-resolution brain MRI). In such cases, consider resampling the image to a more typical clinical spacing (e.g., 1.5×1.5×1.5 mm³) before segmentation.
409
 
410
+ ---
411
 
412
+ ## 🗺️ Roadmap
413
 
414
+ - [x] **Paper Published**: [arXiv:2511.11450](https://arxiv.org/abs/2511.11450)
415
+ - [x] **Code Release**: Official implementation published
416
+ - [x] **PyPI Package**: Package downloadable via pip
417
+ - [x] **Model Release**: Public availability of pretrained weights
418
+ - [x] **Napari Plugin**: Integration into the napari viewer as a [plugin](https://github.com/MIC-DKFZ/napari-voxtell)
419
+ - [x] **Fine-Tuning**: Support and scripts for custom fine-tuning
420
 
421
+ ---
 
 
 
422
 
423
  ## Citation
424
 
425
  ```bibtex
426
+ @inproceedings{rokuss2026voxtell,
427
+ title={Voxtell: Free-text promptable universal 3d medical image segmentation},
428
+ author={Rokuss, Maximilian and Langenberg, Moritz and Kirchhoff, Yannick and Isensee, Fabian and Hamm, Benjamin and Ulrich, Constantin and Regnery, Sebastian and Bauer, Lukas and Katsigiannopulos, Efthimios and Norajitra, Tobias and Maier-Hein, Klaus},
429
+ booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
430
+ pages={37538--37557},
431
+ year={2026}
 
432
  }
433
  ```
434
 
embeddings/voxtell_v1.1/labels.json ADDED
The diff for this file is too large to render. See raw diff
 
embeddings/voxtell_v1.1/text_embeddings.npz ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:65bdee9bcd4eb58d909d7b38701bf1c65e1dc6e4af6733ae00e152b21997c25a
3
+ size 67306013