VoxelMind
3D-CNN + AffectionCell for classifying Minecraft player actions from short grayscale video clips.
| Input | 22 frames × 64×64 grayscale → [1, 1, 22, 64, 64] |
| Output | 11 action classes |
| Checkpoint | AffectionCell hid=24 · 2.58M params · 93.86% val acc |
| Hardware tested | AMD Radeon RX 6600 |
Model description
VoxelMind uses a shared Conv3D encoder followed by an AffectionCell — a learnable N×N influence graph over N = 64 × hid spatial blocks. Unlike flat pooling or sequence heads, AffectionCell exposes interpretable influence maps (see InferenceViz.py).
Classes: w, s, a, d, jump, idle, mouse_up, mouse_down, mouse_left, mouse_right, drop
Benchmark overview
Fair comparison against three baselines on the same dataset, split, preprocessing, and encoder:
- 582 training clips · 114 validation (stratified 20%, seed=42)
- AdamW lr=3e-4 · 100 epochs · input 22×64×64
- Measured on AMD RX 6600, batch=1, median over 200 runs after warmup
| Model | Params | Val acc | Macro-F1 | Latency | GPU FPS |
|---|---|---|---|---|---|
| VoxelMind (AffectionCell, hid=24) | 2,575,211 | 93.86% | 93.08% | 2.05 ms | 488 |
| LSTM head | 198,379 | 92.98% | 91.71% | 1.47 ms | 681 |
| Transformer (2 layers) | 165,291 | 91.23% | 90.29% | 1.58 ms | 633 |
| Linear head | 98,347 | 86.84% | 88.11% | 1.35 ms | 741 |
Accuracy & latency
*Best val accuracy: 93.86% (VoxelMind). Best macro-F1: 93.08%. LSTM trails by 0.88 pp. End-to-end video pipeline: ~140 FPS (I/O bound).*
Model size & throughput
VoxelMind: 2.58M params. LSTM: 198k (13× smaller, ~same accuracy). Linear: 98k, 741 FPS pure forward.
| Model | Val acc | Params | GPU FPS | FPS/Mparam | Quality/M |
|---|---|---|---|---|---|
| Conv3D + fc | 86.84% | 98k | 741 | 7530 | 6539 |
| Conv3D + 2-layer encoder | 91.23% | 165k | 633 | 3832 | 3496 |
| Conv3D + LSTM head | 92.98% | 198k | 681 | 3434 | 3193 |
| AffectionCell hid=24 | 93.86% | 2.58M | 488 | 190 | 178 |
Efficiency metrics
Custom metrics for deployment trade-offs:
- FPS/Mparam = GPU FPS ÷ (params / 1M) — pure inference speed per megaparam
- Quality/M = (val acc ÷ 100 × FPS) ÷ (params / 1M) — accuracy × speed per megaparam
| Winner | FPS/Mparam | Quality/M |
|---|---|---|
| Linear | 7533 | 6538 |
| LSTM | 3434 | 3191 |
| VoxelMind | 189 | 178 |
Linear is ~40× more FPS-efficient than VoxelMind; LSTM is the best accuracy/size balance at 93% val.
AffectionCell hid scaling
hid controls AffectionCell size: N = 64 × hid blocks, N×N connection matrix → params scale O(hid²).
| hid | Blocks N | Params | Val acc | FPS | FPS/M | Quality/M | Source |
|---|---|---|---|---|---|---|---|
| 2 | 128 | 96k | ~75% | ~730 | 7604 | 5703 | estimate |
| 4 | 256 | 180k | ~82% | ~690 | 3833 | 3143 | estimate |
| 8 | 512 | 500k | ~82.5% | ~620 | 1240 | 1023 | estimate |
| 16 | 1024 | 1.20M | ~90% | ~530 | 442 | 398 | estimate |
| 24 | 1536 | 2.58M | 93.86% | 488 | 190 | 178 | measured |
Choosing hid:
| Use case | Recommendation |
|---|---|
| Max accuracy (HF demo) | hid=24 |
| Deployment (no viz needed) | LSTM @ 198k or hid=4–8 |
| Edge / fastest | Linear @ 98k |
| Research / readable heatmaps | hid=2–4 |
Key findings
- Best accuracy: VoxelMind (+0.88 pp vs LSTM, +7.0 pp vs Linear)
- Best speed/size: Linear (26× fewer params than VoxelMind, 741 FPS)
- Best trade-off: LSTM — 92.98% val at 198k params
- Unique feature: only VoxelMind provides N×N influence maps (
InferenceViz.py)
Val set = 114 clips → metrics may vary ±5% between epochs. Numbers from
benchmark.pyre-eval on saved checkpoints.
Inference speed (end-to-end)
| Pipeline | FPS (RX 6600) |
|---|---|
| GPU forward only (VoxelMind) | ~488 |
Inference.py (video I/O + overlay) |
~140 |
InferenceViz.py (with heatmaps) |
~15–40 |
AffectionCell adds +52% latency vs Linear (2.05 vs 1.35 ms). Full video pipeline is mostly I/O-bound.
Usage
Public inference uses a TorchScript export — no Model.py / Affectors.py required.
pip install -r requirements-hf.txt
# CLI
python hf_inference.py --model models/voxel_scripted.pt --video clip.mp4 --device cpu
from hf_inference import load_model, predict_video
model = load_model("models/voxel_scripted.pt", device="cpu")
for action, prob in predict_video(model, "clip.mp4", topk=3):
print(f"{action}: {prob * 100:.1f}%")
Input: 22 consecutive frames, 64×64 grayscale (center crop from video).
Output: logits over 11 action classes (see table above).
Live demo (Hugging Face Space)
Create a Space with app.py + requirements-space.txt. The demo downloads voxel_scripted.pt from this repo automatically.
pip install -r requirements-space.txt
python app.py
Local development (full source — not on Hub)
Training, benchmarks, and AffectionCell visualization require the private codebase locally:
python Train.py
python InferenceViz.py --video clip.mp4
Regenerate the public TorchScript artifact after retraining:
python scripts/export_scripted.py
Limitations
- Minecraft gameplay only; 582 training clips
- Confusable classes:
w↔mouse_up,s↔w(similar motion at 64×64) hid=24→ 2.5M params; val sawtooth on 114 clips suggests overfitting risk- N×N heatmap at
hid=24appears as dense noise after downscaling — usehid=2–4for viz
Citation
@misc{voxelmind2026,
title={VoxelMind: Minecraft Action Recognition with AffectionCell},
author={fotographer},
year={2026},
}
Files in this repo
| File | Description |
|---|---|
models/voxel_scripted.pt |
Public TorchScript checkpoint (inference without source) |
hf_inference.py |
Preprocessing + inference API |
app.py |
Gradio demo for HF Spaces |
requirements-hf.txt |
Minimal deps for inference |
config.json |
Model hyperparameters |
benchmark_results.json |
Raw benchmark numbers |





