VoxelMind

3D-CNN + AffectionCell for classifying Minecraft player actions from short grayscale video clips.

Input 22 frames × 64×64 grayscale → [1, 1, 22, 64, 64]
Output 11 action classes
Checkpoint AffectionCell hid=24 · 2.58M params · 93.86% val acc
Hardware tested AMD Radeon RX 6600

Model description

VoxelMind uses a shared Conv3D encoder followed by an AffectionCell — a learnable N×N influence graph over N = 64 × hid spatial blocks. Unlike flat pooling or sequence heads, AffectionCell exposes interpretable influence maps (see InferenceViz.py).

Classes: w, s, a, d, jump, idle, mouse_up, mouse_down, mouse_left, mouse_right, drop

Benchmark overview

Fair comparison against three baselines on the same dataset, split, preprocessing, and encoder:

  • 582 training clips · 114 validation (stratified 20%, seed=42)
  • AdamW lr=3e-4 · 100 epochs · input 22×64×64
  • Measured on AMD RX 6600, batch=1, median over 200 runs after warmup
Model Params Val acc Macro-F1 Latency GPU FPS
VoxelMind (AffectionCell, hid=24) 2,575,211 93.86% 93.08% 2.05 ms 488
LSTM head 198,379 92.98% 91.71% 1.47 ms 681
Transformer (2 layers) 165,291 91.23% 90.29% 1.58 ms 633
Linear head 98,347 86.84% 88.11% 1.35 ms 741

Accuracy & latency

Validation accuracy, macro-F1 and GPU latency

*Best val accuracy: 93.86% (VoxelMind). Best macro-F1: 93.08%. LSTM trails by 0.88 pp. End-to-end video pipeline: ~140 FPS (I/O bound).*

Model size & throughput

Parameters, GPU throughput and summary table

VoxelMind: 2.58M params. LSTM: 198k (13× smaller, ~same accuracy). Linear: 98k, 741 FPS pure forward.

Model Val acc Params GPU FPS FPS/Mparam Quality/M
Conv3D + fc 86.84% 98k 741 7530 6539
Conv3D + 2-layer encoder 91.23% 165k 633 3832 3496
Conv3D + LSTM head 92.98% 198k 681 3434 3193
AffectionCell hid=24 93.86% 2.58M 488 190 178

Efficiency metrics

Efficiency: FPS/Mparam and Quality/M

Custom metrics for deployment trade-offs:

  • FPS/Mparam = GPU FPS ÷ (params / 1M) — pure inference speed per megaparam
  • Quality/M = (val acc ÷ 100 × FPS) ÷ (params / 1M) — accuracy × speed per megaparam
Winner FPS/Mparam Quality/M
Linear 7533 6538
LSTM 3434 3191
VoxelMind 189 178

Linear is ~40× more FPS-efficient than VoxelMind; LSTM is the best accuracy/size balance at 93% val.

AffectionCell hid scaling

hid controls AffectionCell size: N = 64 × hid blocks, N×N connection matrix → params scale O(hid²).

Validation accuracy and parameters vs hid

hid Blocks N Params Val acc FPS FPS/M Quality/M Source
2 128 96k ~75% ~730 7604 5703 estimate
4 256 180k ~82% ~690 3833 3143 estimate
8 512 500k ~82.5% ~620 1240 1023 estimate
16 1024 1.20M ~90% ~530 442 398 estimate
24 1536 2.58M 93.86% 488 190 178 measured

FPS/Mparam vs hid

hid scaling summary and recommendations

Choosing hid:

Use case Recommendation
Max accuracy (HF demo) hid=24
Deployment (no viz needed) LSTM @ 198k or hid=4–8
Edge / fastest Linear @ 98k
Research / readable heatmaps hid=2–4

Key findings

  • Best accuracy: VoxelMind (+0.88 pp vs LSTM, +7.0 pp vs Linear)
  • Best speed/size: Linear (26× fewer params than VoxelMind, 741 FPS)
  • Best trade-off: LSTM — 92.98% val at 198k params
  • Unique feature: only VoxelMind provides N×N influence maps (InferenceViz.py)

Val set = 114 clips → metrics may vary ±5% between epochs. Numbers from benchmark.py re-eval on saved checkpoints.

Inference speed (end-to-end)

Pipeline FPS (RX 6600)
GPU forward only (VoxelMind) ~488
Inference.py (video I/O + overlay) ~140
InferenceViz.py (with heatmaps) ~15–40

AffectionCell adds +52% latency vs Linear (2.05 vs 1.35 ms). Full video pipeline is mostly I/O-bound.

Usage

Public inference uses a TorchScript export — no Model.py / Affectors.py required.

pip install -r requirements-hf.txt

# CLI
python hf_inference.py --model models/voxel_scripted.pt --video clip.mp4 --device cpu
from hf_inference import load_model, predict_video

model = load_model("models/voxel_scripted.pt", device="cpu")
for action, prob in predict_video(model, "clip.mp4", topk=3):
    print(f"{action}: {prob * 100:.1f}%")

Input: 22 consecutive frames, 64×64 grayscale (center crop from video).
Output: logits over 11 action classes (see table above).

Live demo (Hugging Face Space)

Create a Space with app.py + requirements-space.txt. The demo downloads voxel_scripted.pt from this repo automatically.

pip install -r requirements-space.txt
python app.py

Local development (full source — not on Hub)

Training, benchmarks, and AffectionCell visualization require the private codebase locally:

python Train.py
python InferenceViz.py --video clip.mp4

Regenerate the public TorchScript artifact after retraining:

python scripts/export_scripted.py

Limitations

  • Minecraft gameplay only; 582 training clips
  • Confusable classes: wmouse_up, sw (similar motion at 64×64)
  • hid=24 → 2.5M params; val sawtooth on 114 clips suggests overfitting risk
  • N×N heatmap at hid=24 appears as dense noise after downscaling — use hid=2–4 for viz

Citation

@misc{voxelmind2026,
  title={VoxelMind: Minecraft Action Recognition with AffectionCell},
  author={fotographer},
  year={2026},
}

Files in this repo

File Description
models/voxel_scripted.pt Public TorchScript checkpoint (inference without source)
hf_inference.py Preprocessing + inference API
app.py Gradio demo for HF Spaces
requirements-hf.txt Minimal deps for inference
config.json Model hyperparameters
benchmark_results.json Raw benchmark numbers
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support