VoxelMind

3D-CNN + AffectionCell for classifying Minecraft player actions from short grayscale video clips.


Input	22 frames × 64×64 grayscale → `[1, 1, 22, 64, 64]`
Output	11 action classes
Checkpoint	AffectionCell `hid=24` · 2.58M params · 93.86% val acc
Hardware tested	AMD Radeon RX 6600

Model description

VoxelMind uses a shared Conv3D encoder followed by an AffectionCell — a learnable N×N influence graph over N = 64 × hid spatial blocks. Unlike flat pooling or sequence heads, AffectionCell exposes interpretable influence maps (see InferenceViz.py).

Classes: w, s, a, d, jump, idle, mouse_up, mouse_down, mouse_left, mouse_right, drop

Benchmark overview

Fair comparison against three baselines on the same dataset, split, preprocessing, and encoder:

582 training clips · 114 validation (stratified 20%, seed=42)
AdamW lr=3e-4 · 100 epochs · input 22×64×64
Measured on AMD RX 6600, batch=1, median over 200 runs after warmup

Model	Params	Val acc	Macro-F1	Latency	GPU FPS
VoxelMind (AffectionCell, hid=24)	2,575,211	93.86%	93.08%	2.05 ms	488
LSTM head	198,379	92.98%	91.71%	1.47 ms	681
Transformer (2 layers)	165,291	91.23%	90.29%	1.58 ms	633
Linear head	98,347	86.84%	88.11%	1.35 ms	741

Accuracy & latency

*Best val accuracy: 93.86% (VoxelMind). Best macro-F1: 93.08%. LSTM trails by 0.88 pp. End-to-end video pipeline: ~140 FPS (I/O bound).*

Model size & throughput

VoxelMind: 2.58M params. LSTM: 198k (13× smaller, ~same accuracy). Linear: 98k, 741 FPS pure forward.

Model	Val acc	Params	GPU FPS	FPS/Mparam	Quality/M
Conv3D + fc	86.84%	98k	741	7530	6539
Conv3D + 2-layer encoder	91.23%	165k	633	3832	3496
Conv3D + LSTM head	92.98%	198k	681	3434	3193
AffectionCell hid=24	93.86%	2.58M	488	190	178

Efficiency metrics

Custom metrics for deployment trade-offs:

FPS/Mparam = GPU FPS ÷ (params / 1M) — pure inference speed per megaparam
Quality/M = (val acc ÷ 100 × FPS) ÷ (params / 1M) — accuracy × speed per megaparam

Winner	FPS/Mparam	Quality/M
Linear	7533	6538
LSTM	3434	3191
VoxelMind	189	178

Linear is ~40× more FPS-efficient than VoxelMind; LSTM is the best accuracy/size balance at 93% val.

AffectionCell `hid` scaling

hid controls AffectionCell size: N = 64 × hid blocks, N×N connection matrix → params scale O(hid²).

hid	Blocks N	Params	Val acc	FPS	FPS/M	Quality/M	Source
2	128	96k	~75%	~730	7604	5703	estimate
4	256	180k	~82%	~690	3833	3143	estimate
8	512	500k	~82.5%	~620	1240	1023	estimate
16	1024	1.20M	~90%	~530	442	398	estimate
24	1536	2.58M	93.86%	488	190	178	measured

Choosing hid:

Use case	Recommendation
Max accuracy (HF demo)	`hid=24`
Deployment (no viz needed)	LSTM @ 198k or `hid=4–8`
Edge / fastest	Linear @ 98k
Research / readable heatmaps	`hid=2–4`

Key findings

Best accuracy: VoxelMind (+0.88 pp vs LSTM, +7.0 pp vs Linear)
Best speed/size: Linear (26× fewer params than VoxelMind, 741 FPS)
Best trade-off: LSTM — 92.98% val at 198k params
Unique feature: only VoxelMind provides N×N influence maps (InferenceViz.py)

Val set = 114 clips → metrics may vary ±5% between epochs. Numbers from benchmark.py re-eval on saved checkpoints.

Inference speed (end-to-end)

Pipeline	FPS (RX 6600)
GPU forward only (VoxelMind)	~488
`Inference.py` (video I/O + overlay)	~140
`InferenceViz.py` (with heatmaps)	~15–40

AffectionCell adds +52% latency vs Linear (2.05 vs 1.35 ms). Full video pipeline is mostly I/O-bound.

Usage

Public inference uses a TorchScript export — no Model.py / Affectors.py required.

pip install -r requirements-hf.txt

# CLI
python hf_inference.py --model models/voxel_scripted.pt --video clip.mp4 --device cpu

from hf_inference import load_model, predict_video

model = load_model("models/voxel_scripted.pt", device="cpu")
for action, prob in predict_video(model, "clip.mp4", topk=3):
    print(f"{action}: {prob * 100:.1f}%")

Input: 22 consecutive frames, 64×64 grayscale (center crop from video).
Output: logits over 11 action classes (see table above).

Live demo (Hugging Face Space)

Create a Space with app.py + requirements-space.txt. The demo downloads voxel_scripted.pt from this repo automatically.

pip install -r requirements-space.txt
python app.py

Local development (full source — not on Hub)

Training, benchmarks, and AffectionCell visualization require the private codebase locally:

python Train.py
python InferenceViz.py --video clip.mp4

Regenerate the public TorchScript artifact after retraining:

python scripts/export_scripted.py

Limitations

Minecraft gameplay only; 582 training clips
Confusable classes: w ↔ mouse_up, s ↔ w (similar motion at 64×64)
hid=24 → 2.5M params; val sawtooth on 114 clips suggests overfitting risk
N×N heatmap at hid=24 appears as dense noise after downscaling — use hid=2–4 for viz

Citation

@misc{voxelmind2026,
  title={VoxelMind: Minecraft Action Recognition with AffectionCell},
  author={fotographer},
  year={2026},
}

Files in this repo

File	Description
`models/voxel_scripted.pt`	Public TorchScript checkpoint (inference without source)
`hf_inference.py`	Preprocessing + inference API
`app.py`	Gradio demo for HF Spaces
`requirements-hf.txt`	Minimal deps for inference
`config.json`	Model hyperparameters
`benchmark_results.json`	Raw benchmark numbers

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Video Classification

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support