MoDiCo
MoDiCo (Motif–Distribution–Context) is a 9.7M-parameter multi-branch architecture for file fragment classification. It identifies the file type of an isolated byte fragment (512 B to 16 KB) without access to file headers, extensions, or filesystem metadata.
This repository hosts the trained Stage 2 (cRT) checkpoint used in the accompanying paper. Trained on the Tesserae dataset (1.9 billion fragments across 619 file types).
Architecture
MoDiCo processes a byte fragment through three parallel encoders and combines them via cross-attention:
| Branch | Inductive bias | Captures |
|---|---|---|
| Motif Encoder | translation-invariant CNN | magic numbers, segment markers, delimiters |
| Distribution Encoder | byte histogram + entropy CDF | encrypted vs. compressed vs. text profiles |
| Context Encoder | hierarchical byte-level transformer | nested brackets, function prologues, long-range structure |
An attentive fusion module produces input-dependent weights over the three branches and a small classification head emits 619-way logits. The design plus training procedure is described in the paper.
Files
modico_stage2.pt— final trained model (cRT classifier head over Stage 1 representations)class_names.json— index → file-type-name mappingREADME.md— this card
Quick start
```python import torch from modico import MoDiCoClassifier # see code release
model = MoDiCoClassifier(num_classes=619, max_len=4096) state = torch.load("modico_stage2.pt", map_location="cpu", weights_only=False) model.load_state_dict(state["model_state_dict"], strict=False) model.eval()
import numpy as np raw_bytes = np.frombuffer(open("mystery.bin", "rb").read(), dtype=np.uint8) x = torch.from_numpy(raw_bytes[:4096].copy()).long().unsqueeze(0)
with torch.no_grad(): logits = model(x) top5 = torch.topk(torch.softmax(logits, dim=-1).squeeze(0), k=5) print(list(zip(top5.indices.tolist(), top5.values.tolist()))) ```
The full inference and evaluation code is at https://anonymous.4open.science/r/tesserae-modico-6145.
Intended use
- File carving and digital forensics research
- Benchmarking file-fragment classifiers
- Educational use in security and forensics curricula
Training data
Trained on the Tesserae dataset: 1.9 billion 512-byte fragments drawn from 58.2 million unique source files spanning 619 file types, collected from permissively licensed public sources (The Stack v1, NapierOne, raw.pixls.us, and others). Splits are file-disjoint and content-deduplicated with split-aware survivor selection.
Training procedure
Two-stage decoupled training following Kang et al. (ICLR 2020):
- Stage 1. Full model trained with instance-balanced sampling on a joint schedule of 512 B and 4 KB fragments. AdamW optimizer, cosine LR schedule with epoch-level warm restarts, BF16 precision.
- Stage 2 (cRT). Stage 1 encoder and fusion weights are frozen. The classification head is reinitialized and retrained with class-balanced sampling, calibrating decision boundaries for tail classes.
Trained on two NVIDIA H200 NVL GPUs over roughly one week, processing 7.4 billion samples. Final hyperparameters: `d_model=512`, Motif window 512 B / stride 256 B, entropy window 32 B with 64-point CDF, Context Encoder with 2 hierarchical stages and embedding dimension 128.
Evaluation
On the Tesserae 619-class test split (sample-weighted "Top-1" / class-balanced "Bal-1"):
| Block size | Top-1 | Top-5 | Bal-1 | Bal-5 |
|---|---|---|---|---|
| 512 B | 76.60 | 89.31 | 83.23 | 93.43 |
| 4 KB | 83.14 | 93.80 | 88.67 | 95.76 |
| 8 KB | 83.76 | 94.24 | 88.96 | 95.72 |
| 16 KB | 81.24 | 93.00 | 87.99 | 95.02 |
The model is trained jointly on 512 B and 4 KB only; 8 KB and 16 KB are evaluated zero-shot to test size generalization. MoDiCo's Bal-1 at 4 KB exceeds the strongest baseline (ByteRCNN, 71.52%) by 17 percentage points.
Limitations
A fragment's label reflects the file it belongs to, not the encoding of that specific region. PDFs may embed JPEG payloads whose bytes are indistinguishable from standalone JPEGs, ELF executables can contain plain-ASCII string tables, and so on. This labeling ambiguity produces systematic confusion between embedded and embedding formats at larger block sizes (most prominently `jpg` inside `pptx`/`ppt`/`odp` at 16 KB) and enforces an irreducible error floor. See the paper for a per-class analysis.
The model only assigns a file type to a fragment. It does not, on its own, recover any file. A complete file carving pipeline must additionally discover, group, and order fragments before reconstruction.
Citation
```bibtex @misc{tesserae_modico_2026, title = {Tesserae and MoDiCo: A Billion-Fragment Dataset and Multi-Branch Architecture for File Fragment Classification}, author = {Anonymous Authors}, year = {2026}, note = {Anonymous submission} } ```
License
Apache License 2.0. See the code release for the full license text.