MoDiCo

MoDiCo (Motif–Distribution–Context) is a 9.7M-parameter multi-branch architecture for file fragment classification. It identifies the file type of an isolated byte fragment (512 B to 16 KB) without access to file headers, extensions, or filesystem metadata.

This repository hosts the trained Stage 2 (cRT) checkpoint used in the accompanying paper. Trained on the Tesserae dataset (1.9 billion fragments across 619 file types).

Architecture

MoDiCo processes a byte fragment through three parallel encoders and combines them via cross-attention:

Branch Inductive bias Captures
Motif Encoder translation-invariant CNN magic numbers, segment markers, delimiters
Distribution Encoder byte histogram + entropy CDF encrypted vs. compressed vs. text profiles
Context Encoder hierarchical byte-level transformer nested brackets, function prologues, long-range structure

An attentive fusion module produces input-dependent weights over the three branches and a small classification head emits 619-way logits. The design plus training procedure is described in the paper.

Files

  • modico_stage2.pt — final trained model (cRT classifier head over Stage 1 representations)
  • class_names.json — index → file-type-name mapping
  • README.md — this card

Quick start

```python import torch from modico import MoDiCoClassifier # see code release

model = MoDiCoClassifier(num_classes=619, max_len=4096) state = torch.load("modico_stage2.pt", map_location="cpu", weights_only=False) model.load_state_dict(state["model_state_dict"], strict=False) model.eval()

import numpy as np raw_bytes = np.frombuffer(open("mystery.bin", "rb").read(), dtype=np.uint8) x = torch.from_numpy(raw_bytes[:4096].copy()).long().unsqueeze(0)

with torch.no_grad(): logits = model(x) top5 = torch.topk(torch.softmax(logits, dim=-1).squeeze(0), k=5) print(list(zip(top5.indices.tolist(), top5.values.tolist()))) ```

The full inference and evaluation code is at https://anonymous.4open.science/r/tesserae-modico-6145.

Intended use

  • File carving and digital forensics research
  • Benchmarking file-fragment classifiers
  • Educational use in security and forensics curricula

Training data

Trained on the Tesserae dataset: 1.9 billion 512-byte fragments drawn from 58.2 million unique source files spanning 619 file types, collected from permissively licensed public sources (The Stack v1, NapierOne, raw.pixls.us, and others). Splits are file-disjoint and content-deduplicated with split-aware survivor selection.

Training procedure

Two-stage decoupled training following Kang et al. (ICLR 2020):

  1. Stage 1. Full model trained with instance-balanced sampling on a joint schedule of 512 B and 4 KB fragments. AdamW optimizer, cosine LR schedule with epoch-level warm restarts, BF16 precision.
  2. Stage 2 (cRT). Stage 1 encoder and fusion weights are frozen. The classification head is reinitialized and retrained with class-balanced sampling, calibrating decision boundaries for tail classes.

Trained on two NVIDIA H200 NVL GPUs over roughly one week, processing 7.4 billion samples. Final hyperparameters: `d_model=512`, Motif window 512 B / stride 256 B, entropy window 32 B with 64-point CDF, Context Encoder with 2 hierarchical stages and embedding dimension 128.

Evaluation

On the Tesserae 619-class test split (sample-weighted "Top-1" / class-balanced "Bal-1"):

Block size Top-1 Top-5 Bal-1 Bal-5
512 B 76.60 89.31 83.23 93.43
4 KB 83.14 93.80 88.67 95.76
8 KB 83.76 94.24 88.96 95.72
16 KB 81.24 93.00 87.99 95.02

The model is trained jointly on 512 B and 4 KB only; 8 KB and 16 KB are evaluated zero-shot to test size generalization. MoDiCo's Bal-1 at 4 KB exceeds the strongest baseline (ByteRCNN, 71.52%) by 17 percentage points.

Limitations

A fragment's label reflects the file it belongs to, not the encoding of that specific region. PDFs may embed JPEG payloads whose bytes are indistinguishable from standalone JPEGs, ELF executables can contain plain-ASCII string tables, and so on. This labeling ambiguity produces systematic confusion between embedded and embedding formats at larger block sizes (most prominently `jpg` inside `pptx`/`ppt`/`odp` at 16 KB) and enforces an irreducible error floor. See the paper for a per-class analysis.

The model only assigns a file type to a fragment. It does not, on its own, recover any file. A complete file carving pipeline must additionally discover, group, and order fragments before reconstruction.

Citation

```bibtex @misc{tesserae_modico_2026, title = {Tesserae and MoDiCo: A Billion-Fragment Dataset and Multi-Branch Architecture for File Fragment Classification}, author = {Anonymous Authors}, year = {2026}, note = {Anonymous submission} } ```

License

Apache License 2.0. See the code release for the full license text.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train TesseraeAnon/modico