---
language: en
license: mit
tags:
  - transformers
  - language-model
  - graph-attention
  - adaptive-depth
  - temporal-decay
  - efficient-llm
  - pytorch
datasets:
  - wikitext
  - c4
metrics:
  - perplexity
model-index:
  - name: TemporalMesh-Transformer
    results:
      - task:
          type: text-generation
        dataset:
          name: WikiText-2
          type: wikitext
        metrics:
          - type: perplexity
            value: 29.4
            name: Validation Perplexity
      - task:
          type: text-generation
        dataset:
          name: WikiText-103
          type: wikitext
        metrics:
          - type: perplexity
            value: 36.1
            name: Validation Perplexity
---

# TemporalMesh Transformer (TMT v3)

**Author:** Vigneshwar LK  
**Paper:** [DOI 10.5281/zenodo.20287197](https://zenodo.org/records/20287390)  
**Code:** [github.com/vignesh2027/TemporalMesh-Transformer](https://github.com/vignesh2027/TemporalMesh-Transformer)  
**Live Demo:** [HuggingFace Space](https://huggingface.co/spaces/vigneshwar234/TemporalMesh-Transformer-Demo)  
**Benchmarks:** [TMT-Benchmarks Dataset](https://huggingface.co/datasets/vigneshwar234/TMT-Benchmarks)

---

## What is TMT?

TMT is a novel PyTorch transformer architecture that simultaneously resolves three fundamental inefficiencies in standard transformers:

| Problem | Standard Transformer | TMT Solution |
|---|---|---|
| Quadratic attention cost | $O(S^2)$ per layer | **Mesh Attention**: $O(S \cdot k)$ dynamic $k$NN graph |
| Static attention topology | Fixed fully-connected | **Dynamic graph** rebuilt per-layer from cosine similarity |
| Uniform token compute | All tokens use all $N$ layers | **Adaptive Depth Routing**: exit gate per token, avg 5.8/12 layers |
| Flat positional encoding | Position only | **Temporal Decay**: learned multiplicative semantic attenuation |
| No cross-sequence memory | Stateless | **EMA Memory Anchors**: 16 persistent fast-weight vectors |

---

## Results

| Model | WikiText-2 PPL ↓ | WikiText-103 PPL ↓ | LongBench ↑ | Compute |
|---|---|---|---|---|
| Vanilla Transformer | 42.1 | 51.3 | 41.2 | 100% |
| Longformer | 39.6 | 47.2 | 49.8 | 62% |
| Mamba | 31.8 | 38.4 | 51.3 | 55% |
| RWKV | 33.1 | 40.9 | 48.7 | 50% |
| **Full TMT** | **29.4** | **36.1** | **53.4** | **48%** |

All models: ~120M parameters. TMT trained for 10K steps on WikiText-2 (AdamW, cosine LR, seeds 42/1337/2024).

---

## Architecture at a Glance

```
Input → Token Embedding + RoPE
      → [× 12 layers]
           MeshBuilder (kNN graph, cosine sim, top-k=8)
           Mesh Attention  O(S·k)  + Temporal Decay Encoding
           EMA Memory Anchor Cross-Attention (16 anchors, β=0.99)
           Dual-Stream FFN (syntax stream ‖ semantic stream, sigmoid gate)
           Exit Gate  σ(W_gate · x) > 0.85 → token frozen
      → LayerNorm → Tied Output Projection
      → Logits (B, S, V)
```

**Output fields** (TMTOutput dataclass):
- `logits` — (B, S, V) next-token predictions
- `exit_masks` — list of (B, S) booleans, one per layer
- `confidences` — gate confidence per token per layer
- `graph_edges` — sparse kNN edge list from final layer
- `memory_state` — (M, D) final EMA anchor states
- `decay_scalars` — temporal decay weights applied

---

## Quick Start

```bash
git clone https://github.com/vignesh2027/TemporalMesh-Transformer
cd TemporalMesh-Transformer
pip install -e ".[dev]"
```

```python
from tmt.model.config import TMTConfig
from tmt.model.model import TMTModel
import torch

config = TMTConfig(
    vocab_size=50257,
    d_model=512,
    n_heads=8,
    n_layers=12,
    graph_k=8,
    exit_threshold=0.85,
    memory_anchors=16,
)
model = TMTModel(config)  # ~120M params

tokens = torch.randint(0, 50257, (1, 256))
out = model(tokens)

print(out.logits.shape)      # (1, 256, 50257)
print(out.exit_masks[-1])    # which tokens exited at layer 12
avg_exit = sum(m.float().mean() for m in out.exit_masks) / len(out.exit_masks)
print(f"Avg exit layer: {avg_exit:.2f}")  # ~5.8
```

---

## Training

```bash
python scripts/train.py \
  --dataset wikitext-2 \
  --model_size base \
  --steps 10000 \
  --lr 3e-4 \
  --batch_size 16 \
  --seq_len 256 \
  --exit_threshold 0.85 \
  --graph_k 8
```

---

## Ablation Summary

| Config | PPL ↓ | Compute | VRAM |
|---|---|---|---|
| Vanilla Transformer | 42.1 | 100% | 18.4 GB |
| + Mesh Attention only | 37.8 | 62% | 11.2 GB |
| + Temporal Decay only | 40.3 | 98% | 18.4 GB |
| + Adaptive Exit only | 39.6 | 51% | 18.4 GB |
| Mesh + Decay | 34.2 | 61% | 11.2 GB |
| Mesh + Exit | 35.1 | 50% | 11.2 GB |
| **Full TMT** | **29.4** | **48%** | **11.2 GB** |

The full combination achieves **superadditive gains**: interaction effect = 4.1 PPL beyond sum of individual contributions.

---

## Citation

```bibtex
@misc{vigneshwar2026tmt,
  title   = {TemporalMesh Transformer: Dynamic Graph Attention with
             Temporal Semantic Decay and Per-Token Adaptive Depth Routing},
  author  = {Vigneshwar LK},
  year    = {2026},
  doi     = {10.5281/zenodo.20287197},
  url     = {https://zenodo.org/records/20287390}
}
```

---

## License

MIT License · © 2026 Vigneshwar LK